Position count matrices (PCMs)

We obtained Position Count Matrices (PCMs) from JASPAR [1], which is also including data from Uniprobe [2], HOCOMOCO [3], the Kellis Lab ENCODE Motif database [4], and TRANSFAC [5] for five species: Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, and Caenorhabditis elegans.

Database content


Database #Matrices Retrieval data (YYYY/MM/DD)
JASPAR [6] 515 2016/10/19
HOCOMOCO [7] 20 2016/10/19
UniPROBE [8] 16 2016/10/19
Cistrome [9] 1 2016/10/19
Kellis Lab ENCODE Motif Database [4] 1 2016/10/19
Total 553 -

Database #Matrices Retrieval data (YYYY/MM/DD)
JASPAR [6] 515 2017/03/22
HOCOMOCO [7] 81 2017/03/22
Kellis Lab ENCODE Motif Database [4] 130 2017/03/22
TRANSFAC [5] 584 2017/03/22
Total 1310 -

  • GENCODE Release 26 (GRCh38.p10)
  • GENCODE Release 26 (mapped to GRCh37)

Database #Matrices Retrieval data (YYYY/MM/DD)
JASPAR [6] 499 2017/03/23
HOCOMOCO [7] 67 2017/03/23
Kellis Lab ENCODE Motif Database [4] 124 2017/03/23
TRANSFAC [5] 194 2017/03/23
Total 884 -

  • GENCODE Release M13 (GRCm38.p5)

Database #Matrices Retrieval data (YYYY/MM/DD)
JASPAR [6] 489 2017/03/23
HOCOMOCO [7] 67 2017/03/23
Kellis Lab ENCODE Motif Database [4] 121 2017/03/23
TRANSFAC [5] 50 2017/03/23
Total 727 -

  • UCSC RGSC 6.0/rn6 with RefSeq tracks

Database #Matrices Retrieval data (YYYY/MM/DD)
JASPAR [6] 26 2017/03/23
HOCOMOCO [7] 0 2017/03/23
Kellis Lab ENCODE Motif Database [4] 0 2017/03/23
TRANSFAC [5] 14 2017/03/23
Total 40 -

  • UCSC WBcel235/ce11 with RefSeq tracks

Database #Matrices Retrieval data (YYYY/MM/DD)
JASPAR [6] 129 2017/03/23
HOCOMOCO [7] 0 2017/03/23
Kellis Lab ENCODE Motif Database [4] 0 2017/03/23
TRANSFAC [5] 92 2017/03/23
Total 221 -

  • UCSC BDGP Release 6 + ISO1 MT/dm6 with RefSeq tracks

Processing

We downloaded the JASPAR CORE Vertebrata data set to cover Homo sapiens, Mus musculus, and Rattus norvegicus, for Drosophila melanogaster we use the JASPAR CORE Insecta data set and for Caenorhabditis elegans we obtained the JASPAR CORE Nematoda PCMs. From HOCOMOCO we use the provided data for Homo sapiens and Mus musculus. The latter is also used for Rattus norvegicus. The Kellis Lab ENCOSDE Motifs are based on human ChIP-seq data. Thus, we consider these PCMs for Homo sapiens, Mus musculus, and Rattus norvegicus. From TRANSFAC, we obtained species specific sets for all considered organisms.

From this initial set, we removed all TFs that could not be mapped to an Ensembl gene ID, using the Ensemble Genes 87 database, and the current versions of the reference genomes: GRCh38, GRCm38, Rnor_6, BDGP6, and WBcel235. Thereby, we generate species specific sets of PCMs assuming a motif conservation among vertebrates for the JASPAR CORE Vertebrata PCMs, the HOCOMOCO mouse PCMs, and Kellis Lab ENCODE Motif PCMs. Neither HOCOMOCO nor Kellis Lab ENCODE Motifs are considered for Drosophila melanogaster and caenorhabditis elegans.

Next, for each species set, we computed the information content $IC$ of each PCM $M$ normalized per motif length $|M|$ as $$P(i,j)= \frac{M(i,j)+pc}{4*pc+\sum_{i}M(i,j)},$$ $$IC=-\frac{\sum_{ij}log(P(i,j)*P(i,j)}{|M|},$$ with $i \in \{A,C,G,T\}$, $j \in \{1,...|M|\}$, and a pseudo count $pc=1$. Note that the smaller the $IC$ value of a matrix, the more informative the matrix is.

In Figure 1, a violin plot, shows the distribution of the normalized information content for the Homo sapiens data set. Across all species collections, we find that the JASPAR matrices have a small variance and that the poorest JASPAR PCM is still more informative than several PCMs from other databases. Therefore, we decided to consider all JASPAR PCMs and use the normalized information content value of the poorest JASPAR PCM as a species-specific cut-off value for the remaining databases. The cut-offs are:

  • Homo sapiens: $1.55928$
  • Mus musculus: $1.55928$
  • Rattus norvegicus: $1.55928$
  • Drosophila melanogaster: $1.67157$
  • Caenorhabditis elegans: $1.33879$

Figure 1. Normalized information content for PCMs extracted from JASPAR Core Vertebrata, HOCOMOCO human, the Kellis ENCODE Motif database, and TRANSFAC Human. Small values indicate high quality matrices. Clearly, the poorest PCM in JASPAR is still better than several PCMs out of the other databases.

In case that multiple motifs exists for one distinct TF in one database, we consider only the PCM with the best $IC$ value. If the motifs are marked specifically as a secondary or tertiary binding motif, as in JASPAR, we do not remove them.

In order to merge the different databases per species, we execute the following merging procedure on the filtered sets of the individual databases:

  1. Consider all JASPAR matrices.
  2. Add all HOCOMOCO matrices that are not included in the set of step (1).
  3. Add all Kellis Lab ENCODE Motif database matrices that are not included in the set of step (2).
  4. Add all TRANSFAC matrices that are not included in the set of step (3).

This procedure ensures that we generate the largest possible set of open-source PCMs from our collection of TF binding motifs. In addition to the unified set, we also provide the user with the option to work with all PCMs from a single database.

As mentioned above, TRAP computes TF affinities that are based on a biophysical model of TF binding. Therefore PCMs have to be converted to Position Specific Energy Matrices (PSEMs) such that they can be used in TRAP. Intuitively, PSEMs represent the mismatch energy of a given motif. For a detailed explanation and motivation of the energy based score, please check [10]. A PCM M is converted to a PSEM E according to:

$$E_{i,j}=\frac{1}{\lambda}log(\frac{M_{max,j}}{M_{i,j}}b_{i,j}),$$ $$M_{max,j}=\max\limits_{i\in\{A,C,G,T\}}(M_{i,j}).$$

The parameter $\lambda$ is used for scaling the mismatch energies and $b_{i,j}$ denotes the background frequency of nucleotide $i$ with respect to the most frequent nucleotide at position $j$. This conversion formula is part of the mismatch energy postulated in formula (4) in [10]. By definition, if $j=max$, than $E_{i,j}=0$, as there should be no mismatch energy for the best possible sequence match.

Note that, during conversion, a pseudo count $pc = 1$ is added to each $M_{i,j}$.

The conversion is done by a C++ tool provided by the authors of TRAP. This is also included in the TEPIC repository. As suggested in [10], we use the following parameters for the conversion:

  • $\lambda=0.7$
  • $m=0.584$
  • $n=-5.66$

The parameters slope m and intercept n are used to compute a matrix specific parameter $R_0$ that combines the concentration of the corresponding TF and the equilibrium constant of the binding reaction with its optimal binding site as defined in [10]. The authors of TRAP found a linear approximation for $R_0$ with:

$$ln(R_0)=m*|M|+n,$$

where $|M|$ denotes the length of the PCM as above.

Further, we exploit species specific GC-content values:

  • Homo sapiens $=0.41$
  • Mus musculus $=0.42$
  • Rattus norvegicus $=0.42$
  • Drosophila melanogaster $=0.43$
  • Caenorhabditis elegans $=0.36$

Figure 2. Visualization of the PCM preprocessing workflow.

Bibliography

  1. Mathelier, A. and Fornes, O. and Arenillas, D. J. and Chen, C. Y. and Denay, G. and Lee, J. and Shi, W. and Shyr, C. and Tan, G. and Worsley-Hunt, R. and others JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles Nucleic Acids Res.
  2. Hume, M. A. and Barrera, L. A. and Gisselbrecht, S. S. and Bulyk, M. L. UniPROBE update 2015: new tools and content for the online database of protein-binding microarray data on protein-DNA interactions Nucleic Acids Res.
  3. Kulakovskiy, I. V. and Medvedeva, Y. A. and Schaefer, U. and Kasianov, A. S. and Vorontsov, I. E. and Bajic, V. B. and Makeev, V. J. HOCOMOCO: a comprehensive collection of human transcription factor binding sites models Nucleic Acids Res.
  4. Kheradpour, P. and Kellis, M. Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments Nucleic Acids Res.
  5. Matys, V. and Kel-Margoulis, O. V. and Fricke, E. and Liebich, I. and Land, S. and Barre-Dirrie, A. and Reuter, I. and Chekmenev, D. and Krull, M. and Hornischer, K. and Voss, N. and Stegmaier, P. and Lewicki-Potapov, B. and Saxel, H. and Kel, A. E. and Wingender, E. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes Nucleic Acids Res.
  6. Sandelin, Albin and Alkema, Wynand and Engstrom, Par and Wasserman, Wyeth W and Lenhard, Boris JASPAR: an open-access database for eukaryotic transcription factor binding profiles Nucleic acids research Oxford Univ Press
  7. Kulakovskiy, Ivan V and Medvedeva, Yulia A and Schaefer, Ulf and Kasianov, Artem S and Vorontsov, Ilya E and Bajic, Vladimir B and Makeev, Vsevolod J HOCOMOCO: a comprehensive collection of human transcription factor binding sites models Nucleic acids research Oxford Univ Press
  8. Hume, Maxwell A and Barrera, Luis A and Gisselbrecht, Stephen S and Bulyk, Martha L UniPROBE, update 2015: new tools and content for the online database of protein-binding microarray data on protein--DNA interactions Nucleic acids research Oxford Univ Press
  9. Liu, Tao and Ortiz, Jorge A and Taing, Len and Meyer, Clifford A and Lee, Bernett and Zhang, Yong and Shin, Hyunjin and Wong, Swee S and Ma, Jian and Lei, Ying and others Cistrome: an integrative platform for transcriptional regulation studies Genome biology BioMed Central
  10. Roider, H. G. and Kanhere, A. and Manke, T. and Vingron, M. Predicting trancription factor affinities to DNA from a biophysical model Bioinformatics