GeneTrail 3.2
Advanced high-throughput enrichment analysis
Position count matrices (PCMs)
We obtained Position Count Matrices (PCMs) from JASPAR [1], which is also including data from Uniprobe [2], HOCOMOCO [3], the Kellis Lab ENCODE Motif database [4], and TRANSFAC [5] for five species: Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, and Caenorhabditis elegans.
Database content
Database | #Matrices | Retrieval data (YYYY/MM/DD) |
---|---|---|
JASPAR [6] | 515 | 2016/10/19 |
HOCOMOCO [7] | 20 | 2016/10/19 |
UniPROBE [8] | 16 | 2016/10/19 |
Cistrome [9] | 1 | 2016/10/19 |
Kellis Lab ENCODE Motif Database [4] | 1 | 2016/10/19 |
Total | 553 | - |
Database | #Matrices | Retrieval data (YYYY/MM/DD) |
---|---|---|
JASPAR [6] | 515 | 2017/03/22 |
HOCOMOCO [7] | 81 | 2017/03/22 |
Kellis Lab ENCODE Motif Database [4] | 130 | 2017/03/22 |
TRANSFAC [5] | 584 | 2017/03/22 |
Total | 1310 | - |
- GENCODE Release 26 (GRCh38.p10)
- GENCODE Release 26 (mapped to GRCh37)
Database | #Matrices | Retrieval data (YYYY/MM/DD) |
---|---|---|
JASPAR [6] | 499 | 2017/03/23 |
HOCOMOCO [7] | 67 | 2017/03/23 |
Kellis Lab ENCODE Motif Database [4] | 124 | 2017/03/23 |
TRANSFAC [5] | 194 | 2017/03/23 |
Total | 884 | - |
- GENCODE Release M13 (GRCm38.p5)
Database | #Matrices | Retrieval data (YYYY/MM/DD) |
---|---|---|
JASPAR [6] | 489 | 2017/03/23 |
HOCOMOCO [7] | 67 | 2017/03/23 |
Kellis Lab ENCODE Motif Database [4] | 121 | 2017/03/23 |
TRANSFAC [5] | 50 | 2017/03/23 |
Total | 727 | - |
- UCSC RGSC 6.0/rn6 with RefSeq tracks
Processing
We downloaded the JASPAR CORE Vertebrata data set to cover Homo sapiens, Mus musculus, and Rattus norvegicus, for Drosophila melanogaster we use the JASPAR CORE Insecta data set and for Caenorhabditis elegans we obtained the JASPAR CORE Nematoda PCMs. From HOCOMOCO we use the provided data for Homo sapiens and Mus musculus. The latter is also used for Rattus norvegicus. The Kellis Lab ENCOSDE Motifs are based on human ChIP-seq data. Thus, we consider these PCMs for Homo sapiens, Mus musculus, and Rattus norvegicus. From TRANSFAC, we obtained species specific sets for all considered organisms.
From this initial set, we removed all TFs that could not be mapped to an Ensembl gene ID, using the Ensemble Genes 87 database, and the current versions of the reference genomes: GRCh38, GRCm38, Rnor_6, BDGP6, and WBcel235. Thereby, we generate species specific sets of PCMs assuming a motif conservation among vertebrates for the JASPAR CORE Vertebrata PCMs, the HOCOMOCO mouse PCMs, and Kellis Lab ENCODE Motif PCMs. Neither HOCOMOCO nor Kellis Lab ENCODE Motifs are considered for Drosophila melanogaster and caenorhabditis elegans.
Next, for each species set, we computed the information content $IC$ of each PCM $M$ normalized per motif length $|M|$ as $$P(i,j)= \frac{M(i,j)+pc}{4*pc+\sum_{i}M(i,j)},$$ $$IC=-\frac{\sum_{ij}log(P(i,j)*P(i,j)}{|M|},$$ with $i \in \{A,C,G,T\}$, $j \in \{1,...|M|\}$, and a pseudo count $pc=1$. Note that the smaller the $IC$ value of a matrix, the more informative the matrix is.
In Figure 1, a violin plot, shows the distribution of the normalized information content for the Homo sapiens data set. Across all species collections, we find that the JASPAR matrices have a small variance and that the poorest JASPAR PCM is still more informative than several PCMs from other databases. Therefore, we decided to consider all JASPAR PCMs and use the normalized information content value of the poorest JASPAR PCM as a species-specific cut-off value for the remaining databases. The cut-offs are:
- Homo sapiens: $1.55928$
- Mus musculus: $1.55928$
- Rattus norvegicus: $1.55928$
- Drosophila melanogaster: $1.67157$
- Caenorhabditis elegans: $1.33879$
In case that multiple motifs exists for one distinct TF in one database, we consider only the PCM with the best $IC$ value. If the motifs are marked specifically as a secondary or tertiary binding motif, as in JASPAR, we do not remove them.
In order to merge the different databases per species, we execute the following merging procedure on the filtered sets of the individual databases:
- Consider all JASPAR matrices.
- Add all HOCOMOCO matrices that are not included in the set of step (1).
- Add all Kellis Lab ENCODE Motif database matrices that are not included in the set of step (2).
- Add all TRANSFAC matrices that are not included in the set of step (3).
This procedure ensures that we generate the largest possible set of open-source PCMs from our collection of TF binding motifs. In addition to the unified set, we also provide the user with the option to work with all PCMs from a single database.
As mentioned above, TRAP computes TF affinities that are based on a biophysical model of TF binding. Therefore PCMs have to be converted to Position Specific Energy Matrices (PSEMs) such that they can be used in TRAP. Intuitively, PSEMs represent the mismatch energy of a given motif. For a detailed explanation and motivation of the energy based score, please check [10]. A PCM M is converted to a PSEM E according to:
$$E_{i,j}=\frac{1}{\lambda}log(\frac{M_{max,j}}{M_{i,j}}b_{i,j}),$$ $$M_{max,j}=\max\limits_{i\in\{A,C,G,T\}}(M_{i,j}).$$The parameter $\lambda$ is used for scaling the mismatch energies and $b_{i,j}$ denotes the background frequency of nucleotide $i$ with respect to the most frequent nucleotide at position $j$. This conversion formula is part of the mismatch energy postulated in formula (4) in [10]. By definition, if $j=max$, than $E_{i,j}=0$, as there should be no mismatch energy for the best possible sequence match.
Note that, during conversion, a pseudo count $pc = 1$ is added to each $M_{i,j}$.
The conversion is done by a C++ tool provided by the authors of TRAP. This is also included in the TEPIC repository. As suggested in [10], we use the following parameters for the conversion:
- $\lambda=0.7$
- $m=0.584$
- $n=-5.66$
The parameters slope m and intercept n are used to compute a matrix specific parameter $R_0$ that combines the concentration of the corresponding TF and the equilibrium constant of the binding reaction with its optimal binding site as defined in [10]. The authors of TRAP found a linear approximation for $R_0$ with:
$$ln(R_0)=m*|M|+n,$$where $|M|$ denotes the length of the PCM as above.
Further, we exploit species specific GC-content values:
- Homo sapiens $=0.41$
- Mus musculus $=0.42$
- Rattus norvegicus $=0.42$
- Drosophila melanogaster $=0.43$
- Caenorhabditis elegans $=0.36$
Bibliography
- JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles Nucleic Acids Res.
- UniPROBE update 2015: new tools and content for the online database of protein-binding microarray data on protein-DNA interactions Nucleic Acids Res.
- HOCOMOCO: a comprehensive collection of human transcription factor binding sites models Nucleic Acids Res.
- Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments Nucleic Acids Res.
- TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes Nucleic Acids Res.
- JASPAR: an open-access database for eukaryotic transcription factor binding profiles Nucleic acids research Oxford Univ Press
- HOCOMOCO: a comprehensive collection of human transcription factor binding sites models Nucleic acids research Oxford Univ Press
- UniPROBE, update 2015: new tools and content for the online database of protein-binding microarray data on protein--DNA interactions Nucleic acids research Oxford Univ Press
- Cistrome: an integrative platform for transcriptional regulation studies Genome biology BioMed Central
- Predicting trancription factor affinities to DNA from a biophysical model Bioinformatics