GeneTrail

Position count matrices (PCMs)

We obtained Position Count Matrices (PCMs) from JASPAR [1], which is also including data from Uniprobe [2], HOCOMOCO [3], the Kellis Lab ENCODE Motif database [4], and TRANSFAC [5] for five species: Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, and Caenorhabditis elegans.

Database content

Homo sapiens

Database	#Matrices	Retrieval data (YYYY/MM/DD)
JASPAR [6]	515	2016/10/19
HOCOMOCO [7]	20	2016/10/19
UniPROBE [8]	16	2016/10/19
Cistrome [9]	1	2016/10/19
Kellis Lab ENCODE Motif Database [4]	1	2016/10/19
Total	553	-

Database	#Matrices	Retrieval data (YYYY/MM/DD)
JASPAR [6]	515	2017/03/22
HOCOMOCO [7]	81	2017/03/22
Kellis Lab ENCODE Motif Database [4]	130	2017/03/22
TRANSFAC [5]	584	2017/03/22
Total	1310	-

Reference genomes used for annotation of transcriptional start sites of all genes (protein coding and non-coding).

GENCODE Release 26 (GRCh38.p10)
GENCODE Release 26 (mapped to GRCh37)

Database	#Matrices	Retrieval data (YYYY/MM/DD)
JASPAR [6]	499	2017/03/23
HOCOMOCO [7]	67	2017/03/23
Kellis Lab ENCODE Motif Database [4]	124	2017/03/23
TRANSFAC [5]	194	2017/03/23
Total	884	-

Reference genome used for annotation of transcriptional start sites of all genes (protein coding and non-coding).

GENCODE Release M13 (GRCm38.p5)

Database	#Matrices	Retrieval data (YYYY/MM/DD)
JASPAR [6]	489	2017/03/23
HOCOMOCO [7]	67	2017/03/23
Kellis Lab ENCODE Motif Database [4]	121	2017/03/23
TRANSFAC [5]	50	2017/03/23
Total	727	-

Reference genome used for annotation of transcriptional start sites of all genes (protein coding and non-coding).

UCSC RGSC 6.0/rn6 with RefSeq tracks

Database	#Matrices	Retrieval data (YYYY/MM/DD)
JASPAR [6]	26	2017/03/23
HOCOMOCO [7]	0	2017/03/23
Kellis Lab ENCODE Motif Database [4]	0	2017/03/23
TRANSFAC [5]	14	2017/03/23
Total	40	-

Reference genome used for annotation of transcriptional start sites of all genes (protein coding and non-coding).

UCSC WBcel235/ce11 with RefSeq tracks

Database	#Matrices	Retrieval data (YYYY/MM/DD)
JASPAR [6]	129	2017/03/23
HOCOMOCO [7]	0	2017/03/23
Kellis Lab ENCODE Motif Database [4]	0	2017/03/23
TRANSFAC [5]	92	2017/03/23
Total	221	-

Reference genome used for annotation of transcriptional start sites of all genes (protein coding and non-coding).

UCSC BDGP Release 6 + ISO1 MT/dm6 with RefSeq tracks

Processing

We downloaded the JASPAR CORE Vertebrata data set to cover Homo sapiens, Mus musculus, and Rattus norvegicus, for Drosophila melanogaster we use the JASPAR CORE Insecta data set and for Caenorhabditis elegans we obtained the JASPAR CORE Nematoda PCMs. From HOCOMOCO we use the provided data for Homo sapiens and Mus musculus. The latter is also used for Rattus norvegicus. The Kellis Lab ENCOSDE Motifs are based on human ChIP-seq data. Thus, we consider these PCMs for Homo sapiens, Mus musculus, and Rattus norvegicus. From TRANSFAC, we obtained species specific sets for all considered organisms.

From this initial set, we removed all TFs that could not be mapped to an Ensembl gene ID, using the Ensemble Genes 87 database, and the current versions of the reference genomes: GRCh38, GRCm38, Rnor_6, BDGP6, and WBcel235. Thereby, we generate species specific sets of PCMs assuming a motif conservation among vertebrates for the JASPAR CORE Vertebrata PCMs, the HOCOMOCO mouse PCMs, and Kellis Lab ENCODE Motif PCMs. Neither HOCOMOCO nor Kellis Lab ENCODE Motifs are considered for Drosophila melanogaster and caenorhabditis elegans.

Next, for each species set, we computed the information content $IC$ of each PCM $M$ normalized per motif length $|M|$ as $$P(i,j)= \frac{M(i,j)+pc}{4*pc+\sum_{i}M(i,j)},$$ $$IC=-\frac{\sum_{ij}log(P(i,j)*P(i,j)}{|M|},$$ with $i \in \{A,C,G,T\}$, $j \in \{1,...|M|\}$, and a pseudo count $pc=1$. Note that the smaller the $IC$ value of a matrix, the more informative the matrix is.

In Figure 1, a violin plot, shows the distribution of the normalized information content for the Homo sapiens data set. Across all species collections, we find that the JASPAR matrices have a small variance and that the poorest JASPAR PCM is still more informative than several PCMs from other databases. Therefore, we decided to consider all JASPAR PCMs and use the normalized information content value of the poorest JASPAR PCM as a species-specific cut-off value for the remaining databases. The cut-offs are:

Homo sapiens: $1.55928$
Mus musculus: $1.55928$
Rattus norvegicus: $1.55928$
Drosophila melanogaster: $1.67157$
Caenorhabditis elegans: $1.33879$

**Figure 1.** Normalized information content for *PCMs* extracted from JASPAR Core Vertebrata, HOCOMOCO human, the Kellis ENCODE Motif database, and TRANSFAC Human. Small values indicate high quality matrices. Clearly, the poorest *PCM* in JASPAR is still better than several *PCMs* out of the other databases.

In case that multiple motifs exists for one distinct TF in one database, we consider only the PCM with the best $IC$ value. If the motifs are marked specifically as a secondary or tertiary binding motif, as in JASPAR, we do not remove them.

In order to merge the different databases per species, we execute the following merging procedure on the filtered sets of the individual databases:

Consider all JASPAR matrices.
Add all HOCOMOCO matrices that are not included in the set of step (1).
Add all Kellis Lab ENCODE Motif database matrices that are not included in the set of step (2).
Add all TRANSFAC matrices that are not included in the set of step (3).

This procedure ensures that we generate the largest possible set of open-source PCMs from our collection of TF binding motifs. In addition to the unified set, we also provide the user with the option to work with all PCMs from a single database.

As mentioned above, TRAP computes TF affinities that are based on a biophysical model of TF binding. Therefore PCMs have to be converted to Position Specific Energy Matrices (PSEMs) such that they can be used in TRAP. Intuitively, PSEMs represent the mismatch energy of a given motif. For a detailed explanation and motivation of the energy based score, please check [10]. A PCM M is converted to a PSEM E according to:

$$E_{i,j}=\frac{1}{\lambda}log(\frac{M_{max,j}}{M_{i,j}}b_{i,j}),$$ $$M_{max,j}=\max\limits_{i\in\{A,C,G,T\}}(M_{i,j}).$$

The parameter $\lambda$ is used for scaling the mismatch energies and $b_{i,j}$ denotes the background frequency of nucleotide $i$ with respect to the most frequent nucleotide at position $j$. This conversion formula is part of the mismatch energy postulated in formula (4) in [10]. By definition, if $j=max$, than $E_{i,j}=0$, as there should be no mismatch energy for the best possible sequence match.

Note that, during conversion, a pseudo count $pc = 1$ is added to each $M_{i,j}$.

The conversion is done by a C++ tool provided by the authors of TRAP. This is also included in the TEPIC repository. As suggested in [10], we use the following parameters for the conversion:

$\lambda=0.7$
$m=0.584$
$n=-5.66$

The parameters slope m and intercept n are used to compute a matrix specific parameter $R_0$ that combines the concentration of the corresponding TF and the equilibrium constant of the binding reaction with its optimal binding site as defined in [10]. The authors of TRAP found a linear approximation for $R_0$ with:

$$ln(R_0)=m*|M|+n,$$

where $|M|$ denotes the length of the PCM as above.

Further, we exploit species specific GC-content values:

Homo sapiens $=0.41$
Mus musculus $=0.42$
Rattus norvegicus $=0.42$
Drosophila melanogaster $=0.43$
Caenorhabditis elegans $=0.36$

**Figure 2.** Visualization of the PCM preprocessing workflow.

Bibliography

Mathelier, A. and Fornes, O. and Arenillas, D. J. and Chen, C. Y. and Denay, G. and Lee, J. and Shi, W. and Shyr, C. and Tan, G. and Worsley-Hunt, R. and others JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles 2016 Nucleic Acids Res.
Hume, M. A. and Barrera, L. A. and Gisselbrecht, S. S. and Bulyk, M. L. UniPROBE update 2015: new tools and content for the online database of protein-binding microarray data on protein-DNA interactions 2015 Nucleic Acids Res.
Kulakovskiy, I. V. and Medvedeva, Y. A. and Schaefer, U. and Kasianov, A. S. and Vorontsov, I. E. and Bajic, V. B. and Makeev, V. J. HOCOMOCO: a comprehensive collection of human transcription factor binding sites models 2013 Nucleic Acids Res.
Kheradpour, P. and Kellis, M. Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments 2014 Nucleic Acids Res.
Matys, V. and Kel-Margoulis, O. V. and Fricke, E. and Liebich, I. and Land, S. and Barre-Dirrie, A. and Reuter, I. and Chekmenev, D. and Krull, M. and Hornischer, K. and Voss, N. and Stegmaier, P. and Lewicki-Potapov, B. and Saxel, H. and Kel, A. E. and Wingender, E. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes 2006 Nucleic Acids Res.
Sandelin, Albin and Alkema, Wynand and Engstrom, Par and Wasserman, Wyeth W and Lenhard, Boris JASPAR: an open-access database for eukaryotic transcription factor binding profiles 2004 Nucleic acids research Oxford Univ Press
Kulakovskiy, Ivan V and Medvedeva, Yulia A and Schaefer, Ulf and Kasianov, Artem S and Vorontsov, Ilya E and Bajic, Vladimir B and Makeev, Vsevolod J HOCOMOCO: a comprehensive collection of human transcription factor binding sites models 2013 Nucleic acids research Oxford Univ Press
Hume, Maxwell A and Barrera, Luis A and Gisselbrecht, Stephen S and Bulyk, Martha L UniPROBE, update 2015: new tools and content for the online database of protein-binding microarray data on protein--DNA interactions 2014 Nucleic acids research Oxford Univ Press
Liu, Tao and Ortiz, Jorge A and Taing, Len and Meyer, Clifford A and Lee, Bernett and Zhang, Yong and Shin, Hyunjin and Wong, Swee S and Ma, Jian and Lei, Ying and others Cistrome: an integrative platform for transcriptional regulation studies 2011 Genome biology BioMed Central
Roider, H. G. and Kanhere, A. and Manke, T. and Vingron, M. Predicting trancription factor affinities to DNA from a biophysical model 2007 Bioinformatics

GeneTrail 3.2

Advanced high-throughput enrichment analysis

Position count matrices (PCMs)

Database content

Processing

Bibliography