GeneTrail 3.2
Advanced high-throughput enrichment analysis
Epigenomics analysis
Motivation
The GeneTrail epigenomics workflow provides functionality to analyze histone marks, DNA methylation patterns, and open-chromatin regions of two or more sample groups. The epigenetic marks are used to identify genes that transition between certain chromatin states in the different groups. The gene sets of the different transition groups are then used to analyze affected signaling pathways. To this end, several analysis steps are carried out: First, our web service identifies, which epigenetic modifications are present in the promoter, enhancer and gene body regions of each transcript. This information then defines a combined chromatin state for all regulatory and coding regions that influence this transcript, i.e. active, poised, repressed, or no information. In the following sections, we call this the chromatin state of a transcript. The transcript information is then used to define the chromatin state of each gene. In the last step, The epigenetic marks are used to divide the genes into transition groups, e.g. genes that transition from a certain chromatin state in one sample to another state in a second sample. For all resulting transition groups, enrichment analyses are then carried out to identify associated pathways. An overview of the workflow is shown in the figure below.
Input data
The input for the epigenomics workflow are genomic positions affected by a certain epigenetic mark. Currently, we support open chromatin regions, histone modifications, DNA methylation patterns, and associated gene expression measurements. In the following sections, we describe the file formats that can be used to upload different data sources.
Open chromatin regions or histone modifications
Open chromatin regions or histone modifications are expected to be given as peaks in a BED file format.
BED format
In this format every line represents a region of interest. Each individual line contains at least three fields.
Chromosome Start position of the region End position of the regionExample:
chr1 180775 180925 chr1 181395 181545 chr1 273895 274045 chr1 629895 630045 chr1 633855 634005 ...
DNA methylation patterns
DNA methylation can either be uploaded as IDAT file resulting from Illumina BeadArrays or as BED file from bisulfite sequencing experiments.
Gene expression patterns
RNA-Seq input has to be a tab-separated matrix of gene- or transcript expression values. The rows of this matrix are expected to be gene or transcript identifiers and the columns should represent measured samples. These samples can include replicates for each group. The matrix has to have row names and a header. All expression measurements have to be uploaded in a single matrix. After the matrix was successfully uploaded and parsed, a table will appear in which you can assign each measurement (column in the matrix) to a group.
Naming convention
After a file has been uploaded, it must be assigned to the associated epigenetic mark. In case the name of the uploaded file contains the mark this assignment is done automatically.
Upload
All files can either be uploaded individually or in form of a ZIP archive. After the files have been uploaded, they have to be assigned to the investigated sample groups.
Identifying epigenetic modifications in promoter, enhancer, and gene body regions of transcripts
We define a promoter of a transcript as window around the transcription start site (TSS). This window size is a parameter and can be 1000 base pairs (also written as 1kb), 2kb, or 5kb, resulting in a total promoter size of 2001, 4001 and 10001 bp, respectively, for each transcript. The gene body is defined as the union of the genomic positions of all exons. Enhancer regions are taken from the database GeneHancer [1]. This database comprises regions of possible enhancers and their possible gene targets for humans. The genomic positions for the TSS and exons are taken from GENCODE [2]. A promoter, enhancer, or gene body region is stated to be affected by an epigenetic mark, if at least one bp of the region is affected by the measured mark. For histone modifications and open-chromatin regions, we use bedtools [3] to test for an intersection of the genomic positions. For DNA methylation, we use RnBeads [4] to calculate a beta-value for each region that states the degree of DNA methylation in this region. We choose 0.8 as beta-value cutoff. If a VCF file is given that contains DNA methylation data, it is first converted to bedgraph format by biscuit [5], and then processed by RnBeads as described above. As a result, each transcript is assigned a set of epigenetic marks for its regulatory and gene body regions.
Predicting a combined chromatin state of promoter, enhancer, and gene body regions of a transcript
An epigenetic mark can have a positive or negative effect on transcription, depending on its location. The tri-methylation of lysine 36 in histone 3 (in short H3K36me3) for example can positively affect transcription if it is found in the gene body . Another example is DNA methylation that can have a negative effect on transcription if found in promoter regions [6]. In our analysis, the epigenetic modification pattern, affecting the regulatory and gene body regions of a transcript, is analyzed based on this a priori knowledge gathered from the specialized databases HIstome [7] and HHMD (Human Histone Modification Databse) [8], as well as from a manual literature search to complement this knowledge. To this end, decision rules are extracted from this knowledge base, e.g. "if promoter is covered by H3K4me3 then transcript is 'active'". We then apply these decision rules to the epigenetic modification patterns of the regulatory and gene body regions of a transcript to assign the transcript a combined chromatin state, which is either 'active', 'poised', or 'repressed'. If no epigenetic marks are present in the promotor, enhancer, or gene body regions of a transcript, the chromatin state of the transcript is set to 'no information'.
Predicting the chromatin state of a gene
If a gene codes for several transcripts, it might occur that the chromatin states for the regulatory and gene body regions of those transcripts are assigned to different values. In that case, we assume that the function of these transcripts are redundant and that any transcript can fulfill the function of the gene. Therefore, we assign a gene to the most active chromatin state of its transcripts. Hence, if at least one transcripts is active, the gene is assigned to be active. If this is not the case and if at least one transcripts is poised, the gene is assigned to be poised. If all transcripts are repressed, the gene is also assigned to be repressed. In the case that no transcript contains any epigenetic mark, the gene is assigned to have no information.
Performing enrichment analyses
In the last step of the workflow, the different uploaded sample groups are compared using enrichment analyses. For each pair of groups, we define 16 sets of genes, one for each chromatin state pair. Thereby, chromatin state pairs represent transitions between chromatin states of two groups. A gene is assigned to a chromatin state pair (e.g., active - repressed) if it is assigned to the first chromatin state in the first group (in the example active) and to the second chromatin state in the second group (in the example repressed). For these 16 sets of genes, we perform an Over-Representation Analysis (ORA).
Over-representation analysis
Let us assume that we have a biological category (signaling pathway or biological process) that has k entries in our test set, which consists of n entries, and l entries in the reference (all investigated genes), which consists of m entries. We can then use one of the following statistical tests to check if the test set has more entries in our category then expected by chance.
Hypergeometric test
For this analysis, all elements of test set are always part of the reference. For this purpose, the hypergeometric test can be applied to compute a p-value for the analyzed category:
$$P_C=\begin{cases} \sum\limits_{i=k}^{min(n,l)} \frac{\binom{l}{i}\binom{m-l}{n-i}}{\binom{m}{n}} &, \text{if }k' < k\\ \sum\limits_{i=max(n+l-m, 0)}^{k} \frac{\binom{l}{i}\binom{m-l}{n-i}}{\binom{m}{n}} &, \text{if }k'\ge k \end{cases}$$Multiple testing correction
Since for each test set in our analysis multiple biological categories are tested simultaneously, we need to adjust the resulting p-values in order the account for the multiple testing problem. For this purpose GeneTrail provides a variety of methods (cf. [9]).
Biblibgraphy
- GeneHancer: genome-wide integration of enhancers and target genes in GeneCards Database Oxford Academic
- GENCODE reference annotation for the human and mouse genomes Nucleic acids research Oxford University Press
- BEDTools: the Swiss-army tool for genome feature analysis Current protocols in bioinformatics Wiley Online Library
- RnBeads 2.0: comprehensive analysis of DNA methylation data Genome biology BioMed Central
- biscuit-0.1.3. Zenodo. (View online)
- Functions of DNA methylation: islands, start sites, gene bodies and beyond Nature Reviews Genetics Nature Publishing Group
- HIstome—a relational knowledgebase of human histone proteins and histone modifying enzymes Nucleic acids research Oxford University Press
- HHMD: the human histone modification database Nucleic acids research Oxford University Press
- Multi-omics Enrichment Analysis using the GeneTrail2 Web Service Bioinformatics Oxford University Press