GeneTrail 3.2
Advanced high-throughput enrichment analysis
Time-series analysis
Motivation
The GeneTrail time series workflow provides functionality for the analysis of time resolved gene, protein, or miRNA expression data sets. In order to process these data sets, several analysis steps are conducted: our web service first uses a two-stage clustering approach to group biological entities with very similar expression patterns over time. For the resulting clusters, enrichment analyses are carried out to identify associated pathways. An overview of the workflow is shown in Figure 1.
Input
The input for a time series analysis are gene expression values obtained from microarrays, RNA-seq experiments or mass-spectrometry runs. The measurements can be uploaded as a plain text, tab-separated matrix where columns represent specific time points and each row represents the expression measurements for a particular gene at those time points.
ExampleT0 T1 T2 GeneA 0.1 4.3 2.3 GeneB 3.2 1.2 1.1 GeneC 2.7 9.1 0.3
In general we recommend to upload already normalized and logarithmized data. This gives users the complete control over quality control, batch effect removal, and normalization. However, for RNA-Seq data GeneTrail providesa variety of normalization methods: log(TMM+1), log(GeTMM+1), log(TPM+1), log(CPM+1).
Clustering
In order to group genes, proteins or miRNA with similar expression patterns, we carry out two clustering steps. First, a strict clustering is performed that generates small groups with a high similarity between all members. The second clustering then combines similar clusters generated in the first step to super-clusters. These super-clusters give a general overview of the data. However, they can be further refined to investigate the contained subclusters and finally also the individual genes.
For each stage of our clustering approach, we need to carry out four steps: (1) filtering the gene expression data , (2) calculating the distance between all gene pairs, (2) hierachical clustering, and (3) cutting the dendrogram to obtain clusters.
Filtering
In a preprocessing step, we remove all genes that only show limited expression changes over time. To this end, we implemented two metrics that summarize the overall expression changes in a time series $t={t_1,...,t_n}$. The result of a metric is then compared to a user defined threshold $\delta$ to remove unwanted genes, i.e. genes with low overall expression changes with respect to the chosen metric.
Absolute expression difference
The first measure calculates the absolute difference between the highest and lowest expression value in $t$.
$$d(t)=max(t)-min(t)$$Average expression change
The second method calculates the average expression difference over all consecutive time points.\\
$$d(t)=\frac{1}{n-1}\sum_{i=1}^{n-1}|x_i - x_{i+1}|$$Distance measures
In order to quantify the distance of two time courses, we implemented a variety of distance measures. For the description of these measures, we consider both curves as $n$-dimensional vectors $p=(p_1, p_i, ..., p_n)$ and $q=(q_1, q_i, ..., q_n)$, where $i \in {1,..n}$ represents time point $i$.
Distance measures for time points
In this section, we describe distance measures that calculate the distance between $p$ and $q$ based on their entries.
Euclidean distance
The Euclidean distance is a metric for distance between two points in Euclidean space. It is defined as:
$$d(p,q) = \sqrt{\sum_{i=1}^{n}(q_i - p_i )^2}$$Minimized euclidean distance
For the minimized version of the Euclidean distance, we shift one of the vectors such that the distance between the two is minimized. This means that height differences between the curves are ignored.
$$d(p,q) = \sqrt{\sum_{i=1}^{n}(q_i - s - p_i )^2}$$The optimal value for $s$ can be found as follows:
$$s = \frac{1}{n}\sum_{i=1}^{n}(q_i - p_i )$$Distance measures for transitions between time poinnts
In this section, we describe distance measures that consider the transitions between the time points rather than the time points themselves.
Angle distance
We define the angle distance as the sum of all angles between the transitions of consecutive time points in the time series.
$$d(p,q) = \sum_{i=1}^{n-1} \theta((1, p_{i+1} - p_i), (1, q_{i+1}-q_i))$$where $\theta$ is defined as the angle between the transition of two consecutive points.
Euclidean distance for gradients
This version of the Euclidean distance, calculates the distance between $p$ and $q$ based on the gradients of all consecutive time points.
$$d(p,q) = \sqrt{\sum_{i=1}^{n-1}((q_{i+1} - q_{i}) - (p_{i+1} - p_{i}))^2}$$Association measures
In addition to the distance measures described above, we can also use a variety of association/similarity measures. Before a clustering can be applied, these values have to be transformed to distance measures.
Pearson correlation
The Pearson correlation coefficient (PCC) is a measure for linear dependence between two random variables $P$ and $Q$. We assume that $p$ and $q$ are samples drawn from these variables.
$$r(p, q) = \frac{1}{n-1} \sum_{i=1}^{n}(\frac{p_i-\bar{p}}{s_P})(\frac{q_i - \bar{q}}{s_Q})$$,where $\bar{p},\bar{q}$ and $s_P$, $s_Q$ are the sample means and samples variances respectively.
The Pearson correlation coefficient $r$ ranges from -1 to 1. A value of 1 implies that the relationship between $P$ and $Q$ is perfectly described by a linear function, with all data points lying on a line for which both $P$ and $Q$ increase. A value of −1 implies that all data points lie on a line for which $P$ increases as $Q$ decreases. A value of 0 implies that there is no linear dependence between the variables. This means we can define the following distance measure:
$$d(p,q) = 1 - r(p, q)$$Spearman correlation
The Spearman correlation coefficient (SCC) is a non-parametric measure for dependence between between two random variables $P$ and $Q$. Here, we assume that $p$ and $q$ are samples drawn from these variables. The SCC assesses how well the relationship between two variables can be described using a monotonic function.
$$r(p,q) = 1 - \frac{6 \sum\limits_{i=1}^{n}(rank(p_i) - rank(q_i))^2}{n(n^2-1)}$$The rank $rank(x_i)$ of a sample $x_i$ is the position of that sample in the decreasingly ordered sequence of all samples. In accordance with the PCC, we can transform the SCC into a distance measure:
$$d(p,q) = 1 - r(p, q)$$Hierarchical clustering
For the hierarchical clustering, we rely on the fastcluster R-package (https://cran.r-project.org/web/packages/fastcluster/index.html). It provides an agglomerative clustering approach, where each observation starts in its own cluster and pairs of clusters are merged until a complete hierarchy is generated. In each step, all the clusters with the smallest distance are merged. In order to identify the minimum distance of two clusters $A$ and $B$, users can choose from a variety of methods.
Average Linkage
$$d(A,B)=\frac{1}{|A||B|}\sum_{a \in A, b \in B} d(a,b)$$Complete Linkage
$$d(A,B)=\max_{a \in A, b \in B} \{d(a,b)\}$$Single Linkage
$$d(A,B)=\min_{a \in A, b \in B} \{d(a,b)\}$$McQuitty
$$d(A,B)=\frac{1}{(|A|+|B|)(|A|+|B|-1)}\sum_{a,b \in A \cup B} d(a,b)$$Ward’s Method
$$d(A,B)=\frac{d(\bar{a},\bar{b})}{\frac{1}{|A|} + \frac{1}{|B|}}$$Final clusters
The resulting dendrogram contains the complete cluster hierarchy. In order to obtain a set of clusters that is of special interest to the user, the dendrogram has to be cut. To this end, users can define a threshold $t \in [0,1]$ that describes the quantile of merges that should be used to generate a final result. A smaller threshold will result in a stricter clustering with more individual clusters that have a higher similarity between all members.
Over-representation analysis
For the resulting clusters and super-clusters, enrichment analyses are carried out to identify associated pathways. To this end, lets assume that we have a biological category (signaling pathway or biological process) that has k entries in our test set (cluster), which consists of n entries, and l entries in the reference (all investigated genes), which consists of m entries. We can then use one of the following statistical tests to check if the test set has more entries in our category then expected by chance.
Hypergeometric test
For this analysis, all elements of test set is always part of the reference. For this purpose, the hypergeometric test can be applied to compute a p-value for the analyzed category:
$$P(K \ge k)=\sum\limits_{i=max(n+l-m, 0)}^{k} \frac{\binom{l}{i}\binom{m-l}{n-i}}{\binom{m}{n}}$$Multiple testing correction
Since for each cluster in our analysis multiple biological categories are tested simultaneously, we need to adjust the resulting p-values in order the account for the multiple testing problem. For this purpose GeneTrail provides a variety of methods.