# GeneTrail 3.2

#### Advanced high-throughput enrichment analysis

## Time-series analysis

### Motivation

The GeneTrail time series workflow provides functionality for the analysis of time resolved gene, protein, or miRNA expression data sets. In order to process these data sets, several analysis steps are conducted: our web service first uses a two-stage clustering approach to group biological entities with very similar expression patterns over time. For the resulting clusters, enrichment analyses are carried out to identify associated pathways. An overview of the workflow is shown in Figure 1.

### Input

The input for a time series analysis are gene expression values obtained from microarrays, RNA-seq experiments or mass-spectrometry runs. The measurements can be uploaded as a plain text, tab-separated matrix where columns represent specific time points and each row represents the expression measurements for a particular gene at those time points.

**Example**

T0 T1 T2 GeneA 0.1 4.3 2.3 GeneB 3.2 1.2 1.1 GeneC 2.7 9.1 0.3

In general we recommend to upload already normalized and logarithmized data. This gives users the complete control over quality control, batch effect removal, and normalization. However, for RNA-Seq data GeneTrail providesa variety of normalization methods: log(TMM+1), log(GeTMM+1), log(TPM+1), log(CPM+1).

### Clustering

In order to group genes, proteins or miRNA with similar expression patterns, we carry out two clustering steps. First, a strict clustering is performed that generates small groups with a high similarity between all members. The second clustering then combines similar clusters generated in the first step to *super-clusters*.
These super-clusters give a general overview of the data. However, they can be further refined to investigate the contained subclusters and finally also the individual genes.

For each stage of our clustering approach, we need to carry out four steps: (1) filtering the gene expression data , (2) calculating the distance between all gene pairs, (2) hierachical clustering, and (3) cutting the dendrogram to obtain clusters.

### Filtering

In a preprocessing step, we remove all genes that only show limited expression changes over time. To this end, we implemented two metrics that summarize the overall expression changes in a time series $t={t_1,...,t_n}$. The result of a metric is then compared to a user defined threshold $\delta$ to remove unwanted genes, i.e. genes with low overall expression changes with respect to the chosen metric.

#### Absolute expression difference

The first measure calculates the absolute difference between the highest and lowest expression value in $t$.

$$d(t)=max(t)-min(t)$$#### Average expression change

The second method calculates the average expression difference over all consecutive time points.\\

$$d(t)=\frac{1}{n-1}\sum_{i=1}^{n-1}|x_i - x_{i+1}|$$### Distance measures

In order to quantify the distance of two time courses, we implemented a variety of distance measures. For the description of these measures, we consider both curves as $n$-dimensional vectors $p=(p_1, p_i, ..., p_n)$ and $q=(q_1, q_i, ..., q_n)$, where $i \in {1,..n}$ represents time point $i$.

#### Distance measures for time points

In this section, we describe distance measures that calculate the distance between $p$ and $q$ based on their entries.

##### Euclidean distance

The Euclidean distance is a metric for distance between two points in Euclidean space. It is defined as:

$$d(p,q) = \sqrt{\sum_{i=1}^{n}(q_i - p_i )^2}$$##### Minimized euclidean distance

For the minimized version of the Euclidean distance, we shift one of the vectors such that the distance between the two is minimized. This means that height differences between the curves are ignored.

$$d(p,q) = \sqrt{\sum_{i=1}^{n}(q_i - s - p_i )^2}$$The optimal value for $s$ can be found as follows:

$$s = \frac{1}{n}\sum_{i=1}^{n}(q_i - p_i )$$#### Distance measures for transitions between time poinnts

In this section, we describe distance measures that consider the transitions between the time points rather than the time points themselves.

##### Angle distance

We define the angle distance as the sum of all angles between the transitions of consecutive time points in the time series.

$$d(p,q) = \sum_{i=1}^{n-1} \theta((1, p_{i+1} - p_i), (1, q_{i+1}-q_i))$$where $\theta$ is defined as the angle between the transition of two consecutive points.

##### Euclidean distance for gradients

This version of the Euclidean distance, calculates the distance between $p$ and $q$ based on the gradients of all consecutive time points.

$$d(p,q) = \sqrt{\sum_{i=1}^{n-1}((q_{i+1} - q_{i}) - (p_{i+1} - p_{i}))^2}$$#### Association measures

In addition to the distance measures described above, we can also use a variety of association/similarity measures. Before a clustering can be applied, these values have to be transformed to distance measures.

##### Pearson correlation

The Pearson correlation coefficient (PCC) is a measure for linear dependence between two random variables $P$ and $Q$. We assume that $p$ and $q$ are samples drawn from these variables.

$$r(p, q) = \frac{1}{n-1} \sum_{i=1}^{n}(\frac{p_i-\bar{p}}{s_P})(\frac{q_i - \bar{q}}{s_Q})$$,where $\bar{p},\bar{q}$ and $s_P$, $s_Q$ are the sample means and samples variances respectively.

The Pearson correlation coefficient $r$ ranges from -1 to 1. A value of 1 implies that the relationship between $P$ and $Q$ is perfectly described by a linear function, with all data points lying on a line for which both $P$ and $Q$ increase. A value of −1 implies that all data points lie on a line for which $P$ increases as $Q$ decreases. A value of 0 implies that there is no linear dependence between the variables. This means we can define the following distance measure:

$$d(p,q) = 1 - r(p, q)$$##### Spearman correlation

The Spearman correlation coefficient (SCC) is a non-parametric measure for dependence between between two random variables $P$ and $Q$. Here, we assume that $p$ and $q$ are samples drawn from these variables. The SCC assesses how well the relationship between two variables can be described using a monotonic function.

$$r(p,q) = 1 - \frac{6 \sum\limits_{i=1}^{n}(rank(p_i) - rank(q_i))^2}{n(n^2-1)}$$The rank $rank(x_i)$ of a sample $x_i$ is the position of that sample in the decreasingly ordered sequence of all samples. In accordance with the PCC, we can transform the SCC into a distance measure:

$$d(p,q) = 1 - r(p, q)$$### Hierarchical clustering

For the hierarchical clustering, we rely on the *fastcluster* R-package (https://cran.r-project.org/web/packages/fastcluster/index.html). It provides an agglomerative clustering approach, where each observation starts in its own cluster and pairs of clusters are merged until a complete hierarchy is generated. In each step, all the clusters with the smallest distance are merged. In order to identify the minimum distance of two clusters $A$ and $B$, users can choose from a variety of methods.

#### Average Linkage

$$d(A,B)=\frac{1}{|A||B|}\sum_{a \in A, b \in B} d(a,b)$$#### Complete Linkage

$$d(A,B)=\max_{a \in A, b \in B} \{d(a,b)\}$$#### Single Linkage

$$d(A,B)=\min_{a \in A, b \in B} \{d(a,b)\}$$#### McQuitty

$$d(A,B)=\frac{1}{(|A|+|B|)(|A|+|B|-1)}\sum_{a,b \in A \cup B} d(a,b)$$#### Ward’s Method

$$d(A,B)=\frac{d(\bar{a},\bar{b})}{\frac{1}{|A|} + \frac{1}{|B|}}$$#### Final clusters

The resulting dendrogram contains the complete cluster hierarchy. In order to obtain a set of clusters that is of special interest to the user, the dendrogram has to be cut. To this end, users can define a threshold $t \in [0,1]$ that describes the quantile of merges that should be used to generate a final result. A smaller threshold will result in a stricter clustering with more individual clusters that have a higher similarity between all members.

### Over-representation analysis

For the resulting clusters and super-clusters, enrichment analyses are carried out to identify associated pathways. To this end, lets assume that we have a biological category (signaling pathway or biological process) that has k entries in our test set (cluster), which consists of n entries, and l entries in the reference (all investigated genes), which consists of m entries. We can then use one of the following statistical tests to check if the test set has more entries in our category then expected by chance.

#### Hypergeometric test

For this analysis, all elements of test set is always part of the reference. For this purpose, the hypergeometric test can be applied to compute a p-value for the analyzed category:

$$P(K \ge k)=\sum\limits_{i=max(n+l-m, 0)}^{k} \frac{\binom{l}{i}\binom{m-l}{n-i}}{\binom{m}{n}}$$#### Multiple testing correction

Since for each cluster in our analysis multiple biological categories are tested simultaneously, we need to adjust the resulting p-values in order the account for the multiple testing problem. For this purpose GeneTrail provides a variety of methods.