Set-level statistics

The general idea of all set-level statistics is to revise if a certain category $C$ is significantly enriched or depleted in the analyzed data. A category is a set of biological entities like genes, proteins, or metabolites that are associated with a certain biological process, molecular function, or any molecular signature that might be of interest. The category is used to divide the input data into two groups, entries that are contained and entries that are not contained. Based on this information, a statistical test is applied that computes the differences between these two groups. In the end a category is declared significantly enriched if the upper-tailed p-value of a test is significant and depleted if the lower-tailed p-value is significant.

$$P_{enriched} = P(X \ge x)$$ $$P_{depleted} = P(X \le x)$$

Hereafter, methods are introduced that can be used for this purpose. We use $L=(l_{1},l_{2},\ldots,l_{m})$ to denote all biological entities in the input data, $S=(s_{1},s_{2},\ldots,s_{m})$ as corresponding scores and $X \subseteq S$ as the scores of all entries that are contained in category C. The size of X is denoted by n. We also denote $Y=S \setminus X$. Additionally, we provide general guidelines for the usage of these algorithms.

Over-representation analysis

One of the first methods to investigate enrichment of predefined categories was the over-representation analysis (ORA). Accordingly, this approach has been employed by many authors, e.g. [1], [2], [3], [4], [5]. In GeneTrail 2 we implemented the version of ORA that was presented by Backes et al. [1]. This approach is based on the hypergeometric distribution and can be used to test if a set of selected biological entities is significantly more or less present in a biological category than expected by chance.

In contrast to all other methods described on this page, ORA works only on a subset T of biological entities that fulfill a certain condition, like all genes that are in the top $10\%$ of a sorted list L. Additionally, ORA relies on a reference set (background). For test set T it is then checked if the contained entries are more or less present in category C than expected by reference set R.

Assume a biological category C has k entries in list $T = (t_{1},t_{2},\ldots,t_{n})$ and l entries in reference set $R=(r_{1},r_{2},\ldots,r_{m})$. Based on this information we expect to find $k'=\frac{n*l}{m}$ elements of list L in category C on average.

If T is a subset of R, the hypergeometric test is applied to compute a p-value for C:

$$P_C=\begin{cases} \sum\limits_{i=k}^{min(n,l)} \frac{\binom{l}{i}\binom{m-l}{n-i}}{\binom{m}{n}} &, \text{if }k' < k\\ \sum\limits_{i=max(n+l-m, 0)}^{k} \frac{\binom{l}{i}\binom{m-l}{n-i}}{\binom{m}{n}} &, \text{if }k'\ge k \end{cases}$$

If T is not a subset of R, Fisher's exact test is used instead:

$$P_C=\begin{cases} \sum\limits_{i=k}^{min(n,l+k)} \frac{\binom{n}{i}\binom{m}{l+k-i}}{\binom{m+n}{l+k}} &, \text{if }k' < k\\ \sum\limits_{i=max(l+k-m,0)}^{k} \frac{\binom{n}{i}\binom{m}{l+k-i}}{\binom{m+n}{l+k}} &, \text{if }k'\ge k \end{cases}$$

Weighted gene set enrichment analysis (WGSEA)

Another popular method for detecting enrichment of functional categories is the weighted gene set enrichment analysis (WGSEA) introduced by Mootha et al. [6] and improved by the same group two years later [7]. This method is based on the Kolmogorov-Smirnov test [8], which compares the deviation between the distribution functions of two groups $X$ and $Y$. In terms of enrichment analysis, the goal of WGSEA is to determine whether members of category $C$ tend to occur towards the top (or bottom) of a sorted list L.

The test can be defined using a running sum statistic based on a biological category C with j members occurring in a sorted list $L = \{l_{1},l_{2},\ldots,l_{n}\}$. The running sum increases every time a gene in the list L is in C and decreases every time a gene is not in C. In contrast to the Kolmogorov-Smirnov statistic GSEA is signed and uses weights for each step.

$$RS(i)=\begin{cases} 0 &, \text{if }i=0\\ RS(i-1) + \frac{|l_i|^p}{N_R} &, \text{if }l_i \text{ belongs to C}\\ RS(i-1) - \frac{1}{(n - j)} &, \text{if }l_i \text{ does not belong to C} \end{cases}$$

, where $N_R = \sum\limits_{l_i \in C}|l_i|^p$ is the sum of all list entries contained in category C.

The exponent $p$ can be used to control the weight of each step. When $p=0$, this formulation reduces to the standard Kolmogorov-Smirnov statistic [7]. In our implementation of the algorithm we use $p = 1$, as we additionally provide an unweighted version of the test.

The test statistic $RS_C$ is the maximal deviation from 0 of $RS(i)$.

$$RS_C = \max\limits_{1 \le i \le n} \{|RS(i)|\}$$

The p-value for test statistic $s_{max}$ can be calculated via a permutation test with t permutations.

If $RS_C > 0$ an upper-tailed p-value is computed:

$$p_{enriched} = \frac{1}{t} \sum\limits_{i=1}^{t}I(RS_i \ge RS_C)$$

If $RS_c < 0$ a lower-tailed p-value is computed:

$$p_{depleted} = \frac{1}{t} \sum\limits_{i=1}^{t}(RS_i \le RS_C)$$

Gene set enrichment analysis (GSEA)

We also implemented an unweighted version of the gene set enrichment analysis (GSEA) [1], which is equivalent to the standard Kolmogorov-Smirnov statistic [8]. It is a non-parametric hypothesis test, which is based solely on the order of an input list L. Focusing on ranks rather than on the absolute value has the advantage that the method is more robust and can penalize outliers, which might otherwise have a big influence on the results. Like the weighted version of the test, GSEA evaluates whether the genes of the considered category are randomly distributed or accumulated on top or bottom of the list [1].

The test statistic can be defined similar to the weighted version by using a running sum statistic based on a biological category C with j members occurring in a sorted list $L = \{l_{1},l_{2},\ldots,l_{n}\}$.

$$ RS(i)=\begin{cases} 0 &, \text{if }i=0\\ RS(i-1) + n - j &, \text{if }l_i \text{ belongs to C}\\ RS(i-1)- j &, \text{if }l_i \text{ does not belong to C} \end{cases} $$

The test statistic $RS_C$ is the maximal deviation from 0 of $RS(i)$.

$RS_C = \max\limits_{1 \le i \le n} \{|RS(i)|\}$$

A huge advantage of the unweighted GSEA is that an exact p-value for test statistic $RS_C$ can be computed via the dynamic programming algorithm presented in the next section.

Exact p-value computation

Keller et al. [9] showed that a p-value for GSEA can be computed as the probability that a random running sum reaches a maximal deviation greater or equal than $RS_C$.

The authors developed an algorithm, which is able to compute the number of running sum statistics that are smaller than $RS_C$. This can be used to compute an exact p-value for $RS_C$ based on the complement of this event.

$p = 1-\frac{X}{Y},$

where X is the number of running sum statistics with a maximum deviation of at most $RS-1$ and Y is the number of all possible running sum statistics. Y can be computed as $\binom{n}{j}$, where n is the length of list L and j is the number of entries that belong to category C.

The value of X can be calculated via a dynamic programming algorithm. The algorithm iterates over a matrix M with $(j+1) \times (n - j +1)$ entries. $M(i,k)$ is the number of running sum statistics with i members of the considered category and k non-members whose maximum deviation of zero is less than $RS_C-1$.

The matrix is initialized with:

$$ M(i,0)=\begin{cases} 1&, \text{if } i\cdot(n-k) < |RS_C|\\ 0& \text{else} \end{cases} $$ $$ M(0,j)=\begin{cases} 1&, \text{if } j\cdot k< |RS_C|\\ 0& \text{else} \end{cases} $$

The value for X can be found via dynamic programming over matrix M for $i=0$ to $j$ and for $k=0$ to $n-j$:

$$ M(i,k)=\begin{cases} M(i-1,k) + M (i,k-1)&, \text{if } (*)\\ 0& \text{else} \end{cases}, $$

where (*) is $ i\cdot(n-j) - k \cdot j < |RS_C|$ for an upper-tailed p-value and $-|RS_C| < i\cdot(n-j) - k \cdot j$ for a lower-tailed p-value.

Averaging methods

Another approach pursued by many authors [10] , [11] , [12] , [13] is simply to average over the test statistics of the genes that belong to a certain category. Ackermann and Strimmer [14] show that these averaging methods can perform similar or in some conditions even better than other methods.

The following methods have been proposed to compute a test statistic for the scores $X = (x_{1},x_{2},\ldots,x_{n})$ of list L that belong to a category C.

$$ts_{mean} = \sum\limits_{X_i \in X}\frac{x_i}{n}$$ $$ts_{sum} = \sum\limits_{x_i \in X}x_i$$ $$ts_{median} = \begin{cases} x_{\frac{n+1}{2}}, & \text{if }n\text{ is odd,}\\ \frac{1}{2}(x_{\frac{n}{2}} + x_{\frac{n}{2}+1}), & \text{if }n\text{ even.} \end{cases}$$

The p-value for test statistic $ts^{*}$ (* = mean, sum, or median) can be calculated via a permutation test with t iterations. In each step the scores are randomly permuted and the computation is repeated. The score for permutation i is denoted by $ts^{*}_{i}$. This test is identical for each method. The only difference is the parameter $a$ based on which we decide if an upper-tailed or a lower-tailed p-value is calculated. For the mean and the median statistic an upper-tailed p-value is calculated if the test statistic is bigger than the population mean ($\bar{\mu}$) or median ($\tilde{\mu}$) respectively. Accordingly, a lower-tailed p-value is computed if the test statistic is smaller than the corresponding parameter. For the sum statistic a right-sided p-value is calculated if $ts_{sum}$ is positive and a left-sided p-value if the test statistic is negative. Consequently, the sum statistic assumes the data to be centered around 0.

$$a = \begin{cases} \bar{s}, \text{if * = mean} \\ \tilde{s}, \text{if * = median}\\ 0, \text{if * = sum}\\ \end{cases}$$

If $ ts^{*} > a$ an upper-tailed p-value is computed:

$$p_{enriched} = \frac{1}{t} \sum\limits_{i=1}^{t}I(ts^{*}_i \ge ts^{*})$$

If $ts^{*} < a$ a lower-tailed p-value is computed:

$$p_{depleted} = \frac{1}{t} \sum\limits_{i=1}^{t}I(ts^{*}_i \le ts^{*})$$

Maxmean statistic

Based upon averaging methods as described in the last section, Efron et al. [15] introduce the maxmean statistic, a test that separates between positive and negative scores. This separation facilitates the detection of gene sets containing both up- and down-regulated genes . Furthermore, this test statistic prevents categories with only very few unusually large scores to be significant, yet it allows to detect sets with both moderately large positive and negative scores [14].

The test statistic for the scores $S = (s_{1},s_{2},\ldots,s_{n})$ of list L that belong to a category C is defined as:

$$ s_{max} = max(\bar{s}^{(+)},-\bar{s}^{(-)} )$$

where $\bar{s}^{(+)} = \sum\limits_{s_i \in S}I(s_i \ge 0) \frac{s_i}{n}$ and $\bar{s}^{(-)} = \sum\limits_{s_i \in S}I(s_i \le 0) \frac{s_i}{n}$.

The p-value for test statistic $s_{max}$ can be calculated via a permutation test with t permutations. This test is performed analogously to the permutation test in Section \ref{sec:averaging}. An upper-tailed p-value is computed if $s_{max} = \bar{s}^{(+)}$ and respectively a lower-tailed if $s_{max} = -\bar{s}^{(-)}$.

One sample t-test

An additional approach to judge the significance of biological categories is the one sample t-test [16]. This test can be used to determine if a given sample has the same mean as the rest of the population. The test assumes the population to be normally distributed. In terms of enrichment analysis this test can be used to compare the mean of all tested biological entities to the mean of the ones contained in a certain functional category. A category is declared depleted if the corresponding mean is significantly smaller than the mean of all biological entities and enriched if the mean is significantly larger.

The test statistic for a sample $X=(x_{1},x_{2},\ldots,x_{n})$ is defined as:

$$t=\frac{\bar{x} - \mu_0}{\sqrt{\frac{Var(X)}{n}}}$$

A p-value for the test statistic t can be derived from a t-distribution with $n-1$ degrees of freedom.

Welch's t-test and Wilcoxon rank-sum test

Similar to the one-sample t-test, the Welch's t-test (cf. Section Scoring) and the Wilcoxon rank-sum test (cf. Section Scoring)) can also be used to determine if a biological category C is significantly enriched or depleted. Both tests can be used to test all biological entities that are contained in C against the ones that are not contained.

Biblibgraphy

  1. Backes, Christina and Keller, Andreas and Kuentzer, Jan and Kneissl, Benny and Comtesse, Nicole and Elnakady, Yasser A and Müller, Rolf and Meese, Eckart and Lenhof, Hans-Peter GeneTrail—advanced gene set enrichment analysis Nucleic acids research Oxford Univ Press (View online)
  2. Draghici, Sorin and Khatri, Purvesh and Martins, Rui P. and Ostermeier, G. Charles and Krawetz, Stephen A. Global functional profiling of gene expression Genomics Elsevier (View online)
  3. Hosack, Douglas A and Dennis Jr, Glynn and Sherman, Brad T and Lane, H Clifford and Lempicki, Richard A and others Identifying biological themes within lists of genes with EASE Genome Biol (View online)
  4. Khatri, Purvesh and Draghici, Sorin Ontological analysis of gene expression data: current tools, limitations, and open problems Bioinformatics Oxford Univ Press (View online)
  5. Zhang, Bing and Schmoyer, Denise and Kirov, Stefan and Snoddy, Jay GOTree Machine (GOTM): a web-based platform for interpreting sets of interesting genes using Gene Ontology hierarchies BMC bioinformatics BioMed Central Ltd
  6. Mootha, Vamsi K and Lindgren, Cecilia M and Eriksson, Karl-Fredrik and Subramanian, Aravind and others PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes Nature genetics Nature Publishing Group (View online)
  7. Subramanian, Aravind and Tamayo, Pablo and Mootha, Vamsi K and Mukherjee, Sayan and Ebert, Benjamin L and Gillette, Michael A and Paulovich, Amanda and Pomeroy, Scott L and Golub, Todd R and Lander, Eric S and others Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles Proceedings of the National Academy of Sciences of the United States of America National Acad Sciences (View online)
  8. Hollander, Myles and Wolfe, Douglas A and Chicken, Eric Nonparametric statistical methods John Wiley and Sons
  9. Keller, A. and Backes, C. and Lenhof, H. P. Computation of significance scores of unweighted Gene Set Enrichment Analyses BMC Bioinformatics (View online)
  10. Jiang, Zhen and Gentleman, Robert Extensions to gene set enrichment Bioinformatics Oxford Univ Press
  11. Pavlidis, Paul and Lewis, Darrin P and Noble, William Stafford Exploring gene expression data with class scores
  12. Smyth, Gordon K Limma: linear models for microarray data Springer
  13. Tian, Lu and Greenberg, Steven A and Kong, Sek Won and Altschuler, Josiah and Kohane, Isaac S and Park, Peter J Discovering statistically significant pathways in expression profiling studies Proceedings of the National Academy of Sciences of the United States of America National Acad Sciences
  14. Ackermann, Marit and Strimmer, Korbinian A general modular framework for gene set enrichment analysis BMC Bioinformatics (View online)
  15. Efron, Bradley and Tibshirani, Robert On testing the significance of sets of genes The annals of applied statistics JSTOR
  16. Zar, Jerrold H and others Biostatistical analysis Pearson Education India