## Set-level statistics

The general idea of all set-level statistics is to revise if a certain category $C$ is significantly enriched or depleted in the analyzed data. A category is a set of biological entities like genes, proteins, or metabolites that are associated with a certain biological process, molecular function, or any molecular signature that might be of interest. The category is used to divide the input data into two groups, entries that are contained and entries that are not contained. Based on this information, a statistical test is applied that computes the differences between these two groups. In the end a category is declared significantly enriched if the upper-tailed p-value of a test is significant and depleted if the lower-tailed p-value is significant.

$$P_{enriched} = P(X \ge x)$$ $$P_{depleted} = P(X \le x)$$

Hereafter, methods are introduced that can be used for this purpose. We use $L=(l_{1},l_{2},\ldots,l_{m})$ to denote all biological entities in the input data, $S=(s_{1},s_{2},\ldots,s_{m})$ as corresponding scores and $X \subseteq S$ as the scores of all entries that are contained in category C. The size of X is denoted by n. We also denote $Y=S \setminus X$. Additionally, we provide general guidelines for the usage of these algorithms.

### Over-representation analysis

One of the first methods to investigate enrichment of predefined categories was the over-representation analysis (ORA). Accordingly, this approach has been employed by many authors, e.g. , , , , . In GeneTrail 2 we implemented the version of ORA that was presented by Backes et al. . This approach is based on the hypergeometric distribution and can be used to test if a set of selected biological entities is significantly more or less present in a biological category than expected by chance.

In contrast to all other methods described on this page, ORA works only on a subset T of biological entities that fulfill a certain condition, like all genes that are in the top $10\%$ of a sorted list L. Additionally, ORA relies on a reference set (background). For test set T it is then checked if the contained entries are more or less present in category C than expected by reference set R.

Assume a biological category C has k entries in list $T = (t_{1},t_{2},\ldots,t_{n})$ and l entries in reference set $R=(r_{1},r_{2},\ldots,r_{m})$. Based on this information we expect to find $k'=\frac{n*l}{m}$ elements of list L in category C on average.

If T is a subset of R, the hypergeometric test is applied to compute a p-value for C:

$$P_C=\begin{cases} \sum\limits_{i=k}^{min(n,l)} \frac{\binom{l}{i}\binom{m-l}{n-i}}{\binom{m}{n}} &, \text{if }k' < k\\ \sum\limits_{i=max(n+l-m, 0)}^{k} \frac{\binom{l}{i}\binom{m-l}{n-i}}{\binom{m}{n}} &, \text{if }k'\ge k \end{cases}$$

If T is not a subset of R, Fisher's exact test is used instead:

$$P_C=\begin{cases} \sum\limits_{i=k}^{min(n,l+k)} \frac{\binom{n}{i}\binom{m}{l+k-i}}{\binom{m+n}{l+k}} &, \text{if }k' < k\\ \sum\limits_{i=max(l+k-m,0)}^{k} \frac{\binom{n}{i}\binom{m}{l+k-i}}{\binom{m+n}{l+k}} &, \text{if }k'\ge k \end{cases}$$

### Weighted gene set enrichment analysis (WGSEA)

Another popular method for detecting enrichment of functional categories is the weighted gene set enrichment analysis (WGSEA) introduced by Mootha et al.  and improved by the same group two years later . This method is based on the Kolmogorov-Smirnov test , which compares the deviation between the distribution functions of two groups $X$ and $Y$. In terms of enrichment analysis, the goal of WGSEA is to determine whether members of category $C$ tend to occur towards the top (or bottom) of a sorted list L.

The test can be defined using a running sum statistic based on a biological category C with j members occurring in a sorted list $L = \{l_{1},l_{2},\ldots,l_{n}\}$. The running sum increases every time a gene in the list L is in C and decreases every time a gene is not in C. In contrast to the Kolmogorov-Smirnov statistic GSEA is signed and uses weights for each step.

$$RS(i)=\begin{cases} 0 &, \text{if }i=0\\ RS(i-1) + \frac{|l_i|^p}{N_R} &, \text{if }l_i \text{ belongs to C}\\ RS(i-1) - \frac{1}{(n - j)} &, \text{if }l_i \text{ does not belong to C} \end{cases}$$

, where $N_R = \sum\limits_{l_i \in C}|l_i|^p$ is the sum of all list entries contained in category C.

The exponent $p$ can be used to control the weight of each step. When $p=0$, this formulation reduces to the standard Kolmogorov-Smirnov statistic . In our implementation of the algorithm we use $p = 1$, as we additionally provide an unweighted version of the test.

The test statistic $RS_C$ is the maximal deviation from 0 of $RS(i)$.

$$RS_C = \max\limits_{1 \le i \le n} \{|RS(i)|\}$$

The p-value for test statistic $s_{max}$ can be calculated via a permutation test with t permutations.

If $RS_C > 0$ an upper-tailed p-value is computed:

$$p_{enriched} = \frac{1}{t} \sum\limits_{i=1}^{t}I(RS_i \ge RS_C)$$

If $RS_c < 0$ a lower-tailed p-value is computed:

$$p_{depleted} = \frac{1}{t} \sum\limits_{i=1}^{t}(RS_i \le RS_C)$$

### Gene set enrichment analysis (GSEA)

We also implemented an unweighted version of the gene set enrichment analysis (GSEA) , which is equivalent to the standard Kolmogorov-Smirnov statistic . It is a non-parametric hypothesis test, which is based solely on the order of an input list L. Focusing on ranks rather than on the absolute value has the advantage that the method is more robust and can penalize outliers, which might otherwise have a big influence on the results. Like the weighted version of the test, GSEA evaluates whether the genes of the considered category are randomly distributed or accumulated on top or bottom of the list .

The test statistic can be defined similar to the weighted version by using a running sum statistic based on a biological category C with j members occurring in a sorted list $L = \{l_{1},l_{2},\ldots,l_{n}\}$.

$$RS(i)=\begin{cases} 0 &, \text{if }i=0\\ RS(i-1) + n - j &, \text{if }l_i \text{ belongs to C}\\ RS(i-1)- j &, \text{if }l_i \text{ does not belong to C} \end{cases}$$

The test statistic $RS_C$ is the maximal deviation from 0 of $RS(i)$.

### Welch's t-test and Wilcoxon rank-sum test

Similar to the one-sample t-test, the Welch's t-test (cf. Section Scoring) and the Wilcoxon rank-sum test (cf. Section Scoring)) can also be used to determine if a biological category C is significantly enriched or depleted. Both tests can be used to test all biological entities that are contained in C against the ones that are not contained.

### Biblibgraphy

1. Backes, Christina and Keller, Andreas and Kuentzer, Jan and Kneissl, Benny and Comtesse, Nicole and Elnakady, Yasser A and Müller, Rolf and Meese, Eckart and Lenhof, Hans-Peter GeneTrail—advanced gene set enrichment analysis Nucleic acids research Oxford Univ Press (View online)
2. Draghici, Sorin and Khatri, Purvesh and Martins, Rui P. and Ostermeier, G. Charles and Krawetz, Stephen A. Global functional profiling of gene expression Genomics Elsevier (View online)
3. Hosack, Douglas A and Dennis Jr, Glynn and Sherman, Brad T and Lane, H Clifford and Lempicki, Richard A and others Identifying biological themes within lists of genes with EASE Genome Biol (View online)
4. Khatri, Purvesh and Draghici, Sorin Ontological analysis of gene expression data: current tools, limitations, and open problems Bioinformatics Oxford Univ Press (View online)
5. Zhang, Bing and Schmoyer, Denise and Kirov, Stefan and Snoddy, Jay GOTree Machine (GOTM): a web-based platform for interpreting sets of interesting genes using Gene Ontology hierarchies BMC bioinformatics BioMed Central Ltd
6. Mootha, Vamsi K and Lindgren, Cecilia M and Eriksson, Karl-Fredrik and Subramanian, Aravind and others PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes Nature genetics Nature Publishing Group (View online)
7. Subramanian, Aravind and Tamayo, Pablo and Mootha, Vamsi K and Mukherjee, Sayan and Ebert, Benjamin L and Gillette, Michael A and Paulovich, Amanda and Pomeroy, Scott L and Golub, Todd R and Lander, Eric S and others Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles Proceedings of the National Academy of Sciences of the United States of America National Acad Sciences (View online)
8. Hollander, Myles and Wolfe, Douglas A and Chicken, Eric Nonparametric statistical methods John Wiley and Sons
9. Keller, A. and Backes, C. and Lenhof, H. P. Computation of significance scores of unweighted Gene Set Enrichment Analyses BMC Bioinformatics (View online)
10. Jiang, Zhen and Gentleman, Robert Extensions to gene set enrichment Bioinformatics Oxford Univ Press
11. Pavlidis, Paul and Lewis, Darrin P and Noble, William Stafford Exploring gene expression data with class scores
12. Smyth, Gordon K Limma: linear models for microarray data Springer
13. Tian, Lu and Greenberg, Steven A and Kong, Sek Won and Altschuler, Josiah and Kohane, Isaac S and Park, Peter J Discovering statistically significant pathways in expression profiling studies Proceedings of the National Academy of Sciences of the United States of America National Acad Sciences
14. Ackermann, Marit and Strimmer, Korbinian A general modular framework for gene set enrichment analysis BMC Bioinformatics (View online)
15. Efron, Bradley and Tibshirani, Robert On testing the significance of sets of genes The annals of applied statistics JSTOR
16. Zar, Jerrold H and others Biostatistical analysis Pearson Education India