GeneTrail

P-value computation

For determining the significance levels of the computed set-level scores, GeneTrail offers the gene set and the phenotype strategy.

Gene set strategy

The gene set strategy is based on permuting the identifier-level scores. An advantage of this strategy is that it allows the direct computation of p-values for some methods and thus avoids costly permutation tests. This leads to a higher resolution of the computed p-values and very low computation times. For an in-depth discussion of the advantages and disadvantages of the respective methods we refer the reader to Tian et al. [1].

In the following sections, we discuss the different methods implemented to compute p-value based on this strategy.

Method 1 - Use the underlying distribution

The first way is to take advantage of the distribution a certain test statistic describes. If this distribution or the corresponding cumulative distribution function (CDF) are known or can be estimated, they can be used to obtain a p-value for the test statistic. The p-value for a test statistic can be calculated simply by applying the CDF to this value.

Set level-statistics that use this method

One sample t-test
Welch t-test
Wilcoxon rank sum test

Method 2 - Permutation tests

In case the underlying distribution is not known a permutation test has to be used instead. In this test the labels of the observed data points are rearranged and the test statistic is recalculated to estimate its distribution under the null hypothesis [2]. The p-value is calculated as the fraction of permutation values that are at least as extreme as the original statistic, which was derived from non-permuted data [3]. In order to obtain an exact p-value all possible permutations would have to be computed. However, in practical situations this is computationally expensive and often infeasible. For example, class labels that represent two classes with 50 samples each can be permuted in $100 ∝ 10^{29}$ different ways [3]. Therefore, p-values are often approximated by computing a limited number of random permutations [3].

Given a test statistic $t$ and a set $X$ that contains $N$ permutation values $\hat t_1, \hat t_2, ..., \hat t_N$ the one sided p-values are defined as [3]:

$$p_{upper}=\frac{1}{N}\sum_{i=1}^{N}I(\hat t_i \ge t)$$ $$p_{lower}=\frac{1}{N}\sum_{i=1}^{N}I(\hat t_i \le t)$$

Commonly, a pseudocount is introduced to avoid p-values of 0 [3]:

$$p_{upper}=\frac{1}{N}(1+\sum_{i=1}^{N}I(\hat t_i \ge t))$$ $$p_{lower}=\frac{1}{N}(1+\sum_{i=1}^{N}I(\hat t_i \le t))$$

Schematic description

Compute the test statistic $t$ for the original score list $L$.
Repeat the following steps $N$ times.

Generate a random permutation $\hat L$ of $L$.
Evaluate the test statistic on $\hat L$ yielding $\hat t$.
If $\hat t \ge t$ increase a counter $X$.

Output the p-value $X/N$.

Accordingly, a lower-tailed p-value can be calculated by replacing $\hat t \ge t$ with $\hat t \le t$.

Sevel-statistics that use this method

All avaraging methods
Weighted GSEA

Method 3 - Exact p-values

It is also to mention that methods have been proposed which do not need to calculate all permutations in order to compute an exact p-value. Keller et al. [4] propose such an algorithm for the unweighted version of the GSEA method.

Set level-statistics that use this method

Unweighted GSEA

Phenotype strategy

The phenotype strategy randomly redistributes the measurements between the sample and reference group. This strategy always requires that a permutation test is performed. As new identifier-level scores must be derived for every permutation, the method can only be used if a data matrix was supplied.

The difference between the phenotype and the gene set scheme is how the permuted score list is computed. The phenotype scheme permutes the group labels instead of the gene labels.

Schematic description

Compute the test statistic $t$ for the original score list $L$.
Repeat the following steps $N$ times.

Generate a random assignment to the sample and reference group.
Compute a new score list $\hat L$
Evaluate the test statistic on $\hat L$ yielding $\hat t$.
If $\hat t \ge t$ increase a counter $X$.

Output the p-value $X/N$.

Accordingly, a lower-tailed p-value can be calculated by replacing $\hat t \ge t$ with $\hat t \le t$.

Bibliography

Tian, Lu and Greenberg, Steven A and Kong, Sek Won and Altschuler, Josiah and Kohane, Isaac S and Park, Peter J Discovering statistically significant pathways in expression profiling studies 2005 Proceedings of the National Academy of Sciences of the United States of America National Acad Sciences
Edgington, Eugene and Onghena, Patrick Randomization tests 2007 CRC Press
Knijnenburg, Theo A and Wessels, Lodewyk FA and Reinders, Marcel JT and Shmulevich, Ilya Fewer permutations, more accurate P-values 2009 Bioinformatics Oxford Univ Press
Keller, A. and Backes, C. and Lenhof, H. P. Computation of significance scores of unweighted Gene Set Enrichment Analyses 2007 BMC Bioinformatics (View online)

GeneTrail 3.2

Advanced high-throughput enrichment analysis

P-value computation

Gene set strategy

Method 1 - Use the underlying distribution

Set level-statistics that use this method

Method 2 - Permutation tests

Schematic description

Sevel-statistics that use this method

Method 3 - Exact p-values

Set level-statistics that use this method

Phenotype strategy

Schematic description

Bibliography