GeneTrail

Input data

GeneTrail is able to read various input file formats through which the user can provide measurement data that should be analyzed. In general, GeneTrail will try to automatically detect the meta-data of the uploaded data. This means it attempts to detect the used data format, identifier type, and organism the data was derived from. If errors arise during this step, it is important to understand which input types are supported by GeneTrail.

Thus, in the following we discuss the expected input formats and the assumptions ^bc GeneTrail makes about their contents.

As GeneTrail is able to process data not only from microarray experiments, but also from e.g. mass-spectrometry experiments, we use the term entity for talking about genes, protein, miRNA, etc. Similarly, we uses the term identifier whenever we mean the name of such an entity as it is used in some database such as Ensembl, UniProt, or NCBI Gene.

Identifier lists

The simplest way to provide input data to GeneTrail is to upload a list of identifiers. Identifier lists can contain both: a, typically short, list of relevant entities or a, typically long, list of entities sorted by relevance.

Different methods assume different properties for the input lists. For example the ORA method [1] requires a list of relevant entities. Methods such as the Kolmogorov-Smirnov statistics assume that the identifier list is sorted by relevance. Do not use a list prepared for one method to compute enrichments of the second kind.

The input format for identifier lists recognized by GeneTrail is a simple text file containing exactly one identifier per line:

GDA
SCN3A
SCN3B
RPLP2
GFER
SNORA68
SNORA65
PIP5KL1
BTBD1
RPLP0
BTBD2
BTBD3
...

Identifier level scores

Similarly to identifier lists, score lists can be provided in a text based format containing one identifier per line. The difference to identifier lists is that a score, a numerical value measuring the relevance of the entity, is provided in an additional column. Both columns are separated by a whitespace, preferably by a tab character.

GDA 0.05501
SCN3A   -0.017374
SCN3B   0.33427200000000046
RPLP2   -0.10048799999999997
GFER    0.08075766666666603
SNORA68 0.2532145
SNORA65 -0.289492
PIP5KL1 0.267125
BTBD1   -0.824291000000001
RPLP0   0.050174750000000046
BTBD2   -0.424771999999999
BTBD3   0.267594
RPLP1   -0.1359804999999995
ATP6    -0.2206155
...

If possible prefer score lists to identifier lists. A score list can be used in any scenario an identifier list can be used in and is much less likely to run into the difficulties frequently encountered with the former.

Note, that GeneTrail does not check whether the scores follow a certain distribution or not. While most of the implemented methods work surprisingly well if their assumptions are violated, we recommend that the user chooses an appropriate analysis technique. To this end, the (unweighted) Kolmogorov-Smirnov test and the Wilcoxon test are non-parametric enrichment methods that do not require a specific score distribution.

Measurements

GeneTrail provides support for directly analyzing matrices containing high-throughput measurements. These can be normalized expression values obtained from microarray or RNA-seq experiments or protein abundances from mass-spectrometry runs. Additionally we offer rudimentary support for analyzing count data obtained via RNA-seq.

Analyzing data from high-throughput experiments is not just applying a statistical test to each row of the dataset. In practice, quality control, batch effect removal, and normalization must be performed carefully. The features offered by GeneTrail are provided for convenience and assume, that the data has been properly prepared!

Measurements can be uploaded as a plain text, tab-separated matrix. Optionally, the first column of the file contains names for each of the contained samples. Each subsequent row contains the measurement data for one identifier in all samples. Thus each row except the first starts with an identifier followed by N numerical values, where N is the number of samples.

Sample1	Sample2	Sample3
GeneA	0.1	4.3	2.3
GeneB	3.2	-1.2	1.1
GeneC	2.7	9.1	0.3
...

The advantage of uploading matrices of measurements is, that sample-based (sometimes called phenotype-based) permutation schemes can be used to determine p-values.

Microarray data

A major use case of GeneTrail is the analysis of microarray data. For this experimental platform, well established normalization pipelines exist that usually generate normal or log-normal distributed expression values. GeneTrail can directly work with this kind of data and offers a range of statistics that can be used to derive scores from expression matrices.

RNA-seq data

RNA-seq data usually comes in the form of count data. This means, that for each transcript and sample the number of reads that were mapped to the transcript is reported. The distribution of this data is fundamentally different to the distribution of microarray data, and hence new methods for the analysis of count data have been developed. GeneTrail offers some basic support for directly analyzing count data. For this purpose it uses the DESeq2 [2], edgeR [3], and RUVSeq [4] R packages that can be used to compute scores from count data.

Note that currently for count data, no sample-based permutations can be performed due to the prohibitive runtime of the score computation process.

The used packages perform some level of normalization. However, GeneTrail performs no quality control or proper batch effect removal. Just as with microarray data, the web service relies on normalized or at least well-behaved input data.

Others

Data from other experimental platforms can also be used in GeneTrail. Here, however, it is up to the user to select an appropriate scoring scheme.

Reference sets

Besides the list of relevant entities, the ORA method requires a second list of identifiers which represents the universe of identifiers that can be detected by an experiment. The input format is the same as for identifier lists.

Use a reference set that best fits your experimental platform. For microarrays, this would be the list of genes or transcripts for which probes are present on the array. For RNA-seq experiments this should be the list of all genes or transcripts that are present in the used genome annotation.

BED files

Open-chromatin regions or histone marks, needed for the epigenomics workflow, can be uploaded in BED file format. In this format every line represents a region of interest. Each individual line contains at least three fields.

Chromosome
Start position of the region
End position of the region

chr1	180775	180925
chr1	181395	181545
chr1	273895	274045
chr1	629895	630045
chr1	633855	634005
...

An additional description of the format can be found here.

Single Cell formats

For our single cell analysis workflow, we need an scRNA-Seq matrix and additional, user-defined meta information for each cell. In the following, both file formats are described.

scRNA-seq data

The matrix containing scRNA-Seq data requires the same format as in Measurements, i.e. a tab-separated text file containing cell identifier in its columns and gene identifier in its rows.

Cell1	Cell2	Cell3
GeneA	0.1	4.3	2.3
GeneB	3.2	-1.2	1.1
GeneC	2.7	9.1	0.3
...

The content of the matrix can either be counts, UMIs, or normalized expression values.

Note: We estimate if the values in the matrix were normalized based the presence of non-integer values in the matrix. If all values are integer, we assume to have a non-normalized matrix.

Metadata

The metadata file can be used to group cells based on e.g. experimental factors (batch, sample id, ...), or based on a research question (age: Is an effect related to age?, healthy-diseased: Are there differences on single cell level?, ...). Therefore, the content of the meta information is completely chosen by the user. We do not require specific information to be present (e.g., sample id, or the analysis batch).

The metadata file is a tab-separated text file in which each column provides additional meta information for the cells that should be analyzed. The cell identifier have to match with the column names of the scRNA-Seq matrix. Only cells with an entry in both, the metadata file, and the scRNA-Seq matrix, are analyzed in the subsequent workflow. Currently, we only allow at most three columns from the metadata file to be selected for the analysis. There can be more columns in the uploaded metadata file and a user is asked to select relevant columns on our upload page.

MetaInfo1	MetaInfo2	MetaInfo3
Cell1	age-3	batch_1	cluster_6
Cell2	age-5	batch_1	cluster_2
Cell3	age-3	batch_2	cluster_6
...

Note: As the meta information is used to group cells, please provide categorical entries in the relevant columns (i.e. factors in R). Please do not select columns for the analysis that contain numerical values, or other values that are unique for a cell. This would result in a lot of groups and computations are very slow at best. In the metadata file, such columns can be present, but please deselect them for the analysis on our upload page.

Epigenetic formats

For our epigenetic analysis workflow, we need the genomic positions affected by a certain epigenetic mark. Therefore, we allow BED files as input for histone modifications, open chromatin regions, and DNA methylation calls. Furthermore, expression data can be used to complement the epigenetic information. To this end, please upload an expression matrix containing bulk RNA-Seq or microarray data.

It is possible to upload files for more than two groups and, for our analysis, it is best to upload various epigenetic marks along with expression data. This results in a lot of file uploads, which can be demanding. For convenience, we offer the upload of a ZIP file containing some or all of the files needed for the analysis. It is also possible to upload several ZIP files, and to mix ZIP file uploads with normal file uploads.

Troubleshooting

GeneTrail does not recognize my score list exported from Excel

MS Excel is a popular tool for managing biological datasets. However, there are some pitfalls especially when it comes to interoperability with other tools. It can happen that Excel reformats gene identifiers as dates. For example the gene Apr1 is routinely recognized as April the first. Please make sure, that no such conversions have taken place before exporting your data from Excel.

For more information see also Zeeberg et al. [6].

Bibliography

Draghici, Sorin and Khatri, Purvesh and Martins, Rui P. and Ostermeier, G. Charles and Krawetz, Stephen A. Global functional profiling of gene expression 2003 Genomics Elsevier (View online)
Love, Michael I and Huber, Wolfgang and Anders, Simon Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 2014 Genome Biol (View online)
Robinson, Mark D and McCarthy, Davis J and Smyth, Gordon K edgeR: a Bioconductor package for differential expression analysis of digital gene expression data 2010 Bioinformatics Oxford Univ Press (View online)
Risso, Davide and Ngai, John and Speed, Terence P and Dudoit, Sandrine Normalization of RNA-seq data using factor analysis of control genes or samples 2014 Nature biotechnology Nature Publishing Group (View online)
Subramanian, Aravind and Kuehn, Heidi and Gould, Joshua and Tamayo, Pablo and Mesirov, Jill P GSEA-P: a desktop application for Gene Set Enrichment Analysis 2007 Bioinformatics Oxford Univ Press (View online)
Zeeberg, Barry R and Riss, Joseph and Kane, David W and Bussey, Kimberly J and Uchio, Edward and Linehan, W Marston and Barrett, J Carl and Weinstein, John N Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics 2004 BMC bioinformatics BioMed Central Ltd (View online)

GeneTrail 3.2

Advanced high-throughput enrichment analysis