GeneTrail


    Tutorial

    To start a new GeneTrail analysis select "GeneTrail" from the menu above.

    Right-click here and choose "save target as" for downloading a test set example (Organism: Human, IDs: NCBI Gene IDs).

    First, the analysis method to be performed has to be chosen by the user. Available are the "Over-/Under-Representation Analysis" (ORA) and the "Gene Set Enrichment Analysis" (GSEA). The selection of the analysis method is dependant on the performed experiment. If a subset of genes or proteins is detected from some larger set, ORA should be applied, if genes are measured for an arbitrary feature, GSEA should be applied. For such data, one could also perform ORA analyses. For example, we could consider the overexpressed genes in a microarray experiment. One could compute the subset of all genes on the array that are upregulated at least x-fold and analyze the results against all genes on the microarray (-> ORA). However, the choice of the threshold is crucial, and additionally a gene that is upregulated 10x-fold has the same meaning as a gene that is upregulated 2x-fold. Thus, for such analyses, GSEA is recommended. The input of GSEA is for example a list of microarray identifier sorted by the expression level. 02_analysismethod.png
    The following steps are slightly different between GSEA and ORA. In this tutorial we perform exemplarily an ORA, since all steps are equivalent but the upload of the reference set is not necessary for GSEA.
    In the first step of the parameter specification the organism has to be defined.
    03_organism.png
    Now, the user has to specify the identifier type that he uses. The following identifiers are supported:

    NCBI GeneID 5894
    11186
    11848
    NCBI NP/XP number (Protein RefSeq) NP_006261
    XP_941900
    NP_872606
    NCBI Protein GI 28201876
    113431221
    121114292
    NCBI NM/XM number (RNA RefSeq) NM_018993
    NM_008284
    NM_021168
    NCBI RNA GI 54792783
    51093847
    91105420
    SwissProt/UniProt Q9NZD4
    P55008
    O15155
    UniGene Hs.652097
    Hs.652094
    Hs.652089
    Ensembl Gene ID ENSG00000003147
    ENSG00000005801
    ENSG00000008130
    SGD yeast ORF ID YCR024C-B
    YCR108C
    YLR157W-E
    Amersham Human Whole Genome GE200018
    GE897528
    GE519380
    Affymetrix HG-U133A 1487_at
    1320_at
    1316_at
    Affymetrix HG-U95A 1014_at
    1015_s_at
    1017_at
    Affymetrix HG-U133 Plus 2.0 1552258_at
    1487_at
    1438_at
    Affymetrix HG-U133B 200017_at
    200018_at
    200013_at
    04_identifier.png
    In this step, up to 5 test sets of the identifier type specified in the previous step can be uploaded. Each line must contain exactly one identifier, all files must be of the same identifier type and files are restricted to a file size of 500kb. The files that have been uploaded so far are listed. 05_test.png
    For ORA, a reference set has to be specified. Please note that for GSEA this step is left out. The user has two options to specify the reference set. The set can be uploaded analogous to the reference set in the previous step, additionally GeneTrail offers a variety on common reference sets that can be selected from the drop-down menu. These reference sets include:

    Set Description
    All Genes (GeneIDs) of selected Organism All annotaed NCBI Gene IDs of the organism specified in the second step.
    Affymetrix HG-U133A All spots on the Affymetrix HG-U133A chip that can be mapped to NCBI Gene IDs.
    Affymetrix HG-U95A All spots on the on the Affymetrix HG-U95A chip that can be mapped to NCBI Gene IDs.
    Affymetrix HG-U133 Plus 2.0 All spots on the on the Affymetrix HG-U133 plus 2.0 chip that can be mapped to NCBI Gene IDs.
    Affymetrix HG-U133B All spots on the on the Affymetrix HG-U133B chip that can be mapped to NCBI Gene IDs.
    Amersham Human Whole Genome All spots on the on the Amersham Human Whole Genome cDNA chip that can be mapped to NCBI Gene IDs.
    Heidelberg Human Fetal Brain NCBI Gene IDs of genes that encode proteins which are on the protein array designed at "Deutsches Resourcenzentrum für Genomforschung GmBH".
    06_reference.png
    In the last step the functional categories and analysis parameters have to be defined by the user:
    • The user has to select at least one of the functional categories provided.
    • Additionally, the user can upload a self defined category file for analysis in the following GeneTrail format (.gtf):
      #CategoryName1
      ID1
      ID2
      ID43
      #CategoryName2
      ID23
      ID2
      ID54
      We also support the gene matrix transposed file format (.gmt) and the gene matrix file format (.gmx), which are standard file formats for gene sets. Please note that we ignore the second column or the second row of these file formats respectively, because we expect a description and not an identifier in these fields.
    • The statistical method has to be determined. The statistical method is already proposed by GeneTrail dependent on the uploaded test and reference set. If none of the provided methods is suitable, a warning message is thrown.
    • The adjustment method to adjust p-values for the so called multiple testing problem has to be defined. Available are the conservative Bonferroni adjustment and the False Discovery Rate. Additionally, the user can decide to compute unadjusted p-values, however, this option is not recommendable.
    • The significance threshold has to be specified. Only categories with p-values below this threshold are considered significant. Please note that only significant categories are listed on the output page.
    • Only categories with chosen minimum number of genes are considered. This option is important due to several reasons. A narrow category, as a transcription factor that regulates only two genes of which one is in the test set would likely be statistically significant. However, this finding is probably irrelevant. A second reason is the correction for multiple testing. If the user is interested only in large KEGG pathways, as insulin signalling, or apoptosis pathway, these categories get unrealistic low p-values because of many small pathways. In such a case, the threshold value should be increased to about 10 genes.
    • To get an idea how genes in the test and reference set are distributed along the chromosomes, we smooth gene localizations by using Gaussian distributions for each gene. The user has to specify the standard deviation of the Gaussian distributions. The higher the standard deviation, the smoother the visualization is. Additionally, the chromosomes are divided in a certain number of discrete bins. Usually, these parameters have not to be adapted by the user.
    07_analyses.png




    After all parameters have been specified, the computations are performed. Dependant on the chosen analyses and the traffic on the server this may take several minutes. However, the results page is immediately generated and continuously updated such that intermediary result can be accessed. Additionally, the result page can be bookmarked and can be accessed for 15 days after computation. The representation of an over-representation analysis and a gene set enrichment analysis are shown and interpreted below.








    08_resultORA.png If more than one test set has been uploaded, the ORA summary page contains hyperlinks to each test set on its top. For each test set, the number of uploaded IDs, the number of known IDs and the number of IDs with amino acid sequence is shown. By clicking on the blue show detailed information button, a list containing all genes with additional information, e.g. the chromosomal localization, is provided.
    For each category type (e.g. KEGG or GO), the total number of annotated genes in the test and reference set are given. In addition, the number of pathways/transcription factors/ chromosomes with more than a user defined amount of genes (see step 5 above) is provided, together with the number of significant categories without adjustment, with Bonferroni, and FDR adjustment. For GeneOntology or the protein-protein interactions, graphical representations of the results are also available.
    By clicking on the Show details button, the detailed results are provided. In each row, the name of the category is given (e.g. a certain GO term or KEGG pathway). If available, the names are directly linked to the original sources. Thereafter, a red-up or green-down arrow is provided that states whether the category is enriched or depleted followed by the computed p-value. The last three columns contain the number of observed genes in the test set that are in the category, the expected number that should be in the test set by chance and the genes that are in the category itself.
    09_resultGSEA.png Comparable to the ORA results page, the GSEA summary page contains hyperlinks to each test set on its top, if more than one test set has been uploaded. For each test set, the number of uploaded IDs, the number of known IDs and the number of IDs with amino acid sequence is shown. Additionally, the sorted list with removed duplicated entries can be accessed. By clicking on the blue show detailed information button, a list containing all genes with additional information, as the chromosomal localization, is provided.
    The category headers are again similar to the ORA headers. For each functional category type, the total number of annotated genes in the test and reference set are given. In addition, the number of categories with more than a user defined amount of genes (see step 5 above) is provided, together with the number of significant categories without adjustment, with Bonferroni, and FDR adjustment. For GeneOntology, graphical representation of the results is also available.
    By clicking on the Show details button, the detailed results are provided. In each row, the name of the category is given. If available, the names are directly linked to the original source. In the second column, the p-value of the category is provided, followed by the number of genes of the test set in the category. The last column contains a graphical representation of the running sum statistics. For details on the generation of the running sum statistics, we refer to Subramanian et al. (2005). If the function is colored red, the maximum value of the running sum is positive. If the function is colored green, the maximum value of the running sum is negative.
    ZBI email to webmaster ZBI