# GeneTrail 3.2

#### Advanced high-throughput enrichment analysis

## Single-Cell analysis

### Motivation

The GeneTrail single cell workflow provides functionality to analyze single cell RNA sequencing (scRNA-seq) data sets. Particularly, it is designed to (1) identify for each cell active biological processes, and, (2) subsequently, based on these results, to characterize functional differences between cells. In order to process scRNA-seq data sets, several analysis steps are conducted: First, we optionally normalize the user uploaded RNA-seq matrix. Then, we conduct an enrichment analysis for each single-cell to identify pathways associated with the expressed genes. Additionally, we cluster the cells and perform dimension reduction. As a last step, we group individual cells based on cell clusters or uploaded annotations, e.g. tissue or sample id, and find pathways that are associated with specific groups.

An overview of the workflow is shown in Figure 1.

### Input data

The input for our single cell workflow is a gene expression matrix obtained from a scRNA-Seq run and metadata file. Both can be uploaded as plain text files.

### Expression matrix

The expression matrix is assumed to be a tab-separated matrix where columns represent specific cells and each row represents the expression measurements for a particular gene in those cells. The content of the matrix can either be read/UMI counts or normalized expression values.

**Example**

Cell1 Cell2 Cell3 GeneA 0 4 0 GeneB 2 1 0 GeneC 2 19 5 ...

### Metadata

The metadata file can be uploaded as tab-separated text file in which each column provides additional meta information for the cells that should be analyzed. The cell identifier have to match with the column names of the scRNA-Seq matrix. Only cells with an entry in both, the metadata file, and the scRNA-Seq matrix, are analyzed in the subsequent workflow.

MetaInfo1 MetaInfo2 MetaInfo3 Cell1 age-3 batch_1 cluster_6 Cell2 age-5 batch_1 cluster_2 Cell3 age-3 batch_2 cluster_6 ...

### Filtering and normalization

As already mentioned, users are able to either upload preprocessed and normalized expression values or raw count data. For raw counts, GeneTrail provides several processing steps that are discussed in the following sections.

### Filtering

Current scRNA-seq protocols involve steps to isolate single cells. During this process several artifacts can occur. Some of the cells might not have been separated properly, which could result in doublets, while others could potentially be damaged. GeneTrail provides several filtering procedures that help to remove these artifacts. To this end, we follow the best practices guide by Luecken and Theis [1].

#### Removing damaged cells

A low number of UMI/read counts, a low number of expressed genes and a high percentage of mitochondrial gene counts might be an indication for cells with a broken membrane, where the mRNA in the cytoplasm might have leaked out. Hence, users can select thresholds for each of those criteria to remove affected cells.

#### Removing doublets

Doublets can potentially be detected by a very high number of UMI/read counts or by a very high number of expressed genes. Here, users are again able to select thresholds in order to exclude affected cells from further analyses.

### Normalization

For our analysis to work properly, we require a within-sample normalization, as we compare the expression values of different genes. This means that depending on the used scRNA-Seq protocol (e.g. for full-length sequencing protocols like SMART-seq2) a gene length normalization is required. Additionally, normalized expression values should be log transformed ($log(x+1)$). Currently, we offer two methods for gene-length normalization: TPM [2] and GeTMM [3], and two methods for between-sample normalization: CPM and TMM [4]. Based on the selected protocol, we already preselect suitable default parameters. After normalization all expression values are log transformed ($log_2(x+1)$).

### Enrichment analysis of individual cells

Next, for each single cell, an enrichment analysis is conducted. To this end, we consider only the most expressed genes of the cell. Therefore, the user can determine a threshold for filtering the genes that should be further analyzed. Alternatively, we use the $x$ genes with the highest normalized expression values. In both cases, we generate one gene set for each single cell, independently. The gene sets are used as test sets in an Over-Representation Analysis (ORA) . As a reference set all protein coding genes are used.

### Over-representation analysis

Let us assume that we have a biological category (signaling pathway or biological process) that has k entries in our test set, which consists of n entries, and l entries in the reference, which consists of m entries. We can then use one of the following statistical tests to check if the test set has more entries in our category then expected by chance.

#### Hypergeometric test

If all elements of the test set are also a part of the reference, the hypergeometric test is applied to compute a p-value for the analyzed category:

$$P(K>=k)=\sum\limits_{i=max(n+l-m, 0)}^{k} \frac{\binom{l}{i}\binom{m-l}{n-i}}{\binom{m}{n}}$$

#### Fisher's exact test

If the test set contains elements that are not a part of the reference, the Fisher's exact test is applied to compute a p-value for the analyzed category:

$$P(K>=k)=\sum\limits_{i=max(l+k-m,0)}^{k} \frac{\binom{n}{i}\binom{m}{l+k-i}}{\binom{m+n}{l+k}}$$

#### Multiple testing correction

Since for each cell in our single cell analysis multiple biological categories are tested simultaneously, we need to adjust the resulting p-values in order the account for the multiple testing problem. For this purpose GeneTrail provides a variety of methods (cf. [5]).

### Group comparison of enrichment results

Along with the scRNA-seq expression matrix, a user can upload metadata for each cell (e.g. tissue of origin, sample id, or clinical information). This information can be used to define groups of special interest to the user. Additionally, we cluster the cells in our workflow independently from the given annotations using Seurat3 . Subsequently, the clusters can also be used to partition the cells into groups.

In order to find pathways that are associated with certain groups, we perform a $\chi$-squared test on the previously generated enrichment results. The intuition behind the $\chi$-squared test is to identify pathways that are more often enriched in cells from one particular group compared to all other groups. To this end, we perform one $\chi$-squared test for each pathway and for each group.

The $\chi$-squared test for two variables can be defined over a $2 \times 2$ contingency table. Let $p$ be a pathway and $g$ be a group of interest. Furthermore, let $n$ be the total number of cells. We create a $2\times 2$ contingency table as shown below. The columns indicate if pathway $p$ is enriched for a cell and the rows indicate if a cell is contained in group $g$. The cells in the contingency table ($c_{ij}$) count the number of single cells with a combination of these attributes, e.g. the top-left cell of the contingency table ($c_{00}$) counts the number of single cells in group $g$ for which pathway $p$ is enriched. The $\chi$-squared test then tests if being in group $g$ is statistically independent from having a significant result for pathway $p$. The $\chi$-squared test statistics is defined as

$$\chi = n\cdot\sum_i^2 \sum_j^2 \frac{(\frac{c_{ij}}{n}-p_{ij})^2}{p_{ij}},$$

with $p_{ij}$ as the estimated probability of $c_{ij}$ calculated based on the global distribution. We obtained a $p$-value for this $\chi$-squared statistic using Boost version 1.71.

e not e i a b not i c d

The $2\times 2$ contingency table used to test if pathway $p$ is more often found to be enriched in group $g$ compared to what is expected given all other groups. The variable $e$ represents cells in which pathway $p$ is enriched. The variable $i$ represents cells that are contained in group $g$.

### Marker gene detection

For each group, we compare the gene expression of this group against all other groups in order to find genes that are associated with a specific group. The marker genes for a specific group are calculated as follows: For each group, the expression matrix is separated into cells belonging to the given group and cells that do not belong to the group. For these two new groups of cells, a Wilcoxon rank-sum test, and an independent-shrinkage t-test is performed using GeneTrail with default parameters.

### Dimension reduction

In order to visualize the cells, we offer several dimension reductions including UMAP and t-SNE. The t-SNE dimension reduction is calculated with Seurat3 using the following parameters: In the normalization step of Seurat3, we use LogNormalize as normalization method, and 10000 as scaling factor. To reduce the dimensionality of the data set, only the most variable genes are kept as features. To find the most variable genes, we select vst as selection method. The number of most variable genes is a parameter of the GeneTrail single cell workflow and defaults to 2000. Additionally, we choose 50 as perplexity. The UMAP dimension reduction is also calculated with Seurat3 using the same parameters and 30 for the n.neighbours parameter.

Finally, we provide another UMAP reduction representation calculated with Monocle3 [6], which is also the basis for the pseudo-time analysis described in the following section. For the Monocle3 analysis, we first reduced the dimension of the gene expression data set to the first 50 principal components and afterwards performed a dimension reduction to the first two UMAP coordinates using standard parameters of Monocle3.

### Pseudo-time analysis

The pseudo-time analysis offered by GeneTrail is performed using the Monocle3 R package with standard parameters and a dimension reduction as discussed above. The basis for the pseudo-time calculation is a graph learning algorithm that tries to learn the ancestry of the cells. Monocle3 provides the coordinates of the graph vertices in their UMAP reduction space. Therefore, it is currently not possible to change the dimension reduction representation to any other than the UMAP representation of Monocle3 for the pseudo-time plot on our results page. If the graph is generated, the pseudo-time is calculated based on the distance to a selected start cell. In our automatized workflow, we currently do not support the manual selection of a start cell, but we estimate the start cell using a script from the Monocle3 documentation.

### Enrichment analysis of user-selected groups

On the results page, we offer an on-demand analysis to compare user-selected groups against each other by performing an enrichment analysis. To this end, a user can decide which groups (i.e. cells with the same annotation) should be part of either the sample or the reference set of the enrichment analysis. The annotation can originate from the metadata file or from the clusters that GeneTrail automatically calculated. If more than one group of cells is chosen for the sample (reference) set, the union of the cells in these groups is taken as sample (reference) set. To ensure mutual exclusiveness of the sample and reference set, the cells that are in both sets are removed.

After the assignment of annotation groups to either the reference or the sample set and an appropriate curation of the reference and the sample set as described above, we perform an enrichment analysis to compare the sample to the reference set.

We compare the expression of the sample and the reference set as follows: For each gene $i$, we calculate the mean expression value in the sample set $m_{si}$ and the mean expression value in the reference set $m_{ri}$. A score for each gene is calculated as $s_i=m_{si}-m_{ri}$. Hence, if the expression values for this gene are on average higher (lower) in the sample set than in the reference set, the gene is assigned a positive (negative) score. On the resulting list of scores, we perform two ORAs, one for the highest 1000 positive genes (representing the sample group) and one for the lowest 1000 negative genes (representing the reference group). The ORA is performed using the GeneTrail workflow with default parameters (see Section 4.1 of this supplement for further information on ORA in GeneTrail).

### Biblibgraphy

- Current best practices in single-cell RNA-seq analysis: a tutorial Molecular systems biology
- Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples Theory in biosciences Springer
- Gene length corrected trimmed mean of M-values (GeTMM) processing of RNA-seq data performs similarly in intersample analyses while improving intrasample comparisons BMC bioinformatics BioMed Central
- A scaling normalization method for differential expression analysis of RNA-seq data Genome biology BioMed Central
- Multi-omics Enrichment Analysis using the GeneTrail2 Web Service Bioinformatics Oxford University Press
- The single-cell transcriptional landscape of mammalian organogenesis Nature Nature Publishing Group