Note: 10x Genomics does not provide support for community-developed tools and makes no guarantees regarding their function or performance. Please contact tool developers with any questions. If you have feedback about Analysis Guides, please email [email protected].
Chromium Single Cell Gene Expression is a powerful technology that systematically measures individual cell transcriptomes. The combination of 10x barcodes and unique molecular identifier (UMI) sequences enables the quantification of mRNA molecules in each cell. However, due to intrinsic differences introduced in the workflow, e.g. differences in capture and reverse transcription efficiency and dropout issues (Lytal et al., 2020), the RNA molecule counts reflect both biological and technical variation. Therefore, data normalization is needed to remove technical variation while preserving biological variation before downstream processing.
Currently, a widely-used normalization approach is to divide the raw UMI count by the total detected RNAs in each cell, multiply by a scale factor (usually 10,000), add a pseudo-count (typically 1), and then perform a log transformation of the result. The size-factor normalization reduces the technical variation from sequencing depth, while the log transformation minimizes the effects from expression outliers and helps prevent high-abundance genes from dominating downstream analysis due to their higher technical variability. Many popular single cell tools have the functions that implement this method, such as NormalizeData function in Seurat, normalize_total and log1p functions in Scanpy, and LogNorm in Loupe Browser (10x Genomics). This approach can mitigate the relationship between sequencing depth and gene expression. However, the normalization effect can be uneven between genes with different abundances. Hafemeister and Satija (2019) observed that this approach fails to effectively normalize high-abundance genes. In addition, after normalization, a disproportionately higher variance could be detected for high-abundance genes in cells with low UMI counts. Despite this ineffective normalization observed in high-abundance genes, a benchmarking study showed that this normalization method can achieve satisfactory performance in clustering and embedding (Lytal et al., 2020). Indeed, it might be true that this normalization method can successfully separate common cell types; however, it has also been shown that this normalization generates a low-dimensional embedding wherein the orders of cells within a cell type still show strong correlations with cellular sequencing depth, which could prohibit accurate representation of sub-cell types (Hafemeister and Satija, 2019).
Besides the aforementioned normalization method, many other data normalization methods have been developed for single cell RNA-seq data. This article introduces some of the commonly-used normalization methods in the single cell community.
1. SCTransform
- Publication: Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression
- Tutorial: Using sctransform in Seurat
- Programming language: R
- Normalization steps: In the first step, a negative binomial generalized linear model is used to fit each gene with observed total UMI count in a cell (a proxy for sequencing depth) as a covariate. In the second step, each model parameter, including the intercept, slope, and negative binomial dispersion parameters, is regularized based on the relationship between parameter values and gene mean, to avoid overfitting. Lastly, the regularized parameters are used to define an affine function using the negative binomial model to transform observed UMI counts into Pearson residuals.
- Features: 1) No assumption of a fixed size factor; 2) Regularization minimizes overfitting of single cell RNA-seq data; 3) Pearson residuals are independent of sequencing depth and are suitable for variable gene selection, dimensional reduction, clustering, visualization, and differential expression.
2. BASiCS
- Publication: BASiCS: Bayesian Analysis of Single-Cell Sequencing Data
- Tutorial: BASiCS vignettes
- Programming language: R
- Normalization steps: Bayesian Analysis of Single-Cell Sequencing data (BASiCS) builds a joint hierarchical model for spike-in and biological genes to simultaneously quantify technical variation and cell-to-cell biological heterogeneity. If spike-in gene data is not available, BASiCS is able to quantify technical variation from technical replicates of samples where cells from a population are randomly allocated to multiple independent experimental replicates (Eling et al., 2018).
- Features: 1) If spike-in genes are unavailable, technical replicates are needed to quantify technical variation. 2) In addition to data normalization, BASiCS is also able to identify highly (and lowly) variable genes within one group and identify changes of gene expression between multiple groups.
3. SCnorm
- Publication: SCnorm: robust normalization of single-cell RNA-seq data
- Tutorial: SCnorm vignettes
- Programming language: R
- Normalization steps: SCnorm uses quantile regression to estimate the dependence of log-transformed transcript expression on sequencing depth for each gene. Genes are then grouped based on the dependence similarity and a second quantile regression is used to estimate scale factors in each group. Within-group adjustment for sequencing depth is performed using the estimated scale factors to provide normalized counts. If multiple biological conditions are present, SCnorm data normalization is performed on each condition, and then rescaling is conducted across conditions. In the rescaling, genes are split into quartiles based on their expression levels. For each group and condition, each gene is scaled by a common scale factor estimated as the median fold-change between each gene’s condition-specific mean and its mean across conditions. If good spike-ins are available, the performance of post-normalization recalling between conditions could be improved.
- Features: 1) Instead of using a global scale factor, SCnorm estimates scale factors separately for different gene groups showing distinct dependence on sequencing depth; 2) SCnorm can scale counts between different conditions; 3) spike-in data is optional.
4. Scran
- Publication: Pooling across cells to normalize single-cell RNA sequencing data with many zero counts
- Tutorial: Pooling normalization
- Programming language: R
- Normalization steps: All cells are averaged to make a reference pseudo-cell. Multiple cell pools are selected, and in each pool, expression values for cells are summed together and normalized against the above reference to obtain a pool-based size factor. The pool-based size factor is equal to the sum of the cell-based size factors in each pool and can be used to formulate a linear equation. After repeating this with multiple cell pools, a system of linear equations is constructed and solved to estimate size factor for each cell.
- Features: Scran generates size factors for each individual cell and the results can be used in any normalization or analysis method that allows user-specified size factors, such as log-transformed normalization (dividing each count by cell-specific size factor and log-transformed with the addition of a pseudo-count).
5. Linnorm
- Publication: Linnorm: improved statistical analysis for single cell RNA-seq expression data
- Tutorial: Linnorm User Manual
- Programming language: R
- Normalization steps: Linnorm can perform data normalization and transformation. Firstly, Linnorm calculates the relative expression scale for each gene in each cell. Then, it filters out genes to ensure that the genes being used for modeling are largely homogeneous. It filters i) low count genes with high amounts of zeros and ii) highly variable genes. Based on the relative expression scale, Linnorm transforms the data, which is modulated by a transformation parameter. The transformation parameter is optimized to minimize the deviation of transformed data from homoscedasticity (homogeneity of variance) and normality. The relative expression scale is multiplied with the optimized transformation parameter and then transformed in logarithmic scale. Each gene’s mean expression value across all cells is calculated, and then the expression mean and each cell’s expression are fitted to a linear model. Each model is shifted based on the normalization strength coefficient μ (0<=μ<=1, default value is 0.5). In the end the shifted model is used for normalization.
- Features: In addition to data normalization, Linnorm provides a function for data transformation to minimize the deviation of homoscedasticity (homogeneity of variance) and normality assumptions.
6. PsiNorm
- Publication: PsiNorm: a scalable normalization for single-cell RNA-seq data
- Tutorial: PsiNorm User Manual
- Programming language: R
- Normalization steps: PsiNorm performs between-sample normalization for single-cell RNA-seq data based on the power-law Pareto type I distribution. PsiNorm estimates the shape parameter alpha of Pareto Type I distribution for each cell using the maximum likelihood method and uses it as a multiplicative normalization factor to normalize input counts for each cell.
- Features: PsiNorm is a very scalable normalization method, providing comparable performance with shorter runtime and less RAM consumption (working with out of memory data).
There is no consensus on the best performing normalization method. It can be a good practice for users to test different normalization methods and compare the results in cell clustering, embedding and differential gene expression analysis.
Required skills:
- Familiar with a programming language (most commonly R)
Reference:
- Lytal et al. (2020). Normalization methods on single-cell RNA-seq data: an empirical survey. Front Genet. 11, 41.
- Hafemeister and Satija (2019). Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 20, 296.
- Vallejos et al. (2015). BASiCS: Bayesian Analysis of Single-Cell Sequencing Data. PLOS Computational Biology. 11(6), e1004333.
- Eling et al. (2018). Correcting the mean-variance dependency for differential variability testing using single-cell RNA sequencing data. Cell Syst. 7(3), 284.
- Bacher et al. (2017). SCnorm: robust normalization of single-cell RNA-seq data. Nat Methods. 14, 584.
- Lun et al. (2016). Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17, 75.
- Yip et al. (2017). Linnorm: improved statistical analysis for single cell RNA-seq expression data. Nucleic Acids Res. 45, 13097
- Borella et al. (2022). PsiNorm: a scalable normalization for single-cell RNA-seq data. Bioinformatics. 38, 164