The cell annotation model was co-developed by 10x Genomics and the Cellarium AI Lab at the Data Sciences Platform of the Broad Institute. The model is in beta.
Cell Ranger introduces a new pipeline for automated cell type annotation, which can be applied to the Gene Expression outputs of Cell Ranger count
, multi
, and aggr
to generate accurate cell type labels. This method assigns cell types by comparing each cell's gene expression profile to annotated reference datasets, rather than relying on known marker genes for each cell type or tissue-specific references. Please note that the cell annotation pipeline is a beta feature.
Specifically, each cell barcode's gene expression profile is compared to a model built on the Chan Zuckerberg CELL by GENE (CZ CELLxGENE) census, identifying the most similar cell types. A consensus label is then assigned to each barcode, with the results summarized in the web_summary.html
. These labels can be viewed in Loupe Browser or accessed via the cell_types.csv
output file.
The algorithm generates an embedding for each cell barcode by first applying principal component analysis (PCA) to the reference dataset, extracting the top 512 components for each reference cell. The gene expression profile of each cell barcode being analyzed is transformed into the same 512-dimensional (512-D) embedding. To classify a cell, the algorithm performs an approximate nearest-neighbor (ANN) search, identifying the 500 most similar cells in the reference set based on these embeddings. The most common cell type among these nearest neighbors is then assigned to the query cell.
This figure shows the gene expression profile of a single 10x Barcode (shown in red), transformed into a 512-D embedding. The approximate nearest neighbors (primarily yellow cells) of the 10x Barcode are shown within the grey circle.
Cell type terms are sourced from the Cell Ontology, which CZ CELLxGENE uses to annotate all datasets. The reference datasets can vary in the granularity of annotations— some experts may assign highly specific terms like "CD8-positive, CD25-positive, alpha-beta regulatory T cell," while others might use broader classifications such as "T cell." Please note that the cell annotation algorithm may show poor performance with samples such as cancers or cell lines, as these are not well represented in CZ CELLxGENE database.
Our goal is to help users identify high-level cell types (e.g., T cells, B cells). To achieve this, the algorithm maps specific terms from the Cell Ontology to selected high-level cell types. These broader categories are displayed in both the web_summary.html
and the .cloupe
file. Some selected groups are illustrated in the figure below:
We benchmarked five human tissues—brain, blood, heart, kidney, and lung—some of which included multiple tissue types. The datasets consisted of both cell and nuclei data, tested using 3’ Single Cell Gene Expression v2 and v3 chemistries, as well as 5’ Single Cell Immune Profiling v1 and v2 chemistries.
Coarse and fine cell type annotations are available in the cell_types.csv file, which offers the option to refine classifications further.