Cell Ranger Molecule Info (HDF5 File)

The cellranger pipeline outputs an HDF5 file containing per-molecule information for all molecules that contain a valid barcode, a valid UMI, and were assigned with high confidence to a gene or Feature Barcode. This HDF5 file contains data corresponding to the observed molecules, as well as data about the libraries and feature set(s) used (general information about the HDF5 file format available here). This file is called molecule_info.h5 in cellranger count and sample_molecule_info.h5 in cellranger multi outputs.


(root)
    ├─ barcode_idx
    ├─ barcode_info	[HDF5 group]
    │   ├─ genomes
    │   └─ pass_filter
    ├─ barcodes
    ├─ count
    ├─ feature_idx
    ├─ features	[HDF5 group]
    │   ├─ _all_tag_keys
    │   ├─ target_sets [for Targeted Gene Expression or Flex]
    │   │    └─ [target set name]
    │   ├─ feature_type
    │   ├─ genome
    │   ├─ id
    │   ├─ name
    │   ├─ pattern [Feature Barcode only]
    │   ├─ read [Feature Barcode only]
    │   └─ sequence [Feature Barcode only]
    ├─ gem_group
    ├─ library_idx
    ├─ library_info
    ├─ metrics_json [HDF5 dataset; see below]
    ├─ probe_idx          ----------------------|
    ├─ probes [HDF5 group]                      |
    │   ├── feature_id                          | [For Flex, Cell Ranger v7.1+]
    │   ├── feature_name                        |
    │   ├── probe_id                            |
    │   └── region         ---------------------|
    ├─ umi
    └─ umi_type

You can examine the contents of the H5 file using software such as HDFView or the h5dump command, as demonstrated below to show the file contents of the entire H5 object:


h5dump -n molecule_info.h5

    HDF5 "molecule_info.h5" {
    FILE_CONTENTS {
    group      /
    dataset    /barcode_idx
    group      /barcode_info
    dataset    /barcode_info/genomes
    dataset    /barcode_info/pass_filter
    dataset    /barcodes
    dataset    /count
    dataset    /feature_idx
    group      /features
    dataset    /features/_all_tag_keys
    dataset    /features/feature_type
    dataset    /features/genome
    dataset    /features/id
    dataset    /features/name
    dataset    /gem_group
    dataset    /library_idx
    dataset    /library_info
    dataset    /metrics_json
    dataset    /umi
    dataset    /umi_type
    }
    }

The following HDF5 datasets in the molecule info file correspond to columns of a table. Each row of that table corresponds to a unique (UMI, cell-barcode, feature) tuple indicating the feature best supported by the reads (i.e., including PCR duplicates) assigned to that UMI and cell-barcode.

Cloumn	Type	Description
`barcode_idx`	uint64	A zero-based index into the barcodes dataset (see next section), indicating the cell-barcode assigned to this putative molecule.
`count`	uint32	Number of reads associated with this putative molecule that were confidently mapped to the assigned feature.
`feature_idx`	uint32	A zero-based index into the feature list (see next section), indicating the feature to which this putative molecule was assigned.
`gem_group`	uint16	Integer label that distinguishes data coming from distinct 10x Genomics GEM reactions (such as different channels or chips).
`library_idx`	uint16	A zero-based index into the `library_info` array (see next section) that distinguishes data coming from distinct 10x Genomics libraries (for example, gene expression and Feature Barcode). There may be multiple libraries associated with a single GEM well.
`umi`	uint32	2-bit encoded (see note below) processed (i.e. corrected) UMI sequence.
`umi_type`	uint32	A boolean array specifying whether the molecule aligned to an exonic (1) or intronic (0) region of the associated feature.
`probe_idx`	uint32	Present for Flex analysis with Cell Ranger v7.1 and later. A zero-based index for the probes dataset (see Probe reference section), indicating the probe used to capture this transcript.

In addition, the molecule info file has datasets corresponding to information about the libraries, barcodes, and feature set(s) that were used in the analysis, as described below.

Experiment reference

The barcodes, library_info, and metrics_json datasets contain information about the experiments contained in this analysis:

Dataset	Type	Description
`barcodes`	string	A list of all barcodes that had at least 1 read in this experiment. The `barcode_idx` column described in the previous section contains indices into this list of barcodes. To distinguish between identical cell-barcode sequences observed in different GEM reactions, the GEM well is appended to the end of the cell-barcode sequence (e.g., `AAACCCAAGGAGAGTA-1`).
`library_info`	string	A JSON-formatted array of objects, where each object contains metadata for a single library. Each library will at a minimum contain the metadata `library_id`, `library_type`, and gem_group
`metrics_json`	string	Pipeline metrics in JSON format that are used internally by Cell Ranger (more detail on the metrics pages).

Observed cell-barcodes

The HDF5 group barcode_info contains information regarding the barcodes that were called as cells during the analysis. This HDF5 group contains:

Dataset	Type	Description
`genomes`	string	A list of all genome references used for gene expression libraries in this analysis.
`pass_filter`	uint64	A matrix with three columns that contains one row per passing cell-barcode. Each row is a tuple (`barcode_idx`, `library_idx`, `genome_idx`), where `genome_idx` is an index into the `genomes` dataset. For Feature Barcode libraries, `genome_id`x will correspond to the genome reference used for the gene expression data from the specified cell-barcode.

Feature reference

The HDF5 group features contains information regarding the feature reference(s) used for the analysis. The datasets within the features group represent columns in a table containing one row per feature. Values in the feature_idx column described in the previous section provide indices into the rows of this table of features.

In addition to the columns described below, user-specified tags may also be present. The dataset _all_tag_keys contains a list of user-specified tags as well as built-in tags (genome, pattern, read, and/or sequence).

Column	Type	Description
`feature_type`	string	The type of feature reference to which this feature belongs (Gene Expression, CRISPR Guide Capture, Antibody Capture, or Custom).
`genome`	string	The genome reference for a given feature (e.g., "GRCh38" or "mm10"). For non-gene expression features, this entry is an empty string.
`id`	string	The unique id corresponding to this feature (for example, an Ensembl gene ID).
`name`	string	A human-readable name associated with this feature (for example, the common name associated with a gene).
`pattern`	string	[Feature Barcode only] Specifies how to extract the Feature Barcode sequence from the read.
`read`	string	[Feature Barcode only] Specifies which RNA sequencing read ("R1" or "R2") contains the Feature Barcode.
`sequence`	string	[Feature Barcode only] Feature-barcode sequence associated with this feature (e.g., a sgRNA protospacer sequence).

Probe reference

Present for Flex analysis with Cell Ranger v7.1 and later. The HDF5 group probes contains information regarding the probe set used for the analysis. The datasets within the probes group represent the columns in a table containing one row per probe. Values in the probe_idx column described in the previous section provide indices for the rows of this table of probes.

Column	Type	Description
`feature_name`	string	The name of the feature (gene) targeted by this probe.
`feature_id`	string	The Ensembl gene identifier of the gene targeted by this probe.
`probe_id`	string	A unique identifier assigned to each probe.
`region`	string	Present only when v1.0.1 probe set reference CSV is used. The region targeted by the probe may be either `spliced` (overlapping a splice junction on the gene) or `unspliced`.

2-bit encoding

The UMI sequences are 2-bit encoded as follows:

Each pair of bits encodes a nucleotide (0="A", 1="C", 2="G", 3="T").
The least significant byte (LSB) contains the 3'-most nucleotides.

Note that the cell-barcode sequences do not have this encoding. Instead, they are stored as plain strings in the library_info HDF5 group.

Cell Ranger Molecule Info (HDF5 File)

Overview

HDF5 file hierarchy

Per-molecule columns

Reference columns