The spaceranger
pipeline outputs an HDF5 file (molecule_info.h5
) containing per-molecule information for all molecules that contain a valid barcode, valid UMI, and were assigned with high confidence to a gene or protein. This HDF5 file contains data corresponding to the observed molecules, as well as data about the libraries, features sets, and barcode lists used for the analysis. Refer to the HDF5 Matrix format for more general information.
molecule_info.h5
│ ├── file_version
│ ├── filetype
├── barcode_idx
├── barcode_info [HDF5 group]
│ ├── genomes
│ └── pass_filter
├── barcodes
├── count
├── feature_idx
├── features [HDF5 group]
│ ├── _all_tag_keys
│ ├── feature_type
│ ├── genome
│ ├── id
│ ├── name
│ └── target_sets [HDF5 group]
│ ├── target panel CSV <span class="variable">[For Targeted GEX]
│ └── probe set reference CSV <span class="variable">[For Visium FFPE]
├── gem_group
├── library_idx
├── library_info
├── metrics_json <span class="variable">[Contains Slide Serial Number and Capture Area information if supplied]
├── probe_idx ---------------------|
├── probes [HDF5 group] |
│ ├── feature_id | <span class="variable">[For Visium FFPE]
│ ├── feature_name |
│ ├── probe_id |
│ └── <u>region</u> ---------------------|
│ <span class="variable">[Present when v2 probe set reference CSV is used]
├── umi
└── umi_type
The contents of the .h5
file can be examined using HDFView
software or the h5dump
command.
h5dump -n molecule_info.h5
HDF5 "molecule_info.h5" {
FILE_CONTENTS {
group /
dataset /barcode_idx
group /barcode_info
dataset /barcode_info/genomes
dataset /barcode_info/pass_filter
dataset /barcodes
dataset /count
dataset /feature_idx
group /features
dataset /features/_all_tag_keys
dataset /features/feature_type
dataset /features/genome
dataset /features/id
dataset /features/name
group /features/target_sets
dataset /features/target_sets/[target set name]
dataset /gem_group
dataset /library_idx
dataset /library_info
dataset /metrics_json
dataset /probe_idx
group /probes
dataset /probes/feature_id
dataset /probes/feature_name
dataset /probes/probe_id
dataset /umi
dataset /umi_type
}
}
The following HDF5 datasets in the molecule info file correspond to columns of a table. Each row of that table corresponds to a unique (UMI, spot-barcode, feature) tuple indicating the feature best supported by the reads (i.e., including PCR duplicates) assigned to that UMI and spot-barcode.
Column | Type | Description |
---|---|---|
barcode_idx | uint64 | A zero-based index into the barcodes dataset (see next section), indicating the spot-barcode assigned to this putative molecule. |
count | uint32 | Number of reads associated with this putative molecule that were confidently mapped to the assigned feature. |
feature_idx | uint32 | A zero-based index into the feature list (see next section), indicating the feature to which this putative molecule was assigned. |
gem_group | uint16 | Integer label that is currently one (1) for all Space Ranger output. |
library_idx | uint16 | Integer label that is currently one (1) for all Space Ranger output. |
umi | uint32 | 2-bit encoded (see note below) processed (i.e. corrected) UMI sequence. |
umi_type | uint32 | A boolean array specifying whether the molecule aligned to an exonic (1) or intronic (0) region of the associated feature. |
probe_idx | uint32 | Present only when probe set reference CSV is used. A zero-based index into the probes dataset, indicating the probe with which this transcript was captured. |
The molecule_info.h5 file has datasets corresponding to information about the libraries, barcode lists, and feature sets that were used.
At the top level of the HDF5 file hierarchy, the barcodes
, library_info
and metrics_json
datasets provide information about the experiments contained in this analysis:
Dataset | Type | Description |
---|---|---|
barcodes | string | A list of all spot-barcodes associated with this experiment (including those that were not observed). The barcode_idx column described in the previous section contains indices into this list of barcodes. Each spot-barcode sequence has a trailing digit that is currently one (1) in output generated from Space Ranger (e.g., AGAATGGTCTGCAT-1 ). |
library_info | string | A JSON-formatted array of objects, where each object contains metadata for a single library. Each library will at a minimum contain the metadata library_id , library_type , and gem_group |
metrics_json | string | Pipeline metrics in JSON format that are used internally by Space Ranger. From Space Ranger v2.0 onwards, this file also contains the slide serial number and capture area information if it was supplied to the spaceranger count pipeline. |
gene_ids | string | The Ensembl gene IDs contained in this reference. The gene column defined in the previous section is an index into this array. |
gene_names | string | The common gene symbol associated with each of the above gene_ids . |
genome_ids | string | The list of genomes represented in this reference. In most cases, this will be a single genome. The genome column defined in the previous section is an index into this array. |
The HDF5 group barcode_info
gives information regarding the barcodes determined to be underneath the tissue.
Dataset | Type | Description |
---|---|---|
genomes | string | A list of all genome references used for gene expression libraries in this analysis. |
pass_filter | uint64 | A matrix with three columns that contains one row per passing spot-barcode. Each row is a tuple (barcode_idx, library_idx, genome_idx ), where genome_idx is an index into the genomes dataset. |
gene_ids | string | The Ensembl gene IDs contained in this reference. The gene column defined in the previous section is an index into this array. |
gene_names | string | The common gene symbol associated with each of the above gene_ids . |
genome_ids | string | The list of genomes represented in this reference. In most cases, this will be a single genome. The genome column defined in the previous section is an index into this array. |
The HDF5 group features
contains information regarding the feature reference(s) used for the analysis. The datasets within the features
group represent columns in a table containing one row per feature (gene). Values in the feature_idx
column described in the previous section provide indices into the rows of this table of features.
In addition to the columns described below, user-specified tags may also be present. The dataset _all_tag_keys
contains a list of user-specified tags as well as built-in tags (e.g. genome
, pattern
, read
, and/or sequence
).
Column | Type | Description |
---|---|---|
feature_type | string | The type of feature reference to which this feature belongs, i.e., Gene Expression or Antibody Capture. |
genome | string | The genome reference for a given feature (e.g., "GRCh38" or "mm10"). |
id | string | The unique id corresponding to this feature (for example, an Ensembl gene ID). |
name | string | A human-readable name associated with this feature (for example, the common name associated with a gene). |
pattern | string | PEX only. Specifies how to extract the Antibody Barcode sequence from the read. |
read | string | PEX only. Specifies which RNA sequencing read ("R1" or "R2") contains the Antibody Barcode. |
sequence | string | PEX only. Nucleotide barcode sequence associated with this feature (e.g., antibody barcode). |
isotype_control | string | PEX only. True/False indicating whether antibody is an isotype control. |
secondary_name | string | PEX only. Secondary human-readable name for this feature. |
The features
group also contains an HDF5 group target_sets
which contain the probe set reference CSV for Visium FFPE samples and target panel CSV for Targeted Gene Expression. When a target gene panel is present, indices of the target genes are stored inside target_sets
, in an HDF5 dataset named after the target gene panel (e.g., "Human Gene Signature").
Present only when probe set reference CSV is used. The HDF5 group probes
contains information regarding the probe set used for the analysis. The datasets within the probes
group represent the columns in a table containing one row per probe. Values in the probe_idx
column described in the previous section provide indices into the rows of this table of probes.
Column | Type | Description |
---|---|---|
feature_name | string | The name of the feature (gene) targeted by this probe. |
feature_id | string | The Ensembl gene identifier of the gene targeted by this probe. |
probe_id | string | A unique identifier assigned to each probe. |
region | string | Present only when v2 probe set reference CSV is used. The region targeted by the probe may be either spliced (overlapping a splice junction on the gene) or unspliced . |
The UMI sequences are 2-bit encoded as follows:
- Each pair of bits encodes a nucleotide (0="A", 1="C", 2="G", 3="T").
- The least significant byte (LSB) contains the 3'-most nucleotides.
Note that the spot-barcode sequences do not have this encoding. Instead, they are stored as plain strings in the library_info
HDF5 group.