Space Ranger outputs unfiltered (raw_feature_bc_matrix
) and filtered feature-barcode (filtered_feature_bc_matrix
) matrices in two file formats: the Market Exchange Format (MEX, described on this page) and Hierarchical Data Format (HDF5).
Each element of the matrix is the number of UMIs associated with a feature (row) and a barcode (column).
Type | Description |
---|---|
Unfiltered feature-barcode matrix | Contains every barcode from fixed list of known-good barcode sequences that has at least one read. This includes background and tissue-associated barcodes. |
Filtered feature-barcode matrix | Contains only tissue-associated barcodes. For Visium probe-based assays, genes not in the filtered probe set are removed from the filtered matrix by default. |
Raw probe-barcode matrix | Contains columns that indicate the probes in the filtered probe reference, the probes that passed gDNA filtering, and the probe barcodes that are in spots. It is similar to the feature-barcode matrix, but is organized at the probe level rather than the gene level. |
Each matrix is stored in the Market Exchange Format (MEX) for sparse matrices. It also contains gzipped TSV files with feature and barcode sequences corresponding to row and column indices, respectively. For example, the matrices output may look like:
$ cd /home/jdoe/runs/sample345/outs
$ tree filtered_feature_bc_matrix
filtered_feature_bc_matrix
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz
0 directories, 3 files
Features correspond to row indices. For each matrix, the components of features.tsv.gz
are:
- Column 1: feature ID
- Column 2: feature name
- Column 3: type of feature i.e.
Gene Expression
orAntibody Capture
Below is a minimal example features.tsv.gz
file showing data collected for three genes and antibodies.
$ gzip -cd filtered_feature_bc_matrix/features.tsv.gz
ENSG00000187634 SAMD11 Gene Expression
ENSG00000188976 NOC2L Gene Expression
ENSG00000187961 KLHL17 Gene Expression
For Gene Expression (GEX) data, the ID corresponds to gene_id
in the annotation field of the reference GTF. Correspondingly, the name corresponds to gene_name
in the annotation field of the reference GTF. If no gene_name
field is present in the reference GTF, gene name is equivalent to gene ID. Similarly, for Protein Expression (PEX) data, the feature ID and name are taken from the first two columns of the Feature Reference CSV file.
For multi-species experiments, gene IDs and names are prefixed with the genome name to avoid name collisions between genes of different species, e.g., GAPDH becomes hg19_GAPDH
and Gm15816 becomes mm10_Gm15816
.
Barcode sequences correspond to column indices.
$ gzip -cd filtered_feature_bc_matrices/barcodes.tsv | head -10
AACACTTGGCAAGGAA-1
AACAGGATTCATAGTT-1
AACAGGTTATTGCACC-1
AACAGGTTCACCGAAG-1
AACAGTCAGGCTCCGC-1
AACAGTCCACGCGGTG-1
AACATAGTCTATCTAC-1
AACATCTTAAGGCTCA-1
AACCAATCTGGTTGGC-1
AACCACTGCCATAGCC-1
Each barcode sequence includes a suffix with a dash separator followed by a number:
AACACTTGGCAAGGAA-1
More details on the barcode sequence format are available in the barcoded BAM section.
R and Python support the MEX format, and sparse matrices can be used for more efficient manipulation, as described below.
The R package Matrix
supports loading MEX format data, and can be easily used to load the sparse feature-barcode matrix, as shown in the example code below.
# load package
library(Matrix)
# set the different file paths of the filtered matrix
<span class="variable">matrix_dir = <span style="color:#fcc">"/opt/sample345/outs/filtered_feature_bc_matrix/"
<span class="variable">barcode.path <- paste0(matrix_dir, <span class="string">"barcodes.tsv.gz")
<span class="variable">features.path <- paste0(matrix_dir, <span class="string">"features.tsv.gz")
<span class="variable">matrix.path <- paste0(matrix_dir, <span class="string">"matrix.mtx.gz")
# load the matrix.mtx.gz
<span class="variable">mat_filtered <- readMM(file = matrix.path)
# load the feature.tsv.gz
<span class="variable">feature.names = read.delim(features.path,
header = FALSE,
stringsAsFactors = FALSE)
# load the barcodes.tsv.gz
<span class="variable">barcode.names = read.delim(barcode.path,
header = FALSE,
stringsAsFactors = FALSE)
# set the matrix column and row names
colnames(<span class="variable">mat_filtered) = barcode.names$V1
rownames(<span class="variable">mat_filtered) = feature.names$V1
The csv
, os
, gzip
, and scipy.io
modules can be used to load a feature-barcode matrix into Python as shown below (edit path to the matrix directory in red).
import csv
import gzip
import os
import scipy.io
# define MEX directory
<span class="variable">matrix_dir = <span style="color:#fcc">"/opt/sample345/outs/filtered_feature_bc_matrix"
# read in MEX format matrix as table
<span class="variable">mat_filtered = scipy.io.mmread(os.path.join(matrix_dir, <span class="string">"matrix.mtx.gz"))
# list of transcript ids, e.g. 'ENSG00000187634'
<span class="variable">features_path = os.path.join(<span class="variable">matrix_dir, <span class="string">"features.tsv.gz")
<span class="variable">feature_ids = [row[0] for row in csv.reader(gzip.open(features_path, mode="rt"), delimiter=<span class="string">"\t")]
# list of gene names, e.g. 'SAMD11'
<span class="variable">gene_names = [row[1] for row in csv.reader(gzip.open(features_path, mode="rt"), delimiter=<span class="string">"\t")]
# list of feature_types, e.g. 'Gene Expression'
<span class="variable">feature_types = [row[2] for row in csv.reader(gzip.open(features_path, mode="rt"), delimiter=<span class="string">"\t")]
# list of barcodes, e.g. 'AAACATACAAAACG-1'
<span class="variable">barcodes_path = os.path.join(<span class="variable">matrix_dir, <span class="string">"barcodes.tsv.gz")
<span class="variable">barcodes = [row[0] for row in csv.reader(gzip.open(barcodes_path, mode="rt"), delimiter=<span class="string">"\t")]
Space Ranger represents the feature-barcode matrix using sparse formats (only the nonzero entries are stored) in order to minimize file size. All of our programs, and many other programs for gene expression analysis, support sparse formats.
However, certain programs (e.g. Excel) only support dense formats (where every row-column entry is explicitly stored, even if it's a zero). Here are a few methods for converting feature-barcode matrices to CSV format:
Method 1: Python
Follow the steps in the Loading matrices into Python section to get the MEX data into a matrix format. To view the matrix as a data table and save as a CSV file, convert the matrix into a pandas dataframe with the following code:
import pandas as pd
# transform table to pandas dataframe and label rows and columns
<span class="variable">matrix = pd.DataFrame.sparse.from_spmatrix(<span class="variable">mat_filtered)
<span class="variable">matrix.columns = barcodes
matrix.insert(loc=0, column="feature_id", value=feature_ids)
matrix.insert(loc=1, column="gene", value=gene_names)
matrix.insert(loc=2, column="feature_type", value=feature_types)
# display matrix
print(matrix)
# save the table as a CSV (note the CSV will be a very large file)
matrix.to_csv(<span class="string">"mex_matrix_filtered.csv", index=False)
The output should look similar to
feature_id gene feature_type AAACAACGAATAGTTC-1 ...
0 ENSG00000243485 MIR1302-2HG Gene Expression 0
1 ENSG00000237613 FAM138A Gene Expression 0
2 ENSG00000186092 OR4F5 Gene Expression 0
3 ENSG00000238009 AL627309.1 Gene Expression 0
4 ENSG00000239945 AL627309.3 Gene Expression 0
...
Method 2: mat2csv
You can convert a feature-barcode matrix to dense CSV format using the spaceranger mat2csv
command.
This command takes two arguments - an input matrix generated by Space Ranger (either an HDF5 file or a MEX directory), and an output path for the dense CSV. For example, to convert a matrix from a pipestance named sample123
in the current directory, either of the following commands would work:
# convert from MEX
$ spaceranger mat2csv sample123/outs/filtered_feature_bc_matrix sample123.csv
# or, convert from HDF5
$ spaceranger mat2csv sample123/outs/filtered_feature_bc_matrix.h5 sample123.csv
You can then load sample123.csv
into Excel.
mat2csv
for small datasets, we strongly recommend using R or Python (as shown in the sections above) to examine these matrix files.Method 3: Shell commands
Please see this Q&A article for shell commands to convert MEX files to CSV. This method creates a single file that is sparse (zeroes are ignored).