The Xenium Onboard Analysis pipeline generates several output files using the Zarr format. Xenium Explorer reads these files to display cell segmentation, secondary analysis clustering results, and transcript assignment on the nuclei-stained tissue morphology images.
The Zarr format saves large amounts of data by storing them as compressed chunks of N-dimensional arrays. Zarr files can be read and modified with Python. The zarr
Python library documentation and tutorials are available here.
In the sections below, we describe the group arrays and attributes associated with each Zarr file in the Xenium output bundle, along with example Python code for viewing these files.
The cells.zarr.zip
output file contains the cell and nucleus segmentation masks used for transcript assignment and the polygon boundaries used for visualization. It is the only file in the output bundle where you can find the cell and nucleus segmentation data.
It has the following hierarchy of array data:
(root)
├── cell_id
├── cell_summary
├── masks
│ ├── 0
│ ├── 1
│ └── homogeneous_transform
└── polygon_sets
├── 0
│ ├── cell_index
│ ├── method
│ ├── num_vertices
│ └── vertices
└── 1
├── cell_index
├── method
├── num_vertices
└── vertices
Arrays
Description of root group arrays:
Path | Type | Description |
---|---|---|
/cell_id | uint32 | The first column consists of the cell_id prefix, and the second column is the dataset suffix (see Cell ID format mapping for string conversion). |
/cell_summary | float64 | An array containing information about each cell (see attributes below). |
Description of segmentation /masks
arrays:
Path | Type | Description |
---|---|---|
[mask_index] | uint32 | Contains masks for the nucleus and cell segmentations in image space. The mask_index=0 is the nucleus segmentation mask and the mask_index=1 is the cell segmentation mask. The arrays at these indices contain the masks and have a 2D shape (rows, columns) of the morphology image that segmentation was performed on. Each value is the cell index for that pixel. Pixels with value=0 are background, and the cell indices start at 1. |
homogeneous_transform | float32 | The 4x4 transform matrix used to convert data from physical space (microns) to stitched-image space (pixels). This is needed to generate polygons from the masks. |
Description of segmentation /polygon_sets
arrays:
Path | Type | Description |
---|---|---|
[polygon_sets_index] | uint32 | Contains polygons for the nucleus and cell segmentations in physical space. The polygon_sets_index=0 contains the nucleus segmentation polygons and the polygon_sets_index=1 contains the cell segmentation polygons. |
cell_index | uint32 | An index for the cell that is associated with each nucleus or cell polygon. It corresponds to the root/cell_id . |
method | uint32 | An integer value describing the segmentation method used to derive this polygon. The integer corresponds to the order of methods listed in the root segmentation_methods attribute (starts at 0). |
num_vertices | int32 | Each element is the number of vertices for a given polygon, including a repeat of the initial vertex. A polygon with no vertices indicates the absence of a polygon for that cell. |
vertices | float32 | The XY coordinates in physical space (µm) for each vertex in the polygon. The coordinates for the first vertex are repeated at the end. |
Attributes
Description of root group attributes:
Field | Type | Description |
---|---|---|
major_version | int | Major version for the cells.zarr.zip file. This number is increased when breaking changes are made. |
minor_version | int | Minor version. |
number_cells | int | The number of cells in the dataset. |
polygon_set_names | list[str] | Each element is the unique, machine-readable name of a polygon set (e.g., a single polygon associated with nuclei is called "nucleus"). |
polygon_set_display_names | list[str] | Each element is the display name of a polygon set in Xenium Explorer (e.g., "Nucleus boundaries", "Cell boundaries"). |
polygon_set_descriptions | list[str] | Each element is the description of a polygon set in Xenium Explorer (e.g., "DAPI-based nuclei segmentation", "Cell Segmentation"). |
spatial_units | str | The units of the stitched image space ("microns"). |
segmentation_methods | list[str] | Describes how a polygon’s boundary was generated (e.g., "Segmented by boundary stain (ATP1A1+CD45+E-Cadherin)", "Segmented by interior stain (18S)", "Segmented by nucleus expansion of 5.0µm", "Segmented by nuclear stain (DAPI)"). |
Description of the /cell_summary
array columns (type f64
):
Field | Description |
---|---|
cell_centroid_x | X coordinate of cell centroid in µm. |
cell_centroid_y | Y coordinate of cell centroid in µm. |
cell_area | Area of cell in µm2. |
nucleus_centroid_x | X coordinate of nucleus centroid in µm. |
nucleus_centroid_y | Y coordinate of nucleus centroid in µm. |
nucleus_area | Area of nucleus in µm2. |
z_level | Z-level in which the cell was found in µm. |
nucleus_count | Number of nuclei associated with this cell. |
The analysis.zarr.zip
output file contains the automated secondary analysis clustering results. It has the following hierarchy of array data:
(root)
└── cell_groups
├── 0
│ ├── indices
│ └── indptr
├── 1
│ ├── indices
│ └── indptr
├── [...]
└── 9
├── indices
└── indptr
There are 10 cell clustering results (clustering_index
= 0 - 9) stored in this file - the first for graph-based clustering and the remaining for K-means clustering (K = 2 - 10). Descriptions for /cell_groups/[clustering_index]
group arrays:
Path | Type | Description |
---|---|---|
/indices | uint32 | An array of the cell indices for all cells assigned to one of the clusters in the secondary analysis. Cluster assignment determines the order of cell indices in each of these cell_groups/[clustering_index] arrays. |
/indptr | uint32 | An array that indicates the cell index value (row) where each new cluster assignment begins in /cell_groups/[clustering_index]/indices . For example, "[0, 218440]" for cell_groups/1 means that cluster 1 starts at the 1st element of indices and cluster 2 starts at the 218,441st element of indices (0-based indexing). |
Descriptions for the cell_groups
group attributes:
Field | Type | Description |
---|---|---|
major_version | int | Major version for the analysis.zarr.zip file. This number is increased when breaking changes are made. |
minor_version | int | Minor version. |
number_groupings | int | The number of clustering results in the dataset (graph-based and K-means clusters). |
grouping_names | list[str] | Contains a list of unique clustering method names for all the clustering results (e.g., "gene_expression_graphclust", "gene_expression_kmeans_2_clusters"). |
group_names | list[list[str]] | For each of the clustering result groups (e.g., "gene_expression_kmeans_2_clusters"), there is an inner list of all the clusters in the group (e.g., "[‘Cluster 1’, ‘Cluster 2’]"). |
The cell_feature_matrix.zarr.zip
output file contains a matrix of counts per cell and per feature (including gene and non-gene codewords), which have passed the default quality value (Q-Score) threshold of Q20. It has the following hierarchy of array data:
(root)
└── cell_features
├── cell_id
├── data
├── indices
└── indptr
Description for /cell_features
group arrays:
Path | Type | Description |
---|---|---|
/cell_id | uint32 | The first column consists of the cell_id prefix, and the second column is the dataset suffix (see Cell ID format mapping for string conversion). |
/data | uint32 | An array of counts (Q-Score ≥ 20) for a particular cell and specified feature, stored in a compressed sparse row (CSR) format (array V) that only contains nonzero counts. |
/indices | uint32 | Contains column indices (column_index in CSR format) that specify the cell index for each nonzero count value in the /data values array. |
/indptr | uint32 | Contains indices (row_index in CSR format) where each group of nonzero counts starts for a given feature in /data . For example, "[0, 21282, 28505, …]" means the nonzero counts for feature 1 start at 0, the nonzero counts for feature 2 start at 21282, etc. for all features in the dataset. |
Description for /cell_features
group attributes:
Field | Type | Description |
---|---|---|
major_version | int | Major version for the cell_feature_matrix.zarr.zip file. This number is increased when breaking changes are made. |
minor_version | int | Minor version. |
number_cells | int | The number of cells in the dataset. |
number_features | int | The number of features (e.g., genes, controls, unassigned) in the dataset. |
feature_keys | list[str] | Each element is the name of the feature (e.g., gene name). |
feature_ids | list[str] | Each element is the ID of the feature (e.g., gene id). |
feature_types | list[str] | Each element is the type of the feature (e.g., gene, negative_control_codeword). |
The transcripts.zarr.zip
output file contains data to evaluate transcript quality and localization. It has the following hierarchy of array data:
(root)
├── codeword_category
├── gene_category
├── density
│ └── gene
│ ├── data
│ ├── indices
│ └── indptr
└── grids
├── 0
│ ├── 0,0
│ │ ├── codeword_identity
│ │ ├── gene_identity
│ │ ├── id
│ │ ├── location
│ │ ├── quality_score
│ │ ├── status
│ │ ├── uuid
│ │ └── valid
│ ├── 0,1
│ │ ├── codeword_identity
│ │ ├── gene_identity
│ │ ├── id
│ │ ├── location
│ │ ├── quality_score
│ │ ├── status
│ │ ├── uuid
│ │ └── valid
│ ├── [X,Y]
[...]
The /density
array and associated attributes contain transcript density bin information, which is shown in the analysis_summary.html
Region Details panel and in the transcript density view in Xenium Explorer.
The /grids
arrays contain a pyramid structure of downsampled transcript levels. The transcript information is stored in this structure as a way to divide it into smaller chunks and for subsampling at zoomed out views. The number of levels corresponds to the selected tissue region size; smaller regions require fewer levels to store subsampled transcript information.
For example, if there are seven levels in total, grids/0
is the most zoomed in level and grids/6
is the most zoomed out level. The most zoomed out level contains a subsample of the transcript information and can fit in a single file (0,0). The most zoomed in level describes where every transcript is located, and consequently the chunks of data need to be stored in more files ((0,0)
, (0,1)
, etc.); the arrangement of the files is specified in the file names.
Arrays
Description for root group arrays:
Path | Type | Description |
---|---|---|
/codeword_category | bool | A num_codewords x 7 boolean table that contains information about the categories that codewords belong to. column names and descriptions are contained in codeword_category/.zattrs . |
/gene_category | bool | A num_genes x 7 boolean table that contains information about the categories that genes belong to. Column names and descriptions are contained in gene_category/.zattrs . |
Description for /density/gene
group arrays:
Path | Type | Description |
---|---|---|
/data | uint16 | An array of the Q-Score ≥20 counts for a particular transcript density grid cell and specified gene (chunked at 50,000 elements), stored in a compressed sparse row (CSR) format (array V) that only contains nonzero counts. Each bin is 10 µm. The rows of this matrix are a collapsed encoding of two quantities: (gene , grid_row ). The columns of this matrix correspond to grid_col . Where grid_row and grid_col specify the location in the density grid and gene specifies the index of the gene. |
/indices | uint16 | Contains the feature indices (column_index in CSR format) that correspond to the order of feature counts in the /data values array (chunked at 50,000 elements). |
/indptr | uint32 | Contains indices (row_index in CSR format) where each group of counts start for a given density grid cell in /data . |
Description for /grids/[grid_index]/[grid_position]
group arrays:
Path | Type | Description |
---|---|---|
/gene_identity | uint16 | The gene index(es) for each transcript. Gene indices are zero-based and reference the gene_names attribute attached to the gene parent group (see root attribute table below). Codewords corresponding to no-call (absence of a codeword) are denoted by the value 65535. Columns are: gene_call . |
/id | uint32 | The transcript ID (1st column) and FOV index (2nd column). This array is not guaranteed to be sorted in any particular order. The transcript ID is a unique value for each transcript within the FOV. |
/location | float32 | The location of each transcript in physical coordinate space. Columns are: x_position , y_position , and z_position of the transcript. |
/quality_score | float32 | The calibrated Q-Score for each transcript. |
/status | uint8 | The status of a transcript used in the pipeline to indicate that it passed filtering; always 0 if present in final output file. |
/uuid | uint32 | Unique identifier for transcripts; used by the pipeline. |
/valid | uint8 | The status of a transcript used in the pipeline to indicate that it passed filtering; always 1 if present in final output file. |
/codeword_identity | uint16 | The codeword index for each RNA. Codeword indices are zero-based and reference the codeword_names attribute attached to the dataset. Unknown codewords are given by max_value (uint16). Currently, the first column indicates the codeword index and the second column is unused. |
Attributes
Description for root group attributes:
Field | Type | Description |
---|---|---|
name | str | The name of the dataset ("RnaDataset"). |
major_version | int | Major version for the transcripts.zarr.zip file. This number is increased when breaking changes are made. |
minor_version | int | Minor version. |
dataset_uuid | str | Unique ID for this dataset. |
data_format | int | A field for internal pipeline use. Always set to 0. |
number_rnas | int | The total number of transcripts in the dataset. |
spatial_units | str | The units of the stitched image space ("micron"). |
fov_names | list[str] | Names of the FOVs used in the dataset as referenced by the FOV indices (/grids/[grid_index]/[grid_position]/id ). |
number_genes | int | The number of genes in the dataset. |
gene_names | list[str] | Names of the genes. |
codeword_count | int | The number of codewords. |
codeword_gene_mapping | list[int] | The index of the gene in gene_names specified by each codeword. |
codeword_gene_names | list[str] | The name of the gene in gene_names specified by each codeword. |
coordinate_space | str | For internal pipeline use. Should have the value "refined-final_global_micron". |
Note: The key root group attributes used by Xenium Explorer are shown above. This is not a comprehensive attribute list from all Xenium Onboard Analysis versions.
Description for /density/gene
array attributes:
Field | Type | Description |
---|---|---|
grid_size | list[float] | List of the XY grid spacings in µm (10 µm in current version). |
rows | int | The number of density grid (bin) rows. |
cols | int | The number of density grid (bin) columns. |
gene_names | list[str] | The names of genes. |
origin | dict[str,float] | Origin of the grid as {"x": min_x, "y": min_y} . |
Description for /grids
array attributes:
Field | Type | Description |
---|---|---|
grid_key_names | list[str] | The names of the grid keys used by the current grid (e.g., "grid_x_loc"). |
number_levels | int | The number of levels in the grid pyramid (must be ≥1). |
grid_size | list[float] | The size of a grid element for each grid pyramid level. |
grid_keys | list[list[str]] | The grid keys (e.g., "0,0,0") for each level of the grid pyramid. |
grid_number_objects | list[list[str]] | The number of transcripts in each grid element, in each level of the grid pyramid. |
The cells.zarr.zip
and cell_feature_matrix.zarr.zip
have cell_id
arrays in integer format (uint32). The first column describes the cell_id_prefix
. The polygon vertices of all cells in the dataset determine these integer values. The second column describes the dataset_suffix
, and is an integer value defaulting to 1 that may be changed to designate cells originating from different datasets.
Other files (e.g., H5/MTX, CSV) have cell_id
in string format (e.g., cmlbdfdf-1
). To map between these formats, here is the conversion process from integer to string:
- Convert the
cell_id_prefix
to its hexadecimal (hex) representation. Pad it with leading zeroes so that it has eight digits (i.e.,3d51
becomes00003d51
). - Shift the characters from the normal hex range [0 - 9, a - f] to the range [a - p]:
Hex code | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | a | b | c | d | e | f |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Shifted code | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p |
- Add a dash and append the
dataset_suffix
as an unpadded integer.
For example: given an integer cell_id_prefix
= 1437536272 and dataset_suffix
= 1
- Hex conversion of prefix = "55af1010"
- Shifted code = "ffkpbaba"
- Append suffix for the final string
cell_id
= "ffkpbaba-1"
This code snippet shows how to read a Zarr array into numpy
N-dimensional arrays:
# Import Python libraries # This script was tested with zarr v2.13.6 import zarr import numpy as np # Function to open a Zarr file def open_zarr(path: str) -> zarr.Group: store = (zarr.ZipStore(path, mode="r") if path.endswith(".zip") else zarr.DirectoryStore(path) ) return zarr.group(store=store) # For example, use the above function to open the cells Zarr file, which contains segmentation mask Zarr arrays root = open_zarr("cells.zarr.zip") # Look at group array info and structure root.info root.tree() # shows structure, array dimensions, data types # Create cell and nucleus segmentation mask np array objects to read or modify cellseg_mask = np.array(root["masks"][1]) nucseg_mask = np.array(root["masks"][0]) # Show dimensions of the 2D segmentation mask arrays (also shown in .tree()) # .ndim() shows number of dimensions # The shape should match the number of pixels in the morphology image. cellseg_mask.shape nucseg_mask.shape # Show max value of cells in the masks (value=0 are background pixels) # The .max() method counts all the values that are not 0, which should equal # the total cells detected in the dataset (reported in e.g., analysis_summary.html # summary tab metric). cellseg_mask.max() nucseg_mask.max() # Examples for exploring file contents # How to show array root["masks"][0][0:9] # or root["masks/0"] root["cell_summary"][0:9] # How to show attribute values root.attrs["major_version"] root.attrs["segmentation_methods"] # How to list out attribute names and values dict(root.attrs.items()) dict(root['cell_summary'].attrs.items())
Using the same Python function as above to read in the file, here are a few example lines to view the analysis.zarr.zip
and transcripts.zarr.zip
arrays and attributes:
# Read in secondary analysis Zarr arrays root = open_zarr("analysis.zarr.zip") # Examples for exploring file contents # How to show a slice of the clustering_index arrays root["cell_groups"][0]["indices"][0:9] # How to show attributes root["cell_groups"].attrs["group_names"] # Read in transcripts Zarr arrays root = open_zarr("transcripts.zarr.zip") # Examples for exploring file contents # How to show array info root['grids'][0]['0,0']['gene_identity'].shape root['grids'][0]['0,0']['quality_score'][0:9] root['grids'][0]['0,0']['location'][0:9,] # How to show array attributes root.attrs['major_version'] root['density']['gene'].attrs['gene_names'][0:9]