Aggregating Multiple GEM Wells with cellranger-atac aggr

Many experiments involve generating data for multiple samples that are processed through different Gel Bead-in Emulsion (GEM) wells on Chromium instruments. Depending on the experimental design, these could be replicates from the same set of cells, cells from different tissues or time points from the same individual, or cells from different individuals. The cellranger-atac aggr pipeline can be used to aggregate these into a single peak-barcode matrix.

When conducting large studies involving multiple GEM wells, run cellranger-atac count on FASTQ data from each of the GEM wells individually, then pool the results using cellranger-atac aggr, as described here.

The cellranger-atac aggr command inputs a CSV file specifying a list of cellranger-atac count output files (specifically the fragments.tsv.gz, and singlecell.csv from each run), and produces a single peak-barcode matrix containing all the data.

When combining multiple GEM wells, the barcode sequences for each channel are distinguished by a GEM well suffix appended to the barcode sequencee (see Understanding GEM wells).

By default, the reads from each GEM well are subsampled such that all GEM wells have the same effective sequencing depth, measured in terms of the median number of unique fragments per cell. However, it is possible to turn off this normalization altogether (see Depth Normalization).

Each GEM well is a physically distinct set of GEM partitions corresponding to a single Chromium chip channel, but draws barcode sequences randomly from the pool of valid barcodes, known as the barcode whitelist. To keep the barcodes unique when aggregating multiple libraries, Cell Ranger ATAC appends a small integer identifying the GEM well to the barcode nucleotide sequence, and uses that nucleotide sequence plus ID as the unique identifier in the feature-barcode matrix. For example, AGACCATTGAGACTTA-1 and AGACCATTGAGACTTA-2 are distinct cell barcodes from different GEM wells, despite having the same barcode nucleotide sequence.

This number, which indicates which GEM well the barcode sequence came from, is called the GEM well suffix. The numbering of the GEM wells will reflect the order that the GEM wells were provided in the Aggregation CSV.

cellranger-atac aggr is not designed for combining multiple sequencing runs of the same GEM well. Instead, pass a list of FASTQ files from resequenced libraries to the --fastqs argument of cellranger-atac count.

Prior to aggregating data, first run a single instance of cellranger-atac count on each individual GEM well prepared using the Chromium platform, as described in single GEM well analysis.

For example, suppose you ran three count pipelines as follows:


$ cd /opt/runs
$ cellranger-atac count --id=LV123 ...
... wait for pipeline to finish ...
$ cellranger-atac count --id=LV456 ...
... wait for pipeline to finish ...
$ cellranger-atac count --id=LV789 ...
... wait for pipeline to finish ...

You can aggregate these three runs to get an aggregated matrix and analysis. In order to do so, you need to create an Aggregation CSV as detailed in the next section.

Create a CSV file with a header line containing the following columns:

library_id: Unique identifier for this input GEM well. This will be used for labeling purposes only; it does not need to match any previous ID assigned to the GEM well.
fragments: Path to the fragments.tsv.gz file produced by cellranger-atac count. For example, if you processed your GEM well by calling cellranger-atac count --id=ID in some directory /DIR, the fragments would be /DIR/ID/outs/fragments.tsv.gz.
cells: Path to the singlecell.csv file produced by cellranger-atac count.
(Optional) Additional custom columns containing library meta-data (e.g., lab or sample origin). These custom library annotations do not affect the analysis pipeline, unless the column name is batch, in which case the batch effect correction algorithm will be implemented (see Aggregating libraries with different chemistry versions). However, these columns can be visualized downstream in the Loupe Browser. Unlike other CSV inputs to Cell Ranger ATAC, these custom columns may contain characters outside the ASCII range (e.g., non-Latin characters).

You can either make the CSV file in a text editor, or create it in Excel and export to CSV. Continuing the example from the previous section, your Excel spreadsheet should look like this:

	A	B	C
1	library_id	fragments	cells
2	LV123	/opt/runs/LV123/outs/fragments.tsv.gz	/opt/runs/LV123/outs/singlecell.csv
3	LV456	/opt/runs/LV456/outs/fragments.tsv.gz	/opt/runs/LV456/outs/singlecell.csv
4	LV789	/opt/runs/LV789/outs/fragments.tsv.gz	/opt/runs/LV789/outs/singlecell.csv

When you save it as a CSV, the result looks like this:


library_id,fragments,cells
LV123,/opt/runs/LV123/outs/fragments.tsv.gz,/opt/runs/LV123/outs/singlecell.csv
LB456,/opt/runs/LB456/outs/fragments.tsv.gz,/opt/runs/LB456/outs/singlecell.csv
LP789,/opt/runs/LP789/outs/fragments.tsv.gz,/opt/runs/LP789/outs/singlecell.csv

These are the required command line arguments (also available through cellranger-atac aggr --help):

Argument	Description
`--id=ID`	A unique run id and output folder name [a-zA-Z0-9_-]+ of maximum length 64 characters.
`--csv=CSV`	Path to CSV file enumerating `cellranger-atac count` outputs (see Setting up a CSV).
`--reference=PATH`	Path to folder containing a Cell Ranger ATAC or Cell Ranger ARC reference.

See list of optional parameters on the command line arguments page.

After specifying input arguments and options, run cellranger-atac aggr:


$ cd /home/jdoe/runs
$ cellranger-atac aggr  --id=AGG123 \
                        --csv=AGG123_libraries.csv \
                        --normalize=depth \
                        --reference=/home/jdoe/refs/hg19

The pipeline will begin to run, creating a new folder named with the aggregation ID specified with the --id argument (e.g. /home/jdoe/runs/AGG123). If this output folder already exists, cellranger-atac will assume it is an existing pipestance and attempt to resume running it.

The cellranger-atac aggr pipeline generates output files that contain all of the data from the individual input jobs, aggregated into single output files, for convenient multi-sample analysis. The GEM well suffix of each barcode is updated to prevent barcode collisions, as described below.

Each output file produced by cellranger-atac aggr follows the format described in the Understanding Output section, but includes the union of all the relevant barcodes from each input job.

cellranger-atac aggr does not perform a cell-calling step, it simply aggregates the cell calls as encoded in singlecell.csv from each input job into a final set of cell calls.

A successful run will conclude with a message like this:


Outputs:
- Barcoded and aligned fragment file:           /home/jdoe/runs/AGG123/outs/fragments.tsv.gz
- Fragment file index:                          /home/jdoe/runs/AGG123/outs/fragments.tsv.gz.tbi
- Per-barcode fragment counts & metrics:        /home/jdoe/runs/AGG123/outs/singlecell.csv
- Bed file of all called peak locations:        /home/jdoe/runs/AGG123/outs/peaks.bed
- Filtered peak barcode matrix in hdf5 format:  /home/jdoe/runs/AGG123/outs/filtered_peak_bc_matrix.h5
- Filtered peak barcode matrix in mex format:   /home/jdoe/runs/AGG123/outs/filtered_peak_bc_matrix
- Directory of analysis files:                  /home/jdoe/runs/AGG123/outs/analysis
- HTML file summarizing aggregation analysis :  /home/jdoe/runs/AGG123/outs/web_summary.html
- Filtered tf barcode matrix in hdf5 format:    /home/jdoe/runs/AGG123/outs/filtered_tf_bc_matrix.h5
- Filtered tf barcode matrix in mex format:     /home/jdoe/runs/AGG123/outs/filtered_tf_bc_matrix
- Loupe Browser input file:                /home/jdoe/runs/AGG123/outs/cloupe.cloupe
- csv summarizing important metrics and values: /home/jdoe/runs/AGG123/outs/summary.csv
- Summary of all data metrics:                  /home/jdoe/runs/AGG123/outs/summary.json
- Annotation of peaks with genes:               /home/jdoe/runs/AGG123/outs/peak_annotation.tsv
- Csv of aggregation of libraries:              /home/jdoe/runs/AGG123/outs/aggregation_csv.csv

Pipestance completed successfully!

Once cellranger-atac aggr has successfully completed, you can browse the resulting summary HTML file in any supported web browser, open the .cloupe file in Loupe Browser, or refer to the Understanding Output section to explore the data by hand. For machine-readable versions of the summary metrics, refer to the Summary Metrics page.

If you are aggregating libraries generated by different chemistry versions (v2 vs. v1.1) of the Single Cell ATAC reagents (not to be confused with the Cell Ranger ATAC pipeline version), you might observe systematic differences in chromatin accessibility profiles between libraries. The cellranger-atac aggr pipeline optionally incorporates batch effect correction (algorithm details) to overcome this. To enable this module, you should include the following column in your aggregation CSV file:

batch: (optional) Unique identifier for the batch that this GEM well belongs to. Libraries with the same batch identifier will be considered to be in the same batch.

For example, if the LV123 sample in the previous example is a v1.1 chemistry library, and the LB456 and LP789 samples are v2 libraries, you would set up the aggregation CSV file like this:


library_id,fragments,cells,batch
LV123,/opt/runs/LV123/outs/fragments.tsv.gz,/opt/runs/LV123/outs/singlecell.csv,v1.1_lib
LV456,/opt/runs/LV456/outs/fragments.tsv.gz,/opt/runs/LV456/outs/singlecell.csv,v2_lib
LV789,/opt/runs/LV789/outs/fragments.tsv.gz,/opt/runs/LV789/outs/singlecell.csv,v2_lib

The v1.1_lib and v2_lib identifiers are merely example identifiers. Every sample from a given batch has to have the same batch identifier, but otherwise the identifier itself is arbitrary.

This Chemistry Batch Correction is specifically intended to correct for systematic variability in chromatin accessibility profiles caused by different versions of the Single Cell ATAC chemistry. 10x Genomics has tested and verified its effectiveness primarily on aggregating Single Cell ATAC v1.1 and v2 chemistries with well-matched input material. The module may be useful in other scenarios but will require careful validation of results.
Chemistry batch correction affects the dimensionality reduction, t-SNE, and UMAP visualization and clustering results. Values in the aggregated matrix are not adjusted by Chemistry Batch Correction. Differential accessibility analysis is still performed on the peak-barcode matrix.
The batch effect score (described in the algorithm details is recommended to compare the performance of batch correction. Besides the batch effect, the batch effect score also depends on the composition of the cell population across batches.
When the chemistry batch correction is enabled, by default the dimensionality reduction will be performed with LSA (latent semantic analysis). The number of dimensions is set to 100. When PCA (principal components analysis) is specified as the method, FBPCA (functions for principal component analysis) will be used to perform dimensionality reduction. PLSA (probabilistic latent semantic analysis) is not compatible with chemistry batch correction.
The minimum System Requirements of 64GB RAM will allow batch correction on datasets with a total number of 128k cells.

When combining data from multiple GEM wells, the cellranger-atac aggr pipeline automatically equalizes the average read depth per cell between groups before merging. When libraries are sequenced to very different read depth per cell you may observe that cells cluster by library of origin rather than cell type. This is commonly referred to as a batch effect in the literature. A multitude of factors can cause batch effects in single cell data and sequencing depth is only one of them. The downsampling normalization in cellranger-atac aggr specifically addresses sequencing depth batch effects but not others. It is possible to turn off normalization or change the way normalization is done. The none option may be appropriate if you want to maximize sensitivity and plan to deal with depth normalization or more general batch correction in a downstream step.

There are two normalization modes:

none: Do not normalize at all.
depth (default): Subsample reads from higher-depth GEM wells until they all have, on average, an equal number of median unique fragments per cell.

Aggregating Multiple GEM Wells with cellranger-atac aggr

What is aggr?

Understanding GEM wells

Requirements

Create aggregation CSV

Running aggr on the command line

Pipeline outputs

Aggregating libraries with different chemistry versions

About chemistry batch correction

Depth normalization