Support homeCell Ranger ARCAnalysis
Specifying Input FASTQ Files for cellranger-arc count

Specifying Input FASTQ Files for cellranger-arc count

The cellranger-arc count pipeline requires ATAC and GEX FASTQ files as input, which typically come from running cellranger-arc mkfastq, a 10x Genomics-aware convenience wrapper for bcl2fastq. However, it is possible to use FASTQ files from other sources, such as Illumina's bcl2fastq or BCL Convert, a published dataset, or the 10x Genomics bamtofastq tool. Input FASTQ files must conform to the naming conventions of bcl2fastq and mkfastq for cellranger-arc count to successfully complete. These files are specified using a libraries CSV file and passed to the cellranger-arc count pipeline using the --libraries argument.

The cellranger-arc count pipeline can process data from one Multiome ATAC library and one Multiome GEX library, each of which could be sequenced on multiple flow cells. Multi-library analysis is not possible at this time. cellranger-arc count must not be used to process GEX or ATAC data alone.

There are multiple ways bcl2fastqbcl-convert> and mkfastq can be invoked, resulting in a wide range of potential file names and locations as output. Since finding the right FASTQ files to process and the right arguments to process those files as desired can be confusing, we will illustrate some common scenarios below.

To serve as inputs for Cell Ranger ARC, FASTQ files should conform to the naming conventions of bcl2fastq and mkfastq described below.

[Sample Name]S1_L00[Lane Number][Read Type]_001.fastq.gz

Where Read Type is one of:

  • I1: Dual index i7 read (optional)
  • I2: Dual index i5 read (optional)
  • R1: Read 1
  • R2: Read 2

[Sample Name]S1_L00[Lane Number][Read Type]_001.fastq.gz

Where Read Type is one of:

  • I1: Dual index i7 read (optional)
  • R1: Read 1
  • R2: Dual index i5 read
  • R3: Read 2

Cell Ranger ARC will also accept ATAC FASTQs in this format:

  • I1: Dual index i7 read (optional)
  • R1: Read 1
  • I2: Dual index i5 read
  • R2: Read 2

Jump to ATAC FASTQ files

Where are your GEX FASTQ files?

How are your GEX FASTQ files named?

How did I get here?

By running cellranger-arc mkfastq with a simple CSV layout file or Illumina Experiment Manager samplesheet, or by running bcl2fastq directly (with an IEM samplesheet) on a flow cell.

Your files will be in a (MKFASTQ_ID)/outs/fastq_path folder, and the file hierarchy may look similar to this:

MKFASTQ_ID |-- MAKE_FASTQS_CS `-- outs |-- fastq_path |-- HFLC5BBXX |-- test_sample1 | |-- test_sample1_S1_L001_I1_001.fastq.gz | |-- test_sample1_S1_L001_I2_001.fastq.gz | |-- test_sample1_S1_L001_R1_001.fastq.gz | |-- test_sample1_S1_L001_R2_001.fastq.gz | |-- test_sample1_S1_L002_I1_001.fastq.gz | |-- test_sample1_S1_L002_I2_001.fastq.gz | |-- test_sample1_S1_L002_R1_001.fastq.gz | |-- test_sample1_S1_L002_R2_001.fastq.gz | |-- test_sample1_S1_L003_I1_001.fastq.gz | |-- test_sample1_S1_L003_I2_001.fastq.gz | |-- test_sample1_S1_L003_R1_001.fastq.gz | `-- test_sample1_S1_L003_R2_001.fastq.gz |-- test_sample2 | |-- test_sample2_S2_L001_I1_001.fastq.gz | |-- test_sample2_S2_L001_I2_001.fastq.gz | |-- test_sample2_S2_L001_R1_001.fastq.gz | |-- test_sample2_S2_L001_R2_001.fastq.gz | |-- test_sample2_S2_L002_I1_001.fastq.gz | |-- test_sample2_S2_L002_I2_001.fastq.gz | |-- test_sample2_S2_L002_R1_001.fastq.gz | |-- test_sample2_S2_L002_R2_001.fastq.gz | |-- test_sample2_S2_L003_I1_001.fastq.gz | |-- test_sample2_S2_L003_I2_001.fastq.gz | |-- test_sample2_S2_L003_R1_001.fastq.gz | `-- test_sample2_S2_L003_R2_001.fastq.gz |-- Reports |-- Stats |-- Undetermined_S0_L001_I1_001.fastq.gz ... `-- Undetermined_S0_L003_R2_001.fastq.gz

Your file hierarchy may look similar to this:

BCL2FASTQ_OUTPUT_DIR |-- HFLC5BBXX |-- test_sample1 | |-- test_sample1_S1_L001_I1_001.fastq.gz | |-- test_sample1_S1_L001_I2_001.fastq.gz | |-- test_sample1_S1_L001_R1_001.fastq.gz | |-- test_sample1_S1_L001_R2_001.fastq.gz | |-- test_sample1_S1_L002_I1_001.fastq.gz | |-- test_sample1_S1_L002_I2_001.fastq.gz | |-- test_sample1_S1_L002_R1_001.fastq.gz | |-- test_sample1_S1_L002_R2_001.fastq.gz | |-- test_sample1_S1_L003_I1_001.fastq.gz | |-- test_sample1_S1_L003_I2_001.fastq.gz | |-- test_sample1_S1_L003_R1_001.fastq.gz | `-- test_sample1_S1_L003_R2_001.fastq.gz |-- test_sample2 | |-- test_sample2_S2_L001_I1_001.fastq.gz | |-- test_sample2_S2_L001_I2_001.fastq.gz | |-- test_sample2_S2_L001_R1_001.fastq.gz | |-- test_sample2_S2_L001_R2_001.fastq.gz | |-- test_sample2_S2_L002_I1_001.fastq.gz | |-- test_sample2_S2_L002_I2_001.fastq.gz | |-- test_sample2_S2_L002_R1_001.fastq.gz | |-- test_sample2_S2_L002_R2_001.fastq.gz | |-- test_sample2_S2_L003_I1_001.fastq.gz | |-- test_sample2_S2_L003_I2_001.fastq.gz | |-- test_sample2_S2_L003_R1_001.fastq.gz | `-- test_sample2_S2_L003_R2_001.fastq.gz ...

You will have one set of fastq files per sample, prefixed with the name of the sample as it appears in the simple CSV layout file or IEM samplesheet.

For more information on the naming conventions, please visit Illumina's support site or refer to the bcl2fastq User Guide. The scenario where your files do not conform to the naming convention is described in a different section later on this page.

The table below describes the line in the libraries CSV file you would use in the corresponding scenario. Be sure to substitute the capitalized text as appropriate. The "All Samples" entries in this table are provided for technical completeness.

SituationLine in libraries CSV
All samples (mkfastq)fastqs,sample,library_type
/PATH/TO/MKFASTQ_ID/outs/fastq_path,,Gene Expression
...
All samples (mkfastq), multiple flow cellsfastqs,sample,library_type
/PATH/TO/MKFASTQ_FLOWCELL1/outs/fastq_path,,Gene Expression
/PATH/TO/MKFASTQ_FLOWCELL2/outs/fastq_path,,Gene Expression
...
All samples (bcl2fastq direct)fastqs,sample,library_type
/PATH/TO/BCL2FASTQ_OUTPUT_DIR,,Gene Expression
...
Process test_sample1 (mkfastq)fastqs,sample,library_type
/PATH/TO/MKFASTQ_ID/outs/fastq_path,test_sample1,Gene Expression
...
Process test_sample1 and test_sample2 as a single merged sample (mkfastq)fastqs,sample,library_type
/PATH/TO/MKFASTQ_ID/outs/fastq_path,test_sample1,Gene Expression
/PATH/TO/MKFASTQ_ID/outs/fastq_path,test_sample2,Gene Expression
...

How did I get here?

An Illumina Experiment Manager-formatted samplesheet was used with either no entry or a blank entry for the Sample_Project column. Your hierarchy may look similar to this:

fastq_path |-- Reports |-- Stats |-- test_sample_S1_L001_I1_001.fastq.gz |-- test_sample_S1_L001_I2_001.fastq.gz |-- test_sample_S1_L001_R1_001.fastq.gz |-- test_sample_S1_L001_R2_001.fastq.gz |-- test_sample_S1_L002_I1_001.fastq.gz |-- test_sample_S1_L002_I2_001.fastq.gz |-- test_sample_S1_L002_R1_001.fastq.gz |-- test_sample_S1_L002_R2_001.fastq.gz |-- test_sample_S1_L003_I1_001.fastq.gz |-- test_sample_S1_L003_I2_001.fastq.gz |-- test_sample_S1_L003_R1_001.fastq.gz |-- test_sample_S1_L003_R2_001.fastq.gz |-- Undetermined_S0_L001_I1_001.fastq.gz ... `-- Undetermined_S0_L003_R2_001.fastq.gz

This is fine; you would use the same arguments as if the FASTQs were organized into subfolders within the output folder.

SituationLine in libraries CSV
All samples (mkfastq)fastqs,sample,library_type
/PATH/TO/MKFASTQ_ID/outs/fastq_path,,Gene Expression
...
All samples (bcl2fastq direct)fastqs,sample,library_type
/PATH/TO/BCL2FASTQ_OUTPUT_DIR,,Gene Expression
...
Process test_sample only (mkfastq)fastqs,sample,library_type
/PATH/TO/MKFASTQ_ID/outs/fastq_path,test_sample,Gene Expression
...

How did I get here?

It is likely that FASTQ files have been transferred from either a mkfastq or bcl2fastq run into another folder. They still retain the names assigned by bcl2fastq, which is a combination of sample name, sample order, lane, read type, and chunk. Your file hierarchy may look like this:

PROJECT_FOLDER |-- MySample_S1_L001_I1_001.fastq.gz |-- MySample_S1_L001_I2_001.fastq.gz |-- MySample_S1_L001_R1_001.fastq.gz |-- MySample_S1_L001_R2_001.fastq.gz |-- MySample_S1_L002_I1_001.fastq.gz |-- MySample_S1_L002_I2_001.fastq.gz |-- MySample_S1_L002_R1_001.fastq.gz |-- MySample_S1_L002_R2_001.fastq.gz

This is fine; since the files are named according to the bcl2fastq standard, you would use the same arguments as if the FASTQs were organized into a flow cell folder or mkfastq output folder.

How did I get here?

It is likely that you received files that were processed through a proprietary LIMS system, which employs its own naming conventions.

10x Genomics pipelines require files to be named in the bcl2fastq convention in order to run properly. You will need to determine the corresponding sample and read type for each file, likely by consulting your sequencing core or the individual who demultiplexed your flow cell.

It is highly likely that these files were initially processed with bcl2fastq. Once you track the origin of the file, you will rename the files in the following format:

[Sample Name]S1_L00[Lane Number][Read Type]_001.fastq.gz

Where Read Type is one of:

  • I1: Dual index i7 read (optional)
  • I2: Dual index i5 read (optional)
  • R1: Read 1
  • R2: Read 2

After the files have been renamed in the specified format, you will use the following arguments:

SituationLine in libraries CSV
All samplesfastqs,sample,library_type
/PATH/TO/PROJECT_FOLDER,,Gene Expression
...
Process SAMPLENAME onlyfastqs,sample,library_type
/PATH/TO/PROJECT_FOLDER,SAMPLENAME,Gene Expression
...

Where are your ATAC FASTQ files?

How are your ATAC FASTQ files named?

How did I get here?

By running cellranger-arc mkfastq with a simple CSV layout file or Illumina Experiment Manager samplesheet, or by running bcl2fastq directly (with an IEM samplesheet) on a flow cell.

Your files will be in a (MKFASTQ_ID)/outs/fastq_path folder, and your file hierarchy may look similar to this:

MKFASTQ_ID |-- MAKE_FASTQS_CS `-- outs |-- fastq_path |-- HFLC5BBXX |-- test_sample1 | |-- test_sample1_S1_L001_I1_001.fastq.gz | |-- test_sample1_S1_L001_R1_001.fastq.gz | |-- test_sample1_S1_L001_R2_001.fastq.gz | |-- test_sample1_S1_L001_R3_001.fastq.gz | |-- test_sample1_S1_L002_I1_001.fastq.gz | |-- test_sample1_S1_L002_R1_001.fastq.gz | |-- test_sample1_S1_L002_R2_001.fastq.gz | |-- test_sample1_S1_L002_R3_001.fastq.gz | |-- test_sample1_S1_L003_I1_001.fastq.gz | |-- test_sample1_S1_L003_R1_001.fastq.gz | |-- test_sample1_S1_L003_R2_001.fastq.gz | `-- test_sample1_S1_L003_R3_001.fastq.gz |-- test_sample2 | |-- test_sample2_S1_L001_I1_001.fastq.gz | |-- test_sample2_S1_L001_R1_001.fastq.gz | |-- test_sample2_S1_L001_R2_001.fastq.gz | |-- test_sample2_S1_L001_R3_001.fastq.gz | |-- test_sample2_S1_L002_I1_001.fastq.gz | |-- test_sample2_S1_L002_R1_001.fastq.gz | |-- test_sample2_S1_L002_R2_001.fastq.gz | |-- test_sample2_S1_L002_R3_001.fastq.gz | |-- test_sample2_S1_L003_I1_001.fastq.gz | |-- test_sample2_S1_L003_R1_001.fastq.gz | |-- test_sample2_S1_L003_R2_001.fastq.gz | `-- test_sample2_S1_L003_R3_001.fastq.gz |-- Reports |-- Stats |-- Undetermined_S0_L001_I1_001.fastq.gz ... `-- Undetermined_S0_L003_R3_001.fastq.gz

Your file hierarchy may look similar to this:

BCL2FASTQ_OUTPUT_DIR |-- HFLC5BBXX |-- test_sample1 | |-- test_sample1_S1_L001_I1_001.fastq.gz | |-- test_sample1_S1_L001_R1_001.fastq.gz | |-- test_sample1_S1_L001_R2_001.fastq.gz | |-- test_sample1_S1_L001_R3_001.fastq.gz | |-- test_sample1_S1_L002_I1_001.fastq.gz | |-- test_sample1_S1_L002_R1_001.fastq.gz | |-- test_sample1_S1_L002_R2_001.fastq.gz | |-- test_sample1_S1_L002_R3_001.fastq.gz | |-- test_sample1_S1_L003_I1_001.fastq.gz | |-- test_sample1_S1_L003_R1_001.fastq.gz | |-- test_sample1_S1_L003_R2_001.fastq.gz | `-- test_sample1_S1_L003_R3_001.fastq.gz |-- test_sample2 | |-- test_sample2_S1_L001_I1_001.fastq.gz | |-- test_sample2_S1_L001_R1_001.fastq.gz | |-- test_sample2_S1_L001_R2_001.fastq.gz | |-- test_sample2_S1_L001_R3_001.fastq.gz | |-- test_sample2_S1_L002_I1_001.fastq.gz | |-- test_sample2_S1_L002_R1_001.fastq.gz | |-- test_sample2_S1_L002_R2_001.fastq.gz | |-- test_sample2_S1_L002_R3_001.fastq.gz | |-- test_sample2_S1_L003_I1_001.fastq.gz | |-- test_sample2_S1_L003_R1_001.fastq.gz | |-- test_sample2_S1_L003_R2_001.fastq.gz | `-- test_sample2_S1_L003_R3_001.fastq.gz ...

You will have one set of fastq files per sample, prefixed with the name of the sample as it appears in the simple CSV layout file or IEM samplesheet. Other situations described later on this page deal with the presence of four separate sets of files (four "samples" from bcl2fastq's point of view) per single biological sample/library.

For more information on the naming conventions, please visit Illumina's support site or refer to the bcl2fastq User Guide. The scenario where your files do not conform to the naming convention is described in a different section later on this page.

The table below describes the line in the libraries CSV file you would use in the corresponding scenario. Be sure to substitute the capitalized text as appropriate. The "All Samples" entries in this table are provided for technical completeness.

SituationLine in libraries CSV
All samples (mkfastq)fastqs,sample,library_type
/PATH/TO/MKFASTQ_ID/outs/fastq_path,,Chromatin Accessibility
...
All samples (mkfastq), multiple flow cellsfastqs,sample,library_type
/PATH/TO/MKFASTQ_FLOWCELL1/outs/fastq_path,,Chromatin Accessibility
/PATH/TO/MKFASTQ_FLOWCELL2/outs/fastq_path,,Chromatin Accessibility
...
All samples (bcl2fastq direct)fastqs,sample,library_type
/PATH/TO/BCL2FASTQ_OUTPUT_DIR,,Chromatin Accessibility
...
Process test_sample1 (mkfastq)fastqs,sample,library_type
/PATH/TO/MKFASTQ_ID/outs/fastq_path,test_sample1,Chromatin Accessibility
...
Process test_sample1 and test_sample2 as a single merged sample (mkfastq)fastqs,sample,library_type
/PATH/TO/MKFASTQ_ID/outs/fastq_path,test_sample1,Chromatin Accessibility
/PATH/TO/MKFASTQ_ID/outs/fastq_path,test_sample2,Chromatin Accessibility
...

How did I get here?

It is likely that the input samplesheet used explicitly separated the four oligos in a 10x Genomics sample index set into four separate sample names. You may see a file hierarchy similar to this:

bcl2fastq_output |-- HFLC5BBXX |-- SI-GA-A1_1 | |-- SI-GA-A1_1_S1_L001_I1_001.fastq.gz | |-- SI-GA-A1_1_S1_L001_R1_001.fastq.gz | |-- SI-GA-A1_1_S1_L001_R2_001.fastq.gz | `-- SI-GA-A1_1_S1_L001_R3_001.fastq.gz |-- SI-GA-A1_2 | |-- SI-GA-A1_2_S2_L001_I1_001.fastq.gz | |-- SI-GA-A1_2_S2_L001_R1_001.fastq.gz | |-- SI-GA-A1_2_S2_L001_R2_001.fastq.gz | `-- SI-GA-A1_2_S2_L001_R3_001.fastq.gz |-- SI-GA-A1_3 | |-- SI-GA-A1_3_S3_L001_I1_001.fastq.gz | |-- SI-GA-A1_3_S3_L001_R1_001.fastq.gz | |-- SI-GA-A1_3_S3_L001_R2_001.fastq.gz | `-- SI-GA-A1_3_S3_L001_R3_001.fastq.gz |-- SI-GA-A1_4 | |-- SI-GA-A1_4_S4_L001_I1_001.fastq.gz | |-- SI-GA-A1_4_S4_L001_R1_001.fastq.gz | |-- SI-GA-A1_4_S4_L001_R2_001.fastq.gz | `-- SI-GA-A1_4_S4_L001_R3_001.fastq.gz |-- Reports |-- Stats |-- Undetermined_S0_L001_I1_001.fastq.gz |-- Undetermined_S0_L001_R1_001.fastq.gz |-- Undetermined_S0_L001_R2_001.fastq.gz `-- Undetermined_S0_L001_R3_001.fastq.gz

You probably want to be able to merge All samples from the SI-GA-A1 index into a single analysis. If you only run one index at a time, you will see a smaller number of reads than expected, which may translate to lower than expected coverage or cell count for the experiment.

SituationLine in libraries CSV
All samples (mkfastq)fastqs,sample,library_type
/PATH/TO/MKFASTQ_ID/outs/fastq_path,,Chromatin Accessibility
...
Process all SI-GA-A1 reads in a single analysisfastqs,sample,library_type
/PATH/TO/MKFASTQ_ID/outs/fastq_path,SI-GA-A1_1,Chromatin Accessibility
/PATH/TO/MKFASTQ_ID/outs/fastq_path,SI-GA-A1_2,Chromatin Accessibility
/PATH/TO/MKFASTQ_ID/outs/fastq_path,SI-GA-A1_3,Chromatin Accessibility
/PATH/TO/MKFASTQ_ID/outs/fastq_path,SI-GA-A1_4,Chromatin Accessibility
...
Only process first sample indexfastqs,sample,library_type
/PATH/TO/MKFASTQ_ID/outs/fastq_path,SI-GA-A1_1,Chromatin Accessibility
...

How did I get here?

An Illumina Experiment Manager-formatted samplesheet was used with either no entry or a blank entry for the Sample_Project column. Your hierarchy may look similar to this:

fastq_path |-- Reports |-- Stats |-- test_sample_S1_L001_I1_001.fastq.gz |-- test_sample_S1_L001_R1_001.fastq.gz |-- test_sample_S1_L001_R2_001.fastq.gz |-- test_sample_S1_L001_R3_001.fastq.gz |-- test_sample_S1_L002_I1_001.fastq.gz |-- test_sample_S1_L002_R1_001.fastq.gz |-- test_sample_S1_L002_R2_001.fastq.gz |-- test_sample_S1_L002_R3_001.fastq.gz |-- test_sample_S1_L003_I1_001.fastq.gz |-- test_sample_S1_L003_R1_001.fastq.gz |-- test_sample_S1_L003_R2_001.fastq.gz |-- test_sample_S1_L003_R3_001.fastq.gz |-- Undetermined_S0_L001_I1_001.fastq.gz ... `-- Undetermined_S0_L003_R3_001.fastq.gz

This is fine; you would use the same arguments as if the FASTQs were organized into subfolders within the output folder.

SituationLine in libraries CSV
All samples (mkfastq)fastqs,sample,library_type
/PATH/TO/MKFASTQ_ID/outs/fastq_path,,Chromatin Accessibility
...
All samples (bcl2fastq direct)fastqs,sample,library_type
/PATH/TO/BCL2FASTQ_OUTPUT_DIR,,Chromatin Accessibility
...
Process test_sample only (mkfastq)fastqs,sample,library_type
/PATH/TO/MKFASTQ_ID/outs/fastq_path,test_sample,Chromatin Accessibility
...

How did I get here?

It is likely that FASTQ files have been transferred from either a mkfastq or bcl2fastq run into another folder. They still retain the names assigned by bcl2fastq, which is a combination of sample name, sample order, lane, read type, and chunk. Your file hierarchy may look similar to this:

PROJECT_FOLDER |-- MySample_S1_L001_I1_001.fastq.gz |-- MySample_S1_L001_I2_001.fastq.gz |-- MySample_S1_L001_R1_001.fastq.gz |-- MySample_S1_L001_R2_001.fastq.gz |-- MySample_S1_L002_I1_001.fastq.gz |-- MySample_S1_L002_I2_001.fastq.gz |-- MySample_S1_L002_R1_001.fastq.gz |-- MySample_S1_L002_R2_001.fastq.gz

This is fine; since the files are named according to the bcl2fastq standard, you would use the same arguments as if the FASTQs were organized into a flow cell folder or mkfastq output folder.

SituationLine in libraries CSV
All samples (mkfastq)fastqs,sample,library_type
/PATH/TO/PROJECT_FOLDER,,Chromatin Accessibility
...
Process MySample onlyfastqs,sample,library_type
/PATH/TO/PROJECT_FOLDER,MySample,Chromatin Accessibility
...

How did I get here?

It is likely that you received files that were processed through a proprietary LIMS system, which employs its own naming conventions.

10x Genomics pipelines require files to be named in the bcl2fastq convention in order to run properly. You will need to determine the corresponding sample and read type for each file, likely by consulting your sequencing core or the individual who demultiplexed your flow cell.

It is highly likely that these files were initially processed with bcl2fastq, so you will need to rename the files in one of the following formats, once you track down their origin:

[Sample Name]S1_L00[Lane Number][Read Type]_001.fastq.gz

Where Read Type is one of:

  • I1: Dual index i7 read (optional)
  • R1: Read 1
  • R2: Dual index i5 read
  • R3: Read 2

Alternatively, Cell Ranger ARC will also accept ATAC FASTQs in this format:

  • I1: Dual index i7 read (optional)
  • R1: Read 1
  • I2: Dual index i5 read
  • R2: Read 2

After you have renamed those files into that format, you'll use the following arguments:

SituationLine in libraries CSV
All samples (mkfastq)fastqs,sample,library_type
/PATH/TO/PROJECT_FOLDER,,Chromatin Accessibility
...
Process SAMPLENAME onlyfastqs,sample,library_type
/PATH/TO/PROJECT_FOLDER,SAMPLENAME,Chromatin Accessibility
...