Support homeCell Ranger ARCAdvanced
Pipestance Structure

Pipestance Structure

The pipeline output directory, described in Understanding Output, contains all of the data produced by one invocation of a pipeline (a pipestance) as well as rich metadata describing the characteristics of each stage. This directory contains a specific structure that is used by the Martian pipeline framework to track the state of the pipeline as execution proceeds.

Cell Ranger's notion of a pipeline is very flexible in that a pipeline can be composed of stages that run stage code or sub-pipelines that may themselves contain stages or sub-pipelines.

Cell Ranger pipelines follow the convention that stages are named with verbs (e.g., ALIGN_READSMARK_DUPLICATESFILTER_BARCODES) and sub-pipelines are named with nouns and prefixed with an underscore (e.g., _BCSORTER). Each stage runs in its own directory bearing its name, and each stage's directory is contained within its parent pipeline's directory.

For example, the cellranger-arc mkfastq pipeline has the following process graph:

where

  • MAKE_FASTQS_CS is the top-level pipeline stage
  • MAKE_FASTQS is a sub-pipeline contained in MAKE_FASTQS_CS
  • PREPARE_SAMPLESHEETBCL2FASTQ_WITH_SAMPLESHEETMAKE_QC_SUMMARY, and MERGE_FASTQS_BY_LANE_SAMPLE are stages contained in the MAKE_FASTQS sub-pipeline.
  • MAKE_FASTQS_PREFLIGHT and MAKE_FASTQS_PREFLIGHT_LOCAL are preflight stages, which validate inputs prior to running the other stages. These also belong to MAKE_FASTQS, but have no connections to other stages because they don't produce any outputs.

The MAKE_FASTQS_CS stage is not strictly necessary since it contains no stages and only one child pipeline (MAKE_FASTQS); however, it serves to mask some of the low-level inputs required by the MAKE_FASTQS pipeline.

Every pipestance operates wholly inside of its pipeline output directory. When the pipestance completes, this pipestance output directory contains three outputs: metadata files, the pipestance output file directory, and the top-level pipeline stage directory.

  • Metadata files are files prefixed with an underscore (_) and usually contain unstructured text or JSON-encoded arrays and hashes.
  • The pipestance output file directory is a directory called outs/ that contains the pipestance's output files.
  • The top-level pipeline stage directory is a directory named according to the top-level pipeline stage that contains the child stage directories that compose this pipestance.

The top-level pipeline stage directory is a stage directory that contains any number of child stage directories as well as one stage output directory for each fork run by that stage. The top-level pipeline stages for Cell Ranger ARC are:

Most of the Cell Ranger ARC pipelines contain single-fork stages, which means there is one fork0 stage output directory within each stage directoryChunk output directories are a subset of stage output directories that additionally contain runtime information specific to the job or process being run by that chunk (e.g., a process ID or cluster job ID).

For example, the cellranger-arc mkfastq pipeline's pipeline output directory contains the following directory structure:

_logMetadata file
outs/Pipestance output file directory
MAKE_FASTQS_CS/Top-level pipeline stage directory
MAKE_FASTQS_CS/fork0/Stage output directory
MAKE_FASTQS_CS/fork0/files/Stage output files
MAKE_FASTQS_CS/MAKE_FASTQS/Stage directory
MAKE_FASTQS_CS/MAKE_FASTQS/fork0/Stage output directory
MAKE_FASTQS_CS/MAKE_FASTQS/fork0/files/Stage output files
MAKE_FASTQS_CS/MAKE_FASTQS/BCL2FASTQ_WITH_SAMPLESHEET/Stage directory
MAKE_FASTQS_CS/MAKE_FASTQS/BCL2FASTQ_WITH_SAMPLESHEET/fork0/Stage output directory
MAKE_FASTQS_CS/MAKE_FASTQS/BCL2FASTQ_WITH_SAMPLESHEET/fork0/chnk0/Chunk output directory

The metadata contained in the pipeline output directory includes

File NameDescription
Metadata cache that is populated when a pipestance completes to minimize re-aggregation of metadata
The MRO call used to invoke this pipestance
The log messages that are reported to your terminal window when running cellranger-arc commands
_mrosourceThe entire MRO describing the pipeline with all @include statements dereferenced
_perfDetailed runtime performance data for every stage in the pipestance
_timestampThe start and finish time for this pipestance
_vdrkillA list of all of the volatile data (temporary files) removed during pipeline execution as well as total number of files and bytes deleted
_versionsVersions of the components used by the pipeline

Stage directories contain stage output directoriesstage output files, and the stage directories of any child stages or pipelines.

Stage output directories typically contain:

File NameContents
files/Directory containing any files created by this stage that were not considered volatile (temporary)
split/A special stage output directory for the step that divided this stage's input into parallel chunks
chnkN/chunk output directory for the Nth parallel chunk executed
join/A special stage output directory for the step that recombined this stage's parallel output chunks into a single output dataset again
_completeA file that, when present, signifies that this stage has successfully completed
_errorsA file that, when present, signifies that this stage failed. Contains the errors that resulted in stage failure.
_invocationThe MRO call used to execute this stage by the Martian framework
_outsThe output files generated by this stage
_vdrkillA list of all of the volatile data (temporary files) removed during pipeline execution as well as total number of files and bytes deleted

Chunk output directories are a subset of stage output directories that, in addition to the aforementioned stage output, may contain:

File NameContents
_argsThe arguments passed to the stage's stage code
_jobinfoMetadata describing the stage's execution, including performance metrics, job manager jobid and jobname, and process ID
_jobscriptThe script submitted to the cluster job manager (cluster mode
_stdoutAny stage code output that was printed to the stdout stream
_stderrAny stage code output that was printed to the stderr stream

These metadata files should be treated as read-only, and altering the contents of metadata files is not recommended.