V(D)J Assembly

The assembly process takes the reads for a single barcode as input. These reads are then glued together, outputting a set of assembled contigs that represent the best estimate of transcript sequences present. Each base in each contig is assigned a quality value. The numbers of UMIs and reads supporting each contig are also tracked.

The assembler uses the V(D)J reference sequence during assembly, unless the pipeline is run in de novo mode. Parts of the Annotation Algorithm page may be relevant to learn more about more about the assembly process.

Contig assembly is complicated by noise that can arise from many sources. Some sources of noise include:

Background (extracellular) mRNA
Cell doublets
Errors in transcription in the cell
Errors in reverse transcription to make cDNA
Random errors during sequencing
Index hopping in the sequencing process

Steps in the assembly algorithm

Step	Operation
Adapter trimming	Trim adapters using a custom algorithm.
Read subsampling	Downsample reads for a given barcode to retain a maximum of 80,000 reads. >80,000 reads do not improve results.
Read trimming	Trim off nucleotides in the read after the enrichment primers.
Graph formation	Build a De Bruijn graph using kmer length (k) = 20
Reference-free graph simplification	Simplify the graph by removing noisy edges.
Reference-assisted graph simplification	Use the V(D)J reference to remove noisy edges.
Contaminant filtering	Filter out barcodes that are likely to be contaminants.
Chimera filtering	Filter out barcodes with chimeric contigs that are likely artifacts.
UMI filtering	Filter out UMIs that are likely to be artifacts.
Contig construction	Build contigs by looking for the best path through the graph for each UMI.
Competitive deletion of contigs	Compare contigs, remove weak contigs that are likely to be artifacts.
Contig confidence	Define contigs that are likely to represent bona fide transcripts from a single cell (associated to one barcode).
Contig quality scores	Assign a quality score to each base on each contig.

Known adapter and primer sequences from the 5’ and 3’ ends of reads are trimmed using a custom 10x Genomics trimming tool.

Some cells have extremely high coverage. High coverage could be either due to true high sequencing coverage, or high mRNA expression in plasma cells (commonly seen in BCR).

Very high coverage (greater than 80,000 reads) of transcripts can be problematic because it degrades computational performance and adds little information. Therefore, coverage is capped to a maximum of 80,000 reads per barcode. If there are more than 80,000 reads for any given 10x Barcode, the reads are downsampled.

The inner enrichment primers hybridize to constant regions of V(D)J genes. Any bases to the right of those positions should not be present in the data. They are trimmed from the reads.

A De Bruijn graph using k = 20 is created and transformed into a directed graph. The edges of the graph are DNA sequences corresponding to unbranched paths in the De Bruijn graph.

A collection of heuristic steps is applied to simplify the graph. During this process read support on each edge is tracked and edited. Several examples of simplification steps are described:

Branch cleaning:
- For each branch in the graph, and for each UMI, if one branch has ten times more reads than a second branch, read support for the UMI from the second branch is removed.
- When two branches emanate from a vertex, the weaker branch is deleted based on these criteria:
  - There are at least twice as many reads on the strong branch.
  - There are fewer than 8 reads for any UMI on the weak branch.
  - For every UMI, the strong branch has twice as many reads as the weak branch with utmost one exception (such as events like alternate splicing) where the event is supported by only one UMI.
Path cleaning: For each UMI, the strongest path is defined. Then graph edges that are not on this path are deleted.
Component cleaning: For each UMI, if one graph component has ten times more reads supporting it than a second component, the read support for the second component is deleted.

If the pipeline is run in reference-assisted mode (not de novo assembly), bubbles in the graph are popped with the aid of the reference sequence. There are several heuristic tests, all of which require that both bubble branches have the same length. An example scenario is when branch 1 is supported by at least three UMIs and has a kmer matching the reference, whereas branch 2 is supported by a single UMI, and has no kmers matching the reference. In this scenario, the weaker branch (branch 2) is deleted.

In cases where pairs of barcodes have identical productive contigs, the following criteria are used to label a barcode as a contaminant:

There is a 10-fold or greater difference in UMI counts between the two barcodes.
The sequence of the barcode with the smaller UMI count can be derived from the barcode with the higher UMI count by a 1-base-pair insertion or deletion (after correction).

If both conditions are met, the barcode with the smaller UMI count is marked as a contaminant and removed.

To filter out barcodes with contigs that share a common V region but have different CDR3 sequences, the algorithm follows these criteria:

Contigs must share a matching V region prefix sequence, allowing up to 1 mismatch.
CDR3 sequences must differ by at least 1 Hamming distance.
The V region prefix match must be at least 25 bases long.

UMIs that survive these filtration steps are retained:

Find the single strongest path for each UMI. A strong path either contains a reference kmer, or if assembled de novo, matches a primer (described above).
Find good graph edges that appear on one or more strong paths.
Sort the reads based on these good graph edge assignments.
Find the UMIs for these reads.
Remove any UMI for which less than 50% of kmers are contained in good edges.
For reference-assisted assembly, if none of the strong paths had a V segment annotation, remove all the UMIs for that barcode.

Initially, every strong path that either contains an enrichment primer (de novo assembly) or is annotated by a CDR3 (in the reference-assisted assembly) is called a contig.

Then, in reference-assisted assembly:

Contigs are trimmed to remove nucleotides occurring before the 5' UTR for a V segment and after enrichment primers.
Contigs that have only a C annotation are deleted. These deleted contigs are enriched for artifacts.
If a contig has a single-base indel relative to the reference that is supported by a single UMI (or one UMI plus one additional read), the indel is corrected to reflect the reference sequence.

Contigs with fewer than 300 base pairs are removed.

At this stage in assembly, there can be some redundancy among contigs arising from actual differences in transcripts, laboratory technical artifacts, or artifacts in contig construction.

Steps to eliminate redundancy:

The number of UMIs assigned to each contig is computed.
Junction selection:
- For reference-assisted assembly, if two productive contigs share the same junction sequence (defined as 100 bases ending at the end of a J segment), the junction supported by the most UMIs is selected. If there is a tie, junction selection is arbitrary.
- For de novo assembly, if two contigs are annotated with the same CDR3 sequence, the contig with the most UMIs is selected.
Non-productive contigs are de-duplicated. Any contig for which at least 75% of its kmers are contained in a productive contig is deleted. If 75% of the kmers in a non-productive contig are contained in a longer non-productive contig, the shorter contig is deleted. In de novo assembly, the same criteria apply, with productive replaced by "has a CDR3".

Competitive deletion of contigs aims to delete contigs that arise from extracellular mRNA in the sample or other background processes.

For reference-assisted assembly, the junction sequence of each productive contig is defined to be 100 nucleotides at the end of the annotated J segment. The junction UMI support for the contig is the number of UMIs that cover the junction sequence. Reads that support the junction sequence make up the junction read support. Suppose we have two contigs with respective (junction UMI support, junction read support) = (u1,n1) and (u2,n2). Suppose that (u1,n1) is sufficiently larger than (u2,n2). For example, u1 ≥ 2, u2 = 1, n1 ≥ 2 * n2 would qualify. (And there are some similar criteria, not listed here.) Then if the contigs have the same chain type, we delete the second contig.

In de novo mode, a similar criterion is applied to contigs containing a CDR3, but instead of the junction mode used in the reference-assisted assembly, the 100 nucleotides starting at the end of the CDR3 are used. Chain type is not considered when deleting a contig, and the two strongest contigs are protected from deletion.

The presence of extracellular mRNA or multiplets can interfere with the accurate assembly of contigs or lead to inconsistent clonotype calling in cells. To mitigate this issue, the information associated with each cell barcode is evaluated.

These criteria help determine if the assembled contigs associated with a barcode can be considered high confidence:

Cells without any productive contigs are assigned low confidence.

In reference-assisted assembly, a cell barcode with productive contigs is deemed low confidence if it satisfies any of the following criteria:

Low confidence due to potential multiplet:
- There are more than four productive contigs.
- There are more than two contigs for any chain type (e.g. having three TRA contigs): Typically, a single cell barcode is expected to contain one productive TRA and one productive TRB chain for T cells, or one productive heavy chain and one productive light chain for B cells. The presence of additional productive contigs beyond these expectations suggests it may not reliably represent a single cell.
Low confidence due to ambient RNA
- The number of filtered UMIs with three or more supporting reads are counted for each cell barcode. A cell barcode is deemed low confidence if it has fewer than three UMIs meeting this criterion.
Low confidence due to low junction support:
- A junction segment is defined as the last 80 bases where the right end of a J region aligns to the contig.
- Determine the maximum number of UMIs associated with any junction for a given cell barcode. To assess confidence, the maximum count of UMIs associated with any junction for a specific cell barcode is calculated. The cell barcode is called “high_confidence: False” if the max number of UMIs associated with a junction is one or less, and either of the following conditions is met:
  - The number of filtered UMIs, each with more than three supporting reads, is less than four.
  - There are more than two productive contigs associated with that cell barcode.
- Determine the minimum number of UMIs associated with any junction for a given cell barcode. The cell barcode is called "high_confidence: False" if the minimum number of UMIs associated with any junction is one or less and there are fewer than three filtered UMIs each with at least n/20 read pairs, where n is an “N50 of N50s” statistic.
- To determine n:
  - Count the number of reads supporting each UMI for a given cell barcode.
  - The N50 of UMI counts is the number of supporting reads for which the sum of supporting reads for UMIs above this cutoff is 50% of the total reads associated with the cell barcode.
  - The N50 of N50s is the same mathematical operation for the N50s of each cell barcode.

In de novo assembly mode, similar criteria are used, with a key adjustment: the term "productive contig" is replaced by "contig having a CDR3 sequence". Additionally, the chain type test is omitted.

Each base in the assembled contig is assigned a Phred-scaled quality value (QV), representing an estimate of the probability of an error at that base. The QV is computed with a hierarchical model that accounts for the errors in:

Reverse transcription (RT): these errors affect all reads with the same UMI, and
Sequencing: these errors affect individual reads

The sequencing error model uses the reported sequencer QVs. At recommended sequencing depths, many reads per UMI are observed. This allows for sequencing errors in individual reads to be corrected rapidly.

The estimated error rate for the V(D)J RT reaction is 1e-4 per base. Therefore, assembled bases that are covered by a single UMI are assigned Q40, and bases covered by at least two UMIs are assigned Q60.

Assembly process overview

Steps in the assembly algorithm

Adapter trimming

Read subsampling

Read trimming

Graph formation

Reference-free graph simplification

Reference-assisted graph simplification

Contaminant filtering

Chimera filtering

UMI filtering

Contig construction

Competitive deletion of contigs

Confidence determination

Contig quality scores