Genomic Processing¤
Raw sequencing data requires extensive processing before biological analysis can begin -- adapter contamination must be removed, duplicates identified, errors corrected, reads mapped to a reference, and in some cases assembled de novo. DiffBio provides 8 differentiable operators covering the full processing pipeline from raw reads to assembled genomes, plus specialized operators for CRISPR guide design and population genetics.
The Sequencing Pipeline¤
A typical next-generation sequencing experiment produces millions of short reads (100-300 bp) that must be processed through a series of steps:
graph TB
A["Raw Reads"] --> B["Adapter Removal<br/>(trim technical sequences)"]
B --> C["Error Correction<br/>(fix sequencing mistakes)"]
C --> D["Duplicate Weighting<br/>(handle PCR amplification bias)"]
D --> E["Read Mapping<br/>(align to reference genome)"]
E --> F["Downstream Analysis<br/>(variant calling, expression, etc.)"]
style A fill:#d1fae5,stroke:#059669,color:#064e3b
style B fill:#e0e7ff,stroke:#4338ca,color:#312e81
style C fill:#e0e7ff,stroke:#4338ca,color:#312e81
style D fill:#e0e7ff,stroke:#4338ca,color:#312e81
style E fill:#e0e7ff,stroke:#4338ca,color:#312e81
style F fill:#d1fae5,stroke:#059669,color:#064e3b
Each step traditionally uses hard decisions -- trim or keep, duplicate or unique, mapped or unmapped. DiffBio replaces these binary decisions with soft, differentiable operations.
Adapter Contamination¤
Sequencing library preparation ligates adapter sequences to DNA fragments. When the fragment is shorter than the read length, the sequencer reads into the adapter, contaminating the biological sequence.
SoftAdapterRemoval uses differentiable Smith-Waterman alignment to find
adapter matches, then applies sigmoid-weighted soft trimming based on the
match position. Instead of hard clipping at a fixed threshold:
- The alignment score determines match confidence (continuous, not binary)
- Soft trimming gradually down-weights bases near the adapter boundary
- Temperature controls the sharpness of the trimming transition
Both the match threshold and trimming temperature are learnable, allowing the operator to adapt to dataset-specific adapter contamination patterns.
PCR Duplicate Handling¤
PCR amplification before sequencing creates identical copies of DNA fragments. These duplicates inflate apparent read depth and can bias variant calling.
Traditional tools (Picard MarkDuplicates, samtools) perform hard binary classification: each read is either a duplicate (removed) or unique (kept). This discards information -- some reads marked as duplicates may represent independent observations of the same genomic position.
DifferentiableDuplicateWeighting replaces binary removal with probabilistic
weighting:
| Approach | Decision | Gradient Flow |
|---|---|---|
| Traditional | Binary: remove or keep | None |
| DiffBio | Continuous weight in [0, 1] | Through similarity computation |
The operator learns sequence embeddings, computes pairwise similarity, and assigns weights inversely proportional to soft cluster size. Highly similar reads receive lower weights; unique reads receive full weight. Temperature controls the sharpness of the clustering.
Sequencing Error Correction¤
Sequencing platforms introduce errors at characteristic rates -- Illumina produces approximately 0.1-1% substitution errors, with error rates increasing toward read ends. These errors propagate to downstream analysis as false variants or misalignments.
SoftErrorCorrection uses a neural network (MLP) to predict corrected base
probabilities from local sequence context:
- A sliding window of size \(w\) (default 11) captures the local context around each position
- Both sequence (one-hot) and quality score features are concatenated
- The MLP predicts a probability distribution over the 4 bases
- Temperature-controlled softmax blends the original and corrected calls
The operator is inspired by the DeepConsensus approach for consensus calling. Quality scores serve as prior confidence -- positions with low quality are corrected more aggressively.
Read Mapping¤
Read mapping aligns short reads to a reference genome to determine their
genomic origin. NeuralReadMapper implements a cross-attention approach:
- Sequence encoding: Both the read and reference window are encoded using a shared transformer encoder with positional embeddings
- Cross-attention: Multi-head attention between read and reference embeddings computes soft alignment scores at each position
- Position prediction: A softmax over reference positions produces a mapping probability distribution
The output is a soft mapping -- each read has a probability distribution over reference positions rather than a single hard alignment. This is valuable for multi-mapping reads (repetitive regions) where the true origin is ambiguous.
Genome Assembly¤
When no reference genome is available, reads must be assembled de novo into contiguous sequences (contigs). Assembly algorithms construct an overlap graph (or de Bruijn graph) where nodes represent reads (or k-mers) and edges represent overlaps.
GNNAssemblyNavigator uses graph attention networks to navigate assembly
graphs:
- Message passing: GATv2 attention layers propagate information between connected nodes, learning which overlaps are most reliable
- Edge scoring: Each edge receives a traversal score based on the node representations of its endpoints
- Soft path selection: Temperature-controlled softmax selects the next edge to traverse, enabling gradient flow through the assembly path
This differentiable approach enables joint optimization of assembly decisions -- the GNN learns to prefer paths that produce longer, more accurate contigs.
Metagenomic Binning¤
Environmental samples contain DNA from many organisms mixed together. Metagenomic binning groups assembled contigs by their organism of origin.
DifferentiableMetagenomicBinner implements a VAMB-style VAE approach:
- Input features: Tetranucleotide frequencies (TNF, 136 canonical 4-mers) and abundance profiles across samples
- VAE encoding: An encoder maps combined features to a latent space
- Latent clustering: Contigs from the same genome cluster together in latent space
- Soft bin assignment: Temperature-controlled softmax assigns contigs to bins with continuous probabilities
The beta-VAE objective balances reconstruction quality against latent space regularity, producing clusters that correspond to individual genomes.
CRISPR Guide Design¤
CRISPR-Cas9 genome editing requires a guide RNA (gRNA) that directs the Cas9 nuclease to a specific genomic target. Guide efficiency varies dramatically based on sequence context -- some guides cut efficiently while others fail.
DifferentiableCRISPRScorer predicts on-target efficiency using a
DeepCRISPR-inspired CNN architecture:
| Input | Shape | Description |
|---|---|---|
| Guide sequence | (23, 4) | One-hot encoded 20nt guide + 3nt PAM |
| Epigenetic features | (23, \(k\)) | Optional chromatin accessibility, etc. |
The 1D CNN extracts local sequence patterns predictive of cutting efficiency, followed by fully connected layers that produce a score in [0, 1]. Because the scoring is differentiable, it can be integrated into guide optimization pipelines -- gradients indicate which sequence positions most affect the predicted efficiency.
Population Genetics¤
DifferentiableAncestryEstimator implements a Neural ADMIXTURE-style model
for estimating ancestry proportions from genotype data. Given a genotype
vector of \(n\) SNPs and \(K\) ancestral populations:
- An autoencoder maps the genotype to a latent representation
- A softmax layer produces ancestry proportions \(\mathbf{q} \in \Delta^{K-1}\) (summing to 1)
- A decoder reconstructs genotype frequencies from the ancestry estimate
Temperature controls the sharpness of ancestry assignments -- lower temperature produces more confident (peaky) estimates, higher temperature allows more admixture. The entire model is differentiable, enabling gradient- based optimization of the number of populations \(K\) and integration with downstream association studies.
Why Differentiability Matters for Genomic Processing¤
Traditional genomic processing tools make hard decisions at each step. A read is either trimmed or not. A duplicate is either removed or kept. A read maps to one position or is discarded. These binary decisions cannot be revisited in light of downstream evidence.
DiffBio's differentiable operators enable:
- Adaptive preprocessing: Quality thresholds, trimming parameters, and error correction sensitivity adjust to the specific dataset through gradient-based optimization
- Joint preprocessing-analysis: A variant calling loss propagates gradients back through mapping, error correction, and duplicate weighting, learning preprocessing parameters that maximize variant detection accuracy
- Soft decisions preserve information: Probabilistic duplicate weights and soft mapping positions retain uncertainty that hard decisions discard, improving downstream statistical power
- Learned assembly strategies: The GNN assembly navigator learns traversal preferences from training data, adapting to genome-specific repeat structures and error profiles
Further Reading¤
- Preprocessing Operators -- adapter removal, deduplication, error correction
- Assembly & Mapping Operators -- neural read mapping and GNN assembly
- CRISPR Operators -- guide RNA scoring
- Population Operators -- ancestry estimation
- Preprocessing Pipeline -- end-to-end read processing
- Genomic Processing API -- full API reference
References¤
- Dias et al. "Neural ADMIXTURE for rapid genomic clustering." Nature Computational Science 2, 2022.
- Nissen et al. "Improved metagenome binning and assembly using deep variational autoencoders." Nature Biotechnology 39, 2021.
- Chuai et al. "DeepCRISPR: optimized CRISPR guide RNA design by deep learning." Genome Biology 19, 2018.
- Baid et al. "DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer." Nature Biotechnology 41, 2023.