DiffBio¤
End-to-end differentiable bioinformatics pipelines built on JAX/Flax and Datarax
DiffBio provides differentiable implementations of core bioinformatics algorithms, enabling gradient-based optimization of entire genomic analysis pipelines. Built on top of the Datarax framework, DiffBio brings the power of automatic differentiation to sequence alignment, variant calling, and genomic analysis.
Key Features¤
Differentiable Operators¤
- Smith-Waterman Alignment: Smooth, differentiable sequence alignment with soft gap penalties
- Pileup Generation: Differentiable read pileup computation for variant detection
- Quality Filtering: Soft-thresholded quality score filtering
End-to-End Pipelines¤
- Variant Calling Pipeline: Complete differentiable pipeline from reads to variants
- Composable Architecture: Chain operators using the Datarax framework
Training Utilities¤
- Gradient-based Optimization: Train alignment and variant calling models end-to-end
- Custom Loss Functions: Flexible loss definitions for bioinformatics tasks
Quick Start¤
Installation¤
Basic Usage¤
import jax
import jax.numpy as jnp
from flax import nnx
from diffbio.operators.singlecell import SoftKMeansClustering, SoftClusteringConfig
# Configure and create an operator
config = SoftClusteringConfig(n_clusters=5, n_features=20)
operator = SoftKMeansClustering(config, rngs=nnx.Rngs(42))
# Generate synthetic data and run
data = {"embeddings": jax.random.normal(jax.random.key(0), (100, 20))}
result, state, metadata = operator.apply(data, {}, None)
print(f"Cluster assignments: {result['cluster_assignments'].shape}")
Gradient Computation¤
# Gradients flow through all operators
def loss_fn(input_data):
result, _, _ = operator.apply(input_data, {}, None)
return result["cluster_assignments"].sum()
grad = jax.grad(loss_fn)(data)
print(f"Gradient is non-zero: {bool(jnp.any(grad['embeddings'] != 0))}")
Architecture¤
DiffBio follows the Datarax operator pattern for composable, differentiable data processing:
graph LR
A[Raw Reads] --> B[Quality Filter]
B --> C[Smith-Waterman Alignment]
C --> D[Pileup Generation]
D --> E[Variant Calling]
E --> F[Variants]
style A fill:#e0e7ff,stroke:#4338ca,color:#312e81
style B fill:#e0e7ff,stroke:#4338ca,color:#312e81
style C fill:#dbeafe,stroke:#2563eb,color:#1e3a5f
style D fill:#dbeafe,stroke:#2563eb,color:#1e3a5f
style E fill:#ede9fe,stroke:#7c3aed,color:#4c1d95
style F fill:#d1fae5,stroke:#059669,color:#064e3b
Each operator in the pipeline is fully differentiable, allowing gradients to flow from the final output back through all processing steps.
Documentation Structure¤
Why Differentiable Bioinformatics?¤
Traditional bioinformatics pipelines consist of discrete, non-differentiable operations that prevent end-to-end optimization. DiffBio addresses this by:
- Smooth Approximations: Using temperature-scaled softmax and sigmoid functions to create differentiable versions of discrete operations
- Gradient Flow: Enabling gradients to propagate through entire pipelines for joint optimization
- Learnable Parameters: Allowing alignment scores, gap penalties, and quality thresholds to be learned from data
- GPU Acceleration: Leveraging JAX's XLA compilation for high-performance computation
Applications¤
- Adaptive Alignment: Learn optimal scoring parameters for specific sequence types
- Joint Optimization: Train variant callers end-to-end with alignment quality
- Transfer Learning: Fine-tune pre-trained models on domain-specific data
- Neural Integration: Combine traditional algorithms with neural network components
Citation¤
If you use DiffBio in your research, please cite:
@software{diffbio2026,
title={DiffBio: End-to-End Differentiable Bioinformatics Pipelines},
author={Shafiei, Mahdi},
year={2026},
url={https://github.com/avitai/DiffBio},
version={0.1.0}
}
License¤
DiffBio is released under the MIT License.
Contributing¤
We welcome contributions! See our Contributing Guide for details.