Skip to content

Benchmarks¤

DiffBio includes a benchmark suite that evaluates operators on real datasets with field-standard metrics and comparison tables against published SOTA methods.


Running Benchmarks¤

# CI tier (~1 min, subsampled datasets)
uv run python benchmarks/run_all.py --tier ci --quick

# Nightly tier (~30 min, full Tier 1+2 benchmarks)
uv run python benchmarks/run_all.py --tier nightly

# Full suite (~2 hours on GPU)
uv run python benchmarks/run_all.py --tier full

# Filter by domain
uv run python benchmarks/run_all.py --tier nightly --domains singlecell

# Single benchmark
uv run python benchmarks/singlecell/bench_batch_correction.py --quick

Benchmark Suite¤

Single-Cell (6 benchmarks)¤

Benchmark Operator Dataset Metrics Baselines
Foundation Annotation LinearEmbeddingProbe on native or imported embeddings immune_human Accuracy, macro-F1, train loss DiffBio native, Geneformer precomputed, scGPT precomputed
Batch Correction DifferentiableHarmony immune_human (33K cells, 10 batches) Full scib-metrics (aggregate, silhouette, NMI, ARI, iLISI, cLISI) scVI, Harmony (R), Scanorama, BBKNN
Clustering SoftKMeansClustering immune_human (16 cell types) ARI, NMI, silhouette Leiden, Louvain, sklearn k-means
VAE Integration VAENormalizer (ZINB) immune_human ELBO + scib-metrics suite scVI, scANVI
Trajectory Pseudotime + Velocity pancreas (3.7K cells) Pseudotime range, velocity shape scVelo, DPT, Monocle3
GRN Inference DifferentiableGRN benGRN mESC (11.6K edges) AUPRC, precision, recall GENIE3, GRNBoost2, pySCENIC

Genomics (3 scaffold benchmarks)¤

Benchmark Operator Dataset Metrics Baselines
Promoter Classification LinearEmbeddingProbe on native, frozen, or imported sequence embeddings synthetic_genomics scaffold Accuracy, macro-F1, train loss DiffBio native, DiffBio frozen encoder, sequence precomputed adapters
TFBS Classification LinearEmbeddingProbe on native, frozen, or imported sequence embeddings synthetic_genomics scaffold Accuracy, macro-F1, train loss DiffBio native, DiffBio frozen encoder, sequence precomputed adapters
Splice-Site Classification LinearEmbeddingProbe on native, frozen, or imported sequence embeddings synthetic_genomics scaffold Accuracy, macro-F1, train loss DiffBio native, DiffBio frozen encoder, sequence precomputed adapters

Drug Discovery (3 benchmarks)¤

Benchmark Operator Dataset Metrics Baselines
MolNet BBBP CircularFingerprintOperator + MLP bbbp Test ROC-AUC, train ROC-AUC GCN, AttentiveFP, D-MPNN
Davis DTI DifferentiableDTIPipeline with TransformerSequenceEncoder + DifferentiableMolecularFingerprint davis RMSE, Pearson, Spearman non-differentiable fingerprint, differentiable drug encoder
BioSNAP DTI DifferentiableDTIPipeline with TransformerSequenceEncoder + DifferentiableMolecularFingerprint biosnap ROC-AUC, PR-AUC, MRR, Recall@1, Recall@5 non-differentiable fingerprint, differentiable drug encoder

Epigenomics (3 benchmarks)¤

Benchmark Operator Dataset Metrics Baselines
Peak Calling DifferentiablePeakCaller ENCODE_CTCF_K562 Precision, recall, F1, Jaccard MACS2, HOMER, Genrich
Contextual Peak Calling ContextualEpigenomicsOperator ablation suite synthetic_contextual_epigenomics Precision, recall, F1, chromatin consistency sequence-only, +TF, +TF+chromatin
Chromatin-State Prediction ContextualEpigenomicsOperator ablation suite synthetic_contextual_epigenomics Accuracy, chromatin consistency sequence-only, +TF, +TF+chromatin

Alignment (2 benchmarks)¤

Benchmark Operator Dataset Metrics Baselines
MSA SoftProgressiveMSA balifam100 (59 families) SP score, TC score MAFFT, ClustalW, MUSCLE, T-Coffee
Pairwise SmoothSmithWaterman balifam100 Alignment score BLAST, SSEARCH, FASTA

Structure Prediction (2 benchmarks)¤

Benchmark Operator Dataset Metrics Baselines
RNA Folding DifferentiableRNAFold ArchiveII Sensitivity, PPV, F1 ViennaRNA, LinearFold, EternaFold
Protein SS DifferentiableSecondaryStructure Ideal backbones Q3 accuracy DSSP, STRIDE, KAKSI

Molecular Dynamics (1 benchmark)¤

Benchmark Operator Dataset Metrics Baselines
LJ Fluid ForceFieldOperator + MDIntegrator 64K LJ system Steps/sec, energy drift jax-md (direct), LAMMPS

Statistical (1 benchmark)¤

Benchmark Operator Dataset Metrics Baselines
DE Analysis DifferentiableNBGLM immune_human (2 cell types) Concordance with t-test DESeq2, edgeR, Wilcoxon

Architecture¤

Built on the Datarax, Artifex, Opifex, and Calibrax ecosystem:

  • datarax: DataSourceModule and execution patterns for loading and iterating real datasets
  • artifex: benchmark and model-adapter reference patterns for model-facing integrations
  • opifex: scientific-training and optimization surfaces used by benchmarked operators
  • calibrax: BenchmarkResult, Metric, Point, TimingCollector, Store, regression detection, comparison, and publication export
  • DiffBio-specific: check_gradient_flow() for verifying operator differentiability

Benchmark Pattern¤

Every benchmark inherits from DiffBioBenchmark and implements only _run_core():

class MyBenchmark(DiffBioBenchmark):
    def _run_core(self) -> dict[str, Any]:
        # 1. Load data via DataSource
        source = MyDataSource(MyConfig(data_dir=self.data_dir))
        data = source.load()

        # 2. Create operator
        operator = MyOperator(config, rngs=nnx.Rngs(42))

        # 3. Train operator (for operators with learnable parameters)
        #    Use an unsupervised loss and gradient descent via optax.
        #    Physics-based operators (e.g. Smith-Waterman, DSSP) that
        #    have no learnable params can skip this step.
        opt = nnx.Optimizer(operator, optax.adam(1e-3), wrt=nnx.Param)
        for step in range(n_steps):
            loss, grads = nnx.value_and_grad(unsupervised_loss)(operator)
            opt.update(operator, grads)

        # 4. Evaluate trained operator
        result, _, _ = operator.apply(data, {}, None)

        # 5. Compute metrics
        metrics = evaluate_my_domain(result, ground_truth)

        # 6. Return standard dict
        return {
            "metrics": metrics,
            "operator": operator,
            "input_data": data,
            "loss_fn": lambda m, d: jnp.sum(m.apply(d, {}, None)[0]["output"]),
            "n_items": len(source),
            "iterate_fn": lambda: operator.apply(data, {}, None),
            "baselines": MY_BASELINES,
            "dataset_info": {"n_items": len(source)},
            "operator_name": "MyOperator",
            "dataset_name": "my_dataset",
        }

The base class handles: gradient flow check, throughput measurement, comparison table printing, BenchmarkResult construction, and CLI.

Standard Benchmark Tags¤

Every DiffBioBenchmark result emits these baseline Calibrax tags:

  • framework: always diffbio
  • operator: operator or pipeline name
  • dataset: dataset identifier used by the benchmark
  • task: canonical task slug derived from the benchmark name unless overridden

When a benchmark's _run_core() returns raw operator output under result_data, the base class also promotes canonical foundation-model metadata into Calibrax tags and result metadata:

  • model_family
  • adapter_mode
  • artifact_id
  • preprocessing_version

These values are decoded from the operator's foundation_model payload and stored once in the shared benchmark layer. The corresponding result metadata also includes foundation_model, comparison_axes, and one deterministic comparison_key so regression and comparison tooling can group by dataset, task, and artifact identity without benchmark-specific code.

Shared foundation-suite reports preserve the same contract. Each task report now exposes:

  • comparison_axes: the canonical ordering used for provenance-aware grouping
  • foundation_model: normalized per-model provenance when present
  • comparison_key: one deterministic row keyed by comparison_axes, with None for axes that do not apply to a given model

Each full foundation-suite report also stores:

  • regression_expectations: the canonical comparison_axes, task_order, per-task required_models ordering, Calibrax-native metric_defs, and the stored calibrax baseline/threshold policy used for regression checks
  • deferred_tasks: planned-but-unverified tasks that stay outside the current stable promotion scope, with the required follow-on harness or evidence made explicit in the saved report and Calibrax run metadata

Use build_foundation_promotion_report() from benchmarks._foundation_models to convert a stored suite report plus an optional Calibrax GuardResult into a deterministic promotion-review artifact. That artifact keeps the in-scope task list, deferred scope, required models, threshold policy, and any missing promotion evidence in one machine-readable record.

Use save_foundation_suite_report() from benchmarks._foundation_models to persist these deterministic suite reports as canonical JSON. Use save_foundation_suite_run() to mirror the same suite report into a Calibrax Store, and check_foundation_suite_regressions() to run the stored suite against the main baseline with the persisted threshold policy. Use save_foundation_promotion_report() to persist the promotion-review record as canonical JSON once the relevant regression check has been attached. For single-cell promotion review, use benchmarks.singlecell.foundation_suite.build_singlecell_foundation_promotion_report(); it attaches the Calibrax guard result before building the shared promotion artifact and fails closed unless an existing baseline is available or baseline bootstrap is requested explicitly. For genomics promotion review, use benchmarks.genomics.foundation_suite.build_genomics_foundation_promotion_report(); it follows the same fail-closed guard path while preserving the current pre-promotion scaffold provenance boundaries in the stored suite report.

Imported Foundation-Model Benchmarks¤

The current stable imported-model path is precomputed embeddings. For single-cell workloads, benchmarks consume a SingleCellPrecomputedAdapter implementation, align artifact rows by cell_ids, and then run the normal DiffBio downstream benchmark.

Imported foundation-model benchmarks must expose one shared metadata contract at the benchmark layer. In addition to benchmark tags, the promoted foundation_model metadata now carries:

  • dataset
  • task
  • model_family
  • adapter_mode
  • artifact_id
  • preprocessing_version

This keeps artifact identity and benchmark scenario identity on one shared schema for comparison, regression, and provenance tooling.

Genomics pre-promotion scaffold reports also carry dataset_provenance so synthetic interface validation cannot be mistaken for biological validation. The synthetic_genomics scaffold is recorded with:

  • dataset_name: synthetic_genomics
  • source_type: scaffold
  • curation_status: synthetic
  • provenance_label: deterministic_motif_scaffold
  • biological_validation: interface_validation_only
  • promotion_eligible: false

Any custom or curated genomics source must provide its own dataset_provenance payload before it can be reported through the foundation suite.

The first supported imported adapters are GeneformerPrecomputedAdapter and ScGPTPrecomputedAdapter. This remains deliberately narrower than generic checkpoint support:

  • supported: external embedding artifacts with explicit cell_ids
  • supported: benchmark tagging by model_family, adapter_mode, artifact_id, and preprocessing_version
  • supported: deterministic single-cell quick-suite reports across native DiffBio, Geneformer, and scGPT adapters for annotation and batch correction
  • supported: explicit scGPT batch-context metadata in comparison reports via requires_batch_context, batch_key, and context_version
  • supported: canonical single-cell deferral metadata that keeps grn_transfer outside stable promotion until a dedicated foundation-aware GRN harness exists
  • Pre-promotion scaffold: a shared SequencePrecomputedAdapter contract plus a genomics quick-suite scaffold for promoter, TFBS, and splice-site tasks
  • Pre-promotion scaffold: FrozenSequenceEncoderAdapter for in-process frozen sequence encoder benchmarking under adapter_mode=frozen_encoder
  • Pre-promotion scaffold: DNABERT2PrecomputedAdapter and NucleotideTransformerPrecomputedAdapter for aligned precomputed genomics artifacts, pending genomics realism and promotion evidence
  • supported: deterministic DTI source contracts for Davis affinity regression and BioSNAP binary interaction scaffolds, including paired-input batching and metric packaging for regression, classification, and ranking
  • supported: DTI benchmark metadata that exposes the shared paired-input required keys, dataset/split provenance, synthetic-scaffold promotion status, and metric groups for regression, classification, and ranking outputs
  • supported: a shared differentiable DTI integration path that encodes proteins with TransformerSequenceEncoder, encodes molecules with DifferentiableMolecularFingerprint, and promotes the protein foundation metadata into Calibrax benchmark tags and comparison keys
  • supported: DTI mini-batch provenance remains contract-valid by updating batch n_pairs while retaining the split-level source_n_pairs
  • supported: DTI comparison reports benchmark the differentiable pipeline against a fixed non-differentiable scaffold-feature baseline, include calibration metrics for binary interaction tasks, and use the Opifex optimizer factory for trainable DTI benchmark paths
  • supported: a shared contextual epigenomics source contract with canonical sequence, tf_context, chromatin_contacts, and targets keys
  • supported: contextual target-semantics validation for binary_peak_mask and chromatin_state_id, including per-task output-class counts in benchmark metadata and suite reports
  • supported: ContextualEpigenomicsOperator with one configurable code path for sequence-only, +TF, and +TF+chromatin modes, backed by an Artifex transformer and an optional structured chromatin-guidance loss
  • supported: deterministic contextual epigenomics ablation benchmarks and suite reports for peak calling and chromatin-state prediction across sequence_only, tf_context, and tf_plus_chromatin
  • supported: Calibrax-stored contextual epigenomics ablation comparisons using dataset, task, and contextual_variant as comparison axes, with explicit metric semantics for task quality and chromatin consistency
  • not yet supported: arbitrary Geneformer checkpoint loading into DiffBio
  • not yet supported: external frozen DNABERT-2 or Nucleotide Transformer checkpoint imports in stable APIs
  • not yet supported: tokenizer interchangeability claims across upstream models
  • not yet supported: stable DTI biological-promotion claims from the synthetic fallback sources alone
  • not yet supported: real cell-type-resolved epigenomics datasets for the contextual benchmark family
  • not yet supported: stable biological promotion of contextual epigenomics ablation gains from the synthetic contextual source alone

Important: Operators with learnable parameters (neural networks, learnable centroids, GLM coefficients) must be trained before evaluation. Comparing untrained random weights against optimised baselines produces misleading results. Use an unsupervised loss appropriate to the domain (e.g. reconstruction error, compactness, log-likelihood).


Datasets¤

Required Downloads¤

Dataset Size Path Download
immune_human 2.0 GB /media/mahdi/ssd23/Data/scib/Immune_ALL_human.h5ad Figshare
pancreas 51 MB /media/mahdi/ssd23/Data/scvelo/endocrinogenesis_day15.h5ad GitHub

From Cloned Repos¤

Dataset Repo Used By
balifam100 ../balifam/ MSA, pairwise alignment
ArchiveII ../RNAFoldAssess/ RNA folding
mESC ground truth ../benGRN/ GRN inference

Test Environment¤

Component Value
Platform Linux 6.8.0 (Ubuntu)
Python 3.12.6
JAX 0.10.0
GPU NVIDIA GeForce RTX 4090 (24 GB)
Backend CUDA