Data Sources API¤
Data source modules for loading bioinformatics and drug discovery datasets, extending Datarax's DataSourceModule.
Genomics Sources¤
BAMSource¤
diffbio.sources.bam.BAMSource
¤
BAMSource(
config: BAMSourceConfig,
*,
rngs: Rngs | None = None,
name: str | None = None,
)
Bases: IndexedBatchSourceMixin, DataSourceModule
BAM/CRAM file data source extending Datarax DataSourceModule.
Provides efficient access to aligned sequencing reads with:
- Lazy loading using pysam iterators
- Indexed random access via BAI/CRAI files
- Quality filtering at load time
- One-hot encoded sequence output
Inherits from DataSourceModule (StructuralModule) because:
- Non-parametric: BAM reading is deterministic
- Frozen config: file parameters don't change
- Domain-specific: requires genomics-specific handling
Example
Performance Tips (from pysam best practices):
- Use indexed BAM files for random access
- Filter by region to reduce data loading
- Set min_mapping_quality to filter at read time
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
BAMSourceConfig
|
BAM source configuration |
required |
rngs
|
Rngs | None
|
Random number generators (unused for data loading) |
None
|
name
|
str | None
|
Optional module name |
None
|
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If BAM file not found |
ImportError
|
If pysam is not installed |
BAMSourceConfig¤
diffbio.sources.bam.BAMSourceConfig
dataclass
¤
BAMSourceConfig(
file_path: Path = None,
reference_path: Path | None = None,
include_unmapped: bool = False,
min_mapping_quality: int | None = None,
region: str | None = None,
handle_n: Literal["uniform", "zero"] = "uniform",
)
Bases: StructuralConfig
Configuration for BAM/CRAM data source.
Attributes:
| Name | Type | Description |
|---|---|---|
file_path |
Path
|
Path to BAM/CRAM file |
reference_path |
Path | None
|
Optional path to reference FASTA (required for CRAM) |
include_unmapped |
bool
|
Whether to include unmapped reads (default: False) |
min_mapping_quality |
int | None
|
Minimum mapping quality to include (default: None) |
region |
str | None
|
Optional genomic region to query (e.g., "chr1:1000-2000") |
handle_n |
Literal['uniform', 'zero']
|
How to handle N nucleotides in sequences |
FastaSource¤
diffbio.sources.fasta.FastaSource
¤
FastaSource(
config: FastaSourceConfig,
*,
rngs: Rngs | None = None,
name: str | None = None,
)
Bases: IndexedBatchSourceMixin, DataSourceModule
FASTA file data source extending Datarax DataSourceModule.
Provides efficient access to DNA/RNA sequences with:
- Lazy loading using samtools-compatible .fai index
- Dictionary-like access by sequence name
- One-hot encoded sequence output
- Support for compressed BGZF files
Inherits from DataSourceModule (StructuralModule) because:
- Non-parametric: FASTA reading is deterministic
- Frozen config: file parameters don't change
- Domain-specific: requires genomics-specific handling
Example
Performance Tips (from pyfaidx best practices):
- Use indexed FASTA files (.fai) for random access
- Access regions with slicing for large chromosomes
- BGZF compression reduces disk space while maintaining random access
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
FastaSourceConfig
|
FASTA source configuration |
required |
rngs
|
Rngs | None
|
Random number generators (unused for data loading) |
None
|
name
|
str | None
|
Optional module name |
None
|
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If FASTA file not found |
ImportError
|
If pyfaidx is not installed |
sequence_names
property
¤
Get list of all sequence names in the FASTA file.
__getitem__
¤
Get sequence by index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
int
|
Index of the sequence |
required |
Returns:
| Type | Description |
|---|---|
Element | None
|
Element at the given index, or None if out of bounds |
get_batch
¤
Return the next batch of elements, advancing the internal index.
get_by_name
¤
Get sequence by name/ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Sequence identifier (e.g., "chr1", "seq1") |
required |
Returns:
| Type | Description |
|---|---|
Element | None
|
Element for the sequence, or None if not found |
FastaSourceConfig¤
diffbio.sources.fasta.FastaSourceConfig
dataclass
¤
FastaSourceConfig(
file_path: Path = None,
handle_n: Literal["uniform", "zero"] = "uniform",
create_index: bool = True,
)
Bases: StructuralConfig
Configuration for FASTA data source.
Attributes:
| Name | Type | Description |
|---|---|---|
file_path |
Path
|
Path to FASTA file |
handle_n |
Literal['uniform', 'zero']
|
How to handle N nucleotides ("uniform" or "zero") |
create_index |
bool
|
Whether to create .fai index if not exists (default: True) |
MolNet Benchmark Source¤
MolNetSource¤
diffbio.sources.molnet.MolNetSource
¤
MolNetSource(
config: MolNetSourceConfig,
*,
rngs: Rngs | None = None,
name: str | None = None,
)
Bases: DataSourceModule
MolNet benchmark data source extending Datarax DataSourceModule.
Provides standardized access to MoleculeNet benchmark datasets with proper train/valid/test splits. Supports automatic downloading and caching.
Inherits from DataSourceModule (StructuralModule) because:
- Non-parametric: data loading is deterministic
- Frozen config: dataset parameters don't change
- Domain-specific: requires molecular data handling
Example
References
Wu et al. "MoleculeNet: A Benchmark for Molecular Machine Learning" Chemical Science, 2018.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
MolNetSourceConfig
|
MolNet source configuration |
required |
rngs
|
Rngs | None
|
Random number generators (unused for data loading) |
None
|
name
|
str | None
|
Optional module name |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If dataset_name is unknown |
FileNotFoundError
|
If data not found and download=False |
__getitem__
¤
Get element by index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
int
|
Index of the element |
required |
Returns:
| Type | Description |
|---|---|
Element | None
|
Element at the given index, or None if out of bounds |
MolNetSourceConfig¤
diffbio.sources.molnet.MolNetSourceConfig
dataclass
¤
MolNetSourceConfig(
dataset_name: str = "",
split: Literal["train", "valid", "test"] = "train",
data_dir: Path | None = None,
download: bool = True,
)
Bases: StructuralConfig
Configuration for MolNet benchmark data source.
Attributes:
| Name | Type | Description |
|---|---|---|
dataset_name |
str
|
Name of the MolNet dataset (e.g., "bbbp", "tox21", "esol") |
split |
Literal['train', 'valid', 'test']
|
Which split to load ("train", "valid", or "test") |
data_dir |
Path | None
|
Directory to store downloaded data (default: ~/.diffbio/molnet) |
download |
bool
|
Whether to download if data not found (default: True) |
Indexed View Source¤
IndexedViewSource¤
diffbio.sources.indexed_view.IndexedViewSource
¤
IndexedViewSource(
config: IndexedViewSourceConfig,
source: DataSourceModule,
indices: ndarray,
*,
rngs: Rngs | None = None,
name: str | None = None,
)
Bases: DataSourceModule
Lazy-loading view into a data source using index mapping.
This source wraps an existing DataSourceModule and provides access only to elements at specified indices. Elements are loaded ON-DEMAND from the underlying source, enabling lazy loading for large datasets.
Key Features:
- LAZY LOADING: Elements fetched from underlying source only when accessed
- Memory efficient: Only stores indices, not actual data
- Preserves underlying source's lazy loading behavior
- Supports shuffling of view indices (not underlying data)
Example
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
IndexedViewSourceConfig
|
Configuration for the view source |
required |
source
|
DataSourceModule
|
Underlying data source to wrap |
required |
indices
|
ndarray
|
Array of indices into the source to expose |
required |
rngs
|
Rngs | None
|
Random number generators for shuffling |
None
|
name
|
str | None
|
Optional name for the module |
None
|
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
IndexedViewSourceConfig
|
Configuration for the view source |
required |
source
|
DataSourceModule
|
Underlying data source to wrap |
required |
indices
|
ndarray
|
Array of indices into the source to expose |
required |
rngs
|
Rngs | None
|
Random number generators for shuffling |
None
|
name
|
str | None
|
Optional name for the module |
None
|
__getitem__
¤
Get element at view index (LAZY - fetches from underlying source).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
int
|
Index into the VIEW (0 to len(view)-1) |
required |
Returns:
| Type | Description |
|---|---|
Element | None
|
Element from underlying source at mapped index, or None if out of bounds |
IndexedViewSourceConfig¤
diffbio.sources.indexed_view.IndexedViewSourceConfig
dataclass
¤
Bases: StructuralConfig
Configuration for IndexedViewSource.
Attributes:
| Name | Type | Description |
|---|---|---|
shuffle |
bool
|
Whether to shuffle the view indices on initialization and reset |
seed |
int | None
|
Random seed for shuffling (optional) |
AnnData Source¤
AnnDataSource¤
diffbio.sources.anndata_source.AnnDataSource
¤
AnnDataSource(
config: AnnDataSourceConfig,
*,
rngs: Rngs | None = None,
name: str | None = None,
)
Bases: DataSourceModule
Eager-loading AnnData source for single-cell RNA-seq data.
Loads all data from .h5ad files to JAX arrays at initialization, then provides pure JAX iteration, batching, and indexed access. Follows the same eager-loading pattern as datarax's HFEagerSource.
Provides
- Full dataset loading via
load() - Per-cell indexed access via
__getitem__ - Iteration via
__iter__with optional O(1) memory shuffling - Batch retrieval via
get_batch(batch_size) - Automatic sparse-to-dense conversion
- JAX array output for count matrices and embeddings
Output dictionary keys
counts: Dense JAX array of shape (n_cells, n_genes) from.Xobs: Dict of cell metadata columns from.obsvar: Dict of gene metadata columns from.varobsm: Dict of embedding JAX arrays from.obsm(empty if absent)
Example
config = AnnDataSourceConfig(file_path="pbmc3k.h5ad")
source = AnnDataSource(config)
print(len(source)) # 2700
print(source.load()["counts"].shape) # (2700, 32738)
for cell in source:
print(cell["counts"].shape) # (32738,)
break
batch = source.get_batch(32)
print(batch["counts"].shape) # (32, 32738)
Loads all data to JAX arrays at construction time.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
AnnDataSourceConfig
|
AnnDataSourceConfig with file path and options. |
required |
rngs
|
Rngs | None
|
Optional RNG state for shuffling. |
None
|
name
|
str | None
|
Optional module name. |
None
|
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the file does not exist. |
ImportError
|
If anndata is not installed. |
__getitem__
¤
Get data for a single cell by index.
Supports negative indexing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
int
|
Cell index (supports negative indexing). |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with |
Raises:
| Type | Description |
|---|---|
IndexError
|
If idx is out of bounds. |
__iter__
¤
Iterate over cells with optional O(1) memory shuffling.
Yields:
| Type | Description |
|---|---|
dict[str, Any]
|
Per-cell dictionaries with |
AnnDataSourceConfig¤
diffbio.sources.anndata_source.AnnDataSourceConfig
dataclass
¤
AnnDataSourceConfig(
file_path: str | None = None,
backed: bool = False,
shuffle: bool = False,
seed: int = 42,
split: str | None = None,
)
Bases: StructuralConfig
Configuration for AnnDataSource.
Attributes:
| Name | Type | Description |
|---|---|---|
file_path |
str | None
|
Path to the .h5ad file (string or Path object). |
backed |
bool
|
Whether to open in backed mode (memory-mapped). |
shuffle |
bool
|
Whether to shuffle during iteration. |
seed |
int
|
Integer seed for Grain's index_shuffle. |
split |
str | None
|
Optional split name for pipeline integration. |
AnnData Interop¤
to_anndata¤
diffbio.sources.anndata_interop.to_anndata
¤
Convert a DiffBio data dict to an AnnData object.
Translates the standard DiffBio dictionary format (as produced by
AnnDataSource.load()) into an anndata.AnnData object for use
with scanpy, scvi-tools, and other AnnData-based tools.
JAX arrays in counts and obsm are converted to numpy via
np.asarray(). The obs and var dicts become pandas
DataFrames.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_dict
|
dict[str, Any]
|
Dictionary with keys:
- |
required |
Returns:
| Type | Description |
|---|---|
AnnData
|
AnnData object with |
AnnData
|
populated from the input dictionary. |
Raises:
| Type | Description |
|---|---|
ImportError
|
If anndata or pandas is not installed. |
from_anndata¤
diffbio.sources.anndata_interop.from_anndata
¤
Convert an AnnData object to a DiffBio data dict.
Translates an anndata.AnnData object into the standard DiffBio
dictionary format compatible with AnnDataSource.load() output.
Sparse .X matrices are converted to dense before wrapping in a
JAX array. .obs and .var DataFrames become plain dicts of
numpy arrays. .obsm entries become JAX arrays.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
AnnData object to convert. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with keys:
- |
Usage Examples¤
Reading BAM/CRAM Files¤
from pathlib import Path
from diffbio.sources import BAMSource, BAMSourceConfig
# Load aligned reads from a BAM file
config = BAMSourceConfig(
file_path=Path("sample.bam"),
min_mapping_quality=20, # Filter low-quality alignments
include_unmapped=False, # Skip unmapped reads
)
source = BAMSource(config)
print(f"Number of reads: {len(source)}")
# Access reads
for element in source:
# One-hot encoded sequence (length, 4)
sequence = element.data["sequence"]
# Phred quality scores (length,)
quality = element.data["quality_scores"]
# Read name
name = element.data["read_name"]
print(f"Read {name}: {sequence.shape}, avg quality: {quality.mean():.1f}")
Reading FASTA Files¤
from pathlib import Path
from diffbio.sources import FastaSource, FastaSourceConfig
# Load sequences from a FASTA file
config = FastaSourceConfig(
file_path=Path("genome.fasta"),
handle_n="uniform", # or "zero" for N nucleotides
create_index=True, # Create .fai index for random access
)
source = FastaSource(config)
print(f"Number of sequences: {len(source)}")
print(f"Sequence names: {source.sequence_names}")
# Access by index
for element in source:
seq_id = element.data["sequence_id"]
sequence = element.data["sequence"] # One-hot encoded
print(f"{seq_id}: {sequence.shape[0]} bp")
# Access by name
chr1 = source.get_by_name("chr1")
if chr1 is not None:
print(f"Chromosome 1 length: {chr1.data['sequence'].shape[0]}")
Region-Based BAM Access¤
from pathlib import Path
from diffbio.sources import BAMSource, BAMSourceConfig
# Load only reads from a specific region
config = BAMSourceConfig(
file_path=Path("sample.bam"),
region="chr1:1000000-2000000", # 1Mb region on chr1
)
source = BAMSource(config)
print(f"Reads in region: {len(source)}")
Loading MolNet Benchmarks¤
from diffbio.sources import MolNetSource, MolNetSourceConfig
# Load BBBP (Blood-Brain Barrier Penetration) dataset
config = MolNetSourceConfig(
dataset_name="bbbp",
split="train", # "train", "valid", or "test"
download=True, # Auto-download if not found
)
source = MolNetSource(config)
print(f"Dataset size: {len(source)}")
print(f"Task type: {source.task_type}") # "classification"
print(f"Number of tasks: {source.n_tasks}") # 1
# Iterate over elements
for element in source:
smiles = element.data["smiles"]
label = element.data["y"]
print(f"{smiles}: {label}")
Available MolNet Datasets¤
| Dataset | Task Type | Tasks | Description |
|---|---|---|---|
bbbp |
classification | 1 | Blood-brain barrier penetration |
tox21 |
classification | 12 | Toxicity across 12 assays |
hiv |
classification | 1 | HIV replication inhibition |
bace |
classification | 1 | BACE-1 inhibitor activity |
clintox |
classification | 2 | Clinical trial toxicity |
sider |
classification | 27 | Drug side effects |
esol |
regression | 1 | Aqueous solubility |
freesolv |
regression | 1 | Hydration free energy |
lipophilicity |
regression | 1 | Octanol/water partition |
Using IndexedViewSource for Lazy Loading¤
from diffbio.sources import IndexedViewSource, IndexedViewSourceConfig
from diffbio.splitters import ScaffoldSplitter, ScaffoldSplitterConfig
import jax.numpy as jnp
# Create a splitter
splitter_config = ScaffoldSplitterConfig(smiles_key="smiles")
splitter = ScaffoldSplitter(splitter_config)
# Get split indices
result = splitter.split(data_source)
# Create lazy view for training data
view_config = IndexedViewSourceConfig(shuffle=True, seed=42)
train_view = IndexedViewSource(
view_config,
data_source,
result.train_indices,
)
# Iterate - elements loaded on demand
for element in train_view:
print(element.data["smiles"])
Integration with Datarax Samplers¤
from diffbio.sources import MolNetSource, MolNetSourceConfig
from diffbio.splitters import ScaffoldSplitter, ScaffoldSplitterConfig
from datarax.samplers import ShuffleSampler, ShuffleSamplerConfig
# Load dataset
source_config = MolNetSourceConfig(dataset_name="bbbp", split="train")
source = MolNetSource(source_config)
# Split by scaffold
splitter_config = ScaffoldSplitterConfig(smiles_key="smiles")
splitter = ScaffoldSplitter(splitter_config)
# Create split sources (lazy loading)
train_source, valid_source, test_source = splitter.create_split_sources(
source,
lazy=True,
)
# Use with Datarax sampler
sampler_config = ShuffleSamplerConfig(batch_size=32)
train_sampler = ShuffleSampler(sampler_config, data_source=train_source)
# Training loop
for batch in train_sampler:
# Process batch
pass
Custom Data Directory¤
from pathlib import Path
from diffbio.sources import MolNetSource, MolNetSourceConfig
# Specify custom data directory
config = MolNetSourceConfig(
dataset_name="tox21",
split="train",
data_dir=Path("/custom/path/to/data"),
download=True,
)
source = MolNetSource(config)
Accessing Individual Elements¤
from diffbio.sources import MolNetSource, MolNetSourceConfig
config = MolNetSourceConfig(dataset_name="esol", split="train")
source = MolNetSource(config)
# Access by index
element = source[0]
if element is not None:
smiles = element.data["smiles"]
solubility = element.data["y"]
metadata = element.metadata # {"idx": 0, "dataset": "esol"}
Data Element Format¤
MolNetSource Elements¤
Each element from MolNetSource contains:
| Field | Type | Description |
|---|---|---|
data["smiles"] |
str | SMILES representation of molecule |
data["y"] |
float or jnp.ndarray | Label(s) for the molecule |
state |
dict | Empty state dictionary |
metadata["idx"] |
int | Index within the split |
metadata["dataset"] |
str | Dataset name |
IndexedViewSource Elements¤
Elements from IndexedViewSource are passed through from the underlying source, with indices remapped to the view's subset.
Configuration Options¤
MolNetSourceConfig¤
| Parameter | Type | Default | Description |
|---|---|---|---|
dataset_name |
str | required | Name of MolNet dataset |
split |
str | "train" | Which split: "train", "valid", "test" |
data_dir |
Path | ~/.diffbio/molnet | Data storage directory |
download |
bool | True | Auto-download if missing |
IndexedViewSourceConfig¤
| Parameter | Type | Default | Description |
|---|---|---|---|
shuffle |
bool | False | Shuffle indices on iteration |
seed |
int | None | Random seed for shuffling |