Skip to content

Data Sources API¤

Data source modules for loading bioinformatics and drug discovery datasets, extending Datarax's DataSourceModule.

Genomics Sources¤

BAMSource¤

diffbio.sources.bam.BAMSource ¤

BAMSource(
    config: BAMSourceConfig,
    *,
    rngs: Rngs | None = None,
    name: str | None = None,
)

Bases: IndexedBatchSourceMixin, DataSourceModule

BAM/CRAM file data source extending Datarax DataSourceModule.

Provides efficient access to aligned sequencing reads with:

  • Lazy loading using pysam iterators
  • Indexed random access via BAI/CRAI files
  • Quality filtering at load time
  • One-hot encoded sequence output

Inherits from DataSourceModule (StructuralModule) because:

  • Non-parametric: BAM reading is deterministic
  • Frozen config: file parameters don't change
  • Domain-specific: requires genomics-specific handling
Example
config = BAMSourceConfig(file_path=Path("sample.bam"))
source = BAMSource(config)
for element in source:
    print(element.data["read_name"], element.data["sequence"].shape)

Performance Tips (from pysam best practices):

  • Use indexed BAM files for random access
  • Filter by region to reduce data loading
  • Set min_mapping_quality to filter at read time

Parameters:

Name Type Description Default
config BAMSourceConfig

BAM source configuration

required
rngs Rngs | None

Random number generators (unused for data loading)

None
name str | None

Optional module name

None

Raises:

Type Description
FileNotFoundError

If BAM file not found

ImportError

If pysam is not installed

__len__ ¤

__len__() -> int

Return the number of reads in the source.

__getitem__ ¤

__getitem__(idx: int) -> Element | None

Get read by index.

Parameters:

Name Type Description Default
idx int

Index of the read

required

Returns:

Type Description
Element | None

Element at the given index, or None if out of bounds

__iter__ ¤

__iter__() -> Iterator[Element]

Return iterator over reads.

reset ¤

reset(seed: int | None = None) -> None

Reset iteration state, optionally with a new seed.

get_batch ¤

get_batch(
    batch_size: int, key: Array | None = None
) -> list[Element]

Return the next batch of elements, advancing the internal index.

BAMSourceConfig¤

diffbio.sources.bam.BAMSourceConfig dataclass ¤

BAMSourceConfig(
    file_path: Path = None,
    reference_path: Path | None = None,
    include_unmapped: bool = False,
    min_mapping_quality: int | None = None,
    region: str | None = None,
    handle_n: Literal["uniform", "zero"] = "uniform",
)

Bases: StructuralConfig

Configuration for BAM/CRAM data source.

Attributes:

Name Type Description
file_path Path

Path to BAM/CRAM file

reference_path Path | None

Optional path to reference FASTA (required for CRAM)

include_unmapped bool

Whether to include unmapped reads (default: False)

min_mapping_quality int | None

Minimum mapping quality to include (default: None)

region str | None

Optional genomic region to query (e.g., "chr1:1000-2000")

handle_n Literal['uniform', 'zero']

How to handle N nucleotides in sequences

FastaSource¤

diffbio.sources.fasta.FastaSource ¤

FastaSource(
    config: FastaSourceConfig,
    *,
    rngs: Rngs | None = None,
    name: str | None = None,
)

Bases: IndexedBatchSourceMixin, DataSourceModule

FASTA file data source extending Datarax DataSourceModule.

Provides efficient access to DNA/RNA sequences with:

  • Lazy loading using samtools-compatible .fai index
  • Dictionary-like access by sequence name
  • One-hot encoded sequence output
  • Support for compressed BGZF files

Inherits from DataSourceModule (StructuralModule) because:

  • Non-parametric: FASTA reading is deterministic
  • Frozen config: file parameters don't change
  • Domain-specific: requires genomics-specific handling
Example
config = FastaSourceConfig(file_path=Path("genome.fasta"))
source = FastaSource(config)
elem = source.get_by_name("chr1")
print(elem.data["sequence"].shape)

Performance Tips (from pyfaidx best practices):

  • Use indexed FASTA files (.fai) for random access
  • Access regions with slicing for large chromosomes
  • BGZF compression reduces disk space while maintaining random access

Parameters:

Name Type Description Default
config FastaSourceConfig

FASTA source configuration

required
rngs Rngs | None

Random number generators (unused for data loading)

None
name str | None

Optional module name

None

Raises:

Type Description
FileNotFoundError

If FASTA file not found

ImportError

If pyfaidx is not installed

sequence_names property ¤

sequence_names: list[str]

Get list of all sequence names in the FASTA file.

__len__ ¤

__len__() -> int

Return the number of sequences in the source.

__getitem__ ¤

__getitem__(idx: int) -> Element | None

Get sequence by index.

Parameters:

Name Type Description Default
idx int

Index of the sequence

required

Returns:

Type Description
Element | None

Element at the given index, or None if out of bounds

__iter__ ¤

__iter__() -> Iterator[Element]

Return iterator over sequences.

reset ¤

reset(seed: int | None = None) -> None

Reset iteration state, optionally with a new seed.

get_batch ¤

get_batch(
    batch_size: int, key: Array | None = None
) -> list[Element]

Return the next batch of elements, advancing the internal index.

get_by_name ¤

get_by_name(name: str) -> Element | None

Get sequence by name/ID.

Parameters:

Name Type Description Default
name str

Sequence identifier (e.g., "chr1", "seq1")

required

Returns:

Type Description
Element | None

Element for the sequence, or None if not found

FastaSourceConfig¤

diffbio.sources.fasta.FastaSourceConfig dataclass ¤

FastaSourceConfig(
    file_path: Path = None,
    handle_n: Literal["uniform", "zero"] = "uniform",
    create_index: bool = True,
)

Bases: StructuralConfig

Configuration for FASTA data source.

Attributes:

Name Type Description
file_path Path

Path to FASTA file

handle_n Literal['uniform', 'zero']

How to handle N nucleotides ("uniform" or "zero")

create_index bool

Whether to create .fai index if not exists (default: True)

MolNet Benchmark Source¤

MolNetSource¤

diffbio.sources.molnet.MolNetSource ¤

MolNetSource(
    config: MolNetSourceConfig,
    *,
    rngs: Rngs | None = None,
    name: str | None = None,
)

Bases: DataSourceModule

MolNet benchmark data source extending Datarax DataSourceModule.

Provides standardized access to MoleculeNet benchmark datasets with proper train/valid/test splits. Supports automatic downloading and caching.

Inherits from DataSourceModule (StructuralModule) because:

  • Non-parametric: data loading is deterministic
  • Frozen config: dataset parameters don't change
  • Domain-specific: requires molecular data handling
Example
config = MolNetSourceConfig(dataset_name="bbbp", split="train")
source = MolNetSource(config)
for element in source:
    print(element.data["smiles"], element.data["y"])
References

Wu et al. "MoleculeNet: A Benchmark for Molecular Machine Learning" Chemical Science, 2018.

Parameters:

Name Type Description Default
config MolNetSourceConfig

MolNet source configuration

required
rngs Rngs | None

Random number generators (unused for data loading)

None
name str | None

Optional module name

None

Raises:

Type Description
ValueError

If dataset_name is unknown

FileNotFoundError

If data not found and download=False

task_type property ¤

task_type: str

Get the task type for this dataset.

n_tasks property ¤

n_tasks: int

Get the number of tasks for this dataset.

__len__ ¤

__len__() -> int

Return the number of elements in the source.

__getitem__ ¤

__getitem__(idx: int) -> Element | None

Get element by index.

Parameters:

Name Type Description Default
idx int

Index of the element

required

Returns:

Type Description
Element | None

Element at the given index, or None if out of bounds

__iter__ ¤

__iter__() -> Iterator[Element]

Return iterator over elements.

MolNetSourceConfig¤

diffbio.sources.molnet.MolNetSourceConfig dataclass ¤

MolNetSourceConfig(
    dataset_name: str = "",
    split: Literal["train", "valid", "test"] = "train",
    data_dir: Path | None = None,
    download: bool = True,
)

Bases: StructuralConfig

Configuration for MolNet benchmark data source.

Attributes:

Name Type Description
dataset_name str

Name of the MolNet dataset (e.g., "bbbp", "tox21", "esol")

split Literal['train', 'valid', 'test']

Which split to load ("train", "valid", or "test")

data_dir Path | None

Directory to store downloaded data (default: ~/.diffbio/molnet)

download bool

Whether to download if data not found (default: True)

Indexed View Source¤

IndexedViewSource¤

diffbio.sources.indexed_view.IndexedViewSource ¤

IndexedViewSource(
    config: IndexedViewSourceConfig,
    source: DataSourceModule,
    indices: ndarray,
    *,
    rngs: Rngs | None = None,
    name: str | None = None,
)

Bases: DataSourceModule

Lazy-loading view into a data source using index mapping.

This source wraps an existing DataSourceModule and provides access only to elements at specified indices. Elements are loaded ON-DEMAND from the underlying source, enabling lazy loading for large datasets.

Key Features:

  • LAZY LOADING: Elements fetched from underlying source only when accessed
  • Memory efficient: Only stores indices, not actual data
  • Preserves underlying source's lazy loading behavior
  • Supports shuffling of view indices (not underlying data)
Example
# Create view of first 1000 elements
indices = jnp.arange(1000)
config = IndexedViewSourceConfig()
view = IndexedViewSource(config, original_source, indices)
view[0]  # Fetches original_source[indices[0]] lazily

Parameters:

Name Type Description Default
config IndexedViewSourceConfig

Configuration for the view source

required
source DataSourceModule

Underlying data source to wrap

required
indices ndarray

Array of indices into the source to expose

required
rngs Rngs | None

Random number generators for shuffling

None
name str | None

Optional name for the module

None

Parameters:

Name Type Description Default
config IndexedViewSourceConfig

Configuration for the view source

required
source DataSourceModule

Underlying data source to wrap

required
indices ndarray

Array of indices into the source to expose

required
rngs Rngs | None

Random number generators for shuffling

None
name str | None

Optional name for the module

None

__len__ ¤

__len__() -> int

Return number of elements in the view.

__getitem__ ¤

__getitem__(idx: int) -> Element | None

Get element at view index (LAZY - fetches from underlying source).

Parameters:

Name Type Description Default
idx int

Index into the VIEW (0 to len(view)-1)

required

Returns:

Type Description
Element | None

Element from underlying source at mapped index, or None if out of bounds

__iter__ ¤

__iter__() -> Iterator[Element]

Iterate over view elements (LAZY - fetches on demand).

IndexedViewSourceConfig¤

diffbio.sources.indexed_view.IndexedViewSourceConfig dataclass ¤

IndexedViewSourceConfig(
    shuffle: bool = False, seed: int | None = None
)

Bases: StructuralConfig

Configuration for IndexedViewSource.

Attributes:

Name Type Description
shuffle bool

Whether to shuffle the view indices on initialization and reset

seed int | None

Random seed for shuffling (optional)

AnnData Source¤

AnnDataSource¤

diffbio.sources.anndata_source.AnnDataSource ¤

AnnDataSource(
    config: AnnDataSourceConfig,
    *,
    rngs: Rngs | None = None,
    name: str | None = None,
)

Bases: DataSourceModule

Eager-loading AnnData source for single-cell RNA-seq data.

Loads all data from .h5ad files to JAX arrays at initialization, then provides pure JAX iteration, batching, and indexed access. Follows the same eager-loading pattern as datarax's HFEagerSource.

Provides
  • Full dataset loading via load()
  • Per-cell indexed access via __getitem__
  • Iteration via __iter__ with optional O(1) memory shuffling
  • Batch retrieval via get_batch(batch_size)
  • Automatic sparse-to-dense conversion
  • JAX array output for count matrices and embeddings
Output dictionary keys
  • counts: Dense JAX array of shape (n_cells, n_genes) from .X
  • obs: Dict of cell metadata columns from .obs
  • var: Dict of gene metadata columns from .var
  • obsm: Dict of embedding JAX arrays from .obsm (empty if absent)
Example
config = AnnDataSourceConfig(file_path="pbmc3k.h5ad")
source = AnnDataSource(config)
print(len(source))                # 2700
print(source.load()["counts"].shape)  # (2700, 32738)

for cell in source:
    print(cell["counts"].shape)   # (32738,)
    break

batch = source.get_batch(32)
print(batch["counts"].shape)      # (32, 32738)

Loads all data to JAX arrays at construction time.

Parameters:

Name Type Description Default
config AnnDataSourceConfig

AnnDataSourceConfig with file path and options.

required
rngs Rngs | None

Optional RNG state for shuffling.

None
name str | None

Optional module name.

None

Raises:

Type Description
FileNotFoundError

If the file does not exist.

ImportError

If anndata is not installed.

__len__ ¤

__len__() -> int

Return the number of cells in the dataset.

__getitem__ ¤

__getitem__(idx: int) -> dict[str, Any]

Get data for a single cell by index.

Supports negative indexing.

Parameters:

Name Type Description Default
idx int

Cell index (supports negative indexing).

required

Returns:

Type Description
dict[str, Any]

Dictionary with counts, obs, obsm keys.

Raises:

Type Description
IndexError

If idx is out of bounds.

__iter__ ¤

__iter__() -> Iterator[dict[str, Any]]

Iterate over cells with optional O(1) memory shuffling.

Yields:

Type Description
dict[str, Any]

Per-cell dictionaries with counts, obs, obsm keys.

AnnDataSourceConfig¤

diffbio.sources.anndata_source.AnnDataSourceConfig dataclass ¤

AnnDataSourceConfig(
    file_path: str | None = None,
    backed: bool = False,
    shuffle: bool = False,
    seed: int = 42,
    split: str | None = None,
)

Bases: StructuralConfig

Configuration for AnnDataSource.

Attributes:

Name Type Description
file_path str | None

Path to the .h5ad file (string or Path object).

backed bool

Whether to open in backed mode (memory-mapped).

shuffle bool

Whether to shuffle during iteration.

seed int

Integer seed for Grain's index_shuffle.

split str | None

Optional split name for pipeline integration.

AnnData Interop¤

to_anndata¤

diffbio.sources.anndata_interop.to_anndata ¤

to_anndata(data_dict: dict[str, Any]) -> AnnData

Convert a DiffBio data dict to an AnnData object.

Translates the standard DiffBio dictionary format (as produced by AnnDataSource.load()) into an anndata.AnnData object for use with scanpy, scvi-tools, and other AnnData-based tools.

JAX arrays in counts and obsm are converted to numpy via np.asarray(). The obs and var dicts become pandas DataFrames.

Parameters:

Name Type Description Default
data_dict dict[str, Any]

Dictionary with keys: - counts: JAX or numpy array of shape (n_cells, n_genes). - obs: Dict mapping column names to per-cell arrays. - var: Dict mapping column names to per-gene arrays. - obsm (optional): Dict mapping embedding names to arrays.

required

Returns:

Type Description
AnnData

AnnData object with .X, .obs, .var, and .obsm

AnnData

populated from the input dictionary.

Raises:

Type Description
ImportError

If anndata or pandas is not installed.

from_anndata¤

diffbio.sources.anndata_interop.from_anndata ¤

from_anndata(adata: AnnData) -> dict[str, Any]

Convert an AnnData object to a DiffBio data dict.

Translates an anndata.AnnData object into the standard DiffBio dictionary format compatible with AnnDataSource.load() output.

Sparse .X matrices are converted to dense before wrapping in a JAX array. .obs and .var DataFrames become plain dicts of numpy arrays. .obsm entries become JAX arrays.

Parameters:

Name Type Description Default
adata AnnData

AnnData object to convert.

required

Returns:

Type Description
dict[str, Any]

Dictionary with keys: - counts: Dense JAX array of shape (n_cells, n_genes). - obs: Dict mapping column names to numpy arrays. - var: Dict mapping column names to numpy arrays. - obsm: Dict mapping embedding names to JAX arrays.

Usage Examples¤

Reading BAM/CRAM Files¤

from pathlib import Path
from diffbio.sources import BAMSource, BAMSourceConfig

# Load aligned reads from a BAM file
config = BAMSourceConfig(
    file_path=Path("sample.bam"),
    min_mapping_quality=20,  # Filter low-quality alignments
    include_unmapped=False,  # Skip unmapped reads
)
source = BAMSource(config)

print(f"Number of reads: {len(source)}")

# Access reads
for element in source:
    # One-hot encoded sequence (length, 4)
    sequence = element.data["sequence"]
    # Phred quality scores (length,)
    quality = element.data["quality_scores"]
    # Read name
    name = element.data["read_name"]

    print(f"Read {name}: {sequence.shape}, avg quality: {quality.mean():.1f}")

Reading FASTA Files¤

from pathlib import Path
from diffbio.sources import FastaSource, FastaSourceConfig

# Load sequences from a FASTA file
config = FastaSourceConfig(
    file_path=Path("genome.fasta"),
    handle_n="uniform",  # or "zero" for N nucleotides
    create_index=True,   # Create .fai index for random access
)
source = FastaSource(config)

print(f"Number of sequences: {len(source)}")
print(f"Sequence names: {source.sequence_names}")

# Access by index
for element in source:
    seq_id = element.data["sequence_id"]
    sequence = element.data["sequence"]  # One-hot encoded
    print(f"{seq_id}: {sequence.shape[0]} bp")

# Access by name
chr1 = source.get_by_name("chr1")
if chr1 is not None:
    print(f"Chromosome 1 length: {chr1.data['sequence'].shape[0]}")

Region-Based BAM Access¤

from pathlib import Path
from diffbio.sources import BAMSource, BAMSourceConfig

# Load only reads from a specific region
config = BAMSourceConfig(
    file_path=Path("sample.bam"),
    region="chr1:1000000-2000000",  # 1Mb region on chr1
)
source = BAMSource(config)

print(f"Reads in region: {len(source)}")

Loading MolNet Benchmarks¤

from diffbio.sources import MolNetSource, MolNetSourceConfig

# Load BBBP (Blood-Brain Barrier Penetration) dataset
config = MolNetSourceConfig(
    dataset_name="bbbp",
    split="train",  # "train", "valid", or "test"
    download=True,  # Auto-download if not found
)
source = MolNetSource(config)

print(f"Dataset size: {len(source)}")
print(f"Task type: {source.task_type}")  # "classification"
print(f"Number of tasks: {source.n_tasks}")  # 1

# Iterate over elements
for element in source:
    smiles = element.data["smiles"]
    label = element.data["y"]
    print(f"{smiles}: {label}")

Available MolNet Datasets¤

Dataset Task Type Tasks Description
bbbp classification 1 Blood-brain barrier penetration
tox21 classification 12 Toxicity across 12 assays
hiv classification 1 HIV replication inhibition
bace classification 1 BACE-1 inhibitor activity
clintox classification 2 Clinical trial toxicity
sider classification 27 Drug side effects
esol regression 1 Aqueous solubility
freesolv regression 1 Hydration free energy
lipophilicity regression 1 Octanol/water partition

Using IndexedViewSource for Lazy Loading¤

from diffbio.sources import IndexedViewSource, IndexedViewSourceConfig
from diffbio.splitters import ScaffoldSplitter, ScaffoldSplitterConfig
import jax.numpy as jnp

# Create a splitter
splitter_config = ScaffoldSplitterConfig(smiles_key="smiles")
splitter = ScaffoldSplitter(splitter_config)

# Get split indices
result = splitter.split(data_source)

# Create lazy view for training data
view_config = IndexedViewSourceConfig(shuffle=True, seed=42)
train_view = IndexedViewSource(
    view_config,
    data_source,
    result.train_indices,
)

# Iterate - elements loaded on demand
for element in train_view:
    print(element.data["smiles"])

Integration with Datarax Samplers¤

from diffbio.sources import MolNetSource, MolNetSourceConfig
from diffbio.splitters import ScaffoldSplitter, ScaffoldSplitterConfig
from datarax.samplers import ShuffleSampler, ShuffleSamplerConfig

# Load dataset
source_config = MolNetSourceConfig(dataset_name="bbbp", split="train")
source = MolNetSource(source_config)

# Split by scaffold
splitter_config = ScaffoldSplitterConfig(smiles_key="smiles")
splitter = ScaffoldSplitter(splitter_config)

# Create split sources (lazy loading)
train_source, valid_source, test_source = splitter.create_split_sources(
    source,
    lazy=True,
)

# Use with Datarax sampler
sampler_config = ShuffleSamplerConfig(batch_size=32)
train_sampler = ShuffleSampler(sampler_config, data_source=train_source)

# Training loop
for batch in train_sampler:
    # Process batch
    pass

Custom Data Directory¤

from pathlib import Path
from diffbio.sources import MolNetSource, MolNetSourceConfig

# Specify custom data directory
config = MolNetSourceConfig(
    dataset_name="tox21",
    split="train",
    data_dir=Path("/custom/path/to/data"),
    download=True,
)
source = MolNetSource(config)

Accessing Individual Elements¤

from diffbio.sources import MolNetSource, MolNetSourceConfig

config = MolNetSourceConfig(dataset_name="esol", split="train")
source = MolNetSource(config)

# Access by index
element = source[0]
if element is not None:
    smiles = element.data["smiles"]
    solubility = element.data["y"]
    metadata = element.metadata  # {"idx": 0, "dataset": "esol"}

Data Element Format¤

MolNetSource Elements¤

Each element from MolNetSource contains:

Field Type Description
data["smiles"] str SMILES representation of molecule
data["y"] float or jnp.ndarray Label(s) for the molecule
state dict Empty state dictionary
metadata["idx"] int Index within the split
metadata["dataset"] str Dataset name

IndexedViewSource Elements¤

Elements from IndexedViewSource are passed through from the underlying source, with indices remapped to the view's subset.

Configuration Options¤

MolNetSourceConfig¤

Parameter Type Default Description
dataset_name str required Name of MolNet dataset
split str "train" Which split: "train", "valid", "test"
data_dir Path ~/.diffbio/molnet Data storage directory
download bool True Auto-download if missing

IndexedViewSourceConfig¤

Parameter Type Default Description
shuffle bool False Shuffle indices on iteration
seed int None Random seed for shuffling