Preprocessing Operators API¤

Differentiable preprocessing operators for read quality control, adapter removal, and error correction.

SoftAdapterRemoval¤

diffbio.operators.preprocessing.adapter_removal.SoftAdapterRemoval ¤

SoftAdapterRemoval(
    config: AdapterRemovalConfig,
    *,
    rngs: Rngs | None = None,
    name: str | None = None,
)

Bases: TemperatureOperator

Differentiable adapter removal for sequencing reads.

This operator performs soft adapter trimming using a differentiable approach. It finds potential adapter matches at the 3' end of reads and applies sigmoid-weighted trimming that maintains gradient flow.

The algorithm: 1. Compute soft alignment scores between sequence suffix and adapter 2. Find the soft trim position using weighted position averaging 3. Apply sigmoid-weighted retention to each position

Parameters:

Name	Type	Description	Default
`config`	`AdapterRemovalConfig`	AdapterRemovalConfig with adapter parameters.	required
`rngs`	`Rngs \| None`	Flax NNX random number generators (optional).	`None`
`name`	`str \| None`	Optional operator name.	`None`

Example

config = AdapterRemovalConfig(adapter_sequence="AGATCGGAAGAG")
remover = SoftAdapterRemoval(config)
data = {"sequence": encoded_seq, "quality_scores": quality}
result, state, meta = remover.apply(data, {}, None)

Parameters:

Name	Type	Description	Default
`config`	`AdapterRemovalConfig`	Adapter removal configuration.	required
`rngs`	`Rngs \| None`	Random number generators (optional).	`None`
`name`	`str \| None`	Optional operator name.	`None`

apply ¤

apply(
    data: PyTree,
    state: PyTree,
    metadata: dict[str, Any] | None,
    random_params: Any = None,
    stats: dict[str, Any] | None = None,
) -> tuple[PyTree, PyTree, dict[str, Any] | None]

Apply soft adapter removal to sequence data.

This method finds potential adapter sequences and applies differentiable soft trimming to remove them while maintaining gradient flow.

Parameters:

Name	Type	Description	Default
`data`	`PyTree`	Dictionary containing: - "sequence": One-hot encoded sequence (length, alphabet_size) - "quality_scores": Phred quality scores (length,)	required
`state`	`PyTree`	Element state (passed through unchanged)	required
`metadata`	`dict[str, Any] \| None`	Element metadata (passed through unchanged)	required
`random_params`	`Any`	Not used (deterministic operator)	`None`
`stats`	`dict[str, Any] \| None`	Not used	`None`

Returns:

Type	Description
`tuple[PyTree, PyTree, dict[str, Any] \| None]`	Tuple of (transformed_data, state, metadata): - transformed_data contains: `- "sequence": Soft-trimmed sequence - "quality_scores": Soft-trimmed quality scores - "adapter_score": Maximum adapter match score - "trim_position": Soft trim position` state is passed through unchanged metadata is passed through unchanged

AdapterRemovalConfig¤

diffbio.operators.preprocessing.adapter_removal.AdapterRemovalConfig `dataclass` ¤

AdapterRemovalConfig(
    adapter_sequence: str = "AGATCGGAAGAG",
    temperature: float = 1.0,
    learnable_temperature: bool = True,
    match_threshold: float = 0.5,
    min_overlap: int = 6,
)

Bases: OperatorConfig

Configuration for SoftAdapterRemoval.

Attributes:

Name	Type	Description
`adapter_sequence`	`str`	Adapter sequence to remove (default: Illumina universal).
`temperature`	`float`	Temperature for soft matching and trimming. Lower = sharper trimming, Higher = smoother.
`match_threshold`	`float`	Minimum alignment score ratio to consider a match.
`min_overlap`	`int`	Minimum overlap length to consider adapter presence.

DifferentiableDuplicateWeighting¤

diffbio.operators.preprocessing.duplicate_filter.DifferentiableDuplicateWeighting ¤

DifferentiableDuplicateWeighting(
    config: DuplicateWeightingConfig,
    *,
    rngs: Rngs | None = None,
    name: str | None = None,
)

Bases: TemperatureOperator

Differentiable duplicate weighting for sequencing reads.

This operator assigns probabilistic weights to reads based on their uniqueness within a batch. Instead of hard duplicate removal, it down-weights reads that are similar to others, maintaining gradient flow.

The algorithm: 1. Embed sequences using learned convolutional features 2. Compute pairwise soft similarity matrix 3. Compute soft cluster sizes from similarity matrix 4. Assign weights inversely proportional to cluster size

Note: This operator works on batched data where reads can be compared. For single-read processing, it returns weight=1.0.

Parameters:

Name	Type	Description	Default
`config`	`DuplicateWeightingConfig`	DuplicateWeightingConfig with weighting parameters.	required
`rngs`	`Rngs \| None`	Flax NNX random number generators.	`None`
`name`	`str \| None`	Optional operator name.	`None`

Example

config = DuplicateWeightingConfig(similarity_threshold=0.9)
weighter = DifferentiableDuplicateWeighting(config, rngs=nnx.Rngs(42))
data = {"sequence": encoded_seq, "quality_scores": quality}
result, state, meta = weighter.apply(data, {}, None)

Parameters:

Name	Type	Description	Default
`config`	`DuplicateWeightingConfig`	Duplicate weighting configuration.	required
`rngs`	`Rngs \| None`	Random number generators for initialization.	`None`
`name`	`str \| None`	Optional operator name.	`None`

apply ¤

apply(
    data: PyTree,
    state: PyTree,
    metadata: dict[str, Any] | None,
    random_params: Any = None,
    stats: dict[str, Any] | None = None,
) -> tuple[PyTree, PyTree, dict[str, Any] | None]

Apply duplicate weighting to sequence data.

For single sequences, returns weight=1.0. For batched sequences, computes uniqueness-based weights.

Parameters:

Name	Type	Description	Default
`data`	`PyTree`	Dictionary containing: - "sequence": One-hot encoded sequence (length, alphabet_size) or batch (batch, length, alphabet_size) - "quality_scores": Quality scores (length,) or (batch, length)	required
`state`	`PyTree`	Element state (passed through unchanged)	required
`metadata`	`dict[str, Any] \| None`	Element metadata (passed through unchanged)	required
`random_params`	`Any`	Not used (deterministic operator)	`None`
`stats`	`dict[str, Any] \| None`	Not used	`None`

Returns:

Type	Description
`tuple[PyTree, PyTree, dict[str, Any] \| None]`	Tuple of (transformed_data, state, metadata): - transformed_data contains: `- "sequence": Original sequence (unchanged) - "quality_scores": Original quality scores (unchanged) - "uniqueness_weight": Weight based on uniqueness - "embedding": Sequence embedding (for downstream use)` state is passed through unchanged metadata is passed through unchanged

DuplicateWeightingConfig¤

diffbio.operators.preprocessing.duplicate_filter.DuplicateWeightingConfig `dataclass` ¤

DuplicateWeightingConfig(
    temperature: float = 1.0,
    learnable_temperature: bool = True,
    similarity_threshold: float = 0.9,
    embedding_dim: int = 32,
)

Bases: OperatorConfig

Configuration for DifferentiableDuplicateWeighting.

Attributes:

Name	Type	Description
`temperature`	`float`	Temperature for soft similarity computation. Lower = sharper clustering, Higher = smoother.
`similarity_threshold`	`float`	Minimum similarity to consider as duplicate.
`embedding_dim`	`int`	Dimension of learned sequence embedding.

SoftErrorCorrection¤

diffbio.operators.preprocessing.error_correction.SoftErrorCorrection ¤

SoftErrorCorrection(
    config: ErrorCorrectionConfig,
    *,
    rngs: Rngs | None = None,
    name: str | None = None,
)

Bases: TemperatureOperator

Differentiable error correction for sequencing reads.

This operator uses a neural network to refine base calls based on local sequence context and quality scores. It outputs soft base probabilities that maintain gradient flow.

The algorithm: 1. For each position, extract a window of sequence and quality data 2. Pass through MLP to predict corrected base probabilities 3. Output soft one-hot representation blending original and corrected

Parameters:

Name	Type	Description	Default
`config`	`ErrorCorrectionConfig`	ErrorCorrectionConfig with model parameters.	required
`rngs`	`Rngs \| None`	Flax NNX random number generators.	`None`
`name`	`str \| None`	Optional operator name.	`None`

Example

config = ErrorCorrectionConfig(window_size=11, hidden_dim=64)
corrector = SoftErrorCorrection(config, rngs=nnx.Rngs(42))
data = {"sequence": encoded_seq, "quality_scores": quality}
result, state, meta = corrector.apply(data, {}, None)

Parameters:

Name	Type	Description	Default
`config`	`ErrorCorrectionConfig`	Error correction configuration.	required
`rngs`	`Rngs \| None`	Random number generators for initialization.	`None`
`name`	`str \| None`	Optional operator name.	`None`

apply ¤

apply(
    data: PyTree,
    state: PyTree,
    metadata: dict[str, Any] | None,
    random_params: Any = None,
    stats: dict[str, Any] | None = None,
) -> tuple[PyTree, PyTree, dict[str, Any] | None]

Apply error correction to sequence data.

This method corrects each position in the sequence using the neural network model, producing soft corrected base probabilities.

Parameters:

Name	Type	Description	Default
`data`	`PyTree`	Dictionary containing: - "sequence": One-hot encoded sequence (length, alphabet_size) - "quality_scores": Phred quality scores (length,)	required
`state`	`PyTree`	Element state (passed through unchanged)	required
`metadata`	`dict[str, Any] \| None`	Element metadata (passed through unchanged)	required
`random_params`	`Any`	Not used (deterministic operator)	`None`
`stats`	`dict[str, Any] \| None`	Not used	`None`

Returns:

Type	Description
`tuple[PyTree, PyTree, dict[str, Any] \| None]`	Tuple of (transformed_data, state, metadata): - transformed_data contains: `- "sequence": Corrected soft one-hot sequence - "quality_scores": Original quality scores - "correction_confidence": Average correction weight` state is passed through unchanged metadata is passed through unchanged

ErrorCorrectionConfig¤

diffbio.operators.preprocessing.error_correction.ErrorCorrectionConfig `dataclass` ¤

ErrorCorrectionConfig(
    window_size: int = 11,
    hidden_dim: int = 64,
    num_layers: int = 2,
    use_quality: bool = True,
    temperature: float = 1.0,
    learnable_temperature: bool = True,
)

Bases: OperatorConfig

Configuration for SoftErrorCorrection.

Attributes:

Name	Type	Description
`window_size`	`int`	Size of context window around each position. Must be odd. Default is 11 (5 bases on each side).
`hidden_dim`	`int`	Hidden layer dimension in the MLP.
`num_layers`	`int`	Number of hidden layers in the MLP.
`use_quality`	`bool`	Whether to include quality scores as input.
`temperature`	`float`	Temperature for output softmax.

Usage Examples¤

Adapter Removal¤

from flax import nnx
from diffbio.operators.preprocessing import SoftAdapterRemoval, AdapterRemovalConfig

config = AdapterRemovalConfig(
    adapter_sequence="AGATCGGAAGAG",
    temperature=1.0,
    match_threshold=0.8,
)
adapter_removal = SoftAdapterRemoval(config, rngs=nnx.Rngs(42))

data = {"sequence": read, "quality_scores": quality}
result, _, _ = adapter_removal.apply(data, {}, None)
trimmed = result["sequence"]

Duplicate Weighting¤

from diffbio.operators.preprocessing import (
    DifferentiableDuplicateWeighting,
    DuplicateWeightingConfig,
)

config = DuplicateWeightingConfig(embedding_dim=32)
dup_weighting = DifferentiableDuplicateWeighting(config, rngs=nnx.Rngs(42))

data = {"sequences": reads, "quality_scores": quality}
result, _, _ = dup_weighting.apply(data, {}, None)
weights = result["weights"]

Error Correction¤

from diffbio.operators.preprocessing import SoftErrorCorrection, ErrorCorrectionConfig

config = ErrorCorrectionConfig(hidden_dim=64, window_size=5)
error_correction = SoftErrorCorrection(config, rngs=nnx.Rngs(42))

data = {"sequence": read, "quality_scores": quality}
result, _, _ = error_correction.apply(data, {}, None)
corrected = result["sequence"]

Preprocessing Operators API¤

SoftAdapterRemoval¤

diffbio.operators.preprocessing.adapter_removal.SoftAdapterRemoval ¤

apply ¤

AdapterRemovalConfig¤

diffbio.operators.preprocessing.adapter_removal.AdapterRemovalConfig dataclass ¤

DifferentiableDuplicateWeighting¤

diffbio.operators.preprocessing.duplicate_filter.DifferentiableDuplicateWeighting ¤

apply ¤

DuplicateWeightingConfig¤

diffbio.operators.preprocessing.duplicate_filter.DuplicateWeightingConfig dataclass ¤

SoftErrorCorrection¤

diffbio.operators.preprocessing.error_correction.SoftErrorCorrection ¤

apply ¤

ErrorCorrectionConfig¤

diffbio.operators.preprocessing.error_correction.ErrorCorrectionConfig dataclass ¤

Usage Examples¤

Adapter Removal¤

Duplicate Weighting¤

Error Correction¤

diffbio.operators.preprocessing.adapter_removal.AdapterRemovalConfig `dataclass` ¤

diffbio.operators.preprocessing.duplicate_filter.DuplicateWeightingConfig `dataclass` ¤

diffbio.operators.preprocessing.error_correction.ErrorCorrectionConfig `dataclass` ¤