Skip to content

Sequence Alignment¤

Sequence alignment arranges two or more biological sequences to identify regions of similarity that may reflect functional, structural, or evolutionary relationships. DiffBio provides differentiable alignment operators that enable gradient-based learning of alignment parameters.


The Alignment Problem¤

Given two sequences \(A\) and \(B\), find an arrangement that maximizes a scoring function:

\[ \text{Score} = \sum_{\text{aligned pairs}} s(a_i, b_j) + \sum_{\text{gaps}} g(k) \]

Where \(s(a, b)\) is the substitution score and \(g(k)\) is the gap penalty for a gap of length \(k\).

Local vs Global Alignment¤

Type Algorithm Goal DiffBio Operator
Local Smith-Waterman Best matching subsequence SmoothSmithWaterman
Global Needleman-Wunsch Full end-to-end alignment
Profile Profile HMM Sequence-to-model alignment ProfileHMMSearch
Multiple Progressive Align multiple sequences SoftProgressiveMSA

DiffBio focuses on local alignment (Smith-Waterman) as the foundation.


Smith-Waterman Recurrence¤

For sequences \(A\) (length \(m\)) and \(B\) (length \(n\)), the standard recurrence fills a matrix \(H\):

\[ H(i, j) = \max \begin{cases} 0 \\ H(i-1, j-1) + s(A_i, B_j) \\ H(i-1, j) + g \\ H(i, j-1) + g \end{cases} \]

The optimal local alignment score is \(\max_{i,j} H(i,j)\).

Why This Blocks Gradients¤

The max operation selects one branch and discards the others. Gradients only flow through the winning branch — all other branches receive zero gradient. This creates a hard, non-differentiable decision at every cell.


Smooth Smith-Waterman¤

DiffBio replaces max with temperature-scaled logsumexp:

\[ H(i, j) = \tau \cdot \log \sum_{k} \exp\left(\frac{c_k}{\tau}\right) \]

Where \(c_k\) are the four candidates (0, diagonal, up, left). Gradients now flow to all branches, weighted by their relative magnitude.

Temperature Effect¤

\(\tau\) Behavior Use Case
0.01 Nearly identical to hard max Final evaluation
0.1 - 1.0 Balanced accuracy and gradient flow Training
10.0+ All branches contribute equally Warm start / exploration

The operator uses a scan-based implementation that fills the matrix row by row, compatible with jax.jit and jax.grad.


Scoring Matrices¤

The substitution score \(s(a, b)\) determines how well two characters match.

DNA Scoring¤

For DNA (alphabet size 4: A, C, G, T), a simple match/mismatch matrix:

\[ s(a, b) = \begin{cases} +2 & \text{if } a = b \\ -1 & \text{if } a \neq b \end{cases} \]

DiffBio provides create_dna_scoring_matrix(match, mismatch) for this.

Learned Scoring¤

Because DiffBio's scoring matrix is an nnx.Param, gradients from a downstream loss update scoring values during training. This lets the alignment learn task-specific substitution preferences — for example, penalizing transitions less than transversions, or learning scoring from structural similarity rather than sequence identity.


Gap Penalties¤

Gaps represent insertions or deletions (indels). Two common models:

Linear: \(g(k) = d \cdot k\) — each gap position costs \(d\).

Affine: \(g(k) = d_{\text{open}} + d_{\text{extend}} \cdot (k-1)\) — opening a gap is expensive, extending it is cheaper. This better models biological indels, which tend to occur in contiguous blocks.

Both gap_open and gap_extend are learnable parameters in DiffBio.


Alignment Outputs¤

SmoothSmithWaterman returns three key outputs:

Output Key Shape Description
score scalar Soft maximum alignment score
alignment_matrix \((m+1, n+1)\) Full DP matrix
soft_alignment \((m, n)\) Per-position alignment scores

The full DP matrix can be used for traceback analysis. The per-position scores give a soft alignment probability: higher values indicate positions that contribute more to the optimal alignment.


When to Use Differentiable Alignment¤

Good fit:

  • Learning scoring matrices for non-standard alphabets or domains
  • Joint optimization where alignment feeds into a downstream classifier
  • Sensitivity analysis — which scoring parameters affect the result most

Not needed:

  • Production alignment of millions of reads (BWA/minimap2 are faster)
  • Standard DNA alignment with known scoring (BLOSUM62, fixed gap penalties)
  • When alignment is a preprocessing step with no learnable parameters

DiffBio's alignment operators are designed for learning and analysis, not for replacing production aligners on large-scale data.


Further Reading¤

References¤

  1. Smith & Waterman. "Identification of common molecular subsequences." J. Mol. Biol. 147(1), 1981.
  2. Petti et al. "End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman." Bioinformatics 39(1), 2023.