Multiple Sequence Alignments#

The phylozoo.core.sequence module provides immutable containers for multiple sequence alignments (MSAs), along with comprehensive tools for distance computation and bootstrap resampling.

All classes and functions on this page can be imported from the core sequence module:

from phylozoo.core.sequence import *
# or directly
from phylozoo.core.sequence import MSA

Working with Multiple Sequence Alignments#

The MSA class is the canonical container for aligned sequences in PhyloZoo. It provides an immutable, labeled, and read-only representation that ensures data integrity throughout your analysis pipeline.

Creating an MSA#

MSAs can be created from dictionaries mapping taxon names to sequence strings:

from phylozoo.core.sequence import MSA

# Create from sequence dictionary
sequences = {
    "taxon1": "ACGTACGT",
    "taxon2": "ACGTACGT",
    "taxon3": "ACGTTTAA"
}

msa = MSA(sequences)

All sequences must have the same length. The constructor validates these properties and raises PhyloZooValueError if the input is invalid.

For performance-critical applications, you can also create MSAs directly from pre-encoded arrays:

import numpy as np

# Create from coded array (efficient for internal operations)
coded_array = np.array([[0, 1, 2, 3, 0, 1, 2, 3],  # taxon1
                       [0, 1, 2, 3, 0, 1, 2, 3],  # taxon2
                       [0, 1, 2, 3, 3, 3, 0, 0]], # taxon3
                      dtype=np.int8)
taxa_order = ("taxon1", "taxon2", "taxon3")

msa = MSA.from_coded_array(coded_array, taxa_order)

Accessing Sequences#

MSAs provide several methods for accessing sequences and metadata:

# Get sequence for a specific taxon
sequence = msa.get_sequence("taxon1")  # Returns: "ACGTACGT"

# Check if taxon exists
exists = "taxon1" in msa  # Returns: True

# Access metadata
taxa = msa.taxa          # Returns: frozenset of taxon names
taxa_order = msa.taxa_order  # Returns: tuple with canonical order
length = msa.sequence_length # Returns: 8
num_taxa = msa.num_taxa      # Returns: 3

# Access internal representation
coded = msa.coded_array  # Returns: read-only numpy array

These methods provide safe access to sequences without exposing mutable internals, maintaining immutability guarantees.

File Input/Output#

MSAs support reading and writing in multiple phylogenetic formats:

FASTA (default): Standard sequence alignment format — see FASTA format
NEXUS: Comprehensive phylogenetic data format — see NEXUS format

# Load from file (auto-detects format by extension)
msa = MSA.load("alignment.fasta")

# Load with explicit format
msa = MSA.load("alignment.nexus", format="nexus")

# Save to file
msa.save("output.fasta")

Bootstrap Resampling#

The sequence module provides functions for bootstrap resampling, which is essential for assessing the statistical support of phylogenetic inferences.

Basic Bootstrap#

The bootstrap() function generates bootstrap replicates by sampling alignment columns with replacement:

from phylozoo.core.sequence import bootstrap

# Generate 100 bootstrap replicates
for replicate in bootstrap(msa, n_bootstrap=100, seed=42):
    # Each replicate is a new MSA with resampled columns
    print(f"Replicate has {replicate.sequence_length} columns")

The length parameter controls how many columns to sample (defaults to full alignment length), and seed ensures reproducible results for testing and debugging.

Gene-Based Bootstrap#

For multi-gene alignments, the bootstrap_per_gene() function resamples columns within each gene separately:

from phylozoo.core.sequence import bootstrap_per_gene

# Define gene boundaries (lengths must sum to total alignment length)
gene_lengths = [100, 200, 150]  # Three genes

for replicate in bootstrap_per_gene(msa, gene_lengths, n_bootstrap=100, seed=42):
    # Columns are resampled within each gene independently
    print(f"Gene-based replicate: {replicate.sequence_length} columns")

Distance Computation#

The sequence module provides efficient functions for computing evolutionary distances from multiple sequence alignments.

Hamming Distance#

The hamming_distances() function computes normalized Hamming distances between all pairs of sequences:

\[d(i,j) = \frac{1}{L} \sum_{k=1}^{L} \mathbf{1}_{s_i[k] \neq s_j[k]}\]

where \(L\) is the number of valid (non-gap, non-unknown) positions, and the indicator function equals 1 when sequences differ at position \(k\).

from phylozoo.core.sequence import hamming_distances
from phylozoo.core.distance.classifications import is_metric

# Compute distance matrix
distance_matrix = hamming_distances(msa)

# Check if result is a proper metric
if is_metric(distance_matrix):
    print("Hamming distances form a metric")

The function excludes positions where either sequence contains gaps (-) or unknown characters (N), focusing only on positions where both sequences have valid nucleotides. The implementation uses vectorized NumPy operations for efficient computation on large alignments.

Multiple Sequence Alignments#

Working with Multiple Sequence Alignments#

Creating an MSA#

Accessing Sequences#

File Input/Output#

Bootstrap Resampling#

Basic Bootstrap#

Gene-Based Bootstrap#

Distance Computation#

Hamming Distance#

See Also#