sequence#

Sequence module.

This module provides classes and functions for working with multiple sequence alignments (MSAs). Sequences are stored internally as NumPy arrays for efficient computation. The public API (MSA, bootstrap, bootstrap_per_gene, hamming_distances) is re-exported here; the implementation is split across the base, distances, bootstrap, and io submodules.

Main Class#

Multiple Sequence Alignment (MSA) base module.

This module provides the MSA class for working with multiple sequence alignments. Sequences are stored internally as numpy arrays for efficient computation.

class phylozoo.core.sequence.base.MSA(sequences: dict[str, str])[source]#

Bases: IOMixin

Immutable Multiple Sequence Alignment.

A Multiple Sequence Alignment (MSA) represents a collection of aligned sequences where all sequences have the same length. Each sequence is associated with a taxon (sequence identifier).

Sequences are stored internally as numpy arrays (int8) for efficient computation. This class is immutable - once created, the alignment structure cannot be modified. All sequences are stored internally in a canonical order.

Parameters:

sequences (dict[str, str]) – Dictionary mapping taxon names (sequence identifiers) to sequence strings. All sequences must have the same length.

Raises:

PhyloZooValueError – If sequences dictionary is empty, or if sequences have different lengths.
PhyloZooTypeError – If sequences is not a dictionary.

Notes

Supported I/O formats:

fasta (default): .fasta, .fa, .fas
nexus: .nexus, .nex, .nxs

The internal representation uses:

Numpy array (_coded_array) of shape (num_taxa, sequence_length) with dtype int8
Precomputed lookup table for efficient character encoding
Reverse lookup table for decoding back to strings when needed
Taxa order for consistent indexing

Examples

>>> sequences = {
...     "taxon1": "ACGTACGT",
...     "taxon2": "ACGTACGT",
...     "taxon3": "ACGTACGT"
... }
>>> msa = MSA(sequences)
>>> len(msa)
3
>>> msa.sequence_length
8
>>> "taxon1" in msa
True
>>> msa.get_sequence("taxon1")
'ACGTACGT'
>>> msa.get_sequence("nonexistent")
None

__contains__(taxon: str) → bool[source]#

Check if a taxon is in the alignment.

Parameters:: taxon (str) – Taxon name to check.
Returns:: True if taxon is in the alignment, False otherwise.
Return type:: bool

Examples

>>> sequences = {"taxon1": "ACGT", "taxon2": "ACGT"}
>>> msa = MSA(sequences)
>>> "taxon1" in msa
True
>>> "taxon3" in msa
False

__len__() → int[source]#

Return the number of taxa in the alignment.

Returns:: Number of taxa.
Return type:: int

__repr__() → str[source]#

Return string representation of the MSA.

Returns:: String representation showing number of taxa and sequence length.
Return type:: str

__str__() → str[source]#

Return detailed string representation of the MSA.

Returns:: Detailed string representation with sequence previews.
Return type:: str

property coded_array: ndarray#

Get the coded array representation of the MSA.

Returns a numpy array of shape (num_taxa, sequence_length) where each character is encoded according to the character encoding dictionary. Characters not in the encoding are mapped to -1.

Returns:: Array of shape (num_taxa, sequence_length) with dtype int8. Each element is the encoded value of the corresponding character.
Return type:: np.ndarray

Notes

This returns the internal array directly (not a copy for efficiency). The array is read-only from the user’s perspective due to immutability.

copy() → MSA[source]#

Create a copy of the MSA.

Returns:: A new MSA instance with the same sequences.
Return type:: MSA

Notes

Returns a new instance with the same data. The copy shares no mutable state with the original.

classmethod from_coded_array(coded_array: ndarray, taxa_order: tuple[str, ...]) → MSA[source]#

Create an MSA directly from a coded numpy array.

This method creates an MSA instance without encoding/decoding sequences, which is useful for efficient operations like bootstrapping where you already have a coded array representation.

Parameters:

coded_array (np.ndarray) – Numpy array of shape (num_taxa, sequence_length) with dtype int8. Each element should be an encoded character value according to DEFAULT_NUCLEOTIDE_ENCODING.
taxa_order (tuple[str, ...]) – Tuple of taxon names in the order corresponding to the rows of coded_array. Must have length equal to the first dimension of coded_array.

Returns:

A new MSA instance created from the coded array.

Return type:

MSA

Raises:

PhyloZooTypeError – If coded_array is not a numpy array, or if taxa_order is not a tuple.
PhyloZooValueError – If coded_array is empty, has wrong shape, wrong dtype, or if taxa_order length doesn’t match coded_array shape.

Examples

>>> import numpy as np
>>> coded = np.array([[0, 1, 2, 3], [0, 1, 2, 3]], dtype=np.int8)
>>> taxa_order = ("taxon1", "taxon2")
>>> msa = MSA.from_coded_array(coded, taxa_order)
>>> msa.sequence_length
4
>>> msa.num_taxa
2

Notes

This method bypasses the normal encoding process, making it efficient for operations that work directly with coded arrays (e.g., bootstrap sampling). The coded_array is copied to ensure immutability. Uses DEFAULT_NUCLEOTIDE_ENCODING for character encoding.

get_sequence(taxon: str) → str | None[source]#

Get the sequence for a given taxon (lazy decoding from numpy array).

Parameters:: taxon (str) – Taxon name.
Returns:: Sequence string if taxon exists, None otherwise.
Return type:: str | None

Examples

>>> sequences = {"taxon1": "ACGT", "taxon2": "ACGT"}
>>> msa = MSA(sequences)
>>> msa.get_sequence("taxon1")
'ACGT'
>>> msa.get_sequence("taxon3")
None

get_sequences() → dict[str, str][source]#

Get a copy of all sequences as a dictionary.

Returns:: Dictionary mapping taxon names to sequences.
Return type:: dict[str, str]

Notes

Returns a copy to maintain immutability. The returned dictionary can be modified without affecting the MSA.

property num_taxa: int#

Get the number of taxa in the alignment (read-only).

Returns:: Number of taxa.
Return type:: int

property sequence_length: int#

Get the length of sequences in the alignment (read-only).

Returns:: Length of all sequences.
Return type:: int

property taxa: frozenset[str]#

Get the set of all taxon names (read-only).

Returns:: Frozen set of all taxon names.
Return type:: frozenset[str]

property taxa_order: tuple[str, ...]#

Get the taxon names in canonical order (read-only).

Returns:: Tuple of taxon names in sorted order.
Return type:: tuple[str, …]

Distance Computation#

Distance computation module for MSA.

This module provides functions for computing distance matrices from MSAs, including normalized Hamming distances.

phylozoo.core.sequence.distances.hamming_distances(msa: MSA) → DistanceMatrix[source]#

Compute normalized Hamming distances between all pairs of taxa in an MSA.

The Hamming distance between two sequences is the number of positions where they differ, normalized by the number of positions where both sequences have valid nucleotides (i.e., excluding positions with gaps or unknown characters).

This function uses vectorized numpy operations for efficiency, making it suitable for large alignments.

Parameters:: msa (MSA) – The multiple sequence alignment.
Returns:: A symmetric distance matrix with normalized Hamming distances between all pairs of taxa. The matrix has zero diagonal and is normalized to [0, 1] range.
Return type:: DistanceMatrix

Examples

>>> sequences = {
...     "taxon1": "ACGTACGT",
...     "taxon2": "ACGTACGT",
...     "taxon3": "ACGTTTAA"
... }
>>> msa = MSA(sequences)
>>> dm = hamming_distances(msa)
>>> dm.get_distance("taxon1", "taxon2")
0.0
>>> dm.get_distance("taxon1", "taxon3")
0.5  # 4 differences out of 8 positions

Notes

The normalization excludes positions where either sequence has a gap (-) or unknown character (N). Only positions where both sequences have valid nucleotides (A, C, G, T) are considered.

This implementation uses vectorized numpy operations for maximum efficiency on large alignments.

Bootstrap Functions#

Bootstrap module for Multiple Sequence Alignments.

This module provides functions for generating bootstrap replicates of MSAs, including standard bootstrap and per-gene bootstrap methods.

phylozoo.core.sequence.bootstrap.bootstrap(msa: MSA, n_bootstrap: int, length: int | None = None, seed: int | None = None) → Iterator[MSA][source]#

Generate bootstrap replicates of an MSA.

This function yields bootstrapped versions of the input MSA by sampling columns with replacement. Each bootstrap replicate has the same number of taxa and the same sequence length (or specified length).

Parameters:

msa (MSA) – The multiple sequence alignment to bootstrap.
n_bootstrap (int) – Number of bootstrap replicates to generate. Must be positive.
length (int | None, optional) – Length of the bootstrapped sequences. If None, uses the original sequence length. By default None.
seed (int | None, optional) – Random seed for reproducibility. If None, uses random state. By default None.

Yields:

MSA – A bootstrapped MSA replicate.

Examples

>>> sequences = {
...     "taxon1": "ACGTACGT",
...     "taxon2": "ACGTACGT",
...     "taxon3": "ACGTACGT"
... }
>>> msa = MSA(sequences)
>>> bootstrapped = next(bootstrap(msa, n_bootstrap=1, seed=42))
>>> bootstrapped.sequence_length == msa.sequence_length
True
>>> bootstrapped.num_taxa == msa.num_taxa
True

Notes

Bootstrap sampling is done by sampling column indices with replacement using numpy’s random number generator for efficiency. The function uses vectorized operations to create bootstrapped sequences from the coded array representation. Uses MSA.from_coded_array() to avoid encoding/decoding overhead for maximum efficiency.

phylozoo.core.sequence.bootstrap.bootstrap_per_gene(msa: MSA, gene_lengths: list[int], n_bootstrap: int, seed: int | None = None) → Iterator[MSA][source]#

Generate bootstrap replicates of an MSA with per-gene bootstrapping.

This function bootstraps each gene separately, maintaining the structure of gene boundaries. For each gene, columns are sampled with replacement only from within that gene’s boundaries.

Parameters:

msa (MSA) – The multiple sequence alignment to bootstrap.
gene_lengths (list[int]) – List of gene lengths. The sum of gene lengths must equal the sequence length of the MSA.
n_bootstrap (int) – Number of bootstrap replicates to generate. Must be positive.
seed (int | None, optional) – Random seed for reproducibility. If None, uses random state. By default None.

Yields:

MSA – A bootstrapped MSA replicate with per-gene bootstrapping.

Raises:

PhyloZooValueError – If gene_lengths is empty, contains non-positive values, or if the sum of gene lengths does not equal the MSA sequence length.

Examples

>>> sequences = {
...     "taxon1": "ACGTACGT",
...     "taxon2": "ACGTACGT",
...     "taxon3": "ACGTACGT"
... }
>>> msa = MSA(sequences)
>>> gene_lengths = [4, 4]  # Two genes of length 4 each
>>> bootstrapped = next(bootstrap_per_gene(msa, gene_lengths, n_bootstrap=1, seed=42))
>>> bootstrapped.sequence_length == msa.sequence_length
True

Notes

Per-gene bootstrapping preserves gene boundaries by sampling columns only within each gene’s region. This is useful when genes have different evolutionary histories or when gene boundaries are biologically meaningful. Uses MSA.from_coded_array() to avoid encoding/decoding overhead for maximum efficiency.

I/O Support#

MSA I/O module.

This module provides format handlers for reading and writing MSAs to/from files. Format handlers are registered with FormatRegistry for use with the IOMixin system.

The following format handlers are defined and registered:

fasta: FASTA format for sequence alignments (extensions: .fasta, .fa, .fas)
- Writer: to_fasta() - Converts MSA to FASTA string
- Reader: from_fasta() - Parses FASTA string to MSA
nexus: NEXUS format for sequence alignments (extensions: .nexus, .nex, .nxs)
- Writer: to_nexus() - Converts MSA to NEXUS string
- Reader: from_nexus() - Parses NEXUS string to MSA

These handlers are automatically registered when this module is imported. MSA inherits from IOMixin, so you can use:

msa.save(‘file.fasta’) - Save to file (auto-detects format)
msa.load(‘file.fasta’) - Load from file (auto-detects format)
msa.to_string(format=’fasta’) - Convert to string
MSA.from_string(string, format=’fasta’) - Parse from string
MSA.convert(‘in.fasta’, ‘out.nexus’) - Convert between formats

phylozoo.core.sequence.io.from_fasta(fasta_string: str, **kwargs: Any) → MSA[source]#

Parse a FASTA format string and create an MSA.

Parameters:

fasta_string (str) – FASTA format string containing sequence alignment data.
**kwargs – Additional arguments (currently unused, for compatibility).

Returns:

Parsed MSA.

Return type:

MSA

Raises:

PhyloZooParseError – If the FASTA string is malformed or cannot be parsed (e.g., empty taxon identifier, sequence data before header, no sequences found).

Examples

>>> from phylozoo.core.sequence.io import from_fasta
>>> fasta_str = ">taxon1\nACGTACGT\n>taxon2\nTGCAACGT\n"
>>> msa = from_fasta(fasta_str)
>>> len(msa)
2
>>> msa.get_sequence("taxon1")
'ACGTACGT'

Notes

This parser:

Handles multi-line sequences (concatenates lines between headers)
Strips whitespace from sequence lines
Converts taxon identifiers to strings (FASTA format uses strings)
Raises error if sequences have different lengths

phylozoo.core.sequence.io.from_nexus(nexus_string: str, **kwargs: Any) → MSA[source]#

Parse a NEXUS format string and create an MSA.

Parameters:

nexus_string (str) – NEXUS format string containing sequence alignment data.
**kwargs – Additional arguments (currently unused, for compatibility).

Returns:

Parsed MSA.

Return type:

MSA

Raises:

PhyloZooParseError – If the NEXUS string is malformed or cannot be parsed (e.g., missing TAXA or CHARACTERS blocks, mismatched number of taxa and matrix rows, invalid matrix format).

Examples

>>> from phylozoo.core.sequence.io import from_nexus
>>>
>>> nexus_str = '''#NEXUS
...
... BEGIN TAXA;
...     DIMENSIONS ntax=2;
...     TAXLABELS
...         taxon1
...         taxon2
...     ;
... END;
...
... BEGIN CHARACTERS;
...     DIMENSIONS nchar=8;
...     FORMAT datatype=DNA missing=N gap=-;
...     MATRIX
...         taxon1    ACGTACGT
...         taxon2    TGCAACGT
...     ;
... END;'''
>>>
>>> msa = from_nexus(nexus_str)
>>> len(msa)
2
>>> msa.get_sequence("taxon1")
'ACGTACGT'

Notes

This parser supports:

TAXA block with TAXLABELS
CHARACTERS block with MATRIX
Handles missing and gap characters as specified in FORMAT
Converts taxon identifiers to strings

phylozoo.core.sequence.io.to_fasta(msa: MSA, **kwargs: Any) → str[source]#

Convert an MSA to FASTA format string.

FASTA format consists of:

Each sequence starts with a ‘>’ followed by the taxon identifier
The sequence follows on subsequent lines (can be wrapped)
Sequences are separated by newlines

Parameters:

msa (MSA) – The MSA to convert.
**kwargs –
Additional arguments:
- line_length (int): Maximum line length for sequences (default: 80). Set to 0 or None for no wrapping.

Returns:

The FASTA format string representation of the MSA.

Return type:

str

Examples

>>> from phylozoo.core.sequence import MSA
>>> from phylozoo.core.sequence.io import to_fasta
>>>
>>> sequences = {"taxon1": "ACGTACGT", "taxon2": "TGCAACGT"}
>>> msa = MSA(sequences)
>>> fasta_str = to_fasta(msa)
>>> '>taxon1' in fasta_str
True
>>> 'ACGTACGT' in fasta_str
True

Notes

Taxon identifiers are converted to strings for output. Sequences are wrapped to 80 characters per line by default.

phylozoo.core.sequence.io.to_nexus(msa: MSA, **kwargs: Any) → str[source]#

Convert an MSA to NEXUS format string.

NEXUS format for sequences consists of:

#NEXUS header
TAXA block with taxon labels
CHARACTERS block with sequence data

Parameters:

msa (MSA) – The MSA to convert.
**kwargs –
Additional arguments:
- datatype (str): Data type (default: ‘DNA’). Can be ‘DNA’, ‘RNA’, ‘PROTEIN’, etc.
- missing (str): Missing data character (default: ‘N’)
- gap (str): Gap character (default: ‘-‘)

Returns:

The NEXUS format string representation of the MSA.

Return type:

str

Examples

>>> from phylozoo.core.sequence import MSA
>>> from phylozoo.core.sequence.io import to_nexus
>>>
>>> sequences = {"taxon1": "ACGTACGT", "taxon2": "TGCAACGT"}
>>> msa = MSA(sequences)
>>> nexus_str = to_nexus(msa)
>>> '#NEXUS' in nexus_str
True
>>> 'BEGIN TAXA' in nexus_str
True
>>> 'BEGIN CHARACTERS' in nexus_str
True

Notes

The NEXUS format includes:

Taxa block with DIMENSIONS and TAXLABELS
Characters block with DIMENSIONS, FORMAT, and MATRIX
Taxon identifiers are converted to strings