Multiple Sequence Alignments
============================

The :mod:`phylozoo.core.sequence` module provides immutable containers for multiple sequence
alignments (MSAs), along with comprehensive tools for distance computation and bootstrap
resampling.

All classes and functions on this page can be imported from the core sequence module:

.. code-block:: python

   from phylozoo.core.sequence import *
   # or directly
   from phylozoo.core.sequence import MSA

Working with Multiple Sequence Alignments
------------------------------------------

The :class:`~phylozoo.core.sequence.base.MSA` class is the canonical container for aligned
sequences in PhyloZoo. It provides an immutable, labeled, and read-only representation
that ensures data integrity throughout your analysis pipeline.

Creating an MSA
^^^^^^^^^^^^^^^^

MSAs can be created from dictionaries mapping taxon names to sequence strings:

.. code-block:: python

   from phylozoo.core.sequence import MSA

   # Create from sequence dictionary
   sequences = {
       "taxon1": "ACGTACGT",
       "taxon2": "ACGTACGT",
       "taxon3": "ACGTTTAA"
   }

   msa = MSA(sequences)

All sequences must have the same length. The constructor validates these properties
and raises :class:`~phylozoo.utils.exceptions.general.PhyloZooValueError` if the input is invalid.

For performance-critical applications, you can also create MSAs directly from pre-encoded arrays:

.. code-block:: python

   import numpy as np

   # Create from coded array (efficient for internal operations)
   coded_array = np.array([[0, 1, 2, 3, 0, 1, 2, 3],  # taxon1
                          [0, 1, 2, 3, 0, 1, 2, 3],  # taxon2
                          [0, 1, 2, 3, 3, 3, 0, 0]], # taxon3
                         dtype=np.int8)
   taxa_order = ("taxon1", "taxon2", "taxon3")

   msa = MSA.from_coded_array(coded_array, taxa_order)

Accessing Sequences
^^^^^^^^^^^^^^^^^^^

MSAs provide several methods for accessing sequences and metadata:

.. code-block:: python

   # Get sequence for a specific taxon
   sequence = msa.get_sequence("taxon1")  # Returns: "ACGTACGT"

   # Check if taxon exists
   exists = "taxon1" in msa  # Returns: True

   # Access metadata
   taxa = msa.taxa          # Returns: frozenset of taxon names
   taxa_order = msa.taxa_order  # Returns: tuple with canonical order
   length = msa.sequence_length # Returns: 8
   num_taxa = msa.num_taxa      # Returns: 3

   # Access internal representation
   coded = msa.coded_array  # Returns: read-only numpy array

These methods provide safe access to sequences without exposing mutable internals,
maintaining immutability guarantees.

File Input/Output
^^^^^^^^^^^^^^^^^

MSAs support reading and writing in multiple phylogenetic formats:

- **FASTA** (default): Standard sequence alignment format — see :doc:`FASTA format <../utils/io/formats/fasta>`
- **NEXUS**: Comprehensive phylogenetic data format — see :doc:`NEXUS format <../utils/io/formats/nexus>`

.. code-block:: python

   # Load from file (auto-detects format by extension)
   msa = MSA.load("alignment.fasta")

   # Load with explicit format
   msa = MSA.load("alignment.nexus", format="nexus")

   # Save to file
   msa.save("output.fasta")

.. seealso::
   The `MSA` class uses the :class:`~phylozoo.utils.io.mixin.IOMixin` interface, providing
   consistent file handling across PhyloZoo classes. For details on the I/O system,
   see the :doc:`I/O manual <../utils/io/overview>`.


Bootstrap Resampling
---------------------

The sequence module provides functions for bootstrap resampling, which is essential
for assessing the statistical support of phylogenetic inferences.

Basic Bootstrap
^^^^^^^^^^^^^^^

The :func:`~phylozoo.core.sequence.bootstrap.bootstrap` function generates bootstrap replicates
by sampling alignment columns with replacement:

.. code-block:: python

   from phylozoo.core.sequence import bootstrap

   # Generate 100 bootstrap replicates
   for replicate in bootstrap(msa, n_bootstrap=100, seed=42):
       # Each replicate is a new MSA with resampled columns
       print(f"Replicate has {replicate.sequence_length} columns")

The ``length`` parameter controls how many columns to sample (defaults to full alignment
length), and ``seed`` ensures reproducible results for testing and debugging.

Gene-Based Bootstrap
^^^^^^^^^^^^^^^^^^^^

For multi-gene alignments, the :func:`~phylozoo.core.sequence.bootstrap.bootstrap_per_gene`
function resamples columns within each gene separately:

.. code-block:: python

   from phylozoo.core.sequence import bootstrap_per_gene

   # Define gene boundaries (lengths must sum to total alignment length)
   gene_lengths = [100, 200, 150]  # Three genes

   for replicate in bootstrap_per_gene(msa, gene_lengths, n_bootstrap=100, seed=42):
       # Columns are resampled within each gene independently
       print(f"Gene-based replicate: {replicate.sequence_length} columns")

Distance Computation
--------------------

The sequence module provides efficient functions for computing evolutionary distances
from multiple sequence alignments.

Hamming Distance
^^^^^^^^^^^^^^^^

The :func:`~phylozoo.core.sequence.distances.hamming_distances` function computes normalized
Hamming distances between all pairs of sequences:

.. math::

   d(i,j) = \frac{1}{L} \sum_{k=1}^{L} \mathbf{1}_{s_i[k] \neq s_j[k]}

where :math:`L` is the number of valid (non-gap, non-unknown) positions, and the
indicator function equals 1 when sequences differ at position :math:`k`.

.. code-block:: python

   from phylozoo.core.sequence import hamming_distances
   from phylozoo.core.distance.classifications import is_metric

   # Compute distance matrix
   distance_matrix = hamming_distances(msa)

   # Check if result is a proper metric
   if is_metric(distance_matrix):
       print("Hamming distances form a metric")

The function excludes positions where either sequence contains gaps (-) or unknown
characters (N), focusing only on positions where both sequences have valid nucleotides.
The implementation uses vectorized NumPy operations for efficient computation on
large alignments.

See Also
--------

- :doc:`API Reference <../../api/core/sequences>` - Complete function signatures and detailed examples
- :doc:`Distance Matrices <distance>` - Working with distance matrices computed from alignments