Pairwise Distances#

The phylozoo.core.distance module provides immutable containers for pairwise distances between taxa, along with comprehensive tools for classification and optimization. Distance matrices are fundamental to many phylogenetic inference algorithms.

All classes and functions on this page can be imported from the core distance module:

from phylozoo.core.distance import *
# or directly
from phylozoo.core.distance import DistanceMatrix

Working with Distance Matrices#

The DistanceMatrix class is the canonical container for pairwise distances in PhyloZoo. It provides an immutable, labeled, and read-only representation that ensures data integrity throughout your analysis pipeline.

Creating a Distance Matrix#

Distance matrices can be created from NumPy arrays with optional labels:

import numpy as np
from phylozoo.core.distance import DistanceMatrix

# Create a symmetric distance matrix
matrix = np.array([
    [0.0, 1.0, 2.0],
    [1.0, 0.0, 1.0],
    [2.0, 1.0, 0.0]
])

# Create with labels
dm = DistanceMatrix(matrix, labels=["A", "B", "C"])

# Or without labels (defaults to 0, 1, 2, ...)
dm2 = DistanceMatrix(matrix)

The matrix must be square and symmetric. The constructor validates these properties and raises PhyloZooValueError if the input is invalid.

Accessing Distances#

Distance matrices provide several methods for accessing distances and metadata:

# Get distance between two labeled items
distance = dm.get_distance("A", "B")  # Returns: 1.0

# Get the index of a label
index = dm.get_index("A")  # Returns: 0

# Access labels and underlying array
labels = dm.labels        # Returns: ("A", "B", "C")
array = dm.np_array       # Returns: read-only numpy array

These methods provide safe access to distances without exposing the mutable underlying array, maintaining immutability guarantees.

File Input/Output#

Distance matrices support reading and writing in multiple phylogenetic formats:

NEXUS (default): Standard phylogenetic data format — see NEXUS format
PHYLIP: Classic phylogenetic format — see PHYLIP format
CSV: Comma-separated values for interoperability — see CSV format

# Load from file (auto-detects format by extension)
dm = DistanceMatrix.load("distances.nexus")

# Load with explicit format
dm = DistanceMatrix.load("distances.phy", format="phylip")

# Save to file
dm.save("output.csv", format="csv")

Classification Functions#

The distance module provides functions to classify distance matrices based on mathematical properties. These classifications are essential for validating that distance matrices meet the requirements of specific algorithms.

Basic Properties#

Non-negativity

The is_nonnegative() function checks that all distances are non-negative: \(d(i,j) \geq 0\) for all \(i, j\). This is a fundamental requirement for distance functions.

Zero Diagonal

The has_zero_diagonal() function verifies that all diagonal entries are exactly zero, ensuring that the distance from any item to itself is zero. This is a strict equality check using NumPy equality, intended to verify metric conventions.

Triangle Inequality

The satisfies_triangle_inequality() function checks whether the distance matrix satisfies the triangle inequality:

\[d(i,k) \leq d(i,j) + d(j,k) \quad \forall i, j, k\]

This property is fundamental to metric spaces and is required by many distance-based algorithms. The implementation uses Numba-accelerated triple loops for efficient computation on moderately sized matrices.

Example:

from phylozoo.core.distance.classifications import satisfies_triangle_inequality

if satisfies_triangle_inequality(dm):
    print("Matrix satisfies triangle inequality")
else:
    print("Matrix violates triangle inequality")

Metric Properties#

Metrics

A metric distance matrix satisfies four properties:

Non-negativity: \(d(x, y) \geq 0\) for all \(x, y\)
Triangle inequality: \(d(x, z) \leq d(x, y) + d(y, z)\) for all \(x, y, z\)
Zero diagonal: \(d(x, x) = 0\) for all \(x\)
Symmetry: \(d(x, y) = d(y, x)\) for all \(x, y\) (enforced by constructor)

The is_metric() function provides a convenient wrapper that checks all of these properties. Use this function when downstream algorithms require a proper metric and you want a single boolean check.

Pseudo-metrics

A pseudo-metric is like a metric but allows distinct items to have zero distance. The is_pseudo_metric() function checks non-negativity and triangle inequality, but does not require a zero diagonal.

Kalmanson Conditions

The Kalmanson condition [Kalmanson, 1975] is a pair of inequalities that must hold for every ordered quadruple \((i < j < k < l)\) in a circular ordering:

\[\begin{split}d(i,j) + d(k,l) &\leq d(i,k) + d(j,l) \\ d(i,l) + d(j,k) &\leq d(i,k) + d(j,l)\end{split}\]

The is_kalmanson() function checks these conditions with respect to a given circular ordering. This function requires:

The matrix to be a pseudo-metric
The provided CircularOrdering to contain exactly the same labels as the distance matrix

The check is implemented using Numba-accelerated loops.

Tree Metrics

The is_tree_metric() function checks whether a distance matrix is a tree metric — a metric that arises as the path-length distances on some edge-weighted tree. The check is based on the four-point condition: for every choice of four taxa the maximum of the three pairwise sums

\[\{d_{ij}+d_{kl},\; d_{ik}+d_{jl},\; d_{il}+d_{jk}\}\]

must be achieved by at least two of them [Bandelt and Dress, 1992]. The test uses a Numba-accelerated loop over all \(\binom{n}{4}\) quartets and returns in \(O(n^4)\) time.

from phylozoo.core.distance.classifications import is_tree_metric

if is_tree_metric(dm):
    print("Distance matrix is a tree metric")

Totally Decomposable Metrics

The is_totally_decomposable() function checks whether a distance matrix is totally decomposable: the split-prime residual \(d^0\) of the canonical split decomposition is zero, meaning the metric is expressed exactly as a weighted sum of split metrics. Every tree metric is totally decomposable, but the converse does not hold — totally decomposable metrics may have incompatible d-splits. See next section for details on split decompositions.

from phylozoo.core.distance.classifications import is_totally_decomposable

if is_totally_decomposable(dm):
    print("All pairwise distances are explained by d-splits alone")

Split Decomposition#

The phylozoo.core.distance.decomposition module implements the canonical split decomposition of [Bandelt and Dress, 1992]. Every finite metric \(d\) decomposes uniquely as

\[d = d^0 + \sum_{S} \alpha_S\,\delta_S\]

where \(S\) ranges over all d-splits (bipartitions with positive isolation index), \(\alpha_S\) is the isolation index of \(S\), \(\delta_S\) is the corresponding split metric, and \(d^0\) is the split-prime residual — a metric that admits no further d-splits.

All classes and functions can be imported from the core distance module:

from phylozoo.core.distance import isolation_index, split_decomposition

Isolation Index#

The isolation_index() function computes the isolation index of a split \((A, B)\):

\[\begin{split}\alpha_{A,B} = \tfrac{1}{2}\, \min_{\substack{i,j\in A \\ k,l\in B}} \bigl(\max\{d_{ij}+d_{kl},\, d_{ik}+d_{jl},\, d_{il}+d_{jk}\} - d_{ij} - d_{kl}\bigr)\end{split}\]

The index is always non-negative; it is strictly positive if and only if the split is a d-split of the distance matrix.

import numpy as np
from phylozoo.core.distance import DistanceMatrix, isolation_index
from phylozoo.core.split import Split

d = np.array([
    [ 0,  4,  5,  7, 13,  8,  6],
    [ 4,  0,  1,  3,  9, 12, 10],
    [ 5,  1,  0,  2,  8, 13, 11],
    [ 7,  3,  2,  0,  6, 11, 13],
    [13,  9,  8,  6,  0,  5,  7],
    [ 8, 12, 13, 11,  5,  0,  2],
    [ 6, 10, 11, 13,  7,  2,  0],
], dtype=float)
dm = DistanceMatrix(d, labels=list("ABCDEFG"))

s = Split({"E", "F", "G"}, {"A", "B", "C", "D"})
print(isolation_index(dm, s))  # 6.0

Split Decomposition#

The split_decomposition() function computes the full decomposition and returns both the WeightedSplitSystem of all d-splits and the residual DistanceMatrix.

system, residual = split_decomposition(dm)

print(len(system.splits))          # 4  (Table 1 has four d-splits)
import numpy as np
print(np.allclose(residual.np_array, 0))  # True  (totally decomposable)

for sp in system.splits:
    print(sp, "→ α =", system.get_weight(sp))

Round-Trip Identity#

The decomposition satisfies \(d^1 + d^0 = d\) exactly (up to floating-point noise), where \(d^1 = \sum_S \alpha_S\,\delta_S\) is computed from the weighted split system via distances_from_splitsystem():

from phylozoo.core.split.algorithms import distances_from_splitsystem

d1_dm = distances_from_splitsystem(system)
labels = list(dm.labels)
n = len(labels)
d1 = np.array([[d1_dm.get_distance(labels[r], labels[c]) for c in range(n)]
               for r in range(n)])

assert np.allclose(d1 + residual.np_array, dm.np_array)  # always holds

Conversely, when a WeightedSplitSystem encodes a tree metric (all d-splits are pairwise compatible), the decomposition is the inverse of distances_from_splitsystem():

from phylozoo.core.split import WeightedSplitSystem, Split
from phylozoo.core.distance import split_decomposition
from phylozoo.core.split.algorithms import distances_from_splitsystem

# Build a tree split system, convert to distances, then recover the system
tree_system = WeightedSplitSystem({
    Split({"A"}, {"B", "C", "D"}): 1.0,
    Split({"B"}, {"A", "C", "D"}): 1.0,
    Split({"C"}, {"A", "B", "D"}): 1.0,
    Split({"D"}, {"A", "B", "C"}): 1.0,
    Split({"A", "B"}, {"C", "D"}): 2.0,
})
dm = distances_from_splitsystem(tree_system)
recovered, residual = split_decomposition(dm)

assert recovered.splits == tree_system.splits        # same d-splits
assert np.allclose(residual.np_array, 0)             # zero residual

Pairwise Distances#

Working with Distance Matrices#

Creating a Distance Matrix#

Accessing Distances#

File Input/Output#

Classification Functions#

Basic Properties#

Metric Properties#

Split Decomposition#

Isolation Index#

Split Decomposition#

Round-Trip Identity#

See Also#