Parallel Execution#

The phylozoo.utils.parallel module provides the standard interface for parallel execution across PhyloZoo. It defines a small set of backends (sequential, threading, multiprocessing) behind a single ParallelConfig configuration object and a common executor protocol, so that any function that wants to support parallelization exposes the same API to its users.

All classes on this page can be imported from the parallel module:

from phylozoo.utils.parallel import ParallelConfig, ParallelBackend

Note

No PhyloZoo function currently exposes a parallel parameter. This module is the intended way to introduce parallelization in future implementations or for dependent packages to utilize, so that all parallelized functions share a uniform interface.

How parallelization is organized#

Parallel execution in PhyloZoo follows a single pattern: a function that supports parallel execution accepts a ParallelConfig (or None for the default sequential behavior), obtains an executor from it, and uses the executor’s map() / starmap() to dispatch work. Choosing a backend, the number of workers, and the chunk size is therefore done in one place (the config), and the function itself does not need to know about threads, processes, or pools.

Backends#

Three backends are available through ParallelBackend:

  • SEQUENTIAL — no parallelization; executes one item at a time. The default, and the right choice for debugging or when parallel overhead would outweigh the benefits.

  • THREADING — uses a thread pool (concurrent.futures.ThreadPoolExecutor). Good for I/O-bound work or when worker memory must be shared; limited by Python’s GIL for CPU-bound tasks.

  • MULTIPROCESSING — uses a process pool (multiprocessing.Pool). Best for CPU-bound tasks; bypasses the GIL but requires that the function and its arguments be picklable.

Each backend is implemented as a small executor class (SequentialExecutor, ThreadingExecutor, MultiprocessingExecutor) that conforms to the ParallelExecutor protocol with map and starmap methods.

Configuring parallel execution#

Construct a ParallelConfig to describe how work should be dispatched.

Sequential (default)

from phylozoo.utils.parallel import ParallelConfig, ParallelBackend

config = ParallelConfig(backend=ParallelBackend.SEQUENTIAL)

Multiprocessing with a fixed number of workers

config = ParallelConfig(
    backend=ParallelBackend.MULTIPROCESSING,
    n_jobs=4,
)

Multiprocessing using all available cores

Pass n_jobs=None (or -1) to use every available CPU core:

config = ParallelConfig(
    backend=ParallelBackend.MULTIPROCESSING,
    n_jobs=None,
)

Threading

config = ParallelConfig(
    backend=ParallelBackend.THREADING,
    n_jobs=4,
)

The backend can also be supplied as a string ("sequential", "threading", "multiprocessing") for convenience.

Using an executor directly#

Once a config is built, call get_executor() to obtain an executor and use its map() or starmap() methods:

import itertools
from phylozoo.utils.parallel import ParallelConfig, ParallelBackend

def process_quartet(indices: tuple[int, int, int, int]) -> float:
    i, j, k, l = indices
    # ... do work ...
    return 0.0

combinations = list(itertools.combinations(range(20), 4))

config = ParallelConfig(
    backend=ParallelBackend.MULTIPROCESSING,
    n_jobs=4,
)
executor = config.get_executor()
results = list(executor.map(process_quartet, combinations))

For functions that take multiple positional arguments, use starmap() with an iterable of argument tuples:

def pair_distance(a: int, b: int) -> float:
    # ... do work ...
    return 0.0

pairs = [(0, 1), (0, 2), (1, 2)]
results = list(executor.starmap(pair_distance, pairs))

Pattern for PhyloZoo functions#

When a PhyloZoo function wants to support parallel execution, it should accept a parallel keyword argument typed as ParallelConfig | None, default to a sequential config when None is given, and dispatch its inner loop through the executor:

from phylozoo.utils.parallel import ParallelConfig, ParallelBackend

def my_parallel_function(
    data: list[int],
    parallel: ParallelConfig | None = None,
) -> list[float]:
    if parallel is None:
        parallel = ParallelConfig(backend=ParallelBackend.SEQUENTIAL)
    executor = parallel.get_executor()
    return list(executor.map(_process_item, data))

This keeps the choice of backend and worker count in the caller’s hands while the function body stays agnostic to the parallelization strategy.

Warning

For the multiprocessing backend, both func and every item in iterable must be picklable. Module-level functions are picklable; closures and lambdas are not.

See Also#