Comparative Orthologous Read-based Analysis of Lineage Substitutions
CORAL is a tool for scalable extraction, detection, and analysis of point mutations across species evolutionary history. It aligns multiple species to a shared reference genome, simulates reads, filters alignments by mapping quality, extracts unambiguous trinucleotide substitutions, and summarizes mutation rates and mutation spectra.
Preprint available at https://doi.org/10.64898/2026.02.02.703326
- Linux (or WSL2 for windows)
- Conda (Miniforge or Anaconda recommended)
git clone https://github.com/asafpinhasitechnion/CORAL.git
cd CORAL
conda env create -f environment.yml
conda activate coral-env
pip install -e .coral --help
samtools --version
bwa
datasets --versionThe provided environment.yml installs all required dependencies, including:
- Python 3.10
- BWA (classic)
- SAMtools
- NCBI Datasets CLI
- unzip
- All required Python dependencies
PHYLIP is not required for the core pipeline.
Install only if using phylogenetic inference via coral run_multi or coral run_phylip:
conda install -c bioconda phylipcoral run_single \
--outgroup Saccharomyces_mikatae_IFO_1815 GCF_947241705.1 \
--species Saccharomyces_paradoxus GCF_002079055.1 \
Saccharomyces_cerevisiae_S288C GCF_000146045.2 \
--output ../test_output \
--mapq 60 \
--suffix testThis runs the full pipeline, including genome download, reference indexing, read simulation, alignment, mutation extraction, and summary table and plot generation.
coral run_multi \
--species-list '[["Drosophila_melanogaster","GCF_000001215.4"],["Drosophila_sechellia","GCF_004382195.2"],["Drosophila_mauritiana","GCF_004382145.1"],["Drosophila_simulans","GCF_016746395.2"]]' \
--outgroup Drosophila_simulans \
--output ../test_output \
--run-id drosophila_test \
--mapq 60Note: Multi-species mode is experimental and intended for exploratory analyses.
- Download genomes by NCBI assembly accession
- Index the reference genome for alignment
- Simulate FASTQ reads by sliding a window across genomes
- Align simulated reads to the outgroup reference
- Filter alignments by MAPQ and coverage
- Allow customization of aligner and parameters
- Generate pileups from reference and aligned BAMs
- Extract unambiguous trinucleotide substitutions
- Optionally retain genomic positions
- Normalize mutation counts by underlying trinucleotide abundance
- Collapse complementary strands into canonical spectra
- Generate summary tables and visualizations
Each run produces a self-contained output directory containing:
Mutations/*_mutations.csv.gz– per-branch mutation listsMutations/*_mutations.json– trinucleotide mutation countsTables/*.tsv– normalized mutation spectraPlots/*.png– diagnostic and summary plots
Mutation files are named:
<taxon1>__<taxon2>__<reference>__mutations.*
This indicates mutations inferred on the branch leading to taxon1 since divergence from taxon2, using reference as the outgroup genome.
See OUTPUT_FORMAT.md for full file format and naming conventions.
tutorial.ipynb– command-line tutorial and examplesOUTPUT_FORMAT.md– output file structure and naming conventions
Details, benchmarking, and results are available in the preprint: https://doi.org/10.64898/2026.02.02.703326
The final reference will be updated upon publication.