# Ghost Population Detection Pipeline
This repository contains a Snakemake-based workflow to detect unsampled "ghost" populations in genomic datasets using demographic inference, statistical testing, and model selection.
The pipeline integrates STRUCTURE, IMa3, ARGweaver, and custom analysis scripts to evaluate signals of ghost introgression in population genomics data.
## Key Features
- **STRUCTURE-based inference** of admixture models across multiple K values
- **Likelihood Ratio Tests (LRTs)** using IMa3 for model comparison
- **Bootstrap testing** and AIC/BIC model selection
- **ARGweaver-based coalescent simulations** and TMRCA distribution analysis
- **Multimodality tests** (e.g., Hartigan’s Dip Test) to detect non-standard coalescent patterns
- Fully automated with **Snakemake**
- Reproducible environments with **Conda**
## Repository Structure
```text
ghost-pop-gen/
├── Snakefile # Main Snakemake pipeline
├── Snakefile.part3 # ARGweaver + modality analysis
├── environment.yml # Main conda environment
├── config/ # YAML config files and model specifications
│ ├── config.yaml
│ ├── model1.par
│ └── nested_models_2pop.txt
├── data/ # Input files (FASTA, .u, .str)
│ ├── fasta/
│ ├── ima3_inputs_2pop/
│ ├── ima3_inputs_3pop/
│ └── structure_inputs/
├── envs/ # Conda envs for specific tools
│ └── argweaver_py2.yaml
├── results/ # Output files and visualizations
│ ├── structure_outputs/
│ ├── ima3/
│ ├── *.csv
│ ├── *.png
├── scripts/ # R, Python, and Bash helper scripts
├── software/ # Compiled tools (e.g., ARGweaver)
└── README.md - Clone the repository:
git clone https://github.com/Megmugure/ghost-pop-gen.git
cd ghost-pop-gen- Create and activate the conda environment:
conda env create -f environment.yml
conda activate ghost-pop-genTo run the full workflow:
snakemake --cores 4To perform a dry run:
snakemake -nTo run the ARGweaver + modality testing component separately:
snakemake -s Snakefile.part3 --cores 4To generate a DAG (workflow graph):
snakemake --dag | dot -Tpng > dag.png# Run Kolmogorov-Smirnov test on TMRCA values
Rscript scripts/KS_tests.R data/input.tmrca
# Run LRT test
python scripts/LRT_test.py results/lrt_values.txt
# Run STRUCTURE bootstrap LRT
bash scripts/bootstrap_test.sh data/structure_data.strIf you use this pipeline in your research, please cite:
(preprint link or DOI coming soon)
This project is licensed under the MIT License. See the LICENSE file for full details.
Margaret Wanjiku [email protected] GitHub: Megmugure