A production-ready, modular Snakemake workflow for calling Copy Number Variations (CNVs) from tumor exome sequencing data. This pipeline integrates CNVkit for core coverage analysis and segmentation with PureCN for sophisticated tumor purity and ploidy estimation, designed specifically for FFPE tumor samples.
- ποΈ Modular Architecture: Organized into focused modules for maintainability and flexibility
- π End-to-End Automation: From BAM files to final VCF calls with comprehensive QC
- 𧬠FFPE-Optimized: Implements best practices for FFPE data with pooled Panel of Normals and contamination filtering
- π― Hybrid Purity Estimation: PureCN for Tumor/Normal pairs, fallback estimates for Tumor-only samples
- π Rich Visualizations: Automated generation of scatter plots, diagrams, heatmaps, and gene-focused plots
- π Reproducible: Self-contained Conda environments ensure portability across systems
- β‘ HPC-Ready: Optimized for cluster execution with Slurm integration and resource management
The pipeline follows a modular design with five core components:
cnv_pipeline.smk (Main Controller)
βββ rules/01_purity.smk # Purity estimation with PureCN
βββ rules/02_pon.smk # Panel of Normals generation
βββ rules/03_calling.smk # CNV calling and export
βββ rules/04_plotting.smk # Comprehensive visualization
-
π― Target Preparation (
02_pon.smk
)- Annotate targets BED file with gene information from refFlat
- Split large regions for optimal bin sizes
- Create antitarget regions for off-target coverage
-
π§ͺ Purity Estimation (
01_purity.smk
)- PureCN analysis for
TvsN
samples with VCF files - Fallback to user-supplied estimates for other cases
- Consolidated purity values across all samples
- PureCN analysis for
-
π Panel of Normals Generation (
02_pon.smk
)- Parallel CNVkit target and antitarget coverage calculation for all normal samples
- Automated QC filtering based on noise metrics
- Pooled reference creation from clean normals
-
π― CNV Calling (
03_calling.smk
)- Tumor sample processing with pooled reference
- Contamination region exclusion
- Absolute copy number calling with purity integration
- VCF export for downstream analysis
-
π Visualization (
04_plotting.smk
)- Scatter plots showing copy number across chromosomes
- Diagram plots for copy number segments
- Heatmaps for cross-sample comparison
- Gene-focused plots for regions of interest
git clone [repository-url]
cd sm-cnv-calling
Prerequisites: Miniconda or Anaconda installed
The pipeline uses two main conda environments that need to be created before running. You can either let Snakemake create them automatically or set them up manually:
Option 1: Automatic Setup (Recommended)
# Snakemake will automatically create the environments when first run
snakemake -s scripts/snakemake/cnv_pipeline.smk --use-conda --conda-prefix conda_envs --cores 1 --dry-run
Option 2: Manual Setup
# Create the conda environments manually with specific names
conda env create -f scripts/snakemake/envs/cnvkit.yaml -n cnvkit
conda env create -f scripts/snakemake/envs/purecn.yaml -n purecn
# Alternatively, create them in the conda_envs directory as Snakemake expects
mkdir -p conda_envs
conda env create -f scripts/snakemake/envs/cnvkit.yaml -p conda_envs/cnvkit
conda env create -f scripts/snakemake/envs/purecn.yaml -p conda_envs/purecn
The pipeline will automatically use these environments for different steps:
cnvkit
: Used for CNVkit operations, coverage analysis, and plottingpurecn
: Used for PureCN-based purity estimation
This repository includes a helper script to download standard accessory files (access*.bed
, refFlat.txt
) for common genome builds.
# Create a directory for your references
mkdir -p references
# Activate the cnvkit conda environment (if created manually)
# conda activate cnvkit
# Or if using conda_envs directory:
# conda activate conda_envs/cnvkit
# Download files for hg19
python scripts/snakemake/helpers/download_references.py --genome hg19 --output-dir references/
This will download access-5k-mappable.hg19.bed
and refFlat.txt
into the references/
directory.
Edit scripts/snakemake/config.yaml
to specify your reference files. If you used the downloader script, these paths would be references/access-5k-mappable.hg19.bed
, references/refFlat.txt
, etc.:
reference_genome: "/path/to/your/ucsc.hg19.fasta"
targets_bed: "/path/to/your/targets.bed"
access_bed: "references/access-5k-mappable.hg19.bed"
blacklist: "/path/to/your/genomic_blacklist.bed"
annotate_refFlat: "references/refFlat.txt" # Optional: comment out to disable gene annotation
purecn_normal_panel_vcf: "/path/to/your/normal_panel.vcf.gz"
# Optional: Skip PureCN analysis entirely (default: false)
skip_purecn: false
Skipping PureCN: If you encounter issues with PureCN dependencies or want to run only CNVkit analysis, set skip_purecn: true
in the config. This will:
- Skip all PureCN-related rules and dependencies
- Use only fallback purity values from the sample sheet
- Still run the complete CNVkit workflow for CNV calling
Create your scripts/snakemake/samples.tsv
with sample information (see Input Formats below).
The conda environments will be automatically created by Snakemake when using the --use-conda
flag. Execute the pipeline:
# Run the complete pipeline (environments created automatically)
snakemake -s scripts/snakemake/cnv_pipeline.smk --use-conda --conda-prefix conda_envs --cores 8
# For cluster execution (SLURM)
sbatch scripts/run_cnv_pipeline.sh
For additional reference files (like hg38 or updated files), you can use the included downloader:
# Download hg38 files
python scripts/snakemake/helpers/download_references.py --genome hg38 --output-dir references/
# Download with custom config path
python scripts/snakemake/helpers/download_references.py --genome hg19 --output-dir references/ --config my_custom_download_config.yaml
All output paths are centralized under the dirs
section:
dirs:
purecn_setup: "results/01_purecn_setup" # PureCN interval files
purecn_runs: "results/02_purecn_runs" # PureCN analysis outputs
purity_values: "results/03_purity_values" # Consolidated purity estimates
pon_creation: "results/04_pon_creation" # Panel of Normals files
cnvkit_runs: "results/05_cnvkit_runs" # CNVkit coverage/segmentation
final_calls: "results/06_final_calls" # Final copy number calls
final_vcfs: "results/07_final_vcfs" # Exported VCF files
plots: "results/08_plots" # All visualization outputs
logs: "logs" # Pipeline execution logs
Update these paths to match your system:
Parameter | Description | Format |
---|---|---|
reference_genome |
Reference FASTA file | .fasta , .fa |
targets_bed |
Target regions for exome capture | .bed |
access_bed |
Mappable regions (e.g., UCSC access-5k) | .bed |
blacklist |
Contaminated regions to exclude | .bed |
annotate_refFlat |
Gene annotations for CNVkit (optional) | .txt |
purecn_normal_panel_vcf |
PureCN normal panel VCF | .vcf.gz |
The pipeline uses two conda environments defined in the conda_envs
section:
conda_envs:
cnvkit: "scripts/snakemake/envs/cnvkit.yaml" # CNVkit operations and plotting
purecn: "scripts/snakemake/envs/purecn.yaml" # PureCN purity estimation
When using --conda-prefix conda_envs
, Snakemake will create environments in:
conda_envs/cnvkit/
- Contains cnvkit, pandas, requests, tqdm, and related toolsconda_envs/purecn/
- Contains R, PureCN, and related Bioconductor packages
Parameter | Default | Description |
---|---|---|
purecn_genome |
"hg19" |
Reference genome build |
purecn_assay_name |
"MyTumorAssay" |
Assay identifier for PureCN |
pon_qc_metric_threshold |
0.3 |
Maximum noise threshold for normals |
segment_threshold |
1.0e-4 |
P-value threshold for segmentation |
genes_of_interest:
- "TP53"
- "MYC"
- "EGFR"
- "ERBB2"
Specify genes for focused plotting. Gene names should match your reference annotation.
Parameter | Default | Description |
---|---|---|
default_threads |
4 |
Default CPU cores per job |
default_mem_mb |
16000 |
Default memory (MB) per job |
cnvkit_batch_mem_mb |
24000 |
Memory for CNVkit batch processing |
cnvkit_call_mem_mb |
8000 |
Memory for CNVkit call processing with VCFs |
cnvkit_export_mem_mb |
4000 |
Memory for CNVkit VCF export |
cnvkit_plot_mem_mb |
6000 |
Memory for CNVkit individual plot operations |
cnvkit_heatmap_mem_mb |
24000 |
Memory for CNVkit cohort heatmap generation (increase to 32000+ for >50 samples) |
The pipeline requires a tab-separated file with the following columns:
Column | Required | Description | Example |
---|---|---|---|
sample_id |
β | Unique sample identifier | PatientA |
analysis_type |
β | Analysis mode (see below) | TvsN |
tumor_bam |
Conditional | Path to tumor BAM file | /path/to/tumor.bam |
normal_bam |
Conditional | Path to normal BAM file | /path/to/normal.bam |
vcf |
Optional | Path to variant VCF file | /path/to/variants.vcf.gz |
tumor_sample_id_vcf |
Optional | Tumor sample ID in VCF for B-allele frequency calculation | PatientA-T |
normal_sample_id_vcf |
Optional | Normal sample ID in VCF for B-allele frequency calculation | PatientA-N |
fallback_purity |
β | Backup purity estimate (0-1) | 0.35 |
Type | Description | Required Files | Use Case |
---|---|---|---|
TvsN |
Tumor vs Normal | tumor_bam , normal_bam , vcf (optional) |
Paired analysis with germline control; normal BAM used for Panel of Normals |
To |
Tumor only | tumor_bam , vcf (optional) |
Single sample analysis without matched normal |
sample_id analysis_type tumor_bam normal_bam vcf tumor_sample_id_vcf normal_sample_id_vcf fallback_purity
PatientA_T1 TvsN /data/bams/PatientA_T1.bam /data/bams/PatientA_N.bam /data/vcfs/PatientA_T1.vcf.gz PatientA-T PatientA-N 0.35
PatientA_T2 TvsN /data/bams/PatientA_T2.bam /data/bams/PatientA_N.bam /data/vcfs/PatientA_T2.vcf.gz PatientA-T PatientA-N 0.40
PatientB TvsN /data/bams/PatientB_T.bam /data/bams/PatientB_N.bam /data/vcfs/PatientB.vcf.gz PatientB-T PatientB-N 0.30
PatientC To /data/bams/PatientC_T.bam /data/vcfs/PatientC.vcf.gz PatientC-T 0.45
Note:
- The normal BAM files from
TvsN
samples are automatically used to construct the Panel of Normals (PoN) for improved CNV calling accuracy. - Multiple tumors per patient: If the same normal BAM is used for multiple tumor samples (e.g.,
PatientA_T1
andPatientA_T2
both usePatientA_N.bam
), the pipeline automatically deduplicates normal BAMs to avoid processing the same file multiple times in the PoN. - VCF Sample IDs: For samples with VCF files, add
tumor_sample_id_vcf
andnormal_sample_id_vcf
columns to specify the exact sample identifiers in the VCF file. This enables proper B-allele frequency calculation from germline variants and accurate VCF parsing when sample names in the VCF differ from the sample sheet IDs.
- Coordinate-sorted and indexed (
.bai
files required) - Aligned to the same reference specified in
config.yaml
- Duplicate-marked (recommended but not required)
- Base quality scores recalibrated (recommended for optimal results)
- Compressed and indexed (
.gz
+.tbi
files) - Same reference coordinate system as BAM files
- Contains germline variants for optimal PureCN analysis
- Quality-filtered variants recommended (PASS filter)
The pipeline generates organized outputs across multiple directories:
scatter_genome/{sample}.genome.pdf
: Copy number scatter plots across chromosomesdiagram/{sample}.diagram.pdf
: Segmented copy number diagramsscatter_gene/{sample}/{gene}.pdf
: Gene-focused plots for regions of interest, organized by sampleheatmap_cohort.pdf
: Cross-sample copy number heatmap
Note: Gene-focused plots are organized into sample-specific subdirectories for better organization. For example, with samples PatientA_T1
, PatientB
and genes TP53
, MYC
, the structure would be:
results/08_plots/scatter_gene/
βββ PatientA_T1/
β βββ TP53.pdf
β βββ MYC.pdf
βββ PatientB/
βββ TP53.pdf
βββ MYC.pdf
results/07_final_vcfs/
: Final VCF files with copy number callsresults/06_final_calls/
: CNVkit native format calls (.call.cns
)results/03_purity_values/
: Consolidated purity estimatesresults/04_pon_creation/pooled_reference.cnn
: Panel of Normals reference
results/01_purecn_setup/
: PureCN interval and setup filesresults/02_purecn_runs/
: Individual PureCN analysis resultsresults/05_cnvkit_runs/
: CNVkit coverage and segmentation files
Execute only purity estimation:
snakemake -s scripts/snakemake/cnv_pipeline.smk --use-conda \
purity_consolidation \
--cores 8
Generate only visualizations:
snakemake -s scripts/snakemake/cnv_pipeline.smk --use-conda \
all_plots \
--cores 4
Override default resources in your execution:
snakemake -s scripts/snakemake/cnv_pipeline.smk --use-conda \
--set-resources cnvkit_batch:mem_mb=32000 \
--cores 16
For SLURM clusters, use the provided execution script:
# Edit cluster parameters in run_cnv_pipeline.sh
./scripts/run_cnv_pipeline.sh
Or create a custom SLURM profile following Snakemake cluster documentation.
snakemake -s scripts/snakemake/cnv_pipeline.smk --summary
snakemake -s scripts/snakemake/cnv_pipeline.smk --lint
snakemake -s scripts/snakemake/cnv_pipeline.smk --dag | dot -Tpng > pipeline_dag.png
The pipeline includes several QC checkpoints:
- Reference File Validation: Automatic verification of all required reference files
- Normal Sample QC: Automated filtering of noisy normal samples using CNVkit metrics
- Purity Validation: Consolidated purity estimates with fallback handling
- Output Validation: File existence and format checks throughout the pipeline
- Minimum: 16 GB RAM for small exomes
- Recommended: 32 GB RAM for large panels or whole exomes
- PureCN: May require up to 24 GB for complex samples
Step | Single Sample | 10 Samples | 50 Samples |
---|---|---|---|
Purity Estimation | 30-60 min | 2-4 hours | 6-12 hours |
Panel of Normals | N/A | 1-2 hours | 3-6 hours |
CNV Calling | 15-30 min | 1-3 hours | 4-8 hours |
Visualization | 5-10 min | 30-60 min | 2-4 hours |
Estimates based on typical exome data (~50 MB BAM files) running on modern HPC infrastructure
If you use this pipeline in your research, please cite:
- CNVkit: Talevich et al. (2016). CNVkit: Genome-wide copy number detection and visualization from targeted sequencing. PLOS Computational Biology
- PureCN: Riester et al. (2016). PureCN: copy number calling and SNV classification using targeted short read sequencing. Source Code for Biology and Medicine
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all existing tests pass
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
For questions and support:
- Issues: Use GitHub Issues for bug reports and feature requests
- Discussions: Use GitHub Discussions for usage questions and community support
- Documentation: Refer to individual tool documentation for CNVkit and PureCN
Submit the main pipeline job using the provided shell script:
# Submit with default of 50 parallel jobs
sbatch scripts/run_cnv_pipeline.sh
# Submit with a custom number of parallel jobs
sbatch scripts/run_cnv_pipeline.sh 100
The script will handle temporary directories and logging. All final results will be placed in the results/
directory, following the structure outlined in the Snakefile
.