LongPhase-S is a program for somatic haplotagging and tumor purity estimation from tumor-normal pair long-read sequencing data. It also supports germline variant phasing and haplotagging, and is based on LongPhase (v2.0.0).
Key Features:
- Enhances somatic variant calling accuracy
- Somatic haplotagging
- Tumor purity estimation
We evaluated the performance of LongPhase-S across six cancer cell lines using ONT tumor–normal paired long-read sequencing data. Germline variants were called using "Clair3 v1.0.10", and benchmarking was conducted against different truth variant sources:
- HCC1395 : Truth variants from SEQC2
- COLO829 : Truth variants from New York Genome Center
- Four additional cell lines : Truth variants from DeepSomatic orthogonal tools benchmark
This figure compares the performance of somatic SNV calling between "ClairS v0.4.1 with SS+RS model" and "DeepSomatic v1.8.0", before and after applying LongPhase-S.
Note: The following results are based on ORIG_BAM data.
This figure shows the F1-score of somatic SNV calling across different tumor purity levels and cell lines, using the same somatic variant callers and LongPhase-S integration as described above. Different tumor purity levels were simulated by mixing tumor and normal BAM files at specified ratios, with tumor coverage fixed at 50x and normal coverage fixed at 25x.
Note: The following results are based on SUBSAMPLE_BAM data.
This figure shows the performance of LongPhase-S in somatic SNV haplotagging, using tumor VCF files generated by "ClairS v0.4.1 with the SS+RS model". The evaluation is conducted across various cell lines, with metrics including F1-score, recall, and precision for different haplotype tagging categories.
This figure compares tumor purity estimation results between "ASCAT v3.2.0" and "LongPhase-S v1.0.0" (using different somatic variant callers) across multiple cell lines. Different tumor purity levels were simulated by mixing tumor and normal BAM files at specified ratios, with tumor coverage fixed at 50x and normal coverage fixed at 25x.
This figure shows an example of tagged tumor BAM file visualized in IGV, demonstrating somatic haplotagging results on HCC1395/HCC1395BL tumor/normal ONT data.
Clone and compile using the following commands, and make sure that the environment has zlib installed. If you require setting up a virtual environment, we also provide a Dockerfile.
git clone https://github.com/CCU-Bioinformatics-Lab/longphase-s.git
cd longphase-s
autoreconf -i
./configure
make -j 4
This command performs tumor purity estimation and somatic variant calling using tumor-normal pair BAM files, then tags (assigns) each read in tumor BAM to one haplotype based on phased normal SNP VCF and tumor SNP VCF. See Input Preparation for details on how to prepare the required input files:
- Reference genome: Generate reference index
- BAM files: Generate alignment and index files
- Phased normal SNP VCF: Generate germline SNP file and then perform phasing
- Tumor SNV VCF: Generate somatic SNV file
In addition, the haplotype block of each read is stored in the PS tag (only for reads with phased SNPs). The phased VCF can be generated by other programs as long as the PS or HP tags are encoded. The author can specify --log for additionally output a plain-text file containing haplotype tags of each read without parsing BAM.
longphase-s somatic_haplotag \
-s phased_germline_snp.vcf \
-b normal.bam \
--tumor-snv-file tumor_snv_indel.vcf \
--tumor-bam-file tumor.bam \
-r reference.fasta \
-t 8 \
-o tagged_tumor_bam_prefix \
--tagSupplementary \
-q 20 \
Note: See Input Preparation for preprocessing steps to prepare the required input files.
- Tagged tumor BAM file
- The reads will be tagged as:
HP:z:1orHP:z:2for reads with germline SNPsHP:z:1-1orHP:z:2-1for reads with somatic SNPs derived from germline haplotype 1 or 2HP:z:3for reads with somatic SNPs that cannot be derived from germline haplotypes
- The reads will be tagged as:
- Tumor purity estimation file
- Somatic calling result VCF (if
--output-somatic-vcfis enabled) - Benchmark metrics file (if truth files are provided)
- If
--truth-vcfis provided, it will evaluate the performance of somatic haplotagging by comparing the reads that truly contain somatic variants with the reads that are tagged as somatic reads. - If
--truth-bedis also provided, the evaluation will only consider variants within these regions. - See detailed benchmark methodology
This command using tumor-normal pair BAM and VCF files, along with haplotype information, and outputs the estimation file.
longphase-s estimate_purity \
-s phased_normal_snp.vcf \
-b normal.bam \
--tumor-snv-file tumor_snv.vcf \
--tumor-bam-file tumor.bam \
-r reference.fasta \
-t 8 \
-o output_prefix
Note: See Input Preparation for preprocessing steps to prepare the required input files.
For SNP-only phasing, the input of LongPhase consists of SNPs in VCF (e.g., SNP.vcf), an indexed reference in Fasta (e.g., reference.fasta, reference.fasta.fai), and one (or multiple) indexed read-to-reference alignment in BAM (e.g., alignment1.bam, alignment1.bai, alignment2.bam, ...) (see Input Preparation). The users should specify the sequencing platform (--ont for Nanopore and --pb for PacBio). An example of SNP phasing usage is shown below.
longphase-s phase \
-s SNP.vcf \
-b alignment1.bam \
-b alignment2.bam \
-r reference.fasta \
-t 8 \
-o phased_prefix \
--ont # or --pb for PacBio Hifi
If your SNV VCF is produced by DeepSomatic, you can enable a built-in pre-processing step that
keeps only GERMLINE variants and normalizes the GT field according to VAF before phasing. This is
triggered by --deepsomatic_output.
See detailed DeepSomatic output support documentation
This command tags (assigns) each read (in BAM) to one haplotype in the phased SNP/SV VCF. i.e., reads will be tagged as HP:i:1 or HP:i:2. In addition, the haplotype block of each read is stored in the PS tag. The phased VCF can be also generated by other programs as long as the PS or HP tags are encoded. The author can specify --log for additionally output a plain-text file containing haplotype tags of each read without parsing BAM.
longphase-s haplotag \
-r reference.fasta \
-s phased_snp.vcf \
--sv-file phased_sv.vcf \
-b alignment.bam \
-t 8 \
-o tagged_bam_prefix
- Generate reference index
- Generate alignment and index files
- Generate germline SNP file
- Generate somatic SNV file
For somatic haplotagging and tumor purity estimation commands, the preprocessing workflow is:
- Generate tumor SNV VCF using somatic variant callers
- Generate normal SNP VCF using germline variant callers
- Perform phasing on normal SNP VCF using normal BAM to obtain phased normal SNP VCF (e.g., using phasing command)
Ming-En Ho, Zhenxian Zheng, Ruibang Luo, Huai-Hsiang Chiang, Yao-Ting Huang, LongPhase-S: purity estimation and variant recalibration with somatic haplotying for long-read sequencing, bioRxiv, 2025.
Yao-Ting Huang, ythuang at cs.ccu.edu.tw







