Skip to content

CCU-Bioinformatics-Lab/longphase-s

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LongPhase-S

LongPhase-S is a program for somatic haplotagging and tumor purity estimation from tumor-normal pair long-read sequencing data. It also supports germline variant phasing and haplotagging, and is based on LongPhase (v2.0.0).

Key Features:

  • Enhances somatic variant calling accuracy
  • Somatic haplotagging
  • Tumor purity estimation

Somatic Haplotagging Workflow

workflow

Performance Evaluation

We evaluated the performance of LongPhase-S across six cancer cell lines using ONT tumor–normal paired long-read sequencing data. Germline variants were called using "Clair3 v1.0.10", and benchmarking was conducted against different truth variant sources:

  • HCC1395 : Truth variants from SEQC2
  • COLO829 : Truth variants from New York Genome Center
  • Four additional cell lines : Truth variants from DeepSomatic orthogonal tools benchmark

Somatic Variant Calling Performance (ONT)

This figure compares the performance of somatic SNV calling between "ClairS v0.4.1 with SS+RS model" and "DeepSomatic v1.8.0", before and after applying LongPhase-S.

Note: The following results are based on ORIG_BAM data.

SNV performance

SNV performance (original cell line)

INDEL performance

INDEL performance (original cell line)

This figure shows the F1-score of somatic SNV calling across different tumor purity levels and cell lines, using the same somatic variant callers and LongPhase-S integration as described above. Different tumor purity levels were simulated by mixing tumor and normal BAM files at specified ratios, with tumor coverage fixed at 50x and normal coverage fixed at 25x.

Note: The following results are based on SUBSAMPLE_BAM data.

SNV performance across varying tumor purity

SNV performance (varying tumor purity and cell lines)

INDEL performance across varying tumor purity

INDEL performance (varying tumor purity and cell lines)

Somatic Haplotagging Performance

This figure shows the performance of LongPhase-S in somatic SNV haplotagging, using tumor VCF files generated by "ClairS v0.4.1 with the SS+RS model". The evaluation is conducted across various cell lines, with metrics including F1-score, recall, and precision for different haplotype tagging categories.

haplotagging performance

Tumor Purity Estimation

This figure compares tumor purity estimation results between "ASCAT v3.2.0" and "LongPhase-S v1.0.0" (using different somatic variant callers) across multiple cell lines. Different tumor purity levels were simulated by mixing tumor and normal BAM files at specified ratios, with tumor coverage fixed at 50x and normal coverage fixed at 25x.

Purity estimation

IGV Visualization of Somatic Haplotagging Result

This figure shows an example of tagged tumor BAM file visualized in IGV, demonstrating somatic haplotagging results on HCC1395/HCC1395BL tumor/normal ONT data.

haplotagging IGV case

Contents


Installation

Clone and compile using the following commands, and make sure that the environment has zlib installed. If you require setting up a virtual environment, we also provide a Dockerfile.

git clone https://github.com/CCU-Bioinformatics-Lab/longphase-s.git
cd longphase-s
autoreconf -i
./configure
make -j 4

Usage

Somatic haplotagging command

SNV and INDEL somatic haplotagging

This command performs tumor purity estimation and somatic variant calling using tumor-normal pair BAM files, then tags (assigns) each read in tumor BAM to one haplotype based on phased normal SNP VCF and tumor SNP VCF. See Input Preparation for details on how to prepare the required input files:

In addition, the haplotype block of each read is stored in the PS tag (only for reads with phased SNPs). The phased VCF can be generated by other programs as long as the PS or HP tags are encoded. The author can specify --log for additionally output a plain-text file containing haplotype tags of each read without parsing BAM.

longphase-s somatic_haplotag \
-s phased_germline_snp.vcf \
-b normal.bam \
--tumor-snv-file tumor_snv_indel.vcf \
--tumor-bam-file tumor.bam \
-r reference.fasta \
-t 8 \
-o tagged_tumor_bam_prefix \
--tagSupplementary \
-q 20 \

Note: See Input Preparation for preprocessing steps to prepare the required input files.

Output files:

  • Tagged tumor BAM file
    • The reads will be tagged as:
      • HP:z:1 or HP:z:2 for reads with germline SNPs
      • HP:z:1-1 or HP:z:2-1 for reads with somatic SNPs derived from germline haplotype 1 or 2
      • HP:z:3 for reads with somatic SNPs that cannot be derived from germline haplotypes
  • Tumor purity estimation file
  • Somatic calling result VCF (if --output-somatic-vcf is enabled)
  • Benchmark metrics file (if truth files are provided)

Somatic haplotagging benchmark :

  • If --truth-vcf is provided, it will evaluate the performance of somatic haplotagging by comparing the reads that truly contain somatic variants with the reads that are tagged as somatic reads.
  • If --truth-bed is also provided, the evaluation will only consider variants within these regions.
  • See detailed benchmark methodology

Tumor purity estimation command

This command using tumor-normal pair BAM and VCF files, along with haplotype information, and outputs the estimation file.

longphase-s estimate_purity \
-s phased_normal_snp.vcf \
-b normal.bam \
--tumor-snv-file tumor_snv.vcf \
--tumor-bam-file tumor.bam \
-r reference.fasta \
-t 8 \
-o output_prefix

Note: See Input Preparation for preprocessing steps to prepare the required input files.

Phasing command

SNP-only phasing

For SNP-only phasing, the input of LongPhase consists of SNPs in VCF (e.g., SNP.vcf), an indexed reference in Fasta (e.g., reference.fasta, reference.fasta.fai), and one (or multiple) indexed read-to-reference alignment in BAM (e.g., alignment1.bam, alignment1.bai, alignment2.bam, ...) (see Input Preparation). The users should specify the sequencing platform (--ont for Nanopore and --pb for PacBio). An example of SNP phasing usage is shown below.

longphase-s phase \
-s SNP.vcf \
-b alignment1.bam \
-b alignment2.bam \
-r reference.fasta \
-t 8 \
-o phased_prefix \
--ont # or --pb for PacBio Hifi

DeepSomatic output support

If your SNV VCF is produced by DeepSomatic, you can enable a built-in pre-processing step that keeps only GERMLINE variants and normalizes the GT field according to VAF before phasing. This is triggered by --deepsomatic_output.

See detailed DeepSomatic output support documentation

Haplotagging command

This command tags (assigns) each read (in BAM) to one haplotype in the phased SNP/SV VCF. i.e., reads will be tagged as HP:i:1 or HP:i:2. In addition, the haplotype block of each read is stored in the PS tag. The phased VCF can be also generated by other programs as long as the PS or HP tags are encoded. The author can specify --log for additionally output a plain-text file containing haplotype tags of each read without parsing BAM.

longphase-s haplotag \
-r reference.fasta \
-s phased_snp.vcf \
--sv-file phased_sv.vcf \
-b alignment.bam \
-t 8 \
-o tagged_bam_prefix

Input Preparation

Preprocessing for somatic mode commands

For somatic haplotagging and tumor purity estimation commands, the preprocessing workflow is:

  1. Generate tumor SNV VCF using somatic variant callers
  2. Generate normal SNP VCF using germline variant callers
  3. Perform phasing on normal SNP VCF using normal BAM to obtain phased normal SNP VCF (e.g., using phasing command)

Citation

Ming-En Ho, Zhenxian Zheng, Ruibang Luo, Huai-Hsiang Chiang, Yao-Ting Huang, LongPhase-S: purity estimation and variant recalibration with somatic haplotying for long-read sequencing, bioRxiv, 2025.


Contact

Yao-Ting Huang, ythuang at cs.ccu.edu.tw