IntGenomicsLab/lr_somatic is a robust bioinformatics pipeline designed for processing and analyzing somatic DNA sequencing data for long-read sequencing technologies from Oxford Nanopore and PacBio. It supports both canonical base DNA and modified base calling, including specialized applications such as Fiber-seq.
This end-to-end pipeline handles the entire workflow — from raw read processing and alignment, to comprehensive somatic variant calling, including single nucleotide variants, indels, structural variants, copy number alterations, and modified bases.
It can be run in both matched tumour-normal and tumour-only mode, offering flexibility depending on the users study design.
Developed using Nextflow DSL2, it offers high portability and scalability across diverse computing environments. By leveraging Docker or Singularity containers, installation is streamlined and results are highly reproducible. Each process runs in an isolated container, simplifying dependency management and updates. Where applicable, pipeline components are sourced from nf-core/modules, promoting reuse, interoperability, and consistency within the broader Nextflow and nf-core ecosystems.
1) Pre-processing:
a. Raw read QC (cramino
)
b. Alignment to the reference genome (minimap2
)
c. Post alignment QC (cramino
, samtools idxstats
, samtools flagstats
, samtools stats
)
d. Specific for calling modified base calling (Modkit
, Fibertools
)
2i) Matched mode: small variant calling:
a. Calling Germline SNPs (Clair3
)
b. Phasing and Haplotagging the SNPs in the normal and tumour BAM (LongPhase
)
c. Calling somatic SNVs (ClairS
)
2ii) Tumour only mode: small variant calling:
a. Calling Germline SNPs and somatic SNVs (ClairS-TO
)
b. Phasing and Haplotagging germline SNPs in tumour BAM (LongPhase
)
3) Large variant calling:
a. Somatic structural variant calling (Severus
)
b. Copy number alterion calling; long read version of (ASCAT
)
Note
If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow.Make sure to test your setup with -profile test
before running the workflow on actual data.
First prepare a samplesheet with your input data that looks as follows:
sample,bam_tumor,bam_normal,platform,sex,fiber
sample1,tumour.bam,normal.bam,ont,female,n
sample2,tumour.bam,,ont,female,y
sample3,tumour.bam,,pb,male,n
sample4,tumour.bam,normal.bam,pb,male,y
Each row represents a sample. The bam files should always be unaligned bam files. All fields except for bam_normal
are required. If bam_normal
is empty, the pipeline will run in tumour only mode. platform
should be either ont
or pb
for Oxford Nanopore Sequencing or PacBio sequencing, respectively. sex
refers to the biological sex of the sample and should be either female
or male
. Finally, fiber
specifies whether your sample is Fiber-seq data or not and should have either y
for Yes or n
for No.
Now, you can run the pipeline using:
nextflow run IntGenomicsLab/lr_somatic \
-profile <docker/singularity/.../institute> \
--input samplesheet.csv \
--outdir <OUTDIR>
Warning
Please provide pipeline parameters via the CLI or Nextflow -params-file
option. Custom config files including those provided by the -c
Nextflow option can be used to provide any configuration except for parameters; see docs.
IntGenomicsLab/lr_somatic was originally written by Luuk Harbers, Alexandra Pančíková, Robert Forsyth, Marios Eftychiou, Ruben Cools, and Jonas Demeulemeester.
This pipeline produces a series of different output files. The main output is an aligned and phased tumour bam file. This bam file can be used by any typical downstream tool that uses bam files as input. Furthermore, we have sample-specific QC outputs from cramino
(fastq), cramino
(bam), mosdepth
, samtools
(stats/flagstat/idxstats), and optionally fibertools
. Finally, we have a multiqc
report from that combines the output from mosdepth
and samtools
into one html report.
Besides QC and the aligned and phased bam file, we have output from (structural) variant and copy number callers, of which some are optional. The output from these variant callers can be found in their respective folders. For small and structural variant callers (clairS
, clairS-TO
, and severus
) these will contain, among others, vcf
files with called variants. For ascat
these contain files with final copy number information and plots of the copy number profiles.
Example output directory structure:
results
|
├── multiqc
│
├── sample1
│ ├── bamfiles
│ ├── qc
│ │ ├── tumour
│ │ └── normal
│ ├── variants
│ │ ├── severus
│ │ └── clairs
│ └── ascat
│
└── sample2
├── bamfiles
├── qc
│ ├── tumour
│ └── normal
├── variants
│ ├── severus
│ └── clairs
└── ascat
If you would like to contribute to this pipeline, please see the contributing guidelines.
If you use IntGenomicsLab/lr_somatic for your analysis, please cite it using the following doi: 10.5281/zenodo.XXXXXX
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md
file.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.