Skip to content

Commit

Permalink
Merge pull request #21 from TRON-Bioinformatics/test_ivar_trim
Browse files Browse the repository at this point in the history
Add step for primer trimming with iVar
  • Loading branch information
priesgo authored Apr 1, 2022
2 parents 3b9e3db + 7e0d982 commit c94136b
Show file tree
Hide file tree
Showing 7 changed files with 319 additions and 21 deletions.
2 changes: 2 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,5 @@ test:
bash tests/test_07.sh
bash tests/test_08.sh
bash tests/test_09.sh
bash tests/test_10.sh
bash tests/test_11.sh
25 changes: 20 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ When FASTQ files are provided the pipeline includes the following steps:
- **Trimming**. `fastp` is used to trim reads with default values. This step also includes QC filtering.
- **Alignment**. `BWA mem` is used for the alignment of single or paired end samples.
- **BAM preprocessing**. BAM files are prepared and duplicate reads are marked using GATK and Picard tools.
- **Primer trimming**. When a BED with primers is provided, these are trimmed from the reads using iVar. This is applicable to the results from all variant callers.
- **Coverage analysis**. `samtools coverage` and `samtools depth` are used to compute the horizontal and vertical
coverage respectively.
- **Variant calling**. Four different variant callers are employed: BCFtools, LoFreq, iVar and GATK.
Expand Down Expand Up @@ -110,6 +111,17 @@ or GATK's ReadBackedPhasing. Unfortunately these tools do not support the scenar
undefined number of subclones.
For this reason, phasing is implemented with custom Python code at `bin/phasing.py`.

## Primers trimming

With some library preparation protocols such as ARTIC it is recommended to trim the primers from the reads.
We have observed that if primers are not trimmed spurious mutations are being called specially SNVs with lower frequencies and long deletions.
Also the variant allele frequencies of clonal mutations are underestimated.

The BED files containing the primers for each ARTIC version can be found at https://github.com/artic-network/artic-ncov2019/tree/master/primer_schemes/nCoV-2019.

If the adequate BED file is provided to the CoVigator pipeline with `--primers` the primers will be trimmed with iVar.
This affects the output of every variant caller, not only iVar.

## Reference data

The default SARS-CoV-2 reference files correspond to Sars_cov_2.ASM985889v3 and were downloaded from Ensembl servers.
Expand Down Expand Up @@ -230,10 +242,16 @@ Input:
* --library: required only when using --input_fastqs
* --input_fastas_list: alternative to --name and --fasta for batch processing
Optional input only required to use a custom reference:
* --reference: the reference genome FASTA file, *.fai, *.dict and bwa indexes are required.
* --gff: the GFFv3 gene annotations file (required to run iVar and to phase mutations from all variant callers)
* --snpeff_data: path to the SnpEff data folder, it will be useful to use the pipeline on other virus than SARS-CoV-2
* --snpeff_config: path to the SnpEff config file, it will be useful to use the pipeline on other virus than SARS-CoV-2
* --snpeff_organism: organism to annotate with SnpEff, it will be useful to use the pipeline on other virus than SARS-CoV-2
Optional input:
* --fastq2: the second input FASTQ file
* --reference: the reference genome FASTA file, *.fai, *.dict and bwa indexes are required.
* --gff: the GFFv3 gene annotations file (only required to run iVar)
* --primers: a BED file containing the primers used during library preparation. If provided primers are trimmed from the reads.
* --min_base_quality: minimum base call quality to take a base into account for variant calling (default: 20)
* --min_mapping_quality: minimum mapping quality to take a read into account for variant calling (default: 20)
* --vafator_min_base_quality: minimum base call quality to take a base into account for VAF annotation (default: 0)
Expand All @@ -251,9 +269,6 @@ Optional input:
* --open_gap_score: global alignment open gap score, only applicable for assemblies (default: -3)
* --extend_gap_score: global alignment extend gap score, only applicable for assemblies (default: -0.1)
* --skip_sarscov2_annotations: skip some of the SARS-CoV-2 specific annotations (default: false)
* --snpeff_data: path to the SnpEff data folder, it will be useful to use the pipeline on other virus than SARS-CoV-2
* --snpeff_config: path to the SnpEff config file, it will be useful to use the pipeline on other virus than SARS-CoV-2
* --snpeff_organism: organism to annotate with SnpEff, it will be useful to use the pipeline on other virus than SARS-CoV-2
* --keep_intermediate: keep intermediate files (ie: BAM files and intermediate VCF files)
Output:
Expand Down
28 changes: 19 additions & 9 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ nextflow.enable.dsl = 2

include { READ_TRIMMING_PAIRED_END; READ_TRIMMING_SINGLE_END } from './modules/01_fastp'
include { ALIGNMENT_PAIRED_END; ALIGNMENT_SINGLE_END } from './modules/02_bwa'
include { BAM_PREPROCESSING; COVERAGE_ANALYSIS } from './modules/03_bam_preprocessing'
include { BAM_PREPROCESSING; COVERAGE_ANALYSIS; PRIMER_TRIMMING_IVAR } from './modules/03_bam_preprocessing'
include { VARIANT_CALLING_BCFTOOLS; VARIANT_CALLING_LOFREQ ; VARIANT_CALLING_GATK ;
VARIANT_CALLING_IVAR ; VARIANT_CALLING_ASSEMBLY; IVAR2VCF } from './modules/04_variant_calling'
include { VARIANT_NORMALIZATION ; PHASING } from './modules/05_variant_normalization'
Expand Down Expand Up @@ -33,6 +33,7 @@ params.gff = false
params.snpeff_data = false
params.snpeff_config = false
params.snpeff_organism = false
params.primers = false

params.output = "."
params.min_mapping_quality = 20
Expand Down Expand Up @@ -89,6 +90,8 @@ else {
skip_sarscov2_annotations = true
}

primers = params.primers ? file(params.primers) : false

skip_snpeff = false
if (! snpeff_data || ! snpeff_config || ! snpeff_organism) {
log.info "Skipping SnpEff annotation as either --snpeff_data, --snpeff_config or --snpeff_organism was not provided"
Expand Down Expand Up @@ -188,27 +191,32 @@ workflow {
bam_files = ALIGNMENT_SINGLE_END.out
}
BAM_PREPROCESSING(bam_files, reference)
COVERAGE_ANALYSIS(BAM_PREPROCESSING.out.preprocessed_bam)
preprocessed_bams = BAM_PREPROCESSING.out.preprocessed_bam
if (primers) {
PRIMER_TRIMMING_IVAR(preprocessed_bams, primers)
preprocessed_bams = PRIMER_TRIMMING_IVAR.out.trimmed_bam
}
COVERAGE_ANALYSIS(preprocessed_bams)

// variant calling
vcfs_to_normalize = null
if (!params.skip_bcftools) {
VARIANT_CALLING_BCFTOOLS(BAM_PREPROCESSING.out.preprocessed_bam, reference)
VARIANT_CALLING_BCFTOOLS(preprocessed_bams, reference)
vcfs_to_normalize = vcfs_to_normalize == null?
VARIANT_CALLING_BCFTOOLS.out : vcfs_to_normalize.concat(VARIANT_CALLING_BCFTOOLS.out)
}
if (!params.skip_lofreq) {
VARIANT_CALLING_LOFREQ(BAM_PREPROCESSING.out.preprocessed_bam, reference)
VARIANT_CALLING_LOFREQ(preprocessed_bams, reference)
vcfs_to_normalize = vcfs_to_normalize == null?
VARIANT_CALLING_LOFREQ.out : vcfs_to_normalize.concat(VARIANT_CALLING_LOFREQ.out)
}
if (!params.skip_gatk) {
VARIANT_CALLING_GATK(BAM_PREPROCESSING.out.preprocessed_bam, reference)
VARIANT_CALLING_GATK(preprocessed_bams, reference)
vcfs_to_normalize = vcfs_to_normalize == null?
VARIANT_CALLING_GATK.out : vcfs_to_normalize.concat(VARIANT_CALLING_GATK.out)
}
if (!params.skip_ivar && gff) {
VARIANT_CALLING_IVAR(BAM_PREPROCESSING.out.preprocessed_bam, reference, gff)
VARIANT_CALLING_IVAR(preprocessed_bams, reference, gff)
IVAR2VCF(VARIANT_CALLING_IVAR.out, reference)
vcfs_to_normalize = vcfs_to_normalize == null?
IVAR2VCF.out : vcfs_to_normalize.concat(IVAR2VCF.out)
Expand Down Expand Up @@ -238,14 +246,16 @@ workflow {

if (input_fastqs) {
// we can only add technical annotations when we have the reads
VAFATOR(normalized_vcfs.combine(BAM_PREPROCESSING.out.preprocessed_bam, by: 0))
VAFATOR(normalized_vcfs.combine(preprocessed_bams, by: 0))
VARIANT_VAF_ANNOTATION(VAFATOR.out.annotated_vcf)
normalized_vcfs = VARIANT_VAF_ANNOTATION.out.vaf_annotated
}

// NOTE: phasing has to happen before SnpEff annotation for MNVs to be annotated correctly
PHASING(normalized_vcfs, reference, gff)
normalized_vcfs = PHASING.out
if (params.gff) {
PHASING(normalized_vcfs, reference, gff)
normalized_vcfs = PHASING.out
}

if (! skip_snpeff) {
// only when configured we run SnpEff
Expand Down
36 changes: 35 additions & 1 deletion modules/03_bam_preprocessing.nf
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,40 @@ process BAM_PREPROCESSING {
"""
}

process PRIMER_TRIMMING_IVAR {
cpus params.cpus
memory params.memory
if (params.keep_intermediate) {
publishDir "${params.output}", mode: "copy"
}
publishDir "${params.output}", mode: "copy", pattern: "${name}.deduplication_metrics.txt"
tag "${name}"

conda (params.enable_conda ? "bioconda::gatk4=4.2.0.0 bioconda::ivar=1.3.1" : null)

input:
tuple val(name), file(bam), file(bai)
file(primers)

output:
tuple val(name), file("${bam.baseName}.trimmed.sorted.bam"), file("${bam.baseName}.trimmed.sorted.bai"), emit: trimmed_bam

"""
ivar trim \
-i ${bam} \
-b ${primers} \
-p ${bam.baseName}.trimmed
gatk SortSam \
--java-options '-Xmx${params.memory} -Djava.io.tmpdir=./tmp' \
--INPUT ${bam.baseName}.trimmed.bam \
--OUTPUT ${bam.baseName}.trimmed.sorted.bam \
--SORT_ORDER coordinate
gatk BuildBamIndex --INPUT ${bam.baseName}.trimmed.sorted.bam
"""
}

process COVERAGE_ANALYSIS {
cpus params.cpus
memory params.memory
Expand All @@ -81,4 +115,4 @@ process COVERAGE_ANALYSIS {
samtools coverage ${bam} > ${name}.coverage.tsv
samtools depth -s -d 0 -H ${bam} > ${name}.depth.tsv
"""
}
}
15 changes: 9 additions & 6 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ process.shell = ['/bin/bash', '-euo', 'pipefail']
cleanup = true
conda.createTimeout = '1 h'

VERSION = '0.9.3'
VERSION = '0.10.0'

manifest {
name = 'TRON-Bioinformatics/covigator-ngs-pipeline'
Expand All @@ -89,10 +89,16 @@ Input:
* --library: required only when using --input_fastqs
* --input_fastas_list: alternative to --name and --fasta for batch processing
Optional input only required to use a custom reference:
* --reference: the reference genome FASTA file, *.fai, *.dict and bwa indexes are required.
* --gff: the GFFv3 gene annotations file (required to run iVar and to phase mutations from all variant callers)
* --snpeff_data: path to the SnpEff data folder, it will be useful to use the pipeline on other virus than SARS-CoV-2
* --snpeff_config: path to the SnpEff config file, it will be useful to use the pipeline on other virus than SARS-CoV-2
* --snpeff_organism: organism to annotate with SnpEff, it will be useful to use the pipeline on other virus than SARS-CoV-2
Optional input:
* --fastq2: the second input FASTQ file
* --reference: the reference genome FASTA file, *.fai, *.dict and bwa indexes are required.
* --gff: the GFFv3 gene annotations file (only required to run iVar)
* --primers: a BED file containing the primers used during library preparation. If provided primers are trimmed from the reads. Only applicable to FASTQs.
* --min_base_quality: minimum base call quality to take a base into account for variant calling (default: 20)
* --min_mapping_quality: minimum mapping quality to take a read into account for variant calling (default: 20)
* --vafator_min_base_quality: minimum base call quality to take a base into account for VAF annotation (default: 0)
Expand All @@ -110,9 +116,6 @@ Optional input:
* --open_gap_score: global alignment open gap score, only applicable for assemblies (default: -3)
* --extend_gap_score: global alignment extend gap score, only applicable for assemblies (default: -0.1)
* --skip_sarscov2_annotations: skip some of the SARS-CoV-2 specific annotations (default: false)
* --snpeff_data: path to the SnpEff data folder, it will be useful to use the pipeline on other virus than SARS-CoV-2
* --snpeff_config: path to the SnpEff config file, it will be useful to use the pipeline on other virus than SARS-CoV-2
* --snpeff_organism: organism to annotate with SnpEff, it will be useful to use the pipeline on other virus than SARS-CoV-2
* --keep_intermediate: keep intermediate files (ie: BAM files and intermediate VCF files)
Output:
Expand Down
Loading

0 comments on commit c94136b

Please sign in to comment.