This pipeline performs resequencing analysis by calling freebayes on reads aligned to the genome using bwa mem. Aligned BAM files are then filtered with markduplicates to select reads aligned unambiguously to the genome. Genotypes are called from all samples, and the resulting VCF file is normalized using vcfwave and bcftools, which is the final output of this pipeline.
There's also an additional wiki for this project with more information about this pipeline: nf-resequencing-mem wiki
In order to execute this pipeline, you will need nextflow
installed and one of this
different executors: conda
, singularity
and docker
.
The proper way to running this pipeline is described in
pipeline sharing
nextflow manual, and lets nextflow to download and execute the pipeline, for example:
nextflow pull cnr-ibba/nf-resequencing-mem
You can specify the version of the pipeline to be downloaded by providing the
-r
option, for example:
nextflow pull cnr-ibba/nf-resequencing-mem -r v0.6.2
In order to test this pipeline in your infrastructure, you can call this pipeline
on the test dataset using the test
profile:
nextflow run cnr-ibba/nf-resequencing-mem -profile test,<your profile>
where <your profile>
is one of docker
, singularity
, apptainer
or conda
profiles.
Nextflow will download the required containers (if necessary) and will execute
the pipeline on the test dataset.
For more information, please see Profiles scope section
of this documentation.
When running Nextflow, Nextflow looks for a file named nextflow.config
in the
current directory and in the script base directory (if it is not the same as the
current directory). Finally it checks for the file $HOME/.nextflow/config
, which
have lower priority than the other configuration files.
There's also a institutional configuration for this pipeline, which is located
at cnr-ibba/nf-config.
Additionally, you can provide a custom config file using the -config
option,
which has an higher priority respect to the other configuration files.
There are tree different scopes, which can be used to customize this
pipeline according your needs:
- The
params
scope defines variables which applies to the whole pipeline: those params could be the input samplesheet or the genome to be used - The
profiles
scope defines variables which apply to the particular profile invoked with the-profile
Nextflow parameter - The
process
scope can define parameters applied to a single process (for example the number of CPUs used or the required RAM)
All this configuration scopes can be customized in each configuration file,
and configuring a value in an higher priority file will override values
defined in lower priority configuration files.
There are also configuration files in the conf
folder of this repository, for
example the conf/modules.config
file keeps the configuration for each module
used within this pipeline: you can affect a module behavior without modifying the
proper module simply changing the proper process configuration section.
The params scope let you to define pipeline parameters. You can set those params directly in config file or pass those from command line (which have the highest priority and override each parameter defined in other configuration files). These are pipeline parameters which are mandatory:
--input
: (required) specify the samplesheet CSV/TSV file wheresample,fastq_1,fastq_2
columns are described (see assets/samplesheet.csv for an example). In thefastq_1
andfastq_2
columns you need to specify the path fore R1 and R2 files respectively. If you have single paired reads, leavefastq_2
column empty. if you have more file for the same sample, specify all the single/pair files using the same sample name: this pipeline will append all reads belonging to the same sample before calling trimgalore in theCAT_FASTQ
process.--genome_fasta
: (required) path to genome (FASTA) file. If file is compressed, index calculation will be forced even if provided by CLI--outdir
: (required) directory in which store results
There are also additional pipeline parameters that can be provided and can be used to save intermediate results or to skip a particular step:
--genome_fasta_fai
: path to fasta index file (skip fasta index step)--genome_bwa_index
: path to genome bwa index directory (skip bwa index step)--save_cram
: (bool, def. false) save markduplicated cram files with their indexes in results folder--save_trimmed
: (bool, def. false) save trimmed reads in results folder--save_fasta_index
: (bool, def. false) save fasta index (for reusing with this pipeline)--save_bwa_index
: (bool, def. false) save bwa index (for reuse with this pipeline)--save_freebayes
: (bool, def. false) save freebayes output file (not normalized!)--remove_fastq_duplicates
: (bool, def. false) remove FASTQ duplicates by IDs using seqkit/rmdup--save_unique_fastq
: (bool, def. false) write de-duplicated FASTQ files (requires--remove_fastq_duplicates
option)--ploidy
: (integer, def. 2) Sets the default ploidy for the analysis--gvcf
: (bool, def. false) save output in gVCF format (def. false)--gvcf_chunk
: (integer) When writing gVCF output emit a record for every N bases.--gvcf_dont_use_chunk
: (bool, def. false) When writing the gVCF output emit a record for all bases, will also route an int to--gvcf_chunk
similar to--output-mode EMIT_ALL_SITES
from GATK--skip_normalization
: (bool, def. false) skip VCF normalization steps--normalization_only
: (bool, def. false) only normalize a VCF file (skip all the other steps, see Normalize a vcf file)--input_vcf
: path to a VCF file to be normalized (required when--normalization_only
is set)--input_tbi
: path to a VCF index file (required when--normalization_only
is set)--snpeff_database
: annotate the VCF file with SnpEff by providing a pre-built database that can be found using thejava -jar snpEff.jar databases
command. If the database is known to SnpEff will be downloaded and managed by the pipeline itself. Otherwise you can provide a custom database using--snpeff_cachedir
and--snpeff_config
parameters. See our wiki Creating a snpEff database for more information.--snpeff_cachedir
: SnpEff cache directory. It must contain a subdirectory with the same name of--snpeff_database
, with a valid SnpEff database as a content. Is required when annotating with SnpEff with a custom database--snpeff_config
: SnpEff custom config file. Is required only with a custom database. Needs to have the same custom database defined by--snpeff_database
option (see: Building databases of SnpEff documentation)
You can have a list of available parameters by calling:
nextflow run cnr-ibba/nf-resequencing-mem --help
In addition, instead of passing parameters using CLI, you can create a custom
configuration file and define each params in the params scope. According
nextflow best practices, parameters defined in params scope should be
placed in a json file, for example params.json
:
{
"input": "<samplesheet.csv>",
"genome_fasta": "<genome_fasta>",
"outdir": "<results dir>",
"save_fasta_index": true
}
Parameters have the same name used within the CLI, but without the --
prefix.
Nextflow can be then called like this:
nextflow run cnr-ibba/nf-resequencing-mem -resume -profile <your profile> \
-params-file params.json
There's also a way to provide parameters using a config file:
according to the nextflow community, this
configuration file should be used to specify all the parameters
that can't be specified using the command line, for example custom arguments
to be passed to a certain module using ext.args
option can be defined in
a custom.config
file like this:
process {
withName: PICARD_MARKDUPLICATES {
ext.args = '--TMP_DIR $TMPDIR'
}
}
Then you can call nextflow
providing such configuration file:
nextflow run cnr-ibba/nf-resequencing-mem -resume -profile <your profile> \
-config custom.config
You can name this file as you prefer, but if is named nextflow.config
it will
be automatically loaded by nextflow: is better to use a different name to avoid
confusion and explictly provide the configuration file using the -config
option.
See Nextflow Configuration
documentation for more information.
The profile scope is used to define configuration attributes that can be activated
in different environment or context when providing the profile name using the
-profile
option. For example, pipeline dependencies are managed using nextflow
by defining three different profiles:
- conda: every pipeline step will manage its requirements using conda in a
specific environment. Conda environments are created inside
work
directory (but you can change this behavior usingcacheDir
option within the conda scope). This in not the recommended profile in a HPC environment, should be used if you aren't able to use singularity or docker profiles. - docker: manage requirements using docker images. You will need to be part of
the
docker
group in order to use this profile. - singularity: manage requirements using singularity images. You can execute
this profile without any permissions.
singularity
software need to be installed and available in your$PATH
bash environment variable - apptainer: manage requirements using apptainer images, which is the community driven continuation of Singularity, maintained under the Linux Foundation.
Then there are also profiles which define the executor used to submit/collect a job, that can be customized in order to affect the pipeline in a particular environment. Profiles can be combined when calling nextflow, for example:
nextflow run -profile singularity,slurm ...
will enable all the configuration to be applied on slurm executor using singularity.
Profiles are also defined in the institutional configuration file, with the aim
to describe and configure the infrastructure available by a certain institution.
For this pipeline, the base repository (--custom_config_base
)
for the institutional configuration is located at
cnr-ibba/nf-configs using
ibba as default branch. You
call any profile described by the
cnr-ibba/nf-config/README.md
file.
The process configuration scope allows you to provide a specific configuration for a specific process. For example, within this scope you can declare the maximum number of CPUs used by a certain process or the amount of memory required, for example:
process {
// https://www.nextflow.io/docs/edge/reference/process.html#resourcelimits
resourceLimits = [
cpus: 32,
memory: 64.GB
]
}
You can specify here any property described in the process
directive and the executor sections and override them.
With the withLabel
selector, you can configure of all processes annotated with
such label selector. For example:
process {
// only processes with this label have those parameters
withLabel: process_high {
cpus = 4
memory = 4.GB
}
}
will affect all the processes annotated with process_high
label. Similarly, you
can provide additional parameters to a certain step. For example, supposing to
change the freebayes
ploidy since you are dealing with a non-diploid sample
(default): you can provide a custom.config
file in which pass extra parameters
to freebayes:
process {
withName: FREEBAYES_CHUNK {
ext.args = '--ploidy 4'
}
}
NOTE: freebayes ploidy can be set using the
--ploidy
parameter, see Params scope section of this documentation.
There are parameters which are nextflow specific. They start all with a single
-
before the parameter name and need to be provided using the nextflow CLI.
Here are the most used parameters:
-resume
: recover previous attempt-profile
: specify one ofdocker
,singularity
,apptainer
andconda
profiles.singularity
andapptainer
are the recommended profile in a HPC environment-work-dir
: save temporary files in a default temporary directory (def.$projectDir/work
)
The local executor is the default executor used when calling pipeline. It spawns pipeline processes using fork and threads using all your local CPUs and memory available. If you need to set a limit to the resources used, enable this configuration for local executor:
executor {
// for the local executor, I will set the maximum values of CPU and MEMORY
$local {
cpus = 8
memory = '16 GB'
}
}
Using the $
symbol before the executor name, let you to specify different
configuration in the same executor scope. See
scope executor
documentation page for more information.
You can change the default executor by specifying the pbs
profile. Simply add
such profile to your command line, for example:
nextflow run cnr-ibba/nf-resequencing-mem -resume -profile pbs,singularity \
--input <samplesheet.csv> --genome_fasta <genome.fasta> --outdir <results dir>
In alternative, you can export this environment variable:
export NXF_EXECUTOR=pbs
Like pbs
executor, simply add such profile to your command line, for example:
nextflow run cnr-ibba/nf-resequencing-mem -resume -profile slurm,singularity \
--input <samplesheet.csv> --genome_fasta <genome.fasta> --outdir <results dir>
In alternative, you can export this environment variable:
export NXF_EXECUTOR=slurm
This pipeline could run using the AWS batch queue system.
In order to do so, you need to configure your credentials with aws cli:
you require to configure a IAM account with permission to run batch, EC2 and S3.
You require also a S3 bucket in which nextflow could store and retrieve data (nextflow
will make a copy of the input data and will retrieve the results from here) and
a AWS batch queue with EC2 spot instances as recommended compute environment.
After that, you could launch this pipeline by providing only awsbatch as profile
and the queue name and the AWS region with the --awsqueue
and --awsregion
parameters:
nextflow run cnr-ibba/nf-resequencing-mem -resume -profile awsbatch \
-bucket-dir s3://<s3 bucket name>/<subfolder> \
--input '<samplesheet.csv>' --genome_fasta <genome_fasta> \
--awsqueue <aws batch queue name> --awsregion <aws region>
Please see the Amazon Cloud section of nextflow documentation to get other information on nextflow and AWS usage.
With this pipeline is it possible to perform the normalization workflow on a VCF
file, without running the whole pipeline. This is useful when you have a VCF file
that needs to be normalized, for example after a freebayes run. You can call
this pipeline providing the --normalization_only
parameter and the --input_vcf
and --input_tbi
parameters:
nextflow run cnr-ibba/nf-resequencing-mem -resume -profile <your profile> \
--normalization_only --input_vcf <input.vcf> --input_tbi <input.tbi> \
--genome_fasta <genome.fasta> --outdir <results dir>
Other provided parameters will be ignored, and the pipeline will normalize the
VCF file and will store the normalized VCF file in the outdir
directory.
You can find more information about normalization workflow in our
Normalizing VCF files and
Running normalization only
wiki pages
We have a wiki page that describes how to merge different datasets of SNPs using this pipeline: you can find it at Merging different datasets of SNPs
Rarely this pipeline could fail at INPUT_CHECK:SAMPLESHEET_CHECK
claiming that given sample sheet does not appear to contain a header
,
even if your sample sheet starts with sample,fastq_1,fastq_2
header
section. This behavior is very uncommon and described in nf-core/tools
issue #1539. There's a way
to overcome this issue by providing --has_header
parameter to
bin/check_samplesheet.py
which let to skip the header check sections. However,
you have to ensure your samplesheet have the required sample,fastq_1,fastq_2
header section, otherwise this pipeline will not work as intended.
You can do it by customizing the SAMPLESHEET_CHECK
step
in the process scope, for example:
process {
withName: SAMPLESHEET_CHECK {
ext.args = '--has_header'
}
}
The freebayes step can take a lot of time when calculating SNPs with a lot of data.
This process is calculated using multiple processes by splitting the whole genome in
regions (relying on CRAM alignment sizes), and then by calling SNPs on each region
on a single process.
In the last step, all results are collected and sorted to produce the final VCF file
(see subworkflows/local/cram_freebayes_parallel
subworkflow for more information).
You can customize the region splitting, for example by using a greater cumulative
coverage (def. is 500_000_000
) or the minimum fragment length (def. is 10_000
)
in the split process like this:
process {
withName: FREEBAYES_SPLITCRAM {
ext.args = '--max_coverage 100_000_000 --min_length 20_000'
}
}
however, having a lot of smaller regions could be problematic if the overlapping region shared by segments become larger than the unique region.
There are some freebayes processes that can fail by reaching time or memory
limits. According to freebayes creators (see issues
#465
and #626) this is unpredictable
and may depends on coverage, allelic complexity and sample or ploidy
number. Nextflow can resubmit such process increasing the required resources at
each step until maxRetries
attempts are reached: you could increase the retry
attempts like this:
process {
withName: FREEBAYES_CHUNK {
maxRetries = 10
}
}
If this is not sufficient and jobs continue to fail for memory and time requirements,
you could think to skip regions with a very high coverage (-g
option) or
try to sub-sample regions with high coverage with the --limit-coverage
option,
like this:
process {
withName: FREEBAYES_CHUNK {
ext.args = '--limit-coverage 25'
}
}
This pipeline compute a coverage step before calling freebayes, so you could try to determine which value make sense to be used for filtering.
N.B. Even with downsampling this step could require a lot of time, and time required for each step will be unpredictable. Using a very low coverage limit could affect the SNP calling process. Use this option with caution.
When a FASTQ
file have duplicated IDs, the MarkDuplicates
fails with an error
message like this:
Exception in thread "main" htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once. 11: RGE200002173L1C027R02800999933
at htsjdk.samtools.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:133)
at htsjdk.samtools.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:86)
at picard.sam.markduplicates.util.DiskBasedReadEndsForMarkDuplicatesMap.remove(DiskBasedReadEndsForMarkDuplicatesMap.java:61)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:571)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:270)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
This is a not an issue with MarkDuplicates
(as discussed
here)
but an issue at demultiplexing step: the only way to deal with this problem is
to make rid of duplicated IDs using seqkit/rmdup
by providing the --remove_fastq_duplicates
option.
Markduplicates writes temporary files into /tmp
partition by default. If your
organization have a different location where temporary files should be stored
(ex /scratch
or any other $TMP
position) and your jobs are running out of
spaces, you should provide a different temporary location to MarkDuplicates
steps,
for example:
process {
withName: PICARD_MARKDUPLICATES {
ext.args = '--TMP_DIR $TMPDIR'
}
}
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md
file.