For an explanation of the general purpose of HybPiper, and the approach it takes to generate target locus sequences from sequence capture data, please see the documentation, wiki and tutorial at the mossmatters
repo:
- https://github.com/mossmatters/HybPiper/
- https://github.com/mossmatters/HybPiper/wiki
- https://github.com/mossmatters/HybPiper/wiki/Tutorial
To simplify running HybPiper, I’ve provided a Singularity container based on the Linux distribution Ubuntu 22.04, with all the software required to run the HybPiper pipeline (including some additional functionality, see below for details), as well as all the dependencies (BioPython, BLAST, BWA, BBmap, Exonerate, SPAdes, Samtools). The container is called hybpiper-paragone.sif
.
To run HybPiper using this container, I’ve provided a Nextflow pipeline that uses the software in the Singularity container. This pipeline runs all HybPiper steps with a single command. The pipeline script is called hybpiper.nf
. It comes with an associated config file called hybpiper.config
. The only input required is a folder of sequencing reads for your samples, and a target file in .fasta
format. The Nextflow pipeline will automatically generate the namelist.txt
file required by some of the HybPiper scripts, and will run all HybPiper scripts on each sample in parallel. It also includes an optional read-trimmming step to QC your reads prior to running HybPiper, using the software Trimmomatic. The number of parallel processes running at any time, as well as computing resources given to each process (e.g. number of CPUs, amount of RAM etc) can be configured by the user by modifying the provided config file. The pipeline can be run directly on your local computer, and on an HPC system submitting jobs via a scheduler (e.g. SLURM, PBS, etc).
Your target file
, which can contain either nucleotide OR protein sequences (see below for corresponding pipeline options), should follow the formatting described for HybPiper here. Briefly, your target genes should be grouped and differentiated by a suffix in the fasta header, consisting of a dash followed by an ID unique to each gene, e.g.:
>AJFN-4471
AATGTTATACAGGATGAAGAGAAACTGAATACTGCAAACTCCGATTGGATGCGGAAATACAAAGGCT...
>Ambtr-4471
AGTGTTATTCAAGATGAAGATGTATTGTCGACAGCCAATGTGGATTGGATGCGGAAATATAAGGGCA...
>Ambtr-4527
GAGGAGCGGGTGATTGCCTTGGTCGTTGGTGGTGGGGGTAGAGAACATGCTCTATGCTATGCTTTGC...
>Arath-4691
GAGCTTGGATCTATCGCTTGCGCAGCTCTCTGTGCTTGCACTCTTACAATAGCTTCTCCTGTTATTG...
>BHYC-4691
GAAGTGAACTGTGTTGCTTGTGGGTTTCTTGCTGCTCTTGCTGTCACTGCTTCTCCCGTAATCGCTG...
etc...
You will need to provide the hybpiper.nf
pipeline with either:
a) A directory of paired-end forwards and reverse reads (and, optionally, a file of single-end reads) for each sample; or
b) A directory of single-end reads.
For the read files to be recognised by the pipeline, they should be named according to the default convention:
*_R1.fastq
*_R2.fastq
*_single.fastq (optional; will be used if running with the flag `--paired_and_single`)
OR
*_R1.fq
*_R2.fq
*_single.fq (optional; will be used if running with the flag `--paired_and_single`)
NOTE:
- It’s fine if there’s text after the
R1
/R1
and before the.fastq
/.fq
, as long as it's the same for both read files. - It’s fine if the input files are gzipped (i.e. suffix
.gz
). - You can provide a folder of single-end reads only (use the flag
--single_only
if you do). Your read files should be named*_single.fastq
(or.fastq.gz
,.fq
,.fq.gz
). - You can specify a custom pattern used for read file matching via the parameters
--read_pairs_pattern <pattern>
or--single_pattern <pattern>
. - If your samples have been run across multiple lanes, you'll likely want to combine read files for each sample before processing. The pipeline can do this for you - see here for details.
Please see the Wiki entry Running on Linux.
Please see the Wiki entry Running on a Mac.
NOTE: Macs using the new Apple M1 chip are not yet supported.
Please see the Wiki entry Running on a PC.
Example run command:
nextflow run hybpiper.nf -c hybpiper.config -entry assemble -profile standard_singularity --illumina_reads_directory reads_for_hybpiper --targetfile_dna Angiosperms353_targetSequences.fasta
\Mandatory arguments:
############################################################################
--illumina_reads_directory <directory>
Path to folder containing illumina read file(s)
AND
--targetfile_dna <file> File containing fasta sequences of target genes
(nucleotides)
OR
--targetfile_aa <file> File containing fasta sequences of target genes
(amino-acids)
#############################################################################
Optional arguments:
-profile <profile> Configuration profile to use. Can use multiple
(comma separated). Available: standard (default),
standard_singularity, slurm_singularity, conda,
conda_slurm
--namelist A text file containing sample names. Only these
samples will be processed, By default, all samples
in the provided <Illumina_reads_directory>
directory are processed
--combine_read_files Group and concatenate read-files via a common prefix.
Useful if samples have been run across multiple lanes.
Default prefix is all text preceding the first
underscore (_) in read filenames
--combine_read_files_num_fields <int>
Number of fields (delimited by an underscore) to use
for combining read files when using the
`--combine_read_files` flag. Default is 1
--num_forks <int> Specify the number of parallel processes (e.g.
concurrent runs of 'hybpiper assemble') to run at any
one time. Can be used to prevent Nextflow from using
all the threads/cpus on your machine. Default is
to use the maximum number possible
--outdir <directory_name>
Specify the name of the pipeline results directory.
Default is 'results'
--paired_and_single Use when providing both paired-end R1 and R2 read
files as well as a file of single-end reads for each
sample
--single_end Use when providing providing only a folder of
single-end reads
--read_pairs_pattern <pattern>
Provide a comma-separated read-pair pattern for
matching fowards and reverse paired-end readfiles,
e.g. '1P,2P'. Default is 'R1,R2'
--single_pattern <pattern>
Provide a pattern for matching single-end read
files. Default is 'single'
#######################################################################################
################################ Trimmomatic options: #################################
#######################################################################################
--use_trimmomatic Trim forwards and reverse reads using Trimmomatic.
Default is off
--trimmomatic_leading_quality <int>
Cut bases off the start of a read, if below this
threshold quality.Default is 3
--trimmomatic_trailing_quality <int>
Cut bases off the end of a read, if below this
threshold quality. Default is 3
--trimmomatic_min_length <int>
Drop a read if it is below this specified length.
Default is 36
--trimmomatic_sliding_window_size <int>
Size of the sliding window used by Trimmomatic;
specifies the number of bases to average across.
Default is 4
--trimmomatic_sliding_window_quality <int>
Specifies the average quality required within the
sliding window. Default is 20
#######################################################################################
############################# hybpiper assemble options: ##############################
#######################################################################################
Options for step: map_reads:
--bwa Use BWA to search reads for hits to target. Requires
BWA and a target file that is nucleotides!
--diamond Use DIAMOND instead of BLASTx
--diamond_sensitivity {mid-sensitive,sensitive,more-sensitive,very-sensitive,
ultra-sensitive}
Use the provided sensitivity for DIAMOND searches.
--evalue e-value threshold for blastx/DIAMOND hits, default: 0.0001
--max_target_seqs Max target seqs to save in BLASTx search, default is 10
Options for step: distribute_reads:
--distribute_low_mem Distributing and writing reads to individual gene
directories will be 40-50 percent slower, but can use
less memory/RAM with large input files
Options for step: assemble_reads:
--cov_cutoff <int> Coverage cutoff to pass to the SPAdes assembler.
Default is 8
--single_cell_assembly Run SPAdes assemblies in MDA (single-cell) mode.
Default is False
--kvals Values of k for SPAdes assemblies. SPAdes needs to be
compiled to handle larger k-values! Default is
auto-detection by SPAdes
--merged Merge forward and reverse reads, and run SPAdes
assembly with merged and unmerged (the latter
in interleaved format) data. Default is off
--timeout_assemble_reads Kill long-running gene assemblies if they take longer
than X percent of average.
Options for step: extract_contigs:
--not_protein_coding If provided, extract sequences from SPAdes contigs using
BLASTn rather than Exonerate (step: extract_contigs)
--extract_contigs_blast_task
{blastn,blastn-short,megablast,dc-megablast}
Task to use for BLASTn searches during the extract_contigs
step of the assembly pipeline. See
https://www.ncbi.nlm.nih.gov/books/NBK569839/ for a
description of tasks. Default is: blastn
--extract_contigs_blast_evalue
Expectation value (E) threshold for saving hits.
Default is: 10
--extract_contigs_blast_word_size
Word size for wordfinder algorithm (length of best perfect
match)
--extract_contigs_blast_gapopen
Cost to open a gap
--extract_contigs_blast_gapextend
Cost to extend a gap
--extract_contigs_blast_penalty
Penalty for a nucleotide mismatch
--extract_contigs_blast_reward
Reward for a nucleotide match
--extract_contigs_blast_perc_identity
Percent identity. Can be used as a pre-filter at the BLASTn
stage, followed by --thresh (see below)
--extract_contigs_blast_max_target_seqs
Maximum number of aligned sequences to keep (value of 5 or
more is recommended). Default is: 500
--thresh Percent identity threshold for retaining Exonerate/BLASTn
hits. Default is 55, but increase this if you are worried
about contaminant sequences. Exonerate hit identity is
calculated using amino-acids, BLASTn hit identity is calculated
using nucleotides
--paralog_min_length_percentage <decimal>
Minimum length percentage of a SPAdes contig vs
reference protein query for a paralog warning to be
generated and a putative paralog contig to be
recovered. Default is 0.75
--depth_multiplier Assign a long paralog as the "main" sequence if it
has a coverage depth <depth_multiplier> times all
other long paralogs. Set to zero to not use depth.
Default is 10
--target Use the target file sequence with this taxon name in
Exonerate searches for each gene. Other targets for
that gene will be used only for read sorting. Can be a
tab-delimited file (one <gene>\t<taxon_name> per line)
or a single taxon name
--exclude Do not use any sequence with the specified taxon name
string in Exonerate searches. Sequenced from this
taxon will still be used for read sorting
--timeout_extract_contigs Kill long-running processes if they take longer than
X seconds. Default is 120
--no_stitched_contig Do not create stitched contigs; use longest Exonerate
hit only. Default is off
--no_pad_stitched_contig_gaps_with_n
When constructing stitched contigs, do not pad any gaps
between hits (with respect to the "best" protein reference)
with a number of Ns corresponding to the reference gap multiplied
by 3 (Exonerate) or reference gap (BLASTn). Default is: True.
--chimeric_stitched_contig_check
Attempt to determine whether a stitched contig is a potential
chimera of contigs from multiple paralogs. Default is: False
--bbmap_memory <int> MB memory (RAM) to use for bbmap.sh if a chimera check is
performed during step extract_contigs. Default: is 1000
--bbmap_subfilter <int> Ban alignments with more than this many
substitutions when performing read-pair mapping to
supercontig reference (bbmap.sh). Default is 7
--chimeric_stitched_contig_edit_distance <int>
Minimum number of base differences between one read
of a read pair vs the stitched-contig reference for a
read pair to be flagged as discordant. Default is 5
--chimeric_stitched_contig_discordant_reads_cutoff <int>
Minimum number of discordant reads pairs required
to flag a stitched-contig as a potential chimera of
contigs from multiple paralogs. Default is 5
--trim_hit_sliding_window_size <int>
Size of the sliding window (amino acids for Exonerate,
nucleotides for BLASTn) when trimming hit termini.
Default is: 5 (Exonerate) or 15 (BLASTn)
--exonerate_hit_sliding_window_thresh <int>
Percentage similarity threshold for the sliding window
(amino acids for Exonerate, nucleotides for BLASTn)
when trimming hit termini. Default is: 75 (Exonerate)
or 65 (BLASTn)
--exonerate_skip_hits_with_frameshifts
Skip Exonerate hits where the SPAdes sequence contains
a frameshift. See:
https://github.com/mossmatters/HybPiper/wiki/Troubleshooting-common-
issues,-and-recommendations#42-hits-where-the-spades-contig-contains-frameshifts.
Default is: False
--exonerate_skip_hits_with_internal_stop_codons
Skip Exonerate hits where the SPAdes sequence contains an
internal in-frame stop codon. See:
https://github.com/mossmatters/HybPiper/wiki/Troubleshooting,-common-
issues,-and-recommendations#31-sequences-containing-stop-codons.
A single terminal stop codon is allowed, but see option
"--exonerate_skip_hits_with_terminal_stop_codons" below. Default is: False.
--exonerate_skip_hits_with_terminal_stop_codons
Skip Exonerate hits where the SPAdes sequence contains a single
terminal stop codon. Only applies when option
"--exonerate_skip_hits_with_internal_stop_codons" is also provided.
Only use this flag if your target file exclusively contains
protein-coding genes with no stop codons included, and you would like
to prevent any in-frame stop codons in the output sequences.
Default is: False.
--exonerate_refine_full
Run Exonerate searches using the parameter "--refine full".
Default is: False.
--no_intronerate Do not run intronerate to recover fasta files for supercontigs
with introns (if present), and introns-only. If this flag is used,
fasta files in `subfolders 09_sequences_intron` and
`10_sequences_supercontig` will be empty
--no_padding_supercontigs If Intronerate is run, and a supercontig is created
by concatenating multiple SPAdes contigs, do not add
10 "N" characters between contig joins. By default,
Ns will be added
--keep_intermediate_files Keep all intermediate files and logs, which can be
useful for debugging. Default action is to delete
them, which greatly reduces the total file number
--verbose_logging If supplied, enable verbose login. NOTE: this can
increase the size of the log files by an order of
magnitude
--compress_sample_folder
Tarball and compress the sample folder after assembly
has completed (<sample_name>.tar.gz). Default is: False
#######################################################################################
####################### hybpiper paralog_retriever options: ###########################
#######################################################################################
--paralogs_list_threshold_percentage PARALOGS_LIST_THRESHOLD_PERCENTAGE
Percent of total number of samples and genes that must
have paralog warnings to be reported in the
<genes_with_paralogs.txt> report file. The default is
0.0, meaning that all genes and samples with at least
one paralog warning will be reported
--figure_length FIGURE_LENGTH
Length dimension (in inches) for the output heatmap
file. Default is automatically calculated based on the
number of genes
--figure_height FIGURE_HEIGHT
Height dimension (in inches) for the output heatmap
file. Default is automatically calculated based on the
number of samples
--sample_text_size SAMPLE_TEXT_SIZE
Size (in points) for the sample text labels in the
output heatmap file. Default is automatically
calculated based on the number of samples
--gene_text_size GENE_TEXT_SIZE
Size (in points) for the gene text labels in the
output heatmap file. Default is automatically
calculated based on the number of genes
--heatmap_filetype {png,pdf,eps,tiff,svg}
File type to save the output heatmap image as. Default
is png
--heatmap_dpi HEATMAP_DPI
Dots per inch (DPI) for the output heatmap image.
Default is 300
#######################################################################################
######################## hybpiper recovery_heatmap options: ###########################
#######################################################################################
--figure_length FIGURE_LENGTH
Length dimension (in inches) for the output heatmap
file. Default is automatically calculated based on the
number of genes
--figure_height FIGURE_HEIGHT
Height dimension (in inches) for the output heatmap
file. Default is automatically calculated based on the
number of samples
--sample_text_size SAMPLE_TEXT_SIZE
Size (in points) for the sample text labels in the
output heatmap file. Default is automatically
calculated based on the number of samples
--gene_text_size GENE_TEXT_SIZE
Size (in points) for the gene text labels in the
output heatmap file. Default is automatically
calculated based on the number of genes
--heatmap_filetype {png,pdf,eps,tiff,svg}
File type to save the output heatmap image as. Default
is *.png
--heatmap_dpi HEATMAP_DPI
Dot per inch (DPI) for the output heatmap image.
Default is 150
Please see the Wiki entry Additional pipeline features and details for further explanation of the parameters above, and general pipeline functionality.
If you're just after the unaligned .fasta
files for each of your target genes (not including putative paralogs), the two main output folders of interest are probably:
07_sequences_dna
08_sequences_aa
If you need the per-sample output folders produced by a standard HybPiper run, these can be found in folder:
04_processed_gene_directories
For a full explanation of output folders and files, please see the Wiki entry Output folders and files.
For details on adapting the pipeline to run on local and HPC computing resources, see here.
Please see the Wiki entry Issues.
30 April 2025
- Updated the HybPiper version in the Singularity container
hybpiper-paragone
to version 2.3.2, and updated thehybpiper-nf
script to version 1.1.0. - Updated the command line options for the entry point 'assemble' to reflect HybPiper version 2.3.2.
31 May 2024
- Updated the HybPiper version in the Singularity container
hybpiper-paragone
to version 2.1.7, and updated thehybpiper-nf
script to version 1.0.4 - Fixed an issue when calling
hybpiper retrieve_sequences
with an amino-acid target file. - Updated the flag
--use_diamond
to--diamond
to match stand-alone HybPiper. - Updated the flag
--run_intronerate
to--no_intronerate
to reflect changes made in HybPiper >= 2.1.6
03 July 2023
- Updated the HybPiper version in the Singularity container
hybpiper-paragone
to version 2.1.5, and updated thehybpiper-nf
script to version 1.0.3 - Added the new options
exonerate_hit_sliding_window_size
andexonerate_hit_sliding_window_thresh
to thehybpiper assemble
options.
15 May 2023
- Bugfix: BBmap.sh memory for Hybpiper chimera test was set to a default of 1 Mb. Changed to 1000 Mb.
- Update version in hybpiper-nf script to version 1.0.2
14 April 2023
- Updated the HybPiper version in the Singularity container
hybpiper-paragone
to version 2.1.3, and updated thehybpiper-nf
script to version 1.0.1 - Bugfix: get target file basename for HybPiper commands (fixes error when using a target file that isn't in the current working directory)
01 February 2023
- Added a
conda
andconda_slurm
profile. This allows the pipeline to be run using conda packages rather than the Singularity container. The corresponding conda environment is created in the nextflowwork
directory. - Updated the
hybpiper.nf
script to support all native HybPiper 2 parameters. - Added a script version number; view by using the flag
--version
. - Split the main HybPiper
assemble
pipeline, and the HybPiper commandscheck_targetfile
and thefix_targetfile
; these are now separate entry points to thehybpiper.nf
script, accessible by using the parameter-entry
, e.g.-entry assemble
. - Added separate help docs for each entry point, e.g.
-entry assemble --help
.
28 November 2022
- Change repository name from
HybPiper-RBGV
tohybpiper-nf
. - Update the required container name from
hybpiper-yang-and-smith-rbgv.sif
tohybpiper-paragone.sif
. - Update containerised HybPiper version and corresponding Nextflow scripts to HybPiper version 2
09 November 2021
- Nextflow script updated to DSL2. NOTE: Nextflow version >= 21.04.1 is now required!
- Singularity *.def file updated to use Ubuntu 20.04, and to install Muscle and FastTreeMP.