v0.1
NGS-4-ECOPROD wrapper/pipeline collection is primarily dedicated to metagenome data processing and analysis. The installation script sets up a Miniconda folder and Conda environments where all the necessary tools are installed. It does not interfere with the Linux system. ngs4ecoprod aims to simplify the often complex tasks (especially for beginners) associated with this kind of data by automating key steps in the processing of raw sequence data to human interpretable data. The pipeline provides basic analysis scripts and tools from the public domain. The overarching goal is to optimize time utilization by streamlining data workflows, allowing researchers to devote more time to the substantive biological analysis.
This repository is developed in the framework of NGS-4-ECOPROD at the University of Göttingen. The pipeline aims to automate and simplify metagenomic workflows (including 16S/18S rRNA gene amplicon analysis, metagenomes derived from Illumina paired-end sequencing, metagenomes derived from Nanopore long-reads etc.).
The pipeline was tested under Linux (Ubuntu 20.04 & 22.04 LTS) and is encapsuled in a miniconda environment with the intention to not affect the Linux operating system it is installed on.
NOTE: After the recent updates of miniconda and mamba (12th of July), installation seems to work only on recent Linux distributions (2020 and up).
You can either install the pipeline as a user into your home or as server admin in - for example - /opt
and make it accessable for every user via an alias in the users .bashrc
or corresponding shell (e.g., zsh, ksh, tcsh etc.).
The current disk space requirement for the installation is approximately 23 GB without the databases. However, when including all databases needed, the total disk space needed increases to roughly 1 TB (July 2023), with SILVA requiring an additional 1 GB, kraken2 database (nt) requiring 676 GB, kaiju database (nr) requiring 187 GB, and GTDB-Tk & PLSDB requiring 80 GB.
# 1. Download installation script
wget https://raw.githubusercontent.com/dschnei1/ngs4ecoprod/main/install_ngs4ecoprod
# 2. Install ngs4ecoprod (in this example into your home ~/ngs4ecoprod)
bash install_ngs4ecoprod -i ~/ngs4ecoprod
# 3. Restart terminal or type
source ~/.bashrc
# 4. Activate environment
activate_ngs4ecoprod
# 5. Remove installer
rm -f install_ngs4ecoprod
# 1. Download bash installation script
wget https://raw.githubusercontent.com/dschnei1/ngs4ecoprod/main/install_ngs4ecoprod
# 2. Install ngs4ecoprod (in this example into your home ~/ngs4ecoprod)
sudo bash install_ngs4ecoprod -i /opt/ngs4ecoprod
# 3. To activate the environment ensure every user has the following alias in .bashrc:
# alias activate_ngs4ecoprod='source /opt/ngs4ecoprod/bin/activate ngs-4-ecoprod'
# Example command
echo "alias activate_ngs4ecoprod='source /opt/ngs4ecoprod/bin/activate ngs-4-ecoprod'" >> ~/.bashrc
# 4. Restart terminal or type
source ~/.bashrc
# 5. Activate environment
activate_ngs4ecoprod
# 6. Remove installer
rm -f install_ngs4ecoprod
Note: Before first use please run GNU parallel once after activating your conda environment and agree to conditions to cite or pay GNU parallel
parallel --citation
Here is a list of all software installed by install_ngs4ecoprod
via conda, in addition NanoPhase, metaWRAP, GTDB-tk, BLCA, sra-toolkit are installed alongside.
Silva database for ngs4_16S
& ngs4_16S_blca
& ngs4_18S
ngs4_download_silva_db -i ~/ngs4ecoprod/ngs4ecoprod/db
Download GTDB-tk and PLSDB databases for ngs4_np_assembly
ngs4_download_nanophase -i ~/ngs4ecoprod/ngs4ecoprod/db
Note: GTDBtk database download is very slow, a mirror of the database will be available soon
Download precompiled kraken2 (nt) and kaiju (nr) databases for ngs4_tax
& ngs4_np_tax
ngs4_download_tax_k2k -i ~/ngs4ecoprod/ngs4ecoprod/db
Note: Download will be performed in current directory - make sure you have ~580 Gb (+676 Gb if you install/extract to the same disk) of disk space before starting the download.
To remove the pipeline do the following (adapt .bashrc to your shell)
# 1. Remove conda folder
rm -rf ~/ngs4ecoprod
# 2. Remove alias from .bashrc
sed -i -E "/^alias activate_ngs4ecoprod=.*/d" ~/.bashrc
So far the repository contains the following data processing scripts:
-
Amplicon analysis pipeline (16S rRNA gene, Bacteria and Archaea, 18S rRNA gene Eukaryota)
ngs4_16S
ngs4_16S_blca
ngs4_16S_blca_ncbi
ngs4_18S
ngs4_18S_blca
-
Nanopore: Metagenome analysis Under active development!
ngs4_np_qf
ngs4_np_tax
ngs4_np_assembly
-
Illumina: Metagenome analysis Under active development!
ngs4_qf
ngs4_tax
ngs4_16S
is a 16S rRNA gene amplicon analysis pipeline providing processing of raw reads to amplicon sequence variant (ASV) sequences, read count table and a phylogenetic tree of ASV sequences. The pipeline uses the tools fastp, cutadapt, vsearch, mafft, FastTree, NCBI blast, BLCA and R.
In principle the following steps are performed by the pipeline:
- all raw reads are quality filtered
- all primer sequences are removed
- paired-end reads are stitched together
- reads are sorted by length
- reads are dereplicated (by default sorted by decreasing abundance)
- reads are denoised, see UNOISE3
- de novo chimera removal
- reference-based chimera removal (reference SILVA NR99 138.1)
- final set of ASVs
- quality filtered reads are mapped back against the ASVs
- blastn against SILVA
- ASV count table generation
- phylogenetic tree from ASVs
- data formatting and curation (minimum of 85% query coverage + lineage correction for 16S rRNA gene amplicons)
- final ASV count table
- Optional: for a more robust classification use BLCA against SILVA or NCBIs 16S rRNA
The default configuration of the pipeline is for Illumina MiSeq paired-end reads using reagent kit v3 (2x 300 bp, 600 cycles) with the primer pair SD-Bact-0341-b-S-17 and S-D-Bact-0785-a-A-21 proposed by Klindworth et al. (2013). However, by changing the parameters of primer sequence, sequence length, ASV length this pipeline can be used for any overlapping paired-end bacterial or archaeal amplicon raw sequence data (see options). The script also performs a lineage correction (removing uncertain assignments from species to phylum based on percent identity: <98.7 species, <94.5 genus, <86.5 family, <82.0 order, <78.5 class, <75 phylum) as proposed by Yarza et al. (2014) to avoid over/misinterpretation of the blast classification.
A very basic R script (based on ampvis2) is provided to start your analyses. I highly recommend to fill the metadata file (metadata.tsv) with all information about the samples that you have at hand. For more information, microsud has compiled an extensive overview of available microbiome analysis tools.
Before you start you need demultiplexed forward and reverse paired-end reads, placed in one folder, and sample names must meet the following naming convention:
<Sample_name>_<forward=R1_or_reverse=R2>.fastq.gz
# Example
Sample_1_R1.fastq.gz
Sample_1_R2.fastq.gz
Sample_2_R1.fastq.gz
Sample_2_R2.fastq.gz
etc.
Afterwards you can start the pipeline (here with example data) to process your 16S rRNA gene amplicon data.
ngs4_16S \
-i ~/ngs4ecoprod/ngs4ecoprod/example_data/16S \
-o ~/ngs4_16S \
-d ~/ngs4ecoprod/ngs4ecoprod/db/silva \
-p 3 -t 8
-i Input folder containing paired-end fastq.gz
Note: files must be named according to the following scheme
Sample_name_R1.fastq.gz
Sample_name_R2.fastq.gz
-o Output folder
-d Path to SILVA database
-l Optional: Minimum length of forward and reverse sequence in bp [default: 200]
-q Optional: Minimum phred score [default: 20]
-p Number of processes [default: 1]
-t Number of CPU threads per process [default: 1]
-f Forward primer [default: CCTACGGGNGGCWGCAG]
-r Reverse primer [default: GGATTAGATACCCBDGTAGTC]
Note: Use the reverse complement sequence of your 16S rRNA gene reverse primer
-a Optional: Minimum length of amplicon [default: 400]
-u Optional: minsize of UNOISE [default: 8]
Note: Only change under special circumstances, i.e., very low sample number
-h Print this help
Optional: Since ngs4_16S
is using a "simple" blastn (megablast, best hit) to infer taxonomy of the ASVs you might want to use a more sophisticated approach for taxonomic assignment. You can use bayesian-based lowest common ancestor (BLCA) classification method on your data. This will take more computation time (depending on the diversity/amount of ASVs of your samples & your hardware) mainly due to BLCA performing a blastn and a clustalo alignment of the ASV sequences.
There are two scripts: ngs4_16S_blca
which uses BLCA with the SILVA 138.1 database and ngs4_16S_blca_ncbi
which uses BLCA against NCBIs 16S rRNA database.
To run BLCA with SILVA on your data after ngs4_16S
has finished, process your data with ngs4_16S_blca
as follows:
ngs4_16S_blca \
-i ~/ngs4_16S \
-d ~/ngs4ecoprod/ngs4ecoprod/db/silva \
-t 8
To run BLCA with NCBIs 16S rRNA gene database on your data after ngs4_16S
has finished, process your data with ngs4_16S_blca_ncbi
(Note: every time you start the script the most recent version of the database will be downloaded) as follows:
ngs4_16S_blca_ncbi -i ~/ngs4_16S -t 8
ASV_sequences.fasta
→ FASTA file containing all ASVs from your dataset
ASV_table.tsv
→ ASV read count table including blast classification
ASV.tre
→ Phylogenetic tree of the ASV sequences
markergene_16S.R
→ Basic R-script to visualize and analyze your data
metadata.tsv
→ Template metadata file including SampleID
ngs4_16S_DATE_TIME.log
→ Pipeline log file
ASV_table_BLCA.tsv
→ ASV read count table including BLCA SILVA classification
ngs4_16S_blca_DATE_TIME.log
→ Pipeline log file
ASV_table_BLCA_ncbi.tsv
→ ASV read count table including BLCA NCBI classification
ngs4_16S_blca_ncbi_DATE_TIME.log
→ Pipeline log file
This pipeline is intended for use with 18S rRNA gene amplicons and is very similar to the 16S rRNA gene pipeline, except that the Yarza correction is not applied to the blastn hits. Default settings are currently set to match the primer pair TAReuk454FWD1 and TAReukREV3 designed by Stoeck et al. (2015), but by tweaking the settings (primer sequences, amplicons size) can be adapted to other primers (paired-end sequences must overlap).
ngs4_18S \
-i ~/raw_18S_data \
-o ~/ngs4_18S \
-d ~/ngs4ecoprod/ngs4ecoprod/db/silva \
-p 3 -t 8
-i Input folder containing paired-end fastq.gz
Note: files must be named according to the following scheme
Sample_name_R1.fastq.gz
Sample_name_R2.fastq.gz
-o Output folder
-d Path to SILVA database
-l Optional: Minimum length of forward and reverse sequence in bp [default: 200]
-q Optional: Minimum phred score [default: 20]
-p Number of processes [default: 1]
-t Number of CPU threads per process [default: 1]
-f Optional: Forward primer [default: CCAGCASCYGCGGTAATTCC]
-r Optional: Reverse primer [default: TYRATCAAGAACGAAAGT]
Note: Use the reverse complement sequence of your 18S rRNA gene reverse primer
-a Optional: Minimum length of amplicon [default: 350]
-u Optional: minsize parameter of UNOISE [default: 8]
Note: Only change under special circumstances, i.e., very low sample number
-h Print this help
ngs4_18S_blca \
-i ~/ngs4_18S \
-d ~/ngs4ecoprod/ngs4ecoprod/db/silva \
-p 3 -t 8
ASV_sequences.fasta
→ FASTA file containing all ASVs from your dataset
ASV_table.tsv
→ ASV read count table including blast classification
ASV.tre
→ Phylogenetic tree of the ASV sequences
markergene_18S.R
→ Basic R-script to visualize and analyze your data
metadata.tsv
→ Template metadata file including SampleID
ngs4_18S_DATE_TIME.log
→ Pipeline log file
ASV_table_BLCA.tsv
→ ASV read count table including BLCA SILVA classification
ngs4_18S_blca_DATE_TIME.log
→ Pipeline log file
To ensure high quality long-reads, the first step is filtering your data with ngs_np_qf
which includes a general quality filter with fastp and afterwards removal of barcode leftovers at the ends and/or in the middle of the long-reads with Porechop_ABI (an extension of Porechop).
Before you start you need your basecalled long-reads in one folder and your file names must meet the following naming convention:
<Sample_name>.fastq.gz
# Example
Sample_1.fastq.gz
Sample_2.fastq.gz
Sample_3.fastq.gz
etc.
Note: Before you can perform a test run with the example data, you have to download the example data (Zymo-gut-mock-Kit20 (SRR17913199) described in the NanoPhase paper):
ngs4_download_np_example -i ~/ngs4ecoprod/ngs4ecoprod/example_data
ngs4_np_qf -i ~/ngs4ecoprod/ngs4ecoprod/example_data/nanopore -o ~/ngs4_np -p 3 -t 12
-i Input folder containing nanopore raw data as fastq.gz
Note: files must be named according to the following scheme (ending with .fastq.gz)
SampleName.fastq.gz
-o Output folder
-q Optional: Minimum phred score [default: 15]
Note: you might have to lower these for old chemistry/flow cells (<R10.4)
-l Optional: Minimum length of nanopore read [default: 500]
-p Number of processes [default: 1]
-t Number of CPU threads per process [default: 1]
-h Print this help
To get a "rough" estimate of the taxonomic composition of your metagenomes you can use ngs4_np_tax
which is a combination of Kraken2 and Kaiju against NCBIs nt and nr, respectively. These tools use large databases and also some more RAM (up to 670 Gb) per process, however, with -m this can be reduced. The script will produce a read count table which you can then analyze in R.
ngs4_np_tax -i ~/ngs4_np -d ~/ngs4ecoprod/ngs4ecoprod/db/ -p 1 -t 20 -m
-i Folder containing quality filtered fastq.gz
-d Path to databases (kraken2 & kaiju)
-p Number of processes [default: 1]
Note: per process you need 187-670 Gb of RAM
-t Number of CPU threads per process [default: 1]
-m Reduce RAM requirements to 187 Gb (--memory-mapping for kraken2), slower
Note: If your database is NOT located on a SSD expect long processing times
-h Print this help
Now to the interesting part: assemble your quality filtered long-reads and generate metagenome assembled genomes (MAGs). This task is performed by NanoPhase which uses several tools to complete this task: metaWRAP, maxbin2, metabat2, semibin, checkm, GTDB-tk among others.
ngs4_np_assembly -i ~/ngs4_np -p 1 -t 20
-i Folder containing quality filtered fastq.gz
-p Number of processes [default: 1]
Note: Better only use one process here - depending on your system
-t Number of CPU threads per process [default: 1]
-h Print this help
This will perform quality filtering on your raw sequence data. In detail low quality sequences will be removed, sequences will be trimmed if quality drops below the threshold, sequences will be polished according to the consensus if reads overlap. In addition adapter leftovers will be removed and possible leftovers of phiX.
Note:
There is one requirement for the script to work (see example files), your file names have to meet the following scheme:
<Sample_name>_<forward=R1_or_reverse=R2>.fastq.gz
# Example
Sample_1_R1.fastq.gz
Sample_1_R2.fastq.gz
Sample_2_R1.fastq.gz
Sample_2_R2.fastq.gz
etc.
ngs4_qf -i ~/ngs4ecoprod/ngs4ecoprod/example_data -o ~/ngs4_test_run -d ~/ngs4ecoprod/ngs4ecoprod/db -p 3 -t 14
-i Input folder containing paired-end fastq.gz
Note: files must be named according to the following scheme
Sample_name_R1.fastq.gz
Sample_name_R2.fastq.gz
-o Output folder
-d Path to databases
-l Optional: Minimum length of sequence in bp [default: 50]
-q Optional: Minimum phred score [default: 20]
-p Number of processes [default: 1]
-t Number of CPU threads per process [default: 1]
-h Print this help
With this script you assign taxonomy to your data with Kraken2 and Kaiju. Both classifications will be merged while Kraken2 annotation (higher precision) is prioritized over Kaiju annotation (higher sensitivity). In the end you will have an relative abundance table with taxonomic assignments.
Note:
This step is RAM intensive, per process you need at least 187 (-m) or 670 Gb of RAM.
In addition, make sure you have both databases (nt & nr) located on a SSD drive!
ngs4_tax -i ~/ngs4_illumina -d ~/ngs4ecoprod/ngs4ecoprod/db -p 1 -t 10 -m
-i Folder containing quality filtered fastq.gz
-d Path to databases (kraken2 & kaiju)
-p Number of processes (default: 1)
Note: per process you need 187-670 Gb of RAM
-t Number of CPU threads per process (default: 1)
-m Reduce RAM requirements to 187 Gb (--memory-mapping for kraken2), slower
Note: If you use -m and your database is NOT located on a SSD expect long processing times
-h Print this help
#ngs4_assemble -i ~/ngs4_illumina
Dominik Schneider ([email protected])
Please cite all the sophisticated software tools and databases that are incorporated into ngs4ecoprod that you used in your analysis: software ngs4ecoprod environment
Since this repository currently has no associated publication, please cite via the GitHub link: https://github.com/dschnei1/ngs4ecoprod
→ Install ngs4ecoprod & download SILVA database
wget https://raw.githubusercontent.com/dschnei1/ngs4ecoprod/main/install_ngs4ecoprod
bash install_ngs4ecoprod -i ~/ngs4ecoprod
source ~/.bashrc
activate_ngs4ecoprod
rm -f install_ngs4ecoprod
ngs4_download_silva_db -i ~/ngs4ecoprod/ngs4ecoprod/db
→ 16S rRNA gene amplicon pipeline on example (or your data)
ngs4_16S -i ~/ngs4ecoprod/ngs4ecoprod/example_data/16S -o ~/ngs4_16S -d ~/ngs4ecoprod/ngs4ecoprod/db/silva -p 3 -t 8