This workflow is a replicate of the QA protocol implemented at JGI for Illumina reads and use the program “rqcfilter2” from BBTools(38:96) which implements them as a pipeline.
-
RQCFilterData Database: It is a 106G tar file includes reference datasets of artifacts, adapters, contaminants, phiX genome, host genomes.
-
Prepare the Database
mkdir -p refdata
wget https://portal.nersc.gov/cfs/m3408/db/RQCFilterData.tgz
tar xvzf RQCFilterData.tgz -C refdata
rm RQCFilterData.tgz
Description of the files:
.wdl
file: the WDL file for workflow definition.json
file: the example input for the workflow.conf
file: the conf file for running Cromwell..sh
file: the shell script for running the example workflow
- fastq (illumina paired-end interleaved fastq),
- project name
{
"metaTReadsQC.input_files": ["/global/cfs/cdirs/m3408/ficus/example/12889.1.295318.GTGCTTAC-GTAAGCAC.fastq.gz"],
"metaTReadsQC.proj":"nmdc:xxxxxxx"
}
The output will have one directory named by prefix of the fastq input file and a bunch of output files, including statistical numbers, status log and a shell script to reproduce the steps etc.
The main QC fastq output is named by prefix.fastq.gz.
|-- nmdc_xxxxxxx_filtered.fastq.gz
|-- nmdc_xxxxxxx_filterStats.txt
|-- nmdc_xxxxxxx_filterStats2.txt
|-- nmdc_xxxxxxx_qa_stats.json
|-- filtered/adaptersDetected.fa
|-- filtered/reproduce.sh
|-- filtered/spikein.fq.gz
|-- filtered/status.log
|-- ...