WDL wrapper around RefNAAP and RABV-GLUE for execution on Terra.bio. Additionally, a custom Coding DNA Sequence (CDS) coverage calculation module is available for quality-control of the processed data.
Figure 1: RefNAAP-wdl.
RefNAAP is a reference based Oxford Nanopore Technologies (ONT) assembly analysis pipeline for RABV genomes. In summary, it performs the following steps:
- It QCs the files using fastQC and multiQC to generate a quality report.
- It trims the left and right ends of the reads by 25 basepairs, and filters out reads shorter than 50bp. These values can be customized.
- It generates the assembly reads using reference-based assembly with minimap2, gap fixing, and medaka.
It uses a reference file composed of 14 different RABV sequences for the reference-based assembly.
RABV-GLUE is a sequence-oriented resource for comparative genomic analysis of rabies virus (RABV), developed using the GLUE software framework. It organises RABV genome sequence data along evolutionary lines, aiming to leverage new and existing RABV sequences in order to improve our understanding of the epidemiology and pathology of RABV. It provides the following information:
- RABV Major Clade
- RABV Minor Clade
- Closest full genome accession in the RABV-GLUE database
To emulate the behaviour of RABV-GLUE Online, a custom module was created that recieves the closest full genome accession and downloads it from NCBI. This is then used to create a [BLAST] database for the calculation of coverage per CDS. A sequence is idenfitied as RABV if at least one CDS has over 75 percent coverage (default value, it can be adjusted). It produces the following information:
- Identification of RABV sequence
- Per CDS coverage (N, P, M, G and L)
RefNAAP-wdl
is available on Terra.bio, a cloud-native platform for researchers to access data, run analysis tools, and collaborate. With Terra.bio, you can easily process your data without prior knowledge of the command-line.
The following steps, assume you have already set up an account on Terra.bio and created a workspace to work with RefNAAP-wdl
.
To begin using RefNAAP-wdl
on Terra.bio, you will need to import the workflow from Dockstore, which is available at: RefNAAP-wdl Dockstore Import.
Figure 2: RefNAAP-wdl on Dockstore.
Once you are on the Dockstore page for RefNAAP-wdl
, you will want to locate the Launch with
section on the right side of the page and click on Terra.
Figure 3: Launching a workflow with Terra.bio on Dockstore.
After clicking the Terra button, you will be transported to Terra.bio. Once here you will decide on the Destination Workspace. Please select which of your workspaces you would like to import this workflow into. Once you have selected a Destination Workspace, all that remains is to click the Import button.
Figure 4: Importing workflow interface on Terra.bio.
The RefNAAP-wdl
should now be available in Terra.bio on the WORKFLOWS tab. When clicking on the RefNAAP-wdl
the workflow interface loads. On the workflow configuration section you will need to select the Run workflow(s) with inputs defined by data table. RefNAAP-wdl
is a sample-level workflow.
Figure 5: RefNAAP-wdl on Terra.bio.
Several inputs are available for workflow costumization: required inputs that are necessary for execution, and optional inputs that have default values but can be overwritten by the user.
Note: To provide inputs from the data table, terra uses the
this.{column_name}
notation. For example, to pass the ONT reads that are in theont_read
column on the data table to theread1
input, the value should be passed asthis.ont_reads
.
Table 1: Input description for RefNAAP-wdl
Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
---|---|---|---|---|---|
refnaap_wf | read1 | File | Base-called ONT read file in FASTQ file format (compressed). | Required | |
refnaap_wf | samplename | String | Name of sample to be analyzed. | Required | |
ncbi_datasets_blast | blast_evalue | String | BLAST e-value threshold. | "1e-10" | Optional |
ncbi_datasets_blast | cpu | Int | Number of CPUs to allocate to the task. | 4 | Optional |
ncbi_datasets_blast | disk_size | Int | Amount of storage (in GB) to allocate to the task. | 50 | Optional |
ncbi_datasets_blast | docker | String | The Docker container to use for the task. | "us-docker.pkg.dev/general-theiagen/theiagen/ncbi-datasets-blast:16.38.1_20250321" | Optional |
ncbi_datasets_blast | memory | Int | Amount of memory/RAM (in GB) to allocate to the task. | 8 | Optional |
ncbi_datasets_blast | min_gene_coverage | Float | Minimum percent coverage for BLAST to determine the persence of a CDS. | 75.0 | Optional |
ncbi_datasets_blast | min_percent_identity | Float | Minimum percent identity for BLAST to determine the persence of a CDS. | 75.0 | Optional |
rabv_genotype | cpu | Int | Number of CPUs to allocate to the task. | 4 | Optional |
rabv_genotype | disk_size | Int | Amount of storage (in GB) to allocate to the task. | 50 | Optional |
rabv_genotype | docker | String | The Docker container to use for the task. | "us-docker.pkg.dev/general-theiagen/theiagen/rabvglue:1.1.113_20250320" | Optional |
rabv_genotype | memory | Int | Amount of memory/RAM (in GB) to allocate to the task. | 8 | Optional |
refnaap | cpu | Int | Number of CPUs to allocate to the task. | 8 | Optional |
refnaap | disk_size | Int | Amount of storage (in GB) to allocate to the task. | 100 | Optional |
refnaap | docker | String | The Docker container to use for the task. | "us-docker.pkg.dev/general-theiagen/internal/refnaap:b3ad097" | Optional |
refnaap | memory | Int | Amount of memory/RAM (in GB) to allocate to the task. | 16 | Optional |
refnaap | min_coverage | Int | Amplicon regions need a minimum of this average coverage number. | 5 | Optional |
refnaap | model | String | Basecall model. | "r10_min_high_g303" | Optional |
refnaap | size | Int | Filter reads less than this length. | 50 | Optional |
refnaap | trim_left | Int | Bases to trim from left side of read. | 25 | Optional |
refnaap | trim_right | Int | Bases to trim from right side of read. | 25 | Optional |
Note: Available basecall models:
r103_min_high_g345, r103_min_high_g360, r103_prom_high_g360, r103_prom_snp_g3210, r103_prom_variant_g3210, r10_min_high_g303, r10_min_high_g340, r941_min_fast_g303, r941_min_high_g303, r941_min_high_g330, r941_min_high_g340_rle, r941_min_high_g344, r941_min_high_g351, r941_min_high_g360, r941_prom_fast_g303, r941_prom_high_g303, r941_prom_high_g330, r941_prom_high_g344, r941_prom_high_g360, r941_prom_high_g4011, r941_prom_snp_g303, r941_prom_snp_g322, r941_prom_snp_g360, r941_prom_variant_g303, r941_prom_variant_g322, r941_prom_variant_g360
Note: When BLASTing to calculate the percent coverage of a given CDS, only the largest fragment that alignes is considered.
The RefNAAP-wdl
produces several outputs that are populated back to the data table.
Table 2: Output description for RefNAAP-wdl
Variable | Type | Description |
---|---|---|
blast_results | File | File containing the BLAST results. |
datasets_ncbi_docker | String | Dockerfile used for the NCBI datasets module. |
datasets_ncbi_reference_fasta | File | File, in FASTA format, with the closest full reference sequence identified by RABV-GLUE and used to create the BLAST database. |
datasets_ncbi_report | File | Report file from NCBI Datasets for the reference sequence download. |
datasets_ncbi_version | File | Version of NCBI Datasets used. |
G_percent_coverage | Float | Percent coverage of the G CDS in the RABV genome. |
L_percent_coverage | Float | Percent coverage of the L CDS in the RABV genome. |
M_percent_coverage | Float | Percent coverage of the M CDS in the RABV genome. |
N_percent_coverage | Float | Percent coverage of the N CDS in the RABV genome. |
P_percent_coverage | Float | Percent coverage of the P CDS in the RABV genome. |
rabv_identified | String | Indication that the sequence analysed has been identified as RABV. |
rabvglue_closest_reference | String | Accession of the closest reference identified by RABV-GLUE. |
rabvglue_major_clade | String | Major clade identified by RABV-GLUE. |
rabvglue_minor_clade | String | Minor clade identified by RABV-GLUE. |
refnaap_analysis_date | String | Date of analysis with RefNAAP. |
refnaap_assembly_fasta | File | Consensus assembly generated by RefNAAP in FASTA format. |
refnaap_docker | String | Dockerfile used for the RefNAAP module. |
refnaap_multiqc_report | File | MultiQC report generated by RefNAAP in HTML format. |
The RABV-GLUE and custom Coding DNA Sequence (CDS) coverage calculation modules of the RefNAAP-WDL workflow are also available as it's own standalone workdlow for execution on Terra.bio. This allows for the analysis of RABV genomes that have been assembled through alternative methods, such as de novo assembly.
The following steps, assume you have already set up an account on Terra.bio and created a workspace to work with RABVGlue-wdl
.
To begin using RABVGlue-wdl
on Terra.bio, you will need to import the workflow from Dockstore, which is available at: RABVGlue-wdl Dockstore Import.
Figure 6: RABVGlue-wdl on Dockstore.
Once you are on the Dockstore page for RABVGlue-wdl
, you will want to locate the Launch with
section on the right side of the page and click on Terra.
Figure 7: Launching a workflow with Terra.bio on Dockstore.
After clicking the Terra button, you will be transported to Terra.bio. Once here you will decide on the Destination Workspace. Please select which of your workspaces you would like to import this workflow into. Once you have selected a Destination Workspace, all that remains is to click the Import button.
Figure 8: Importing workflow interface on Terra.bio.
The RABVGlue-wdl
should now be available in Terra.bio on the WORKFLOWS tab. When clicking on the RABVGlue-wdl
the workflow interface loads. On the workflow configuration section you will need to select the Run workflow(s) with inputs defined by data table. RABVGlue-wdl
is a sample-level workflow.
Figure 5: RABVGlue-wdl on Terra.bio.
Several inputs are available for workflow costumization: required inputs that are necessary for execution, and optional inputs that have default values but can be overwritten by the user.
Note: To provide inputs from the data table, terra uses the
this.{column_name}
notation. For example, to pass the ONT reads that are in theont_read
column on the data table to theread1
input, the value should be passed asthis.ont_reads
.
Table 1: Input description for RefNAAP-wdl
Terra Task Name | Variable | Type | Description | Default Value | Terra Status |
---|---|---|---|---|---|
rabvglue_wf | assembly_fasta | File | FASTA file with the RABV sequence to be analyzed. | Required | |
ncbi_datasets_blast | blast_evalue | String | BLAST e-value threshold. | "1e-10" | Optional |
ncbi_datasets_blast | cpu | Int | Number of CPUs to allocate to the task. | 4 | Optional |
ncbi_datasets_blast | disk_size | Int | Amount of storage (in GB) to allocate to the task. | 50 | Optional |
ncbi_datasets_blast | docker | String | The Docker container to use for the task. | "us-docker.pkg.dev/general-theiagen/theiagen/ncbi-datasets-blast:16.38.1_20250321" | Optional |
ncbi_datasets_blast | memory | Int | Amount of memory/RAM (in GB) to allocate to the task. | 8 | Optional |
ncbi_datasets_blast | min_gene_coverage | Float | Minimum percent coverage for BLAST to determine the persence of a CDS. | 75.0 | Optional |
ncbi_datasets_blast | min_percent_identity | Float | Minimum percent identity for BLAST to determine the persence of a CDS. | 75.0 | Optional |
rabv_genotype | cpu | Int | Number of CPUs to allocate to the task. | 4 | Optional |
rabv_genotype | disk_size | Int | Amount of storage (in GB) to allocate to the task. | 50 | Optional |
rabv_genotype | docker | String | The Docker container to use for the task. | "us-docker.pkg.dev/general-theiagen/theiagen/rabvglue:1.1.113_20250320" | Optional |
rabv_genotype | memory | Int | Amount of memory/RAM (in GB) to allocate to the task. | 8 | Optional |
Note: When BLASTing to calculate the percent coverage of a given CDS, only the largest fragment that alignes is considered.
The RABVGlue-wdl
produces several outputs that are populated back to the data table.
Table 2: Output description for RefNAAP-wdl
Variable | Type | Description |
---|---|---|
blast_results | File | File containing the BLAST results. |
datasets_ncbi_docker | String | Dockerfile used for the NCBI datasets module. |
datasets_ncbi_reference_fasta | File | File, in FASTA format, with the closest full reference sequence identified by RABV-GLUE and used to create the BLAST database. |
datasets_ncbi_report | File | Report file from NCBI Datasets for the reference sequence download. |
datasets_ncbi_version | File | Version of NCBI Datasets used. |
G_percent_coverage | Float | Percent coverage of the G CDS in the RABV genome. |
L_percent_coverage | Float | Percent coverage of the L CDS in the RABV genome. |
M_percent_coverage | Float | Percent coverage of the M CDS in the RABV genome. |
N_percent_coverage | Float | Percent coverage of the N CDS in the RABV genome. |
P_percent_coverage | Float | Percent coverage of the P CDS in the RABV genome. |
rabv_identified | String | Indication that the sequence analysed has been identified as RABV. |
rabvglue_closest_reference | String | Accession of the closest reference identified by RABV-GLUE. |
rabvglue_major_clade | String | Major clade identified by RABV-GLUE. |
rabvglue_minor_clade | String | Minor clade identified by RABV-GLUE. |
If you have any questions or concerns, please raise a GitHub issue or email Theiagen's general support at [email protected].