-
Notifications
You must be signed in to change notification settings - Fork 1
Home
This is pre-publication software that is currently under active development. Use it at your own risk. Bug reports are welcome, but a user cannot depend on getting support at this time.
Pipeline for analyzing genomic read sets for public and animal health purposes.
Author: Karin Lagesen, @karinlag
Contact information: please submit an issue, and the author will get back to you.
This software uses the Nextflow.io workflow system to run various analyses appropriate for genomic epidemiology and comparative microbiology purposes. The Nextflow system allows for running the same pipeline on a local computer and on a cluster without changing the code.
For installation, see the installation pages. Please note: this software has at the time of writing (August 2018) not been tested on any other systems than Ubuntu and on the University of Oslo/Abel cluster (i.e. under slurm).
For details on how to run, see the Run pages. The pipeline consists of a run
script which enables the running of several different scripts. Each script
consists of several different tools which result in an analysis. For each script,
a nextflow script, a template config file and a template profile file is
provided. For each compute system, the profile file needs to be adjusted to
ensure that the software used is available. The easiest way to do that is to
create a conda environment with the required software. Input the location to
that in the appropriate conda config files, and you should be good to go.
Once this is done, that profile file should not need modification. For each run,
the template config script should be modified to specify specific things for
that run, such as input data, species, databases needed, options to software, etc.
The pipeline has been developed as a series of scripts, where each script has a specific input and a set of logically connected analyses. Each script comes with its own nextflow script and a separate config file, which is used to specify inputs and software options for that specific run.
The current pipeline contains the following scripts:
- qc_track.nf: Basic QC
-
Fastqc
is run on all input files, followed bymultiqc
, which aggregates the results.
-
- specific_gene.nf: MLST, virulence and AMR annotation
- The software
ARIBA
is used to annotate MLST, virulence and AMR directly from reads. This script can be used to run all three at once, or just one or two of them.
- The software
- asm_annot.nf: Assembly and annotation
- This script first runs through
fastqc
andmultiqc
, before stripping PhiX usingbbduk
, trimming withtrimmomatic
, assembly withSPAdes
, assembly polishing withpilon
, evaluating assemblies withQUAST
, before annoating withprokka
.
- This script first runs through
The following features are planned for future releases:
- Species identification
- SNP tree analyses, probably both with parsnp and kSNP
- Pan-genome analysis, probably using ROARY
This software is already available at the UiO Abel cluster. Please see the University of Oslo Abel pages for how to run the software on the cluster.