Skip to content

Latest commit

 

History

History
141 lines (93 loc) · 6.34 KB

README.md

File metadata and controls

141 lines (93 loc) · 6.34 KB

Woltka

License Build Status Coverage Status

Woltka (Web of Life Toolkit App), is a bioinformatics package for shotgun metagenome data analysis. It takes full advantage of, and it not limited by, the WoL reference phylogeny. It bridges first-pass sequence aligners with advanced analytical platforms (such as QIIME 2). Highlights of this program include:

  • gOTU: fine-grain community ecology.
  • Tree-based, rank-free classification.
  • Combined taxonomic & functional analysis.

Woltka ships with a QIIME 2 plugin. See here for instructions.

Contents

Overview

Where does Woltka fit in a pipeline

Woltka is a classifier. It serves as a middle layer between sequence alignment and community ecology analyses.

What does Woltka do

Woltka processes alignments -- the mappings of query sequences against reference sequences (such as microbial genomes or genes), and infers the best placement of the queries in a hierarchical classification system. One query could have simultaneous matches in multiple references. Woltka finds the most suitable classification unit(s) to describe the query accordingly the criteria specified by the researcher. Woltka generates profiles (feature tables) -- the frequencies (counts) of classification units which describe the composition of samples.

What does Woltka not do

Woltka does NOT align sequences. You need to align your FastQ (or Fast5, etc.) files against a reference database (we recommend WoL) use an aligner of your choice (BLAST, Bowtie2, etc.). The resulting alignment files can be fed into Woltka.

Woltka does NOT analyze profiles. We recommend using QIIME 2 for robust downstream analyses of the profiles to decode the relationships among micobial communities and with their environments.

Installation

Requirement: Python 3.6 or above, with Python package biom-format.

pip install git+https://github.com/qiyunzhu/woltka.git

After installation, launch the program by executing:

woltka

More details about installation are provided here.

Example usage

Woltka provides several small test datasets under woltka/tests/data. To access them, download this GitHub repo, unzip, and navigate to this directory.

One can execute the following commands to make sure that Woltka functions correctly, and to get an impression of the basic usage of Woltka.

(Note: a more complete list of commands at provided here. Alternatively, you can skip this test dataset check out the instructions for working with WoL.)

1. gOTU table generation (details):

woltka gotu -i align/bowtie2 -o table.biom

The input path, align/bowtie2, is a directory containing five Bowtie2 alignment files (S01.sam.xz, S02.sam.xz,... S05.sam.xz) (SAM format, xzipped), each representing the mapping of shotgun metagenomic sequences per sample against a reference genome database.

The output file, table.biom, is a feature table in BIOM format, which can then be analyzed using various bioformatics programs such as QIIME 2.

2. Taxonomic profiling at the ranks of phylum, genus and species (details):

woltka classify \
  -i align/bowtie2 \
  --map taxonomy/g2tid.txt \
  --nodes taxonomy/nodes.dmp \
  --names taxonomy/names.dmp \
  --rank phylum,genus,species \
  -o output_dir

The mapping file (g2tid.txt) translates genome IDs to taxonomic IDs, which then allows Woltka to classify query sequences based on the NCBI taxonomy (nodes.dmp and names.dmp).

The output directory (output_dir) will contain three feature tables: phylum.biom, genus.biom and species.biom, each representing a taxonomic profile at one of the three ranks.

3. Functional profiling by UniRef entries then by GO terms (molecular process):

woltka classify \
  -i align/bowtie2 \
  --coords function/coords.txt.xz \
  --map function/uniref.map.xz \
  --map function/go/process.tsv.xz \
  --map-as-rank \
  --rank uniref,process \
  -o output_dir

Here, the input files are still read-to-genome alignments, instead of read-to-gene ones, but Woltka matches reads to genes based on their coordinates on the genomes (as indicated by the file coords.txt). This ensures consistency between taxonomic and functional classifications.

Subsequently, Woltka is able to assign query sequences to functional units, as defined in mapping files (uniref.map and go/process.tsv). As you can see, compressed files are supported and auto-detected.

Similarly, the output files are two functional profiles: uniref.biom and process.biom.

One can also combine taxonomic and functional profilings in a stratification analysis. See details.

Notes

Citation

Woltka is currently under development. Please directly cite this GitHub repository:

Grants

The development of Woltka is supported by: (to be added).

Contact

Please forward any questions to the project leader: Dr. Qiyun Zhu ([email protected]) or the senior PI: Dr. Rob Knight ([email protected]).