The notion of “gOTU” (pronounced as "go-to") is the minimal unit for community ecology studies based on shotgun metagenome or other forms of whole-genome microbiome data. It is in constrast to conventional practices, in which taxonomic units such as genera or species were used. Therefore, gOTU is analogous to sOTU in 16S rRNA studies. The advantage of using gOTU includes 1) highest-possible resolution, 2) independent from taxonomy which is coarse and error-prone as a classification system. 3) allowing for phylogeny-based analysis such as Faith’s PD and UniFrac. The last part is enhanced by the “Web of Life” (WoL) reference phylogeny.
To generate a gOTU table, one needs a multiplexed alignment file, or a directory of per-sample alignment files. These files can be generated by aligning sequencing data against a reference genome database. We recommend using SHOGUN with the "Web of Life" database (WoL, available for download at: https://biocore.github.io/wol/). For example:
shogun align -a bowtie2 -d WoLr1 -i input.fasta -o .
Then one can run Woltka to convert the alignment file(s) into a gOTU table:
woltka gotu -i alignment.bowtie2.sam -o table.biom
The output file table.biom
is a BIOM table with rows as genome IDs (gOTUs), columns as sample IDs, and cell values as counts of gOTUs in samples.
If necessary, you may convert a BIOM table into tab-delimited file:
biom convert --to-tsv -i table.biom -o table.tsv
Note: Both SHOGUN and WoL are available at the Qiita server. If you are a Qiita user, the alignment file can be automatically generated and downloaded from the Qiita interface. See details.
The generated BIOM table can be imported into a QIIME artifact:
qiime tools import --type FeatureTable[Frequency] --input-path table.biom --output-path table.qza
These intermediate steps are automated if you use the QIIME 2 plugin of Woltka.
One can then investigate the microbiome by applying classical QIIME analyses on the gOTU table. For example, with the WoL reference phylogeny (direct download link: tree.qza), one can do:
qiime diversity core-metrics-phylogenetic \
--i-phylogeny tree.qza \
--i-table table.qza \
--p-sampling-depth 1000 \
--m-metadata-file metadata.tsv \
--output-dir .
It is quite common that one query sequence can be aligned to multiple reference genomes. In such cases, Woltka by default counts each gOTU as 1 / k, where k is the total number of matching genomes.
Alternatively, one may choose to discard all non-unique matches, by adding a flag:
woltka gotu --uniq ...
Technically, one can use any sequence aligners and reference genome databases to generate alignment files which can then be converted into a gOTU table. We cannot validate the goodness of outcome, but understand that you may have this intention considering the consistency with existing parts of your analytical pipeline. For examples:
bwa mem refseq.fna input.R1.fq input.R2.fq > output.sam
blastn -db refseq_genomes -query input.fa -max_target_seqs 1 -outfmt 6 -out output.txt
However, most of these protocols generate mappings of reads to nucleotides (e.g., chromosomes or scaffolds), rather than to genomes. In order to produce gOTUs, one needs to supply Woltka with a nucleotide-to-genome mapping file (nucl2g.txt
, example provided under taxonomy/nucl
):
woltka gotu --map nucl2g.txt ...