Skip to content

Extremely accurate algorithm used to group DNA sequences from microbial communities into operational taxonomic units (proxy for species) for ecological or biomedical research. This algorithm uses the information contained in the distribution of DNA sequences across samples along with sequence distances to cluster sequences accurately.

Notifications You must be signed in to change notification settings

spacocha/dbOTUcaller

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 

Repository files navigation

NOTE: There is a faster implementation of this coded by Scott Olesen (https://github.com/swo/dbotu3). Please refer to this latest version instead.

OUTDATED INFORMATION ABOUT THIS VERSION OF DISTRIBUTION-BASED CLUSTERING

Distribution-based clustering version 2.0

Introduction

This is a compact version of distribution-based clustering re-written in python and about 10-fold faster than the previous version with the same accuracy.

Please cite the original distribution-based clustering paper when using this program: Appl Environ Microbiol. 2013 Nov;79(21):6593-603. doi: 10.1128/AEM.00342-13. Epub 2013 Aug 23. Distribution-based clustering: using ecology to refine the operational taxonomic unit. Preheim SP1, Perrotta AR, Martin-Platero AM, Gupta A, Alm EJ.

Outline: 1.) Requirements 2.) Quick-start guide to running dbOTUcaller 3.) Running dbOTUcaller from SmileTrain 4.) Additional tools 5.) Contact me

1.) Requirements:

These are the following verified programs and versions needed to run dbOTUcaller- R (version 2.15.3) python (version 2.7.3) rpy2 (version 2.3.6) numpy (version 1.5.1) biopython

These are the required imports into python for the program. Test to make sure these are functioning in python: import Bio from Bio import AlignIO from Bio.Align import MultipleSeqAlignment import numpy as np import sys import argparse import csv from datetime import datetime import rpy2 import rpy2.robjects as robjects from rpy2.robjects.packages import importr base = importr('base') chisqtest=rpy2.robjects.r("chisq.test") import gc import math

Errors in the import of any of these will cause the program to fail.

2.) Quick-start quide to running dbOTUcaller

Once all of the required programs and packages are installed and accessible by python, distribution-based clustering can be called with the following inputs A.) OTU table input file This is a file with counts of 100% clusters (i.e. dereplicated sequences) in all of the libraries. The first row is headers with 'OTU' as the first column header, the first column is OTU ids. B.) alignment file This is a file of all the dereplicated sequences aligned to a reference. Typically this is aligned to a dataset representing only the amplified portion of the 16S rRNA gene that is sequenced. C.) output prefix A unique name that will be used as the prefix for the output files. Four output files will be generated, .log, .list, .fasta and .mat

There are many flags that can be used to alter the parameters of the clustering. Please type the following for a list of commands: python dbOTUcaller.py --help

This generates the following output: usage: dbOTUcaller.py [-h] [-d DIST_CUTOFF] [-k K_FOLD] [-p PVALUE] [-u] [-v] [-s SPLIT] [-j USEJS] OTUtablefile alignmentfile output

Create OTUs using ecological and genetic information (DBC version 2.0)

positional arguments: OTUtablefile OTU table input alignmentfile alignment file input output unique prefix for output log, list, OTU table and fasta files

optional arguments: -h, --help show this help message and exit -d DIST_CUTOFF, --dist_cutoff DIST_CUTOFF maximum genetic variation allowed to be within the same population (i.e. OTU) -k K_FOLD, --k_fold K_FOLD abundance criteria: existing OTU rep must have at least k-fold increase over the candidate sequence to be joined (default 10 for seq error only) -p PVALUE, --pvalue PVALUE pvalue cut-off: this could vary depending on the total number of libraries -u, --unaligned use the unaligned sequence to correct alignment issues -v, --verbose verbose option to work through method in log file -s SPLIT, --split SPLIT input a list of the sequences clustered to the same percent as the distance cut-off to speed up the analysis -j USEJS, --useJS USEJS Merge statistically significantly different sequences if below Jensen-Shannon cut-off?

Example data can be found at the following website: http://web.mit.edu/afs/athena.mit.edu/user/s/p/spacocha/Public/16S-workshop/16S-analysis-workshop-tar.tar.gz

Download and open the file above. Within the example_data2_res folder will be an example of the uniput OTU table file (unique.f0.good.mat), input alignment (unique.good.align) and output files with the previx unique.dbOTU (unique.dbOTU.log, unique.dbOTU.list, unique.dbOTU.fasta, unique.dbOTU.mat).

3.) Running dbOTUcaller from SmileTrain

Interfacing with SmileTrain, which is a python wrapper for USEARCH and custom perl and python scripts is a fast and easy way to go from raw data to processed files quickly and reliably. SmileTrain can be accessed through GitHub:

https://github.com/almlab/SmileTrain

To install SmileTrain, follow the documentation on the wiki:

https://github.com/almlab/SmileTrain/wiki

Alternatively, a practical guide to running SmileTrain and dbOTUcaller can be found with the example datasets: http://web.mit.edu/afs/athena.mit.edu/user/s/p/spacocha/Public/16S-workshop/16S-analysis-workshop-tar.tar.gz

In a file called Practical_notes_SmileTrain_dbOTU.docx

Follow the instructions within that document for quick installation guide and test data.

4.) Additional tools

A.) Merging different analysis

When running samples across different sequencing lanes, or combining separate analyses, you will have to use the following commands to merge datasets. -This assumes you have run --k_fold 10 (to remove potential sequencing errors) on at least two different samples and you want to join them. The following example assumes you have done two different dbOTU calling analyses in ../Reactor_and_Coupon/raw_data/dbOTU_dir and ../GroundWater/raw_data/dbOTUcaller_dir with the output files as indicated.

Merge the data together: 1.) Figure out how the sequences relate to each other- they must be the same length: perl ~/lib/dbOTUcaller/perllib/translate_fasta_eOTU.pl ../Reactor_and_Coupon/raw_data/dbOTU_dir/unique.fa ../GroundWater/raw_data/dbOTUcaller_dir/unique.fa > translate_complete.txt

2.) Only keep sequences that are parents- the rest are considered as errors. Notice you use the unique.dbOTU.list from the output of the dbOTU analysis and the translate_complete.txt from step 1: perl ~/lib/dbOTUcaller/perllib/merge_eOTU_results_parentsonly2.pl translate_complete.txt ../Reactor_and_Coupon/raw_data/dbOTU_dir/unique.dbOTU.list,../GroundWater/raw_data/dbOTUcaller_dir/unique.dbOTU.list > merged_output.txt

3.) make a text file which contains each mat file on a different line with ls: ls ../Reactor_and_Coupon/raw_data/dbOTU_dir/unique.f0.mat ../GroundWater/raw_data/dbOTUcaller_dir/unique.f0.mat > matfiles

4.) Merge the different analyses into one using the output prefix "myoutput" and the output from steps 1, 2 and 3: perl ~/lib/dbOTUcaller/perllib/create_merged_dataset.pl merged_output.txt translate_complete.txt matfiles myoutput

files with myoutput.* can be used for additional analyses and should contain samples from both analyses. This can be done with two or more datasets

5.) Contact me:

Please create an issue to contact me about problems installing or running dbOTUcaller. I am happy to help, but documenting issues will help others who encounter the same issues.

About

Extremely accurate algorithm used to group DNA sequences from microbial communities into operational taxonomic units (proxy for species) for ecological or biomedical research. This algorithm uses the information contained in the distribution of DNA sequences across samples along with sequence distances to cluster sequences accurately.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published