This tutorial is largely inspired of the MiSeq SOP from the Schloss Lab.
Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD. (2013): Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Applied and Environmental Microbiology. 79(17):5112-20.
- Introduction
- Softwares Required for this Tutorial
- Downloading the Data and Start Mothur
- Reducing Sequencing and PCR Errors
- Processing Improved Sequences
- Analysis
- Batch Mode
The 16S rRNA gene is a section of prokaryotic DNA found in all bacteria and archaea. This gene codes for an rRNA, and this rRNA in turn makes up part of the ribosome. The first 'r' in rRNA stands for ribosomal. The ribosome is composed of two subunits, the large subunit (LSU) and the small subunit (SSU).
The 16S rRNA gene is a commonly used tool for identifying bacteria for several reasons. First, traditional characterization depended upon phenotypic traits like gram positive or gram negative, bacillus or coccus, etc. Taxonomists today consider analysis of an organism's DNA more reliable than classification based solely on phenotypes. Secondly, researchers may, for a number of reasons, want to identify or classify only the bacteria within a given environmental or medical sample. While there is a homologous gene in eukaryotes, the 18S rRNA gene, it is distinct, thereby rendering the 16S rRNA gene a useful tool for extracting and identifying bacteria as separate from plant, animal, fungal, and protist DNA within the same sample. Thirdly, the 16S rRNA gene is relatively short at 1.5 kb, making it faster and cheaper to sequence than many other unique bacterial genes.
Mothur is a command-line computer program for analyzing sequence data from microbial communities and namely 16s data. mothur is licensed under the GPL and is free to use.
Firstly, download and unzip the sample dataset:
wget http://www.mothur.org/w/images/d/d6/MiSeqSOPData.zip
unzip MiSeqSOPData.zip
In the MiSeq_SOP
directory, you'll find the reads files in fastq format, as well as a file called stability.files
The first lines of stability.files
look like this:
F3D0 F3D0_S188_L001_R1_001.fastq F3D0_S188_L001_R2_001.fastq F3D141 F3D141_S207_L001_R1_001.fastq F3D141_S207_L001_R2_001.fastq F3D142 F3D142_S208_L001_R1_001.fastq F3D142_S208_L001_R2_001.fastq F3D143 F3D143_S209_L001_R1_001.fastq F3D143_S209_L001_R2_001.fastq
The first column is the name of the sample. The second column is the name of the forward read for that sample and the third columns in the name of the reverse read for that sample.
Now it's time to start mothur. Type mothur
in your terminal. You should see your prompt changing to
mothur >
The first thing we want to do is combine our two sets of reads for each sample and then to combine the data from all of the samples. This is done using the make.contigs
command, which requires stability.files
as input. This command will extract the sequence and quality score data from your fastq files, create the reverse complement of the reverse read and then join the reads into contigs.
make.contigs(file=stability.files, processors=8)
It took 30 secs to process 152360 sequences.
Group count:
F3D0 7793
F3D1 5869
F3D141 5958
F3D142 3183
F3D143 3178
F3D144 4827
F3D145 7377
F3D146 5021
F3D147 17070
F3D148 12405
F3D149 13083
F3D150 5509
F3D2 19620
F3D3 6758
F3D5 4448
F3D6 7989
F3D7 5129
F3D8 5294
F3D9 7070
Mock 4779
Total of all groups is 152360
Output File Names:
stability.trim.contigs.fasta
stability.trim.contigs.qual
stability.contigs.report
stability.scrap.contigs.fasta
stability.scrap.contigs.qual
stability.contigs.groups
The stability.contigs.report
file will tell you something about the contig assembly for each read. Let's see what these sequences look like using the summary.seqs
command:
summary.seqs(fasta=stability.trim.contigs.fasta)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 248 248 0 3 1
2.5%-tile: 1 252 252 0 3 3810
25%-tile: 1 252 252 0 4 38091
Median: 1 252 252 0 4 76181
75%-tile: 1 253 253 0 5 114271
97.5%-tile: 1 253 253 6 6 148552
Maximum: 1 502 502 249 243 152360
Mean: 1 252.811 252.811 0.70063 4.44854
# of Seqs: 152360
This tells us that we have 152360 sequences that for the most part vary between 248 and 253 bases. Interestingly, the longest read in the dataset is 502 bp. Be suspicious of this, the reads are supposed to be 251 bp each. This read clearly didn't assemble well (or at all). Also, note that at least 2.5% of our sequences had some ambiguous base calls. We'll take care of these issues in the next step when we run screen.seqs
.
screen.seqs(fasta=stability.trim.contigs.fasta, group=stability.contigs.groups, maxambig=0, maxlength=275)
You'll notice that mothur remembered that we used 8 processors in make.contigs
. To see what else mothur knows about you, run the following:
get.current()
Current files saved by mothur:
fasta=stability.trim.contigs.good.fasta
group=stability.contigs.good.groups
qfile=stability.trim.contigs.qual
processors=8
summary=stability.trim.contigs.summary
What this means is that mothur remembers your latest fasta file and group file as well as the number of processors you have. So you could run:
mothur > summary.seqs(fasta=stability.trim.contigs.good.fasta)
mothur > summary.seqs(fasta=current)
mothur > summary.seqs()
and get the same output for each command.
But, now that we have filtered the sequencing errors, let's move to the next step.
We anticipate that many of our sequences are duplicates of each other. Because it's computationally wasteful to align the same sequences several times, we'll make our sequences unique:
unique.seqs(fasta=stability.trim.contigs.good.fasta)
If two sequences have the same identical sequence, then they're considered duplicates and will get merged. In the screen output there are two columns - the first is the number of sequences characterized and the second is the number of unique sequences remaining
Another thing to do to make our lives easier is to simplify the names and group files. If you look at the most recent versions of those files you'll see together they are 13 MB. This may not seem like much, but with a full MiSeq run those long sequence names can add up and make life tedious. So we'll run count.seqs to generate a table where the rows are the names of the unique seqeunces and the columns are the names of the groups. The table is then filled with the number of times each unique sequence shows up in each group.
This will generate a file called stability.trim.contigs.good.count_table. In subsequent commands we'll use it by using the count option:
count.seqs(name=stability.trim.contigs.good.names, group=stability.contigs.good.groups)
summary.seqs(count=stability.trim.contigs.good.count_table)
Using stability.trim.contigs.good.unique.fasta as input file for the fasta parameter.
Using 8 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 250 250 0 3 1
2.5%-tile: 1 252 252 0 3 3227
25%-tile: 1 252 252 0 4 32265
Median: 1 252 252 0 4 64530
75%-tile: 1 253 253 0 5 96794
97.5%-tile: 1 253 253 0 6 125832
Maximum: 1 270 270 0 12 129058
Mean: 1 252.462 252.462 0 4.36663
# of unique seqs: 16477
total # of seqs: 129058
Now we need to align our sequences to the reference alignment.
First we need to download the SILVA database.
# This step should be done outside mothur
wget http://www.mothur.org/w/images/9/98/Silva.bacteria.zip
unzip Silva.bacteria.zip
If you have quit mothur to download the database, rerun the mothur
command, then take a look at the database you have downloaded:
summary.seqs(fasta=silva.bacteria/silva.bacteria.fasta, processors=8)
Now do the alignment using align.seqs
:
align.seqs(fasta=stability.trim.contigs.good.unique.fasta, reference=silva.bacteria/silva.bacteria.fasta)
We can then run summary.seqs
again to get a summary of our alignment:
summary.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table)
You'll see that the bulk of the sequences start at position 13862 and end at position 23444. Some sequences start at position 13144 or 13876 and end at 22587 or 25294. These deviants from the mode positions are likely due to an insertion or deletion at the terminal ends of the aliignments. Sometimes you'll see sequences that start and end at the same position indicating a very poor alignment, which is generally due to non-specific amplification. To make sure that everything overlaps the same region we'll re-run screen.seqs to get sequences that start at or before position 1968 and end at or after position 11550. We'll also set the maximum homopolymer length to 8 since there's nothing in the database with a stretch of 9 or more of the same base in a row (this really could have been done in the first execution of screen.seqs above). Note that we need the count table so that we can update the table for the sequences we're removing and we're also using the summary file so we don't have to figure out again all the start and stop positions:
screen.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table, summary=stability.trim.contigs.good.unique.summary, start=13862, end=23444, maxhomop=8)
summary.seqs(fasta=current, count=current)
No we can trim both ends of the aligned reads to be sure the all overlap exactly the same region. We can do this with fliter.seqs
filter.seqs(fasta=stability.trim.contigs.good.unique.good.align, vertical=T, trump=.)
We may have introduced redundancy by trimming the ends of the sequences, so we will re-run unique.seqs
unique.seqs(fasta=stability.trim.contigs.good.unique.good.filter.fasta, count=stability.trim.contigs.good.good.count_table)
This identified 3 duplicate sequences that we've now merged with previous unique sequences. The next thing we want to do to further de-noise our sequences is to pre-cluster the sequences using the pre.cluster
command allowing for up to 2 differences between sequences. This command will split the sequences by group and then sort them by abundance and go from most abundant to least and identify sequences that are within 2 nt of each other. If they are then they get merged. We generally favor allowing 1 difference for every 100 bp of sequence:
pre.cluster(fasta=stability.trim.contigs.good.unique.good.filter.unique.fasta, count=stability.trim.contigs.good.unique.good.filter.count_table, diffs=2)
At this point we have removed as much sequencing error as we can and it is time to turn our attention to removing chimeras. We'll do this using the UCHIME algorithm that is called within mothur using the chimera.uchime
command. Again, this command will split the data by sample and check for chimeras. Our preferred way of doing this is to use the abundant sequences as our reference. In addition, if a sequence is flagged as chimeric in one sample, the the default (dereplicate=F) is to remove it from all samples. Our experience suggests that this is a bit aggressive since we've seen rare sequences get flagged as chimeric when they're the most abundant sequence in another sample. This is how we do it:
chimera.uchime(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t)
Running chimera.uchime
with the count file will remove the chimeric sequences from the count file. But you still need to remove those sequences from the fasta file. We do this using remove.seqs
:
remove.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, accnos=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.accnos)
As a final quality control step, we need to see if there are any "undesirables" in our dataset. Sometimes when we pick a primer set they will amplify other stuff that gets to this point in the pipeline such as 18S rRNA gene fragments or 16S rRNA from Archaea, chloroplasts, and mitochondira. There's also just the random stuff that we want to get rid of. Let's go ahead and classify those sequences using the Bayesian classifier with the classify.seqs
command:
classify.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table, reference=silva.bacteria/silva.bacteria.fasta, taxonomy=silva.bacteria/silva.bacteria.rdp.tax, cutoff=80)
Now that everything is classified we want to remove our undesirables. We do this with the remove.lineage
command:
remove.lineage(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.rdp.wang.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)
We will use cluster.split
for clustering sequences into OTUs
cluster.split(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.rdp.wang.pick.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.15)
We used taxlevel=4
, which corresponds to the level of Order
Next we want to know how many sequences are in each OTU from each group and we can do this using the make.shared command
. Here we tell mothur that we're really only interested in the 0.03 cutoff level:
make.shared(list=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.list, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, label=0.03)
We also want to know the taxonomy for each of our OTUs. We can get the consensus taxonomy for each OTU using the classify.otu
command
classify.otu(list=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.list, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.rdp.wang.pick.taxonomy, label=0.03)
If you open the file stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.0.03.cons.taxonomy
, you can get information about your OTUs.
OTU Size Taxonomy
Otu0001 12328 Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0002 8918 Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0003 7850 Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0004 7478 Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0005 7478 Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0006 6650 Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0007 6341 Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Bacteroidaceae(100);Bacteroides(100);Bacteroides_unclassified(100);
Otu0008 5374 Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Rikenellaceae(100);Alistipes(100);Alistipes_unclassified(100);
Otu0009 3618 Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
This is telling you that Otu0001 was observed 12328 times in your sample and that 100% of the sequences were from Barnesiella
We'll visualize the composition of our datasets using Krona.
python mothur_krona_XML.py stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.0.03.cons.tax.summary > mothur.xml
ktImportXML mothur.xml
Now open xml.krona.html
in your browser!
It is perfectly acceptable to enter the commands for your analysis from within mothur. We call this the interactive mode. If you are doing a lot these types of analysis or you want to use this SOP on your own data without thinking too much, you can run mothur in batch mode
using
./mothur script.batch
where script.batch (or whatever name you want, really) is a text file containing all the commands that you previously entered in interactive mode.
If you have time, copy all the commands from this tutorial in a file, a try to make mothur work in batch mode!