16s Metabarcoding Analysis

This tutorial is largely inspired of the MiSeq SOP from the Schloss Lab.

Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD. (2013): Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Applied and Environmental Microbiology. 79(17):5112-20.

Introduction

The 16S rRNA gene is a section of prokaryotic DNA found in all bacteria and archaea. This gene codes for an rRNA, and this rRNA in turn makes up part of the ribosome. The first 'r' in rRNA stands for ribosomal. The ribosome is composed of two subunits, the large subunit (LSU) and the small subunit (SSU).

The 16S rRNA gene is a commonly used tool for identifying bacteria for several reasons. First, traditional characterization depended upon phenotypic traits like gram positive or gram negative, bacillus or coccus, etc. Taxonomists today consider analysis of an organism's DNA more reliable than classification based solely on phenotypes. Secondly, researchers may, for a number of reasons, want to identify or classify only the bacteria within a given environmental or medical sample. While there is a homologous gene in eukaryotes, the 18S rRNA gene, it is distinct, thereby rendering the 16S rRNA gene a useful tool for extracting and identifying bacteria as separate from plant, animal, fungal, and protist DNA within the same sample. Thirdly, the 16S rRNA gene is relatively short at 1.5 kb, making it faster and cheaper to sequence than many other unique bacterial genes.

Mothur is a command-line computer program for analyzing sequence data from microbial communities and namely 16s data. mothur is licensed under the GPL and is free to use.

Softwares Required for this Tutorial

mothur
mothur_krona

Downloading the Data and Start Mothur

Firstly, download and unzip the sample dataset:

wget http://www.mothur.org/w/images/d/d6/MiSeqSOPData.zip
unzip MiSeqSOPData.zip

In the MiSeq_SOP directory, you'll find the reads files in fastq format, as well as a file called stability.files

The first lines of stability.files look like this:

F3D0 F3D0_S188_L001_R1_001.fastq F3D0_S188_L001_R2_001.fastq F3D141 F3D141_S207_L001_R1_001.fastq F3D141_S207_L001_R2_001.fastq F3D142 F3D142_S208_L001_R1_001.fastq F3D142_S208_L001_R2_001.fastq F3D143 F3D143_S209_L001_R1_001.fastq F3D143_S209_L001_R2_001.fastq

The first column is the name of the sample. The second column is the name of the forward read for that sample and the third columns in the name of the reverse read for that sample.

Now it's time to start mothur. Type mothur in your terminal. You should see your prompt changing to

mothur >

Reducing Sequencing and PCR Errors

The first thing we want to do is combine our two sets of reads for each sample and then to combine the data from all of the samples. This is done using the make.contigs command, which requires stability.files as input. This command will extract the sequence and quality score data from your fastq files, create the reverse complement of the reverse read and then join the reads into contigs.

make.contigs(file=stability.files, processors=8)  

It took 30 secs to process 152360 sequences.  

Group count:
F3D0	7793
F3D1	5869
F3D141	5958
F3D142	3183
F3D143	3178
F3D144	4827
F3D145	7377
F3D146	5021
F3D147	17070
F3D148	12405
F3D149	13083
F3D150	5509
F3D2	19620
F3D3	6758
F3D5	4448
F3D6	7989
F3D7	5129
F3D8	5294
F3D9	7070
Mock	4779

Total of all groups is 152360

Output File Names:
stability.trim.contigs.fasta
stability.trim.contigs.qual
stability.contigs.report
stability.scrap.contigs.fasta
stability.scrap.contigs.qual
stability.contigs.groups

The stability.contigs.report file will tell you something about the contig assembly for each read. Let's see what these sequences look like using the summary.seqs command:

summary.seqs(fasta=stability.trim.contigs.fasta)

Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	248	248	0	3	1
2.5%-tile:	1	252	252	0	3	3810
25%-tile:	1	252	252	0	4	38091
Median: 	1	252	252	0	4	76181
75%-tile:	1	253	253	0	5	114271
97.5%-tile:	1	253	253	6	6	148552
Maximum:	1	502	502	249	243	152360
Mean:	1	252.811	252.811	0.70063	4.44854
# of Seqs:	152360

This tells us that we have 152360 sequences that for the most part vary between 248 and 253 bases. Interestingly, the longest read in the dataset is 502 bp. Be suspicious of this, the reads are supposed to be 251 bp each. This read clearly didn't assemble well (or at all). Also, note that at least 2.5% of our sequences had some ambiguous base calls. We'll take care of these issues in the next step when we run screen.seqs.

screen.seqs(fasta=stability.trim.contigs.fasta, group=stability.contigs.groups, maxambig=0, maxlength=275)

You'll notice that mothur remembered that we used 8 processors in make.contigs. To see what else mothur knows about you, run the following:

get.current()

Current files saved by mothur:
fasta=stability.trim.contigs.good.fasta
group=stability.contigs.good.groups
qfile=stability.trim.contigs.qual
processors=8
summary=stability.trim.contigs.summary

What this means is that mothur remembers your latest fasta file and group file as well as the number of processors you have. So you could run:

mothur > summary.seqs(fasta=stability.trim.contigs.good.fasta)
mothur > summary.seqs(fasta=current)
mothur > summary.seqs()

and get the same output for each command.

But, now that we have filtered the sequencing errors, let's move to the next step.

Processing Improved Sequences

We anticipate that many of our sequences are duplicates of each other. Because it's computationally wasteful to align the same sequences several times, we'll make our sequences unique:

unique.seqs(fasta=stability.trim.contigs.good.fasta)

If two sequences have the same identical sequence, then they're considered duplicates and will get merged. In the screen output there are two columns - the first is the number of sequences characterized and the second is the number of unique sequences remaining

Another thing to do to make our lives easier is to simplify the names and group files. If you look at the most recent versions of those files you'll see together they are 13 MB. This may not seem like much, but with a full MiSeq run those long sequence names can add up and make life tedious. So we'll run count.seqs to generate a table where the rows are the names of the unique seqeunces and the columns are the names of the groups. The table is then filled with the number of times each unique sequence shows up in each group.

This will generate a file called stability.trim.contigs.good.count_table. In subsequent commands we'll use it by using the count option:

count.seqs(name=stability.trim.contigs.good.names, group=stability.contigs.good.groups)
summary.seqs(count=stability.trim.contigs.good.count_table)

Using stability.trim.contigs.good.unique.fasta as input file for the fasta parameter.

Using 8 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	250	250	0	3	1
2.5%-tile:	1	252	252	0	3	3227
25%-tile:	1	252	252	0	4	32265
Median: 	1	252	252	0	4	64530
75%-tile:	1	253	253	0	5	96794
97.5%-tile:	1	253	253	0	6	125832
Maximum:	1	270	270	0	12	129058
Mean:	1	252.462	252.462	0	4.36663
# of unique seqs:	16477
total # of seqs:	129058

Now we need to align our sequences to the reference alignment.

First we need to download the SILVA database.

# This step should be done outside mothur
wget http://www.mothur.org/w/images/9/98/Silva.bacteria.zip
unzip Silva.bacteria.zip

If you have quit mothur to download the database, rerun the mothur command, then take a look at the database you have downloaded:

summary.seqs(fasta=silva.bacteria/silva.bacteria.fasta, processors=8)

Now do the alignment using align.seqs:

align.seqs(fasta=stability.trim.contigs.good.unique.fasta, reference=silva.bacteria/silva.bacteria.fasta)

We can then run summary.seqs again to get a summary of our alignment:

summary.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table)

You'll see that the bulk of the sequences start at position 13862 and end at position 23444. Some sequences start at position 13144 or 13876 and end at 22587 or 25294. These deviants from the mode positions are likely due to an insertion or deletion at the terminal ends of the aliignments. Sometimes you'll see sequences that start and end at the same position indicating a very poor alignment, which is generally due to non-specific amplification. To make sure that everything overlaps the same region we'll re-run screen.seqs to get sequences that start at or before position 1968 and end at or after position 11550. We'll also set the maximum homopolymer length to 8 since there's nothing in the database with a stretch of 9 or more of the same base in a row (this really could have been done in the first execution of screen.seqs above). Note that we need the count table so that we can update the table for the sequences we're removing and we're also using the summary file so we don't have to figure out again all the start and stop positions:

screen.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table, summary=stability.trim.contigs.good.unique.summary, start=13862, end=23444, maxhomop=8)
summary.seqs(fasta=current, count=current)

No we can trim both ends of the aligned reads to be sure the all overlap exactly the same region. We can do this with fliter.seqs

filter.seqs(fasta=stability.trim.contigs.good.unique.good.align, vertical=T, trump=.)

We may have introduced redundancy by trimming the ends of the sequences, so we will re-run unique.seqs

unique.seqs(fasta=stability.trim.contigs.good.unique.good.filter.fasta, count=stability.trim.contigs.good.good.count_table)

This identified 3 duplicate sequences that we've now merged with previous unique sequences. The next thing we want to do to further de-noise our sequences is to pre-cluster the sequences using the pre.cluster command allowing for up to 2 differences between sequences. This command will split the sequences by group and then sort them by abundance and go from most abundant to least and identify sequences that are within 2 nt of each other. If they are then they get merged. We generally favor allowing 1 difference for every 100 bp of sequence:

pre.cluster(fasta=stability.trim.contigs.good.unique.good.filter.unique.fasta, count=stability.trim.contigs.good.unique.good.filter.count_table, diffs=2)

At this point we have removed as much sequencing error as we can and it is time to turn our attention to removing chimeras. We'll do this using the UCHIME algorithm that is called within mothur using the chimera.uchime command. Again, this command will split the data by sample and check for chimeras. Our preferred way of doing this is to use the abundant sequences as our reference. In addition, if a sequence is flagged as chimeric in one sample, the the default (dereplicate=F) is to remove it from all samples. Our experience suggests that this is a bit aggressive since we've seen rare sequences get flagged as chimeric when they're the most abundant sequence in another sample. This is how we do it:

chimera.uchime(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t)

Running chimera.uchime with the count file will remove the chimeric sequences from the count file. But you still need to remove those sequences from the fasta file. We do this using remove.seqs:

remove.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, accnos=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.accnos)

As a final quality control step, we need to see if there are any "undesirables" in our dataset. Sometimes when we pick a primer set they will amplify other stuff that gets to this point in the pipeline such as 18S rRNA gene fragments or 16S rRNA from Archaea, chloroplasts, and mitochondira. There's also just the random stuff that we want to get rid of. Let's go ahead and classify those sequences using the Bayesian classifier with the classify.seqs command:

classify.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table, reference=silva.bacteria/silva.bacteria.fasta, taxonomy=silva.bacteria/silva.bacteria.rdp.tax, cutoff=80)

Now that everything is classified we want to remove our undesirables. We do this with the remove.lineage command:

remove.lineage(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.rdp.wang.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)

Analysis

OTUs

We will use cluster.split for clustering sequences into OTUs

cluster.split(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.rdp.wang.pick.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.15)

We used taxlevel=4, which corresponds to the level of Order

Next we want to know how many sequences are in each OTU from each group and we can do this using the make.shared command. Here we tell mothur that we're really only interested in the 0.03 cutoff level:

make.shared(list=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.list, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, label=0.03)

We also want to know the taxonomy for each of our OTUs. We can get the consensus taxonomy for each OTU using the classify.otu command

classify.otu(list=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.list, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.rdp.wang.pick.taxonomy, label=0.03)

If you open the file stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.0.03.cons.taxonomy, you can get information about your OTUs.

OTU	Size	Taxonomy
Otu0001	12328	Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0002	8918	Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0003	7850	Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0004	7478	Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0005	7478	Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0006	6650	Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0007	6341	Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Bacteroidaceae(100);Bacteroides(100);Bacteroides_unclassified(100);
Otu0008	5374	Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Rikenellaceae(100);Alistipes(100);Alistipes_unclassified(100);
Otu0009	3618	Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);

This is telling you that Otu0001 was observed 12328 times in your sample and that 100% of the sequences were from Barnesiella

We'll visualize the composition of our datasets using Krona.

python mothur_krona_XML.py stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.0.03.cons.tax.summary > mothur.xml
ktImportXML mothur.xml

Now open xml.krona.html in your browser!

Batch Mode

It is perfectly acceptable to enter the commands for your analysis from within mothur. We call this the interactive mode. If you are doing a lot these types of analysis or you want to use this SOP on your own data without thinking too much, you can run mothur in batch mode using

./mothur script.batch

where script.batch (or whatever name you want, really) is a text file containing all the commands that you previously entered in interactive mode.

If you have time, copy all the commands from this tutorial in a file, a try to make mothur work in batch mode!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

16s.md

16s.md

16s Metabarcoding Analysis

Table of Contents

Introduction

Softwares Required for this Tutorial

Downloading the Data and Start Mothur

Reducing Sequencing and PCR Errors

Processing Improved Sequences

Analysis

OTUs

Batch Mode

Files

16s.md

Latest commit

History

16s.md

File metadata and controls

16s Metabarcoding Analysis

Table of Contents

Introduction

Softwares Required for this Tutorial

Downloading the Data and Start Mothur

Reducing Sequencing and PCR Errors

Processing Improved Sequences

Analysis

OTUs

Batch Mode