Getting Started

The LCA tool uses bitscore, identity percentage and the LCA script itself in the following order:

Step 1, determine the top percentage based on bitscore:
Of all the blast hits a sub-selection will be made of the top hits based on a percentage of the bitscore. If the top percentage is set to 8% the top will be calculated like 0.92*bitscore best hit=top threshold. In this example, all hits with a bitscore above 294.4 will continue to the next step.

Query	Subject	Identity percentage	Coverage	bitscore
Otu1	hit1	93	95	320
Otu1	hit2	93	95	320
Otu1	hit3	91	94	310
Otu1	hit4	90	95	301
Otu1	hit5	89	94	300
Otu1	hit6	85	95	290
Otu1	hit7	84	95	290
Otu1	hit8	81	90	281
Otu1	hit9	80	89	275
Otu1	hit10	79	88	274

Step 2, filter on threshold:
From the top hits there will be filtered on identity, coverage and bitscore. Let's choose a threshold of 90 identity, 90 coverage and 250 bitscore. All hits with scores above the thresholds will go to the next step.

Query	Subject	Identity percentage	Coverage	bitscore
Otu1	hit1	93	95	320
Otu1	hit2	93	95	320
Otu1	hit3	91	94	310
Otu1	hit4	90	95	301
Otu1	hit5	89	94	300

Step 3, determine the lowest common ancestor:
Of all the remaining hits the lowest common ancestor is determined. The script starts at species level, it checks if all the hits are coming from the same species. If not it checks at genus level and so on.

Query	Subject	taxonomy
Otu1	hit1	Animalia / Arthropoda / Insecta / Diptera / Asilidae / Scarbroughia / Scarbroughia delicatula
Otu1	hit2	Animalia / Arthropoda / Insecta / Diptera / Asilidae / Scarbroughia / Scarbroughia delicatula
Otu1	hit3	Animalia / Arthropoda / Insecta / Diptera / Asilidae / Schildia / Schildia fragilis
Otu1	hit4	Animalia / Arthropoda / Insecta / Diptera / Asilidae / Scytomedes / Scytomedes haemorrhoidalis

With this example data and thresholds the identification of otu1 will be Asilidae. This is a basic explanation of the LCA approach, more advanced filter settings and output options are shown at the example commands.

Usage

There is an example input file included in the example folder, this file will be used to execute the commands. The file consist of 11 columns, has a specific header and is tab separated.

Column explanation input file:

Column name	Description
Query ID	Id of the input sequence (qseqid)
Subject	Means Subject Title (stitle)
Subject accession	Subject accession (sacc)
Subject Taxonomy ID	Unique Subject Taxonomy ID (staxid), this value can be any value and is not used to find the lca
Identity percentage	Identity percentage of hit ()
Coverage	Means Query Coverage Per Subject (qcovs)
Evalue	Expected value of the hit (evalue)
Bitscore	Bit score of hit (bitscore)
Source	The source of where the taxonomy comes from, this value can be any value and is not used to find the lca
Taxonomy	This column contains the taxonomy of the hit in the order Kingdom / phylum / class / order / family / genus /species

The script itself has multiple parameter options.

Parameters:

Parameter	Description
-i	Input file
-o	Output file
-b	Bitscore top percentage threshold
-id	Minimum identity threshold
-cov	Minimum coverage threshold
-t	Check the top hit first or perform an lca analysis on all hits. Options:['only_lca', 'best_hit', "best_hits_range"]
-tid	Identity threshold for the tophit, only used when -t best_hit or best_hits_range
-tcov	Coverage threshold for the tophit, only used when -t best_hit or best_hits_range
-fh	Filter hits, filter out lines that contain a certain string
-flh	Filter lca hits, during the determination of the lca ignore this string
-minbit	Minimum bitscore threshold

Examples

Example 1:
This command performs an lca analysis on all the hits per input sequence. The top percentage threshold is 8%, minimal idenity and coverage is set to 80%. If you perform an lca analysis on all the hits the identification will never be on species level.

python lca.py -i example/example.tabular -o output1_example.tabular -b 8 -id 80 -cov 80 -t only_lca

Example 1 output explanation:
Otu1 has no identification and there is no lca analysis performed. You can see this in the last column. The hits of this sequence did not passed the thresholds.

#query	#lca rank	#lca taxon	#kingdom	#phylum	#class	#order	#family	#genus	#species	#method
Otu1	no identification	no identification	no identification	no identification	no identification	no identification	no identification	no identification	no identification	no lca

For Otu3 and Otu 6 an lca analysis is performed. The identification of otu 3 is at genus level and is identified as Chaetopterus.

#query	#lca rank	#lca taxon	#kingdom	#phylum	#class	#order	#family	#genus	#species	#method
Otu3	genus	Chaetopterus	Eukaryota	Annelida	Polychaeta	Spionida	Chaetopteridae	Chaetopterus	no identification	lca
Otu6	class	Polychaeta	Eukaryota	Annelida	Polychaeta	no identification	no identification	no identification	no identification	lca

Otu16 is identified as "unknown genus". This is because that is how the information is stored in the reference database. In this case the hit is coming from BOLD http://www.boldsystems.org/index.php/Public_RecordView?processid=POLNB1246-14

#query	#lca rank	#lca taxon	#kingdom	#phylum	#class	#order	#family	#genus	#species	#method
Otu16	genus	unknown genus	Animalia	Annelida	Polychaeta	Polychaeta incertae sedis	unknown family	unknown genus	no identification	lca

Example 2 taxonomy filter:
This command performs an lca analysis on all the hits per sequence like example 1, but now during the lca analysis if the text "unknown" exist in a rank it will be ignored. In many databases like Genbank or BOLD not all taxonomy is filled in by the uploader.

python lca.py -i example/example.tabular -o output2_example.tabular -b 8 -id 80 -cov 80 -t only_lca -flh unknown

Example 2 taxonomy filter output explanation:
Otu6 is now now identified as Thelepus at genus level. If you look at the example.tabular file you can see that this makes sense, in example 1 the rank was pushed up to class level because of the "unknowns" in the first hit.

#query	#lca rank	#lca taxon	#kingdom	#phylum	#class	#order	#family	#genus	#species	#method
Otu6	genus	Thelepus	Eukaryota	Annelida	Polychaeta	Terebellida	Terebellidae	Thelepus	no identification	lca
Otu16	order	Polychaeta incertae sedis	Animalia	Annelida	Polychaeta	Polychaeta incertae sedis	no identification	no identification	no identification	lca

Example 3 best hit species level identification:
This command performs an lca analysis if the top hit falls below the "top hit" thresholds. If the top blast hit (sorted on evalue) exceeds the thresholds the top hit is chosen and the input sequence gets a species level identification. Notice the change in parameters -t best_hit -tid 98 -tcov 100. If the top hit has an identity above 98% and a coverage above or equal to 100% this hit will be the taxonomic identification and will be at species level.

python lca.py -i example/example.tabular -o output3_example.tabular -b 8 -id 80 -cov 80 -t best_hit -tid 98 -tcov 100 -flh unknown

Example 3 best hit species level identification output explanation:
Now we have an otu that is identified on species level. Also the last column of Otu9 is different than in the previous examples. The value "best hit" means that there is no lca analysis performed, but the species of the top hit is chosen for identification. If you look at the example.tabular file you can see that the top hit of otu9 has an identity above 98 and coverage equal to 100.

#query	#lca rank	#lca taxon	#kingdom	#phylum	#class	#order	#family	#genus	#species	#method
Otu6	genus	Thelepus	Eukaryota	Annelida	Polychaeta	Terebellida	Terebellidae	Thelepus	no identification	lca
Otu9	species	Myxine glutinosa	Eukaryota	Chordata	unknown class	Myxiniformes	Myxinidae	Myxine	Myxine glutinosa	best hit

Example 4 best hit range:
In some cases the sequences in the reference database are wrongly morphological identified or contaminated by human or bacterial DNA. It can also happen that a certain marker (16S, CO1, ITS) is not distinctive enough. Many people choose the top hit as the identification for the input sequence without looking at the second hit while that it can even be a better choice. In these cases the following command can help to solve this. Notice the changed parameter -t best_hits_range

python lca.py -i example/example.tabular -o output4_example.tabular -b 8 -id 80 -cov 80 -t best_hits_range -tid 98 -tcov 100 -flh unknown

Example 4 best hit range output explanation:
After executing the command otu6 stays the same. Otu6 does not have hits above -tid 98 -tcov 100 so the parameter -t best_hits_range has no effect here. If we look at otu9 you see that it occurs twice in the output with two extra columns. Otu9 had multiple hits above -tid 98 -tcov 100 on two different species. In the blast output of otu9 there were 6 hits on the species Myxine glutinosa with an identity between 98.7 and 100%.

#query	#lca rank	#lca taxon	#kingdom	#phylum	#class	#order	#family	#genus	#species	#method	#identity	#coverage
Otu6	genus	Thelepus	Eukaryota	Annelida	Polychaeta	Terebellida	Terebellidae	Thelepus	no identification	lca
Otu9	species	Myxine glutinosa	Eukaryota	Chordata	unknown class	Myxiniformes	Myxinidae	Myxine	Myxine glutinosa	top hit (6)	98.7-100.0	100.0-100.0
Otu9	species	Myxine limosa	Eukaryota	Chordata	unknown class	Myxiniformes	Myxinidae	Myxine	Myxine limosa	top hit (4)	98.1-98.7	100.0-100.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

usage.md

usage.md

Getting Started

Usage

Examples

Files

usage.md

Latest commit

History

usage.md

File metadata and controls

Getting Started

Usage

Examples