Add new program mode: NCBI Blast #22

mvences · 2021-07-25T12:21:44Z

This additional function is lower priority and should be dealt with after issues #20 and #21.

One additional functionality TaxI3 should offer is to compare a set of sequences (from the input file) online to the NCBI-Genbank reference data set (which comprise many millions of sequences) using the server's BLAST algorithm, and retrieve the best matches as well as their identification. In principle this process is rather easy, but there are several handicaps:

First of all, the process is very slow and only realistic for a small number of sequences.
Then, also, sometimes the process simply fails for some sequences in a set (due to failures in the NCBI server which can be very heavily accessed from throughout the world at some times) and for these sequences then no result is retrieved (which means, maybe they need to be repeated to get some results for them).
And most importantly, sometimes there are many matches that are equally good (100% match) but differ in the relevant metadata (species name) which can be problematic if our goal is to find out to which species an unknown sequence belongs.

However, BLAST searches against this online database have many advantages and offer many important options, such as retrieving for a query sequence all geographic localities where this species may occur, and so on. So we should not totally omit it from TaxI3 as many users will expect such an option.

Maybe to start, this can be implemented in a very simple way without many options: take each sequence, submit it to the NCBI BLAST search, and retrieve only one (the first) hit that the database return, and print a simple output file with the basic information returned from the database.

I also suggest for this mode, building in a "blocker" that first counts the number of sequences in the input file, and only takes the first 100 sequences for submission to NCBI, issuing an error message "This process is very time consuming, and for now only allows comparing 100 sequences at once; the first 100 sequences from the input file are being used".

Once the implementation is successful, we can think about way to improve the output.

Probably the easiest way to implement this is using Biopython, see this link:

https://biopython.org/docs/dev/api/Bio.Blast.NCBIWWW.html

necrosovereign · 2021-10-04T13:29:28Z

As far as I can understand, NCBI Blast web API is deprecated. Apparently, they expect the developers to create their own copies of the databases using cloud providers (e.g Google, Amazon) and direct the requests to the copies.

mvences · 2021-10-04T13:35:49Z

Really? Oh, this is very understandable, probably people have kept sending very massive BLAST requests to their server and led to their servers getting very slow.
OK, let me look into this ... we keep this issue open for now, but probably we will then eventually drop it.

mvences assigned necrosovereign Jul 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new program mode: NCBI Blast #22

Add new program mode: NCBI Blast #22

mvences commented Jul 25, 2021

necrosovereign commented Oct 4, 2021

mvences commented Oct 4, 2021

Add new program mode: NCBI Blast #22

Add new program mode: NCBI Blast #22

Comments

mvences commented Jul 25, 2021

necrosovereign commented Oct 4, 2021

mvences commented Oct 4, 2021