Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new program mode: NCBI Blast #22

Open
mvences opened this issue Jul 25, 2021 · 2 comments
Open

Add new program mode: NCBI Blast #22

mvences opened this issue Jul 25, 2021 · 2 comments
Assignees

Comments

@mvences
Copy link
Contributor

mvences commented Jul 25, 2021

This additional function is lower priority and should be dealt with after issues #20 and #21.

One additional functionality TaxI3 should offer is to compare a set of sequences (from the input file) online to the NCBI-Genbank reference data set (which comprise many millions of sequences) using the server's BLAST algorithm, and retrieve the best matches as well as their identification. In principle this process is rather easy, but there are several handicaps:

  • First of all, the process is very slow and only realistic for a small number of sequences.
  • Then, also, sometimes the process simply fails for some sequences in a set (due to failures in the NCBI server which can be very heavily accessed from throughout the world at some times) and for these sequences then no result is retrieved (which means, maybe they need to be repeated to get some results for them).
  • And most importantly, sometimes there are many matches that are equally good (100% match) but differ in the relevant metadata (species name) which can be problematic if our goal is to find out to which species an unknown sequence belongs.

However, BLAST searches against this online database have many advantages and offer many important options, such as retrieving for a query sequence all geographic localities where this species may occur, and so on. So we should not totally omit it from TaxI3 as many users will expect such an option.

Maybe to start, this can be implemented in a very simple way without many options: take each sequence, submit it to the NCBI BLAST search, and retrieve only one (the first) hit that the database return, and print a simple output file with the basic information returned from the database.

I also suggest for this mode, building in a "blocker" that first counts the number of sequences in the input file, and only takes the first 100 sequences for submission to NCBI, issuing an error message "This process is very time consuming, and for now only allows comparing 100 sequences at once; the first 100 sequences from the input file are being used".

Once the implementation is successful, we can think about way to improve the output.

Probably the easiest way to implement this is using Biopython, see this link:

https://biopython.org/docs/dev/api/Bio.Blast.NCBIWWW.html

@necrosovereign
Copy link
Collaborator

As far as I can understand, NCBI Blast web API is deprecated. Apparently, they expect the developers to create their own copies of the databases using cloud providers (e.g Google, Amazon) and direct the requests to the copies.

@mvences
Copy link
Contributor Author

mvences commented Oct 4, 2021

Really? Oh, this is very understandable, probably people have kept sending very massive BLAST requests to their server and led to their servers getting very slow.
OK, let me look into this ... we keep this issue open for now, but probably we will then eventually drop it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants