Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Question) Can I use this for protein homology to UniRef50? #144

Open
jolespin opened this issue Mar 2, 2024 · 2 comments
Open

(Question) Can I use this for protein homology to UniRef50? #144

jolespin opened this issue Mar 2, 2024 · 2 comments

Comments

@jolespin
Copy link

jolespin commented Mar 2, 2024

I'm looking for a faster alternative to Diamond for aligning proteins to UniRef50 so I can map identifiers to de novo proteins.

Can I use this tool to accomplish this task?

@mheinzinger
Copy link
Collaborator

mheinzinger commented Mar 5, 2024

Not out of the box, no.
There are approaches which use embedding (distance) from pLMs for remote homolgy detection s.a. (really not a full list just an excerpt that just came to my mind with the latter being from our group (disclaimer)):

You can use the code base provided in the first link to align proteins and/or you can use the recipe described in the latter link to find remote homologs. My 2 cents: if you really want to align proteins I do not think that embeddings will give you a speed up (at least, I am not aware of an implementation that would a) generate embeddings and b) align them to some DB in less time than MMSeqs2/Diamond. What embeddings might give you is some fast pre-filter if you have your DB already pre-computed (see second link for details).
But I would probably just use foldseek (potentially together with predicted 3Di if you care about speed- disclaimer#2 also from us --> https://github.com/mheinzinger/ProstT5/tree/main/scripts ).

@jolespin
Copy link
Author

jolespin commented Mar 5, 2024

Ok this is very useful information.

What embeddings might give you is some fast pre-filter if you have your DB already pre-computed
I'll look into your paper for more details but I'm a bit confused. Let's say you have a model that uses protein embeddings for UniRef50. When you're saying having the DB pre-computed are you referring to the query proteins or the reference proteins or both?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants