-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new tasks based on UniProt keywords #37
Conversation
it's used in several places
not just domain anymore
Simplify, add docstrings and comments
Other tasks could be added but they are already covered so I chose not to add them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestions.
You can merge
" ['Biological process', 'Cellular component', 'Coding sequence diversity', 'Disease', " | ||
" 'Domain', 'Ligand', 'Molecular function', 'PTM', 'Technical term']" | ||
"Can be multiply defined. Defaults to creating all of the possible keyword tasks.", | ||
default=[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default should be the subset that we want. So without "Biological process," for example
entities=pd.Series(gene_keyword_df_dict[task].index).rename("symbol"), | ||
outcomes=gene_keyword_df_dict[task], | ||
main_task_directory=main_task_directory, | ||
task_name="UniProt keyword " + task, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could do this using a f string
UniProt keywords provide a nice array of new gene properties, specifically properties of the proteins coded by genes.
This PR would add the following
The ones that are really high value as they cover really new things are:
These are much more structural/chemical than our other properties and should provide a non-trivial differentiation between models.
We also have Cellular component, molecular function and disease but I think these are basically covered by HPA. Could be interesting to have another version of the same question but those are not so high value.
Also there are 'Technical term' and 'Coding sequence diversity' which do not appear to have much value at all.
What should I update in the Excel? Just the three really interesting ones?