-
Notifications
You must be signed in to change notification settings - Fork 2
[ENH] AptaTrans: training code and training schema #92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -0,0 +1,348 @@ | |||
"""AptaTrans' deep neural network for aptamer-protein interaction prediction.""" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the plan is to make all files private but expose the classes being used in __init__.py
? So _model.py
, _pipeline.py
, and so on
pyaptamer/utils/augment.py
Outdated
@@ -0,0 +1,26 @@ | |||
__author__ = ["nennomp"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be private I think? One more question, what is the reason for creating so many util files and all of them not being in just one file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually, I separated them into different purpose modules. Wouldn't we end up with a monolithic utils.py
file if we don't separate them by what they do?
Not sure which design choice is better. My idea was that in the future you may add other methods for rna, proteins, etc. and having separate modules would be helpful and more organized. Otherwise, we could put them all into a seq_to_vec_utils.py
or something idk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, I had no idea what the purpose for them was (and I still do not have a very concrete idea to be very honest) but I think this is something we can discuss tomorrow (so that you can explain what you are doing and I can follow it if I ever need to)? This is how I was doing it:
_{algorithm}_utils.py
for utils depending on one algorithm.x_to_y
for converter functions.
Your idea for specific molecule groups makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, let's discuss it as an agenda point tomorrow and decide which approach to use going forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Schedule this for Tuesday then :D
pyaptamer/base/_base_solver.py
Outdated
from tqdm import tqdm | ||
|
||
|
||
class BaseSolver(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a very important class I do not expect in a PR for AptaTrans
, why not:
- Move this to a new PR including other classes similar to this which can be useful for other algorithms too.
- Rename it?
BaseSolver
does not seem to imply that it is a base class to train neural networks on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, taking into account your suggestion about using pytorch lightining (which is a good idea), I think the best approach would be to delegate any training to such library.
Maybe I keep the custom solvers for now so that we may have a working AptaTrans sooner rather than later, and then we refactor any training code to lightining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to talk about tomorrow too I guess :D (I agree with what you are saying though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agenda getting bigger and bigger for tomorrow 😁.
By the way, I gave a quick look at lightning and we would need to do some wrapping to convert nn.Module
to lightning.LightningModule
to make the models compatible with lightning training pipelines.
Probably making wrappers but also keeping the pytorch compatible ones could be a good idea, let's see what others think tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we just reinvent lightning
in this class? I do not particularly mind for now, but we should avoid reinventing the wheel
@satvshr This is still a draft PR, it's not ready for review. |
Oh yeah I know, I just had few observations while trying to find your training code given I am trying to do the same for |
@@ -1,9 +1,14 @@ | |||
"""Utils for the pyaptamer package.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just observed this, can you move everything out of utils into their respective files? It breaks the purpose of having util files for separate purposes if we move some functions to root import imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree, in any case we should avoid making unrelated changes in a PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something we had discussed in the weekly regarding the structure of utils, the main essence of the discussion was that init should only contain imports which are going to be used by users and algorithms for transformations, etc. Not private functions that serves the use case of a single algorithm, in that sense we should also move literally every function (in _aptanet_utils
and so on) in utils to init, hence resulting in no structure for the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why rename this file? Please do not touch files not a part of the current PR :D It is not a private utility, rather its a file format converter helper function. I will bring this up in the next meeting to decide what should be moved to the utils init folder and what should not be, could you specify what the functions currently in the utils/__init__.py
file are for and are they general purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you specify what the functions currently in the utils/init.py file are for and are they general purpose?
I meant in this thread (if you mistook it for docstring) sorry for not being clear!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understood what you meant, no worries. I will probably address the points during the weekend, not sure if I will be able to do it today.
Thanks for the input.
@@ -0,0 +1,2 @@ | |||
aptamer,protein,label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about structuring the data
directory and moving files into their respective paths? SO pdb
goes into data/pdb
and so on. @fkiraly thoughts?
ps: If you agree and do this, you will have to change the paths of the helper functions too that I added in utils for file conversions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
|
||
|
||
class AptaTransPipeline: | ||
"""AptaTrans pipeline for aptamer affinity prediction, by Shin et al. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we changing the docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think references do not display properly in the header - may I suggest to revert?
|
||
return (apta_words, prot_words) | ||
|
||
def _init_aptamer_experiment(self, target: str) -> Aptamer: | ||
"""Initialize the aptamer experiment.""" | ||
# initialize the aptamer recommendation experiment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is redundant since it's saying the same thing of the method's docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we had discussed making the comment the method docstring instead as it provided more clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we had discussed making the comment the method docstring instead as it provided more clarity?
Yep, then the inline comment became redundant and I removed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but it shows you removed the comment and did not edit the method docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused. I remember a comment about this method not having a docstring and being suggested to move the inline comment to the docstring. Cannot find it right now though.
Anyway, acknowledging what is being shown it makes sense to me to remove the inline comment as it is redundant no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we discussed was making the comment:"# initialize the aptamer recommendation experiment" the function docstring instead of "Initialize the aptamer experiment.", but what you have done is remove the comment and keep the function docstring unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not an important line but I was still wondering why it gets removed. I do not mind if it gets removed, but it is just strange that it does.
I agree that the scope of this PR has massively increased and such concerns should be moved to a new PR, this (utils structure) was already discussed though in a weekly, as mentioned above. I would also recommend making the notebook a separate PR dependent on this PR and making it as a priority, as it would help out with benchmarking |
I now understand how making unnecessary changes (out of the PR scope) such as docstrings or file names makes it hard to follow the changes being made. I will make sure to avoid doing this mistake in the future. Renaming the utilities was dicussed during a weekly. Let me know if you want me to revert it and address it in a separate PR. Just as a comment, I plan to remove the notebook from here and implement it in a separate PR as we discussed during the weekly, as having them running is a top priority. |
I think doing that in a separate PR (file renames and utils restructure) as a new issue is best practice (and what Franz asked above) |
That would be appreciated - I hope it is not too difficult to do this - since we are squashing pull requests, it is a matter of copy-paste refactor hopefully. Let me know if it turns out to be more difficult than I thought. |
@NennoMP, would you mind describing how the content in this PR has been rearranged across PR (or whether you have decided to keep it in one place) |
hm, there are a number of conflicts now - did we merge in the wrong sequence? |
|
||
print(f"Downloading {name}...") | ||
try: | ||
dataset = load_dataset(f"gcos/pyaptamer-{name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not work for dataformats not noticed by huggingface, was trying to make a loader for a fasta file and was going to use this until I noticed this.
So, the content of the PR is pretty much the same, except for training AptaTrans' deep neural network. We implemented that using Data-preprocessing and loading, and overall training schema are still implemented in this PR. Additionally, I need to implement a I updated the PR description to reflect this. |
Changes to look at from last time:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder of change requests above.
- docstrings get changed, shortened or removed. Please revert unless there is a clear reason.
- please revert renaming of modules in
utils
New requests/questions:
AptaTransPipeline
has a paramterprot_words
which does not seem properly described, and its description is discrepant with the example. Further, should there not be a default for this?- I think there should be a
predict
method as well. Is it just a matter of renamingpredict_api
?
Adds training code and training schema for AptaTrans, including code related to data loading and pre-processing. Specifically, training code via
lightning
for AptaTrans' deep neural network has already been implemente din #130. This PR will still involve implementing the same for pretraining AptaTrans' autoe-encoders.This pull requests resolves #49 and is stacked on #63 (which should be merged before this).
Data and model weights
Particularly big files (datasets, model weights) are hosted on GC.OS Huggin Face.
LAST UPDATED: 25-09-2025