Skip to content

Conversation

NennoMP
Copy link
Collaborator

@NennoMP NennoMP commented Aug 10, 2025

Adds training code and training schema for AptaTrans, including code related to data loading and pre-processing. Specifically, training code via lightning for AptaTrans' deep neural network has already been implemente din #130. This PR will still involve implementing the same for pretraining AptaTrans' autoe-encoders.

This pull requests resolves #49 and is stacked on #63 (which should be merged before this).

Data and model weights

Particularly big files (datasets, model weights) are hosted on GC.OS Huggin Face.

LAST UPDATED: 25-09-2025

@NennoMP NennoMP self-assigned this Aug 10, 2025
@NennoMP NennoMP added the enhancement New feature or request label Aug 10, 2025
@NennoMP NennoMP marked this pull request as ready for review August 11, 2025 22:49
@NennoMP NennoMP marked this pull request as draft August 11, 2025 22:49
@@ -0,0 +1,348 @@
"""AptaTrans' deep neural network for aptamer-protein interaction prediction."""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the plan is to make all files private but expose the classes being used in __init__.py? So _model.py, _pipeline.py, and so on

@@ -0,0 +1,26 @@
__author__ = ["nennomp"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be private I think? One more question, what is the reason for creating so many util files and all of them not being in just one file?

Copy link
Collaborator Author

@NennoMP NennoMP Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, I separated them into different purpose modules. Wouldn't we end up with a monolithic utils.py file if we don't separate them by what they do?

Not sure which design choice is better. My idea was that in the future you may add other methods for rna, proteins, etc. and having separate modules would be helpful and more organized. Otherwise, we could put them all into a seq_to_vec_utils.py or something idk

Copy link
Collaborator

@satvshr satvshr Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, I had no idea what the purpose for them was (and I still do not have a very concrete idea to be very honest) but I think this is something we can discuss tomorrow (so that you can explain what you are doing and I can follow it if I ever need to)? This is how I was doing it:

  1. _{algorithm}_utils.py for utils depending on one algorithm.
  2. x_to_y for converter functions.

Your idea for specific molecule groups makes sense.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, let's discuss it as an agenda point tomorrow and decide which approach to use going forward.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schedule this for Tuesday then :D

from tqdm import tqdm


class BaseSolver(ABC):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a very important class I do not expect in a PR for AptaTrans, why not:

  1. Move this to a new PR including other classes similar to this which can be useful for other algorithms too.
  2. Rename it? BaseSolver does not seem to imply that it is a base class to train neural networks on.

Copy link
Collaborator Author

@NennoMP NennoMP Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, taking into account your suggestion about using pytorch lightining (which is a good idea), I think the best approach would be to delegate any training to such library.

Maybe I keep the custom solvers for now so that we may have a working AptaTrans sooner rather than later, and then we refactor any training code to lightining.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to talk about tomorrow too I guess :D (I agree with what you are saying though)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agenda getting bigger and bigger for tomorrow 😁.

By the way, I gave a quick look at lightning and we would need to do some wrapping to convert nn.Module to lightning.LightningModule to make the models compatible with lightning training pipelines.

Probably making wrappers but also keeping the pytorch compatible ones could be a good idea, let's see what others think tomorrow.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we just reinvent lightning in this class? I do not particularly mind for now, but we should avoid reinventing the wheel

@NennoMP
Copy link
Collaborator Author

NennoMP commented Aug 18, 2025

@satvshr This is still a draft PR, it's not ready for review.

@satvshr
Copy link
Collaborator

satvshr commented Aug 18, 2025

@satvshr This is still a draft PR, it's not ready for review.

Oh yeah I know, I just had few observations while trying to find your training code given I am trying to do the same for DeepAptamer.

@NennoMP NennoMP requested a review from fkiraly August 18, 2025 18:58
@NennoMP NennoMP marked this pull request as ready for review August 18, 2025 18:59
@NennoMP NennoMP removed the request for review from fkiraly August 18, 2025 19:01
@@ -1,9 +1,14 @@
"""Utils for the pyaptamer package."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just observed this, can you move everything out of utils into their respective files? It breaks the purpose of having util files for separate purposes if we move some functions to root import imo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, in any case we should avoid making unrelated changes in a PR

Copy link
Collaborator

@satvshr satvshr Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we had discussed in the weekly regarding the structure of utils, the main essence of the discussion was that init should only contain imports which are going to be used by users and algorithms for transformations, etc. Not private functions that serves the use case of a single algorithm, in that sense we should also move literally every function (in _aptanet_utils and so on) in utils to init, hence resulting in no structure for the file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why rename this file? Please do not touch files not a part of the current PR :D It is not a private utility, rather its a file format converter helper function. I will bring this up in the next meeting to decide what should be moved to the utils init folder and what should not be, could you specify what the functions currently in the utils/__init__.py file are for and are they general purpose?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you specify what the functions currently in the utils/init.py file are for and are they general purpose?

I meant in this thread (if you mistook it for docstring) sorry for not being clear!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood what you meant, no worries. I will probably address the points during the weekend, not sure if I will be able to do it today.

Thanks for the input.

@@ -0,0 +1,2 @@
aptamer,protein,label
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about structuring the data directory and moving files into their respective paths? SO pdb goes into data/pdb and so on. @fkiraly thoughts?
ps: If you agree and do this, you will have to change the paths of the helper functions too that I added in utils for file conversions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.



class AptaTransPipeline:
"""AptaTrans pipeline for aptamer affinity prediction, by Shin et al.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we changing the docstring?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think references do not display properly in the header - may I suggest to revert?


return (apta_words, prot_words)

def _init_aptamer_experiment(self, target: str) -> Aptamer:
"""Initialize the aptamer experiment."""
# initialize the aptamer recommendation experiment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove this comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is redundant since it's saying the same thing of the method's docstring.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we had discussed making the comment the method docstring instead as it provided more clarity?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we had discussed making the comment the method docstring instead as it provided more clarity?

Yep, then the inline comment became redundant and I removed it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but it shows you removed the comment and did not edit the method docstring.

Copy link
Collaborator Author

@NennoMP NennoMP Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused. I remember a comment about this method not having a docstring and being suggested to move the inline comment to the docstring. Cannot find it right now though.

Anyway, acknowledging what is being shown it makes sense to me to remove the inline comment as it is redundant no?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we discussed was making the comment:"# initialize the aptamer recommendation experiment" the function docstring instead of "Initialize the aptamer experiment.", but what you have done is remove the comment and keep the function docstring unchanged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not an important line but I was still wondering why it gets removed. I do not mind if it gets removed, but it is just strange that it does.

@satvshr
Copy link
Collaborator

satvshr commented Sep 4, 2025

Can you revert changes that do not concern AptaTrans, and if you want to reorganize utils, do it in another PR? We should probably also discuss whether these changes are necessary.

I agree that the scope of this PR has massively increased and such concerns should be moved to a new PR, this (utils structure) was already discussed though in a weekly, as mentioned above.

I would also recommend making the notebook a separate PR dependent on this PR and making it as a priority, as it would help out with benchmarking AptaTrans.

@NennoMP
Copy link
Collaborator Author

NennoMP commented Sep 4, 2025

The neural network code looks great, I am confused about the changes in the utils module. I think we should not be reorganizing existing code in the module in the same PR where we are doing something else.

Can you revert changes that do not concern AptaTrans, and if you want to reorganize utils, do it in another PR? We should probably also discuss whether these changes are necessary.

I now understand how making unnecessary changes (out of the PR scope) such as docstrings or file names makes it hard to follow the changes being made.

I will make sure to avoid doing this mistake in the future. Renaming the utilities was dicussed during a weekly. Let me know if you want me to revert it and address it in a separate PR.

Just as a comment, I plan to remove the notebook from here and implement it in a separate PR as we discussed during the weekly, as having them running is a top priority.

@satvshr
Copy link
Collaborator

satvshr commented Sep 4, 2025

Let me know if you want me to revert it and address it in a separate PR.

I think doing that in a separate PR (file renames and utils restructure) as a new issue is best practice (and what Franz asked above)

@fkiraly
Copy link
Contributor

fkiraly commented Sep 4, 2025

Can you revert changes that do not concern AptaTrans, and if you want to reorganize utils, do it in another PR? We should probably also discuss whether these changes are necessary.

I now understand how making unnecessary changes (out of the PR scope) such as docstrings or file names makes it hard to follow the changes being made.

That would be appreciated - I hope it is not too difficult to do this - since we are squashing pull requests, it is a matter of copy-paste refactor hopefully.

Let me know if it turns out to be more difficult than I thought.

NennoMP added a commit that referenced this pull request Sep 11, 2025
@NennoMP NennoMP mentioned this pull request Sep 11, 2025
2 tasks
NennoMP added a commit that referenced this pull request Sep 11, 2025
@fkiraly
Copy link
Contributor

fkiraly commented Sep 15, 2025

@NennoMP, would you mind describing how the content in this PR has been rearranged across PR (or whether you have decided to keep it in one place)

@satvshr
Copy link
Collaborator

satvshr commented Sep 16, 2025

@NennoMP Can you use the csv loader from #114 when it is approved? In the weekly meeting today Franz mentioned that converting a csv to a DataFrame is not enough and we should make it similar to how sklearn loaders return data. Reference

@fkiraly
Copy link
Contributor

fkiraly commented Sep 16, 2025

hm, there are a number of conflicts now - did we merge in the wrong sequence?


print(f"Downloading {name}...")
try:
dataset = load_dataset(f"gcos/pyaptamer-{name}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work for dataformats not noticed by huggingface, was trying to make a loader for a fasta file and was going to use this until I noticed this.

@NennoMP
Copy link
Collaborator Author

NennoMP commented Sep 25, 2025

@NennoMP, would you mind describing how the content in this PR has been rearranged across PR (or whether you have decided to keep it in one place)

So, the content of the PR is pretty much the same, except for training AptaTrans' deep neural network. We implemented that using lightning in #130, which was also a "test" to see whether we liked or not such an approach. We also removed the jupyter notebook and we now have a dedicated PR for that.

Data-preprocessing and loading, and overall training schema are still implemented in this PR. Additionally, I need to implement a lightning wrapper for pre-training AptaTran's auto-encoders, similarly to what we did in #130. I think this should be done in the scope of this PR and doesn't need a separate one.

I updated the PR description to reflect this.

NennoMP added a commit that referenced this pull request Sep 28, 2025
@NennoMP
Copy link
Collaborator Author

NennoMP commented Sep 28, 2025

Changes to look at from last time:

  • pyaptamer.aptatrans._model_lightning.py: Added a AptaTransEncoderLightning class for defining the pre-training logic (in lightning) of AptaTrans' encoders. Needed as it differs from when you simply train/fine-tune the deep neural network;
  • Consequently, updated and/or added a few tests in pyaptamer.aptatrans.tests/.

Copy link
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder of change requests above.

  • docstrings get changed, shortened or removed. Please revert unless there is a clear reason.
  • please revert renaming of modules in utils

New requests/questions:

  • AptaTransPipeline has a paramter prot_words which does not seem properly described, and its description is discrepant with the example. Further, should there not be a default for this?
  • I think there should be a predict method as well. Is it just a matter of renaming predict_api?

NennoMP added a commit that referenced this pull request Sep 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AptaTrans: Training code and training schema
3 participants