[ENH] AptaTrans: training code and training schema #92

NennoMP · 2025-08-10T17:44:21Z

Adds training code and training schema for AptaTrans, including code related to data loading and pre-processing. Specifically, training code via lightning for AptaTrans' deep neural network has already been implemente din #130. This PR will still involve implementing the same for pretraining AptaTrans' autoe-encoders.

This pull requests resolves #49 and is stacked on #63 (which should be merged before this).

Data and model weights

Particularly big files (datasets, model weights) are hosted on GC.OS Huggin Face.

LAST UPDATED: 25-09-2025

satvshr · 2025-08-18T13:29:00Z

pyaptamer/aptatrans/model.py

@@ -0,0 +1,348 @@
+"""AptaTrans' deep neural network for aptamer-protein interaction prediction."""
+


I think the plan is to make all files private but expose the classes being used in __init__.py? So _model.py, _pipeline.py, and so on

satvshr · 2025-08-18T13:31:07Z

pyaptamer/utils/augment.py

@@ -0,0 +1,26 @@
+__author__ = ["nennomp"]


It should be private I think? One more question, what is the reason for creating so many util files and all of them not being in just one file?

Conceptually, I separated them into different purpose modules. Wouldn't we end up with a monolithic utils.py file if we don't separate them by what they do?

Not sure which design choice is better. My idea was that in the future you may add other methods for rna, proteins, etc. and having separate modules would be helpful and more organized. Otherwise, we could put them all into a seq_to_vec_utils.py or something idk

Ah ok, I had no idea what the purpose for them was (and I still do not have a very concrete idea to be very honest) but I think this is something we can discuss tomorrow (so that you can explain what you are doing and I can follow it if I ever need to)? This is how I was doing it:

_{algorithm}_utils.py for utils depending on one algorithm.

x_to_y for converter functions.

Your idea for specific molecule groups makes sense.

Yep, let's discuss it as an agenda point tomorrow and decide which approach to use going forward.

Schedule this for Tuesday then :D

satvshr · 2025-08-18T13:37:37Z

pyaptamer/base/_base_solver.py

+from tqdm import tqdm
+
+
+class BaseSolver(ABC):


Seems like a very important class I do not expect in a PR for AptaTrans, why not:

Move this to a new PR including other classes similar to this which can be useful for other algorithms too.

Rename it? BaseSolver does not seem to imply that it is a base class to train neural networks on.

Actually, taking into account your suggestion about using pytorch lightining (which is a good idea), I think the best approach would be to delegate any training to such library.

Maybe I keep the custom solvers for now so that we may have a working AptaTrans sooner rather than later, and then we refactor any training code to lightining.

Something to talk about tomorrow too I guess :D (I agree with what you are saying though)

Agenda getting bigger and bigger for tomorrow 😁.

By the way, I gave a quick look at lightning and we would need to do some wrapping to convert nn.Module to lightning.LightningModule to make the models compatible with lightning training pipelines.

Probably making wrappers but also keeping the pytorch compatible ones could be a good idea, let's see what others think tomorrow.

did we just reinvent lightning in this class? I do not particularly mind for now, but we should avoid reinventing the wheel

NennoMP · 2025-08-18T13:56:43Z

@satvshr This is still a draft PR, it's not ready for review.

satvshr · 2025-08-18T15:15:11Z

@satvshr This is still a draft PR, it's not ready for review.

Oh yeah I know, I just had few observations while trying to find your training code given I am trying to do the same for DeepAptamer.

satvshr · 2025-08-19T19:42:38Z

pyaptamer/utils/__init__.py

@@ -1,9 +1,14 @@
 """Utils for the pyaptamer package."""


Just observed this, can you move everything out of utils into their respective files? It breaks the purpose of having util files for separate purposes if we move some functions to root import imo

I disagree, in any case we should avoid making unrelated changes in a PR

This is something we had discussed in the weekly regarding the structure of utils, the main essence of the discussion was that init should only contain imports which are going to be used by users and algorithms for transformations, etc. Not private functions that serves the use case of a single algorithm, in that sense we should also move literally every function (in _aptanet_utils and so on) in utils to init, hence resulting in no structure for the file.

satvshr · 2025-08-19T19:46:04Z

pyaptamer/utils/_struct_to_aaseq.py

Why rename this file? Please do not touch files not a part of the current PR :D It is not a private utility, rather its a file format converter helper function. I will bring this up in the next meeting to decide what should be moved to the utils init folder and what should not be, could you specify what the functions currently in the utils/__init__.py file are for and are they general purpose?

could you specify what the functions currently in the utils/init.py file are for and are they general purpose?

I meant in this thread (if you mistook it for docstring) sorry for not being clear!

I understood what you meant, no worries. I will probably address the points during the weekend, not sure if I will be able to do it today.

Thanks for the input.

satvshr · 2025-08-22T08:14:28Z

pyaptamer/datasets/data/dummy_data.csv

@@ -0,0 +1,2 @@
+aptamer,protein,label


What do you think about structuring the data directory and moving files into their respective paths? SO pdb goes into data/pdb and so on. @fkiraly thoughts?
ps: If you agree and do this, you will have to change the paths of the helper functions too that I added in utils for file conversions.

Makes sense.

fkiraly · 2025-09-04T20:30:47Z

pyaptamer/aptatrans/_pipeline.py



 class AptaTransPipeline:
-    """AptaTrans pipeline for aptamer affinity prediction, by Shin et al.


why are we changing the docstring?

I think references do not display properly in the header - may I suggest to revert?

fkiraly · 2025-09-04T20:31:17Z

pyaptamer/aptatrans/_pipeline.py


        return (apta_words, prot_words)

    def _init_aptamer_experiment(self, target: str) -> Aptamer:
        """Initialize the aptamer experiment."""
-        # initialize the aptamer recommendation experiment


why remove this comment?

I think it is redundant since it's saying the same thing of the method's docstring.

I think we had discussed making the comment the method docstring instead as it provided more clarity?

I think we had discussed making the comment the method docstring instead as it provided more clarity?

Yep, then the inline comment became redundant and I removed it.

Yeah but it shows you removed the comment and did not edit the method docstring.

I am confused. I remember a comment about this method not having a docstring and being suggested to move the inline comment to the docstring. Cannot find it right now though.

Anyway, acknowledging what is being shown it makes sense to me to remove the inline comment as it is redundant no?

I think what we discussed was making the comment:"# initialize the aptamer recommendation experiment" the function docstring instead of "Initialize the aptamer experiment.", but what you have done is remove the comment and keep the function docstring unchanged.

this is not an important line but I was still wondering why it gets removed. I do not mind if it gets removed, but it is just strange that it does.

satvshr · 2025-09-04T20:38:11Z

Can you revert changes that do not concern AptaTrans, and if you want to reorganize utils, do it in another PR? We should probably also discuss whether these changes are necessary.

I agree that the scope of this PR has massively increased and such concerns should be moved to a new PR, this (utils structure) was already discussed though in a weekly, as mentioned above.

I would also recommend making the notebook a separate PR dependent on this PR and making it as a priority, as it would help out with benchmarking AptaTrans.

NennoMP · 2025-09-04T20:40:42Z

The neural network code looks great, I am confused about the changes in the utils module. I think we should not be reorganizing existing code in the module in the same PR where we are doing something else.

Can you revert changes that do not concern AptaTrans, and if you want to reorganize utils, do it in another PR? We should probably also discuss whether these changes are necessary.

I now understand how making unnecessary changes (out of the PR scope) such as docstrings or file names makes it hard to follow the changes being made.

I will make sure to avoid doing this mistake in the future. Renaming the utilities was dicussed during a weekly. Let me know if you want me to revert it and address it in a separate PR.

Just as a comment, I plan to remove the notebook from here and implement it in a separate PR as we discussed during the weekly, as having them running is a top priority.

satvshr · 2025-09-04T20:44:46Z

Let me know if you want me to revert it and address it in a separate PR.

I think doing that in a separate PR (file renames and utils restructure) as a new issue is best practice (and what Franz asked above)

fkiraly · 2025-09-04T22:41:36Z

Can you revert changes that do not concern AptaTrans, and if you want to reorganize utils, do it in another PR? We should probably also discuss whether these changes are necessary.

I now understand how making unnecessary changes (out of the PR scope) such as docstrings or file names makes it hard to follow the changes being made.

That would be appreciated - I hope it is not too difficult to do this - since we are squashing pull requests, it is a matter of copy-paste refactor hopefully.

Let me know if it turns out to be more difficult than I thought.

fkiraly · 2025-09-15T08:33:07Z

@NennoMP, would you mind describing how the content in this PR has been rearranged across PR (or whether you have decided to keep it in one place)

satvshr · 2025-09-16T15:06:41Z

@NennoMP Can you use the csv loader from #114 when it is approved? In the weekly meeting today Franz mentioned that converting a csv to a DataFrame is not enough and we should make it similar to how sklearn loaders return data. Reference

fkiraly · 2025-09-16T16:42:59Z

hm, there are a number of conflicts now - did we merge in the wrong sequence?

satvshr · 2025-09-22T18:59:00Z

pyaptamer/datasets/_loaders/_hf_loader.py

+
+    print(f"Downloading {name}...")
+    try:
+        dataset = load_dataset(f"gcos/pyaptamer-{name}")


This will not work for dataformats not noticed by huggingface, was trying to make a loader for a fasta file and was going to use this until I noticed this.

NennoMP · 2025-09-25T20:23:36Z

@NennoMP, would you mind describing how the content in this PR has been rearranged across PR (or whether you have decided to keep it in one place)

So, the content of the PR is pretty much the same, except for training AptaTrans' deep neural network. We implemented that using lightning in #130, which was also a "test" to see whether we liked or not such an approach. We also removed the jupyter notebook and we now have a dedicated PR for that.

Data-preprocessing and loading, and overall training schema are still implemented in this PR. Additionally, I need to implement a lightning wrapper for pre-training AptaTran's auto-encoders, similarly to what we did in #130. I think this should be done in the scope of this PR and doesn't need a separate one.

I updated the PR description to reflect this.

NennoMP · 2025-09-28T16:23:52Z

Changes to look at from last time:

pyaptamer.aptatrans._model_lightning.py: Added a AptaTransEncoderLightning class for defining the pre-training logic (in lightning) of AptaTrans' encoders. Needed as it differs from when you simply train/fine-tune the deep neural network;
Consequently, updated and/or added a few tests in pyaptamer.aptatrans.tests/.

fkiraly

Reminder of change requests above.

docstrings get changed, shortened or removed. Please revert unless there is a clear reason.
please revert renaming of modules in utils

New requests/questions:

AptaTransPipeline has a paramter prot_words which does not seem properly described, and its description is discrepant with the example. Further, should there not be a default for this?
I think there should be a predict method as well. Is it just a matter of renaming predict_api?

NennoMP added 2 commits August 10, 2025 19:37

Add dataloaders for csv and hugging face

660a9b5

Update .gitignore

e1d8a6a

NennoMP self-assigned this Aug 10, 2025

NennoMP added the enhancement New feature or request label Aug 10, 2025

NennoMP added 2 commits August 10, 2025 19:46

Update .gitignore

bd7cdf4

Add first version of solvers

c7366bf

NennoMP marked this pull request as ready for review August 11, 2025 22:49

NennoMP marked this pull request as draft August 11, 2025 22:49

NennoMP added 5 commits August 12, 2025 00:55

Keep updated with #63

a5c3e35

Add data preprocessing for RNA pretraining

1ab8c11

Run pre-commit

a0ebbd6

Complete data loading for pretraining, extend example notebook

aa65fd0

Run pre-commit

84da30d

satvshr reviewed Aug 18, 2025

View reviewed changes

Make all modules private, ready for lightning studio

5c611da

NennoMP added 3 commits August 18, 2025 19:44

Add pretrained weights loading for AptaTrans; add tests

152a79e

Add pretrained weights loading for AptaTrans; add tests

5c4112a

Add pretrained weights loading for AptaTrans; add tests

26ba9e6

NennoMP requested a review from fkiraly August 18, 2025 18:58

NennoMP marked this pull request as ready for review August 18, 2025 18:59

NennoMP removed the request for review from fkiraly August 18, 2025 19:01

satvshr reviewed Aug 19, 2025

View reviewed changes

satvshr reviewed Aug 22, 2025

View reviewed changes

NennoMP added 2 commits August 24, 2025 17:50

Merge branch 'main' into feature/49-aptatrans-training-schema

ac6f7f5

Merge

cd90141

fkiraly reviewed Sep 4, 2025

View reviewed changes

NennoMP added 4 commits September 11, 2025 23:37

Revert renaming of utils

cb21c89

Merge branch 'main' into feature/49-aptatrans-training-schema

3aa03df

Run pre-commit

0453cff

Fix bug in tests

0bff5a9

NennoMP added a commit that referenced this pull request Sep 11, 2025

Stack on #92 and #119

794e7b1

NennoMP mentioned this pull request Sep 11, 2025

AptaTrans jupyter notebook #146

Open

2 tasks

NennoMP added a commit that referenced this pull request Sep 11, 2025

Fix missing dependency from stack on #92

8e967f9

NennoMP added 2 commits September 12, 2025 00:28

Update rna2vec

7d8353b

Fix bug in docstring

443568e

satvshr reviewed Sep 22, 2025

View reviewed changes

satvshr mentioned this pull request Sep 22, 2025

[DISCUSSION] Hugging face loaders #154

Open

Resolve merge conflicts

1febecf

Add training lightning wrapper for AptaTrans' encoders

12e1397

NennoMP added a commit that referenced this pull request Sep 28, 2025

Update and stack on #92

9b2822d

NennoMP added 2 commits September 29, 2025 16:02

Fix a few bugs

1bd6978

Update rna2vec to handle short/long sequences (padding/truncation)

8b720bd

fkiraly requested changes Sep 29, 2025

View reviewed changes

Fix AptaTransPipeline docstrings, rename predict method

4e2a16c

NennoMP added a commit that referenced this pull request Sep 29, 2025

Update stack on #92

cda151c

		@@ -0,0 +1,348 @@
		"""AptaTrans' deep neural network for aptamer-protein interaction prediction."""



		class AptaTransPipeline:
		"""AptaTrans pipeline for aptamer affinity prediction, by Shin et al.

[ENH] AptaTrans: training code and training schema #92

Are you sure you want to change the base?

[ENH] AptaTrans: training code and training schema #92

Uh oh!

Conversation

NennoMP commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Data and model weights

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NennoMP Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

satvshr Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NennoMP Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NennoMP commented Aug 18, 2025

Uh oh!

satvshr commented Aug 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

satvshr Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NennoMP Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

NennoMP commented Aug 10, 2025 •

edited

Loading

NennoMP Aug 18, 2025 •

edited

Loading

satvshr Aug 18, 2025 •

edited

Loading

NennoMP Aug 18, 2025 •

edited

Loading

satvshr Sep 4, 2025 •

edited

Loading

NennoMP Sep 4, 2025 •

edited

Loading

NennoMP commented Sep 25, 2025 •

edited

Loading

NennoMP commented Sep 28, 2025 •

edited

Loading

fkiraly left a comment •

edited

Loading