Add aptadb loader #159

Satarupa22-SD · 2025-09-25T20:27:59Z

Added loader for aptaDB database from kaggle dataset.
#152

satvshr · 2025-09-26T05:47:08Z

Kindly add docstrings for all functions (it is hard to review without it), and if the functions are meant to be private (helper functions for the main function) then start the function with an underscore (_function_name). If this PR is still under progress kindly convert it to a draft PR.

fkiraly

Very nice! Useful dataset.

Could you kindly:

ensure public functions (those the users would call) have docstrings, see other parts of the code base?
ensure the code passes the linting checks? For this, you can use pre-commit - pip install pre-commit and then pre-commit run (in the root folder)

You can also check code formatting failures by clicking on the "test/code-quality" job below. Using pre-commit is one way to automatically ensure formatting of your local code.

Satarupa22-SD · 2025-10-04T19:47:21Z

@fkiraly @satvshr Please review the changes. Thank you.

fkiraly

I think the test file is a copy of the load_aptamer_interactions - did you want to add tests?

Also, can you kindly add a more detailed description in the "returns" parts of all loaders, which columns and rows the user can expect?

pyaptamer/datasets/_loaders/load_aptamer_interactions.py

satvshr · 2025-10-05T11:47:35Z

pyaptamer/datasets/_loaders/load_aptamer_interactions.py

+    Exception
+        If the download fails for any reason
+
+    Notes


GPT tends to generate notes, I dont think we should add aditional text to reaad in a new subsection, but it could be developer preference. what do you feel @fkiraly ?

I would move it up to the top

satvshr · 2025-10-05T11:50:33Z

pyaptamer/datasets/_loaders/load_aptamer_interactions.py

+        )
+
+
+def _find_csv(directory: Path) -> Path | None:


Could you explain why this function is useful with an example?

This is with reference to the last function, sorry i missed this, in case there are different files in future with varied filename, this function helps to find the dataset we want which in this case is aptamer_interaction. suppose the dataset is extended to complexes as well aptamer_interactions.csv and complexes.csv in the same root kaggle folder, this function would return aptamer_interactions.csv. the reason why i have added these extended function is so that if the dataset expands or we get new dataset with similar data, we can store it under the same kaggle dataset and don't have to repeat the functions or if we want to narrow the dataset scope we can store that too here, in which case it would be useful.

satvshr · 2025-10-05T11:53:19Z

pyaptamer/datasets/_loaders/load_aptamer_interactions.py

+    target_dir.mkdir(parents=True, exist_ok=True)
+
+    # Only download if forced or no CSV files exist
+    if force_download or not any(target_dir.glob("*.csv")):


I do not understand why the parent directory cannot have other csv files in it. What if the user wants multiple datasets in one directory? Moreover why are we assuming that csv is the only file format one can download from kaggle?

The dataset that we are working with is in csv format.

Oh I see, can you rename the functoin to _download_aptadb then? Moreover your docstring is very generalized to indicate it is meant for all kaggle datasets, kindly rewrite it to indicate it is for aptadb only.

but the function can download any csv, not just aptadb?

Yes, it's a generic function to download datasets from Kaggle.

The current dataset is a combination of three files from the original aptaDB, this has been created to ensure the correct dataset is selected in case any updates are made to the original dataset(aptaDB or if we narrow down or expand the scope of the current dataset, currently it only targets aptamers, but if we include complexes in future and we have a new dataset there, we can update this function as needed.
The default cache structure uses dataset-specific subdirectories (~/.pyaptamer/cache/{dataset_name}/), which automatically isolates different Kaggle datasets in separate folders. When load_aptadb() is called multiple times, it reuses the cached CSV without creating duplicates (unless force_download=True is explicitly used).

The default cache structure

where is this cache structure created?

What if the user wants multiple datasets in one directory?

How is cache coming into play when the user is using our aptadb loader?

if the dataset is not present in the cache, the loader downloads it with kaggle api(user have to create their own), saves it to cache directory and returns the dataframe. if the dataset already exists in cache, the download is skipped and loads the csv and return dataframe from the cache itself.

I meant where is the cache directory coming from? Is this something that is created in the users workspace once your loader is used?

Yes, it's currently being created in the users home directory.

pyaptamer/datasets/_loaders/load_aptamer_interactions.py

satvshr · 2025-10-05T11:59:36Z

pyaptamer/datasets/_loaders/load_aptamer_interactions.py

+
+
+def load_aptadb(
+    dataset_name: str = "satarupadeb/aptamer-interactions",


Reminder for when your PR is ready to be merged we should move this dataset to an account under gcos.

Yes sure. I think the process is also similar to how we do it on Hugging Face.
@fkiraly does GCOS have an organization on Kaggle?

I could do it once your PR is ready, just remind me please I will get it sorted out 😄

pyaptamer/datasets/_loaders/load_aptamer_interactions.py

satvshr

It is ok to use AI tools in order to help you with your PRs, but please try understanding the purpose of the functions you are using.

Satarupa22-SD · 2025-10-05T14:42:20Z

@fkiraly I have updated the files, please check. I was editing the docstrings after fixing the linting, I seemed to have pushed the wrong test file, this is the current version.

satvshr · 2025-10-06T06:29:20Z

Please address all comments and then re-request a review once they have all been addressed.

Satarupa22-SD · 2025-10-06T21:22:30Z

@fkiraly I have removed the normalise function. Also tested the loader manually

fkiraly

Thanks! Looks mostly good.
We may have to change things later like location of the datasets.

Can you kindly ensure you comment on my previous requests? And respond to @satvshr's comments?

Other questions: are we sure the tests are run? I do not think the kaggle package is currently installed, so should the tests not fail?

Satarupa22-SD · 2025-10-09T16:25:30Z

@fkiraly the Kaggle package was installed in my system, hence all the tests have passed, also there is an added testcase on the test file:

 except ImportError as err:
    raise ImportError(
        "The 'kaggle' package is required to download datasets. "
        "Install it with: pip install kaggle"
    ) from err

additionally all of Satvshr's comments were about the docstring structure mostly, which I have already updated. If you notice, the comments are still on the outdated version,
Should I add Kaggle as a dependency?

satvshr · 2025-10-11T00:00:16Z

additionally all of Satvshr's comments were about the docstring structure mostly, which I have already updated. If you notice, the comments are still on the outdated version,

All of them are not on the outdated version, if you will have a look. There is one comment asking about a function you created. Kindly try to address all comments before asking for a re-review.

Should I add Kaggle as a dependency?

Yes.

satvshr · 2025-10-11T20:53:52Z

@Satarupa22-SD Thanks for the explanations and the PR! You seem to be putting in a lot of "fail-safes" in case we extend the aptadb loader, I would recommend that we do that once we want to extend it. Hence, we should remove code that will not be useful currently in the scope of this PR.

Satarupa22-SD · 2025-10-12T18:56:16Z

@satvshr I understand your point but the true usability of the dataset will only come into picture once we start with the experiments, the idea is to have a single dataset repo for easy maintenance, most of these functions are for the future modification of this dataset only as per our needs, I feel we should keep the csv functions as it is for now. Lets test the dataset first and then if we don't need certain functions we can always remove them.

satvshr · 2025-10-12T19:57:09Z

@Satarupa22-SD Having code not being used in the repo is bad design and just "bloat" if that makes sense. I understand that the "true usability of the dataset will only come into picture once we start with the experiments", but it doesnt make sense to be preparing for something that may/may not happen in the future. I am not only talking about your csv function here, I meant any code that is being put there for future purposes, I think you mentioned it twice above if I am not wrong. here and here

Satarupa22-SD · 2025-10-12T20:14:09Z

@satvshr so what are your suggestions, which functions do you feel need to be removed atp?

satvshr · 2025-10-13T05:31:45Z

which functions do you feel need to be removed atp?

Only keep code that is required to load the dataset you want to load, which is a "combination of 3 files" according to you. Which exact functions to remove are known to you (given its your code), though I would start with the _find_csv function solely basedo nth eway you described it.

Satarupa22-SD added 3 commits September 26, 2025 01:19

Add AptaDB loader

389dd07

Update datasets __init__.py

9eceaaa

Update test

478087d

fkiraly requested changes Sep 29, 2025

View reviewed changes

Add docstrings and fix linting issues for aptadb loader

1873003

fkiraly added the enhancement New feature or request label Oct 4, 2025

fkiraly requested changes Oct 4, 2025

View reviewed changes

satvshr assigned Satarupa22-SD Oct 5, 2025

satvshr reviewed Oct 5, 2025

View reviewed changes

pyaptamer/datasets/_loaders/load_aptamer_interactions.py Outdated Show resolved Hide resolved

satvshr reviewed Oct 5, 2025

View reviewed changes

pyaptamer/datasets/_loaders/load_aptamer_interactions.py Show resolved Hide resolved

satvshr reviewed Oct 5, 2025

View reviewed changes

pyaptamer/datasets/_loaders/load_aptamer_interactions.py Outdated Show resolved Hide resolved

satvshr reviewed Oct 5, 2025

View reviewed changes

pyaptamer/datasets/_loaders/load_aptamer_interactions.py Outdated Show resolved Hide resolved

satvshr reviewed Oct 5, 2025

View reviewed changes

pyaptamer/datasets/_loaders/load_aptamer_interactions.py Outdated Show resolved Hide resolved

satvshr reviewed Oct 5, 2025

View reviewed changes

pyaptamer/datasets/_loaders/load_aptamer_interactions.py Outdated Show resolved Hide resolved

satvshr requested changes Oct 5, 2025

View reviewed changes

Satarupa22-SD added 2 commits October 5, 2025 19:05

Add tests

cafe7e5

Update docsting

8b2982a

Remove normalise function

08c1051

fkiraly requested changes Oct 8, 2025

View reviewed changes

Satarupa22-SD requested a review from fkiraly October 9, 2025 17:12

add tests

9a5fb63

update dependencies

ba79a32



		def load_aptadb(
		dataset_name: str = "satarupadeb/aptamer-interactions",

Add aptadb loader #159

Are you sure you want to change the base?

Add aptadb loader #159

Uh oh!

Conversation

Satarupa22-SD commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

satvshr commented Sep 26, 2025

Uh oh!

fkiraly left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Satarupa22-SD commented Oct 4, 2025

Uh oh!

fkiraly left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

satvshr left a comment

Choose a reason for hiding this comment

Uh oh!

Satarupa22-SD commented Oct 5, 2025

Uh oh!

satvshr commented Oct 6, 2025

Uh oh!

Satarupa22-SD commented Oct 6, 2025

Uh oh!

fkiraly left a comment

Choose a reason for hiding this comment

Uh oh!

Satarupa22-SD commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

satvshr commented Oct 11, 2025

Uh oh!

satvshr commented Oct 11, 2025

Uh oh!

Satarupa22-SD commented Oct 12, 2025

Uh oh!

Satarupa22-SD commented Sep 25, 2025 •

edited

Loading

fkiraly left a comment •

edited

Loading

Satarupa22-SD commented Oct 9, 2025 •

edited

Loading