Skip to content

Conversation

Satarupa22-SD
Copy link

@Satarupa22-SD Satarupa22-SD commented Sep 25, 2025

Added loader for aptaDB database from kaggle dataset.
#152

@satvshr
Copy link
Collaborator

satvshr commented Sep 26, 2025

Kindly add docstrings for all functions (it is hard to review without it), and if the functions are meant to be private (helper functions for the main function) then start the function with an underscore (_function_name). If this PR is still under progress kindly convert it to a draft PR.

Copy link
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Useful dataset.

Could you kindly:

  • ensure public functions (those the users would call) have docstrings, see other parts of the code base?
  • ensure the code passes the linting checks? For this, you can use pre-commit - pip install pre-commit and then pre-commit run (in the root folder)

You can also check code formatting failures by clicking on the "test/code-quality" job below. Using pre-commit is one way to automatically ensure formatting of your local code.

@Satarupa22-SD
Copy link
Author

@fkiraly @satvshr Please review the changes. Thank you.

@fkiraly fkiraly added the enhancement New feature or request label Oct 4, 2025
Copy link
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test file is a copy of the load_aptamer_interactions - did you want to add tests?

Also, can you kindly add a more detailed description in the "returns" parts of all loaders, which columns and rows the user can expect?

Exception
If the download fails for any reason

Notes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPT tends to generate notes, I dont think we should add aditional text to reaad in a new subsection, but it could be developer preference. what do you feel @fkiraly ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move it up to the top

)


def _find_csv(directory: Path) -> Path | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why this function is useful with an example?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is with reference to the last function, sorry i missed this, in case there are different files in future with varied filename, this function helps to find the dataset we want which in this case is aptamer_interaction. suppose the dataset is extended to complexes as well aptamer_interactions.csv and complexes.csv in the same root kaggle folder, this function would return aptamer_interactions.csv. the reason why i have added these extended function is so that if the dataset expands or we get new dataset with similar data, we can store it under the same kaggle dataset and don't have to repeat the functions or if we want to narrow the dataset scope we can store that too here, in which case it would be useful.

target_dir.mkdir(parents=True, exist_ok=True)

# Only download if forced or no CSV files exist
if force_download or not any(target_dir.glob("*.csv")):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why the parent directory cannot have other csv files in it. What if the user wants multiple datasets in one directory? Moreover why are we assuming that csv is the only file format one can download from kaggle?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset that we are working with is in csv format.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, can you rename the functoin to _download_aptadb then? Moreover your docstring is very generalized to indicate it is meant for all kaggle datasets, kindly rewrite it to indicate it is for aptadb only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the function can download any csv, not just aptadb?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a generic function to download datasets from Kaggle.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current dataset is a combination of three files from the original aptaDB, this has been created to ensure the correct dataset is selected in case any updates are made to the original dataset(aptaDB or if we narrow down or expand the scope of the current dataset, currently it only targets aptamers, but if we include complexes in future and we have a new dataset there, we can update this function as needed.
The default cache structure uses dataset-specific subdirectories (~/.pyaptamer/cache/{dataset_name}/), which automatically isolates different Kaggle datasets in separate folders. When load_aptadb() is called multiple times, it reuses the cached CSV without creating duplicates (unless force_download=True is explicitly used).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default cache structure

where is this cache structure created?

What if the user wants multiple datasets in one directory?

How is cache coming into play when the user is using our aptadb loader?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the dataset is not present in the cache, the loader downloads it with kaggle api(user have to create their own), saves it to cache directory and returns the dataframe. if the dataset already exists in cache, the download is skipped and loads the csv and return dataframe from the cache itself.
Screenshot (24)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant where is the cache directory coming from? Is this something that is created in the users workspace once your loader is used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's currently being created in the users home directory.



def load_aptadb(
dataset_name: str = "satarupadeb/aptamer-interactions",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder for when your PR is ready to be merged we should move this dataset to an account under gcos.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sure. I think the process is also similar to how we do it on Hugging Face.
@fkiraly does GCOS have an organization on Kaggle?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could do it once your PR is ready, just remind me please I will get it sorted out 😄

Copy link
Collaborator

@satvshr satvshr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is ok to use AI tools in order to help you with your PRs, but please try understanding the purpose of the functions you are using.

@Satarupa22-SD
Copy link
Author

@fkiraly I have updated the files, please check. I was editing the docstrings after fixing the linting, I seemed to have pushed the wrong test file, this is the current version.

@satvshr
Copy link
Collaborator

satvshr commented Oct 6, 2025

Please address all comments and then re-request a review once they have all been addressed.
image

@Satarupa22-SD
Copy link
Author

@fkiraly I have removed the normalise function. Also tested the loader manually
Screenshot (17)

Copy link
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks mostly good.
We may have to change things later like location of the datasets.

Can you kindly ensure you comment on my previous requests? And respond to @satvshr's comments?

Other questions: are we sure the tests are run? I do not think the kaggle package is currently installed, so should the tests not fail?

@Satarupa22-SD
Copy link
Author

Satarupa22-SD commented Oct 9, 2025

@fkiraly the Kaggle package was installed in my system, hence all the tests have passed, also there is an added testcase on the test file:

 except ImportError as err:
    raise ImportError(
        "The 'kaggle' package is required to download datasets. "
        "Install it with: pip install kaggle"
    ) from err

additionally all of Satvshr's comments were about the docstring structure mostly, which I have already updated. If you notice, the comments are still on the outdated version,
Should I add Kaggle as a dependency?

@Satarupa22-SD Satarupa22-SD requested a review from fkiraly October 9, 2025 17:12
@satvshr
Copy link
Collaborator

satvshr commented Oct 11, 2025

additionally all of Satvshr's comments were about the docstring structure mostly, which I have already updated. If you notice, the comments are still on the outdated version,

All of them are not on the outdated version, if you will have a look. There is one comment asking about a function you created. Kindly try to address all comments before asking for a re-review.

Should I add Kaggle as a dependency?

Yes.

@satvshr
Copy link
Collaborator

satvshr commented Oct 11, 2025

@Satarupa22-SD Thanks for the explanations and the PR! You seem to be putting in a lot of "fail-safes" in case we extend the aptadb loader, I would recommend that we do that once we want to extend it. Hence, we should remove code that will not be useful currently in the scope of this PR.

@Satarupa22-SD
Copy link
Author

@satvshr I understand your point but the true usability of the dataset will only come into picture once we start with the experiments, the idea is to have a single dataset repo for easy maintenance, most of these functions are for the future modification of this dataset only as per our needs, I feel we should keep the csv functions as it is for now. Lets test the dataset first and then if we don't need certain functions we can always remove them.

@satvshr
Copy link
Collaborator

satvshr commented Oct 12, 2025

@Satarupa22-SD Having code not being used in the repo is bad design and just "bloat" if that makes sense. I understand that the "true usability of the dataset will only come into picture once we start with the experiments", but it doesnt make sense to be preparing for something that may/may not happen in the future. I am not only talking about your csv function here, I meant any code that is being put there for future purposes, I think you mentioned it twice above if I am not wrong. here and here

@Satarupa22-SD
Copy link
Author

@satvshr so what are your suggestions, which functions do you feel need to be removed atp?

@satvshr
Copy link
Collaborator

satvshr commented Oct 13, 2025

which functions do you feel need to be removed atp?

Only keep code that is required to load the dataset you want to load, which is a "combination of 3 files" according to you. Which exact functions to remove are known to you (given its your code), though I would start with the _find_csv function solely basedo nth eway you described it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants