-
Notifications
You must be signed in to change notification settings - Fork 2
Add aptadb loader #159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add aptadb loader #159
Changes from all commits
389dd07
9eceaaa
478087d
1873003
cafe7e5
8b2982a
08c1051
9a5fb63
ba79a32
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,223 @@ | ||
__author__ = "Satarupa22-SD" | ||
__all__ = ["load_aptadb", "load_aptamer_interactions", "load_interactions"] | ||
|
||
from pathlib import Path | ||
|
||
import pandas as pd | ||
|
||
|
||
def _download_dataset( | ||
dataset_name: str, target_dir: Path, force_download: bool = False | ||
) -> None: | ||
"""Download a Kaggle dataset to the specified directory and unzip it. | ||
|
||
Parameters | ||
---------- | ||
dataset_name : str | ||
Kaggle dataset identifier like "username/dataset-name". | ||
target_dir : Path | ||
Directory to download and extract the dataset. | ||
force_download : bool, default False | ||
If True, download even if CSV files already exist in target_dir. | ||
|
||
Raises | ||
------ | ||
ImportError | ||
If the kaggle package is not installed. | ||
Exception | ||
If the download fails for any reason. | ||
|
||
Notes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. GPT tends to generate notes, I dont think we should add aditional text to reaad in a new subsection, but it could be developer preference. what do you feel @fkiraly ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would move it up to the top |
||
----- | ||
Requires kaggle package installed and configured with API credentials. | ||
""" | ||
import kaggle # avoid import-time auth | ||
|
||
target_dir.mkdir(parents=True, exist_ok=True) | ||
|
||
# Only download if forced or no CSV files exist | ||
if force_download or not any(target_dir.glob("*.csv")): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do not understand why the parent directory cannot have other csv files in it. What if the user wants multiple datasets in one directory? Moreover why are we assuming that csv is the only file format one can download from kaggle? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The dataset that we are working with is in csv format. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh I see, can you rename the functoin to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. but the function can download any csv, not just aptadb? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it's a generic function to download datasets from Kaggle. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The current dataset is a combination of three files from the original aptaDB, this has been created to ensure the correct dataset is selected in case any updates are made to the original dataset(aptaDB or if we narrow down or expand the scope of the current dataset, currently it only targets aptamers, but if we include complexes in future and we have a new dataset there, we can update this function as needed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
where is this cache structure created?
How is cache coming into play when the user is using our aptadb loader? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I meant where is the cache directory coming from? Is this something that is created in the users workspace once your loader is used? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it's currently being created in the users home directory. |
||
kaggle.api.dataset_download_files( | ||
dataset_name, path=str(target_dir), unzip=True | ||
) | ||
|
||
|
||
def _find_csv(directory: Path) -> Path | None: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you explain why this function is useful with an example? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is with reference to the last function, sorry i missed this, in case there are different files in future with varied filename, this function helps to find the dataset we want which in this case is aptamer_interaction. suppose the dataset is extended to complexes as well aptamer_interactions.csv and complexes.csv in the same root kaggle folder, this function would return aptamer_interactions.csv. the reason why i have added these extended function is so that if the dataset expands or we get new dataset with similar data, we can store it under the same kaggle dataset and don't have to repeat the functions or if we want to narrow the dataset scope we can store that too here, in which case it would be useful. |
||
"""Return the most appropriate CSV file path from a directory. | ||
|
||
Parameters | ||
---------- | ||
directory : Path | ||
Directory to look for CSV files. | ||
|
||
Returns | ||
------- | ||
Path or None | ||
Path to CSV file or None if none found. | ||
|
||
Notes | ||
----- | ||
Preference order: | ||
1. If only one CSV, return it. | ||
2. If multiple, prefer files with "aptamer", "interaction", "main", or "data" | ||
in name. | ||
3. Otherwise, return first CSV found. | ||
""" | ||
csv_files = list(directory.glob("*.csv")) | ||
|
||
if not csv_files: | ||
return None | ||
|
||
if len(csv_files) == 1: | ||
return csv_files[0] | ||
|
||
preferred_keywords = ["aptamer", "interaction", "main", "data"] | ||
candidates = [ | ||
f | ||
for f in csv_files | ||
if any(keyword in f.name.lower() for keyword in preferred_keywords) | ||
] | ||
|
||
return candidates[0] if candidates else csv_files[0] | ||
|
||
|
||
def load_aptamer_interactions( | ||
path: str | Path, | ||
*, | ||
encoding: str | None = None, | ||
**read_csv_kwargs, | ||
) -> pd.DataFrame: | ||
"""Load aptamer interactions CSV into a pandas DataFrame. | ||
|
||
Tries common encodings automatically for robust loading. | ||
|
||
Parameters | ||
---------- | ||
path : str or Path | ||
Path to CSV file with aptamer interactions. | ||
encoding : str, optional | ||
Specific file encoding to use. If None, tries common encodings. | ||
**read_csv_kwargs | ||
Additional arguments passed to pandas.read_csv(). | ||
|
||
Returns | ||
------- | ||
pd.DataFrame | ||
DataFrame with aptamer interaction data. | ||
|
||
Raises | ||
------ | ||
RuntimeError | ||
If CSV cannot be read with any attempted encodings. | ||
|
||
Notes | ||
----- | ||
Encodings tried (in order): utf-8, utf-8-sig, latin-1, cp1252, windows-1252. | ||
""" | ||
candidate_encodings = ( | ||
[ | ||
"utf-8", | ||
"utf-8-sig", | ||
"latin-1", | ||
"cp1252", | ||
"windows-1252", | ||
] | ||
if encoding is None | ||
else [encoding] | ||
) | ||
|
||
last_error: Exception | None = None | ||
|
||
for enc in candidate_encodings: | ||
try: | ||
df = pd.read_csv(path, encoding=enc, **read_csv_kwargs) | ||
return df | ||
except Exception as e: | ||
last_error = e | ||
continue | ||
|
||
raise RuntimeError( | ||
f"Failed to read CSV {path} with encodings {candidate_encodings}: {last_error}" | ||
) | ||
|
||
|
||
def load_interactions( | ||
path: str | Path, | ||
*, | ||
encoding: str | None = None, | ||
**read_csv_kwargs, | ||
) -> pd.DataFrame: | ||
"""Alias for load_aptamer_interactions with same parameters and return.""" | ||
return load_aptamer_interactions( | ||
path=path, | ||
encoding=encoding, | ||
**read_csv_kwargs, | ||
) | ||
|
||
|
||
def load_aptadb( | ||
dataset_name: str = "satarupadeb/aptamer-interactions", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reminder for when your PR is ready to be merged we should move this dataset to an account under gcos. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes sure. I think the process is also similar to how we do it on Hugging Face. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I could do it once your PR is ready, just remind me please I will get it sorted out 😄 |
||
cache_dir: str | Path | None = None, | ||
force_download: bool = False, | ||
*, | ||
encoding: str | None = None, | ||
**kwargs, | ||
) -> pd.DataFrame: | ||
"""Download and load aptamer-interactions dataset from Kaggle as DataFrame. | ||
|
||
Parameters | ||
---------- | ||
dataset_name : str, optional | ||
Kaggle dataset name. | ||
cache_dir : str or Path, optional | ||
Local directory for caching dataset files. | ||
force_download : bool, default False | ||
If True, download dataset even if cached files exist. | ||
encoding : str, optional | ||
Encoding for CSV file loading. | ||
**kwargs | ||
Additional arguments passed to CSV loader. | ||
|
||
Returns | ||
------- | ||
pd.DataFrame | ||
Loaded dataset as a pandas DataFrame. | ||
|
||
Raises | ||
------ | ||
ImportError | ||
If the 'kaggle' package is missing. | ||
RuntimeError | ||
If dataset download fails. | ||
FileNotFoundError | ||
If no CSV file found after download. | ||
""" | ||
if cache_dir is None: | ||
cache_dir = ( | ||
Path.home() / ".pyaptamer" / "cache" / dataset_name.replace("/", "_") | ||
) | ||
else: | ||
cache_dir = Path(cache_dir) | ||
|
||
csv_file = _find_csv(cache_dir) if cache_dir.exists() else None | ||
|
||
if csv_file is None: | ||
try: | ||
_download_dataset(dataset_name, cache_dir, force_download=force_download) | ||
except ImportError as err: | ||
raise ImportError( | ||
"The 'kaggle' package is required to download datasets. " | ||
"Install it with: pip install kaggle" | ||
) from err | ||
except Exception as e: | ||
raise RuntimeError( | ||
f"Failed to download dataset '{dataset_name}' from Kaggle: {e}" | ||
) from e | ||
|
||
csv_file = _find_csv(cache_dir) | ||
if csv_file is None: | ||
raise FileNotFoundError( | ||
f"No CSV files found in downloaded Kaggle dataset at {cache_dir}" | ||
) | ||
|
||
return load_aptamer_interactions(path=str(csv_file), encoding=encoding, **kwargs) |
Uh oh!
There was an error while loading. Please reload this page.