-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Add module to make datasets IO easier with pandas #152
base: master
Are you sure you want to change the base?
Conversation
This is quite interesting! Some initial thoughts, and coming from a place of ignorance:
|
Also just for clarity, with the 2 options above, if the 2nd is used, you mean that the whole of pandas would be available as a namespaced component? Or just these functions? For the shorter import, I wonder if either
would be more natural (so keeping the original |
Thanks @acroz , thought about this a bit more and considered all the comments above. How about the following? from faculty import datasets
url = datasets.presigned_url("/path/to/any/file") We can then add a section in the docs (or docstrings) illustrating usage with As for writing, you can do something like This satisfies the two requirements:
|
@acroz Also ran a quick test to compare speed for an AWS backend. For a 139M CSV file,
These are very promising because |
f9d8835
to
8638804
Compare
This needs tests added before merging.
While this is still in draft stage, and prior to implementing tests, I'd like to get review on the API proposed by this PR.
Expected usage looks like:
These mirror closely the pandas API (extra args and kwargs are just passed through), except that the
to_csv
functionality in pandas is a method and not available (AFAICT) as a static function.Pandas as an optional dependency
faculty
does not currently depend on numpy or pandas. It's nice to keep it that way, as the library can be kept lightweight for the majority of applications where the sometimes-expensive installation of numpy is not required. I propose that an optional dependency on pandas be included for this functionality via anextras_require
entry insetup.py
.For the main expected use case (inside the platform), pandas is always expected to be available, so users will rarely encounter the case where it's not available. Managing the case where pandas is not installed could be:
ModuleNotFoundError
.I'm interested in input on the above or other options.
## Possible aliases
Current recommended style when using
faculty.datasets
is:People seem to prefer shorter aliases for things (it seems the data science community finds the 5/6 characters of numpy/pandas too lengthy!) so we may want to encourage a particular alias, such as:
Extra ideas welcome.
Alternatively, if we go with option 2 above (import pandas at function call-time), we could import
faculty.datasets.pandas
infaculty/datasets/__init__.py
, and then the pandas functionality appears as some namespaced components offaculty.datasets
, e.g.: