Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add module to make datasets IO easier with pandas #152

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

acroz
Copy link
Member

@acroz acroz commented Jan 22, 2020

This needs tests added before merging.

While this is still in draft stage, and prior to implementing tests, I'd like to get review on the API proposed by this PR.

Expected usage looks like:

import faculty.datasets.pandas

# Read
df = faculty.datasets.pandas.read_csv("path/to/object.csv")

# Write
faculty.datasets.pandas.to_csv(df, "path/to/object.csv", index=False)

These mirror closely the pandas API (extra args and kwargs are just passed through), except that the to_csv functionality in pandas is a method and not available (AFAICT) as a static function.

Pandas as an optional dependency

faculty does not currently depend on numpy or pandas. It's nice to keep it that way, as the library can be kept lightweight for the majority of applications where the sometimes-expensive installation of numpy is not required. I propose that an optional dependency on pandas be included for this functionality via an extras_require entry in setup.py.

For the main expected use case (inside the platform), pandas is always expected to be available, so users will rarely encounter the case where it's not available. Managing the case where pandas is not installed could be:

  1. (As implemented) Pandas is only imported in this module we make sure this module is not imported by others in the package. In this case, tests would be implemented to check that other functionality works when pandas is not available.
  2. Pandas is imported at function call-time, with a descriptive error message replacing the default ModuleNotFoundError.

I'm interested in input on the above or other options.

 ## Possible aliases

Current recommended style when using faculty.datasets is:

from faculty import datasets
datasets.ls("prefix")
# etc..

People seem to prefer shorter aliases for things (it seems the data science community finds the 5/6 characters of numpy/pandas too lengthy!) so we may want to encourage a particular alias, such as:

import faculty.datasets.pandas as faculty_pandas
import faculty.datasets.pandas as datasets_pandas
import faculty.datasets.pandas as ds_pandas
import faculty.datasets.pandas as fdp

Extra ideas welcome.

Alternatively, if we go with option 2 above (import pandas at function call-time), we could import faculty.datasets.pandas in faculty/datasets/__init__.py, and then the pandas functionality appears as some namespaced components of faculty.datasets, e.g.:

from faculty import datasets
datasets.ls("path/")
df = datasets.pandas.read_csv("path/to/object.csv")

@acroz acroz requested review from pbugnion and zblz February 3, 2020 16:17
@acroz acroz self-assigned this Feb 3, 2020
@acroz acroz changed the title Add module to make datasets IO easier with pandas [RFC] Add module to make datasets IO easier with pandas Feb 3, 2020
@acroz acroz requested a review from imrehg February 19, 2020 16:44
@imrehg
Copy link
Member

imrehg commented Feb 19, 2020

This is quite interesting! Some initial thoughts, and coming from a place of ignorance:

  • that namespacing in faculty.datasets.pandas.... seems good, though would you think people would find it confusing that only some functions (and not all pandas) are propagated? I guess not really, but guessing it's good to also document in any case, later on.
  • would we consider adding similar functions to the other read_FORMAT and to_FORMAT as well? Looking at the API reference, I can imagine FORMAT as in excel, json, HDF, parquet, pickle.... I bet these are less frequent, but feel like if include one, should include the rest as well.

@imrehg
Copy link
Member

imrehg commented Feb 20, 2020

Also just for clarity, with the 2 options above, if the 2nd is used, you mean that the whole of pandas would be available as a namespaced component? Or just these functions?

For the shorter import, I wonder if either

import faculty.datasets.pandas as fdpd
import faculty.datasets.pandas as fpd

would be more natural (so keeping the original pd convention, but adding some way to highlight the "faculty-ness" of things. Just a thought, no strong preference.

@sbalian
Copy link
Contributor

sbalian commented May 1, 2020

Thanks @acroz , thought about this a bit more and considered all the comments above. How about the following?

from faculty import datasets

url = datasets.presigned_url("/path/to/any/file")

We can then add a section in the docs (or docstrings) illustrating usage with pd.read_*, and possibly readers from other libraries that support url inputs.

As for writing, you can do something like datasets.put_string to take the local data as a string (as returned by pd.DataFrame.to_csv(path_or_buf=None) and also pd.Series.to_csv), and again illustrate usage in the docs. Or perhaps modify datasets.put so that it can take the string as well as the file path as input. Finally, we can also have the inverse of this - datasets.get_string or modify datasets.get.

This satisfies the two requirements:

  1. No dependence on pandas: is general and removes the burden of including pandas as a dependency.
  2. Makes it easier to deal with datasets IO during development. To me, the main burden is dealing with ObjectClient when I want a presigned URL or when I want to upload data that is not sitting on disk.

@sbalian
Copy link
Contributor

sbalian commented May 1, 2020

@acroz Also ran a quick test to compare speed for an AWS backend.

For a 139M CSV file,

Method Time in seconds
pandas.read_csv 3.29
faculty.datasets.pandas.read_csv 5.52
pandas.DataFrame.to_csv 10.2
faculty.datasets.pandas.to_csv 18.7

These are very promising because object price/workspace price << 0.5, and here we are seeing that workspace speedup over object is not even 2x (of course I am ignoring other advantages of workspace).

@acroz acroz force-pushed the datasets-csv branch 2 times, most recently from f9d8835 to 8638804 Compare June 23, 2021 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants