[RFC] Add module to make datasets IO easier with pandas #152

acroz · 2020-01-22T13:05:14Z

This needs tests added before merging.

While this is still in draft stage, and prior to implementing tests, I'd like to get review on the API proposed by this PR.

Expected usage looks like:

import faculty.datasets.pandas

# Read
df = faculty.datasets.pandas.read_csv("path/to/object.csv")

# Write
faculty.datasets.pandas.to_csv(df, "path/to/object.csv", index=False)

These mirror closely the pandas API (extra args and kwargs are just passed through), except that the to_csv functionality in pandas is a method and not available (AFAICT) as a static function.

Pandas as an optional dependency

faculty does not currently depend on numpy or pandas. It's nice to keep it that way, as the library can be kept lightweight for the majority of applications where the sometimes-expensive installation of numpy is not required. I propose that an optional dependency on pandas be included for this functionality via an extras_require entry in setup.py.

For the main expected use case (inside the platform), pandas is always expected to be available, so users will rarely encounter the case where it's not available. Managing the case where pandas is not installed could be:

(As implemented) Pandas is only imported in this module we make sure this module is not imported by others in the package. In this case, tests would be implemented to check that other functionality works when pandas is not available.
Pandas is imported at function call-time, with a descriptive error message replacing the default ModuleNotFoundError.

I'm interested in input on the above or other options.

## Possible aliases

Current recommended style when using faculty.datasets is:

from faculty import datasets
datasets.ls("prefix")
# etc..

People seem to prefer shorter aliases for things (it seems the data science community finds the 5/6 characters of numpy/pandas too lengthy!) so we may want to encourage a particular alias, such as:

import faculty.datasets.pandas as faculty_pandas
import faculty.datasets.pandas as datasets_pandas
import faculty.datasets.pandas as ds_pandas
import faculty.datasets.pandas as fdp

Extra ideas welcome.

Alternatively, if we go with option 2 above (import pandas at function call-time), we could import faculty.datasets.pandas in faculty/datasets/__init__.py, and then the pandas functionality appears as some namespaced components of faculty.datasets, e.g.:

from faculty import datasets
datasets.ls("path/")
df = datasets.pandas.read_csv("path/to/object.csv")

imrehg · 2020-02-19T16:53:00Z

This is quite interesting! Some initial thoughts, and coming from a place of ignorance:

that namespacing in faculty.datasets.pandas.... seems good, though would you think people would find it confusing that only some functions (and not all pandas) are propagated? I guess not really, but guessing it's good to also document in any case, later on.
would we consider adding similar functions to the other read_FORMAT and to_FORMAT as well? Looking at the API reference, I can imagine FORMAT as in excel, json, HDF, parquet, pickle.... I bet these are less frequent, but feel like if include one, should include the rest as well.

imrehg · 2020-02-20T18:16:44Z

Also just for clarity, with the 2 options above, if the 2nd is used, you mean that the whole of pandas would be available as a namespaced component? Or just these functions?

For the shorter import, I wonder if either

import faculty.datasets.pandas as fdpd
import faculty.datasets.pandas as fpd

would be more natural (so keeping the original pd convention, but adding some way to highlight the "faculty-ness" of things. Just a thought, no strong preference.

sbalian · 2020-05-01T14:44:18Z

Thanks @acroz , thought about this a bit more and considered all the comments above. How about the following?

from faculty import datasets

url = datasets.presigned_url("/path/to/any/file")

We can then add a section in the docs (or docstrings) illustrating usage with pd.read_*, and possibly readers from other libraries that support url inputs.

As for writing, you can do something like datasets.put_string to take the local data as a string (as returned by pd.DataFrame.to_csv(path_or_buf=None) and also pd.Series.to_csv), and again illustrate usage in the docs. Or perhaps modify datasets.put so that it can take the string as well as the file path as input. Finally, we can also have the inverse of this - datasets.get_string or modify datasets.get.

This satisfies the two requirements:

No dependence on pandas: is general and removes the burden of including pandas as a dependency.
Makes it easier to deal with datasets IO during development. To me, the main burden is dealing with ObjectClient when I want a presigned URL or when I want to upload data that is not sitting on disk.

sbalian · 2020-05-01T15:33:40Z

@acroz Also ran a quick test to compare speed for an AWS backend.

For a 139M CSV file,

Method	Time in seconds
`pandas.read_csv`	3.29
`faculty.datasets.pandas.read_csv`	5.52
`pandas.DataFrame.to_csv`	10.2
`faculty.datasets.pandas.to_csv`	18.7

These are very promising because object price/workspace price << 0.5, and here we are seeing that workspace speedup over object is not even 2x (of course I am ignoring other advantages of workspace).

acroz requested review from pbugnion and zblz February 3, 2020 16:17

acroz self-assigned this Feb 3, 2020

acroz changed the title ~~Add module to make datasets IO easier with pandas~~ [RFC] Add module to make datasets IO easier with pandas Feb 3, 2020

acroz requested a review from imrehg February 19, 2020 16:44

acroz force-pushed the datasets-csv branch 2 times, most recently from f9d8835 to 8638804 Compare June 23, 2021 11:57

acroz and others added 3 commits June 24, 2021 09:36

Add module to make datasets IO easier with pandas

4e14c37

Update copyright notice

0691b3f

Update to latest client APIs

0ef85aa

acroz force-pushed the datasets-csv branch from 8638804 to 0ef85aa Compare June 24, 2021 08:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Add module to make datasets IO easier with pandas #152

[RFC] Add module to make datasets IO easier with pandas #152

acroz commented Jan 22, 2020 •

edited

Loading

imrehg commented Feb 19, 2020

imrehg commented Feb 20, 2020

sbalian commented May 1, 2020 •

edited

Loading

sbalian commented May 1, 2020 •

edited

Loading

[RFC] Add module to make datasets IO easier with pandas #152

Are you sure you want to change the base?

[RFC] Add module to make datasets IO easier with pandas #152

Conversation

acroz commented Jan 22, 2020 • edited Loading

Pandas as an optional dependency

imrehg commented Feb 19, 2020

imrehg commented Feb 20, 2020

sbalian commented May 1, 2020 • edited Loading

sbalian commented May 1, 2020 • edited Loading

acroz commented Jan 22, 2020 •

edited

Loading

sbalian commented May 1, 2020 •

edited

Loading

sbalian commented May 1, 2020 •

edited

Loading