Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example showing inline data saver & loaders #983

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

skrawcz
Copy link
Collaborator

@skrawcz skrawcz commented Jun 24, 2024

This is a proof of concept.

from hamilton.htypes import DataLoaderMetadata, DataSaverMetadata

# this will return metadata to be tracked -- maybe this should just be Annotated[pd.DataFrame, dict/DataLoaderMetadata]:
def raw_data() -> tuple[pd.DataFrame, DataLoaderMetadata]: 
    data = datasets.load_digits()
    df = pd.DataFrame(data.data, columns=[f"feature_{i}" for i in range(data.data.shape[1])])
    return df, DataLoaderMetadata.from_dataframe(df)

# this will correctly stitch together to the above
def transformed_data(raw_data: pd.DataFrame) -> pd.DataFrame:
    return raw_data

# this will output metadata
def saved_data(transformed_data: pd.DataFrame, filepath: str) -> DataSaverMetadata:
    transformed_data.to_csv(filepath)
    return DataSaverMetadata.from_file_and_dataframe(filepath, transformed_data)

What needs to be actually done:

  1. ideally we expand/wrap the function with the dataloader type appropriately, to mirror the current process (I think that's what we want).

I made them classes to make it easy to add from_X functions to create the metadata. Otherwise I don't type the metadata dictionaries -- so maybe we should do that / provide a way to push people to putting standard things in it.

The metadata class should maybe more closely behave like dictionaries...

Otherwise I think this is more ergonomic for most people getting started.

Screen Shot 2024-06-24 at 4 43 37 PM
Screen Shot 2024-06-24 at 4 44 16 PM
Screen Shot 2024-06-24 at 4 44 19 PM

Changes

  • POC

How I tested this

  • locally

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

This is a proof of concept.

What needs to be actually done:

1. ideally we expand/wrap the function with the dataloader type appropriately,
to mirror the current process (I think that's what we want).

I made them classes to make it easy to add from_X functions to create
the metadata. Otherwise I don't type the metadata dictionaries --
so maybe we should do that / provide a way to push people to putting
standard things in it.

Otherwise I think this is more ergonomic for most people getting
started.
@skrawcz
Copy link
Collaborator Author

skrawcz commented Jun 25, 2024

Okay let's instead go for the following if it's simpler to implement:

from hamilton.function_modifiers import loader, saver
from hamilton.io import utils as io_utils

@loader # injects node to pull out result
def foo() -> tuple[pd.DataFrame, dict]:
   ...
   metadata = io_utils....(file, df)
   return DF, metadata

@saver # all it does is add the right tags
def write_foo(...) -> dict:
   ...
   metadata = io_utils....(file)
   return metadata

@elijahbenizzy
Copy link
Collaborator

Okay let's instead go for the following if it's simpler to implement:

from hamilton.function_modifiers import loader, saver

from hamilton.io import utils as io_utils



@loader # injects node to pull out result

def foo() -> tuple[pd.DataFrame, dict]:

   ...

   metadata = io_utils....(file, df)

   return DF, metadata



@saver # all it does is add the right tags

def write_foo(...) -> dict:

   ...

   metadata = io_utils....(file)

   return metadata

I also find it clearer :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants