Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Offline Data Check with Jupyter #17

Open
noklam opened this issue Aug 16, 2023 · 0 comments
Open

Enable Offline Data Check with Jupyter #17

noklam opened this issue Aug 16, 2023 · 0 comments
Assignees

Comments

@noklam
Copy link
Collaborator

noklam commented Aug 16, 2023

Description

Enable data checking in Jupyter Notebook.

Quotes from #12
I wonder if we could introduce a catalog.validate() method which runs tests declared in the catalog? This is why I've designed the snippet above in a somewhat generic validators structure
We could even allow for a prototyping workflow with catalog.validate("dataset_name", SchemaClass)...

interactive worklflow: I'd love to have something like this too. The design should be thought in details. Notice you can already do something like (ugly but still easy):

data=catalog.load("dataset_name")
catalog._data_sets["dataset_name"].metadata.pandera.schema.validate(data)

With the same logic, maybe a CLI command kedro pandera validate would be helpful too, I guess you sometimes just want to check a new dataset quickly.

Context

This enable offline data checking, and make data validation easier within a notebook. It shouldn't even requires a full Kedro Projecta, config file + DataCatalog + a pyproject.toml maybe enough to make it work.

In general, pandera support two modes: Class-based API and Object-based API. Do we have a preference for supporting which API first? maybe it's trivial to support both.

Possible Implementation

It is already possible to validate data against a given schema defined in catalog with the pandera metadata key.

In addition to schema.validate, pandera also support decorators for pipeline
It requires inspecting function signature then parse which datasets are registered with data check. (Out of scope, it only matters when you have a pipeline, we should start with notebook first)

There are few options:

  1. monkeypatch a catalog.validate method
  2. Inherit the current DataCatalog class - requires change in settings.py to enable it.
  3. kedro_pandera.validate(catalog, schema)
  4. ??

Possible Alternatives

TBD

@noklam noklam changed the title Enable Offline Data Check with Jupyter [DRAFT] - Enable Offline Data Check with Jupyter Aug 16, 2023
@noklam noklam changed the title [DRAFT] - Enable Offline Data Check with Jupyter [DRAFT] - Enable Offline Data Check Aug 16, 2023
@noklam noklam changed the title [DRAFT] - Enable Offline Data Check [DRAFT] - Enable Offline Data Check with Jupyter Aug 16, 2023
@Galileo-Galilei Galileo-Galilei changed the title [DRAFT] - Enable Offline Data Check with Jupyter Enable Offline Data Check with Jupyter Sep 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant