Enable Offline Data Check with Jupyter #17

noklam · 2023-08-16T22:45:02Z

Description

Enable data checking in Jupyter Notebook.

Quotes from #12
I wonder if we could introduce a catalog.validate() method which runs tests declared in the catalog? This is why I've designed the snippet above in a somewhat generic validators structure
We could even allow for a prototyping workflow with catalog.validate("dataset_name", SchemaClass)...

interactive worklflow: I'd love to have something like this too. The design should be thought in details. Notice you can already do something like (ugly but still easy):

data=catalog.load("dataset_name")
catalog._data_sets["dataset_name"].metadata.pandera.schema.validate(data)

With the same logic, maybe a CLI command kedro pandera validate would be helpful too, I guess you sometimes just want to check a new dataset quickly.

Context

This enable offline data checking, and make data validation easier within a notebook. It shouldn't even requires a full Kedro Projecta, config file + DataCatalog + a pyproject.toml maybe enough to make it work.

In general, pandera support two modes: Class-based API and Object-based API. Do we have a preference for supporting which API first? maybe it's trivial to support both.

Possible Implementation

It is already possible to validate data against a given schema defined in catalog with the pandera metadata key.

In addition to schema.validate, pandera also support decorators for pipeline
It requires inspecting function signature then parse which datasets are registered with data check. (Out of scope, it only matters when you have a pipeline, we should start with notebook first)

There are few options:

monkeypatch a catalog.validate method
Inherit the current DataCatalog class - requires change in settings.py to enable it.
kedro_pandera.validate(catalog, schema)
??

Possible Alternatives

TBD

The text was updated successfully, but these errors were encountered:

noklam changed the title ~~Enable Offline Data Check with Jupyter~~ [DRAFT] - Enable Offline Data Check with Jupyter Aug 16, 2023

noklam changed the title ~~[DRAFT] - Enable Offline Data Check with Jupyter~~ [DRAFT] - Enable Offline Data Check Aug 16, 2023

noklam changed the title ~~[DRAFT] - Enable Offline Data Check~~ [DRAFT] - Enable Offline Data Check with Jupyter Aug 16, 2023

Galileo-Galilei changed the title ~~[DRAFT] - Enable Offline Data Check with Jupyter~~ Enable Offline Data Check with Jupyter Sep 10, 2023

Galileo-Galilei assigned noklam Sep 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Offline Data Check with Jupyter #17

Enable Offline Data Check with Jupyter #17

noklam commented Aug 16, 2023 •

edited

Loading

Enable Offline Data Check with Jupyter #17

Enable Offline Data Check with Jupyter #17

Comments

noklam commented Aug 16, 2023 • edited Loading

Description

Context

Possible Implementation

Possible Alternatives

noklam commented Aug 16, 2023 •

edited

Loading