You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Quotes from #12
I wonder if we could introduce a catalog.validate() method which runs tests declared in the catalog? This is why I've designed the snippet above in a somewhat generic validators structure
We could even allow for a prototyping workflow with catalog.validate("dataset_name", SchemaClass)...
interactive worklflow: I'd love to have something like this too. The design should be thought in details. Notice you can already do something like (ugly but still easy):
With the same logic, maybe a CLI command kedro pandera validate would be helpful too, I guess you sometimes just want to check a new dataset quickly.
Context
This enable offline data checking, and make data validation easier within a notebook. It shouldn't even requires a full Kedro Projecta, config file + DataCatalog + a pyproject.toml maybe enough to make it work.
In general, pandera support two modes: Class-based API and Object-based API. Do we have a preference for supporting which API first? maybe it's trivial to support both.
Possible Implementation
It is already possible to validate data against a given schema defined in catalog with the pandera metadata key.
In addition to schema.validate, pandera also support decorators for pipeline
It requires inspecting function signature then parse which datasets are registered with data check. (Out of scope, it only matters when you have a pipeline, we should start with notebook first)
There are few options:
monkeypatch a catalog.validate method
Inherit the current DataCatalog class - requires change in settings.py to enable it.
kedro_pandera.validate(catalog, schema)
??
Possible Alternatives
TBD
The text was updated successfully, but these errors were encountered:
noklam
changed the title
Enable Offline Data Check with Jupyter
[DRAFT] - Enable Offline Data Check with Jupyter
Aug 16, 2023
noklam
changed the title
[DRAFT] - Enable Offline Data Check with Jupyter
[DRAFT] - Enable Offline Data Check
Aug 16, 2023
noklam
changed the title
[DRAFT] - Enable Offline Data Check
[DRAFT] - Enable Offline Data Check with Jupyter
Aug 16, 2023
Galileo-Galilei
changed the title
[DRAFT] - Enable Offline Data Check with Jupyter
Enable Offline Data Check with Jupyter
Sep 10, 2023
Description
Enable data checking in Jupyter Notebook.
Context
This enable offline data checking, and make data validation easier within a notebook. It shouldn't even requires a full Kedro Projecta, config file +
DataCatalog
+ a pyproject.toml maybe enough to make it work.In general, pandera support two modes: Class-based API and Object-based API. Do we have a preference for supporting which API first? maybe it's trivial to support both.
Possible Implementation
It is already possible to validate data against a given schema defined in
catalog
with thepandera
metadata key.In addition to
schema.validate
, pandera also support decorators for pipelineIt requires inspecting function signature then parse which datasets are registered with data check. (Out of scope, it only matters when you have a pipeline, we should start with notebook first)
There are few options:
catalog.validate
methodDataCatalog
class - requires change insettings.py
to enable it.kedro_pandera.validate(catalog, schema)
Possible Alternatives
TBD
The text was updated successfully, but these errors were encountered: