Fix issue 791: add LazyFrame sink classes for parquet, csv, ipc, ndjson#1653
Fix issue 791: add LazyFrame sink classes for parquet, csv, ipc, ndjson#1653sonalishintre wants to merge 2 commits into
Conversation
jernejfrank
left a comment
There was a problem hiding this comment.
Hi @sonalishintre and welcome. Thanks for taking the time to contribute. In general looks good just a couple of remarks to keep it consistent with the rest of the existing implementations (see below).
If you look at the https://github.com/apache/hamilton/blob/main/hamilton/plugins/polars_post_1_0_0_extensions.py you will see the same implementations for the eager frame and can see the class naming conventions and order in which the classes appear. Could you try to mirror that?
Otherwise looking good and great job!
| return [DATAFRAME_TYPE] | ||
|
|
||
| def save_data(self, data: pl.LazyFrame) -> dict[str, Any]: | ||
| data.sink_parquet(self.path) |
There was a problem hiding this comment.
Could you add the ability to add the lazyframe.sink_parquet kwargs? https://docs.pola.rs/api/python/stable/reference/api/polars.LazyFrame.sink_parquet.html
| return "feather" | ||
|
|
||
| @dataclasses.dataclass | ||
| class PolarsLazyFrameSinkParquet(DataSaver): |
There was a problem hiding this comment.
Can we name this PolarsSinkParquetWriter?
| return [DATAFRAME_TYPE] | ||
|
|
||
| def save_data(self, data: pl.LazyFrame) -> dict[str, Any]: | ||
| data.sink_ipc(self.path) |
There was a problem hiding this comment.
same here adding kwargs: https://docs.pola.rs/api/python/dev/reference/api/polars.LazyFrame.sink_ipc.html
|
|
||
|
|
||
| @dataclasses.dataclass | ||
| class PolarsLazyFrameSinkIPC(DataSaver): |
There was a problem hiding this comment.
Same rename style: PolarsSinkFeatherWriter
|
|
||
|
|
||
| @dataclasses.dataclass | ||
| class PolarsLazyFrameSinkNDJSON(DataSaver): |
There was a problem hiding this comment.
Same rename: PolarsSinkNDJSONWriter
| return [DATAFRAME_TYPE] | ||
|
|
||
| def save_data(self, data: pl.LazyFrame) -> dict[str, Any]: | ||
| data.sink_ndjson(self.path) |
There was a problem hiding this comment.
|
|
||
|
|
||
| @dataclasses.dataclass | ||
| class PolarsLazyFrameSinkCSV(DataSaver): |
There was a problem hiding this comment.
Same rename: PolarsSinkCSVWriter
| return [DATAFRAME_TYPE] | ||
|
|
||
| def save_data(self, data: pl.LazyFrame) -> dict[str, Any]: | ||
| data.sink_csv(self.path) |
There was a problem hiding this comment.
Summary
Closes #791
Adds data sink support for Polars LazyFrames, allowing users to write
LazyFrames directly to disk without calling lf.collect() first, which
is more performant.
Changes
PolarsLazyFrameSinkParquetclassPolarsLazyFrameSinkCSVclassPolarsLazyFrameSinkIPCclassPolarsLazyFrameSinkNDJSONclassregister_data_loaders()How I tested this
Added 4 new tests in
tests/plugins/test_polars_lazyframe_extensions.py:test_polars_lazyframe_sink_parquet✅test_polars_lazyframe_sink_csv✅test_polars_lazyframe_sink_ipc✅test_polars_lazyframe_sink_ndjson✅Each test creates a LazyFrame, sinks it to a temp file, reads it back,
and verifies the data matches.
Notes
sink_ndjsonhas no corresponding loader yet as noted in the issueChecklist