Skip to content

Fix issue 791: add LazyFrame sink classes for parquet, csv, ipc, ndjson#1653

Open
sonalishintre wants to merge 2 commits into
apache:mainfrom
sonalishintre:fix-issue-791
Open

Fix issue 791: add LazyFrame sink classes for parquet, csv, ipc, ndjson#1653
sonalishintre wants to merge 2 commits into
apache:mainfrom
sonalishintre:fix-issue-791

Conversation

@sonalishintre

Copy link
Copy Markdown

Summary

Closes #791

Adds data sink support for Polars LazyFrames, allowing users to write
LazyFrames directly to disk without calling lf.collect() first, which
is more performant.

Changes

  • Added PolarsLazyFrameSinkParquet class
  • Added PolarsLazyFrameSinkCSV class
  • Added PolarsLazyFrameSinkIPC class
  • Added PolarsLazyFrameSinkNDJSON class
  • Registered all new sinks in register_data_loaders()

How I tested this

Added 4 new tests in tests/plugins/test_polars_lazyframe_extensions.py:

  • test_polars_lazyframe_sink_parquet
  • test_polars_lazyframe_sink_csv
  • test_polars_lazyframe_sink_ipc
  • test_polars_lazyframe_sink_ndjson

Each test creates a LazyFrame, sinks it to a temp file, reads it back,
and verifies the data matches.

Notes

  • sink_ndjson has no corresponding loader yet as noted in the issue

Checklist

  • PR has an informative and human-readable title
  • Changes are limited to a single goal (no scope creep)
  • Any change in functionality is tested
  • New functions are documented (with a description and docstring)

@sonalishintre

Copy link
Copy Markdown
Author

Hi @skrawcz,
this is my first contribution to this project.
I've implemented the LazyFrame sink classes requested in #791
(sink_parquet, sink_csv, sink_ipc, sink_ndjson) with tests for each.
Would appreciate a review when you have time!

@jernejfrank jernejfrank left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sonalishintre and welcome. Thanks for taking the time to contribute. In general looks good just a couple of remarks to keep it consistent with the rest of the existing implementations (see below).

If you look at the https://github.com/apache/hamilton/blob/main/hamilton/plugins/polars_post_1_0_0_extensions.py you will see the same implementations for the eager frame and can see the class naming conventions and order in which the classes appear. Could you try to mirror that?

Otherwise looking good and great job!

return [DATAFRAME_TYPE]

def save_data(self, data: pl.LazyFrame) -> dict[str, Any]:
data.sink_parquet(self.path)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the ability to add the lazyframe.sink_parquet kwargs? https://docs.pola.rs/api/python/stable/reference/api/polars.LazyFrame.sink_parquet.html

return "feather"

@dataclasses.dataclass
class PolarsLazyFrameSinkParquet(DataSaver):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we name this PolarsSinkParquetWriter?

return [DATAFRAME_TYPE]

def save_data(self, data: pl.LazyFrame) -> dict[str, Any]:
data.sink_ipc(self.path)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



@dataclasses.dataclass
class PolarsLazyFrameSinkIPC(DataSaver):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same rename style: PolarsSinkFeatherWriter



@dataclasses.dataclass
class PolarsLazyFrameSinkNDJSON(DataSaver):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same rename: PolarsSinkNDJSONWriter

return [DATAFRAME_TYPE]

def save_data(self, data: pl.LazyFrame) -> dict[str, Any]:
data.sink_ndjson(self.path)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



@dataclasses.dataclass
class PolarsLazyFrameSinkCSV(DataSaver):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same rename: PolarsSinkCSVWriter

return [DATAFRAME_TYPE]

def save_data(self, data: pl.LazyFrame) -> dict[str, Any]:
data.sink_csv(self.path)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add data source sinks for Polars Lazyframe implementation

2 participants