fix: use PyCapsule Interface instead of Dataframe Interchange Protocol #3782

MarcoGorelli · 2024-11-09T17:40:29Z

closes #3756
closes #3533
I'm hoping that this can supersede #3534

This means that you get support for quite a lot more, e.g.:

DuckDB:

cuDF (their interchange protocol implementation is currently broken anyway [BUG] pd.api.interchange.from_dataframe fails with simple cuDF dataframe rapidsai/cudf#17282)
Polars: it fixes the issue reported in Try to_pandas rather than erroring if interchanging to pandas doesn't work? #3533, because the PyCapsule interface actually supports nested data types:

In addition, this has no effect on existing pandas users, as there's already an early return for pandas https://github.com/MarcoGorelli/seaborn/blob/0bd85071284d45f38cbf419b8cf1efb2179eda24/seaborn/_core/data.py#L284-L285

I'm sorry for having introduced the Interchange Protocol in the first place. It's turned out to be fairly problematic, see pandas-dev/pandas#56732 (comment) as the associated discussion for more context

cc @WillAyd for comments

WillAyd · 2024-11-09T17:54:33Z

Implementation wise I think this looks great. Nice work @MarcoGorelli

tests/_core/test_data.py

mwaskom

Cool, thanks for this. I think my one question is about how compatible this will be for users that are currently benefitting from the (seemingly more-or-less built-in) interchange protocol. Do we need to provide backwards compatibility for them?

mwaskom · 2024-11-11T18:11:24Z

seaborn/_core/data.py

+        try:
+            import pyarrow
+        except ImportError as err:
+            msg = "PyArrow is required for non-pandas Dataframe support."


Is this generally a dependency of non-pandas dataframe libraries now? Or could this change introduce a regression for e.g. polars users who are currently leveraging the dataframe interchange protocol?

Thanks for your review!

Polars doesn't depend on PyArrow, but polars.DataFrame.to_pandas always requires PyArrow. So, in practice, anyone working with both dataframe libraries may well already have PyArrow already installed

To avoid requiring PyArrow for the cases when it's not necessary, one way could be to do something like:

try using the interchange protocol

if it raises, then fall back to the PyCapsule Interface (which currently requires PyArrow)

This has the upside of not requiring PyArrow in some cases, but the downside of hiding issues where the interchange protocol silently produces invalid results

It may be possible to do this PyCapsule Interface conversion in the future without PyArrow but with something lighter instead, like arro3 by @kylebarron (who I'm ccing in case he has comments too)

What would be your preference?

Some polars users may not have pyarrow installed. If seaborn needs to get pandas data, the only production-ready way to do Arrow -> pandas that I know of is using pyarrow.

As Marco mentions I'm working on arro3, which is a minimal library for Arrow in Python, but Pandas interop is not a primary concern, and it's not production-ready today.

FWIW pandas 3.x is going to strongly incentivize users to install PyArrow, although it stops short of outright requiring it. In theory, the only people that shouldn't have PyArrow installed are those that operate in space/resource constrained environments, probably in headless environments like AWS Lambda where seaborn won't be used

Of course up to you how much you want to support non-PyArrow configurations, but the dataframe interchange protocol is relatively buggy and gets very little support, so you may find it easier altogether to force users towards PyArrow

cuDF have said that they will deprecate the interchange format: rapidsai/cudf#17282

Plotly have stopped using it, so Seaborn is the only library left using it

At this point, I think there's a greater risk in keeping it - I don't want to force anything here of course, just making sure you're aware

Could you please clarify what you mean by "if the infra isn't there yet"?

The Arrow C Interface already has quite widespread adoption and I'm not aware of edge cases in its implementations. @WillAyd wrote about switching his Pantab project over to it in Leveraging the Arrow C Data Interface, and noted

Almost immediately my issues went away [...] I felt more confident in the implementation and had to deal with less memory corruption / crashes than before. And, perhaps most importantly, I saved a lot of time.

That was nearly a year ago, and given that he's now suggesting it here in Plotly, I'd say that his experience has stayed just as positive

Regarding PyArrow dependency, I'll also note that polars.DataFrame.to_pandas also requires PyArrow, so any Polars user (such as myself) would already have needed PyArrow installed if they were converting to pandas via the Polars official method

Basically my threshold is "do I need to think about it at all". I'm just not interested in the minutia of competing Python dataframe libraries or the various attempts to make them work better together. The previous approach was sold as a simple protocol that always works, but it turns out that wasn't the case. Maybe this new way is better, the problem is I have no real way to say for sure without spending a lot of time learning about something that doesn't interest me.

Shall I close and leave you to remove cross-dataframe compatibility altogether?

The problem is then I get issues bugging me about Polars, so I have to think about it anyway :D

😄 that's understandable

I'm aware that you said that using Narwhals was a complete non-starter, but just to showcase that as a possibility:

import narwhals.stable.v1 as nw from narwhals.stable.v1.typing import IntoDataFrame import polars as pl import pandas as pd def convert_dataframe_to_pandas(data: IntoDataFrame) -> pd.DataFrame: return nw.from_native(data).to_pandas()

and then leave it up to Narwhals to convert to pandas in the best way for each input library

Altair, Plotly, and Vegafusion are using it as required dependency now, and Bokeh have a PR in progress to do the same

For completeness: the way the other libraries are using Narwhals is by making the whole logic dataframe-agnostic. In Plotly this resulted in 2-3x better performance for many plots involving group-bys (compared with converting all inputs to pandas), but I understand that you may not be interested in that

mwaskom · 2024-11-11T18:12:19Z

tests/_core/test_data.py

    def test_data_interchange(self, mock_long_df, long_df):
+        pytest.importorskip(


MarcoGorelli force-pushed the pycapsule branch from aa6132d to 7c03e13 Compare November 9, 2024 17:43

MarcoGorelli mentioned this pull request Nov 9, 2024

Use Arrow PyCapsule Interface instead of Dataframe Interchange Protocol #3756

Open

MarcoGorelli force-pushed the pycapsule branch from 7c03e13 to cb86e7a Compare November 9, 2024 17:54

MarcoGorelli force-pushed the pycapsule branch 3 times, most recently from 31146c8 to 9599662 Compare November 9, 2024 18:23

WillAyd reviewed Nov 9, 2024

View reviewed changes

tests/_core/test_data.py Outdated Show resolved Hide resolved

MarcoGorelli force-pushed the pycapsule branch 2 times, most recently from cf4ce2c to f516630 Compare November 9, 2024 18:39

fix: use PyCapsule Interface instead of Dataframe Interchange Protocol

0bd8507

MarcoGorelli force-pushed the pycapsule branch from f516630 to 0bd8507 Compare November 9, 2024 18:43

MarcoGorelli marked this pull request as ready for review November 9, 2024 18:50

mwaskom reviewed Nov 11, 2024

View reviewed changes

WillAyd mentioned this pull request Nov 16, 2024

PDEP-15: Reject PDEP-10 pandas-dev/pandas#58623

Open

5 tasks

kylebarron mentioned this pull request Nov 21, 2024

[Python] Promote usage of the Arrow PyCapsule Protocol (for the C Data Inteface) apache/arrow#39195

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use PyCapsule Interface instead of Dataframe Interchange Protocol #3782

fix: use PyCapsule Interface instead of Dataframe Interchange Protocol #3782

MarcoGorelli commented Nov 9, 2024 •

edited

Loading

WillAyd commented Nov 9, 2024

mwaskom left a comment

mwaskom Nov 11, 2024

MarcoGorelli Nov 11, 2024

kylebarron Nov 11, 2024

WillAyd Nov 11, 2024

MarcoGorelli Nov 21, 2024

MarcoGorelli Nov 26, 2024

mwaskom Nov 26, 2024

MarcoGorelli Nov 26, 2024

mwaskom Nov 26, 2024

MarcoGorelli Nov 26, 2024 •

edited

Loading

mwaskom Nov 11, 2024

		def test_data_interchange(self, mock_long_df, long_df):
		pytest.importorskip(

fix: use PyCapsule Interface instead of Dataframe Interchange Protocol #3782

Are you sure you want to change the base?

fix: use PyCapsule Interface instead of Dataframe Interchange Protocol #3782

Conversation

MarcoGorelli commented Nov 9, 2024 • edited Loading

WillAyd commented Nov 9, 2024

mwaskom left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli commented Nov 9, 2024 •

edited

Loading

MarcoGorelli Nov 26, 2024 •

edited

Loading