Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the Arrow C and PyCapsule data interfaces to share data with Python #98

Merged
merged 8 commits into from
Oct 20, 2024

Conversation

robomics
Copy link
Contributor

@robomics robomics commented Oct 20, 2024

See:

This basically allows us to do the following:

  • Statically link any version of Arrow into _hictkpy. This is used by hictk to return interactions through a std::share_ptr<arrow::Table>.
  • Use the Arrow C data interface (<arrow/c/abi.h>) and C bridge (<arrow/c/bridge.h>) to export data generated with the C++ API (which is not ABI stable) through the C ABI (which is stable).
  • Wrap the exported data using PyCapsules. We don't even need to depend on nanoarrow to do this, as all we need to do is return Python objects that expose the PyCapsules through the __arrow_c_schema__ or __arrow_c_stream__ attributes.
  • Use pyarrow's Python API to construct a pyarrow.Table given the ArrowSchema and ArrowArrayStream objects returned by the C data interface.

As a bonus, we no longer have to depend on pyarrow (we only need it when fetching pixels or bins as DataFrames).

The only downside of this approach is that pyarrow versions <16 are not supported (because pyarrow.Table.from_arrays() does not recognize objects exposing the __arrow_c_stream__ attribute).

Closes #91.

See:
- https://arrow.apache.org/docs/format/CDataInterface.html
- https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html

This basically allows us to do the following:
- Statically link any version of Arrow into _hictkpy.
  This is used by hictk to return interactions through a
  `std::share_ptr<arrow::Table>`.
- Use the Arrow C data interface (`<arrow/c/abi.h>`) and
  C bridge (`<arrow/c/bridge.h>`) to export data generated with the C++ API
  (which is not ABI stable) through the C ABI (which is stable).
- Wrap the exported data using `PyCapsule`s. We don't even need to
  depend on nanoarrow to do this, as all we need to do is return Python
  objects that expose the `PyCapsule`s through the `__arrow_c_schema__`
  or `__arrow_c_stream__` attributes.
- Use pyarrow's Python API to construct a `pyarrow.Table` given the
  `ArrowSchema` and `ArrowArrayStream` objects returned by the C data
  interface.

As a bonus, we no longer have to depend on pyarrow (we only need it when
fetching pixels or bins as DataFrames).

The only downside of this approach is that pyarrow versions <16 are not
supported (because `pyarrow.Table.from_arrays()` does not recognize
objects exposing the `__arrow_c_stream__` attribute.
@robomics robomics added enhancement New feature or request dependencies Pull requests that update a dependency file labels Oct 20, 2024
@robomics robomics merged commit 0a885d0 into main Oct 20, 2024
57 of 58 checks passed
@robomics robomics deleted the feature/rework-pyarrow branch October 20, 2024 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[packaging] Support multiple versions of pyarrow
1 participant