Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c++] Fix offsets for nullable columns #3611

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

nguyenv
Copy link
Member

@nguyenv nguyenv commented Jan 22, 2025

Issue and/or context:

[sc-62104]

We were adding the offset to the ArrowArray's validity buffer directly which is incorrect because the validity is stored as a bit, not as a uint8_t.

Changes:

  • Add util::bitmap_to_uint8 which takes a pointer to a bitmap, length, and offset, and returns a vector<uint8_t> representation; if the bitmap is a nullptr, then return a nullopt
  • Add helper functions ManagedQuery::_cast_bool_data, formally util::cast_bit_to_uint8, and ManagedQuery::_cast_validity_buffer, which takes the validity buffer of the passed in ArrowArray and returns vector<uint8_t> with the properly shifted offset
  • Update ManagedQuery::setup_write_column and ColumnBuffer::set_data to take in an std::optional<std::vector<uint8_t>> where the validity buffer has already been offset and casted to uint8
  • Add pytest unit testing for nullable columns with offsetting for string, Boolean, enumeration, and the general case (int, float, etc.)

Copy link

codecov bot commented Jan 22, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.31%. Comparing base (77471b6) to head (5d68d61).
Report is 14 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3611      +/-   ##
==========================================
+ Coverage   86.27%   86.31%   +0.04%     
==========================================
  Files          55       55              
  Lines        6375     6380       +5     
==========================================
+ Hits         5500     5507       +7     
+ Misses        875      873       -2     
Flag Coverage Δ
python 86.31% <100.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 86.31% <100.00%> (+0.04%) ⬆️
libtiledbsoma ∅ <ø> (∅)

Copy link
Member

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add some additional tests:

  1. Call Sparse/Dense NDArray write with an array that has a validity buffer, and ensure it acts as expected.
  2. For DataFrame, need to verify expected behavior if an index (dimension) column is written with a validity buffer.
  3. In test_dataframe, I'd also test the "invalid" timestamp types (NaT) alongside the other nullable types.

Copy link
Member

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing requests only at this point - the C++ code looks OK (I did a quick pass only - would be good to get @XanthosXanthopoulos or @jp-dark to review it in more depth)

apis/python/tests/test_dataframe.py Outdated Show resolved Hide resolved
@nguyenv nguyenv force-pushed the viviannguyen/sc-62104/python-data-corruption-in-validity-layer branch from d89b2f4 to 4271442 Compare January 23, 2025 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants