Skip to content

Conversation

@XanthosXanthopoulos
Copy link
Collaborator

@XanthosXanthopoulos XanthosXanthopoulos commented Nov 13, 2025

Issue and/or context: SOMA-528 SOMA-714 SOMA-688

Changes:
This PR changes the memory management for the read/write operations implemented by ManagedQuery. Specifically:

  • Replaces std::vector backed buffers with C++ arrays wrapped in std::unique_ptr
  • Optimizes null count calculation for nullable columns
  • Makes TileDB buffers to Arrow table conversion multithreaded
  • When possible the writes are now zero copy. Passing temporary object to set the data buffer for writes will crash the program because setting the buffers and writing them to TileDB is not an atomic operation
  • Removes implicit casting of numeric data when writing to TileDB. When passing data to read/write you should use the SOMAArray provided schema to type casts in advance
  • Fix index casting when writing dictionaries
  • Properly write validity buffers for nullable enumerated columns

Notes for Reviewer:

@codecov
Copy link

codecov bot commented Nov 14, 2025

Codecov Report

❌ Patch coverage is 87.50000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 86.85%. Comparing base (4f01b4e) to head (1ac1e25).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4311      +/-   ##
==========================================
- Coverage   86.85%   86.85%   -0.01%     
==========================================
  Files         137      138       +1     
  Lines       20705    20712       +7     
  Branches       15       16       +1     
==========================================
+ Hits        17983    17989       +6     
- Misses       2722     2723       +1     
Flag Coverage Δ
python 89.20% <ø> (-0.03%) ⬇️
r 85.62% <87.50%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 89.20% <ø> (-0.03%) ⬇️
libtiledbsoma 76.77% <75.00%> (-0.47%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@XanthosXanthopoulos XanthosXanthopoulos marked this pull request as ready for review November 14, 2025 17:03
Copy link
Member

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any measurements of the time/space impact of this PR?

@XanthosXanthopoulos
Copy link
Collaborator Author

are there any measurements of the time/space impact of this PR?

I have ingested a couple of h5ad files and the result was about 20% faster with this PR with lower memory usage as well

Comment on lines 704 to 723
} else {
// Casting is needed and casted data ownership should pass to the column

std::unique_ptr<std::byte[]> data_buffer = std::make_unique_for_overwrite<std::byte[]>(
array->length * sizeof(DiskType));

std::span<UserType> original_data_buffer_view(buf, array->length);
std::span<DiskType> data_buffer_view(reinterpret_cast<DiskType*>(data_buffer.get()), array->length);

for (int64_t i = 0; i < array->length; ++i) {
data_buffer_view[i] = static_cast<DiskType>(original_data_buffer_view[i]);
}

setup_write_column(
schema->name,
array->length,
std::move(data_buffer),
(uint64_t*)nullptr,
_cast_validity_buffer_ptr(array));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just throw. Since arrow/pyarrow/nanoarrow have an explicit schema, we should just require the user pass us data with appropriate types. The only case we want to auto-cast is when the user provides us with a dictionary where the values match the on disk data type of a non-enumerated column.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change may break existing unit tests in Python/R and will definitely require new unit tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this change will require non trivial changes and may break current workflows, should be in a separate PR?


CArrayColumnBuffer() = delete;
CArrayColumnBuffer(const CArrayColumnBuffer&) = delete;
CArrayColumnBuffer(CArrayColumnBuffer&&) = default;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the following warning when compiling:

/home/jules/Software/TileDB-Inc/TileDB-SOMA/libtiledbsoma/src/soma/column_buffer.h:401:5: warning: explicitly defaulted move constructor is implicitly deleted [-Wdefaulted-function-deleted]
  401 |     CArrayColumnBuffer(CArrayColumnBuffer&&) = default;
      |     ^
/home/jules/Software/TileDB-Inc/TileDB-SOMA/libtiledbsoma/src/soma/column_buffer.h:375:28: note: move constructor of 'CArrayColumnBuffer' is implicitly deleted because base class 'ReadColumnBuffer' has a deleted move constructor


VectorColumnBuffer() = delete;
VectorColumnBuffer(const VectorColumnBuffer&) = delete;
VectorColumnBuffer(VectorColumnBuffer&&) = default;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the following warning when compiling:

/home/jules/Software/TileDB-Inc/TileDB-SOMA/libtiledbsoma/src/soma/column_buffer.h:474:5: warning: explicitly defaulted move constructor is implicitly deleted [-Wdefaulted-function-deleted]
  474 |     VectorColumnBuffer(VectorColumnBuffer&&) = default;
      |     ^
/home/jules/Software/TileDB-Inc/TileDB-SOMA/libtiledbsoma/src/soma/column_buffer.h:423:28: note: move constructor of 'VectorColumnBuffer' is implicitly deleted because base class 'ReadColumnBuffer' has a deleted move constructor

@XanthosXanthopoulos XanthosXanthopoulos changed the title [c++][WIP] Optimize memory management on write path [c++] Optimize memory management on write path Nov 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants