-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-36889: [C++][Python] Fix duplicate CSV header when first batch is empty #48718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ch is empty When writing CSV, if the first record batch was empty, the header would be written twice. This happened because: 1. Header is written to data_buffer_ and flushed during initialization 2. TranslateMinimalBatch returns early for empty batches without modifying data_buffer_ 3. The loop then writes data_buffer_ which still contains the header The fix clears the buffer (resize to 0) when encountering an empty batch, so the subsequent write outputs nothing. Added C++ and Python tests for empty batches at start and in middle of tables. Claude-Generated-By: Claude Code (cli/claude-opus-4-5=1%) Claude-Steers: 2 Claude-Permission-Prompts: 2 Claude-Escapes: 1
|
|
Signed-off-by: Ruiyang Wang <[email protected]>
|
@wgtmac Hi would you mind to also review this? Thanks! |
|
I think we can clear |
Addresses review feedback from HuaHuaY: Instead of clearing the buffer in TranslateMinimalBatch for empty batches, use a WriteAndClearBuffer() helper that writes and clears the buffer in all write paths. This is cleaner because: - Every write follows the same pattern (write -> clear) - Easier to reason about for future write stages - The invariant is explicit: buffer is always clean after flush
|
Clearing after |
|
|
Co-authored-by: Gang Wu <[email protected]>
… tests - Rename WriteAndClearBuffer to FlushToSink (shorter, clearer intent) - Consolidate Python tests into a single parameterized test with 3 cases: empty batch at beginning, middle, and end
Rationale for this change
Fixes #36889
When writing CSV from a table where the first batch is empty, the header gets written twice:
What changes are included in this PR?
The bug happens because:
data_buffer_and flushed duringCSVWriterImplinitializationTranslateMinimalBatchreturns early without modifyingdata_buffer_data_buffer_which still contains stale contentThe fix introduces a
WriteAndClearBuffer()helper that writes the buffer to sink and clears it. This helper is used in all write paths:WriteHeader()WriteRecordBatch()WriteTable()This ensures the buffer is always clean after any flush, making it impossible for stale content to be written again.
Are these changes tested?
Yes. Added C++ tests in
writer_test.ccand Python tests intest_csv.py:Are there any user-facing changes?
No API changes. This is a bug fix that prevents duplicate headers when writing CSV from tables with empty batches.