Skip to content

Conversation

@Vishwanatha-HD
Copy link

@Vishwanatha-HD Vishwanatha-HD commented Nov 20, 2025

Rationale for this change

This PR is intended to enable Parquet DB support on Big-endian (s390x) systems. The fix in this PR fixes "arrow-bit-utility-test" testcase failure.

The "arrow-bit-utility-test" testcase was failing with 1 Test case Failure on Big-endian platforms.

[ FAILED ] 1 test, listed below:
[ FAILED ] BitStreamUtil.ZigZag

What changes are included in this PR?

The fix includes changes to "bit_stream_utils_internal.h" file to address the Abort/Core dump issues.

Are these changes tested?

Yes. The changes are tested on s390x arch to make sure things are working fine. The fix is also tested on x86 arch, to make sure there is no new regression introduced.

Are there any user-facing changes?

No

@Vishwanatha-HD Vishwanatha-HD changed the title GH-48177: [C++][Parquet] Fix arrow-bit-utility-test failure on s390x GH-48194: [C++][Parquet] Fix arrow-bit-utility-test failure on s390x Nov 21, 2025
@github-actions
Copy link

⚠️ GitHub issue #48194 has been automatically assigned in GitHub to PR creator.

}
#else
// For VLQ reading, always read directly from buffer to avoid endianness issues
// with buffered_values_ on big-endian systems like s390x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// with buffered_values_ on big-endian systems like s390x
// with buffered_values_ on big-endian systems like s390x.

#else
// For VLQ reading, always read directly from buffer to avoid endianness issues
// with buffered_values_ on big-endian systems like s390x
// Calculate current position in buffer accounting for bit offset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Calculate current position in buffer accounting for bit offset
// Calculate current position in buffer accounting for bit offset.

const uint8_t* data = NULLPTR;
int max_size = 0;
#if ARROW_LITTLE_ENDIAN
// The data that we will pass to the LEB128 parser
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The data that we will pass to the LEB128 parser
// The data that we will pass to the LEB128 parser.

int max_size = 0;
#if ARROW_LITTLE_ENDIAN
// The data that we will pass to the LEB128 parser
// In all case, we read a byte-aligned value, skipping remaining bits
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// In all case, we read a byte-aligned value, skipping remaining bits
// In all case, we read a byte-aligned value, skipping remaining bits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you revert this change?
(This is needless, right?)

Copy link
Author

@Vishwanatha-HD Vishwanatha-HD Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kou.. Yes certainly, this is needless.. I had earlier tried "git restore" and "git reset --hard" to remove "testing" changes.. But I didnt realize that this is a sub-module and I need to do "git submodule update --init --recursive"...
Now, I have dont that and removed the changes.. thanks..

Comment on lines +391 to +396
const int current_byte_offset = byte_offset_ + bit_util::BytesForBits(bit_offset_);
const int bytes_left_in_buffer = max_bytes_ - current_byte_offset;

// Always read from buffer directly to avoid endianness issues
data = buffer_ + current_byte_offset;
max_size = bytes_left_in_buffer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this the same logic as

} else {
max_size = bytes_left();
data = buffer_ + (max_bytes_ - max_size);
}
?

If so, should we reuse it something like the following?

diff --git a/cpp/src/arrow/util/bit_stream_utils_internal.h b/cpp/src/arrow/util/bit_stream_utils_internal.h
index d8c7317fe8..7352312782 100644
--- a/cpp/src/arrow/util/bit_stream_utils_internal.h
+++ b/cpp/src/arrow/util/bit_stream_utils_internal.h
@@ -366,6 +366,9 @@ inline bool BitReader::GetVlqInt(Int* v) {
   const uint8_t* data = NULLPTR;
   int max_size = 0;
 
+#if ARROW_LITTLE_ENDIAN
+  // TODO: Describe why we need this only for little-endian.
+
   // Number of bytes left in the buffered values, not including the current
   // byte (i.e., there may be an additional fraction of a byte).
   const int bytes_left_in_cache =
@@ -377,7 +380,9 @@ inline bool BitReader::GetVlqInt(Int* v) {
     data = reinterpret_cast<const uint8_t*>(&buffered_values_) +
            bit_util::BytesForBits(bit_offset_);
     // Otherwise, we try straight from buffer (ignoring few bytes that may be cached)
-  } else {
+  } else
+#endif
+  {
     max_size = bytes_left();
     data = buffer_ + (max_bytes_ - max_size);
   }

@kou kou changed the title GH-48194: [C++][Parquet] Fix arrow-bit-utility-test failure on s390x GH-48194: [C++] Fix arrow-bit-utility-test failure on s390x Nov 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants