Optimize validity buffer concat. #2626

liurenjie1024 · 2024-11-26T06:24:27Z

Close #2579

liurenjie1024 · 2024-11-26T06:25:31Z

cc @jlowe I did some benchmark and didn't notice much performance improvement.

Signed-off-by: liurenjie1024 <[email protected]>

liurenjie1024 · 2024-11-26T06:35:59Z

build

jlowe · 2024-11-26T22:17:49Z

src/main/java/com/nvidia/spark/rapids/jni/kudo/KudoTableMerger.java

-        // Extract appendCount bits from srcByte, starting from curSrcBitIdx
-        byte mask = (byte) (((1 << appendCount) - 1) & 0xFF);
-        srcByte = (byte) ((srcByte >>> curSrcBitIdx) & mask);
+    int totalRowCount = toIntExact(sliceInfo.getRowCount() + sliceInfo.getValidityBufferInfo().getBeginBit());


The name of this variable makes reading this confusing. It's not a total row count as the name implies. It's the end index or ending row, IIUC.

jlowe · 2024-11-26T22:25:35Z

src/main/java/com/nvidia/spark/rapids/jni/kudo/KudoTableMerger.java

-        // Sets the bits in destination buffer starting from curDestBitIdx to 0
-        byte destByte = dest.getByte(curDestByteIdx);
-        destByte = (byte) (destByte & ((1 << curDestBitIdx) - 1));
+      if (dest.getLength() >= (curDestOffset + Integer.BYTES)) {


Conditionals should not be in the body of the while loop. The while loop should be as simple as possible, since that's expected to be the hotspot. IMO the code should be structured into three parts similar to the following:

if (curSrcIdx % 8 != 0) { // read an int from the buffer // mask off the unused bits // count the bits // shift and store the bits } while (whole_ints_left_in_buffer) { // read int from buffer // count bits // shift and store the bits } if (leftover bits) { // read an int from the buffer (leverage padded buffer here) // mask off the unused bits // count the bits // shift and store the bits }

jlowe · 2024-11-26T22:29:24Z

src/main/java/com/nvidia/spark/rapids/jni/kudo/KudoTableMerger.java

-        byte destByte = dest.getByte(curDestByteIdx);
-        destByte = (byte) (destByte & ((1 << curDestBitIdx) - 1));
+      if (dest.getLength() >= (curDestOffset + Integer.BYTES)) {
+        // We have enough room to get an int


Don't we always have enough space to get an integer from the destination buffer because it's at least 4-byte padded?

jlowe · 2024-11-26T22:41:02Z

src/main/java/com/nvidia/spark/rapids/jni/kudo/KudoTableMerger.java

+        byte[] destBytes = new byte[4];
+        dest.getBytes(destBytes, 0, curDestOffset, destBufRemBytes);
+        int destInt = ByteBuffer.wrap(destBytes).order(ByteOrder.LITTLE_ENDIAN).getInt();


Curious why we're doing byte-at-time here and endian stuff when we don't up above? This is still grabbing 4 bytes like the above code. Byte-at-a-time is usually a lot slower, might be faster to read the int from the buffer (we're loading 4 bytes anyway) and call Integer.reverseBytes (a HotSpot intrinsic candidate) if ByteOrder.nativeOrder == ByteOrder.LITTLE_ENDIAN.

jlowe · 2024-11-26T22:42:05Z

src/main/java/com/nvidia/spark/rapids/jni/kudo/KudoTableMerger.java

+        ByteBuffer.wrap(destBytes).order(ByteOrder.LITTLE_ENDIAN).putInt(destInt);
+        dest.setBytes(curDestOffset, destBytes, 0, destBufRemBytes);


Similar to above, can call setInt here and leverage ByteOrder to determine when we need to swap bytes or not.

ttnghia · 2024-12-06T05:02:53Z

Please update the PR description. It will be displayed in the commit log. Simply saying "closes XXX" will just show that words, which is difficult to track the changes through commit log.

gerashegalov · 2024-12-11T18:14:57Z

It will be displayed in the commit log

To be fair we do not have the automation (yet) of checking in PRs using the PR description. I follow the convention of copying the PR description as the commit message but it's not mandated and not followed widely in NVIDIA/spark* repos.

res-life · 2024-12-12T03:07:41Z

Could we copy by long instead of int? We can avoid to use toIntExact and use Long.bitCount. This may be more fast.

ttnghia · 2024-12-13T21:42:04Z

src/main/java/com/nvidia/spark/rapids/jni/kudo/KudoTableMerger.java

-        // Sets the bits in destination buffer starting from curDestBitIdx to 0
-        byte destByte = dest.getByte(curDestByteIdx);
-        destByte = (byte) (destByte & ((1 << curDestBitIdx) - 1) & 0xFF);
+    while (curSrcIdx < totalRowCount) {


So does this concatenate the validity buffers, one word (32 bits) at a time, in a serial manner, in Java?

Can't we use C++/CUDA for accelerating this?

liurenjie1024 requested a review from jlowe November 26, 2024 06:24

Optimize validity buffer concat.

3cff316

Signed-off-by: liurenjie1024 <[email protected]>

liurenjie1024 force-pushed the ray/kudo-opt-concat branch from 04d78bd to 3cff316 Compare November 26, 2024 06:26

jlowe reviewed Nov 26, 2024

View reviewed changes

liurenjie1024 changed the base branch from branch-24.12 to branch-25.02 November 27, 2024 06:09

ttnghia reviewed Dec 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize validity buffer concat. #2626

Optimize validity buffer concat. #2626

liurenjie1024 commented Nov 26, 2024

liurenjie1024 commented Nov 26, 2024

liurenjie1024 commented Nov 26, 2024

jlowe Nov 26, 2024

jlowe Nov 26, 2024

jlowe Nov 26, 2024

jlowe Nov 26, 2024

jlowe Nov 26, 2024

ttnghia commented Dec 6, 2024

gerashegalov commented Dec 11, 2024

res-life commented Dec 12, 2024

ttnghia Dec 13, 2024

		ByteBuffer.wrap(destBytes).order(ByteOrder.LITTLE_ENDIAN).putInt(destInt);
		dest.setBytes(curDestOffset, destBytes, 0, destBufRemBytes);

Optimize validity buffer concat. #2626

Are you sure you want to change the base?

Optimize validity buffer concat. #2626

Conversation

liurenjie1024 commented Nov 26, 2024

liurenjie1024 commented Nov 26, 2024

liurenjie1024 commented Nov 26, 2024

jlowe Nov 26, 2024

Choose a reason for hiding this comment

jlowe Nov 26, 2024

Choose a reason for hiding this comment

jlowe Nov 26, 2024

Choose a reason for hiding this comment

jlowe Nov 26, 2024

Choose a reason for hiding this comment

jlowe Nov 26, 2024

Choose a reason for hiding this comment

ttnghia commented Dec 6, 2024

gerashegalov commented Dec 11, 2024

res-life commented Dec 12, 2024

ttnghia Dec 13, 2024

Choose a reason for hiding this comment