New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Accelerate StringCoding.hasNegatives for JDK 11, 17, StringCoding.countPositives for JDK 21+ on x86 #21121

Open

dylanjtuttle wants to merge 5 commits into eclipse-openj9:master from dylanjtuttle:countPositivesIntrinsic

Contributor

dylanjtuttle commented Feb 12, 2025 •

edited

Loading

This PR accelerates intrinsic candidates StringCoding.hasNegatives and StringCoding.countPositives on x86, the former on JDK 9-18 and the latter on JDK 19+.

This PR is incremental in a few ways.

The acceleration currently only delivers the desired performance boost for arrays of up to 31 elements. For arrays of 32 elements and longer, the acceleration still outperforms default OpenJ9, but can no longer keep up with HotSpot. For that, I will need to take advantage of larger SIMD instructions.
I have discovered a strange performance anomaly in OpenJ9 that causes it to perform significantly faster than expected for hasNegatives for arrays of 0-8 elements on JDK 19+. It performs so well for these short arrays that implementing my 'acceleration' would actually cause a performance regression there. While this anomaly is investigated, I will not be accelerating hasNegatives on JDK 19+.

In the interest of taking advantage of the performance boost as it currently stands for the 0.51 release, this PR will deliver these changes in their incremental state, with plans for another PR or two down the road to close the aforementioned gaps.

Contributor Author

dylanjtuttle commented Feb 12, 2025

Paging @vijaysun-omr, @0xdaryl for review and @r30shah, @dchopra001 as requested for comparison to acceleration on Z

dylanjtuttle force-pushed the countPositivesIntrinsic branch from 543b8ac to 233cf24 Compare

February 12, 2025 14:49

Contributor

vijaysun-omr commented Feb 12, 2025

Looks fine to me from a quick review. I'll probably defer to @hzongaro for the review since he has much more direct awareness of this work than I do, and can easily review the inliner parts as well (that I skimmed over too).

hzongaro self-assigned this

hzongaro added the comp:jit label

Contributor Author

dylanjtuttle commented Feb 13, 2025

It appears that the crash is due to a recent OMR commit and is therefore unrelated to my changes. I can build successfully with the openj9 branch of the openj9-omr repo, and can now confirm all of the performance testing checks out.

BradleyWood suggested changes

View reviewed changes

runtime/compiler/x/codegen/J9TreeEvaluator.cpp Show resolved Hide resolved

runtime/compiler/x/codegen/J9TreeEvaluator.cpp Outdated Show resolved Hide resolved

runtime/compiler/x/codegen/J9TreeEvaluator.cpp Outdated Show resolved Hide resolved

runtime/compiler/x/codegen/J9TreeEvaluator.cpp Outdated Show resolved Hide resolved

runtime/compiler/x/codegen/J9TreeEvaluator.cpp Outdated Show resolved Hide resolved

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+                 generateRegRegInstruction(TR::InstOpCode::PTESTRegReg, node, xmmchunk, xmmmask, cg);
+                 // If the result is nonzero (i.e. at least one of the sign bits is set), break and return index
+                 generateLabelInstruction(TR::InstOpCode::JNE4, node, returnIndexLabel, cg);

Member

BradleyWood Feb 13, 2025

at least one of the sign bits is set

Don't you need to keep track of the index of last positive element? This function should find the number of leading positive elements so the location of that negative matters. You cannot return (i - off) as you do in the residue processing because you don't know where in the vector the first negative is.

Contributor Author

dylanjtuttle Feb 14, 2025

The key here was pointed out by Henry early on when I was first trying to figure out how to accelerate countPositives:

    /**
     * Count the number of leading positive bytes in the range.
     *
     * @implSpec the implementation must return len if there are no negative
     *   bytes in the range. If there are negative bytes, the implementation must return
     *   a value that is less than or equal to the index of the first negative byte
     *   in the range.
     */

countPositives only needs to return a value less than or equal to the index of the first negative byte. There are a few places in String.java that call countPositives and then do the work of finding the exact index themselves. I suppose the reason it was designed this way is to take advantage of small time saves in situations where you don't care about the exact index (like when being called by hasNegatives).

runtime/compiler/x/codegen/J9TreeEvaluator.cpp Outdated Show resolved Hide resolved

runtime/compiler/x/codegen/J9CodeGenerator.cpp Outdated Show resolved Hide resolved

runtime/compiler/x/codegen/J9TreeEvaluator.cpp Outdated Show resolved Hide resolved

runtime/compiler/x/codegen/J9TreeEvaluator.cpp Show resolved Hide resolved

0xdaryl reviewed

View reviewed changes

runtime/compiler/x/codegen/J9TreeEvaluator.cpp Outdated Show resolved Hide resolved

BradleyWood reviewed

View reviewed changes

runtime/compiler/x/codegen/J9CodeGenerator.cpp Outdated Show resolved Hide resolved

Contributor Author

dylanjtuttle commented Feb 19, 2025

@0xdaryl @hzongaro I apparently still have a bug in here somewhere because I'm getting a failure in the JCL during the build, but I thought I'd open up for a second round of review in the meantime while I work on squashing it.

BradleyWood reviewed

View reviewed changes

runtime/compiler/x/codegen/J9TreeEvaluator.cpp Show resolved Hide resolved

BradleyWood reviewed

View reviewed changes

runtime/compiler/x/codegen/J9TreeEvaluator.cpp Outdated

+                    case TR::java_lang_StringCoding_countPositives:
+                       {
+                       if (comp->target().cpu.supportsFeature(OMR_FEATURE_X86_SSE2) && feGetEnv("replaceCountPositives") != NULL)

Member

BradleyWood Feb 19, 2025

You already called cg->setSupportsInlineStringCodingCountPositives() so you should be calling cg->getSupportsInlineStringCodingCountPositives()

Contributor Author

dylanjtuttle Feb 21, 2025

I consolidated all of the inlining logic into InlinerTempForJ9, since I was already using it to conditionally only inline countPositives into hasNegatives. I've confirmed that the inlining behaviour is still correct using this method.

BradleyWood reviewed

View reviewed changes

runtime/compiler/x/codegen/J9TreeEvaluator.cpp Show resolved Hide resolved

dylanjtuttle added 5 commits

February 21, 2025 05:19


          base implementation of countPositives and hasNegatives

5f6e2bb


          minor tweaks to hopefully speed things up a bit

bb47060


          disable implementation of hN, enable inlining of both for jdk 19+

84032f7


          disable inlining of countPositives unless its caller is hasNegatives

8e3fd30


          review comments

c31a888

dylanjtuttle force-pushed the countPositivesIntrinsic branch from 8e367bc to c31a888 Compare

February 21, 2025 13:21

Contributor Author

dylanjtuttle commented Feb 21, 2025

I fixed the bug! I squashed all of my review comment changes into one commit, just doing some more performance testing

hzongaro reviewed

View reviewed changes

Member

hzongaro left a comment

Great work! I just have a few suggestions/comments

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

Comment on lines +10426 to +10432

+                 generateRegRegInstruction(TR::InstOpCode::MOV8RegReg, node, limitReg, offsetReg, cg);
+                 generateRegRegInstruction(TR::InstOpCode::ADD8RegReg, node, limitReg, lengthReg, cg);
+                 // loop_limit = (length & -16) + offset
+                 generateRegRegInstruction(TR::InstOpCode::MOV8RegReg, node, loopLimitReg, lengthReg, cg);
+                 generateRegImmInstruction(TR::InstOpCode::AND8RegImm4, node, loopLimitReg, -16, cg);
+                 generateRegRegInstruction(TR::InstOpCode::ADD8RegReg, node, loopLimitReg, offsetReg, cg);

Member

hzongaro Feb 25, 2025

Do these operations calculating limit and loop_limit need to be 64-bit operations or could they be 32-bit operations? Just wondering as these are all type int in the original Java source code.

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+                 generateRegRegInstruction(TR::InstOpCode::PMOVMSKB4RegReg, node, maskReg, xmmMaskReg, cg);
+                 // Check if any negative values exist
+                 generateRegRegInstruction(TR::InstOpCode::TEST8RegReg, node, maskReg, maskReg, cg);

Member

hzongaro Feb 25, 2025

Does this need to be a 64-bit test?

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

Comment on lines +10474 to +10478

+                 // Prepare an 8 byte sign bit mask
+                 generateRegImm64Instruction(TR::InstOpCode::MOV8RegImm64, node, maskReg, 0x8080808080808080, cg);
+                 // Zero out the chunk register
+                 generateRegRegInstruction(TR::InstOpCode::XOR8RegReg, node, chunkReg, chunkReg, cg);

Member

hzongaro Feb 25, 2025

Loading 0x8080808080808080 into maskReg and zeroing chunkReg are not needed for the bytes_left == 0 case. Placing these instructions after

// if bytes_left = 0, return len

might help the zero byte case slightly.

Also, the chunkReg doesn't need to be zeroed out for the 5 to 8 byte case, nor for the 9 to 16 byte case. However, avoiding it for the 5 to 8 byte case would probably be awkward. You might try placing it between these two tests so it can at least be avoided for the 9 to 16 byte case.

    *    if bytes_left > 8
    *       jmp nineOrMoreBytesLabel ----+
    *    if bytes_left > 2               |
    *       jmp threeOrMoreBytesLabel -+ |

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+                  *       if bytes_left > 4            |
+                  *       jmp fiveOrMoreBytesLabel --+ |
+                  *                                  | |
+                  *       laod 3-4 bytes             | |

Member

hzongaro Feb 25, 2025

Typo: load

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+               *   The Code Generator
+               */
+              #if JAVA_SPEC_VERSION < 19
+              static TR::Register* inlineHasNegatives(TR::Node* node, TR::CodeGenerator* cg)

Member

hzongaro Feb 25, 2025

I believe the only parts of inlineHasNegatives and inlineCountPositives that differ are those that deal with returning the position of the first negative or returning true or false. If that's correct, I think it would be better to combine the two methods into one so we can make sure that the common parts won't diverge.

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

Comment on lines +10583 to +10595

+                 // result = length
+                 generateRegRegInstruction(TR::InstOpCode::MOV8RegReg, node, resultReg, lengthReg, cg);
+                 // JMP end
+                 generateLabelInstruction(TR::InstOpCode::JMP1, node, endLabel, cg);
+                 // return_index label
+                 generateLabelInstruction(TR::InstOpCode::label, node, returnIndexLabel, cg);
+                 // result = index - offset
+                 generateRegRegInstruction(TR::InstOpCode::MOV8RegReg, node, resultReg, indexReg, cg);
+                 generateRegRegInstruction(TR::InstOpCode::SUB8RegReg, node, resultReg, offsetReg, cg);

Member

hzongaro Feb 25, 2025

I think this could use indexReg to contain the final result to avoid using another register. It would also avoid one move if a negative byte was found. That is:

returnLenLabel:
   indexReg = lengthReg
   goto endLabel
returnIndexLabel:
   indexReg -= offsetReg

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

Comment on lines +10849 to +10850

		// If the result is nonzero (i.e. at least one of the sign bits is set), return true
		generateLabelInstruction(TR::InstOpCode::JNE1, node, returnTrueLabel, cg);

Member

hzongaro Feb 25, 2025

I think this branch can no longer be reached.

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+                 generateLabelInstruction(TR::InstOpCode::label, node, returnFalseLabel, cg);
+                 // result = 0
+                 generateRegImmInstruction(TR::InstOpCode::MOV8RegImm4, node, resultReg, 0, cg);

Member

hzongaro Feb 25, 2025

This can be done with an XOR of resultReg with itself instead.

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

Comment on lines +10853 to +10857

+                 // return_false label
+                 generateLabelInstruction(TR::InstOpCode::label, node, returnFalseLabel, cg);
+                 // result = 0
+                 generateRegImmInstruction(TR::InstOpCode::MOV8RegImm4, node, resultReg, 0, cg);

Member

hzongaro Feb 25, 2025

Instead of having a falseLabel where resultReg is set to zero, I wonder whether it would be better to set resultReg to zero at the start of the inline code sequence. Then this end sequence would only need to handle the returnTrueLabel case:

returnTrueLabel:
   resultReg = 1
endLabel:

and the if bytes_left == 0 above could branch directly to endLabel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels