-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerate StringCoding.hasNegatives for JDK 11, 17, StringCoding.countPositives for JDK 21+ on x86 #21121
base: master
Are you sure you want to change the base?
Accelerate StringCoding.hasNegatives for JDK 11, 17, StringCoding.countPositives for JDK 21+ on x86 #21121
Conversation
Paging @vijaysun-omr, @0xdaryl for review and @r30shah, @dchopra001 as requested for comparison to acceleration on Z |
543b8ac
to
233cf24
Compare
Looks fine to me from a quick review. I'll probably defer to @hzongaro for the review since he has much more direct awareness of this work than I do, and can easily review the inliner parts as well (that I skimmed over too). |
It appears that the crash is due to a recent OMR commit and is therefore unrelated to my changes. I can build successfully with the |
generateRegRegInstruction(TR::InstOpCode::PTESTRegReg, node, xmmchunk, xmmmask, cg); | ||
|
||
// If the result is nonzero (i.e. at least one of the sign bits is set), break and return index | ||
generateLabelInstruction(TR::InstOpCode::JNE4, node, returnIndexLabel, cg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least one of the sign bits is set
Don't you need to keep track of the index of last positive element? This function should find the number of leading positive elements so the location of that negative matters. You cannot return (i - off) as you do in the residue processing because you don't know where in the vector the first negative is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key here was pointed out by Henry early on when I was first trying to figure out how to accelerate countPositives
:
/**
* Count the number of leading positive bytes in the range.
*
* @implSpec the implementation must return len if there are no negative
* bytes in the range. If there are negative bytes, the implementation must return
* a value that is less than or equal to the index of the first negative byte
* in the range.
*/
countPositives
only needs to return a value less than or equal to the index of the first negative byte. There are a few places in String.java
that call countPositives
and then do the work of finding the exact index themselves. I suppose the reason it was designed this way is to take advantage of small time saves in situations where you don't care about the exact index (like when being called by hasNegatives
).
|
||
case TR::java_lang_StringCoding_countPositives: | ||
{ | ||
if (comp->target().cpu.supportsFeature(OMR_FEATURE_X86_SSE2) && feGetEnv("replaceCountPositives") != NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You already called cg->setSupportsInlineStringCodingCountPositives()
so you should be calling cg->getSupportsInlineStringCodingCountPositives()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consolidated all of the inlining logic into InlinerTempForJ9
, since I was already using it to conditionally only inline countPositives
into hasNegatives
. I've confirmed that the inlining behaviour is still correct using this method.
8e367bc
to
c31a888
Compare
I fixed the bug! I squashed all of my review comment changes into one commit, just doing some more performance testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! I just have a few suggestions/comments
generateRegRegInstruction(TR::InstOpCode::MOV8RegReg, node, limitReg, offsetReg, cg); | ||
generateRegRegInstruction(TR::InstOpCode::ADD8RegReg, node, limitReg, lengthReg, cg); | ||
|
||
// loop_limit = (length & -16) + offset | ||
generateRegRegInstruction(TR::InstOpCode::MOV8RegReg, node, loopLimitReg, lengthReg, cg); | ||
generateRegImmInstruction(TR::InstOpCode::AND8RegImm4, node, loopLimitReg, -16, cg); | ||
generateRegRegInstruction(TR::InstOpCode::ADD8RegReg, node, loopLimitReg, offsetReg, cg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these operations calculating limit
and loop_limit
need to be 64-bit operations or could they be 32-bit operations? Just wondering as these are all type int
in the original Java source code.
generateRegRegInstruction(TR::InstOpCode::PMOVMSKB4RegReg, node, maskReg, xmmMaskReg, cg); | ||
|
||
// Check if any negative values exist | ||
generateRegRegInstruction(TR::InstOpCode::TEST8RegReg, node, maskReg, maskReg, cg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a 64-bit test?
// Prepare an 8 byte sign bit mask | ||
generateRegImm64Instruction(TR::InstOpCode::MOV8RegImm64, node, maskReg, 0x8080808080808080, cg); | ||
|
||
// Zero out the chunk register | ||
generateRegRegInstruction(TR::InstOpCode::XOR8RegReg, node, chunkReg, chunkReg, cg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loading 0x8080808080808080
into maskReg
and zeroing chunkReg
are not needed for the bytes_left == 0
case. Placing these instructions after
// if bytes_left = 0, return len
might help the zero byte case slightly.
Also, the chunkReg
doesn't need to be zeroed out for the 5 to 8 byte case, nor for the 9 to 16 byte case. However, avoiding it for the 5 to 8 byte case would probably be awkward. You might try placing it between these two tests so it can at least be avoided for the 9 to 16 byte case.
* if bytes_left > 8
* jmp nineOrMoreBytesLabel ----+
* if bytes_left > 2 |
* jmp threeOrMoreBytesLabel -+ |
* if bytes_left > 4 | | ||
* jmp fiveOrMoreBytesLabel --+ | | ||
* | | | ||
* laod 3-4 bytes | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: load
* The Code Generator | ||
*/ | ||
#if JAVA_SPEC_VERSION < 19 | ||
static TR::Register* inlineHasNegatives(TR::Node* node, TR::CodeGenerator* cg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the only parts of inlineHasNegatives
and inlineCountPositives
that differ are those that deal with returning the position of the first negative or returning true or false. If that's correct, I think it would be better to combine the two methods into one so we can make sure that the common parts won't diverge.
// result = length | ||
generateRegRegInstruction(TR::InstOpCode::MOV8RegReg, node, resultReg, lengthReg, cg); | ||
|
||
// JMP end | ||
generateLabelInstruction(TR::InstOpCode::JMP1, node, endLabel, cg); | ||
|
||
|
||
// return_index label | ||
generateLabelInstruction(TR::InstOpCode::label, node, returnIndexLabel, cg); | ||
|
||
// result = index - offset | ||
generateRegRegInstruction(TR::InstOpCode::MOV8RegReg, node, resultReg, indexReg, cg); | ||
generateRegRegInstruction(TR::InstOpCode::SUB8RegReg, node, resultReg, offsetReg, cg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could use indexReg
to contain the final result to avoid using another register. It would also avoid one move if a negative byte was found. That is:
returnLenLabel:
indexReg = lengthReg
goto endLabel
returnIndexLabel:
indexReg -= offsetReg
// If the result is nonzero (i.e. at least one of the sign bits is set), return true | ||
generateLabelInstruction(TR::InstOpCode::JNE1, node, returnTrueLabel, cg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this branch can no longer be reached.
generateLabelInstruction(TR::InstOpCode::label, node, returnFalseLabel, cg); | ||
|
||
// result = 0 | ||
generateRegImmInstruction(TR::InstOpCode::MOV8RegImm4, node, resultReg, 0, cg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be done with an XOR of resultReg
with itself instead.
// return_false label | ||
generateLabelInstruction(TR::InstOpCode::label, node, returnFalseLabel, cg); | ||
|
||
// result = 0 | ||
generateRegImmInstruction(TR::InstOpCode::MOV8RegImm4, node, resultReg, 0, cg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having a falseLabel
where resultReg
is set to zero, I wonder whether it would be better to set resultReg
to zero at the start of the inline code sequence. Then this end sequence would only need to handle the returnTrueLabel
case:
returnTrueLabel:
resultReg = 1
endLabel:
and the if bytes_left == 0
above could branch directly to endLabel
.
This PR accelerates intrinsic candidates
StringCoding.hasNegatives
andStringCoding.countPositives
on x86, the former on JDK 9-18 and the latter on JDK 19+.This PR is incremental in a few ways.
hasNegatives
for arrays of 0-8 elements on JDK 19+. It performs so well for these short arrays that implementing my 'acceleration' would actually cause a performance regression there. While this anomaly is investigated, I will not be acceleratinghasNegatives
on JDK 19+.In the interest of taking advantage of the performance boost as it currently stands for the 0.51 release, this PR will deliver these changes in their incremental state, with plans for another PR or two down the road to close the aforementioned gaps.