8303762: Optimize vector slice operation with constant index using VPALIGNR instruction#24104
8303762: Optimize vector slice operation with constant index using VPALIGNR instruction#24104jatin-bhateja wants to merge 18 commits intoopenjdk:masterfrom
Conversation
…ALIGNR instruction
|
👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into |
|
❗ This change is not yet ready to be integrated. |
|
@jatin-bhateja The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
|
/label add hotspot-compiler-dev |
|
@jatin-bhateja |
|
@jatin-bhateja This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a |
|
/keepalive |
|
@jatin-bhateja The pull request is being re-evaluated and the inactivity timeout has been reset. |
|
@jatin-bhateja this pull request can not be integrated into git checkout JDK-8303762
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
|
@jatin-bhateja This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a |
3d09134 to
c0b9eea
Compare
c0b9eea to
607a8fc
Compare
|
Performance after AVX2 backend modifications |
Webrevs
|
|
Hi @XiaohongGong , your comments have been addressed. |
XiaohongGong
left a comment
There was a problem hiding this comment.
LGTM! Thanks for your updating!
Thanks @jatin-bhateja. I will take a look next week. |
|
Hi @sviswa7 , Please let me know if you have comments on x86 backend part. |
|
Hi @sviswa7, @merykitty Can you please review x86 backed implementation. |
merykitty
left a comment
There was a problem hiding this comment.
I still think it is not a good solution to add an intrinsic method for this operation. We should add constant info to TypeVect and transform the slice pattern into VectorSliceNode. I think it is adequate to add bit info (zeros and ones) to each TypeVect instance so that we can do decent inference without too much additional memory overhead.
| JVMState* new_jvms = inline_cg()->generate(jvms); | ||
| // Attempt inlining fallback implementation in case of | ||
| // intrinsification failure. | ||
| if (new_jvms == nullptr && is_vector_late_inline()) { |
There was a problem hiding this comment.
This may be problematic if the intrinsification does not succeed because the arguments have not been constant-folded. It is because the order in which methods are processed during incremental inline is not deterministic.
There was a problem hiding this comment.
Hi @merykitty , Intrinsicification failure due to any such reason is same with and without this patch, in case of slice intrinsic failure we simply inline the fallback implementation which is comprised of vector APIs, VectorSupport* entry points of APIs should then go through intrinsification attempts independently and may succeeded or fail if constraints are not met.
There was a problem hiding this comment.
CallStaticJavaNode::Ideal will enqueue the call for incremental inline again when it is invoked. That means if the method fails to get intrinsified at first, then its arguments constant-fold, then the intrinsification may succeed upon retry.
There was a problem hiding this comment.
I think there is no harm in-lining fallback after first un-successful attempt of intrinsification for sliceOp, as fallback is composed of vectorAPI and we are giving them opportunity for intrinsificaiton, this save costly boxing operation and performance will be at par with what we have today. WDYT ?
There was a problem hiding this comment.
But this will affect other intrinsics, too, they are not implemented using other vector API operations.
There was a problem hiding this comment.
We only perform fallback inlining on first intrinsification failure for sliceOp, this is a very localized change.
https://github.com/jatin-bhateja/jdk/blob/1dfff5589c8b6c83dfc9810bddbb676c7982c904/src/hotspot/share/opto/callGenerator.cpp#L455
There was a problem hiding this comment.
Thanks for pointing out the change. I think that's more hacky than I have expected.
There was a problem hiding this comment.
This is selective enablement of inlining for intrinsic failures which uses vector API in the fall back implimentation.
What you are asking for is a bigger generic change which can be taken up as a separate RFE once this is committed. |
I think there is no need to rush this functionality, and this will become unnecessary when |
Slice operations are used in simdjson UtfValidator and its better to push this patch rather than holding it for some future optimization. |
|
I don't agree with this change because the benefit is little, the intrinsic does not stand in the future, and the implementation is hacky and not trivial. I think I can accept this PR if:
|
Hi @merykitty, Thank you for the detailed feedback. I’ve looked into the alternative you suggested—adding constant/lane info to TypeVect and having the compiler recognize the slice pattern (rearrange + blend with constant shuffle and mask) and replace it with VectorSliceNode, without a dedicated intrinsic—and I’d like to explain why I still believe the hybrid call generator is the more practical choice for this change, while keeping the door open for the pattern-based approach later. Why the pattern-based approach is still very non-trivial Even after we have constant shuffle and constant mask recognizing the slice idiom and recovering
So even with TypeVect constant info, we still need very complex / non-trivial pattern match probable the biggest pattern match in idiealization Why the hybrid call generator is a better fit for this change
Proposed path I think the most practical path is to proceed with the hybrid call generator for this PR so we can ship a reliable slice optimization (including for use cases like simdjson) without taking on the full cost and risk of TypeVect constant info and slice-pattern matching right now. Once this is in, we have a clear baseline: constant-index slice is optimized; variable index uses the inlined fallback. If we later introduce TypeVect constant info and a more generic framework for recognizing vector idioms (e.g. patterns that consume constant shuffle/mask or that match subgraphs like iotaShuffle), we can revisit slice: add a pattern-based recognition that either supplements or eventually replaces the intrinsic. Please note x86 back end implimentation will still be usable if we later remove intrinsic and use complex patten matching to deduce VectorSliceNode Thanks again for the review. |
I think you are misunderstanding, when |
Given that vector lane itself are variable an application of constant shuffle will not fold rearrange, so a complex pattern match to infer VectorSliceNode (with constant index) probably in blend idealization will still have to deal with a graph pallet consisting of blend with constant mask, two re-arrange nodes with constant shuffles, and some other non-folded nodes. I think its reasonable to proceed with the hybrid call generator for this PR so we can ship a reliable slice optimization (including for use cases like simdjson) without taking on the full cost and risk of TypeVect constant info and slice-pattern matching right now. In future once constant TypeVect are in production we can replace existing intrinsification approach with patten match and still use the backend implementation as it is. Lets think over it and also seek other's opinion (@sviswa7) |
|
I’m fine with using the more straightforward approach to intrinsify the slice API when the origin is a constant. In my view, this could also benefit other APIs and future optimizations (for example, #28520), since slice is a general vector operation. Relying on pattern matching makes the compiler implementation significantly more complex in my opinion. Regarding inlining of the fallback implementation, I think we do need such a mechanism to handle APIs that fail to inline on the first attempt, given that the current fallback overhead is much heavier and leads to worse performance. And I agree with @merykitty that a more generic solution would be more preferable. |
Hi @merykitty , @XiaohongGong , Based on the feedback received, I have modified the patch to not inline on first intrinsic failure, instead I now collect such CallGenerators and only towards the end on incremental inlining I inline expands the fallback implementation on the lines of _string_late_inlines. This will give opportunity to create constant context for VectorSlice intrinsification and if that fails we inline the fallback implimentation to avoid any costly boxing penalties. |
| while (_late_inlines.length() > 0) { | ||
| igvn_worklist()->ensure_empty(); // should be done with igvn | ||
|
|
||
| while (inline_incrementally_one()) { |
There was a problem hiding this comment.
Is it possible to generate _vector_late_inlines candidate again in inline_incrementally_one?
There was a problem hiding this comment.
Compile::inline_vector_fallback(PhaseIterGVN& igvn) inlines fallback implementation and if fallback is composed of intrinsics which are recorded in _late_inlines it attempts intrinsification. We don't append to _vector_late_inlines while executing fallback generator
https://github.com/jatin-bhateja/jdk/blob/444a35685eed8442b721795824cf0a43c7bb02f8/src/hotspot/share/opto/callGenerator.cpp#L722
|
There are regression for these two cases. Do you know the root cause? |
Hi @XiaohongGong I observed that there is quite a lot of run to run variation in these micro even with stock JDK, I collected PMU events and found on AVX512 system there are MISALIGNED vector memory operation in fallback which causes this variation. |
|
Hi @XiaohongGong , @merykitty , @erifan let me know if you have other comments. |
|
LGTM, thanks! |
|
I'm still not convinced with this solution. If the pattern matching method proves itself to be not reliable, then we can proceed with an intrinsics. Otherwise, we risk introduce a change that will eventually become redundant. |
Hi @merykitty , As discussed earlier your suggestions were incorporated in latest version of patch, idea here is not to hold an optimization in anticipation of future optimization. x86 backend changes will still be usable if at later point we decide to use complex pattern matching once TypeVect has constant information. What we have currently is generic handling which can inline any fallback after failed intrinsification attempts. Looking forward to your comments on backend part and any further improvement on existing handling. Hi @sviswa7 , @iwanowww , May I request you to share your views / comments |
|
I briefly looked at the patch. First of all, I suggest to separate the logic to handle intrinsification failures. It's not specific to the proposed enhancement and will improve handling of intrinsification failures for vector operations. Speaking of proposed approach, it aligns well with current Vector API implementation practices. I agree it would be nice to automatically detect equivalent IR shapes and transform them accordingly, but if it means hard-coding the shape of |
Thanks @iwanowww , I agree that approach to inline on intrinsic failure is generic enough and can benefit other vector operations also as it may absorb boxing penalties. For slice and un-slice since the fallback is completely written in vector APIs it will give most benefits and that is the focus of this patch. Looking forward to your other comments on current implementation. |
Patch optimizes Vector. slice operation with constant index using x86 ALIGNR instruction.
It also adds a new hybrid call generator to facilitate lazy intrinsification or else perform procedural inlining to prevent call overhead and boxing penalties in case the fallback implementation expects to operate over vectors. The existing vector API-based slice implementation is now the fallback code that gets inlined in case intrinsification fails.
Idea here is to add infrastructure support to enable intrinsification of fast path for selected vector APIs, else enable inlining of fall-back implementation if it's based on vector APIs. Existing call generators like PredictedCallGenerator, used to handle bi-morphic inlining, already make use of multiple call generators to handle hit/miss scenarios for a particular receiver type. The newly added hybrid call generator is lazy and called during incremental inlining optimization. It also relieves the inline expander to handle slow paths, which can easily be implemented library side (Java).
Vector API jtreg tests pass at AVX level 2, remaining validation in progress.
Performance numbers:
Please share your feedback.
Best Regards,
Jatin
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24104/head:pull/24104$ git checkout pull/24104Update a local copy of the PR:
$ git checkout pull/24104$ git pull https://git.openjdk.org/jdk.git pull/24104/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 24104View PR using the GUI difftool:
$ git pr show -t 24104Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24104.diff
Using Webrev
Link to Webrev Comment