Fix: Regression in benchmarks JetStream, ARES, speedometer #1691#1695
Open
justinmichaud wants to merge 1 commit into
Open
Fix: Regression in benchmarks JetStream, ARES, speedometer #1691#1695justinmichaud wants to merge 1 commit into
justinmichaud wants to merge 1 commit into
Conversation
…mForEmbedded#1691 This patch makes some simple fixes to improve 32-bit performance on 2.46. 1) Disable concurrent JIT if we cannot write 64-bit values atomically. Most of the chips we run on have no problem doing 64-bit atomic writes, and in that case, we completely recover the perf lost from enabling concurrent JIT (and in fact gain perf). We detect LPAE, and if it is present, skip the store barriers that were previously needed to guarantee we didn't dereference a bad cell. 2) Clear value profiles The 0 value is not the empty JSValue, poluting our profiles. We manually clear them, and our profiling works much better. 3) Tune various options I tuned many thresholds based on what worked on my machine. This probably requires a second round of tuning on a smaller device. 4) Turn off SIMD SIMD::find is a regression on 32-bit due to some fallback paths. Overall, on my Neoverse N1, I get that after this patch, 2.46 is 26% faster than 2.38 when excluding wasm subtests. There are still some regressed subtests worth investigating, with as much as an 11% regression.
| finish(dst.withOffset(TagOffset)); | ||
| // CJIT is only enabled when LPAE is enabled (such as for armv8l). In this case, | ||
| // 64-bit aligned stores are atomic: https://developer.arm.com/documentation/ddi0406/c/Application-Level-Architecture/Application-Level-Memory-Model/Memory-types-and-attributes-and-the-memory-order-model/Atomicity-in-the-ARM-architecture | ||
| // > In an implementation that includes the Large Physical Address Extension, LDRD and STRD accesses to 64-bit aligned locations are 64-bit single-copy atomic as seen by translation table walks and accesses to translation tables. |
There was a problem hiding this comment.
FWIW, I think the part we want to quote here is
The system designer must ensure that all writable memory locations that might be used to hold translations, such as bulk SDRAM, can be accessed with 64-bit single-copy atomicity.
(later in the same document)
| }); | ||
| } | ||
| } | ||
|
|
|
Looks great to me. |
aoikonomopoulos
approved these changes
Jul 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
#1691
This patch makes some simple fixes to improve 32-bit performance on 2.46.
Most of the chips we run on have no problem doing 64-bit atomic writes, and in that case, we completely recover the perf lost from enabling concurrent JIT (and in fact gain perf).
We detect LPAE, and if it is present, skip the store barriers that were previously needed to guarantee we didn't dereference a bad cell.
The 0 value is not the empty JSValue, poluting our profiles. We manually clear them, and our profiling works much better.
I tuned many thresholds based on what worked on my machine. This probably requires a second round of tuning on a smaller device.
SIMD::find is a regression on 32-bit due to some fallback paths.
Overall, on my Neoverse N1, I get that after this patch, 2.46 is 26% faster than 2.38 when excluding wasm subtests. There are still some regressed subtests worth investigating, with as much as an 11% regression.
93f3188