[STAL-2917] Fix PKU-related segfaults that prevent upgrading to the latest deno_core
.
#560
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What problem are you trying to solve?
Our
deno_core
is currently pinned to 0.196.0 (from July 2023). Upgrading to 0.205.0 or higher causes segfaults on Linux machines supporting Memory Protection Keys (PKU). With PKU enabled, v8 isolates can only be accessed by threads that were spawned by the thread that originally initialized v8.In production, we currently never explicitly initialize v8, instead letting
deno_core
take care of it transparently for us. In practice, this means that in a rayon pool or a rocket endpoint (managed by a tokio runtime), the thread that actually wins the race to initialize v8 is non-deterministic and isn't guaranteed to be the main thread.Additionally,
cargo test
runs the tests in parallel (never on the main thread), so it's a guarantee that the test will segfault because multiple threads without a parent/child relationship will attempt to access v8's memory.What is your solution?
Initialize v8 explicitly on the "correct" thread. This requires a number of changes:
Design
Typestate pattern ensures a v8 platform is always manually initialized
At compile time, we now guarantee that the v8 platform has been manually initialized before any calls to
deno_core
can happen (and thus, deno_core will never be able to initialize v8 itself). This is implemented by theV8Platform
struct. Only aV8Platform<Initialized>
struct can create DDSA JsRuntimes. (typestate refresher here).Library callers are now in control of JsRuntime allocation + management
This was a good time to refactor and remove the existing "magic" that maintained a thread-local JsRuntime on behalf of the caller. This PR removes the
analyze
function. Now, there is only theanalyze_with
function, which requires an explicit JsRuntime to be passed in (which is nice because it gives the consumer control over allocations and lifetimes).A custom Rayon thread pool is used on the server
JsRuntime
instances, we can guarantee that v8 has been initialized correctly. (This is strangely hard to do in a simple way with just tokio, which Rocket uses under the hood).Unit tests use an "unprotected" v8 platform
As mentioned above, due to how
cargo test
works, we can never initialize v8 on the main thread before tests run, meaning we'll always segfault. To get around this, for unit tests only, we use the "unprotected" v8 platform, which doesn't enforce the same PKU memory access guards as the "default" platform.Side notes
test-rules.py
script to make requests concurrently. Prior, it was making requests sequentially, so almost 100% of the time, the same datadog-static-analyzer-server thread would execute code via v8. This masks potential PKU-related issues, because they only arise when two separate threads attempt to access v8. Thus, this change ensures that our test triggers multiple datadog-static-analyzer-server threads to run v8 at the same time.Verification
Right now, GitHub Actions Linux runners do not support PKU, so they can't be used to verify this fix.
However, our internal GitLab CI instances (which use 3rd generation Intel Xeon processors) do, so internal reviewers can refer to the successful Linux amd64 job. External reviewers can verify this by testing this PR on similar hardware (for example, an m6i EC2 instance).
Side Note
I was getting some segfaults that were hard to reproduce that seemed related to
deno_core::JsRuntime
s that didn't use a snapshot. I changed our initialization so we now always create a snapshot. This can technically leak memory (but never will for our use cases). This deserves a follow-up to confirm if it is necessary, but I would rather include it now to be extra cautious.Alternatives considered
What the reviewer should know