Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[STAL-2917] Fix PKU-related segfaults that prevent upgrading to the latest deno_core. #560

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

jasonforal
Copy link
Collaborator

What problem are you trying to solve?

Our deno_core is currently pinned to 0.196.0 (from July 2023). Upgrading to 0.205.0 or higher causes segfaults on Linux machines supporting Memory Protection Keys (PKU). With PKU enabled, v8 isolates can only be accessed by threads that were spawned by the thread that originally initialized v8.

In production, we currently never explicitly initialize v8, instead letting deno_core take care of it transparently for us. In practice, this means that in a rayon pool or a rocket endpoint (managed by a tokio runtime), the thread that actually wins the race to initialize v8 is non-deterministic and isn't guaranteed to be the main thread.

Additionally, cargo test runs the tests in parallel (never on the main thread), so it's a guarantee that the test will segfault because multiple threads without a parent/child relationship will attempt to access v8's memory.

What is your solution?

Initialize v8 explicitly on the "correct" thread. This requires a number of changes:

Design

Typestate pattern ensures a v8 platform is always manually initialized
At compile time, we now guarantee that the v8 platform has been manually initialized before any calls to deno_core can happen (and thus, deno_core will never be able to initialize v8 itself). This is implemented by the V8Platform struct. Only a V8Platform<Initialized> struct can create DDSA JsRuntimes. (typestate refresher here).

Library callers are now in control of JsRuntime allocation + management
This was a good time to refactor and remove the existing "magic" that maintained a thread-local JsRuntime on behalf of the caller. This PR removes the analyze function. Now, there is only the analyze_with function, which requires an explicit JsRuntime to be passed in (which is nice because it gives the consumer control over allocations and lifetimes).

A custom Rayon thread pool is used on the server

  • By creating and using specific thread pool for JsRuntime instances, we can guarantee that v8 has been initialized correctly. (This is strangely hard to do in a simple way with just tokio, which Rocket uses under the hood).
  • Additionally, this is just an overall better way to handle threading. A tokio runtime shouldn't be used to handle both I/O intensive and CPU-intensive workloads. Before this PR, we blocked tokio's thread with CPU-intensive analysis jobs -- now we let rayon do what it does best and handle these tasks.

Unit tests use an "unprotected" v8 platform
As mentioned above, due to how cargo test works, we can never initialize v8 on the main thread before tests run, meaning we'll always segfault. To get around this, for unit tests only, we use the "unprotected" v8 platform, which doesn't enforce the same PKU memory access guards as the "default" platform.

Side notes

  • I updated the test-rules.py script to make requests concurrently. Prior, it was making requests sequentially, so almost 100% of the time, the same datadog-static-analyzer-server thread would execute code via v8. This masks potential PKU-related issues, because they only arise when two separate threads attempt to access v8. Thus, this change ensures that our test triggers multiple datadog-static-analyzer-server threads to run v8 at the same time.

Verification

Right now, GitHub Actions Linux runners do not support PKU, so they can't be used to verify this fix.

However, our internal GitLab CI instances (which use 3rd generation Intel Xeon processors) do, so internal reviewers can refer to the successful Linux amd64 job. External reviewers can verify this by testing this PR on similar hardware (for example, an m6i EC2 instance).

Side Note

I was getting some segfaults that were hard to reproduce that seemed related to deno_core::JsRuntimes that didn't use a snapshot. I changed our initialization so we now always create a snapshot. This can technically leak memory (but never will for our use cases). This deserves a follow-up to confirm if it is necessary, but I would rather include it now to be extra cautious.

Alternatives considered

What the reviewer should know

  • The typestate implementation of V8Platform may seem complicated, but by enforcing it at compile time, we remove footguns for future contributors who may not know about v8/PKU.
  • With rayon now on the server, we should eventually do a better job of coordinating thread use between tokio and rayon (e.g. have a server cli flag to control number of threads, and then split them between rayon/tokio)

Required changes:
* Use the latest API for creating extensions.
* Add V8Platform, changing the API to require explicit v8 initialization before `ddsa_lib::JsRuntime` can be created.
* Give callers control over JsRuntime allocation + management.
* Always create v8 snapshots.
* On datadog-static-analyzer-server, create a Rayon pool on the same thread as v8 is initialized.
… that multi-threaded behavior is tested on the server).
Required changes:
* Use the latest API for creating extensions.
Required changes:
* Modify implementation to utilize new snapshot instantiation API.
Required changes:
* The semantics of how deno_core handles the global proxy have changed. Before, freezing the true global was enough to prevent mutations of JavaScript's "globalThis" (i.e. the v8 global proxy). Now, these need to be explicitly frozen individually.
Required changes:
* Modify implementation to utilize new deno_core::JsRuntime v8 initialization API.
@jasonforal jasonforal requested a review from a team as a code owner November 22, 2024 23:47
@DataDog DataDog deleted a comment from datadog-datadog-prod-us1 bot Nov 22, 2024
@DataDog DataDog deleted a comment from datadog-datadog-prod-us1 bot Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant