[STAL-2917] Fix PKU-related segfaults that prevent upgrading to the latest `deno_core`. #560

jasonforal · 2024-11-22T23:47:30Z

What problem are you trying to solve?

Our deno_core is currently pinned to 0.196.0 (from July 2023). Upgrading to 0.205.0 or higher causes segfaults on Linux machines supporting Memory Protection Keys (PKU). With PKU enabled, v8 isolates can only be accessed by threads that were spawned by the thread that originally initialized v8.

In production, we currently never explicitly initialize v8, instead letting deno_core take care of it transparently for us. In practice, this means that in a rayon pool or a rocket endpoint (managed by a tokio runtime), the thread that actually wins the race to initialize v8 is non-deterministic and isn't guaranteed to be the main thread.

Additionally, cargo test runs the tests in parallel (never on the main thread), so it's a guarantee that the test will segfault because multiple threads without a parent/child relationship will attempt to access v8's memory.

What is your solution?

Initialize v8 explicitly on the "correct" thread. This requires a number of changes:

Design

Typestate pattern ensures a v8 platform is always manually initialized
At compile time, we now guarantee that the v8 platform has been manually initialized before any calls to deno_core can happen (and thus, deno_core will never be able to initialize v8 itself). This is implemented by the V8Platform struct. Only a V8Platform<Initialized> struct can create DDSA JsRuntimes. (typestate refresher here).

Library callers are now in control of JsRuntime allocation + management
This was a good time to refactor and remove the existing "magic" that maintained a thread-local JsRuntime on behalf of the caller. This PR removes the analyze function. Now, there is only the analyze_with function, which requires an explicit JsRuntime to be passed in (which is nice because it gives the consumer control over allocations and lifetimes).

A custom Rayon thread pool is used on the server

By creating and using specific thread pool for JsRuntime instances, we can guarantee that v8 has been initialized correctly. (This is strangely hard to do in a simple way with just tokio, which Rocket uses under the hood).
Additionally, this is just an overall better way to handle threading. A tokio runtime shouldn't be used to handle both I/O intensive and CPU-intensive workloads. Before this PR, we blocked tokio's thread with CPU-intensive analysis jobs -- now we let rayon do what it does best and handle these tasks.

Unit tests use an "unprotected" v8 platform
As mentioned above, due to how cargo test works, we can never initialize v8 on the main thread before tests run, meaning we'll always segfault. To get around this, for unit tests only, we use the "unprotected" v8 platform, which doesn't enforce the same PKU memory access guards as the "default" platform.

Side notes

I updated the test-rules.py script to make requests concurrently. Prior, it was making requests sequentially, so almost 100% of the time, the same datadog-static-analyzer-server thread would execute code via v8. This masks potential PKU-related issues, because they only arise when two separate threads attempt to access v8. Thus, this change ensures that our test triggers multiple datadog-static-analyzer-server threads to run v8 at the same time.

Verification

Right now, GitHub Actions Linux runners do not support PKU, so they can't be used to verify this fix.

However, our internal GitLab CI instances (which use 3rd generation Intel Xeon processors) do, so internal reviewers can refer to the successful Linux amd64 job. External reviewers can verify this by testing this PR on similar hardware (for example, an m6i EC2 instance).

Side Note

I was getting some segfaults that were hard to reproduce that seemed related to deno_core::JsRuntimes that didn't use a snapshot. I changed our initialization so we now always create a snapshot. This can technically leak memory (but never will for our use cases). This deserves a follow-up to confirm if it is necessary, but I would rather include it now to be extra cautious.

Alternatives considered

What the reviewer should know

The typestate implementation of V8Platform may seem complicated, but by enforcing it at compile time, we remove footguns for future contributors who may not know about v8/PKU.
With rayon now on the server, we should eventually do a better job of coordinating thread use between tokio and rayon (e.g. have a server cli flag to control number of threads, and then split them between rayon/tokio)

Required changes: * Use the latest API for creating extensions.

* Add V8Platform, changing the API to require explicit v8 initialization before `ddsa_lib::JsRuntime` can be created. * Give callers control over JsRuntime allocation + management. * Always create v8 snapshots. * On datadog-static-analyzer-server, create a Rayon pool on the same thread as v8 is initialized.

… that multi-threaded behavior is tested on the server).

Required changes: * Use the latest API for creating extensions.

Required changes: * Modify implementation to utilize new snapshot instantiation API.

Required changes: * The semantics of how deno_core handles the global proxy have changed. Before, freezing the true global was enough to prevent mutations of JavaScript's "globalThis" (i.e. the v8 global proxy). Now, these need to be explicitly frozen individually.

Required changes: * Modify implementation to utilize new deno_core::JsRuntime v8 initialization API.

jasonforal added 7 commits November 21, 2024 14:37

Bump deno_core to 0.213.0

ce24f49

Required changes: * Use the latest API for creating extensions.

Modify test-rules.py to send server requests concurrently (ensuring…

69dae69

… that multi-threaded behavior is tested on the server).

Bump deno_core to 0.264.0

038a6a8

Required changes: * Use the latest API for creating extensions.

Bump deno_core to 0.265.0

2baf4e8

Required changes: * Modify implementation to utilize new snapshot instantiation API.

Bump deno_core to 0.321.0

8a1c8df

Required changes: * Modify implementation to utilize new deno_core::JsRuntime v8 initialization API.

jasonforal requested a review from juli1 November 22, 2024 23:47

jasonforal requested a review from a team as a code owner November 22, 2024 23:47

DataDog deleted a comment from datadog-datadog-prod-us1 bot Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[STAL-2917] Fix PKU-related segfaults that prevent upgrading to the latest `deno_core`. #560

[STAL-2917] Fix PKU-related segfaults that prevent upgrading to the latest `deno_core`. #560

jasonforal commented Nov 22, 2024

[STAL-2917] Fix PKU-related segfaults that prevent upgrading to the latest deno_core. #560

Are you sure you want to change the base?

[STAL-2917] Fix PKU-related segfaults that prevent upgrading to the latest deno_core. #560

Conversation

jasonforal commented Nov 22, 2024

What problem are you trying to solve?

What is your solution?

Design

Side notes

Verification

Side Note

Alternatives considered

What the reviewer should know

[STAL-2917] Fix PKU-related segfaults that prevent upgrading to the latest `deno_core`. #560

[STAL-2917] Fix PKU-related segfaults that prevent upgrading to the latest `deno_core`. #560