Skip to content

fix fork cache bug in MemoryAccessor.zig #24376

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

GrayHatter
Copy link
Contributor

wrote this quickly via the web interface, but didn't validate it yet. Submitting the PR for comments and I can fix this up if the approach seems reasonable.

Comment on lines 66 to 69
if (newpid != pid) {
@atomicStore(posix.pid_t, &cached_pid, newpid, .monotonic);
continue :linux -1;
} else unreachable;
Copy link
Member

@ifreund ifreund Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be cleaner to write this as

assert(newpid != pid);
@atomicStore(...

.INVAL, .SRCH => unreachable, // own pid is always valid
.INVAL => unreachable, // EINVAL is returned when flags to process_vm_readv is non-zero
.SRCH => {
// process with pid does not exist, but it should be our pid, so we'll try again.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please explain an example of a case in which this can happen (post-fork()) in the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't sure how much detail to add, do you think this is enough?

@alexrp
Copy link
Member

alexrp commented Jul 9, 2025

What if the parent process has execve'd? Or its PID has been recycled? We won't get ESRCH, but using the cached PID is still clearly wrong.

@GrayHatter
Copy link
Contributor Author

GrayHatter commented Jul 9, 2025

What if the parent process has execve'd? Or its PID has been recycled? We won't get ESRCH, but using the cached PID is still clearly wrong.

I agree, I considered both 1) deleting cached_pid completely and 2) exposing it as pub to give callers a work around. I reject both because 1) seem to be more a more controversial change that I wasn't prepared to defend, and 2) it seemed strictly worse than both this change and deleting cached_pid, it also felt like leaving in a foot gun that didn't need to exist, or be someone else's problem.

(I also considered creating a function to invalidate/reset the cache that I would also call from fork(), but enshrining this as desirable felt like a mistake I'd regret later when rambling about tech debt, lol)

@alexrp
Copy link
Member

alexrp commented Jul 10, 2025

  1. deleting cached_pid completely

I feel like this would be the right move. This is stack trace code; getting correct and useful results is more important than getting them slightly faster, IMO. But cc @jacobly0 since IIRC he wrote this code.

@jacobly0
Copy link
Member

jacobly0 commented Jul 10, 2025

Why can't you just move the global to a field and initialize it to -1 every time the struct is initialized?

@alexrp
Copy link
Member

alexrp commented Jul 10, 2025

I guess if I were to be super pedantic, it's possible for a signal to be delivered while we're doing a stack trace, and for the signal handler to fork(). Then, in the child process, when the signal handler returns and the stack trace code resumes, the PID will be wrong.

But this is obviously a fairly ridiculous edge case.

@jacobly0
Copy link
Member

A future revision of the standard is likely to remove fork(2) from the list of async-signal-safe functions.

@GrayHatter
Copy link
Contributor Author

it's possible for a signal to be delivered while we're doing a stack trace, and for the signal handler to fork()

If you fork in a signal handler... I feel like that's an implicit agreement that you enjoy fighting dragons, and then this behavior would be "expected" :P

Why can't you just move the global to a field and initialize it to -1 every time the struct is initialized?

I was assuming there was a reason it was a file var instead of a member field, in part because it's wrapped in an atomic setter. But that might be a cleaner option. Happy to write that commit too. But I assume this pattern is rare enough it's fine to optimize for the fork/exec instead of introducing a syscall in what I assume is a very hot path?

@jacobly0
Copy link
Member

The goal is to delete this file once the actual bugs it is working around are fixed anyway.

@jacobly0
Copy link
Member

jacobly0 commented Jul 10, 2025

Then, in the child process, when the signal handler returns and the stack trace code resumes, the PID will be wrong.

This is also true if you call getpid, get interrupted, get forked, and then when the handler returns you will have the wrong return value. A cache is no different from using the pid directly for literally anything. (Also this whole experiment is only valid in single-threaded programs to begin with.)

@@ -38,8 +38,8 @@ pub fn deinit(ma: *MemoryAccessor) void {

fn read(ma: *MemoryAccessor, address: usize, buf: []u8) bool {
switch (native_os) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also try deleting this switch entirely and seeing if CI still passes. I just did a preliminary test which suggests that the issue this was working around is no longer present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacobly0 I think is what you had in mind?

I didn't scan through the commit history to understand if this would be the simplest revert, and also because I don't have the context around the struct, I'm very much depending on CI + review here to validate this change. Let me know if I missed something obvious?

@GrayHatter GrayHatter force-pushed the patch-1 branch 3 times, most recently from b4b2871 to 59b3ebd Compare July 11, 2025 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants