Skip to content

Revisit testComm waiting on sawcommevfd #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pwaller opened this issue May 16, 2020 · 0 comments
Open

Revisit testComm waiting on sawcommevfd #29

pwaller opened this issue May 16, 2020 · 0 comments

Comments

@pwaller
Copy link
Contributor

pwaller commented May 16, 2020

In #25, I accidentally introduced an additional comment on evwait(sawcommevfd) in the testComm test. This was unintentional, and doesn't otherwise relate to the fixes introduced in that PR.

This issue is here as a reminder to revisit this.

perf/record_test.go

Lines 528 to 544 in 4d8e4e5

// TODO(acln): investigate the legitimacy of the following crutch.
//
// Wait for the parent to see that we changed our name, then exit.
//
// If we do not wait here, there is a terrible race condition waiting
// to happen: If we PR_SET_NAME in the child, then immediately exit,
// the other side may not see POLLIN on the comm record: it may see
// POLLHUP directly, even though a comm record was actually written
// to the ring in the meantime. Why we get POLLHUP directly, and not
// POLLIN before it, is unclear. The machinery to deal with this
// eventuality in the poller does not exist yet, and at the time
// when this comment was written, I have found no good solutions to
// this conundrum.
//
// So we live with it, but still try to make our test pass.
// evwait(sawcommevfd)
_ = sawcommevfd

The current state for me is that if I apply the patch of #26 onto master at 4d8e4e5, then sudo ./perf.test -test.count=1000 -test.run=Record/Comm passes with no failures for me.

However, if I uncomment the evwait above, then the tests fail. In that case, ReadRecord doesn't return, even if the context deadline is set many seconds into the future. What I see is that the child ends up waiting for the signal sawcommevfd after changing its COMM, and never exits.

So the mystery is why ReadRecord doesn't return when this wait is present. It feels as though the kernel isn't respecting our wakeup events = 1. When the wait is commented out (erroneously by me), then the process exits, and we receive the event.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant