-
Notifications
You must be signed in to change notification settings - Fork 395
[PROF-12763] Tweak profiler signal handler tests to avoid flakiness #4992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
BenchmarksBenchmark execution time: 2025-10-21 09:02:10 Comparing candidate commit fa5db6f in PR branch Found 0 performance improvements and 1 performance regressions! Performance is the same for 43 metrics, 2 unstable metrics. scenario:profiling - Allocations ()
|
✅ Tests 🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage 🔗 Commit SHA: fa5db6f | Docs | Was this helpful? Give us feedback! |
**What does this PR do?** This PR tweaks the "sample from the signal handler" profiler specs to run with GC disabled. To avoid accidentally running out of memory in `loop_until`, we additionally changed two things: * Use `Process.clock_gettime` (that returns a zero-allocation float) instead of `Time.now` (that returns a new `Time` instance) * Add a `check_condition_every_seconds` to avoid calling the condition checking code too often, as it can do allocations **Motivation:** This change is an attempt at avoiding the flakiness seen in <https://github.com/DataDog/dd-trace-rb/actions/runs/18560146732/job/52906922299>: ``` Failures: 1) Datadog::Profiling::Collectors::CpuAndWallTimeWorker#start sampling from signal handler when signal handler sampling is disabled does not prepare samples in the signal handler Failure/Error: expect(sample_count).to be_within(20).percent_of(signal_handler_enqueued_sample) expected 15 to be within 20% of 20 # ./spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb:1053:in 'block (4 levels) in <top (required)>' # ./spec/spec_helper.rb:272:in 'block (2 levels) in <top (required)>' # ./spec/spec_helper.rb:154:in 'block (2 levels) in <top (required)>' # /usr/local/bundle/gems/webmock-3.24.0/lib/webmock/rspec.rb:39:in 'block (2 levels) in <top (required)>' # /usr/local/bundle/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in 'block (2 levels) in <top (required)>' # ./spec/support/execute_in_fork.rb:32:in 'ForkableExample#run' Finished in 27.13 seconds (files took 1.14 seconds to load) 787 examples, 1 failure, 19 pending Failed examples: rspec ./spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb:1068 # Datadog::Profiling::Collectors::CpuAndWallTimeWorker#start sampling from signal handler when signal handler sampling is disabled does not prepare samples in the signal handler ``` The assertion that failed checks that the number of times the profiler actually got to sample is close to the number of times it tried to sample. Unfortunately, I could not reproduce the issue locally but I suspect that what made this condition flaky is GC: the profiler is not able to sample during GC and thus if a long-enough-running GC happens at the wrong time, this condition could fail. I am suspicious that GC is the source also because `loop_until` caused lots of allocations previously. Hopefully this was the source of the flakiness; if not, I'll come up with more ideas next time ;) **Additional Notes:** Fixes DataDog/ruby-guild#272 **How to test the change?** If CI is green we're good! The logic on the test itself hasn't changed.
b9ecb06
to
fa5db6f
Compare
What does this PR do?
This PR tweaks the "sample from the signal handler" profiler specs to run with GC disabled.
To avoid accidentally running out of memory in
loop_until
, we additionally changed two things:Process.clock_gettime
(that returns a zero-allocation float) instead ofTime.now
(that returns a newTime
instance)check_condition_every_seconds
to avoid calling the condition checking code too often, as it can do allocationsMotivation:
This change is an attempt at avoiding the flakiness seen in https://github.com/DataDog/dd-trace-rb/actions/runs/18560146732/job/52906922299:
The assertion that failed checks that the number of times the profiler actually got to sample is close to the number of times it tried to sample.
Unfortunately, I could not reproduce the issue locally but I suspect that what made this condition flaky is GC: the profiler is not able to sample during GC and thus if a long-enough-running GC happens at the wrong time, this condition could fail.
I am suspicious that GC is the source also because
loop_until
caused lots of allocations previously.Hopefully this was the source of the flakiness; if not, I'll come up with more ideas next time ;)
Change log entry
None.
Additional Notes:
Fixes https://github.com/DataDog/ruby-guild/issues/272
How to test the change?
If CI is green we're good! The logic on the test itself hasn't changed.