Skip to content

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Oct 20, 2025

What does this PR do?

This PR tweaks the "sample from the signal handler" profiler specs to run with GC disabled.

To avoid accidentally running out of memory in loop_until, we additionally changed two things:

  • Use Process.clock_gettime (that returns a zero-allocation float) instead of Time.now (that returns a new Time instance)
  • Add a check_condition_every_seconds to avoid calling the condition checking code too often, as it can do allocations

Motivation:

This change is an attempt at avoiding the flakiness seen in https://github.com/DataDog/dd-trace-rb/actions/runs/18560146732/job/52906922299:

Failures:

  1) Datadog::Profiling::Collectors::CpuAndWallTimeWorker#start sampling from signal handler when signal handler sampling is disabled does not prepare samples in the signal handler
     Failure/Error: expect(sample_count).to be_within(20).percent_of(signal_handler_enqueued_sample)
       expected 15 to be within 20% of 20
     # ./spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb:1053:in 'block (4 levels) in <top (required)>'
     # ./spec/spec_helper.rb:272:in 'block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:154:in 'block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/webmock-3.24.0/lib/webmock/rspec.rb:39:in 'block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in 'block (2 levels) in <top (required)>'
     # ./spec/support/execute_in_fork.rb:32:in 'ForkableExample#run'

Finished in 27.13 seconds (files took 1.14 seconds to load)
787 examples, 1 failure, 19 pending

Failed examples:

rspec ./spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb:1068 # Datadog::Profiling::Collectors::CpuAndWallTimeWorker#start sampling from signal handler when signal handler sampling is disabled does not prepare samples in the signal handler

The assertion that failed checks that the number of times the profiler actually got to sample is close to the number of times it tried to sample.

Unfortunately, I could not reproduce the issue locally but I suspect that what made this condition flaky is GC: the profiler is not able to sample during GC and thus if a long-enough-running GC happens at the wrong time, this condition could fail.

I am suspicious that GC is the source also because loop_until caused lots of allocations previously.

Hopefully this was the source of the flakiness; if not, I'll come up with more ideas next time ;)

Change log entry

None.

Additional Notes:

Fixes https://github.com/DataDog/ruby-guild/issues/272

How to test the change?

If CI is green we're good! The logic on the test itself hasn't changed.

@ivoanjo ivoanjo requested review from a team as code owners October 20, 2025 09:57
@github-actions github-actions bot added the dev/testing Involves testing processes (e.g. RSpec) label Oct 20, 2025
@pr-commenter
Copy link

pr-commenter bot commented Oct 20, 2025

Benchmarks

Benchmark execution time: 2025-10-21 09:02:10

Comparing candidate commit fa5db6f in PR branch ivoanjo/prof-12763-flaky-sighandler-spec with baseline commit 01568d4 in branch master.

Found 0 performance improvements and 1 performance regressions! Performance is the same for 43 metrics, 2 unstable metrics.

scenario:profiling - Allocations ()

  • 🟥 throughput [-267993.421op/s; -260665.612op/s] or [-8.020%; -7.800%]

@datadog-datadog-prod-us1
Copy link
Contributor

datadog-datadog-prod-us1 bot commented Oct 21, 2025

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage
Patch Coverage: 100.00%
Total Coverage: 98.58% (+0.03%)

View detailed report

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: fa5db6f | Docs | Was this helpful? Give us feedback!

**What does this PR do?**

This PR tweaks the "sample from the signal handler" profiler specs to
run with GC disabled.

To avoid accidentally running out of memory in `loop_until`, we
additionally changed two things:
* Use `Process.clock_gettime` (that returns a zero-allocation float)
  instead of `Time.now` (that returns a new `Time` instance)
* Add a `check_condition_every_seconds` to avoid calling the
  condition checking code too often, as it can do allocations

**Motivation:**

This change is an attempt at avoiding the flakiness seen in
<https://github.com/DataDog/dd-trace-rb/actions/runs/18560146732/job/52906922299>:

```
Failures:

  1) Datadog::Profiling::Collectors::CpuAndWallTimeWorker#start sampling from signal handler when signal handler sampling is disabled does not prepare samples in the signal handler
     Failure/Error: expect(sample_count).to be_within(20).percent_of(signal_handler_enqueued_sample)
       expected 15 to be within 20% of 20
     # ./spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb:1053:in 'block (4 levels) in <top (required)>'
     # ./spec/spec_helper.rb:272:in 'block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:154:in 'block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/webmock-3.24.0/lib/webmock/rspec.rb:39:in 'block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in 'block (2 levels) in <top (required)>'
     # ./spec/support/execute_in_fork.rb:32:in 'ForkableExample#run'

Finished in 27.13 seconds (files took 1.14 seconds to load)
787 examples, 1 failure, 19 pending

Failed examples:

rspec ./spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb:1068 # Datadog::Profiling::Collectors::CpuAndWallTimeWorker#start sampling from signal handler when signal handler sampling is disabled does not prepare samples in the signal handler
```

The assertion that failed checks that the number of times the
profiler actually got to sample is close to the number of times it
tried to sample.

Unfortunately, I could not reproduce the issue locally but I suspect
that what made this condition flaky is GC: the profiler is not
able to sample during GC and thus if a long-enough-running GC happens
at the wrong time, this condition could fail.

I am suspicious that GC is the source also because `loop_until`
caused lots of allocations previously.

Hopefully this was the source of the flakiness; if not, I'll
come up with more ideas next time ;)

**Additional Notes:**

Fixes DataDog/ruby-guild#272

**How to test the change?**

If CI is green we're good! The logic on the test itself hasn't changed.
@ivoanjo ivoanjo force-pushed the ivoanjo/prof-12763-flaky-sighandler-spec branch from b9ecb06 to fa5db6f Compare October 21, 2025 08:32
@ivoanjo ivoanjo merged commit f8dfc57 into master Oct 21, 2025
425 of 426 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/prof-12763-flaky-sighandler-spec branch October 21, 2025 09:30
@github-actions github-actions bot added this to the 2.23.0 milestone Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dev/testing Involves testing processes (e.g. RSpec)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants