Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use injected now() instead of time.Now() in summary methods #1672

Merged
merged 2 commits into from
Nov 3, 2024

Conversation

imorph
Copy link
Contributor

@imorph imorph commented Nov 2, 2024

Hi!

This PR should fix #1666

How the problem with TestSummaryDecay manifests?

When you run go test -run TestSummaryDecay normally on laptop it is always pass.

=== RUN   TestSummaryDecay
--- PASS: TestSummaryDecay (0.00s)
PASS

But on CI it (sometimes) fails.

--- FAIL: TestSummaryDecay (1.17s)
    summary_test.go:396: 170. got 102.000000, want 80.000000
    summary_test.go:396: 180. got 112.000000, want 90.000000

Looks like it takes very performance constrained environment to reveal problem. Lets recreate one.

How to reproduce locally

Run something like stress -c 14 and while it running execute go test -run TestSummaryDecay again:

=== RUN   TestSummaryDecay
    summary_test.go:396: 220. got 166.000000, want 130.000000
    summary_test.go:396: 230. got 181.000000, want 140.000000
    summary_test.go:396: 240. got 197.000000, want 150.000000
    summary_test.go:396: 250. got 205.000000, want 160.000000
    summary_test.go:396: 260. got 221.000000, want 170.000000
    summary_test.go:396: 270. got 226.000000, want 180.000000
    summary_test.go:396: 280. got 236.000000, want 190.000000
    summary_test.go:396: 290. got 243.000000, want 200.000000
    summary_test.go:396: 300. got 249.000000, want 210.000000
    summary_test.go:396: 310. got 256.000000, want 220.000000
    summary_test.go:396: 320. got 264.000000, want 230.000000
    summary_test.go:396: 330. got 268.000000, want 240.000000
    summary_test.go:396: 340. got 273.000000, want 250.000000
    summary_test.go:396: 820. got 752.000000, want 730.000000
    summary_test.go:396: 830. got 762.000000, want 740.000000
    summary_test.go:396: 840. got 778.000000, want 750.000000
    summary_test.go:396: 850. got 793.000000, want 760.000000
    summary_test.go:396: 860. got 801.000000, want 770.000000
    summary_test.go:396: 870. got 810.000000, want 780.000000
    summary_test.go:396: 880. got 819.000000, want 790.000000
    summary_test.go:396: 890. got 827.000000, want 800.000000
    summary_test.go:396: 900. got 834.000000, want 810.000000
--- FAIL: TestSummaryDecay (1.28s)
FAIL

OK, its reproduces

Root cause analysis and proposed solution

The testis failing due to its dependency on real-time progression and the Summary implementation's direct use of time.Now(). By modifying the Summary implementation to utilize the injected now function from SummaryOpts and adjusting the test to control time progression, we eliminate the flakiness and make the test deterministic.

RCA

  • Dependency on Real Time: The test relied on actual system time, using time.Sleep and time.Ticker to simulate time progression. This made the test susceptible to failures due to variations in execution speed, CPU scheduling, and other external factors beyond our control.

  • Injected now Function Not Utilized: Although SummaryOpts includes a now function intended for testing purposes, the Summary implementation did not use this injected function. Instead, it directly called time.Now(), ignoring the custom time function provided in SummaryOpts.

  • Mismatch Between Test and Implementation: The test attempted to control time progression by advancing a mock now variable. However, since the Summary implementation continued to use the actual system time, there was a discrepancy between the test's simulated time and the Summary's internal time references, causing the test to fail consistently.

Results

Test is now stable and deterministic even if all CPUs are busy.

@imorph
Copy link
Contributor Author

imorph commented Nov 2, 2024

@ArthurSens pls review

Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant, thank you! Your "steps to reproduce" was also very intelligent ❤️

Just one comment and let's :shipit:

prometheus/summary_test.go Outdated Show resolved Hide resolved
@ArthurSens ArthurSens merged commit 0c73c1c into prometheus:main Nov 3, 2024
10 checks passed
@ArthurSens
Copy link
Member

Thank you!

@bwplotka
Copy link
Member

bwplotka commented Nov 4, 2024

Nice, amazing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test: TestSummaryDecay
3 participants