Skip to content

Fix two flaky/broken functional tests#9660

Open
spkane31 wants to merge 5 commits intomainfrom
functional-test-crash
Open

Fix two flaky/broken functional tests#9660
spkane31 wants to merge 5 commits intomainfrom
functional-test-crash

Conversation

@spkane31
Copy link
Contributor

What changed?

  1. tests/task_queue_stats_test.go — Remove the t *testing.T field from taskQueueStatsSuite and replace all s.t usages with s.T(). Critically, fix validateDescribeWorkerDeploymentVersion to use a.NoError(err) instead of require.NoError(s.T(), err) inside the callback.
  2. tests/xdc/history_replication_signals_and_updates_test.go — Fix pollWorkflowResult to handle errors from GetWorkflowExecutionHistory gracefully: return the error from the inner function, retry with a nil page token on transient errors (e.g. CurrentBranchChanged after conflict resolution), and abort cleanly when the context expires.

Why?

TestTaskQueueStats panic: Calling require.NoError(s.T(), err) inside an EventuallyWithT callback is unsafe. EventuallyWithT runs its callback in a separate goroutine on a ticker. If the outer test context expires and the test completes while this callback is still running, the require.NoError call invokes t.Fail() on a completed test, which panics the entire test binary with panic: Fail in goroutine after TestXxx has completed. The fix uses the CollectT-scoped a.NoError(err) so assertions are buffered and safe.

TestConflictResolutionGetResult hang/crash: pollWorkflowResult was calling responseInner.History.Events without checking if responseInner was nil. When GetWorkflowExecutionHistory returns an error (e.g. CurrentBranchChanged after conflict resolution resets the branch token), the response is nil, causing a nil pointer dereference that panics the goroutine. Because the goroutine crashed before sending to workflowResultCh, the main test goroutine blocked on <-workflowResultCh for the entire 30-minute CI timeout. The fix returns the error from the inner function and retries with a nil page token on transient errors, which allows the poll to succeed on the updated branch.

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

The XDC fix changes pollWorkflowResult to retry indefinitely on any non-context error. If a non-transient error were to occur repeatedly, this could loop until the context expires rather than failing immediately. This is acceptable since the context timeout provides a hard bound, and the test failure message via c.t.s.NoError(ctx.Err(), ...) will make the root cause clear.

// taskQueueStatsSuite encapsulates the test environment and parameters for task queue stats tests.
taskQueueStatsSuite struct {
testcore.Env
usePriMatcher bool
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a lint to prevent embedding testing.T in new suites, this is causing issues for our tests

Copy link
Contributor

@stephanos stephanos Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since #9536 decouples Env and assertions, that might already help a lot.

@spkane31 spkane31 marked this pull request as ready for review March 25, 2026 21:43
@spkane31 spkane31 requested review from a team as code owners March 25, 2026 21:43
@spkane31 spkane31 requested a review from stephanos March 25, 2026 21:43

req.ReportTaskQueueStats = true
resp, err := s.FrontendClient().DescribeWorkerDeploymentVersion(ctx, req)
require.NoError(s.T(), err)
Copy link
Contributor

@stephanos stephanos Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be impossible with #9490 :D

Comment on lines +1088 to +1090
if len(allEvents) == 0 {
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we still need this one if the assertion checks it's 1 already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@spkane31 spkane31 enabled auto-merge (squash) March 25, 2026 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants