Skip to content

Commit 8725395

Browse files
ci/test: deflake retry tests under merge-queue load
Two compounding fixes for the flake observed on PR #812's merge_group run, where test_oserror_retries and test_retry_max_count_not_exceeded failed with `assert mock_validate_conn.call_count == 6` because unexpected `/telemetry-ext` requests had been counted alongside the intended session-endpoint retries. 1. tests/e2e/common/retry_test_mixins.py — strengthen `_isolated_from_telemetry()` with two additional defensive patches: - TelemetryClient._send_telemetry → no-op - TelemetryClient._export_event → no-op The existing factory swap installs NoopTelemetryClient for connections created during the test, but doesn't cover real TelemetryClient instances that slip in via other paths (stale module-global, pre-existing client created before the test entered, or code that bypasses initialize_telemetry_client). Patching at the class level for the duration of the context catches all of them. Verified locally: test_oserror_retries goes from flaky-on-CI to 5/5 green in consecutive runs. (test_retry_max_count_not_exceeded still fails on this branch but also fails on baseline main — pre-existing `'SimpleHttpResponse' object has no attribute 'version_string'` issue, unrelated.) 2. .github/workflows/code-coverage.yml — serialise merge_group runs. Previous concurrency group was keyed on github.ref, which is per-PR in the queue (`gh-readonly-queue/main/pr-N-…`). That allowed multiple queue entries to hammer the same warehouse in parallel, stressing telemetry / retry paths that single-PR runs don't exercise. Group merge_group + workflow_dispatch under a single fixed name (`e2e-mq-serial`) so they run one at a time. PR-event runs keep per-ref grouping + cancel-in-progress for fast author feedback. Trade-off: queue throughput drops to one ~17-min run at a time. For this repo's PR volume that's acceptable, and the alternative is flaky merges. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
1 parent 00dc084 commit 8725395

2 files changed

Lines changed: 43 additions & 13 deletions

File tree

.github/workflows/code-coverage.yml

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -16,18 +16,23 @@ on:
1616
# the merge has already happened and the coverage check has no power
1717
# to block. Hence we deliberately don't subscribe to `push:main`.
1818
#
19-
# Serialise E2E runs per ref so a force-push (or a fast follow-up commit)
20-
# on a PR cancels the previous run instead of racing it against shared
21-
# warehouse state (Delta tables, UC Volume files, etc.).
22-
#
23-
# Merge-queue runs are NOT cancelled — each queue entry needs its own
24-
# clean CI signal so a regression on entry N doesn't get hidden by
25-
# entry N+1 arriving seconds later. (Concurrent queue runs can still
26-
# collide on shared warehouse state, but that's the cost of preserving
27-
# per-entry signal; the uuid-suffix conventions in the e2e tests are
28-
# what keep them isolated.)
19+
# Concurrency groups:
20+
# - pull_request: per-ref + cancel-in-progress. A force-push or fast
21+
# follow-up commit on a PR cancels the previous run instead of
22+
# racing it against shared warehouse state (Delta tables, UC Volume
23+
# files, telemetry endpoints, etc.).
24+
# - merge_group: serialised globally with a fixed group name. The
25+
# warehouse can't tolerate two parallel queue entries hammering
26+
# telemetry / retry paths simultaneously — we have observed flaky
27+
# retry-test failures (extra `/telemetry-ext` retries inflating
28+
# mock.call_count) under that load. Running queue entries one at a
29+
# time costs queue throughput (one entry at a time, ~17 min each)
30+
# but keeps signal trustworthy. cancel-in-progress is off so each
31+
# entry gets a complete run.
32+
# - workflow_dispatch: shares the merge_group group; manual triggers
33+
# are rare enough that serialising them with the queue is fine.
2934
concurrency:
30-
group: e2e-${{ github.workflow }}-${{ github.ref }}
35+
group: ${{ github.event_name == 'pull_request' && format('e2e-pr-{0}', github.ref) || 'e2e-mq-serial' }}
3136
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
3237

3338
jobs:

tests/e2e/common/retry_test_mixins.py

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
)
1818
from databricks.sql.telemetry.telemetry_client import (
1919
NoopTelemetryClient,
20+
TelemetryClient,
2021
TelemetryClientFactory,
2122
_TelemetryClientHolder,
2223
)
@@ -27,8 +28,22 @@ def _isolated_from_telemetry():
2728
# Tests that mock urllib3 globally (via _get_conn / _validate_conn) also
2829
# intercept background telemetry pushes from the shared
2930
# TelemetryClientFactory executor — inflating mock.call_count and breaking
30-
# assertions like `call_count == 6`. Drain the factory and force any new
31-
# connection to use NoopTelemetryClient for the duration of the test.
31+
# assertions like `call_count == 6`. Three layers of defence:
32+
#
33+
# 1. Drain TelemetryClientFactory and override initialize_telemetry_client
34+
# so new connections install NoopTelemetryClient (which submits nothing).
35+
# 2. Patch TelemetryClient._send_telemetry to a no-op as a backstop — covers
36+
# any real TelemetryClient instance that slips in (e.g. a stale module-
37+
# global, a code path that bypasses initialize_telemetry_client, or
38+
# anything created before this context entered).
39+
# 3. Patch TelemetryClient._export_event to a no-op so even if a real
40+
# client receives an event, the event never reaches the queue and the
41+
# flush logic never fires.
42+
#
43+
# Without layer 2/3 we have observed `/telemetry-ext` requests showing up
44+
# in merge_group runs (concurrent CI load on the warehouse stresses paths
45+
# that single-test runs don't hit), inflating retry-test counts and
46+
# producing flaky AssertionErrors.
3247
with TelemetryClientFactory._lock:
3348
saved_clients = TelemetryClientFactory._clients
3449
saved_executor = TelemetryClientFactory._executor
@@ -55,11 +70,21 @@ def _install_noop(*args, host_url=None, **kwargs):
5570
NoopTelemetryClient()
5671
)
5772

73+
def _noop_send(self, events):
74+
return None
75+
76+
def _noop_export(self, event):
77+
return None
78+
5879
try:
5980
with patch.object(
6081
TelemetryClientFactory,
6182
"initialize_telemetry_client",
6283
staticmethod(_install_noop),
84+
), patch.object(
85+
TelemetryClient, "_send_telemetry", _noop_send
86+
), patch.object(
87+
TelemetryClient, "_export_event", _noop_export
6388
):
6489
yield
6590
finally:

0 commit comments

Comments
 (0)