Skip to content

Optimize CI: consolidate workflows, fix caching, speed up e2e tests#771

Closed
vikrantpuppala wants to merge 1 commit intoci/protected-runners-jfrogfrom
ci/optimize-e2e-and-coverage
Closed

Optimize CI: consolidate workflows, fix caching, speed up e2e tests#771
vikrantpuppala wants to merge 1 commit intoci/protected-runners-jfrogfrom
ci/optimize-e2e-and-coverage

Conversation

@vikrantpuppala
Copy link
Copy Markdown
Contributor

Summary

  • Consolidate 3 workflows into 1: Delete integration.yml and daily-telemetry-e2e.yml — coverage workflow already runs all e2e tests. Add push: main trigger to coverage. Run all tests (including telemetry) in a single pytest invocation with --dist=loadgroup for xdist_group isolation.
  • Fix pyarrow cache: Remove cache-path: .venv-pyarrow — poetry always creates .venv, so the cache was never saved. This alone fixes 3.14 pyarrow taking 15+ min (compiling mypy/pyarrow from source every run).
  • Fix 3.14 post-test DNS hang: Add enable_telemetry=False to unit test dummy connection args. Unit tests using server_hostname="foo" were triggering real HTTP calls to fake hosts — on protected runners the DNS timeout caused an 8-min process hang after tests finished.
  • Better xdist distribution: Split TestPySQLLargeQueriesSuite into 3 separate classes so the 6 slowest tests get distributed across workers instead of all landing on one (gw3 was running 40 min while gw0/gw1 sat idle after 5 min).
  • Reduce test sizes: Large result set tests reduced from 300MB to 100MB. test_long_running_query starts at scale_factor=50 instead of 1 to skip ramp-up.

Expected Impact

Metric Before After
E2e workflows per PR 3 (integration + coverage + daily-telemetry) 1
Coverage wall-clock ~45 min ~10-12 min
3.14 pyarrow job (with cache fix) ~27 min ~3-5 min
3.14 non-pyarrow job (DNS fix) ~26 min ~3 min
Total warehouse compute per PR ~85 min ~10-12 min

Test plan

  • Verify all unit tests pass (code-quality-checks workflow)
  • Verify coverage workflow runs all tests including telemetry
  • Verify 3.14 pyarrow cache saves successfully on first run and hits on subsequent
  • Verify 3.14 jobs no longer have 8-min post-test DNS hang
  • Verify LargeQueriesSuite tests are distributed across multiple xdist workers

SKIP_COVERAGE_CHECK = CI workflow changes only, no source code coverage impact

This pull request was AI-assisted by Isaac.

Workflow consolidation:
- Delete integration.yml and daily-telemetry-e2e.yml (redundant with
  coverage workflow which already runs all e2e tests)
- Add push-to-main trigger to coverage workflow
- Run all tests (including telemetry) in single pytest invocation with
  --dist=loadgroup to respect xdist_group markers for isolation

Fix pyarrow cache:
- Remove cache-path: .venv-pyarrow from pyarrow jobs. Poetry always
  creates .venv regardless of the cache-path input, so the cache was
  never saved ("Path does not exist" error). The cache-suffix already
  differentiates keys between variants.

Fix 3.14 post-test DNS hang:
- Add enable_telemetry=False to unit test DUMMY_CONNECTION_ARGS that
  use server_hostname="foo". This prevents FeatureFlagsContext from
  making real HTTP calls to fake hosts, eliminating ~8min hang from
  ThreadPoolExecutor threads timing out on DNS on protected runners.

Improve e2e test parallelization:
- Split TestPySQLLargeQueriesSuite into 3 separate classes
  (TestPySQLLargeWideResultSet, TestPySQLLargeNarrowResultSet,
  TestPySQLLongRunningQuery) so xdist distributes them across workers
  instead of all landing on one.

Speed up slow tests:
- Reduce large result set sizes from 300MB to 100MB (still validates
  large fetches, lz4, chunking, row integrity)
- Start test_long_running_query at scale_factor=50 instead of 1 to
  skip ramp-up iterations that finish instantly

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant