Skip to content

feat(ci): add scripts to run Spark SQL test suites locally#4405

Draft
andygrove wants to merge 3 commits into
apache:mainfrom
andygrove:ci-spark-sql-local-tests
Draft

feat(ci): add scripts to run Spark SQL test suites locally#4405
andygrove wants to merge 3 commits into
apache:mainfrom
andygrove:ci-spark-sql-local-tests

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented May 22, 2026

Which issue does this PR close?

N/A. This adds local developer tooling and has no associated issue.

Rationale for this change

The spark_sql_test.yml workflow runs Apache Spark's own SQL test suites with Comet enabled, but there is no convenient way to reproduce that run on a developer machine. Debugging a Spark SQL test failure currently means reconstructing the steps by hand: clone Spark at a version tag, apply the Comet diff, build Comet, and run the right build/sbt shard with the right environment.

What changes are included in this PR?

New bash scripts under dev/ci/spark-sql-tests/ that reproduce the spark_sql_test.yml workflow locally:

  • config.sh: per-version configuration and the seven CI module-shard definitions, copied from spark_sql_test.yml.
  • setup-spark.sh: maintains a persistent apache/spark checkout and applies the matching dev/diffs/<version>.diff, preserving Spark's build artifacts across runs.
  • run.sh: builds Comet, runs the selected module shard(s) with build/sbt using the same environment as CI, and prints a PASS/FAIL summary. Supports SKIP_BUILD and SKIP_SPARK_SETUP for fast iteration.
  • README.md: usage, prerequisites, and environment variables.

The Spark version is selected with a SPARK_VERSION env var (default 4.1.1), supporting all four versions in the CI matrix: 3.4.3, 3.5.8, 4.0.2, and 4.1.1. config.sh derives the build profile and CI JDK per version and mirrors the matrix test-group isolation: every version runs with SERIAL_SBT_TESTS=1 except Spark 4.0, which forks a dedicated JVM per leak-prone Parquet/Orc suite. The Spark checkout and logs are namespaced by version so switching versions does not discard build artifacts or overwrite logs.

The scripts also point PySpark at a nonexistent interpreter by default. Spark 4.x's DataSourceManager probes for Python data sources during query analysis by spawning a Python worker. The CI container has no python3 so the probe is skipped there, but on a developer machine that has python3 the worker can hang indefinitely (the JVM-side read has no idle timeout), stalling suites such as GlobalTempViewSuite. Skipping the probe matches CI behavior.

How are these changes tested?

These scripts orchestrate a multi-hour external test run, so they are not exercised end-to-end in CI. They were verified with bash -n and shellcheck -x (both clean), with smoke tests of run.sh argument handling (--help, unknown-module rejection, unsupported-version rejection), and by confirming config.sh derives the correct build profile, JDK, ref, and test-group settings for each of the four supported versions. Running GlobalTempViewSuite locally confirmed it passes with the Python probe skipped, where it otherwise hangs indefinitely. The module definitions and build/sbt arguments match spark_sql_test.yml exactly.

andygrove added 3 commits May 22, 2026 07:29
Add bash scripts under dev/ci/spark-sql-tests/ that reproduce the
spark_sql_test.yml GitHub Actions workflow on a developer machine for
Apache Spark 4.1. They run Spark's own SQL test suites with Comet
enabled, which is useful for debugging a Spark SQL test failure locally
instead of waiting on CI.

- config.sh: shared configuration and the seven CI module-shard
  definitions, copied from spark_sql_test.yml
- setup-spark.sh: maintains a persistent apache/spark checkout and
  applies dev/diffs/4.1.1.diff, preserving build artifacts across runs
- run.sh: builds Comet, runs the selected module shard(s), and prints a
  PASS/FAIL summary
- README.md: usage, prerequisites, and environment variables

Only Spark 4.1 is supported for now.

[skip ci]
Spark 4.1's DataSourceManager probes for Python data sources during
query analysis by spawning a python3 worker. The CI amd64/rust
container has no python3, so the probe is skipped there. On a developer
machine that has python3 the worker can hang indefinitely, since the
JVM-side read has no idle timeout by default, stalling suites such as
GlobalTempViewSuite.

Point PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON at a nonexistent
interpreter so the probe is skipped, matching CI. The value is
overridable for developers who want to run the Python-dependent suites.
The local Spark SQL test scripts hardcoded Spark 4.1.1. Select the
version with a SPARK_VERSION env var instead, supporting all four
versions from the spark_sql_test.yml CI matrix: 3.4.3, 3.5.8, 4.0.2,
and 4.1.1 (default 4.1.1).

config.sh derives SPARK_SHORT and the CI JDK per version, and mirrors
the matrix test-group isolation: every version runs with
SERIAL_SBT_TESTS=1 except Spark 4.0, which forks a dedicated JVM per
leak-prone Parquet/Orc suite. run.sh builds the sbt environment as an
array so the 4.0 case omits SERIAL_SBT_TESTS entirely.

The Spark checkout and logs are namespaced by version
(apache-spark-<version>, logs/<version>/) so switching versions does
not reset away each version's build artifacts or overwrite logs.
@andygrove andygrove changed the title feat(ci): add scripts to run Spark SQL test suite locally for Spark 4.1 feat(ci): add scripts to run Spark SQL test suites locally May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant