Skip to content

Add optional saving of test output #12184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

saintstack
Copy link
Contributor

@saintstack saintstack commented Jun 5, 2025

This commit adds logging around test runs and an optional archiving feature

Key Changes:

  1. Log Archival on Failure (correctnessTest.sh, config.py, run.py, summarize.py):

    • Introduced the TH_ARCHIVE_LOGS_ON_FAILURE environment variable.
    • When set, correctnessTest.sh now preserves temporary directories
      (/tmp/th_joshua_output..., /tmp/th_run_temp...) if the Python
      harness (app.py) exits with an error.
    • These preserved directories contain the full joshua.xml, FDB trace
      logs, harness stdout/stderr (python_app_stdout.log,
      python_app_stderr.log), and other per-test-part artifacts.
    • run.py was updated to identify FDB log files for potential archival.
  2. Improved Failure Detection (summarize.py):

    • Summary.done() now more robustly identifies failures, including:
      • Tests that start but don't produce an "ElapsedTime" event.
      • Positive tests with a non-zero exit code but no specific error.
      • Positive tests with a zero exit code that produce stderr output.
    • Enhanced unseed mismatch detection.
  3. Added a new contrib/TestHarness2/README.md.

  4. Adds logging of correctnessTest.sh environment and setup for python invocation and then logging of the python call, each to its own log in /tmp per test (distingushed by test seed)

…lity

This commit adds logging and an optional archiving feature to TestHarness2,

Key Changes:

1.  **Log Archival on Failure (`correctnessTest.sh`, `config.py`, `run.py`, `summarize.py`):**
    *   Introduced the `TH_ARCHIVE_LOGS_ON_FAILURE` environment variable.
    *   When set, `correctnessTest.sh` now preserves temporary directories
        (`/tmp/th_joshua_output...`, `/tmp/th_run_temp...`) if the Python
        harness (`app.py`) exits with an error.
    *   These preserved directories contain the full `joshua.xml`, FDB trace
        logs, harness stdout/stderr (`python_app_stdout.log`,
        `python_app_stderr.log`), and other per-test-part artifacts.
    *   `run.py` was updated to identify FDB log files for potential archival.

2.  **Improved Failure Detection (`summarize.py`):**
    *   `Summary.done()` now more robustly identifies failures, including:
        *   Tests that start but don't produce an "ElapsedTime" event.
        *   Positive tests with a non-zero exit code but no specific error.
        *   Positive tests with a zero exit code that produce stderr output.
    *   Enhanced unseed mismatch detection.

3.  Added a new `contrib/TestHarness2/README.md`.
@saintstack saintstack marked this pull request as draft June 5, 2025 05:49
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 29997c5
  • Duration 0:25:30
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 29997c5
  • Duration 0:37:03
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 29997c5
  • Duration 0:37:49
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 29997c5
  • Duration 0:46:17
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 29997c5
  • Duration 0:48:39
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 29997c5
  • Duration 1:00:43
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 29997c5
  • Duration 1:01:20
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: f1d7eb3
  • Duration 0:24:29
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: f1d7eb3
  • Duration 0:37:10
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: f1d7eb3
  • Duration 0:45:21
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: f1d7eb3
  • Duration 0:47:03
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: f1d7eb3
  • Duration 0:49:10
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: f1d7eb3
  • Duration 0:59:59
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: f1d7eb3
  • Duration 1:03:21
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 1c8a80b
  • Duration 0:25:02
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 1c8a80b
  • Duration 0:37:14
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 1c8a80b
  • Duration 0:38:07
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 1c8a80b
  • Duration 0:46:55
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 1c8a80b
  • Duration 0:49:21
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 1c8a80b
  • Duration 1:00:26
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 1c8a80b
  • Duration 1:01:49
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@@ -21,3 +21,9 @@ We use Joshua to simulate failures modes at the network, machine, and datacenter
For a while, there was an informal competition within the engineering team to design failures that found the toughest bugs and issues the most easily. After a period of one-upsmanship, the reigning champion is called "swizzle-clogging". To swizzle-clog, you first pick a random subset of nodes in the cluster. Then, you "clog" (stop) each of their network connections one by one over a few seconds. Finally, you unclog them in a random order, again one by one, until they are all up. This pattern seems to be particularly good at finding deep issues that only happen in the rarest real-world cases.

Joshua's success has surpassed our expectation and has been vital to our engineering team. It seems unlikely that we would have been able to build FoundationDB without this technology.

* `scripts/`: This directory contains shell scripts that serve as entry points for running tests. Joshua invokes these scripts, which then set up the environment and execute the test runner.
* **`correctnessTest.sh`**: This is the primary script for running correctness tests. It is responsible for invoking the Python-based `TestHarnessV2` and passing it the necessary configuration. It also handles the creation and cleanup of temporary output directories.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* **`correctnessTest.sh`**: This is the primary script for running correctness tests. It is responsible for invoking the Python-based `TestHarnessV2` and passing it the necessary configuration. It also handles the creation and cleanup of temporary output directories.
* **`correctnessTest.sh`**: This is the primary script for running correctness tests. It is responsible for invoking the Python-based `TestHarness2` and passing it the necessary configuration. It also handles the creation and cleanup of temporary output directories.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants