Skip to content

Stop CTST gracefully at a cucumber-level time budget#2457

Open
delthas wants to merge 1 commit into
development/2.15from
improvement/ZENKO-5309/ctst-time-budget
Open

Stop CTST gracefully at a cucumber-level time budget#2457
delthas wants to merge 1 commit into
development/2.15from
improvement/ZENKO-5309/ctst-time-budget

Conversation

@delthas

@delthas delthas commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

What

Add an in-process time budget so CTST stops gracefully before the GitHub step timeout would hard-kill it. Touches only tests/functional/ctst/common/hooks.ts and .github/scripts/end2end/run-e2e-ctst.sh.

Why

The only stop for a hung ctst-end2end-sharded run is the step timeout-minutes, which hard-kills the process. cucumber-js v13 installs no signal handler and has no whole-run timeout, so on a kill it never reaches testRunFinishedreport.xml is written empty (Error parsing report.xml: no element found), teardown is skipped, and the archive step has nothing to publish.

How

  • run-e2e-ctst.sh exports CTST_DEADLINE_EPOCH_MS (default 180 min, via CTST_MAX_RUNTIME_MIN) and a marker path, and clears any stale marker.
  • An early Before hook (registered before the expensive @Quotas/count-items hooks) skips any scenario that starts past the deadline — returning 'skipped' short-circuits the remaining hooks/steps — and drops the marker file.
  • cucumber then finishes normally: it writes report.xml/report.html/report.ndjson and runs teardown. The runner script exits 1 if the marker is present, so a timed-out run is still red while keeping a clean report.

When CTST_DEADLINE_EPOCH_MS is unset (local runs), the hook is a no-op.

Relationship to ZENKO-5306

Companion to the step-level timeout. ZENKO-5306 lowers the step timeout-minutes to 190 as a hard backstop: 180 min (graceful, this PR) < 190 min (hard backstop) < the old 360-min job cap.

tsc --build, eslint, and bash -n pass.

Issue: ZENKO-5309

The GitHub step timeout hard-kills the process, so Cucumber never writes
its reports (empty report.xml). Add an in-process deadline instead: the
runner script exports a deadline (default 180m) and a marker path; an
early Before hook skips any scenario starting past the deadline (so the
expensive @Quotas/count-items hooks are short-circuited) and drops the
marker. Cucumber then finishes normally, writing report.xml/html/ndjson
and running teardown; the script fails the run if the marker is present.
The 190m step timeout (ZENKO-5306) remains as a hard backstop.

Issue: ZENKO-5309
@bert-e

bert-e commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@scality scality deleted a comment from bert-e Jun 30, 2026
@bert-e

bert-e commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

# Before hook), so Cucumber finishes normally and writes its reports. The hook
# drops a marker file; we fail the run if it is present.
CTST_MAX_RUNTIME_MIN=${CTST_MAX_RUNTIME_MIN:-180}
export CTST_DEADLINE_EPOCH_MS=$(( $(date +%s%3N) + CTST_MAX_RUNTIME_MIN * 60 * 1000 ))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

best to minimize content of wrapper script : should do this computation in javascript

Suggested change
export CTST_DEADLINE_EPOCH_MS=$(( $(date +%s%3N) + CTST_MAX_RUNTIME_MIN * 60 * 1000 ))
export CTST_MAX_RUNTIME_MIN=${CTST_MAX_RUNTIME_MIN:-180}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even CTST_MAX_RUNTIME_MIN default value could be set directly in the javascript code?

--format message:ctst/reports/report.ndjson || rc=$?

if [ -f "$CTST_TIMEOUT_MARKER" ]; then
echo "::error::CTST exceeded its ${CTST_MAX_RUNTIME_MIN}-minute time budget; remaining scenarios were skipped."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to display a log here ?
we could display the same from within CTST (first time -or everytime- we hit the deadline...), and avoid the whole "marker" handling: which seems overly complex

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need imo, I would display a log in the hook, for each test skipped, we can access the name of the scenario and just write "scenario xxx skipped because of global timeout"

fs.writeFileSync(marker, 'timeout');
} catch { /* best-effort: marker is advisory */ }
}
return 'skipped';

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this "skipped" do?
these (skipped) tests need to be marked as somewhat failed, to ensure we don't pass CI unexpectedly: maybe "skipped" does that, but it must a skip like "I could not execute it" (=potentially failed) rather than "test was intentionally disabled/skipped" (=ignore)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's an official cucumber keyword, but yeah this should probably use fail instead 🤔

if [ -f "$CTST_TIMEOUT_MARKER" ]; then
echo "::error::CTST exceeded its ${CTST_MAX_RUNTIME_MIN}-minute time budget; remaining scenarios were skipped."
if [ "$rc" -eq 0 ]; then
rc=1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of wrapper script, there is already some code to "tweak" the return code (and make it pass always), c.f. tests/functional/ctst/cucumber.config.cjs :

if (process.env.CI_PASS_ON_TEST_FAILURE === 'true') {
    const _exit = process.exit;
    process.exit = function exit(code) {
        _exit(code === 1 ? 0 : code);
    };
    process.on('beforeExit', () => {
        if (process.exitCode === 1) {
            process.exitCode = 0;
        }
    });
}

→ instead of this, you could thus tweak the result to actually return something other than 0 if we want to skip further analysis (e.g. test results merging...)
→ however, why do we need to return error 1 in this case? Timeout is likely an issue in the tests, shouldn't we handle this like all other failed tests (i.e. ignore the result and let followup steps check the report for failure)

@scality scality deleted a comment from bert-e Jun 30, 2026
@bert-e

bert-e commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

// Time budget: scenarios that start after the deadline are skipped so Cucumber
// finishes normally (writing its reports) before the GitHub step timeout would
// hard-kill the process. The marker file lets the runner script fail the run.
Before(() => {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would simplify this :

  • hardcode here max duration allowed for all ctst tests
  • send as env variable a timestamp of when we started running ctst (+ verify if maybe this variable is already exposed by cucumber)
  • just verify time.now < timestampStart + maxDurationAllowed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants