Skip to content

Fix batch child jobs created as 'processing' instead of 'pending'#860

Merged
chubes4 merged 1 commit intomainfrom
fix/batch-child-job-status
Mar 18, 2026
Merged

Fix batch child jobs created as 'processing' instead of 'pending'#860
chubes4 merged 1 commit intomainfrom
fix/batch-child-job-status

Conversation

@chubes4
Copy link
Member

@chubes4 chubes4 commented Mar 18, 2026

Summary

Fixes #858 — batch fan-out child jobs were immediately set to processing at creation time, before Action Scheduler picked them up. After 2 hours, recover-stuck would mark them as timed out even though they were just waiting in the AS queue.

Root cause

PipelineBatchScheduler::createChildJob()
  1. create_job()     → status = 'pending'  ✓
  2. start_job()      → status = 'processing'  ← BUG: premature
  3. schedule_next_step()  → AS queues it for later
  
  // 2 hours pass, AS hasn't gotten to it yet...
  recover-stuck sees: status='processing', age > 2h → marks FAILED

Same pattern in TaskScheduler::scheduleTask().

Fix

Move the pendingprocessing transition to the actual moment of execution:

Location Before After
PipelineBatchScheduler::createChildJob() start_job() after create_job() Removed — child stays pending
ExecuteStepAbility::execute() No status change Added start_job() at entry — transitions to processing when AS fires
TaskScheduler::scheduleTask() start_job() after create_job() Removed — task stays pending
TaskScheduler::handleTask() No status change Added start_job() at entry — transitions to processing when AS fires

For parent jobs, the start_job() in ExecuteStepAbility is a no-op (already processing via RunFlowAbility). For child jobs, it's the real transition.

Result

  • recover-stuck only catches genuinely stuck jobs (started processing but never finished)
  • Jobs waiting in the AS queue stay pending and won't be incorrectly timed out
  • No changes to the recover-stuck logic itself — it correctly targets processing status

Testing

  • 873/873 tests pass (29 pre-existing failures on main, unchanged)
  • Lint passes with no new findings

Child jobs were immediately set to 'processing' at creation time, before
Action Scheduler picked them up. This caused recover-stuck to mark them
as timed out after 2 hours even though they were just waiting in the AS
queue.

Fix: move the pending→processing transition to the actual moment of
execution:
- PipelineBatchScheduler: remove premature start_job() from createChildJob()
- ExecuteStepAbility: add start_job() at execute() entry (no-op for parent
  jobs already in processing, real transition for child jobs)
- TaskScheduler: same pattern — remove from scheduleTask(), add to handleTask()

Now recover-stuck only catches genuinely stuck jobs (ones that started
processing but never finished), not jobs waiting in the AS queue.

Closes #858
@github-actions
Copy link

github-actions bot commented Mar 18, 2026

Homeboy Results — data-machine

Lint

⚡ Scope: changed files only

lint (changed files only)

Test

⚡ Scope: changed files only

test (changed files only)

Audit

⚡ Scope: changed files only

audit (changed files only)

Tooling versions
  • Homeboy CLI: homeboy 0.81.1+385399ee
  • Extension: wordpress from https://github.com/Extra-Chill/homeboy-extensions
  • Extension revision: unknown
  • Action: Extra-Chill/homeboy-action@v1

Homeboy Action v1

@chubes4 chubes4 merged commit 425018a into main Mar 18, 2026
3 checks passed
@chubes4 chubes4 deleted the fix/batch-child-job-status branch March 18, 2026 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Batch fan-out child jobs created as 'processing' instead of 'pending'

1 participant