Skip to content

fix: use UPSERT to prevent duplicate DLQ entries#6

Merged
ceejbot merged 2 commits intolatestfrom
fix/dlq-upsert-duplicate-entries
Dec 17, 2025
Merged

fix: use UPSERT to prevent duplicate DLQ entries#6
ceejbot merged 2 commits intolatestfrom
fix/dlq-upsert-duplicate-entries

Conversation

@rrogers-machinify
Copy link
Copy Markdown
Collaborator

@rrogers-machinify rrogers-machinify commented Dec 17, 2025

Summary

Two fixes for worker queue health and DLQ reliability:

1. DLQ UPSERT - Prevent duplicate entries

When a job is requeued from the DLQ and fails again, it was creating a new DLQ entry instead of updating the existing one. This caused:

  • Duplicate DLQ entries for the same logical job (same job_key)
  • Stale failed_at timestamps - original entry's timestamp was never updated
  • Infinite requeue loops - auto-requeue cron jobs would pick up old entries repeatedly
  • Exponential job growth - duplicates kept multiplying on each requeue cycle

Fix:

  • Add a unique partial index on job_key (WHERE job_key IS NOT NULL)
  • Change INSERT to UPSERT in add_to_dlq() and process_failed_jobs()
  • On conflict, update failed_at=NOW() and increment failure_count

2. Startup cleanup - Release stale locks and clean up dead jobs

When workers crash or are killed without graceful shutdown, they can leave behind:

  • Stale queue locks - prevent other workers from processing jobs in affected queues
  • Dead jobs - jobs with attempts >= max_attempts that sit in the main queue forever

Fix:

  • Add startup_cleanup() that runs automatically when run_until_cancelled() is called
  • release_stale_queue_locks() - releases queue locks older than 5 minutes (DEFAULT_STALE_LOCK_TIMEOUT)
  • cleanup_permanently_failed_jobs() - deletes jobs that have exhausted retries

Behavior After Fix

  • One DLQ entry per logical job (identified by job_key)
  • failed_at always reflects the most recent failure - cooldown periods work correctly
  • failure_count accumulates across all failure cycles
  • Stale locks auto-released on worker startup
  • Dead jobs cleaned up on worker startup

Testing

All 55 tests pass including the admin API DLQ tests.

Breaking Changes

None - backwards compatible. Existing DLQ entries without job_key are unaffected.

🤖 Generated with Claude Code

When a job is requeued from the DLQ and fails again, it was creating
a new DLQ entry instead of updating the existing one. This caused:

1. Duplicate DLQ entries for the same logical job (same job_key)
2. The original entry's failed_at timestamp was never updated
3. Auto-requeue cron jobs would pick up old entries repeatedly
4. Exponential growth of duplicate jobs in the queue

The fix:
- Add a unique partial index on job_key (WHERE job_key IS NOT NULL)
- Change INSERT to UPSERT in add_to_dlq() and process_failed_jobs()
- On conflict, update failed_at=NOW() and increment failure_count

This ensures:
- One DLQ entry per logical job (identified by job_key)
- failed_at always reflects the most recent failure
- Cooldown periods work correctly for auto-requeue
- failure_count accumulates across all failures

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@rrogers-machinify rrogers-machinify force-pushed the fix/dlq-upsert-duplicate-entries branch from f01b3a0 to a8e7e02 Compare December 17, 2025 21:40
Adds automatic cleanup on worker startup to handle stale state left by
crashed workers:

1. **Stale queue lock cleanup** - Releases queue locks older than 5 minutes
   (DEFAULT_STALE_LOCK_TIMEOUT). Queue locks are normally held briefly
   during job selection, so any lock older than this is from a dead worker.

2. **Permanently failed job cleanup** - Deletes jobs with attempts >= max_attempts
   that are still in the main queue. These jobs are "dead" (is_available=false)
   and will never be processed again. They should already be in the DLQ.

The cleanup runs automatically when `run_until_cancelled()` is called, before
the worker starts processing jobs. Failures are logged but don't prevent
the worker from starting.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@rrogers-machinify rrogers-machinify force-pushed the fix/dlq-upsert-duplicate-entries branch from a8e7e02 to 01402ad Compare December 17, 2025 21:40
Copy link
Copy Markdown
Owner

@ceejbot ceejbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Shipping it!

Ok(deleted)
}

/// Run all startup cleanup tasks.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach.


// Run startup cleanup to release stale locks and clean up failed jobs
if let Err(e) = self.client.startup_cleanup().await {
log::warn!("Startup cleanup failed (continuing anyway): {}", e);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct thing to do on a failure here, yeah.

@ceejbot ceejbot merged commit ecc4e18 into latest Dec 17, 2025
1 check passed
@ceejbot ceejbot deleted the fix/dlq-upsert-duplicate-entries branch December 17, 2025 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants