fix: use UPSERT to prevent duplicate DLQ entries by rrogers-machinify · Pull Request #6 · ceejbot/backfill

rrogers-machinify · 2025-12-17T21:10:19Z

Summary

Two fixes for worker queue health and DLQ reliability:

1. DLQ UPSERT - Prevent duplicate entries

When a job is requeued from the DLQ and fails again, it was creating a new DLQ entry instead of updating the existing one. This caused:

Duplicate DLQ entries for the same logical job (same job_key)
Stale failed_at timestamps - original entry's timestamp was never updated
Infinite requeue loops - auto-requeue cron jobs would pick up old entries repeatedly
Exponential job growth - duplicates kept multiplying on each requeue cycle

Fix:

Add a unique partial index on job_key (WHERE job_key IS NOT NULL)
Change INSERT to UPSERT in add_to_dlq() and process_failed_jobs()
On conflict, update failed_at=NOW() and increment failure_count

2. Startup cleanup - Release stale locks and clean up dead jobs

When workers crash or are killed without graceful shutdown, they can leave behind:

Stale queue locks - prevent other workers from processing jobs in affected queues
Dead jobs - jobs with attempts >= max_attempts that sit in the main queue forever

Fix:

Add startup_cleanup() that runs automatically when run_until_cancelled() is called
release_stale_queue_locks() - releases queue locks older than 5 minutes (DEFAULT_STALE_LOCK_TIMEOUT)
cleanup_permanently_failed_jobs() - deletes jobs that have exhausted retries

Behavior After Fix

One DLQ entry per logical job (identified by job_key)
failed_at always reflects the most recent failure - cooldown periods work correctly
failure_count accumulates across all failure cycles
Stale locks auto-released on worker startup
Dead jobs cleaned up on worker startup

Testing

All 55 tests pass including the admin API DLQ tests.

Breaking Changes

None - backwards compatible. Existing DLQ entries without job_key are unaffected.

🤖 Generated with Claude Code

When a job is requeued from the DLQ and fails again, it was creating a new DLQ entry instead of updating the existing one. This caused: 1. Duplicate DLQ entries for the same logical job (same job_key) 2. The original entry's failed_at timestamp was never updated 3. Auto-requeue cron jobs would pick up old entries repeatedly 4. Exponential growth of duplicate jobs in the queue The fix: - Add a unique partial index on job_key (WHERE job_key IS NOT NULL) - Change INSERT to UPSERT in add_to_dlq() and process_failed_jobs() - On conflict, update failed_at=NOW() and increment failure_count This ensures: - One DLQ entry per logical job (identified by job_key) - failed_at always reflects the most recent failure - Cooldown periods work correctly for auto-requeue - failure_count accumulates across all failures 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Adds automatic cleanup on worker startup to handle stale state left by crashed workers: 1. **Stale queue lock cleanup** - Releases queue locks older than 5 minutes (DEFAULT_STALE_LOCK_TIMEOUT). Queue locks are normally held briefly during job selection, so any lock older than this is from a dead worker. 2. **Permanently failed job cleanup** - Deletes jobs with attempts >= max_attempts that are still in the main queue. These jobs are "dead" (is_available=false) and will never be processed again. They should already be in the DLQ. The cleanup runs automatically when `run_until_cancelled()` is called, before the worker starts processing jobs. Failures are logged but don't prevent the worker from starting. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

ceejbot

Looks good! Shipping it!

ceejbot · 2025-12-17T21:44:19Z

src/client/cleanup.rs

+        Ok(deleted)
+    }
+
+    /// Run all startup cleanup tasks.


I like this approach.

ceejbot · 2025-12-17T21:44:43Z

src/worker.rs


+        // Run startup cleanup to release stale locks and clean up failed jobs
+        if let Err(e) = self.client.startup_cleanup().await {
+            log::warn!("Startup cleanup failed (continuing anyway): {}", e);


The correct thing to do on a failure here, yeah.

rrogers-machinify requested a review from ceejbot December 17, 2025 21:10

rrogers-machinify force-pushed the fix/dlq-upsert-duplicate-entries branch from f01b3a0 to a8e7e02 Compare December 17, 2025 21:40

rrogers-machinify force-pushed the fix/dlq-upsert-duplicate-entries branch from a8e7e02 to 01402ad Compare December 17, 2025 21:40

ceejbot approved these changes Dec 17, 2025

View reviewed changes

ceejbot merged commit ecc4e18 into latest Dec 17, 2025
1 check passed

ceejbot deleted the fix/dlq-upsert-duplicate-entries branch December 17, 2025 21:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use UPSERT to prevent duplicate DLQ entries#6

fix: use UPSERT to prevent duplicate DLQ entries#6
ceejbot merged 2 commits intolatestfrom
fix/dlq-upsert-duplicate-entries

rrogers-machinify commented Dec 17, 2025 •

edited

Loading

Uh oh!

ceejbot left a comment

Uh oh!

ceejbot Dec 17, 2025

Uh oh!

ceejbot Dec 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rrogers-machinify commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. DLQ UPSERT - Prevent duplicate entries

2. Startup cleanup - Release stale locks and clean up dead jobs

Behavior After Fix

Testing

Breaking Changes

Uh oh!

ceejbot left a comment

Choose a reason for hiding this comment

Uh oh!

ceejbot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

ceejbot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rrogers-machinify commented Dec 17, 2025 •

edited

Loading