-
Notifications
You must be signed in to change notification settings - Fork 14
Description
Summary
The inline stalled job recovery in reserve-batch.lua recovers expired jobs (removes from processing ZSET, re-adds to group ZSET, re-adds group to ready) but does not remove the jobId from the per-group active list (g:{gid}:active). Because the reserve path gates on LLEN(groupActiveKey) == 0 (line 72), the orphaned entry permanently blocks the group from ever being reserved again. The dedicated check-stalled.lua correctly performs LREM (line 50), but by the time it runs, reserve-batch.lua has already cleaned the processing ZSET, so check-stalled.lua finds no candidates and does nothing.
Environment
- GroupMQ version: 1.1.0
- Node.js: v20.20.1
- Redis: 7-alpine (Docker)
Root Cause
There is a race condition between the two stalled-job recovery paths:
reserve-batch.lua — inline recovery (lines 30–54)
This path finds expired jobs and re-queues them, but never touches the per-group active list:
-- reserve-batch.lua lines 30-54
local expiredJobs = redis.call("ZRANGEBYSCORE", processingKey, 0, now)
if #expiredJobs > 0 then
for _, jobId in ipairs(expiredJobs) do
local procKey = ns .. ":processing:" .. jobId
local procData = redis.call("HMGET", procKey, "groupId", "deadlineAt")
local gid = procData[1]
local deadlineAt = tonumber(procData[2])
if gid and deadlineAt and now > deadlineAt then
local jobKey = ns .. ":job:" .. jobId
local jobScore = redis.call("HGET", jobKey, "score")
if jobScore then
local gZ = ns .. ":g:" .. gid
redis.call("ZADD", gZ, tonumber(jobScore), jobId)
local head = redis.call("ZRANGE", gZ, 0, 0, "WITHSCORES")
if head and #head >= 2 then
local headScore = tonumber(head[2])
redis.call("ZADD", readyKey, headScore, gid)
end
redis.call("DEL", ns .. ":lock:" .. gid)
redis.call("DEL", procKey)
redis.call("ZREM", processingKey, jobId)
-- ❌ MISSING: redis.call("LREM", ns .. ":g:" .. gid .. ":active", 1, jobId)
end
end
end
endcheck-stalled.lua — dedicated checker (lines 48–50)
This path correctly removes from the active list:
-- check-stalled.lua lines 48-50
-- BullMQ-style: Remove from per-group active list
local groupActiveKey = ns .. ":g:" .. groupId .. ":active"
redis.call("LREM", groupActiveKey, 1, jobId)The conflict
When reserve-batch.lua runs first and cleans the processing ZSET, the subsequent check-stalled.lua invocation finds zero candidates (line 27–29: ZRANGEBYSCORE returns empty) and exits early, never reaching the LREM cleanup code.
Steps to Reproduce
- Worker reserves a job for group
G→ job goes toprocessingZSET +g:G:activeLIST - Process crashes (e.g., uncaught exception) before completing the job
- Process restarts, worker calls
reserveBatch() reserve-batch.luainline recovery finds the expired job inprocessing- It removes from
processingZSET, adds job back to group ZSET, adds group toready - But does NOT
LREMthe jobId fromg:G:active - Later,
check-stalled.luaruns but finds nothing inprocessing(already cleaned by reserve-batch) - The active list entry persists forever
- All subsequent reserve attempts for group
GcheckLLEN(groupActiveKey)(line 71–72), seeactiveCount > 0, and skip it - Group
Gis permanently blocked
Impact
Per-group FIFO ordering is permanently broken for affected groups. In our production system, 8 conversation groups became permanently blocked after a single process crash. These groups accumulated jobs in their group ZSETs but none were ever processed, requiring manual Redis intervention to unblock them.
Observed State in Redis (production)
# Orphaned active list — contains the jobId from the crashed process
> LRANGE gmq:myqueue:g:{conversationId}:active 0 -1
1) "job-12345"
# Processing ZSET is empty — reserve-batch already cleaned it
> ZRANGEBYSCORE gmq:myqueue:processing -inf +inf
(empty array)
# Group is in ready ZSET (worker sees it)
> ZSCORE gmq:myqueue:ready {conversationId}
"1710000000000"
# But jobs accumulate in group ZSET, never processed
> ZCARD gmq:myqueue:g:{conversationId}
(integer) 14
Suggested Fix
Add LREM to the inline recovery section in reserve-batch.lua, matching what check-stalled.lua already does:
-- In reserve-batch.lua, inline stalled recovery section (after line 48, before DEL procKey):
local groupActiveKey = ns .. ":g:" .. gid .. ":active"
redis.call("LREM", groupActiveKey, 1, jobId)The full corrected block would be:
if jobScore then
local gZ = ns .. ":g:" .. gid
redis.call("ZADD", gZ, tonumber(jobScore), jobId)
local head = redis.call("ZRANGE", gZ, 0, 0, "WITHSCORES")
if head and #head >= 2 then
local headScore = tonumber(head[2])
redis.call("ZADD", readyKey, headScore, gid)
end
-- Clean up per-group active list (matches check-stalled.lua behavior)
local groupActiveKey = ns .. ":g:" .. gid .. ":active"
redis.call("LREM", groupActiveKey, 1, jobId)
redis.call("DEL", ns .. ":lock:" .. gid)
redis.call("DEL", procKey)
redis.call("ZREM", processingKey, jobId)
endWorkaround
Manually delete the orphaned active list keys:
redis-cli KEYS "gmq:*:g:*:active" | xargs -I{} redis-cli DEL {}Or more targeted:
redis-cli LRANGE "gmq:myqueue:g:{groupId}:active" 0 -1
redis-cli DEL "gmq:myqueue:g:{groupId}:active"