Skip to content

reserve-batch.lua inline stalled recovery does not LREM from per-group active list, causing permanent group blocking #14

@pepinogttv

Description

@pepinogttv

Summary

The inline stalled job recovery in reserve-batch.lua recovers expired jobs (removes from processing ZSET, re-adds to group ZSET, re-adds group to ready) but does not remove the jobId from the per-group active list (g:{gid}:active). Because the reserve path gates on LLEN(groupActiveKey) == 0 (line 72), the orphaned entry permanently blocks the group from ever being reserved again. The dedicated check-stalled.lua correctly performs LREM (line 50), but by the time it runs, reserve-batch.lua has already cleaned the processing ZSET, so check-stalled.lua finds no candidates and does nothing.

Environment

  • GroupMQ version: 1.1.0
  • Node.js: v20.20.1
  • Redis: 7-alpine (Docker)

Root Cause

There is a race condition between the two stalled-job recovery paths:

reserve-batch.lua — inline recovery (lines 30–54)

This path finds expired jobs and re-queues them, but never touches the per-group active list:

-- reserve-batch.lua lines 30-54
local expiredJobs = redis.call("ZRANGEBYSCORE", processingKey, 0, now)
if #expiredJobs > 0 then
  for _, jobId in ipairs(expiredJobs) do
    local procKey = ns .. ":processing:" .. jobId
    local procData = redis.call("HMGET", procKey, "groupId", "deadlineAt")
    local gid = procData[1]
    local deadlineAt = tonumber(procData[2])
    if gid and deadlineAt and now > deadlineAt then
      local jobKey = ns .. ":job:" .. jobId
      local jobScore = redis.call("HGET", jobKey, "score")
      if jobScore then
        local gZ = ns .. ":g:" .. gid
        redis.call("ZADD", gZ, tonumber(jobScore), jobId)
        local head = redis.call("ZRANGE", gZ, 0, 0, "WITHSCORES")
        if head and #head >= 2 then
          local headScore = tonumber(head[2])
          redis.call("ZADD", readyKey, headScore, gid)
        end
        redis.call("DEL", ns .. ":lock:" .. gid)
        redis.call("DEL", procKey)
        redis.call("ZREM", processingKey, jobId)
        -- ❌ MISSING: redis.call("LREM", ns .. ":g:" .. gid .. ":active", 1, jobId)
      end
    end
  end
end

check-stalled.lua — dedicated checker (lines 48–50)

This path correctly removes from the active list:

-- check-stalled.lua lines 48-50
-- BullMQ-style: Remove from per-group active list
local groupActiveKey = ns .. ":g:" .. groupId .. ":active"
redis.call("LREM", groupActiveKey, 1, jobId)

The conflict

When reserve-batch.lua runs first and cleans the processing ZSET, the subsequent check-stalled.lua invocation finds zero candidates (line 27–29: ZRANGEBYSCORE returns empty) and exits early, never reaching the LREM cleanup code.

Steps to Reproduce

  1. Worker reserves a job for group G → job goes to processing ZSET + g:G:active LIST
  2. Process crashes (e.g., uncaught exception) before completing the job
  3. Process restarts, worker calls reserveBatch()
  4. reserve-batch.lua inline recovery finds the expired job in processing
  5. It removes from processing ZSET, adds job back to group ZSET, adds group to ready
  6. But does NOT LREM the jobId from g:G:active
  7. Later, check-stalled.lua runs but finds nothing in processing (already cleaned by reserve-batch)
  8. The active list entry persists forever
  9. All subsequent reserve attempts for group G check LLEN(groupActiveKey) (line 71–72), see activeCount > 0, and skip it
  10. Group G is permanently blocked

Impact

Per-group FIFO ordering is permanently broken for affected groups. In our production system, 8 conversation groups became permanently blocked after a single process crash. These groups accumulated jobs in their group ZSETs but none were ever processed, requiring manual Redis intervention to unblock them.

Observed State in Redis (production)

# Orphaned active list — contains the jobId from the crashed process
> LRANGE gmq:myqueue:g:{conversationId}:active 0 -1
1) "job-12345"

# Processing ZSET is empty — reserve-batch already cleaned it
> ZRANGEBYSCORE gmq:myqueue:processing -inf +inf
(empty array)

# Group is in ready ZSET (worker sees it)
> ZSCORE gmq:myqueue:ready {conversationId}
"1710000000000"

# But jobs accumulate in group ZSET, never processed
> ZCARD gmq:myqueue:g:{conversationId}
(integer) 14

Suggested Fix

Add LREM to the inline recovery section in reserve-batch.lua, matching what check-stalled.lua already does:

-- In reserve-batch.lua, inline stalled recovery section (after line 48, before DEL procKey):
local groupActiveKey = ns .. ":g:" .. gid .. ":active"
redis.call("LREM", groupActiveKey, 1, jobId)

The full corrected block would be:

if jobScore then
  local gZ = ns .. ":g:" .. gid
  redis.call("ZADD", gZ, tonumber(jobScore), jobId)
  local head = redis.call("ZRANGE", gZ, 0, 0, "WITHSCORES")
  if head and #head >= 2 then
    local headScore = tonumber(head[2])
    redis.call("ZADD", readyKey, headScore, gid)
  end
  -- Clean up per-group active list (matches check-stalled.lua behavior)
  local groupActiveKey = ns .. ":g:" .. gid .. ":active"
  redis.call("LREM", groupActiveKey, 1, jobId)
  redis.call("DEL", ns .. ":lock:" .. gid)
  redis.call("DEL", procKey)
  redis.call("ZREM", processingKey, jobId)
end

Workaround

Manually delete the orphaned active list keys:

redis-cli KEYS "gmq:*:g:*:active" | xargs -I{} redis-cli DEL {}

Or more targeted:

redis-cli LRANGE "gmq:myqueue:g:{groupId}:active" 0 -1
redis-cli DEL "gmq:myqueue:g:{groupId}:active"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions