reserve-batch.lua inline stalled recovery does not LREM from per-group active list, causing permanent group blocking

## Summary

The inline stalled job recovery in `reserve-batch.lua` recovers expired jobs (removes from `processing` ZSET, re-adds to group ZSET, re-adds group to `ready`) but does **not** remove the jobId from the per-group active list (`g:{gid}:active`). Because the reserve path gates on `LLEN(groupActiveKey) == 0` (line 72), the orphaned entry permanently blocks the group from ever being reserved again. The dedicated `check-stalled.lua` correctly performs `LREM` (line 50), but by the time it runs, `reserve-batch.lua` has already cleaned the `processing` ZSET, so `check-stalled.lua` finds no candidates and does nothing.

## Environment

- **GroupMQ version**: 1.1.0
- **Node.js**: v20.20.1
- **Redis**: 7-alpine (Docker)

## Root Cause

There is a race condition between the two stalled-job recovery paths:

### `reserve-batch.lua` — inline recovery (lines 30–54)

This path finds expired jobs and re-queues them, but **never touches the per-group active list**:

```lua
-- reserve-batch.lua lines 30-54
local expiredJobs = redis.call("ZRANGEBYSCORE", processingKey, 0, now)
if #expiredJobs > 0 then
  for _, jobId in ipairs(expiredJobs) do
    local procKey = ns .. ":processing:" .. jobId
    local procData = redis.call("HMGET", procKey, "groupId", "deadlineAt")
    local gid = procData[1]
    local deadlineAt = tonumber(procData[2])
    if gid and deadlineAt and now > deadlineAt then
      local jobKey = ns .. ":job:" .. jobId
      local jobScore = redis.call("HGET", jobKey, "score")
      if jobScore then
        local gZ = ns .. ":g:" .. gid
        redis.call("ZADD", gZ, tonumber(jobScore), jobId)
        local head = redis.call("ZRANGE", gZ, 0, 0, "WITHSCORES")
        if head and #head >= 2 then
          local headScore = tonumber(head[2])
          redis.call("ZADD", readyKey, headScore, gid)
        end
        redis.call("DEL", ns .. ":lock:" .. gid)
        redis.call("DEL", procKey)
        redis.call("ZREM", processingKey, jobId)
        -- ❌ MISSING: redis.call("LREM", ns .. ":g:" .. gid .. ":active", 1, jobId)
      end
    end
  end
end
```

### `check-stalled.lua` — dedicated checker (lines 48–50)

This path **correctly** removes from the active list:

```lua
-- check-stalled.lua lines 48-50
-- BullMQ-style: Remove from per-group active list
local groupActiveKey = ns .. ":g:" .. groupId .. ":active"
redis.call("LREM", groupActiveKey, 1, jobId)
```

### The conflict

When `reserve-batch.lua` runs first and cleans the `processing` ZSET, the subsequent `check-stalled.lua` invocation finds zero candidates (line 27–29: `ZRANGEBYSCORE` returns empty) and exits early, never reaching the `LREM` cleanup code.

## Steps to Reproduce

1. Worker reserves a job for group `G` → job goes to `processing` ZSET + `g:G:active` LIST
2. Process crashes (e.g., uncaught exception) before completing the job
3. Process restarts, worker calls `reserveBatch()`
4. `reserve-batch.lua` inline recovery finds the expired job in `processing`
5. It removes from `processing` ZSET, adds job back to group ZSET, adds group to `ready`
6. **But does NOT `LREM` the jobId from `g:G:active`**
7. Later, `check-stalled.lua` runs but finds nothing in `processing` (already cleaned by reserve-batch)
8. The active list entry persists forever
9. All subsequent reserve attempts for group `G` check `LLEN(groupActiveKey)` (line 71–72), see `activeCount > 0`, and skip it
10. Group `G` is permanently blocked

## Impact

Per-group FIFO ordering is permanently broken for affected groups. In our production system, **8 conversation groups became permanently blocked** after a single process crash. These groups accumulated jobs in their group ZSETs but none were ever processed, requiring manual Redis intervention to unblock them.

## Observed State in Redis (production)

```
# Orphaned active list — contains the jobId from the crashed process
> LRANGE gmq:myqueue:g:{conversationId}:active 0 -1
1) "job-12345"

# Processing ZSET is empty — reserve-batch already cleaned it
> ZRANGEBYSCORE gmq:myqueue:processing -inf +inf
(empty array)

# Group is in ready ZSET (worker sees it)
> ZSCORE gmq:myqueue:ready {conversationId}
"1710000000000"

# But jobs accumulate in group ZSET, never processed
> ZCARD gmq:myqueue:g:{conversationId}
(integer) 14
```

## Suggested Fix

Add `LREM` to the inline recovery section in `reserve-batch.lua`, matching what `check-stalled.lua` already does:

```lua
-- In reserve-batch.lua, inline stalled recovery section (after line 48, before DEL procKey):
local groupActiveKey = ns .. ":g:" .. gid .. ":active"
redis.call("LREM", groupActiveKey, 1, jobId)
```

The full corrected block would be:

```lua
if jobScore then
  local gZ = ns .. ":g:" .. gid
  redis.call("ZADD", gZ, tonumber(jobScore), jobId)
  local head = redis.call("ZRANGE", gZ, 0, 0, "WITHSCORES")
  if head and #head >= 2 then
    local headScore = tonumber(head[2])
    redis.call("ZADD", readyKey, headScore, gid)
  end
  -- Clean up per-group active list (matches check-stalled.lua behavior)
  local groupActiveKey = ns .. ":g:" .. gid .. ":active"
  redis.call("LREM", groupActiveKey, 1, jobId)
  redis.call("DEL", ns .. ":lock:" .. gid)
  redis.call("DEL", procKey)
  redis.call("ZREM", processingKey, jobId)
end
```

## Workaround

Manually delete the orphaned active list keys:

```bash
redis-cli KEYS "gmq:*:g:*:active" | xargs -I{} redis-cli DEL {}
```

Or more targeted:

```bash
redis-cli LRANGE "gmq:myqueue:g:{groupId}:active" 0 -1
redis-cli DEL "gmq:myqueue:g:{groupId}:active"
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reserve-batch.lua inline stalled recovery does not LREM from per-group active list, causing permanent group blocking #14

Summary

Environment

Root Cause

`reserve-batch.lua` — inline recovery (lines 30–54)

`check-stalled.lua` — dedicated checker (lines 48–50)

The conflict

Steps to Reproduce

Impact

Observed State in Redis (production)

Suggested Fix

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

reserve-batch.lua inline stalled recovery does not LREM from per-group active list, causing permanent group blocking #14

Description

Summary

Environment

Root Cause

reserve-batch.lua — inline recovery (lines 30–54)

check-stalled.lua — dedicated checker (lines 48–50)

The conflict

Steps to Reproduce

Impact

Observed State in Redis (production)

Suggested Fix

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`reserve-batch.lua` — inline recovery (lines 30–54)

`check-stalled.lua` — dedicated checker (lines 48–50)