OOM risk: put_persisted_messages bypasses max_memory_mb limit on startup

## Summary

When a sink accumulates a large number of persisted events in the database (e.g., due to a delivery error), all events are loaded into memory unconditionally on startup/rehydration via `put_persisted_messages`, bypassing `max_memory_mb` and `setting_max_messages` limits. This can cause OOM kills on the pod.

## Reproduction scenario

We experienced this ~~in production~~ with an ~~Elasticsearch~~ OpenSearch sink (`jobs_sink`) that had a `strict_dynamic_mapping_exception` delivery error. While the sink was erroring:

1. **140,416 events** accumulated in `sequin_streams.consumer_events` (partition 68)
2. Each enriched Job message was **~18 KB** in memory (enrichment joins ~12 related tables)
3. When the delivery error was fixed, the SlotMessageStore loaded all persisted messages at once
4. Memory jumped to **1,509 MB** despite `max_memory_mb` being set to **512 MB**
5. On an earlier run with `max_memory_mb=128`, the store oscillated between filling (hitting the limit for new WAL messages) and draining, but persisted messages were always loaded without limit

## Root cause

In `lib/sequin/runtime/slot_message_store_state.ex`:

```elixir
# Line 132-143
def put_persisted_messages(%State{} = state, messages) do
    persisted_message_groups =
      Enum.reduce(messages, state.persisted_message_groups, fn msg, acc ->
        Multiset.put(acc, group_id(msg), {msg.commit_lsn, msg.commit_idx})
      end)

    # This cannot fail because we `skip_limit_check?`
    {:ok, state} =
      put_messages(%{state | persisted_message_groups: persisted_message_groups}, messages, skip_limit_check?: true)

    state
end
```

Both `put_persisted_messages` (line 141) and `put_table_reader_batch` (line 160) call `put_messages` with `skip_limit_check?: true`, meaning:

- `max_memory_mb` is not enforced
- `setting_max_messages` (default 50,000) is not enforced
- There is no upper bound on memory consumption during rehydration

## Impact

- **OOM risk**: A pod with 32 GB RAM could be killed if enough events accumulate. In our case, 140k events at ~18 KB each = ~2.5 GB. A table with larger payloads or more accumulated events could easily exceed pod memory limits.
- **No recovery path**: Once events are persisted, every restart will attempt to load them all, potentially causing repeated OOM crashes.
- **Silent accumulation**: Events pile up silently when a sink has delivery errors. There's no alert or limit on how many events can accumulate in the database.

## Environment

- Sequin self-hosted on Kubernetes (EKS)
- Pod resources: 8 CPU / 32 GB RAM
- 25 ~~Elasticsearch~~ OpenSearch sinks, all using enrichment queries
- Source database: Aurora PostgreSQL 12
- Sink consumer config: `batch_size=1`, `max_memory_mb=128` (later increased to 512), `message_grouping=true`

## Suggested improvements

1. **Batch-load persisted messages**: Instead of loading all persisted messages at once, load them in configurable batches that respect `max_memory_mb`, processing each batch before loading the next.

2. **Enforce a hard memory ceiling**: Even for persisted messages, don't exceed a configurable absolute maximum (e.g., `max_memory_mb * 2` or a separate `hard_max_memory_mb` setting).

3. **Add a persisted event count limit or alert**: When events accumulate beyond a threshold in the database (e.g., due to sustained delivery errors), surface a health check warning or automatically pause the sink before the backlog becomes dangerous.

4. **Consider on-demand loading**: Instead of eagerly loading all persisted messages into memory, load them on-demand as the sink has capacity to deliver, similar to how the ConsumerProducer pulls from the store.

## Workarounds

Currently the only options to prevent OOM are:
- Fix delivery errors quickly before events accumulate
- Manually delete stuck events from `sequin_streams.consumer_events` 
- Ensure pod memory limits are very generous relative to potential event backlogs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM risk: put_persisted_messages bypasses max_memory_mb limit on startup #2128

Summary

Reproduction scenario

Root cause

Impact

Environment

Suggested improvements

Workarounds

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

OOM risk: put_persisted_messages bypasses max_memory_mb limit on startup #2128

Description

Summary

Reproduction scenario

Root cause

Impact

Environment

Suggested improvements

Workarounds

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions