Failing to acquire a replication slot puts server into a loop #1960

balegas · 2024-11-09T00:44:16Z

When unable to acquire a replication slot, the server gets into a loop, eventually exhausting CPU. A problem with a replication slot shouldn't bring down the entire instance.

Consider adding an exponential backoff or another approach to handle the situation more gracefully.

00:35:27.572 [error] :gen_statem {Electric.Registry.Processes, {Electric.Postgres.ReplicationClient, :default, "1bbcafee-cc1c-429d-b89a-1a14578ec67a"}} terminating
--
** (Postgrex.Error) ERROR 55006 (object_in_use) replication slot "electric_slot_production_us_east_1" is active for PID 679
(stdlib 6.0.1) gen_statem.erl:3242: :gen_statem.loop_state_callback_result/11
(stdlib 6.0.1) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
Queue: [info: :start_streaming]
Postponed: []
00:35:27.574 [info] Postgres server version = 160004, system identifier = 7431946106013976826, timeline_id = 1

The text was updated successfully, but these errors were encountered:

alco · 2024-12-13T14:07:34Z

This shouldn't normally happen because Electric acquires an advisory lock before opening a replication connection, so no two Electric instances may try using the same slot simultaneously.

I was able to reproduce this issue only after I had commented out the lock acquisition logic from the code.

Regardless, there shouldn't be a spin loop when opening a replication connection fails for any reason. Looking into that.

Fixes #1960.

balegas changed the title ~~Failing to acquire a replication slot in a close loop~~ Failing to acquire a replication slot puts server into a loop Nov 9, 2024

KyleAMathews added the bug label Nov 12, 2024

alco self-assigned this Dec 13, 2024

alco added a commit that referenced this issue Dec 13, 2024

Use exponential backoff when restoring a crashed replication connection

80833b3

Fixes #1960.

alco mentioned this issue Dec 13, 2024

Use exponential backoff when restoring a crashed replication connection #2166

Merged

icehaunter closed this as completed in #2166 Dec 24, 2024

icehaunter closed this as completed in b64c900 Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing to acquire a replication slot puts server into a loop #1960

Failing to acquire a replication slot puts server into a loop #1960

balegas commented Nov 9, 2024 •

edited

Loading

alco commented Dec 13, 2024

Failing to acquire a replication slot puts server into a loop #1960

Failing to acquire a replication slot puts server into a loop #1960

Comments

balegas commented Nov 9, 2024 • edited Loading

alco commented Dec 13, 2024

balegas commented Nov 9, 2024 •

edited

Loading