Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to acquire a replication slot puts server into a loop #1960

Closed
balegas opened this issue Nov 9, 2024 · 1 comment · Fixed by #2166
Closed

Failing to acquire a replication slot puts server into a loop #1960

balegas opened this issue Nov 9, 2024 · 1 comment · Fixed by #2166
Assignees
Labels

Comments

@balegas
Copy link
Contributor

balegas commented Nov 9, 2024

When unable to acquire a replication slot, the server gets into a loop, eventually exhausting CPU. A problem with a replication slot shouldn't bring down the entire instance.

Consider adding an exponential backoff or another approach to handle the situation more gracefully.

Screenshot 2024-11-09 at 00 44 03
00:35:27.572 [error] :gen_statem {Electric.Registry.Processes, {Electric.Postgres.ReplicationClient, :default, "1bbcafee-cc1c-429d-b89a-1a14578ec67a"}} terminating
--
** (Postgrex.Error) ERROR 55006 (object_in_use) replication slot "electric_slot_production_us_east_1" is active for PID 679
(stdlib 6.0.1) gen_statem.erl:3242: :gen_statem.loop_state_callback_result/11
(stdlib 6.0.1) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
Queue: [info: :start_streaming]
Postponed: []
00:35:27.574 [info] Postgres server version = 160004, system identifier = 7431946106013976826, timeline_id = 1
@balegas balegas changed the title Failing to acquire a replication slot in a close loop Failing to acquire a replication slot puts server into a loop Nov 9, 2024
@alco alco self-assigned this Dec 13, 2024
@alco
Copy link
Member

alco commented Dec 13, 2024

This shouldn't normally happen because Electric acquires an advisory lock before opening a replication connection, so no two Electric instances may try using the same slot simultaneously.

I was able to reproduce this issue only after I had commented out the lock acquisition logic from the code.

Regardless, there shouldn't be a spin loop when opening a replication connection fails for any reason. Looking into that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants