Skip to content

Automatic reconnection on network failure #312

@Keruspe

Description

@Keruspe
Collaborator

Now that we have the topology API, here are the steps required towards automatic reconnection:

  • add an Option<Consumer> to the wrapped basic_consume, set it to None from the public wrapper, store the Option in the state and use it to restore everything in the given Consumer if we got Some.
    introduce an InternalTopology that stores Connection/Channel/Consumer objects alongside the topology items
    Add some conversion between InternalTopology and Topology; dropping the associated items
    change topology methods to return the Internaltopology, and make the current one use that and convert to public Topology.
    in the same way, introduce some restore_internal, make restore use it, and pass the Options stored in the InternalTopology to basic_consume and friends
    hook up basic get in InternalTopology + restore_internal
    add an Option<Channel> to the channel creation to share internals with the Channel we want to restore, set it to None, but use it when finalizing if it's Some.
    add an Option<InternalTopology> set to None to the connection process, and use it when it's some to restore_internal
    detect network failure from the event loop, and instead of bubbling it up, call topology_internal, reinitiate connection with Some(InternalTopology)

Activity

robo-corg

robo-corg commented on Mar 17, 2021

@robo-corg

Would you be interested in a PR for this?

Keruspe

Keruspe commented on Mar 17, 2021

@Keruspe
CollaboratorAuthor

Sure.
Otherwise I plan to work on this this summer once 2.0 is out

kageru

kageru commented on Sep 19, 2022

@kageru

Any update on this?

Automatic reconnects would be really useful for me. I’d even try to contribute if something specific is missing.

Ks89

Ks89 commented on Dec 29, 2022

@Ks89

I'm also interested on this feature

Keruspe

Keruspe commented on Dec 29, 2022

@Keruspe
CollaboratorAuthor

I'd be willing to take some sponsorship to work on this

carlhoerberg

carlhoerberg commented on May 24, 2024

@carlhoerberg

I'd be willing to take some sponsorship to work on this

We're willing to sponsor this, plz email me at carl@cloudamqp.com

Keruspe

Keruspe commented on Jul 11, 2024

@Keruspe
CollaboratorAuthor

Progress is being made on this front, initial version should be coming this summer

Keruspe

Keruspe commented on Aug 4, 2024

@Keruspe
CollaboratorAuthor

Small update on this front:
I've slightly reworked my approach for this now that I could actually spend time on this (thanks to @carlhoerberg and CloudAMQP support).
I fixed a few bugs in the TCP loop that will be required for this to work properly.
I'm working on handling AMQP "soft" errors (e.g. errors local to 1 channel) to first be able to properly implement recovery of one channel and get it more easily tested.
Once channel recovery is done, I'll move on to AMQP "hard" errors, that are global to the connection, to ensure we properly recover all channels too.
Then the last step will be to trigger the recovery for other errors too (Such as TCP errors).
I will create the associated issues, but the Channel part (which is fundamental for the other parts to properly work) should be done before end of summer. Issuing a passive queue declare on a non existing queue on a channel will probably be the easiest way of testing this, as it triggers a channel error.

conioX

conioX commented on Oct 4, 2024

@conioX

Any news about this?

Keruspe

Keruspe commented on Oct 12, 2024

@Keruspe
CollaboratorAuthor

I'm sorry about this, last two months were a lot... rougher than anticipated. All current progress can be tracked in #416.
I'm still first focusing on channel recovery, and I'll get to connection recovery once this is stabilized.
Currently, the publishing part works pretty well and I'm confident in the implementation.
I want to hook up some topology recovery (tmp queues recreation and so on) .
The consumer part is trickier, but parts of it are already there.

Keruspe

Keruspe commented on Apr 19, 2025

@Keruspe
CollaboratorAuthor

I haven't posted an update here for quite some time, but things have moved forward a lot!
All the ongoing work has been merged as part of the latest 3.0 beta versions.

I'm still targetting only channel reconnection n case of a channel error as a first step... but that's actually most of the work, or at least most of the complexity, protocol-wise.

For the publishing case, I think we're good now.
For the receiving part, there are missing pieces in the consumers handling (some of the work is done but finitions to make it actually work transparently are missing). Basic-get should properly handle reconnection but needs confirmation.
Extensions such as confirm-select are properly supported.
Automatic re-declaration of temporary queues are in progress locally. I want to finish adressing this point and have a first working impl of consumers, then I'll release 3.0 final. The connection part will come after that.

Implementing this has also led me to fixing a few corner case bugs here and there which is good news for the overall stability.

I cannot promise any deadline, but my hopes for 3.0 are in May

Keruspe

Keruspe commented on Jun 4, 2025

@Keruspe
CollaboratorAuthor

Good news : I have consumers working properly!
Needs a little bit of cleanup but release is coming real soon

sin-ack

sin-ack commented on Jul 15, 2025

@sin-ack

Hi, are there any updates on this? Is the checklist in OP up-to-date? I'd like to help with any remaining missing features.

Keruspe

Keruspe commented on Jul 15, 2025

@Keruspe
CollaboratorAuthor

Checklist is partially outdated, we took a different approach in the end.

End of school year was pretty packed, but I should be able to allocate more time on this soon.

Basically, I have one bug that I'm currently troubleshooting for channel reconnection, then I'll be targeting connection reconnection. I really want channel reconnection to be battle tested before enabling the next layers as it's way easier to diagnose issues this way.
Connection recovery should be pretty fast, with all the tooling that got in for channel recovery.
Last step will be triggering it on network errors.

I'd say we're 80-85% here. Connection recovery is ~5% and hooking up the network failures is ~10-15%

sin-ack

sin-ack commented on Jul 15, 2025

@sin-ack

Thanks for the update! I'd like to help get this ready as soon as possible, so my help offer still stands.

Keruspe

Keruspe commented on Jul 27, 2025

@Keruspe
CollaboratorAuthor

I'll update the checklist to reflect more what has been and has to be done, but basically, the AMQP connection part is done. I need to make a last tweak to the io_loop to properly keep on going during reconnection (there's currently a hacky workaround for testing).

I think the TCP part should be mostly done before mid August. I'll release a 3.1.0 at that point, with the full recovery/reconnection testable. Won't be considered ready for production for a while though, as it's inevitable there'll be corner cases where things can go south, as AMQP is stateful and really not designed with recovery in mind

Keruspe

Keruspe commented on Jul 31, 2025

@Keruspe
CollaboratorAuthor

Keeping this issue opened a little more, but this is now testable as part of 3.1.0 with unstable feature as described in README

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @carlhoerberg@robo-corg@Keruspe@Ks89@kageru

        Issue actions

          Automatic reconnection on network failure · Issue #312 · amqp-rs/lapin