Skip to content

CQ: Skip and ignore pending ACKs which no longer exist in RAM and/or DISK when dead-lettering rejected messages #13926

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Ayanda-D
Copy link
Contributor

Proposed Changes

Hello Team 👋

We're seeing the following CQ crash very frequently. NOTE: the crash log below is from a CMQ (on 3.12.x), but the patch we're suggesting is on the rabbit_variable_queue BQ (which we'd still like to have in 4.x. to avoid same problems). When dead-lettering CQ rejected messages, on BQ:ackfold/4 - there are cases where expected pending ACKs no longer exist in BQ state, i.e. ram_pending_ack and/or disk_pending_ack causing the entire queue to crash. In such cases we want to skip and ignore processing the rejected ACK sequence-id, seeing it no longer exists in the BQ state, no need to crash the entire queue, and proceed to processing the next ACK(s):


  | 2025-05-20 22:01:23.801104+00:00 [error] <0.19894.0> ** Reason for termination ==
  | 2025-05-20 22:01:23.801104+00:00 [error] <0.19894.0> ** {{badkey,16601},
  | 2025-05-20 22:01:23.801104+00:00 [error] <0.19894.0>     [{erlang,map_get,[16601,#{}],[{error_info,#{module => erl_erts_errors}}]},
  | 2025-05-20 22:01:23.801104+00:00 [error] <0.19894.0>      {rabbit_variable_queue,lookup_pending_ack,2,
  | 2025-05-20 22:01:23.801104+00:00 [error] <0.19894.0>                             [{file,"rabbit_variable_queue.erl"},{line,2246}]},
  | 2025-05-20 22:01:23.801104+00:00 [error] <0.19894.0>      {rabbit_variable_queue,'-ackfold/4-fun-0-',3,
  | 2025-05-20 22:01:23.801104+00:00 [error] <0.19894.0>                             [{file,"rabbit_variable_queue.erl"},{line,715}]},
  | 2025-05-20 22:01:23.801104+00:00 [error] <0.19894.0>      {lists,foldl,3,[{file,"lists.erl"},{line,1594}]},
  | 2025-05-20 22:01:23.801104+00:00 [error] <0.19894.0>      {rabbit_variable_queue,ackfold,4,
  | 2025-05-20 22:01:23.801104+00:00 [error] <0.19894.0>                             [{file,"rabbit_variable_queue.erl"},{line,714}]},
  | 2025-05-20 22:01:23.801104+00:00 [error] <0.19894.0>      {rabbit_priority_queue,ackfold,4,
  | 2025-05-20 22:01:23.801104+00:00 [error] <0.19894.0>                             [{file,"rabbit_priority_queue.erl"},{line,340}]},
  | 2025-05-20 22:01:23.801104+00:00 [error] <0.19894.0>      {rabbit_mirror_queue_master,ackfold,4,
  | 2025-05-20 22:01:23.801104+00:00 [error] <0.19894.0>                                  [{file,"rabbit_mirror_queue_master.erl"},
  | 2025-05-20 22:01:23.801104+00:00 [error] <0.19894.0>                                   {line,385}]},
  | 2025-05-20 22:01:23.801104+00:00 [error] <0.19894.0>      {rabbit_amqqueue_process,'-dead_letter_rejected_msgs/3-fun-0-',5,
  | 2025-05-20 22:01:23.801104+00:00 [error] <0.19894.0>                               [{file,"rabbit_amqqueue_process.erl"},
  | 2025-05-20 22:01:23.801104+00:00 [error] <0.19894.0>                                {line,1050}]}]}

Please take a look if OK (crashes are recurring). Thanks

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution
you did and what alternatives you considered, etc.

… when

dead-lettering rejected messages - to avoid crashing the queue from rejected
message Acks (messages have been rejected already, no further Ack processing
required if not found in ram_pending_ack and disk_pending_ack).
@lhoguin
Copy link
Contributor

lhoguin commented May 21, 2025

Why wouldn't the pending acks exist in the state? I could understand if CMQs were not synchronous or had a bug that could make something like this happen, but now that CMQs are gone I'm not sure there's a point having this in 4.0+? If there's still a crash that exists in 4.0+ then we should seek where the state gets wrong and correct that instead.

@Ayanda-D
Copy link
Contributor Author

Ayanda-D commented May 23, 2025

@lhoguin Thanks for checking! At this point we are not really sure how the CMQ gets into this state.

Can a normal CQ is_msg_in_pending_acks/2 call return false at some point during its runtime:
https://github.com/rabbitmq/rabbitmq-server/blob/5f1bef6141bd451d60dd9a7267c7d07bed26153d/deps/rabbit/src/rabbit_variable_queue.erl#L1141C1-L1144 - if yes, then seems this could be a possibility.

I was also wondering if it is possible (an edge case) for an ACK and REJECT for same AckTags to be received by the queue process? Seeing the ACKs are prioritized, if that affects order of processing. Which can cause pending to not exist in the state when a REJECT is handled. Should REJECTs asynch messages also be prioritized here maybe?

The only issue here is that when we do "let it crash" at this point, we lose the queue entirely (and with non-mirrored CQs impact might be worse, no mirror to take over).

Also on some legacy clusters running 3.7.x, we can see lookup_pending_ack/2 would also fallback on the qi_pending_ack https://github.com/rabbitmq/rabbitmq-server/blob/v3.7.9/src/rabbit_variable_queue.erl#L2084 - I supposed this has changed with CQv2. & JFYI, we also have a custom patch to handle some memory leaks in CMQs which could be affecting this. So not 100% clear at the moment which of the above could be the cause.

There is no urgent need to handle this in newer supported versions (until we see this in newer versions, if ever). I'll just push another commit for documenting purposes, we can always backport the patch in case we ever need to reference in future.

Please feel free to close. Thanks! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants