Skip to content

Conversation

@michalvavrik
Copy link
Member

@michalvavrik michalvavrik changed the title Fix CDI request context deactivation race when Quarkus REST and Reactive Routes are used during the same request Fix CDI request context activation race when Quarkus REST and Reactive Routes are used during the same request Nov 28, 2025
@michalvavrik michalvavrik force-pushed the feature/fix-rest-cdi-req-ctx-active-after-timeout branch 2 times, most recently from bbb997a to 640e603 Compare November 28, 2025 18:36
@michalvavrik
Copy link
Member Author

michalvavrik commented Nov 28, 2025

@geoand I have run the reproducer and the test I added couple of times and this fixes the issue. I tried to run few related modules like RR and Quarkus REST and didn't see anything failing. Let's see what CI does. @mkouba may also be interested.

@geoand
Copy link
Contributor

geoand commented Nov 28, 2025

Cool!

I'll have a look soon

@quarkus-bot

This comment has been minimized.

@quarkus-bot

This comment has been minimized.

Copy link
Contributor

@geoand geoand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

@geoand geoand requested a review from mkouba November 30, 2025 06:13
} finally {
// Deactivate the context, i.e. cleanup the thread locals
requestContext.deactivate();
if (RoutingUtils.isCdiRequestContextOwner(context, REACTIVE_ROUTES_KEY)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this check unnecessary? I mean, at this point the owner must be quarkus-reactive-routes (unless something replaced the CURRENT_CDI_REQUEST_CTX_OWNER)...

Copy link
Member Author

@michalvavrik michalvavrik Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this check unnecessary? I mean, at this point the owner must be quarkus-reactive-routes (unless something replaced the CURRENT_CDI_REQUEST_CTX_OWNER)...

That is not how I read the situation. When routingContext.next() happens, other route handlers are called synchronously until some code is scheduled (like run blocking), at which point this can be invoked. So it can easily be Quarkus REST.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant "synchronously".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this check unnecessary? I mean, at this point the owner must be quarkus-reactive-routes (unless something replaced the CURRENT_CDI_REQUEST_CTX_OWNER)...

That is not how I read the situation. When routingContext.next() happens, other route handlers are called synchronously until some code is scheduled (like run blocking), at which point this can be invoked. So it can easily be Quarkus REST.

Yes, but all those handlers share the same RoutingContext...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that finally is run when Quarkus REST is processing because routingContext.next() is run before.

Yes, but all those handlers share the same RoutingContext...

I don't know how to answer that :-) You will need to elaborate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that finally is run when Quarkus REST is processing because routingContext.next() is run before.

Ok, so we don't want to deactivate the context either. Small note - if Vertx is used then requestContext.deactivate() is a NOOP if duplicated context is used: https://github.com/quarkusio/quarkus/blob/main/extensions/vertx/runtime/src/main/java/io/quarkus/vertx/runtime/VertxCurrentContextFactory.java#L99

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that finally is run when Quarkus REST is processing because routingContext.next() is run before.

Ok, so we don't want to deactivate the context either. Small note - if Vertx is used then requestContext.deactivate() is a NOOP if duplicated context is used: https://github.com/quarkusio/quarkus/blob/main/extensions/vertx/runtime/src/main/java/io/quarkus/vertx/runtime/VertxCurrentContextFactory.java#L99

I didn't know, thanks. I thought that the bug is about the endHandler where is destroy invoked, but during the debugging I saw my breakpoint on deactivate called after Quarkus REST started doing stuff, so I'd keep it here just in case you change how deactivate works in the future.

* @param owner typically a route handler that needs CDI request context active
*/
public static void assumeCdiRequestContext(RoutingContext ctx, String owner) {
ctx.put(CURRENT_CDI_REQUEST_CTX_OWNER, Objects.requireNonNull(owner));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we throw an exception if the owner is already set?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we throw an exception if the owner is already set?

The idea is to recognize which route handler is handling request if and only if it is reusing the CDI request context. So what happens right now, for example also due to this PR #40408 is that we are reusing already active CDI request context (I think Martin solved some issue that way).

Copy link
Contributor

@mkouba mkouba Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry but I don't undertand how it's supposed to work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, at this point I'll only repeat myself. I am happy to explain it differently, but I don't know what part you don't understand. assumeCdiRequestContext method is called by Quarkus REST before Reactive Routes tries to destroy our state in the endHandler. Whether finally itself runs depends on what next route handlers do.

I'm sorry but I don't undertand how it's supposed to work.

I have added tests. Maybe you can run them and see what they solve?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assumeCdiRequestContext method is called by Quarkus REST before Reactive Routes tries to destroy our state in the endHandler

So Quarkus REST can intentionally replace the original owner (Reactive Routes)? If that's expected then it might make sense to add some debug logging in assumeCdiRequestContext()..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assumeCdiRequestContext method is called by Quarkus REST before Reactive Routes tries to destroy our state in the endHandler

So Quarkus REST can intentionally replace the original owner (Reactive Routes)?

Exactly. The way it works (sorry if I am repeating myself):

  1. Reactive Routes activate CDI request context for the callback with user code (e.g. @RouteFilter)
  2. user code calls routingContext.next()
  3. some other handlers run depending on the order until Quarkus REST handler runs, it detects active CDI request context and reuses it
  4. next depends on what is done:
  • the finally block in the Reactive Routes may run in some scenarios like when blocking code is scheduled and so on; but you already explained it is NOOP, funny I saw context not active in stacktrace for CDI interceptors where it failed; anyway, this fixed it
  • the endHandler may run before all code in Quarkus REST is finished, as shown by the user reproducer with socket timeout on the client side; we need Quarkus REST to take responsibility on when it is destroys CDI context state

If that's expected then it might make sense to add some debug logging in assumeCdiRequestContext()..

I'll add it, so that we can keep track of it. Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a debug logging to track the ownership.

@mkouba
Copy link
Contributor

mkouba commented Dec 1, 2025

CC @manovotn

@michalvavrik michalvavrik force-pushed the feature/fix-rest-cdi-req-ctx-active-after-timeout branch from 640e603 to a368091 Compare December 1, 2025 16:21
@quarkus-bot

This comment has been minimized.

@michalvavrik
Copy link
Member Author

I don't think that failures are related, I only added a debug log message and CIs fail over OOM in unmodified modules.

@geoand
Copy link
Contributor

geoand commented Dec 2, 2025

Very odd indeed... I restarted those jobs - let's see what happens

@quarkus-bot

This comment has been minimized.

@Ladicek
Copy link
Contributor

Ladicek commented Dec 2, 2025

I don't have the full context and I think I never really looked into how the CDI request context is activated/deactivated during HTTP request processing, but I would hope that we have exactly 1 Vert.x Web router handler with suitably low priority that would activate/deactivate the CDI request context and all other frameworks built on top of that would just not touch the CDI request context at all costs. I'm sure there's a reason for why it's done the way it's done, but is there a reason for why it's not done the way I described?

@michalvavrik
Copy link
Member Author

I don't have the full context and I think I never really looked into how the CDI request context is activated/deactivated during HTTP request processing, but I would hope that we have exactly 1 Vert.x Web router handler with suitably low priority that would activate/deactivate the CDI request context and all other frameworks built on top of that would just not touch the CDI request context at all costs.

CDI request context is only activated when needed. There is no such a handler ATM.

I'm sure there's a reason for why it's done the way it's done, but is there a reason for why it's not done the way I described?

I think that is a question for each stack owner. I remember few arguments mentioned in the past (like you may never need CDI request context, or that you may need your own like WS Next OnMessage callbacks), but I don't want to advocate either.

Unless we make general change as you describe, this PR is pretty uninvasive, it only does something if Reactive Routes activated CDI request context, so for 99 % use cases, it should be NOOP.

@michalvavrik
Copy link
Member Author

@geoand that CI error is strange and I don't think it is related, but I'll rebase on the current main.

@michalvavrik michalvavrik force-pushed the feature/fix-rest-cdi-req-ctx-active-after-timeout branch from a368091 to 2afbd09 Compare December 2, 2025 14:30
@mkouba
Copy link
Contributor

mkouba commented Dec 2, 2025

I don't have the full context and I think I never really looked into how the CDI request context is activated/deactivated during HTTP request processing, but I would hope that we have exactly 1 Vert.x Web router handler with suitably low priority that would activate/deactivate the CDI request context and all other frameworks built on top of that would just not touch the CDI request context at all costs.

CDI request context is only activated when needed. There is no such a handler ATM.

I'm sure there's a reason for why it's done the way it's done, but is there a reason for why it's not done the way I described?

I think that is a question for each stack owner. I remember few arguments mentioned in the past (like you may never need CDI request context, or that you may need your own like WS Next OnMessage callbacks), but I don't want to advocate either.

I believe that there are cases where you don't want to activate the CDI request context (think about static resources or high performance endpoints). But I also think that the main reason is historical development. There were times when Vertx duplicated context was not a thing and threadlocals were used instead, stacks used different thread pool strategies, etc. It was easier to manage the activation separately. Also keep in mind that it's not a common use case to combine the stacks 🤷.

@geoand
Copy link
Contributor

geoand commented Dec 2, 2025

Also keep in mind that it's not a common use case to combine the stacks

Sure, but it doesn't mean we can't make it work, at the end of day, all the HTTP stuff is backed by Vert.x (and one of the reasons was to be able to mix such kinds of things)

@mkouba
Copy link
Contributor

mkouba commented Dec 2, 2025

Also keep in mind that it's not a common use case to combine the stacks

Sure, but it doesn't mean we can't make it work, at the end of day, all the HTTP stuff is backed by Vert.x (and one of the reasons was to be able to mix such kinds of things)

I'm not saying we can't... I was talking about why we don't have a common activation handler.

* @param ctx RoutingContext
* @param newOwner typically a route handler that needs CDI request context active
*/
public static void assumeCdiRequestContext(RoutingContext ctx, String newOwner) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michalvavrik is there any guarantee that the last handler to claim the context as their own is always going to be the last to actually attempt to use it?

I am not familiar enough with the code to be able to tell so maybe I am missing something obvious.
However, my thinking was that I would capture a collection of all handlers when they attempt to activate it. Later on, when a handler attempts to deactivate context, remove the handler from the collection and only the last to be removed actually performs the cleanup.
Then again, I am probably just missing something, so take it with a pinch of salt please :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michalvavrik is there any guarantee that the last handler to claim the context as their own is always going to be the last to actually attempt to use it?

It is for the scenario I am trying to solve, but if we treat this method as SPI that theoretically anyone could reuse, then probably no. IMO there should be always only one route handler actively handling the incoming request, because if there were multiple handlers using the same request, we would be in for whole lot of issues like this one #49468.

So what I meant to solve is:

  • avoid code executed after user callback runs routingContext.next(), such as
    • the finally block running requestContext.deactivate()
    • the endHandler code destroying CDI context state

And if in the future (or in quarkiverse) someone does things like these, we cannot guarantee we will be the last ones.

However, my thinking was that I would capture a collection of all handlers when they attempt to activate it. Later on, when a handler attempts to deactivate context, remove the handler from the collection and only the last to be removed actually performs the cleanup.

I think that would be a solution covering all eventualities, but the only way to keep it safe (be sure it is cleaned) would be synchronization (or other locking) which could have performance impact if it was hot path? I think there is a reason why RoutingContext data are not kept in the synchronized map and not in volatile field, and that is that we don't expect things to happen concurrently. But I can be wrong, not an expert on this matter.

IMO current code in Quarkus REST doesn't expect that someone else is handling active CDI request context and I have seen the same thing in gRPC. I'd keep it simple (as is, or just rename method / adjust javadoc if you have some suggestion) @manovotn , but I can do what you want if you deem it important.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First off, I don't think this should be just a local discussion, all relevant people need to be involved.

I think we've established that there may be multiple users (as in, Quarkus components) that want to access the same request context. We've also established that we want the request context created on demand, not eagerly. (Thanks for those remarks! It also feels like déjà vu to me. And I'm sure it will happen again.)

I'm gonna claim a simple int (or AtomicInteger) stored on a shared location (the Vert.x context probably? What about non-Vert.x threads though?) would be enough. If someone wants to create the request context, two things may happen.

  1. If they don't see the int on the shared location, they are the first to activate the request context. They do so and set the value to 1.
  2. If they do see the int on the shared location, the request context already exist and they just increment the value.

Conversely, if someone wants to destroy the request context, two (or three, depending how thorough we are) things may happen:

  1. They see the int on the shared location, they decrement the value and the result is > 0. They don't do anything more.
  2. They see the int on the shared location, they decrement the value and the result is 0. They destroy the request context.
  3. They don't see the int on the shared location. The request context wasn't activated, so there's no point in destroying it.

I think the above should be enough to ensure that the request context can be activated on-demand and destroyed only after all possible users are done with it.

I think either this could be put into a utility class, probably in ArC, or perhaps this could be done in the ArC implementation of request context itself?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. They see the int on the shared location, they decrement the value and the result is > 0. They don't do anything more.

AFAIU this would not help with the use case @michalvavrik is trying to fix. This sentence from #51299 (comment) is important: "IMO current code in Quarkus REST doesn't expect that someone else is handling active CDI request context and I have seen the same thing in gRPC"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU this would not help with the use case @michalvavrik is trying to fix. This sentence from #51299 (comment) is important: "IMO current code in Quarkus REST doesn't expect that someone else is handling active CDI request context and I have seen the same thing in gRPC"

Yeah, but that assumption is wrong, otherwise the issue and this PR wouldn't exist. We clearly have multiple components assuming they own the request context, which is wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I am fine with this change and its local scope.

I know I'm not the one to decide here, but I'm personally not. This PR is a result of great investigation, but the solution presented is a band-aid applied on a wrong abstraction level.

If we actually want to support multiple "owners" of the same request context, I think something like I describe above is what we should do, and the more I think about it, the more I'm convinced the request context implementation itself should do it. (This is obviously performance sensitive...)

Now, there's a difference between deactivation and destruction and I didn't think through this distinction, so there would be more work to do. Still, it feels better to do it properly once, rather than doing it improperly (at least once, but possibly multiple times) and then having to do it properly and undoing the hacks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly, ATM I don't even have time to carefully read all your comments, but I got a gist that @Ladicek have in mind solid solution and @manovotn agree in principle.

I'd be fine to close this PR if you will handle the linked issue @Ladicek ? You can reuse that test, I wrote a lot of comments about re-running it, but it actually reproduced the issue pretty well from my workstation. I just don't know how stable reproducing will be from your laptop.

Anyway, if noone else handles the issue, I can go back to it by the end of this week, read all your comments and try to come up with something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just talked to @mkouba and @manovotn and we concluded that concurrency in Quarkus is pretty much a mess (in how DuplicatedContext is passed around and duplicated/triplicated/whatever, in how SmallRye Context Propagation works/is integrated, in how Vert.x and non-Vert.x thread are different). I certainly don't aim to fix that.

We also briefly discussed the difference between context deactivation and destruction and concluded that deactivation itself shouldn't "unregister" the owner, only destruction should. We didn't discuss what should constitute the "registration" of the request context owner, but it seems logical to me that it's the call to activate() with no initial state (that is, zero-param activate() or one-param activate() with null argument).

Looking at the implementation of CurrentManagedContext, it seems relatively straightforward to support multiple owners. It should only affect these places:

  • method ContextState activate(ContextState initialState)
  • method void destroy(ContextState state)
  • member class CurrentContextState (we need to track the number of current owners, but that should always be small, so it should fit into the state field -- it seems safe to me to assume that there will never be more than 32 owners, in fact I think we can assume there will never more than 16 owners to leave 1 bit reserved for future use)

I'll try to figure something out.

In the meantime, I'm certainly not blocking the merge of this PR, because both @mkouba and @manovotn are fine with it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your comments and expertise everyone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't discuss what should constitute the "registration" of the request context owner, but it seems logical to me that it's the call to activate() with no initial state (that is, zero-param activate() or one-param activate() with null argument).

OK, so that is clearly wrong. (I knew there was a hurdle!) Activating a request context with no initial state is creating a new request context. That only happens once. Registration of an additional owner may only happen when activating with non-null initial state, and at that point, there's no way to know if that means a newly registered owner, or an already registered owner. So my nice and simple idea makes no sense :-)

@manovotn
Copy link
Contributor

manovotn commented Dec 2, 2025

CC @manovotn

I do not know how exactly do routing handlers juggle with CDI contexts but what Michal proposes here seems sensible to me particularly because of his earlier comment:

the endHandler may run before all code in Quarkus REST is finished, as shown by the user reproducer with socket timeout on the client side; we need Quarkus REST to take responsibility on when it is destroys CDI context state

The only thing I added as a comment is related my limited knowledge of whether this solution is "generic enough".

I don't have the full context and I think I never really looked into how the CDI request context is activated/deactivated during HTTP request processing, but I would hope that we have exactly 1 Vert.x Web router handler with suitably low priority that would activate/deactivate the CDI request context and all other frameworks built on top of that would just not touch the CDI request context at all costs. I'm sure there's a reason for why it's done the way it's done, but is there a reason for why it's not done the way I described?

This gives me a bit of a déjà vu feeling, I think we discussed this at some point although I only recall there were some cases where that would be undesirable 🤷
That was a long time ago and much has changed since though (such as using Vert.x duplicated context for CP).

@michalvavrik
Copy link
Member Author

@geoand CI failures in #51314 looks same as the failures we saw in this one.

@geoand
Copy link
Contributor

geoand commented Dec 2, 2025

Agreed, I'll have a closer look tomorrow

@quarkus-bot

This comment has been minimized.

@geoand
Copy link
Contributor

geoand commented Dec 3, 2025

The OpenAPI failure is something that has been happening on CI the last few days, it has nothing to do with this PR

@michalvavrik
Copy link
Member Author

I think this PR is backportable while if I think most of the suggestions, if implemented, should only stay in the main. Considering 3.27 will be around for 5 years, I think this (in some form) should be initial fix and improvements can come later.

@geoand
Copy link
Contributor

geoand commented Dec 5, 2025

Unless anyone objects, I will merge this soon

@quarkus-bot
Copy link

quarkus-bot bot commented Dec 5, 2025

Status for workflow Quarkus CI

This is the status report for running Quarkus CI on commit 2afbd09.

✅ The latest workflow run for the pull request has completed successfully.

It should be safe to merge provided you have a look at the other checks in the summary.

@Ladicek
Copy link
Contributor

Ladicek commented Dec 5, 2025

Even after the 2 or 3 days I spent looking into this, I don't have a satisfactory solution, so I'm retracting any possible objections I might have raised before :-)

I experimented with a single-owner policy and multiple-owners policy, but my implementations are not suitable to implementing what this PR does, which is essentially a handover of the CDI request context ownership from Reactive Routes to RESTEasy Reactive. In order for them to be suitable, I would have to:

  • either introduce the notion of an identity of an owner, which would entail:
    • either more overhead in a performance sensitive piece of code,
    • or at least shared constants (whose number would be severely limited),
  • or implement a single-purpose solution, which would hardly be any better than this PR and would pollute the API.

Too bad :-/

@geoand
Copy link
Contributor

geoand commented Dec 5, 2025

That's great information to have for posterity

@geoand geoand merged commit f96127f into quarkusio:main Dec 5, 2025
124 of 125 checks passed
@quarkus-bot quarkus-bot bot added this to the 3.31 - main milestone Dec 5, 2025
@FroMage
Copy link
Member

FroMage commented Dec 5, 2025

If I had a euro for every time we had a context issue…

@michalvavrik michalvavrik deleted the feature/fix-rest-cdi-req-ctx-active-after-timeout branch December 5, 2025 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ContextNotActiveException in the middle of a request handler

6 participants