Enable query cancellation for MSQE + cancel using client-provided id #14823

albertobastos · 2025-01-15T23:06:48Z

Enables query cancellation feature for MSQE queries (wasn't supported until now).
Lets the client setting a clientQueryId query option that can be used when using the clientQuery/{clientQueryId} endpoint.
Creates a sleep(ms) function, as for today only recommended for testing purposes.

Some refactor involved to reuse as much as possible cancellation logic between SSQE and MSQE.

… identifiers so far)

siddharthteotia · 2025-01-16T06:49:07Z

...rc/main/java/org/apache/pinot/broker/requesthandler/BaseSingleStageBrokerRequestHandler.java

+    }
+    String clientQueryId = extractClientQueryId(sqlNodeAndOptions);
+    if (StringUtils.isBlank(clientQueryId)) {
+      return null;


(nit) in general we don't recommend returning NULL as a coding practice

siddharthteotia · 2025-01-16T06:52:17Z

Is ClientQueryID a new concept? Is it same as requestID ?

How does the support added here improve the existing Query Cancellation (which is also exposed to user IIRC) ?

codecov-commenter · 2025-01-16T07:49:47Z

Codecov Report

Attention: Patch coverage is 27.75120% with 151 lines in your changes missing coverage. Please review.

Project coverage is 63.68%. Comparing base (59551e4) to head (a0e1e83).
Report is 1659 commits behind head on master.

Files with missing lines	Patch %	Lines
...oller/api/resources/PinotRunningQueryResource.java	0.00%	76 Missing ⚠️
.../pinot/query/service/dispatch/QueryDispatcher.java	40.62%	13 Missing and 6 partials ⚠️
...roker/requesthandler/BaseBrokerRequestHandler.java	52.94%	12 Missing and 4 partials ⚠️
...pinot/broker/api/resources/PinotClientRequest.java	0.00%	11 Missing ⚠️
...r/requesthandler/BrokerRequestHandlerDelegate.java	0.00%	9 Missing ⚠️
...sthandler/BaseSingleStageBrokerRequestHandler.java	70.00%	3 Missing and 3 partials ⚠️
...requesthandler/MultiStageBrokerRequestHandler.java	50.00%	5 Missing ⚠️
...common/response/broker/BrokerResponseNativeV2.java	0.00%	3 Missing ⚠️
...roker/requesthandler/TimeSeriesRequestHandler.java	0.00%	2 Missing ⚠️
...inot/common/function/scalar/DateTimeFunctions.java	60.00%	1 Missing and 1 partial ⚠️
... and 2 more

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #14823      +/-   ##
============================================
+ Coverage     61.75%   63.68%   +1.93%     
- Complexity      207     1483    +1276     
============================================
  Files          2436     2712     +276     
  Lines        133233   152210   +18977     
  Branches      20636    23518    +2882     
============================================
+ Hits          82274    96936   +14662     
- Misses        44911    47977    +3066     
- Partials       6048     7297    +1249

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (+99.99%)`	⬆️
integration	`100.00% <ø> (+99.99%)`	⬆️
integration1	`100.00% <ø> (+99.99%)`	⬆️
integration2	`0.00% <ø> (ø)`
java-11	`63.63% <27.75%> (+1.92%)`	⬆️
java-21	`63.57% <27.75%> (+1.95%)`	⬆️
skip-bytebuffers-false	`63.67% <27.75%> (+1.93%)`	⬆️
skip-bytebuffers-true	`63.53% <27.75%> (+35.80%)`	⬆️
temurin	`63.68% <27.75%> (+1.93%)`	⬆️
unittests	`63.68% <27.75%> (+1.93%)`	⬆️
unittests1	`56.23% <44.68%> (+9.34%)`	⬆️
unittests2	`34.02% <19.61%> (+6.29%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

albertobastos · 2025-01-16T18:46:24Z

Is ClientQueryID a new concept? Is it same as requestID ?

How does the support added here improve the existing Query Cancellation (which is also exposed to user IIRC) ?

Hi Siddharth,

AFAIK, the current cancellation feature depends on the internal requestId generated by the broker itself. That request id is not returned until the query completes, so an external user requires first to ask for the active running queries, determine from the responded array the requestId assigned to the one he's interested in (just comparing the query body) and finally use the cancel operation to abort it. That's two back-and-forth trips between the user and the cluster.

With a client-provided requestId he can skip one step, going straight to the cancel operation using his own ID to abort the query.

albertobastos · 2025-01-21T08:54:52Z

As some extra context, the endgame of this is enable on UI a "Cancel" button the customer can use to abort an ongoing query. Using a query id provided by the customer or the UI itself, that can be done without need of any internal id retrieval.

…ith-cqid

…actor for code reusal

…ith-cqid

gortiz

I'm adding several comments but I wasn't able to read the whole PR.

Although I'm asking for changes, it is a good PR overall. We just need to finish the last mile.

gortiz · 2025-01-30T07:30:04Z

pinot-common/src/main/java/org/apache/pinot/common/function/scalar/DateTimeFunctions.java

+    } catch (InterruptedException e) {
+      //TODO: handle interruption
+      //Thread.currentThread().interrupt();
+    }


we need to fix this TODO before merging

Any suggestion on how we should deal with an interruption here? Just warn it and skip the sleep or propagate the error?

gortiz · 2025-01-30T07:31:22Z

pinot-broker/src/main/java/org/apache/pinot/broker/api/resources/PinotClientRequest.java

+  public String cancelClientQuery(
+      @ApiParam(value = "ClientQueryId given by the client", required = true)
+      @PathParam("clientQueryId") String clientQueryId,
+      @ApiParam(value = "Timeout for servers to respond the cancel request") @QueryParam("timeoutMs")
+      @DefaultValue("3000") int timeoutMs,
+      @ApiParam(value = "Return server responses for troubleshooting") @QueryParam("verbose") @DefaultValue("false")
+      boolean verbose) {


Could we use the same endpoint we already have?

We could expand the already existing endpoint by adding a @QueryParam to determine if the provided id is either internal or client-based, being internal as default.

The only drawback here is that internal ids are long whereas client ids are string, so type validation could no longer been done by Jersey but by the method itself.

47fb9bb786

The Controller scenario is different, though. There the existing endpoint is DELETE /query/{brokerId}/{queryId}, but for clientid-based cancellations we do not want to know the exact broker where the query felt into, so we need an endpoint such as DELETE /clientQuery/{clientQueryId}. Can't see how to unify these two.

gortiz · 2025-01-30T07:39:23Z

pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/BaseBrokerRequestHandler.java

+    boolean enableQueryCancellation =
+        Boolean.parseBoolean(config.getProperty(CommonConstants.Broker.CONFIG_OF_BROKER_ENABLE_QUERY_CANCELLATION));
+    if (enableQueryCancellation) {
+      _queriesById = new ConcurrentHashMap<>();
+      _clientQueryIds = new ConcurrentHashMap<>();
+    } else {
+      _queriesById = null;
+      _clientQueryIds = null;
+    }


It is not something we introduced in this PR, but something I think we need to take care of in the future:

We use BaseBrokerRequestHandler as the root/common state for the broker, probably for historical reasons. But that is not true. A single broker may have SSE, MSE, GRPC and even TSE queries running at the same time. It would be a better design to have a shared state between them instead of the trick we do with the delegate.

This is something we need to improve in the future

I agree with the shared state refactor, we should write it down somewhere so we actually do it ;-)

gortiz · 2025-01-30T07:48:12Z

pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/BaseBrokerRequestHandler.java

@@ -179,6 +198,9 @@ protected abstract BrokerResponse handleRequest(long requestId, String query, Sq
      @Nullable HttpHeaders httpHeaders, AccessControl accessControl)
      throws Exception;

+  protected abstract boolean handleCancel(long queryId, int timeoutMs, Executor executor,


We need javadoc here to explain how it should work. At least we should say that queryId may be a client or pinot generated id.

Actually the queryId received here always refers to a broker-generated internal id. The clientQueryId -> brokerQueryId translation is done by BaseBrokerRequestHandler.cancelQueryByClientId.

Added some minimal javadoc here: 52998d3

I tried to mimic the current code design for handleRequest, but it is a bit confusing the existance of two handleRequest methods here:

A public method implemented by the interface and called from the endpoint layer that receives a SqlNodeAndOptions parameter.

A protected method called from the previous one and already receiving a requestId and the query's string itself.

To increase confusion, neither of the two methods have a javadoc.

This probably could get better designed if we move forward with the proposed shared state design.

gortiz · 2025-01-30T07:50:39Z

pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/BaseBrokerRequestHandler.java

+      LOGGER.warn("Query cancellation cannot be performed due to unknown client query id: {}", clientQueryId);
+      return false;


should't we throw here to notify the caller that the query id is incorrect?

I just tried to be consistent with the behavior when we cancel using a broker-generated id: it returns true or false depending on if the query exists or not. If we do not found a match for the given clientQueryId, isn't that the same as saying that the query does not exist?

PinotClientRequest is the one that receives the false value and decides to raise a WebApplicationException enclosing aHTTP 500 response.

gortiz · 2025-01-30T08:07:16Z

...rc/main/java/org/apache/pinot/broker/requesthandler/BaseSingleStageBrokerRequestHandler.java

@@ -810,14 +813,17 @@ protected BrokerResponse handleRequest(long requestId, String query, SqlNodeAndO
        //       can always list the running queries and cancel query again until it ends. Just that such race
        //       condition makes cancel API less reliable. This should be rare as it assumes sending queries out to
        //       servers takes time, but will address later if needed.
-        _queriesById.put(requestId, new QueryServers(query, offlineRoutingTable, realtimeRoutingTable));
-        LOGGER.debug("Keep track of running query: {}", requestId);
+        String clientRequestId = maybeSaveQuery(requestId, sqlNodeAndOptions, query);


Another reason to split this method is that we may want to set brokerRespose.setClientRequestId even if query cancellation is disabled.

Already convinced, no need for more reasons :-)

gortiz · 2025-01-30T08:12:05Z

pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/BaseBrokerRequestHandler.java

+  public Map<Long, String> getRunningQueries() {
+    Preconditions.checkState(isQueryCancellationEnabled(), "Query cancellation is not enabled on broker");
+    return new HashMap<>(_queriesById);
+  }


I think it is safer and better practice to return an immutable view of the map instead. In the delegate we can create a copy we can modify.

Yeah, makes total sense c60b953

gortiz · 2025-01-30T08:30:36Z

pinot-common/src/main/java/org/apache/pinot/common/function/scalar/DateTimeFunctions.java

@@ -547,6 +547,17 @@ public static long now() {
    return System.currentTimeMillis();
  }

+  @ScalarFunction
+  public static long sleep(long millis) {


I don't like the sleep trick. I mean, to have a function that wait some MS or until some epoch per row is great for tests. But given we didn't find a way to make evaluation lazy (sleep with constant arguments is executed at optimization phase) we had to call sleep as sleep(col + constant). That is the trick I don't like.

Instead, I suggest including an option or something like that that can be understood by different parts of the code so we can sleep whenever we want (in the broker, in the leaf operator or in SSE).

The sleep function can also be used to create attacks (see https://www.sqlinjection.net/time-based/). I understand the sleep function is not the topic of this PR but just a utility to test the PR, so I don't think it is correct to block the PR until we have a perfect sleep function. Therefore my suggestion is to at least change the implementation so this function only works if tests are enabled. We can do that by using this horrible Java trick:

boolean assertEnabled = false; assert assertEnabled = true; if (assertEnabled) { Thread.sleep(millis); }

Agree with that minimal compromise so we can move forward the PR.

6042ad2

Ugly but funny trick, didn't think about it.

Personally I think that if we manage to force the lazy evaluation, that would be enough. Without the column+constant hack, it is less painful to the eyes.

gortiz · 2025-01-30T08:34:41Z

pinot-core/src/test/java/org/apache/pinot/core/data/function/DateTimeFunctionsTest.java

+  public void testSleepFunction() {
+    long startTime = System.currentTimeMillis();
+    testFunction("sleep(500)", Collections.emptyList(), new GenericRow(), result -> {
+      assertTrue((long) result >= 500);
+    });
+    long endTime = System.currentTimeMillis();
+    assertTrue(endTime - startTime >= 500);
+  }


nit: we can reduce time to something in the order of tens of millis

50 ms, final offer

b458a5b

gortiz · 2025-01-30T08:38:00Z

...tion-tests/src/test/java/org/apache/pinot/integration/tests/CancelQueryIntegrationTests.java

+    String clientRequestId = UUID.randomUUID().toString();
+    // tricky query: use sleep with some column data to avoid Calcite from optimizing it on compile time
+    String sqlQuery =
+        "SET " + CommonConstants.Broker.Request.QueryOptionKey.CLIENT_QUERY_ID + "='" + clientRequestId + "'; "


nit: directly use the option name instead of the const to make it easier to read.

Ok, in case we change the const the test will break anyway.

9d0f335

albertobastos

Thanks for the review, @gortiz

Besides my doubts on how to handle the sleep interruption (is it really necessary now that we only enable it during tests?) and some future tasks and refactors derived from the PR, I believe I follow your advice on all your suggestions.

albertobastos · 2025-01-30T19:14:44Z

pinot-common/src/main/java/org/apache/pinot/common/function/scalar/DateTimeFunctions.java

+    } catch (InterruptedException e) {
+      //TODO: handle interruption
+      //Thread.currentThread().interrupt();
+    }


Any suggestion on how we should deal with an interruption here? Just warn it and skip the sleep or propagate the error?

albertobastos · 2025-01-30T19:16:35Z

pinot-broker/src/main/java/org/apache/pinot/broker/api/resources/PinotClientRequest.java

+  public String cancelClientQuery(
+      @ApiParam(value = "ClientQueryId given by the client", required = true)
+      @PathParam("clientQueryId") String clientQueryId,
+      @ApiParam(value = "Timeout for servers to respond the cancel request") @QueryParam("timeoutMs")
+      @DefaultValue("3000") int timeoutMs,
+      @ApiParam(value = "Return server responses for troubleshooting") @QueryParam("verbose") @DefaultValue("false")
+      boolean verbose) {


We could expand the already existing endpoint by adding a @QueryParam to determine if the provided id is either internal or client-based, being internal as default.

The only drawback here is that internal ids are long whereas client ids are string, so type validation could no longer been done by Jersey but by the method itself.

47fb9bb786

The Controller scenario is different, though. There the existing endpoint is DELETE /query/{brokerId}/{queryId}, but for clientid-based cancellations we do not want to know the exact broker where the query felt into, so we need an endpoint such as DELETE /clientQuery/{clientQueryId}. Can't see how to unify these two.

albertobastos · 2025-01-30T19:17:36Z

pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/BaseBrokerRequestHandler.java

+    boolean enableQueryCancellation =
+        Boolean.parseBoolean(config.getProperty(CommonConstants.Broker.CONFIG_OF_BROKER_ENABLE_QUERY_CANCELLATION));
+    if (enableQueryCancellation) {
+      _queriesById = new ConcurrentHashMap<>();
+      _clientQueryIds = new ConcurrentHashMap<>();
+    } else {
+      _queriesById = null;
+      _clientQueryIds = null;
+    }


I agree with the shared state refactor, we should write it down somewhere so we actually do it ;-)

albertobastos · 2025-01-31T09:44:16Z

pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/BaseBrokerRequestHandler.java

@@ -179,6 +198,9 @@ protected abstract BrokerResponse handleRequest(long requestId, String query, Sq
      @Nullable HttpHeaders httpHeaders, AccessControl accessControl)
      throws Exception;

+  protected abstract boolean handleCancel(long queryId, int timeoutMs, Executor executor,


Actually the queryId received here always refers to a broker-generated internal id. The clientQueryId -> brokerQueryId translation is done by BaseBrokerRequestHandler.cancelQueryByClientId.

Added some minimal javadoc here: 52998d3

I tried to mimic the current code design for handleRequest, but it is a bit confusing the existance of two handleRequest methods here:

A public method implemented by the interface and called from the endpoint layer that receives a SqlNodeAndOptions parameter.

A protected method called from the previous one and already receiving a requestId and the query's string itself.

To increase confusion, neither of the two methods have a javadoc.

This probably could get better designed if we move forward with the proposed shared state design.

albertobastos · 2025-01-31T10:00:18Z

pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/BaseBrokerRequestHandler.java

+      LOGGER.warn("Query cancellation cannot be performed due to unknown client query id: {}", clientQueryId);
+      return false;


I just tried to be consistent with the behavior when we cancel using a broker-generated id: it returns true or false depending on if the query exists or not. If we do not found a match for the given clientQueryId, isn't that the same as saying that the query does not exist?

PinotClientRequest is the one that receives the false value and decides to raise a WebApplicationException enclosing aHTTP 500 response.

albertobastos · 2025-01-31T11:21:54Z

...rc/main/java/org/apache/pinot/broker/requesthandler/BaseSingleStageBrokerRequestHandler.java

@@ -810,14 +813,17 @@ protected BrokerResponse handleRequest(long requestId, String query, SqlNodeAndO
        //       can always list the running queries and cancel query again until it ends. Just that such race
        //       condition makes cancel API less reliable. This should be rare as it assumes sending queries out to
        //       servers takes time, but will address later if needed.
-        _queriesById.put(requestId, new QueryServers(query, offlineRoutingTable, realtimeRoutingTable));
-        LOGGER.debug("Keep track of running query: {}", requestId);
+        String clientRequestId = maybeSaveQuery(requestId, sqlNodeAndOptions, query);


Already convinced, no need for more reasons :-)

albertobastos · 2025-01-31T11:28:20Z

pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/BaseBrokerRequestHandler.java

+  public Map<Long, String> getRunningQueries() {
+    Preconditions.checkState(isQueryCancellationEnabled(), "Query cancellation is not enabled on broker");
+    return new HashMap<>(_queriesById);
+  }


Yeah, makes total sense c60b953

albertobastos · 2025-01-31T11:45:23Z

pinot-common/src/main/java/org/apache/pinot/common/function/scalar/DateTimeFunctions.java

@@ -547,6 +547,17 @@ public static long now() {
    return System.currentTimeMillis();
  }

+  @ScalarFunction
+  public static long sleep(long millis) {


Agree with that minimal compromise so we can move forward the PR.

6042ad2

Ugly but funny trick, didn't think about it.

Personally I think that if we manage to force the lazy evaluation, that would be enough. Without the column+constant hack, it is less painful to the eyes.

albertobastos · 2025-01-31T11:47:22Z

pinot-core/src/test/java/org/apache/pinot/core/data/function/DateTimeFunctionsTest.java

+  public void testSleepFunction() {
+    long startTime = System.currentTimeMillis();
+    testFunction("sleep(500)", Collections.emptyList(), new GenericRow(), result -> {
+      assertTrue((long) result >= 500);
+    });
+    long endTime = System.currentTimeMillis();
+    assertTrue(endTime - startTime >= 500);
+  }


50 ms, final offer

b458a5b

albertobastos · 2025-01-31T11:51:31Z

...tion-tests/src/test/java/org/apache/pinot/integration/tests/CancelQueryIntegrationTests.java

+    String clientRequestId = UUID.randomUUID().toString();
+    // tricky query: use sleep with some column data to avoid Calcite from optimizing it on compile time
+    String sqlQuery =
+        "SET " + CommonConstants.Broker.Request.QueryOptionKey.CLIENT_QUERY_ID + "='" + clientRequestId + "'; "


Ok, in case we change the const the test will break anyway.

9d0f335

…ith-cqid

albertobastos added 2 commits January 15, 2025 15:05

add cancelClientQuery operation for SingleStageBroker (only numerical…

f7a9488

… identifiers so far)

avoid synchronized BiMap and checkstyle

39a4f94

siddharthteotia reviewed Jan 16, 2025

View reviewed changes

Merge remote-tracking branch 'origin' into cancel-with-cqid

c969abd

albertobastos added 3 commits January 24, 2025 10:18

Merge branch 'master' into cancel-with-cqid

8162bc6

Merge branch 'master' of github.com:albertobastos/pinot into cancel-w…

aa8c120

…ith-cqid

add cancel feature (with queryId and clientQueryId) to MSQE, some ref…

7a5f713

…actor for code reusal

albertobastos changed the title ~~add clientQueryId and its cancel operation~~ Enable query cancellation for MSQE + cancel using client-provided id Jan 27, 2025

albertobastos added 11 commits January 27, 2025 14:11

set and delete clientRequestId on MSQE

a9d1e49

fix unimplemented method

97e7b5d

fix I/O parameter and related tests

65f73a0

Merge branch 'master' of github.com:albertobastos/pinot into cancel-w…

e3a9a5e

…ith-cqid

add clientRequestId on response test

fe5c846

Merge branch 'master' of github.com:albertobastos/pinot into cancel-w…

ae3260c

…ith-cqid

add sleep and random functions for further tests

2eb506e

Merge branch 'master' of github.com:albertobastos/pinot into cancel-w…

5dd5409

…ith-cqid

override test server conf

e9bbdac

add missing superclass call

9dcc393

add some cancel query test using internal sleep function with a trick

5ac7d9e

albertobastos marked this pull request as ready for review January 29, 2025 14:19

Merge branch 'master' of github.com:albertobastos/pinot into cancel-w…

a46659c

…ith-cqid

gortiz reviewed Jan 30, 2025

View reviewed changes

albertobastos added 4 commits January 30, 2025 20:06

bring master

110fb16

reuse same broker endpoint for internal and client-based cancellation

47fb9bb

add javadoc

52998d3

add mapping comments

e2678af

albertobastos added 5 commits January 31, 2025 12:11

refactor base broker methods

c201cf4

return immutable view instead of copy

c60b953

enable sleep(ms) function only during testing

6042ad2

reduce unit test wait time

b458a5b

replace constant with literal on test

9d0f335

albertobastos commented Jan 31, 2025

View reviewed changes

albertobastos added 3 commits January 31, 2025 13:13

linter

1e00bf6

Merge branch 'master' of github.com:albertobastos/pinot into cancel-w…

d3061ba

…ith-cqid

remove embarassing npe

a0e1e83

		LOGGER.warn("Query cancellation cannot be performed due to unknown client query id: {}", clientQueryId);
		return false;

Enable query cancellation for MSQE + cancel using client-provided id #14823

Are you sure you want to change the base?

Enable query cancellation for MSQE + cancel using client-provided id #14823

Conversation

albertobastos commented Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

siddharthteotia commented Jan 16, 2025

codecov-commenter commented Jan 16, 2025 • edited Loading

Codecov Report

albertobastos commented Jan 16, 2025

albertobastos commented Jan 21, 2025

gortiz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertobastos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertobastos commented Jan 15, 2025 •

edited

Loading

codecov-commenter commented Jan 16, 2025 •

edited

Loading