Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve GroupIdGenerators performance for high cardinality group by keys #14685

Open
Tracked by #11924
shauryachats opened this issue Dec 19, 2024 · 9 comments
Open
Tracked by #11924
Assignees

Comments

@shauryachats
Copy link
Contributor

While running some high-volume multi-stage engine queries on Pinot where the join key was high cardinality, we recently observed a disproportionate latency increase when data was increased across both sides of the joins for the following query shape:

SELECT
count(*)
FROM
  table_A
WHERE (
    user_uuid IN (
      SELECT
        user_uuid
      FROM
       table_B
    )
  )   
 AND (
    user_uuid NOT IN (
      SELECT
        user_uuid
      FROM
        table_B
    )
  )
LIMIT
  100 option(useMultistageEngine=true, timeoutMs=120000, useColocatedJoin = true, maxRowsInJoin = 40000000)

After profiling conducted on a server
Screenshot 2024-12-18 at 4 36 12 PM

It turns out that the major cause of the latency increase is due to inefficient groupId generation in org/apache/pinot/query/runtime/operator/MultistageGroupByExecutor.generateGroupByKeys, which is happening due to a few reasons:

  • Open Addressing is the current collision resolution for Object2IntOpenHashMap which performs poorly for high cardinality use cases.
  • Low default initial size of 16 and a default load factor of 0.75 which causes a high number of multiple resizes and rehashing of existing keys for high cardinality use cases, causing a major latency contribution to the overall query runtime.

We are considering a few different strategies like better hash-map selection (avoid open addressing for high-cardinality), generating groupIds in batches, etc. We would be leveraging benchmarks for selecting the appropriate strategy with the most RoI.

This optimization can help boost performance for both Pinot v1 and v2 engines simultaneously, since both the engines rely on this logic. cc: @Jackie-Jiang

@Jackie-Jiang
Copy link
Contributor

cc @bziobrowski who is also working on optimizing the multi-stage group-by engine

@shauryachats
Copy link
Contributor Author

shauryachats commented Jan 28, 2025

After a deep dive into the current implementation, it turned out that the hashmap was being resized many times due to the high cardinality of the group by column, and running benchmarks on latency differences in computeIfAbsent for reserved vs unreserved hashmaps confirmed this.

Benchmark                                                 (_cardinality)  Mode  Cnt      Score     Error  Units
BenchmarkObjectOpenHashMap.object2IntOpenHashMap                  500000  avgt   20     85.720 ±   9.901  ms/op
BenchmarkObjectOpenHashMap.object2IntOpenHashMap                 1000000  avgt   20    274.285 ±  32.380  ms/op
BenchmarkObjectOpenHashMap.object2IntOpenHashMap                 5000000  avgt   20   2011.841 ±  97.977  ms/op
BenchmarkObjectOpenHashMap.object2IntOpenHashMap                10000000  avgt   20   4918.238 ± 119.845  ms/op
BenchmarkObjectOpenHashMap.object2IntOpenHashMap                20000000  avgt   20  11265.309 ± 524.745  ms/op

BenchmarkObjectOpenHashMap.object2IntReservedOpenHashMap          500000  avgt   20     62.205 ±   6.396  ms/op
BenchmarkObjectOpenHashMap.object2IntReservedOpenHashMap         1000000  avgt   20    195.235 ±  30.416  ms/op
BenchmarkObjectOpenHashMap.object2IntReservedOpenHashMap         5000000  avgt   20   2007.643 ±  35.480  ms/op
BenchmarkObjectOpenHashMap.object2IntReservedOpenHashMap        10000000  avgt   20   4264.061 ± 171.140  ms/op
BenchmarkObjectOpenHashMap.object2IntReservedOpenHashMap        20000000  avgt   20   7771.687 ± 107.806  ms/op

We can observe a clear ~30% improvement between reserved and unreserved openHashmap for ~20M distinct groups per server, a common occurrence in Uber use cases. A simple fix inspired by this was to simply reserve numGroupsLimit in the GroupIdGenerator and test it out, and the results were impressive.

Image

The two indicators indicate the start and the end of the deploy of the fix.

The aforementioned use case is low QPS and hence there were no OOMs observed due to reserving higher amounts of memory, but the possibility exists in high QPS scenarios, and therefore to eliminate that possibility, we suggest introducing a new configuration pinot.server.query.executor.group.generator.reserveMap flag which can be toggled to enable/disable reserving the GroupIDGenerator hash map.

cc: @Jackie-Jiang @ankitsultana

@Jackie-Jiang
Copy link
Contributor

What is the difference between this and maxInitialResultHolderCapacity? Can you try out that one?

@shauryachats
Copy link
Contributor Author

shauryachats commented Jan 30, 2025

numGroupsLimit defines the maximum number of distinct groups that can be present in the GroupBy at a server level, whereas maxInitialResultHolderCapacity sets the initial capacity for the mergeResultHolder which stores the results after the groupBy has happened.

It makes more sense for the new config to reserve numGroupsLimit rather than maxInitialResultHolderCapacityto eliminate redundant resizes since we can expect the proposed config (pinot.server.query.executor.group.generator.reserveMap) be enabled for cases where the group by key has high cardinality, and typically the maxInitialResultHolderCapacity would be less than numGroupsLimit which would result in reduced but still non-zero resizes which would significantly contribute to the latency.

Ideally the maxInitialResultHolderCapacity should never be more than numGroupsLimit since the result holder cannot store more than the maximum number of groups specified by the config, hence we should also consider adding a check in Pinot for that.

@shauryachats
Copy link
Contributor Author

Alternatively instead of introducing pinot.server.query.executor.group.generator.reserveMap to set the GroupIdGenerator initial size to numGroupsLimit, we can have a separate configuration for multi-stage engine solely to reserve the minimum size of the GroupIdGenerator namely:

pinot.server.query.executor.multistage.group.id.generator.initial.size

and we can assign a default value of say 1000 so that it does not impact any current multi-stage engine use cases but also can help tune the GroupIdGenerator sizes for high cardinality use cases.

@Jackie-Jiang do let me know if this approach suits you more.

@Jackie-Jiang
Copy link
Contributor

I wouldn't add too many configs for the same purpose because there is no way to make user understand them. I agree we should cap initialResultHolderCapacity with numGroupsLimit if we are not already doing so, and we might already cap it since we are configuring the max value of it.
What concerns me of setting result holder size directly to groups limit is that, if user override numGroupsLimit to a very large value (a lot of users simply overrides it to Integer.MAX_VALUE), server will directly run out of memory.

@shauryachats
Copy link
Contributor Author

I wouldn't add too many configs for the same purpose because there is no way to make user understand them.

I agree with the sentiment, but in this case, I feel a new config should be introduced for the reasons I will be expanding below.

I agree we should cap initialResultHolderCapacity with numGroupsLimit if we are not already doing so, and we might already cap it since we are configuring the max value of it.

I agree, and we can do it in a separate PR since it is outside the scope of this current change.

What concerns me of setting result holder size directly to groups limit is that, if user override numGroupsLimit to a very large value (a lot of users simply overrides it to Integer.MAX_VALUE), server will directly run out of memory.

This is precisely the reason we need a separate config to enable reserving the hash map of the GroupIdGenerator so that the user can choose the tradeoff between latency and memory (for low QPS and high cardinality such as our use case, enabling this would have a significant improvement, whereas for high QPS cases the users can choose not to toggle this).

Another alternative that I suggested above is to introduce a separate config solely responsible for setting the initial size of the GroupIdGenerator so we decouple this from numGroupsLimit and the users who are setting numGroupsLimit to Integer.MAX_VALUE can set other values for initial size of GroupIdGenerator to denote the cardinality.

Until we have some type of column stats for the group by columns, which can help us determine at runtime what the estimated cardinality of group by columns is, we might have to manually rely on this config.

@Jackie-Jiang
Copy link
Contributor

I know where the confusion is coming from. The initial result holder capacity should just match the initial group map capacity because the map is simply generating the index in the result holder. Do you think we can just reuse the same config for both of them?

@ankitsultana
Copy link
Contributor

@Jackie-Jiang : adding some color here. I agree completely with not having too many configs, but in this case, the workload profile is very different for the V1 Engine max initial capacity and the V2 Engine AggregateOperator max initial capacity.

The V1 Engine will run the GroupByOperator for each matching segment for a query, and may end up running the operator 1000s of times for a query. Whereas the AggregateOperator in the MSE will run (Number of agg operators * stageParallelism) times, which is orders of magnitude less.

Since the V1 Engine deals with a much lower amount of data, setting a low value of initial result holder capacity for it makes sense.

The V2 Engine AggregateOperator however would aggregate over all the groups returned by the leaf operator, which may be in the 100s of millions per server, in which case the hash-map sizing becomes quite crucial.

Hence our proposal is to have a separate config for maxInitialResultHolderCapacity for the V2 Engine Group By operators. This means that we should use the new config for sizing both the GroupByResultHolder and the GroupIdGenerator in the V2 Engine AggregateOperator.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants