Support distribution type hint to allow broadcast join #14797

Jackie-Jiang · 2025-01-11T20:54:38Z

Add support for customizing distribution type with join hint.

Allowed distribution types:

LOCAL
HASH
BROADCAST
RANDOM

Added 2 new join hint:

left_distribution_type
right_distribution_type

To achieve broadcast join without shuffling left side:

SELECT /*+ joinOptions(left_distribution_type = 'local', right_distribution_type = 'broadcast') */ a.col1, b.col2 FROM a JOIN b ON a.col1 = b.col1 WHERE a.col2 = 'foo' AND b.col2 = 'bar'

Related to #14518

codecov-commenter · 2025-01-11T21:32:09Z

Codecov Report

Attention: Patch coverage is 44.68085% with 52 lines in your changes missing coverage. Please review.

Project coverage is 63.72%. Comparing base (59551e4) to head (c65d1e8).
Report is 1592 commits behind head on master.

Files with missing lines	Patch %	Lines
...ite/rel/rules/PinotJoinExchangeNodeInsertRule.java	27.41%	39 Missing and 6 partials ⚠️
...pache/pinot/calcite/rel/hint/PinotHintOptions.java	71.42%	4 Missing and 2 partials ⚠️
.../org/apache/pinot/query/routing/WorkerManager.java	90.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #14797      +/-   ##
============================================
+ Coverage     61.75%   63.72%   +1.97%     
- Complexity      207     1610    +1403     
============================================
  Files          2436     2708     +272     
  Lines        133233   151304   +18071     
  Branches      20636    23364    +2728     
============================================
+ Hits          82274    96422   +14148     
- Misses        44911    47634    +2723     
- Partials       6048     7248    +1200

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (+99.99%)`	⬆️
integration	`100.00% <ø> (+99.99%)`	⬆️
integration1	`100.00% <ø> (+99.99%)`	⬆️
integration2	`0.00% <ø> (ø)`
java-11	`63.71% <44.68%> (+2.00%)`	⬆️
java-21	`63.61% <44.68%> (+1.99%)`	⬆️
skip-bytebuffers-false	`63.72% <44.68%> (+1.97%)`	⬆️
skip-bytebuffers-true	`63.59% <44.68%> (+35.86%)`	⬆️
temurin	`63.72% <44.68%> (+1.97%)`	⬆️
unittests	`63.72% <44.68%> (+1.97%)`	⬆️
unittests1	`56.31% <44.68%> (+9.41%)`	⬆️
unittests2	`34.02% <0.00%> (+6.29%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

gortiz · 2025-01-14T17:39:27Z

I'll take a look, but I think we need to rethink the names of join strategies. A strategy should not be defined uniquely by what it does on one of the sides of the join. This case, for example, is clear: The broadcast strategy will be applied on the right-hand side, but what will happen with the left one? We already have a strategy where the right-hand side is broadcasted, but the left is randomly shuffled.

gortiz · 2025-01-14T17:40:12Z

cc @bziobrowski @yashmayya @albertobastos

gortiz · 2025-01-14T17:48:40Z

...lanner/src/main/java/org/apache/pinot/calcite/rel/rules/PinotJoinExchangeNodeInsertRule.java

@@ -54,10 +54,15 @@ public void onMatch(RelOptRuleCall call) {
    JoinInfo joinInfo = join.analyzeCondition();
    RelNode newLeft;
    RelNode newRight;
-    if (PinotHintOptions.JoinHintOptions.useLookupJoinStrategy(join)) {
+    String joinStrategyHint = PinotHintOptions.JoinHintOptions.getJoinStrategyHint(join);
+    if (PinotHintOptions.JoinHintOptions.useLookupJoinStrategy(joinStrategyHint)) {


nit: what about having a PinotHintOptions.JoinHintOptions.getJoinStrategy that returns an enum we can use in a switch here? We expect to add at least 3 new extra strategies (listed in #14518, including random + broadcast right now cannot be specified), so the switch syntax may be helpful.

gortiz · 2025-01-14T17:59:02Z

pinot-query-runtime/src/main/java/org/apache/pinot/query/runtime/InStageStatsTreeBuilder.java

      return recursiveCase(node, MultiStageOperator.Type.LOOKUP_JOIN);
+    } else {
+      // TODO: Consider renaming this operator type. It handles multiple join strategies.


I think we are using names here in a strange way. This is a hash operator because it implements the join using a hash map. The other is a lookup join operator because it implements it using lookup logic.

In parallel, we have join strategies. One of the strategies creates logical partitions at query time based on the values of the columns being joined. The way these partitions are decided is based on hash code, so it is called hash strategy. In the documentation I used Query time partition join strategy because I didn't want to focus too much on the fact that is being using hashes.

Imagine a scenario where we add sorted joins. The type of the join should be sort and the strategy used for the distribution of its inputs may be hash.

TL;DR: I think we need to distinguish between join algorithm (lookup, hash, sorted, nested look) and distribution strategies (hash/partitioned, local, randon, broadcast, etc). The algorithm will probably change the operator class being used while the distribution strategy will change the exchange of the children of the joins

I agree that we should separate the type of join and the type of distribution/shuffle. The latter is not unique to joins and could also be used in e.g. aggregations.

Sounds good. Removed this TODO, and we can revisit this when adding the next join operator

gortiz · 2025-01-14T18:13:43Z

We need to add documentation at https://docs.pinot.apache.org/users/user-guide-query/multi-stage-query/join-strategies. Feel free to use my diagrams at https://app.excalidraw.com/s/6rIIm06x9LN/amPNwZicV0. I don't know how to share excalidraw diagrams with edit permissions without giving the write permission to the whole internet.

bziobrowski · 2025-01-14T18:18:05Z

pinot-query-planner/src/main/java/org/apache/pinot/query/routing/WorkerManager.java

    }
-    PlanNode childPlanNode = children.get(0).getFragmentRoot();
-    return childPlanNode instanceof MailboxSendNode
-        && ((MailboxSendNode) childPlanNode).getDistributionType() == RelDistribution.Type.SINGLETON;


nit: why not return -1 and avoid allocating object?

Jackie-Jiang · 2025-01-15T04:35:49Z

I'll take a look, but I think we need to rethink the names of join strategies. A strategy should not be defined uniquely by what it does on one of the sides of the join. This case, for example, is clear: The broadcast strategy will be applied on the right-hand side, but what will happen with the left one? We already have a strategy where the right-hand side is broadcasted, but the left is randomly shuffled.

I feel BROADCAST usually means fixing one side, and broadcasting the other side. Several query engines only support this strategy. We don't have an explicit join strategy for randomly shuffling left side and broadcasting right side, so if we want to add one, we can think of a new name for this less commonly used one.

siddharthteotia · 2025-01-15T07:03:27Z

I think I had given some feedback on the naming thing long time ago when Multi Stage engine was being developed. I can't seem to find the thread now.

Ideally (this is how OLAP engines typically do), there is a clear distinction between an EXCHANGE type and the implementation of physical operator.

In this case:

Exchange types can be (to name a few)

BROADCAST
HASH_PARTITION (like Spark shuffle)
ROUND_ROBIN
SEND_TO_SINGLE

Exchange itself is an implemented as a pair operator:

Sender operator as the root operator in the sender stage (downstream)
Receiver operator as the leaf operator in the receiver stage (upstream)

Regardless of the exchange, there is some processing done in the receiver stage after the exchange (between 2 stages). In this case, it will be JOIN

Logical Operation - JOIN
Exchange type - BROADCAST
Physical operation (depending on planner / optimizer) - HashJoin, Sort-Merge Join, NLJ etc

Typically, the receiver does a HashJoin after a broadcast exchange but I don't think this is always going to be true

With that being said, BROADCAST is NOT A JOIN Strategy. It is an exchange type (e.g an exchange between two Major Fragments in Presto / Trino).

So, we should try to build this clear distinction both in code and design.

siddharthteotia · 2025-01-15T07:08:45Z

May be for this PR it is fine if we are trying to get something going for a large fact to small dimension table JOIN. My recommendation would be to start thinking about revamping / refactoring all of this. It would make future additions more flexible, decouple physical operator implementation from exchange types (which should always be the case) and implement exchange as operators as well which will make it easier to optimize a plan with the desired exchange type.

Jackie-Jiang · 2025-01-15T07:47:43Z

@siddharthteotia I like the idea of de-coupling exchange and join algorithm. Currently they are coupled under join strategy, where we support hash, lookup, dynamic_broadcast before this PR. We can start thinking how to organize the hint so that we can combine different exchange type with join algorithm. Exchange type is also useful for other operations such as aggregate.

We can address this as a separate effort. cc @ankitsultana

siddharthteotia · 2025-01-15T07:55:11Z

we can combine different exchange type with join algorithm. Exchange type is also useful for other operations such as aggregate.

Exactly. It will be much more flexible to build an optimal plan with the decoupling where we can choose exchange type based on data characteristics and/or physical operator algorithm.

Jackie-Jiang · 2025-01-15T08:19:21Z

I'm thinking changing the join_strategy hint to exchange_type hint (still under joinOptions). For this particular PR's purpose, I can add local_broadcast to represent left local, right broadcast. @gortiz @siddharthteotia wdyt?

ankitsultana · 2025-01-15T16:31:43Z

@siddharthteotia I am right now working on refactoring the optimizer where many of the optimizations like coloration, skipping of partial aggregates, etc. will become automatic. Will add you to the Slack channel.

gortiz · 2025-01-15T21:47:02Z

You are right @siddharthteotia. The distribution is a property of the exchange. I was discussing that with @bziobrowski yesterday, and he rightfully mentioned that exchange types also affect aggregates.

I'm thinking changing the join_strategy hint to exchange_type hint (still under joinOptions). For this particular PR's purpose, I can add local_broadcast to represent left local, right broadcast. @gortiz @siddharthteotia wdyt?

+1 to that.

siddharthteotia · 2025-01-15T21:53:26Z

Sounds good.

I'm thinking changing the join_strategy hint to exchange_type hint (still under joinOptions)

I am ok with this for now. It's fine to provide exchange type as hint as long as the exchange_type is not a property of solely for JoinOptions. It should be independent and JoinOptions or any query type should be able to leverage it especially if user know what they are doing and are trying to dictate the exchange-type via hints.

Jackie-Jiang · 2025-01-16T06:52:30Z

Updated the PR to decouple exchange type from join strategy. Allow customizing exchange type for both left and right side.

gortiz · 2025-01-16T19:16:11Z

pinot-query-runtime/src/test/resources/queries/QueryHints.json

@@ -125,6 +125,14 @@
        "description": "Colocated JOIN with partition column and group by non-partitioned column with stage parallelism",
        "sql": "SET stageParallelism=2; SELECT {tbl1}.name, SUM({tbl2}.num) FROM {tbl1} /*+ tableOptions(partition_function='hashcode', partition_key='num', partition_size='4') */ JOIN {tbl2} /*+ tableOptions(partition_function='hashcode', partition_key='num', partition_size='4') */ ON {tbl1}.num = {tbl2}.num GROUP BY {tbl1}.name"
      },
+      {
+        "description": "Broadcast JOIN without partition hint",
+        "sql": "SELECT /*+ joinOptions(left_exchange_type = 'local', right_exchange_type = 'broadcast') */ {tbl1}.num, {tbl1}.name, {tbl2}.num, {tbl2}.val FROM {tbl1} JOIN {tbl2} ON {tbl1}.num = {tbl2}.num"


Shouldn't be better to create a new hint for exchanges instead of assigning it to the join?

The way I'm suggesting the query would be something like:

SELECT {tbl1}.num, {tbl1}.name, {tbl2}.num, {tbl2}.val FROM {tbl1} /*+ exchangeOption(type = 'local') */ JOIN {tbl2} /*+ exchangeOption(type = 'broadcast') */ ON {tbl1}.num = {tbl2}.num;

This could also be used for example in aggregates. For example, we could write something like:

SELECT {tbl1}.num, count(*) from {tbl1} /*+ exchangeOption(type = 'local') */ GROUP BY {tbl1}.num;

Good suggestion. Let me try it and see if I can make it work

Hmm, I'm not able to make it work because this is not really a TABLE_SCAN option, or an option applied to any specific RelNode. The left and right side of a JOIN could be any RelNode, and it could be another chained JOIN. Do you see a way to extract this hint?

Are we not using this?
https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/hint/package-summary.html

Calcite has good support for hint extraction and propagation.

siddharthteotia · 2025-01-22T06:42:48Z

...lanner/src/main/java/org/apache/pinot/calcite/rel/rules/PinotJoinExchangeNodeInsertRule.java

    RelNode newLeft;
    RelNode newRight;
    if (PinotHintOptions.JoinHintOptions.useLookupJoinStrategy(join)) {
-      // Lookup join - add local exchange on the left side
-      newLeft = PinotLogicalExchange.create(left, RelDistributions.SINGLETON);
+      // Lookup join


(nit) this function is somewhat less readable. we should consider refactoring

we should also add validation. for example, a query can't have a exchange hint for left side as BROADCAST and right side as HASH for the same JOIN op.

Jackie-Jiang added feature release-notes Referenced by PRs that need attention when compiling the next release notes multi-stage Related to the multi-stage query engine labels Jan 11, 2025

Jackie-Jiang requested review from xiangfu0 and gortiz January 11, 2025 20:54

Jackie-Jiang force-pushed the broadcast_join branch from 996e7a5 to 77e8914 Compare January 14, 2025 01:48

gortiz reviewed Jan 14, 2025

View reviewed changes

gortiz approved these changes Jan 14, 2025

View reviewed changes

bziobrowski reviewed Jan 14, 2025

View reviewed changes

Jackie-Jiang force-pushed the broadcast_join branch from 77e8914 to 4b7f883 Compare January 15, 2025 04:32

Jackie-Jiang added the documentation label Jan 15, 2025

Jackie-Jiang force-pushed the broadcast_join branch from 4b7f883 to 01309b4 Compare January 15, 2025 05:56

Jackie-Jiang force-pushed the broadcast_join branch from 01309b4 to c00a392 Compare January 16, 2025 06:46

Jackie-Jiang changed the title ~~Support BROADCAST join strategy~~ Support exchange type hint to allow broadcast join Jan 16, 2025

gortiz reviewed Jan 16, 2025

View reviewed changes

Jackie-Jiang changed the title ~~Support exchange type hint to allow broadcast join~~ Support distribution type hint to allow broadcast join Jan 17, 2025

Support distribution type hint to allow broadcast join

c65d1e8

Jackie-Jiang force-pushed the broadcast_join branch from c00a392 to c65d1e8 Compare January 17, 2025 08:03

siddharthteotia reviewed Jan 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support distribution type hint to allow broadcast join #14797

Support distribution type hint to allow broadcast join #14797

Jackie-Jiang commented Jan 11, 2025 •

edited

Loading

codecov-commenter commented Jan 11, 2025 •

edited

Loading

gortiz commented Jan 14, 2025

gortiz commented Jan 14, 2025

gortiz Jan 14, 2025

Jackie-Jiang Jan 15, 2025

gortiz Jan 14, 2025 •

edited

Loading

bziobrowski Jan 14, 2025

Jackie-Jiang Jan 15, 2025

gortiz commented Jan 14, 2025

bziobrowski Jan 14, 2025

Jackie-Jiang Jan 15, 2025

Jackie-Jiang commented Jan 15, 2025

siddharthteotia commented Jan 15, 2025 •

edited

Loading

siddharthteotia commented Jan 15, 2025

Jackie-Jiang commented Jan 15, 2025

siddharthteotia commented Jan 15, 2025

Jackie-Jiang commented Jan 15, 2025

ankitsultana commented Jan 15, 2025

gortiz commented Jan 15, 2025

siddharthteotia commented Jan 15, 2025

Jackie-Jiang commented Jan 16, 2025

gortiz Jan 16, 2025

Jackie-Jiang Jan 17, 2025

Jackie-Jiang Jan 17, 2025

siddharthteotia Jan 22, 2025

siddharthteotia Jan 22, 2025

Support distribution type hint to allow broadcast join #14797

Are you sure you want to change the base?

Support distribution type hint to allow broadcast join #14797

Conversation

Jackie-Jiang commented Jan 11, 2025 • edited Loading

codecov-commenter commented Jan 11, 2025 • edited Loading

Codecov Report

gortiz commented Jan 14, 2025

gortiz commented Jan 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gortiz Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gortiz commented Jan 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jackie-Jiang commented Jan 15, 2025

siddharthteotia commented Jan 15, 2025 • edited Loading

siddharthteotia commented Jan 15, 2025

Jackie-Jiang commented Jan 15, 2025

siddharthteotia commented Jan 15, 2025

Jackie-Jiang commented Jan 15, 2025

ankitsultana commented Jan 15, 2025

gortiz commented Jan 15, 2025

siddharthteotia commented Jan 15, 2025

Jackie-Jiang commented Jan 16, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jackie-Jiang commented Jan 11, 2025 •

edited

Loading

codecov-commenter commented Jan 11, 2025 •

edited

Loading

gortiz Jan 14, 2025 •

edited

Loading

siddharthteotia commented Jan 15, 2025 •

edited

Loading