HIVE-28259: Common table expression detection and rewrites using CBO #5249

zabetak · 2024-05-14T14:21:32Z

What changes were proposed in this pull request?

Add applyCteRewriting phase in CalcitePlanner for detecting and using CTEs; ensure rewrite logic is consistent with existing hive.optimize.cte.materialize.* properties.
Model CTEs as materialized views (MVs) and add utility method in HiveMaterializedViewUtils for mapping a CTE to a RelOptMaterialization.
Refactor core MV rewrite logic in CalcitePlanner to use during CTE rewrite and exploit CTEs in a cost-based manner.
Add HiveTableSpool operator to represent CTEs and handle them in the plan using new rules: TableScanToSpoolRule and RemoveUnusedCteRule.
Add TableScanToSpoolRule, and RemoveUnusedCteRule to add/remove spools from the plan.
Enhance/Enrich metadata handlers for handling the Spool operator.
Add AggregatedColumns metadata (and respective handler and metadata query), for controlling if a CTE is a "full aggregate" at the CBO (RelNode) level to ensure consistent behavior with hive.optimize.cte.materialize.full.aggregate.only property.
Add HiveSqlTypeUtil.containsSqlType for detecting and skipping the creation of CTEs with untyped nulls since they are not supported (HIVE-11217).
Add hive.optimize.cte.suggester.class and CommonTableExpressionSuggester interface to provide pluggable CTE detection logic. Given that CTE detection logic can range from basic tree traversal algorithms to complex workload analysis frameworks this part needs to be configurable since there is no one-size-fits-all implementation. The configuration property also allows proprietary algorithms to be integrated in HiveServer2 by implementing the necessary APIs and adding the jars in the classpath.
Add prototype implementation for CTE detection logic in CommonTableExpressionIdentitySuggester using CommonRelSubExprRegisterRule and CommonTableExpressionRegistry. Although the implementation is rather simple it can discover various interesting CTEs as demonstrated by the tests and can be indeed useful in a prod environment.
Map spool(s) to WITH clauses during the RelNode to AST conversion (ASTConverter, ASTBuilder, PlanModifierForASTConv) to exploit existing CTE materialization feature (HIVE-11752).
Modify (slighly) SemanticAnalyzer/CalcitePlanner to enable AST-based CTE materialization (getMetadata) post CBO run.

Why are the changes needed?

Open the road for cost-based CTE optimizations in Hive
Pluggable/Extensible CTE detection logic

Does this PR introduce any user-facing change?

By default no. After tuning the new/existing cte properties query plans may change affecting performance.

Is the change a dependency upgrade?

No

How was this patch tested?

Tests for:

CTE detection logic using the CommonTableExpressionPrintSuggester (TestTezTPCDS30TBPerfCliDriver)
demonstrating (end-to-end) the CTE feature (cte_cbo_rewrite_0.q)
verify coherence of CTE rewrite with hive.optimize.cte.materialize.full.aggregate.only (cte_mat_12.q)
spool JSON serialization (cte_cbo_plan_json.q)

sonarcloud · 2024-05-14T15:12:19Z

Quality Gate passed

Issues
33 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

asolimando

LGTM, pretty elegant approach cleanly implemented. I have left few minor comments but nothing blocking, hence the approval.

I have only skimmed few out files for the TPCDS test suite, I haven't checked them all.

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/CommonRelSubExprRegisterRule.java

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HivePartitionPruneRule.java

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/RemoveUnusedCteRule.java

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/AggregatedColumns.java

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTBuilder.java

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java

asolimando · 2024-05-14T15:55:18Z

ql/src/test/queries/clientpositive/cte_cbo_plan_json.q

+EXPLAIN FORMATTED CBO
+SELECT name FROM emps e WHERE salary > 50000
+UNION
+SELECT name FROM emps e WHERE salary > 50000;


Nit: missing newline at the end of the file

zabetak · 2024-06-27T17:19:25Z

Hey @asolimando , many thanks for reviewing the PR!! I was busy and didn't manage to reply to your comments. In the meantime the PR got a bit stale since master diverged. Would you mind if I rebase this PR cause merging with master may look messy?

1. Add `applyCteRewriting` phase in `CalcitePlanner` for detecting and using CTEs; ensure rewrite logic is consistent with existing `hive.optimize.cte.materialize.*` properties. 2. Model CTEs as materialized views (MVs) and add utility method in `HiveMaterializedViewUtils` for mapping a CTE to a `RelOptMaterialization`. 3. Refactor core MV rewrite logic in `CalcitePlanner` to use during CTE rewrite and exploit CTEs in a cost-based manner. 4. Add `HiveTableSpool` operator to represent CTEs and handle them in the plan using new rules: `TableScanToSpoolRule` and `RemoveUnusedCteRule`. 5. Add `TableScanToSpoolRule`, and `RemoveUnusedCteRule` to add/remove spools from the plan. 6. Enhance/Enrich metadata handlers for handling the Spool operator. 7. Add `AggregatedColumns` metadata (and respective handler and metadata query), for controlling if a CTE is a "full aggregate" at the CBO (RelNode) level to ensure consistent behavior with `hive.optimize.cte.materialize.full.aggregate.only` property. 8. Add `HiveSqlTypeUtil.containsSqlType` for detecting and skipping the creation of CTEs with untyped nulls since they are not supported (HIVE-11217). 9. Add `hive.optimize.cte.suggester.class` and CommonTableExpressionSuggester interface to provide pluggable CTE detection logic. Given that CTE detection logic can range from basic tree traversal algorithms to complex workload analysis frameworks this part needs to be configurable since there is no one-size-fits-all implementation. The configuration property also allows proprietary algorithms to be integrated in HiveServer2 by implementing the necessary APIs and adding the jars in the classpath. 10. Add prototype implementation for CTE detection logic in CommonTableExpressionIdentitySuggester using CommonRelSubExprRegisterRule and CommonTableExpressionRegistry. Although the implementation is rather simple it can discover various interesting CTEs as demonstrated by the tests and can be indeed useful in a prod environment. 11. Map spool(s) to `WITH` clauses during the RelNode to AST conversion (ASTConverter, ASTBuilder, PlanModifierForASTConv) to exploit existing CTE materialization feature (HIVE-11752). 12. Modify (slighly) `SemanticAnalyzer`/`CalcitePlanner` to enable AST-based CTE materialization (getMetadata) post CBO run. 13. Tests for: * CTE detection logic using the `CommonTableExpressionPrintSuggester` (`TestTezTPCDS30TBPerfCliDriver`) * demonstrating (end-to-end) the CTE feature (cte_cbo_rewrite_0.q) * verify coherence of CTE rewrite with `hive.optimize.cte.materialize.full.aggregate.only` (cte_mat_12.q) * spool JSON serialization (cte_cbo_plan_json.q)

A temporary table is not necessarily a materialized CTE so that condition alone is not enough to perform CTE specific processing. On the other hand, when materializedTable is true then it is certain that we are dealing with materialized CTEs since that field was introduced exactly for this purpose.

…redundancy

If the registry is null then the rule is useless so the check does not make much sense at the moment.

amansinha100

Thanks for the PR for this important performance feature. I have a few comments, some of which are potential future enhancements. Overall, this looks fairly good.

amansinha100 · 2024-07-05T18:32:11Z

ql/src/test/results/clientpositive/perf/tpcds30tb/tez/cbo_query31.q.out

+CTE Suggestion 3:HiveProject(d_date_sk=[$0])
+  HiveFilter(condition=[AND(=($10, 1), =($6, 2000))])
+    HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim])
+
+CTE Suggestion 4:HiveProject(d_date_sk=[$0])
+  HiveFilter(condition=[AND(=($10, 3), =($6, 2000))])
+    HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim])
+
+CTE Suggestion 5:HiveProject(d_date_sk=[$0])
+  HiveFilter(condition=[AND(=($10, 2), =($6, 2000))])
+    HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim])
+


I suppose a more sophisticated suggester would combine CTEs 3, 4, 5 into a single one with an IN clause for the HiveFilter, i.e. $10 IN (1, 2, 3) AND $6 = 2000

Indeed a different suggester could opt to merge these CTEs together by creating a disjunction of the three conditions. At this stage, I wanted to keep CommonTableExpressionIdentitySuggester rather simple and leave more elaborate logic for follow-up tickets.

For instance, in 3ae07ca I had a WIP suggester that tries to create suggestions by combining predicates (with disjunction) over a table that appears multiple times in a query.

amansinha100 · 2024-07-05T19:57:38Z

ql/src/test/queries/clientpositive/perf/cbo_query11.q

@@ -1,3 +1,4 @@
+set hive.optimize.cte.suggester.class=org.apache.hadoop.hive.ql.optimizer.calcite.CommonTableExpressionPrintSuggester;


In terms of the q file changes, this setting is replicated in all the cbo_queryxx files but would it make sense to just have a separate folder perf/cte with targeted set of tpcds tests where this config option is set ? It is possible we may add other cte specific configurations (e.g some limit on the number of CTEs to consider to reduce planning time), and it may be better to have an isolated suite.

I came up with the CommonTableExpressionPrintSuggester as a convenient way to (unit) test the CommonTableExpressionIdentitySuggester on TPC-DS queries. I considered some other alternatives like 0ed2f5c but these required more changes and were less realistic.

In terms of code duplication indeed the set hive.optimize.cte.suggester.class appears in ~100 files/lines. If we go for a separate suite the overall duplication may be larger. With the current design we would need to duplicate the driver class (TestTezTPCDS30TBPerfCliDriver), the respective config (TezTPCDS30TBCliConfig), potentially the queries that we want to test, xml config files, etc. Furthermore, if we want to test CTE specific configurations for certain TPC-DS queries we can always copy the respective file (e.g., cte_cbo_query11.q) and set additional properties as required.

I am OK with all the options. If you feel a new CTE driver would be needed/beneficial for this work I can do the necessary changes.

Yeah, if the driver and config and associated artifacts have to be duplicated, then we are better off to leave it in the current suite, so I am good with what you have in the PR.

amansinha100 · 2024-07-07T20:06:51Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/TableScanToSpoolRule.java

+    super(operand(TableScan.class, none()));
+    this.referenceThreshold = referenceThreshold;
+    this.tableOccurrences = tableOccurrences;
+    assert referenceThreshold > 0;


nit: should this assert for > 1 ? Since CTE should only be applicable if there are more than 1 references.

The Spool will be introduced (resp. CTE will be materialized) when tableOccurences > referenceThreshold (the condition is evaluated in the matches method) thus referenceThreshold = 1 is a perfectly legal value and it means "more than 1 references".

amansinha100 · 2024-07-07T23:32:11Z

.../java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/views/HiveMaterializedViewUtils.java

+        new RelOptHiveTable(null, cluster.getTypeFactory(), fullName, body.getRowType(), hiveTable, columns,
+            Collections.emptyList(), Collections.emptyList(), new HiveConf(), Hive.getThreadLocal(),
+            new QueryTables(true), new HashMap<>(), new HashMap<>(), new AtomicInteger());
+    optTable.setRowCount(cluster.getMetadataQuery().getRowCount(body));


Besides the rowcount, do we want to propagate other properties to the CTE such as distribution and collation ? If these are not expected to be used for the TableSpool, then it does not matter. It could be a future area of improvement.

amansinha100 · 2024-07-07T23:54:43Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/CommonTableExpressionRegistry.java

+   * <p>The expressions may contain internal planning concepts such as {@link org.apache.calcite.plan.hep.HepRelVertex}.
+   * </p>
+   */
+  private final Map<String, RelNode> ctes = new HashMap<>();


Currently, all CTEs are weighted equally. A potentially future enhancement could be one where there is weight/rank associated with it which would feed into the optimizer decisions. This only matters if we want to restrict the number of CTEs to be considered (for shorter compilation time).

amansinha100 · 2024-07-08T00:18:54Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/CommonRelSubExprRegisterRule.java

+ * Rule for saving relational expressions that appear more than once in a query tree to the planner context.
+ */
+public final class CommonRelSubExprRegisterRule extends CommonRelSubExprRule {
+  public static final CommonRelSubExprRegisterRule JOIN = new CommonRelSubExprRegisterRule(operand(Join.class, any()));


Are we supporting any type of joins in the CTE ? IIRC for MVs, there was some restriction around full outer join (need to confirm this).

amansinha100 · 2024-07-08T02:12:03Z

ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java

@@ -1737,6 +1747,14 @@ public RelNode apply(RelOptCluster cluster, RelOptSchema relOptSchema, SchemaPlu
      perfLogger.perfLogBegin(this.getClass().getName(), PerfLogger.POSTJOIN_ORDERING);
      calcitePlan = applyPostJoinOrderingTransform(calcitePlan, mdProvider.getMetadataProvider(), executorProvider);
      perfLogger.perfLogEnd(this.getClass().getName(), PerfLogger.POSTJOIN_ORDERING);
+      // Perform the CTE rewriting near the end of CBO transformations to avoid interference of the new HiveTableSpool 
+      // operator with other rules (especially those related to constant folding and branch pruning).


It would be useful to have a simple q test with branch pruning on both sides of a UNION ALL (both sides have CTE with FALSE predicate).. in that case the branches should be pruned first and the CTE part should be skipped altogether.

Without this change some cte_mat*.q tests throw: SemanticException: Line 0:-1 Invalid table alias or column reference

The integer range is sufficient for holding the occurrences of tables in the context of a single query.

sonarcloud · 2024-07-08T18:20:36Z

Quality Gate passed

Issues
33 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

asolimando · 2024-07-09T13:14:56Z

Approving latest changes, there are 2 diff in qout files but it seems minor

asf-ci-hive added the tests pending label May 14, 2024

asolimando approved these changes May 14, 2024

View reviewed changes

asf-ci-hive added tests unstable and removed tests pending labels May 14, 2024

zabetak added 5 commits July 5, 2024 16:09

HIVE-28259: Print CTE suggestions only in cbo_query*q tests to avoid …

863551f

…redundancy

HIVE-28259: Drop redundant null check in CommonRelSubExprRegisterRule

fc36fbd

If the registry is null then the rule is useless so the check does not make much sense at the moment.

HIVE-28259: Rename RemoveUnusedCteRule to RemoveInfrequentCteRule

c71353e

zabetak force-pushed the cte-clean branch from f42507c to c71353e Compare July 5, 2024 14:09

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Jul 5, 2024

amansinha100 reviewed Jul 8, 2024

View reviewed changes

zabetak added 2 commits July 8, 2024 15:23

HIVE-28259: Consider CTE alias if available in table scan

847fe8f

Without this change some cte_mat*.q tests throw: SemanticException: Line 0:-1 Invalid table alias or column reference

HIVE-28259: Use Integer (instead of Long) to keep table occurrences

a924080

The integer range is sufficient for holding the occurrences of tables in the context of a single query.

asf-ci-hive added tests pending and removed tests unstable labels Jul 8, 2024

asf-ci-hive added tests unstable and removed tests pending labels Jul 8, 2024

asolimando approved these changes Jul 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-28259: Common table expression detection and rewrites using CBO #5249

HIVE-28259: Common table expression detection and rewrites using CBO #5249

zabetak commented May 14, 2024

sonarcloud bot commented May 14, 2024

asolimando left a comment

asolimando May 14, 2024

zabetak commented Jun 27, 2024

amansinha100 left a comment

amansinha100 Jul 5, 2024

zabetak Jul 9, 2024

zabetak Jul 9, 2024

amansinha100 Jul 5, 2024

zabetak Jul 9, 2024

amansinha100 Jul 9, 2024

amansinha100 Jul 7, 2024

zabetak Jul 10, 2024 •

edited

Loading

amansinha100 Jul 7, 2024

amansinha100 Jul 7, 2024

amansinha100 Jul 8, 2024

amansinha100 Jul 8, 2024

sonarcloud bot commented Jul 8, 2024

asolimando commented Jul 9, 2024

		@@ -1,3 +1,4 @@
		set hive.optimize.cte.suggester.class=org.apache.hadoop.hive.ql.optimizer.calcite.CommonTableExpressionPrintSuggester;

HIVE-28259: Common table expression detection and rewrites using CBO #5249

Are you sure you want to change the base?

HIVE-28259: Common table expression detection and rewrites using CBO #5249

Conversation

zabetak commented May 14, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

sonarcloud bot commented May 14, 2024

Quality Gate passed

asolimando left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zabetak commented Jun 27, 2024

amansinha100 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zabetak Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarcloud bot commented Jul 8, 2024

Quality Gate passed

asolimando commented Jul 9, 2024

zabetak Jul 10, 2024 •

edited

Loading