From 64c8029518a4c5a463d972af895af90fec6e0506 Mon Sep 17 00:00:00 2001 From: Marcos Antonio Vaz Salles Date: Tue, 5 Sep 2023 15:21:00 +0100 Subject: [PATCH] Revised docs for new group size hints This commit revises the user-facing documentation to refer to the new group size hints, namely AGGREGATE INPUT GROUP SIZE, DISTINCT ON INPUT GROUP SIZE, and LIMIT INPUT GROUP SIZE, instead of the old syntax with the EXPECTED GROUP SIZE. Even though the old syntax is still available for backwards compatibility, the documentation is being changed to only refer to the new syntax. As we help customers move to the new syntax, we will eventually be in a position where the old syntax can be deprecated. This strategy was proposed in design doc #21500. --- doc/user/content/sql/select.md | 4 ++- .../content/sql/system-catalog/mz_internal.md | 12 ++++----- .../content/transform-data/optimization.md | 26 +++++++++---------- 3 files changed, 22 insertions(+), 20 deletions(-) diff --git a/doc/user/content/sql/select.md b/doc/user/content/sql/select.md index 1fa5b64d8993..2fbab73e450f 100644 --- a/doc/user/content/sql/select.md +++ b/doc/user/content/sql/select.md @@ -124,7 +124,9 @@ The following query hints are valid within the `OPTION` clause. Hint | Value type | Description ------|------------|------------ -`EXPECTED GROUP SIZE` | `int` | How many rows will have the same group key. Materialize can render `min` and `max` expressions, and some [Top K patterns](/guides/top-k), more efficiently with this information. +`AGGREGATE INPUT GROUP SIZE` | `uint8` | How many rows will have the same group key in an aggregation. Materialize can render `min` and `max` expressions more efficiently with this information. +`DISTINCT ON INPUT GROUP SIZE` | `uint8` | How many rows will have the same group key in a `DISTINCT ON` expression. Materialize can render [Top K patterns](/guides/top-k) based on `DISTINCT ON` more efficiently with this information. +`LIMIT INPUT GROUP SIZE` | `uint8` | How many rows will be given as a group to a `LIMIT` restriction. Materialize can render [Top K patterns](/guides/top-k) based on `LIMIT` more efficiently with this information. For examples, see the [Optimization](/transform-data/optimization/#query-hints) page. diff --git a/doc/user/content/sql/system-catalog/mz_internal.md b/doc/user/content/sql/system-catalog/mz_internal.md index c2508e83a3da..384656a91d37 100644 --- a/doc/user/content/sql/system-catalog/mz_internal.md +++ b/doc/user/content/sql/system-catalog/mz_internal.md @@ -965,10 +965,10 @@ The `mz_dataflow_shutdown_durations_histogram` view describes a histogram of the ### `mz_expected_group_size_advice` -The `mz_expected_group_size_advice` view provides advice on opportunities to set the `EXPECTED GROUP SIZE` -[query hint]. This hint is applicable to dataflows maintaining [`MIN`], [`MAX`], or [Top K] query patterns. The -maintainance of these query patterns is implemented inside an operator scope, called a region, through a -hierarchical scheme for either aggregation or Top K computations. +The `mz_expected_group_size_advice` view provides advice on opportunities to set [query hints]. +Query hints are applicable to dataflows maintaining [`MIN`], [`MAX`], or [Top K] query patterns. +The maintainance of these query patterns is implemented inside an operator scope, called a region, +through a hierarchical scheme for either aggregation or Top K computations. | Field | Type | Meaning | @@ -980,7 +980,7 @@ hierarchical scheme for either aggregation or Top K computations. | `levels` | [`bigint`] | The number of levels in the hierarchical scheme implemented by the region. | | `to_cut` | [`bigint`] | The number of levels that can be eliminated (cut) from the region's hierarchy. | | `savings` | [`numeric`] | A conservative estimate of the amount of memory in bytes to be saved by applying the hint. | -| `hint` | [`double precision`] | The hint value for `EXPECTED GROUP SIZE` that will eliminate `to_cut` levels from the regions' hierarchy. | +| `hint` | [`double precision`] | The hint value that will eliminate `to_cut` levels from the region's hierarchy. | ### `mz_message_counts` @@ -1095,7 +1095,7 @@ The `mz_scheduling_parks_histogram` view describes a histogram of [dataflow] wor [`MIN`]: /sql/functions/#min [`MAX`]: /sql/functions/#max [Top K]: /transform-data/patterns/top-k -[query hint]: /sql/select/#query-hints +[query hints]: /sql/select/#query-hints diff --git a/doc/user/content/transform-data/optimization.md b/doc/user/content/transform-data/optimization.md index 423591afa4a0..162413a42478 100644 --- a/doc/user/content/transform-data/optimization.md +++ b/doc/user/content/transform-data/optimization.md @@ -296,9 +296,9 @@ The following are the possible index usage types: ## Query hints -Materialize has at present one important [query hint]: the `EXPECTED GROUP SIZE`. This hint applies to indexed or materialized views that need to incrementally maintain [`MIN`], [`MAX`], or [Top K] queries, as specified, e.g., by `GROUP BY`, `LATERAL`, or `DISTINCT ON`. Maintaining these queries while delivering low latency result updates is demanding in terms of main memory. This is because Materialize builds a hierarchy of aggregations so that data can be physically partitioned into small groups. By having only small groups at each level of the hierarchy, we can make sure that recomputing aggregations is not slowed down by skew in the sizes of the original query groups. +Materialize has at present three important [query hints]: the `AGGREGATE INPUT GROUP SIZE`, the `DISTINCT ON INPUT GROUP SIZE`, and the `LIMIT INPUT GROUP SIZE`. These hints apply to indexed or materialized views that need to incrementally maintain [`MIN`], [`MAX`], or [Top K] queries, as specified by SQL aggregations, `DISTINCT ON`, or `LIMIT`. Maintaining these queries while delivering low latency result updates is demanding in terms of main memory. This is because Materialize builds a hierarchy of aggregations so that data can be physically partitioned into small groups. By having only small groups at each level of the hierarchy, we can make sure that recomputing aggregations is not slowed down by skew in the sizes of the original query groups. -The number of levels needed in the hierarchical scheme is by default set assuming that there may be large query groups in the input data. By specifying the `EXPECTED GROUP SIZE`, it is possible to refine this assumption, allowing Materialize to build a hierarchy with fewer levels and lower memory consumption without sacrificing update latency. +The number of levels needed in the hierarchical scheme is by default set assuming that there may be large query groups in the input data. By specifying the query hints, it is possible to refine this assumption, allowing Materialize to build a hierarchy with fewer levels and lower memory consumption without sacrificing update latency. Consider the previous example with the collection `sections`. Maintenance of the maximum `course_id` per `teacher` can be achieved with a materialized view: @@ -309,7 +309,7 @@ FROM sections GROUP BY teacher_id; ``` -If the largest number of `course_id` values that are allocated to a single `teacher_id` is known, then this number can be provided as the `EXPECTED GROUP SIZE`. For the query above, it is possible to get an estimate for this number by: +If the largest number of `course_id` values that are allocated to a single `teacher_id` is known, then this number can be provided as the `AGGREGATE INPUT GROUP SIZE`. For the query above, it is possible to get an estimate for this number by: ```sql SELECT MAX(course_count) @@ -320,29 +320,29 @@ FROM ( ); ``` -However, the estimate is based only on data that is already present in the system. So taking into account how much this largest number could expand is critical to avoid issues with update latency after tuning the `EXPECTED GROUP SIZE`. +However, the estimate is based only on data that is already present in the system. So taking into account how much this largest number could expand is critical to avoid issues with update latency after tuning the query hint. -For our example, let's suppose that we determined the largest number of courses per teacher to be `1000`. Then, the original definition of `max_course_id_per_teacher` can be revised to include the `EXPECTED GROUP SIZE` query hint as follows: +For our example, let's suppose that we determined the largest number of courses per teacher to be `1000`. Then, the original definition of `max_course_id_per_teacher` can be revised to include the `AGGREGATE INPUT GROUP SIZE` query hint as follows: ```sql CREATE MATERIALIZED VIEW max_course_id_per_teacher AS SELECT teacher_id, MAX(course_id) FROM sections GROUP BY teacher_id -OPTIONS (EXPECTED GROUP SIZE = 1000) +OPTIONS (AGGREGATE INPUT GROUP SIZE = 1000) ``` -The same hint can be provided in [Top K] query patterns specified by `DISTINCT ON` or `LATERAL`. As examples, consider that we wish not to compute the maximum `course_id`, but rather the `id` of the section of this top course. This computation can be incrementally maintained by the following materialized view: +The other two hints can be provided in [Top K] query patterns specified by `DISTINCT ON` or `LIMIT`. As examples, consider that we wish not to compute the maximum `course_id`, but rather the `id` of the section of this top course. This computation can be incrementally maintained by the following materialized view: ```sql CREATE MATERIALIZED VIEW section_of_top_course_per_teacher AS SELECT DISTINCT ON(teacher_id) teacher_id, id AS section_id FROM sections -OPTIONS (EXPECTED GROUP SIZE = 1000) +OPTIONS (DISTINCT ON INPUT GROUP SIZE = 1000) ORDER BY teacher_id ASC, course_id DESC; ``` -In the above examples, we see that the `EXPECTED GROUP SIZE` hint is always positioned in an `OPTIONS` clause after a `GROUP BY` clause, but before an `ORDER BY`, as captured by the [`SELECT` syntax]. However, in the case of Top K using a `LATERAL` subquery and `LIMIT`, the hint is specified in the subquery. For instance, the following materialized view illustrates how to incrementally maintain the top-3 section `id`s ranked by `course_id` for each teacher: +In the above examples, we see that the query hints are always positioned in an `OPTIONS` clause after a `GROUP BY` clause, but before an `ORDER BY`, as captured by the [`SELECT` syntax]. However, in the case of Top K using a `LATERAL` subquery and `LIMIT`, it is important to note that the hint is specified in the subquery. For instance, the following materialized view illustrates how to incrementally maintain the top-3 section `id`s ranked by `course_id` for each teacher: ```sql CREATE MATERIALIZED VIEW sections_of_top_3_courses_per_teacher AS @@ -351,12 +351,12 @@ FROM teachers grp, LATERAL (SELECT id AS section_id FROM sections WHERE teacher_id = grp.id - OPTIONS (EXPECTED GROUP SIZE = 1000) + OPTIONS (LIMIT INPUT GROUP SIZE = 1000) ORDER BY course_id DESC LIMIT 3); ``` -For indexed and materialized views that have already been created without specifying an `EXPECTED GROUP SIZE` query hint, Materialize includes an introspection view, [`mz_internal.mz_expected_group_size_advice`], that can be used to query, for a given cluster, all incrementally maintained [dataflows] where tuning of the `EXPECTED GROUP SIZE` could be beneficial. The introspection view also provides an advice value based on an estimate of how many levels could be cut from the hierarchy. The following query illustrates how to access this introspection view: +For indexed and materialized views that have already been created without specifying query hints, Materialize includes an introspection view, [`mz_internal.mz_expected_group_size_advice`], that can be used to query, for a given cluster, all incrementally maintained [dataflows] where tuning of the above query hints could be beneficial. The introspection view also provides an advice value based on an estimate of how many levels could be cut from the hierarchy. The following query illustrates how to access this introspection view: ```sql SELECT dataflow_name, region_name, levels, to_cut, hint @@ -364,13 +364,13 @@ FROM mz_internal.mz_expected_group_size_advice ORDER BY dataflow_name, region_name; ``` -The column `hint` provides the estimated value to be provided to the `EXPECTED GROUP SIZE`. +The column `hint` provides the estimated value to be provided to the `AGGREGATE INPUT GROUP SIZE` in the case of a `MIN` or `MAX` aggregation or to the `DISTINCT ON INPUT GROUP SIZE` or `LIMIT INPUT GROUP SIZE` in the case of a Top K pattern. ## Learn more Check out the blog post [Delta Joins and Late Materialization](https://materialize.com/blog/delta-joins/) to go deeper on join optimization in Materialize. -[query hint]: /sql/select/#query-hints +[query hints]: /sql/select/#query-hints [arrangements]: /get-started/arrangements/#arrangements [`MIN`]: /sql/functions/#min [`MAX`]: /sql/functions/#max