Skip to content

Commit

Permalink
Revised docs for new group size hints
Browse files Browse the repository at this point in the history
This commit revises the user-facing documentation to refer to the new
group size hints, namely AGGREGATE INPUT GROUP SIZE, DISTINCT ON INPUT
GROUP SIZE, and LIMIT INPUT GROUP SIZE, instead of the old syntax with
the EXPECTED GROUP SIZE. Even though the old syntax is still available
for backwards compatibility, the documentation is being changed to only
refer to the new syntax. As we help customers move to the new syntax,
we will eventually be in a position where the old syntax can be
deprecated. This strategy was proposed in design doc MaterializeInc#21500.
  • Loading branch information
vmarcos committed Sep 5, 2023
1 parent 4c441a4 commit 32943c3
Show file tree
Hide file tree
Showing 3 changed files with 22 additions and 20 deletions.
4 changes: 3 additions & 1 deletion doc/user/content/sql/select.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,9 @@ The following query hints are valid within the `OPTION` clause.

Hint | Value type | Description
------|------------|------------
`EXPECTED GROUP SIZE` | `int` | How many rows will have the same group key. Materialize can render `min` and `max` expressions, and some [Top K patterns](/guides/top-k), more efficiently with this information.
`AGGREGATE INPUT GROUP SIZE` | `uint8` | How many rows will have the same group key in an aggregation. Materialize can render `min` and `max` expressions more efficiently with this information.
`DISTINCT ON INPUT GROUP SIZE` | `uint8` | How many rows will have the same group key in a `DISTINCT ON` expression. Materialize can render [Top K patterns](/guides/top-k) based on `DISTINCT ON` more efficiently with this information.
`LIMIT INPUT GROUP SIZE` | `uint8` | How many rows will be given as a group to a `LIMIT` restriction. Materialize can render [Top K patterns](/guides/top-k) based on `LIMIT` more efficiently with this information.

For examples, see the [Optimization](/transform-data/optimization/#query-hints) page.

Expand Down
12 changes: 6 additions & 6 deletions doc/user/content/sql/system-catalog/mz_internal.md
Original file line number Diff line number Diff line change
Expand Up @@ -943,10 +943,10 @@ The `mz_dataflow_shutdown_durations_histogram` view describes a histogram of the

### `mz_expected_group_size_advice`

The `mz_expected_group_size_advice` view provides advice on opportunities to set the `EXPECTED GROUP SIZE`
[query hint]. This hint is applicable to dataflows maintaining [`MIN`], [`MAX`], or [Top K] query patterns. The
maintainance of these query patterns is implemented inside an operator scope, called a region, through a
hierarchical scheme for either aggregation or Top K computations.
The `mz_expected_group_size_advice` view provides advice on opportunities to set [query hints].
Query hints are applicable to dataflows maintaining [`MIN`], [`MAX`], or [Top K] query patterns.
The maintainance of these query patterns is implemented inside an operator scope, called a region,
through a hierarchical scheme for either aggregation or Top K computations.

<!-- RELATION_SPEC mz_internal.mz_expected_group_size_advice -->
| Field | Type | Meaning |
Expand All @@ -958,7 +958,7 @@ hierarchical scheme for either aggregation or Top K computations.
| `levels` | [`bigint`] | The number of levels in the hierarchical scheme implemented by the region. |
| `to_cut` | [`bigint`] | The number of levels that can be eliminated (cut) from the region's hierarchy. |
| `savings` | [`numeric`] | A conservative estimate of the amount of memory in bytes to be saved by applying the hint. |
| `hint` | [`double precision`] | The hint value for `EXPECTED GROUP SIZE` that will eliminate `to_cut` levels from the regions' hierarchy. |
| `hint` | [`double precision`] | The hint value that will eliminate `to_cut` levels from the region's hierarchy. |

### `mz_message_counts`

Expand Down Expand Up @@ -1085,7 +1085,7 @@ The `mz_comments` table stores optional comments (descriptions) for objects in t
[`MIN`]: /sql/functions/#min
[`MAX`]: /sql/functions/#max
[Top K]: /transform-data/patterns/top-k
[query hint]: /sql/select/#query-hints
[query hints]: /sql/select/#query-hints

<!-- RELATION_SPEC_UNDOCUMENTED mz_internal.mz_aggregates -->
<!-- RELATION_SPEC_UNDOCUMENTED mz_internal.mz_arrangement_batches_raw -->
Expand Down
26 changes: 13 additions & 13 deletions doc/user/content/transform-data/optimization.md
Original file line number Diff line number Diff line change
Expand Up @@ -296,9 +296,9 @@ The following are the possible index usage types:
## Query hints
Materialize has at present one important [query hint]: the `EXPECTED GROUP SIZE`. This hint applies to indexed or materialized views that need to incrementally maintain [`MIN`], [`MAX`], or [Top K] queries, as specified, e.g., by `GROUP BY`, `LATERAL`, or `DISTINCT ON`. Maintaining these queries while delivering low latency result updates is demanding in terms of main memory. This is because Materialize builds a hierarchy of aggregations so that data can be physically partitioned into small groups. By having only small groups at each level of the hierarchy, we can make sure that recomputing aggregations is not slowed down by skew in the sizes of the original query groups.
Materialize has at present three important [query hints]: the `AGGREGATE INPUT GROUP SIZE`, the `DISTINCT ON INPUT GROUP SIZE`, and the `LIMIT INPUT GROUP SIZE`. These hints apply to indexed or materialized views that need to incrementally maintain [`MIN`], [`MAX`], or [Top K] queries, as specified by SQL aggregations, `DISTINCT ON`, or `LIMIT`, respectively. Maintaining these queries while delivering low latency result updates is demanding in terms of main memory. This is because Materialize builds a hierarchy of aggregations so that data can be physically partitioned into small groups. By having only small groups at each level of the hierarchy, we can make sure that recomputing aggregations is not slowed down by skew in the sizes of the original query groups.
The number of levels needed in the hierarchical scheme is by default set assuming that there may be large query groups in the input data. By specifying the `EXPECTED GROUP SIZE`, it is possible to refine this assumption, allowing Materialize to build a hierarchy with fewer levels and lower memory consumption without sacrificing update latency.
The number of levels needed in the hierarchical scheme is by default set assuming that there may be large query groups in the input data. By specifying the query hints, it is possible to refine this assumption, allowing Materialize to build a hierarchy with fewer levels and lower memory consumption without sacrificing update latency.
Consider the previous example with the collection `sections`. Maintenance of the maximum `course_id` per `teacher` can be achieved with a materialized view:
Expand All @@ -309,7 +309,7 @@ FROM sections
GROUP BY teacher_id;
```

If the largest number of `course_id` values that are allocated to a single `teacher_id` is known, then this number can be provided as the `EXPECTED GROUP SIZE`. For the query above, it is possible to get an estimate for this number by:
If the largest number of `course_id` values that are allocated to a single `teacher_id` is known, then this number can be provided as the `AGGREGATE INPUT GROUP SIZE`. For the query above, it is possible to get an estimate for this number by:

```sql
SELECT MAX(course_count)
Expand All @@ -320,29 +320,29 @@ FROM (
);
```

However, the estimate is based only on data that is already present in the system. So taking into account how much this largest number could expand is critical to avoid issues with update latency after tuning the `EXPECTED GROUP SIZE`.
However, the estimate is based only on data that is already present in the system. So taking into account how much this largest number could expand is critical to avoid issues with update latency after tuning the query hint.

For our example, let's suppose that we determined the largest number of courses per teacher to be `1000`. Then, the original definition of `max_course_id_per_teacher` can be revised to include the `EXPECTED GROUP SIZE` query hint as follows:
For our example, let's suppose that we determined the largest number of courses per teacher to be `1000`. Then, the original definition of `max_course_id_per_teacher` can be revised to include the `AGGREGATE INPUT GROUP SIZE` query hint as follows:

```sql
CREATE MATERIALIZED VIEW max_course_id_per_teacher AS
SELECT teacher_id, MAX(course_id)
FROM sections
GROUP BY teacher_id
OPTIONS (EXPECTED GROUP SIZE = 1000)
OPTIONS (AGGREGATE INPUT GROUP SIZE = 1000)
```

The same hint can be provided in [Top K] query patterns specified by `DISTINCT ON` or `LATERAL`. As examples, consider that we wish not to compute the maximum `course_id`, but rather the `id` of the section of this top course. This computation can be incrementally maintained by the following materialized view:
The other two hints can be provided in [Top K] query patterns specified by `DISTINCT ON` or `LIMIT`. As examples, consider that we wish not to compute the maximum `course_id`, but rather the `id` of the section of this top course. This computation can be incrementally maintained by the following materialized view:

```sql
CREATE MATERIALIZED VIEW section_of_top_course_per_teacher AS
SELECT DISTINCT ON(teacher_id) teacher_id, id AS section_id
FROM sections
OPTIONS (EXPECTED GROUP SIZE = 1000)
OPTIONS (DISTINCT ON INPUT GROUP SIZE = 1000)
ORDER BY teacher_id ASC, course_id DESC;
```

In the above examples, we see that the `EXPECTED GROUP SIZE` hint is always positioned in an `OPTIONS` clause after a `GROUP BY` clause, but before an `ORDER BY`, as captured by the [`SELECT` syntax]. However, in the case of Top K using a `LATERAL` subquery and `LIMIT`, the hint is specified in the subquery. For instance, the following materialized view illustrates how to incrementally maintain the top-3 section `id`s ranked by `course_id` for each teacher:
In the above examples, we see that the query hints are always positioned in an `OPTIONS` clause after a `GROUP BY` clause, but before an `ORDER BY`, as captured by the [`SELECT` syntax]. However, in the case of Top K using a `LATERAL` subquery and `LIMIT`, it is important to note that the hint is specified in the subquery. For instance, the following materialized view illustrates how to incrementally maintain the top-3 section `id`s ranked by `course_id` for each teacher:

```sql
CREATE MATERIALIZED VIEW sections_of_top_3_courses_per_teacher AS
Expand All @@ -351,26 +351,26 @@ FROM teachers grp,
LATERAL (SELECT id AS section_id
FROM sections
WHERE teacher_id = grp.id
OPTIONS (EXPECTED GROUP SIZE = 1000)
OPTIONS (LIMIT INPUT GROUP SIZE = 1000)
ORDER BY course_id DESC
LIMIT 3);
```

For indexed and materialized views that have already been created without specifying an `EXPECTED GROUP SIZE` query hint, Materialize includes an introspection view, [`mz_internal.mz_expected_group_size_advice`], that can be used to query, for a given cluster, all incrementally maintained [dataflows] where tuning of the `EXPECTED GROUP SIZE` could be beneficial. The introspection view also provides an advice value based on an estimate of how many levels could be cut from the hierarchy. The following query illustrates how to access this introspection view:
For indexed and materialized views that have already been created without specifying query hints, Materialize includes an introspection view, [`mz_internal.mz_expected_group_size_advice`], that can be used to query, for a given cluster, all incrementally maintained [dataflows] where tuning of the above query hints could be beneficial. The introspection view also provides an advice value based on an estimate of how many levels could be cut from the hierarchy. The following query illustrates how to access this introspection view:

```sql
SELECT dataflow_name, region_name, levels, to_cut, hint
FROM mz_internal.mz_expected_group_size_advice
ORDER BY dataflow_name, region_name;
```

The column `hint` provides the estimated value to be provided to the `EXPECTED GROUP SIZE`.
The column `hint` provides the estimated value to be provided to the `AGGREGATE INPUT GROUP SIZE` in the case of a `MIN` or `MAX` aggregation or to the `DISTINCT ON INPUT GROUP SIZE` or `LIMIT INPUT GROUP SIZE` in the case of a Top K pattern.

## Learn more

Check out the blog post [Delta Joins and Late Materialization](https://materialize.com/blog/delta-joins/) to go deeper on join optimization in Materialize.

[query hint]: /sql/select/#query-hints
[query hints]: /sql/select/#query-hints
[arrangements]: /get-started/arrangements/#arrangements
[`MIN`]: /sql/functions/#min
[`MAX`]: /sql/functions/#max
Expand Down

0 comments on commit 32943c3

Please sign in to comment.