-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design doc for ambiguous GROUP SIZE
hints
#21500
Conversation
0c104bf
to
4639d73
Compare
This design doc analyzes the problem of ambiguity when specifying the EXPECTED GROUP SIZE hint in a single SQL query block where multiple constructs map to different instances of reductions and top-k operators. A backwards-compatible solution is proposed based on attaching additional syntax-based hints when disambiguation is necessary.
4639d73
to
fbd53ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some more thought suggest the following names:
DISTINCT ON INPUT GROUP SIZE
AGGREGATE INPUT GROUP SIZE
LIMIT INPUT SIZE
The general structure is <SYNTAX FRAGMENT> INPUT [GROUP] SIZE
. The LIMIT
option does not contain the optional GROUP
modifier because we will always have a single group.
We should update the docs to hammer some key points. More specifically:
- The bot the "input" and the "input group" size should be estimated w.r.t. a fixed outer context. This is important because people often use lateral joins to compute the top-k elements per group.
higher potential for memory savings with minimal changes to their SQL (i.e., by adding an extra | ||
hint to the `OPTIONS` clause). | ||
|
||
## Out of Scope |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @teskje pointed out, the current parser structure is a bit weird because the options are attached at the Select
level, but we want them attached at the Query
level.
We need to double-check with @MaterializeInc/surfaces if there is a fundamental reason not to move this around, and move this if possible. I'm not sure if this should be in scope here (I would prefer if we can squeeze this in because at first glance it doesn't seem to be a lot of work, but I might be wrong).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we attach them at the query level, wouldn't we run into issues with referring to multiple aggregations/topks in nested SELECTs?
My thinking was to leave the OPTIONS
clause attached to SELECTs, but move it at the end instead of expecting this weird position in the middle. And maybe rename OPTIONS
to QUERY HINTS
or something equally expressive. We'd still need to keep supporting the original syntax for compatibility reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we attach them at the query level, wouldn't we run into issues with referring to multiple aggregations/topks in nested SELECTs?
Nested SELECT
syntax is are possible even today (for example in the from
field of the Select
AST node.
My thinking was to leave the OPTIONS clause attached to SELECTs, but move it at the end instead of expecting this weird position in the middle. And maybe rename OPTIONS to QUERY HINTS or something equally expressive. We'd still need to keep supporting the original syntax for compatibility reasons.
but move it at the end instead of expecting this weird position in the middle
Well, this is the problem. The ORDER BY
, LIMIT
and OFFSET
options are not treated as part of the Select
, but of the enclosing Query
.
If we want to allow them, instead of
pub struct Select<T: AstInfo> {
pub distinct: Option<Distinct<T>>,
/// projection expressions
pub projection: Vec<SelectItem<T>>,
/// FROM
pub from: Vec<TableWithJoins<T>>,
/// WHERE
pub selection: Option<Expr<T>>,
/// GROUP BY
pub group_by: Vec<Expr<T>>,
/// HAVING
pub having: Option<Expr<T>>,
/// OPTION
pub options: Vec<SelectOption<T>>,
}
we will need to have the following definition
pub struct Query<T: AstInfo> {
/// WITH (common table expressions, or CTEs)
pub ctes: CteBlock<T>,
/// SELECT or UNION / EXCEPT / INTERSECT
pub body: SetExpr<T>,
/// ORDER BY
pub order_by: Vec<OrderByExpr<T>>,
/// `LIMIT { <N> | ALL }`
/// `FETCH { FIRST | NEXT } <N> { ROW | ROWS } | { ONLY | WITH TIES }`
pub limit: Option<Limit<T>>,
/// `OFFSET <N> { ROW | ROWS }`
pub offset: Option<Expr<T>>,
/// OPTION
pub options: Vec<SelectOption<T>>,
}
so the hints can be parsed at the end of a query block (note that subqueries and CTE values are modeled in the AST as a Query
, not as a Select
).
When handling these options during planning, we will need to make sure that we only apply the hints to the top-level Select
nodes in the body
. I believe this will achieve exactly what you wish, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, maybe I'm confused about what "select" and "query" mean then.
My assumption was that this is a query with three selects:
SELECT
sum(max_revenue) AS sum_max_revenue
FROM (
SELECT max_revenue
FROM (
SELECT l_orderkey,
max(l_extendedprice * (1 - l_discount)) AS max_revenue
FROM lineitem
GROUP BY l_orderkey
)
ORDER BY max_revenue DESC
LIMIT 10
);
... and each select can contain things that we want to supply optimizer hints for, so if you were only able to supply a single OPTIONS
clause at the end of the whole query it would be hard to target constructs in individual selects.
But from your code snippets it looks like the above is instead a query with two nested subqueries and each of the three queries contains a select? If that's the case we are saying the same thing, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But from your code snippets it looks like the above is instead a query with two nested subqueries and each of the three queries contains a select?
Looking at the AST definitions I would expect to see that, yes!
b5b1bf5
to
6b773b6
Compare
6b773b6
to
1815c12
Compare
higher potential for memory savings with minimal changes to their SQL (i.e., by adding an extra | ||
hint to the `OPTIONS` clause). | ||
|
||
## Out of Scope |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we attach them at the query level, wouldn't we run into issues with referring to multiple aggregations/topks in nested SELECTs?
My thinking was to leave the OPTIONS
clause attached to SELECTs, but move it at the end instead of expecting this weird position in the middle. And maybe rename OPTIONS
to QUERY HINTS
or something equally expressive. We'd still need to keep supporting the original syntax for compatibility reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the revision.
@teskje I captured the additional issues (https://github.com/MaterializeInc/database-issues/issues/6472, https://github.com/MaterializeInc/database-issues/issues/6473) and some of your comments regarding docs in a new section "Other Remarks". Just let me know if any more discussion is needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thank you for the reviews and discussions! |
This commit implements the new GROUP SIZE query hint syntax proposed in design doc MaterializeInc#21500. The syntax comprises three new query hints: AGGREGATE INPUT GROUP SIZE, DISTINCT ON INPUT GROUP SIZE, and LIMIT INPUT GROUP SIZE. These hints apply to the syntax constructs they refer to, as opposed to EXPECTED GROUP SIZE, which applies to all constructs in a query block. The new syntax cannot be mixed with the old EXPECTED GROUP SIZE query hint, so an error is raised in case an invalid combination is detected.
This commit adds an SLT that validates the new syntax of group size hints as well as checks for backwards compatibility requirements with the old syntax. The test scenarios draw heavily from the MVP articulated in design document MaterializeInc#21500.
This commit implements the new GROUP SIZE query hint syntax proposed in design doc MaterializeInc#21500. The syntax comprises three new query hints: AGGREGATE INPUT GROUP SIZE, DISTINCT ON INPUT GROUP SIZE, and LIMIT INPUT GROUP SIZE. These hints apply to the syntax constructs they refer to, as opposed to EXPECTED GROUP SIZE, which applies to all constructs in a query block. The new syntax cannot be mixed with the old EXPECTED GROUP SIZE query hint, so an error is raised in case an invalid combination is detected.
This commit adds an SLT that validates the new syntax of group size hints as well as checks for backwards compatibility requirements with the old syntax. The test scenarios draw heavily from the MVP articulated in design document MaterializeInc#21500.
This commit implements the new GROUP SIZE query hint syntax proposed in design doc MaterializeInc#21500. The syntax comprises three new query hints: AGGREGATE INPUT GROUP SIZE, DISTINCT ON INPUT GROUP SIZE, and LIMIT INPUT GROUP SIZE. These hints apply to the syntax constructs they refer to, as opposed to EXPECTED GROUP SIZE, which applies to all constructs in a query block. The new syntax cannot be mixed with the old EXPECTED GROUP SIZE query hint, so an error is raised in case an invalid combination is detected.
This commit adds an SLT that validates the new syntax of group size hints as well as checks for backwards compatibility requirements with the old syntax. The test scenarios draw heavily from the MVP articulated in design document MaterializeInc#21500.
This commit implements the new GROUP SIZE query hint syntax proposed in design doc MaterializeInc#21500. The syntax comprises three new query hints: AGGREGATE INPUT GROUP SIZE, DISTINCT ON INPUT GROUP SIZE, and LIMIT INPUT GROUP SIZE. These hints apply to the syntax constructs they refer to, as opposed to EXPECTED GROUP SIZE, which applies to all constructs in a query block. The new syntax cannot be mixed with the old EXPECTED GROUP SIZE query hint, so an error is raised in case an invalid combination is detected.
This commit adds an SLT that validates the new syntax of group size hints as well as checks for backwards compatibility requirements with the old syntax. The test scenarios draw heavily from the MVP articulated in design document MaterializeInc#21500.
This commit revises the user-facing documentation to refer to the new group size hints, namely AGGREGATE INPUT GROUP SIZE, DISTINCT ON INPUT GROUP SIZE, and LIMIT INPUT GROUP SIZE, instead of the old syntax with the EXPECTED GROUP SIZE. Even though the old syntax is still available for backwards compatibility, the documentation is being changed to only refer to the new syntax. As we help customers move to the new syntax, we will eventually be in a position where the old syntax can be deprecated. This strategy was proposed in design doc MaterializeInc#21500.
This commit implements the new GROUP SIZE query hint syntax proposed in design doc MaterializeInc#21500. The syntax comprises three new query hints: AGGREGATE INPUT GROUP SIZE, DISTINCT ON INPUT GROUP SIZE, and LIMIT INPUT GROUP SIZE. These hints apply to the syntax constructs they refer to, as opposed to EXPECTED GROUP SIZE, which applies to all constructs in a query block. The new syntax cannot be mixed with the old EXPECTED GROUP SIZE query hint, so an error is raised in case an invalid combination is detected.
This commit adds an SLT that validates the new syntax of group size hints as well as checks for backwards compatibility requirements with the old syntax. The test scenarios draw heavily from the MVP articulated in design document MaterializeInc#21500.
This commit revises the user-facing documentation to refer to the new group size hints, namely AGGREGATE INPUT GROUP SIZE, DISTINCT ON INPUT GROUP SIZE, and LIMIT INPUT GROUP SIZE, instead of the old syntax with the EXPECTED GROUP SIZE. Even though the old syntax is still available for backwards compatibility, the documentation is being changed to only refer to the new syntax. As we help customers move to the new syntax, we will eventually be in a position where the old syntax can be deprecated. This strategy was proposed in design doc MaterializeInc#21500.
This commit revises the user-facing documentation to refer to the new group size hints, namely AGGREGATE INPUT GROUP SIZE, DISTINCT ON INPUT GROUP SIZE, and LIMIT INPUT GROUP SIZE, instead of the old syntax with the EXPECTED GROUP SIZE. Even though the old syntax is still available for backwards compatibility, the documentation is being changed to only refer to the new syntax. As we help customers move to the new syntax, we will eventually be in a position where the old syntax can be deprecated. This strategy was proposed in design doc MaterializeInc#21500.
This design doc analyzes the problem of ambiguity when specifying the
EXPECTED GROUP SIZE
hint in a single SQL query block where multiple constructs map to different instances of reductions and top-k operators. A backwards-compatible solution is proposed based on attaching additional syntax-based hints when disambiguation is necessary.Advances MaterializeInc/database-issues#5578.
Rendered version.
Motivation
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label. N/A