-
Notifications
You must be signed in to change notification settings - Fork 68
feat: optimize truncate(col) == value to startsWith predicate #289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
a831452 to
5cbeca0
Compare
5cbeca0 to
ab63064
Compare
|
There are some merge conflicts... |
dfb4504 to
8076d44
Compare
|
Fixed |
src/iceberg/expression/predicate.cc
Outdated
| /// This matches the behavior of TruncateUtils::TruncateUTF8. | ||
| /// | ||
| /// NOTE: This counts code points, not grapheme clusters (user-perceived characters). | ||
| /// Per the Iceberg spec, combining marks count as separate code points. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iceberg spec does not have this. We should not mislead readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
src/iceberg/expression/predicate.cc
Outdated
| /// | ||
| /// \param str The UTF-8 encoded string | ||
| /// \return The number of code points (not bytes, not graphemes) | ||
| inline int32_t CountUTF8CodePoints(std::string_view str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this should be moved to string_util.h or truncate_util.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved
| auto* transform_term = dynamic_cast<BoundTransform*>(bound_term.get()); | ||
| if (!transform_term) { | ||
| // Should never happen after kind check, but be defensive | ||
| return std::make_shared<BoundLiteralPredicate>(BASE::op(), std::move(bound_term), | ||
| std::move(literal)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| auto* transform_term = dynamic_cast<BoundTransform*>(bound_term.get()); | |
| if (!transform_term) { | |
| // Should never happen after kind check, but be defensive | |
| return std::make_shared<BoundLiteralPredicate>(BASE::op(), std::move(bound_term), | |
| std::move(literal)); | |
| } | |
| auto* transform_term = internal::checked_cast<BoundTransform*>(bound_term.get()); |
Let's reuse the existing pattern for cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
src/iceberg/expression/predicate.cc
Outdated
| !literal.IsNull()) { // Null safety: skip null literals | ||
|
|
||
| // Extract width parameter using type-safe API | ||
| auto width_opt = transform_term->transform()->param(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can greatly simplify this line and below.
We can add transform_func like below.
class ICEBERG_EXPORT BoundTransform : public BoundTerm {
public:
const std::shared_ptr<TransformFunction>& transform_func() const { return transform_func_; }
};Then we just need to call transform_term->transform_func()->Transform(literal) and check if its return value is same as literal. We don't even need CountUTF8CodePoints above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
src/iceberg/transform.h
Outdated
| /// For non-parameterized transforms (identity, year, etc.), returns std::nullopt. | ||
| /// | ||
| /// \return The parameter if present, otherwise std::nullopt | ||
| std::optional<int32_t> param() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this function? This makes it harder to evolve the variant in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
0aad922 to
1d01343
Compare
Implements optimization to rewrite truncate equality predicates as startsWith for better predicate pushdown and index usage. The optimization applies when: - Operation is equality - Term is a truncate transform on string column - Literal has exactly the truncate width in UTF-8 code points Implementation uses transform_func()->Transform() to validate that: 1. truncate(literal) == literal (literal is compatible) 2. truncate(literal + 'x') == literal (literal has exact width) This approach leverages the transform function without duplicating UTF-8 code point counting logic.
1d01343 to
7c51a61
Compare
Rewrite truncate(col) == "value" predicates to col STARTS_WITH "value" for string columns. This enables better predicate pushdown to storage formats and efficient use of prefix indexes.
The optimization only applies when:
Added tests to verify correct application and edge cases.