Skip to content

Conversation

@shangxinli
Copy link
Contributor

Rewrite truncate(col) == "value" predicates to col STARTS_WITH "value" for string columns. This enables better predicate pushdown to storage formats and efficient use of prefix indexes.

The optimization only applies when:

  • Operation is equality
  • Term is a truncate transform
  • Literal is a string type

Added tests to verify correct application and edge cases.

@shangxinli shangxinli force-pushed the truncate_literal branch 9 times, most recently from a831452 to 5cbeca0 Compare November 3, 2025 18:51
@shangxinli shangxinli changed the title optimize truncate(col) == value to startsWith predicate feat: optimize truncate(col) == value to startsWith predicate Nov 3, 2025
@wgtmac
Copy link
Member

wgtmac commented Nov 4, 2025

There are some merge conflicts...

@shangxinli
Copy link
Contributor Author

Fixed

/// This matches the behavior of TruncateUtils::TruncateUTF8.
///
/// NOTE: This counts code points, not grapheme clusters (user-perceived characters).
/// Per the Iceberg spec, combining marks count as separate code points.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg spec does not have this. We should not mislead readers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

///
/// \param str The UTF-8 encoded string
/// \return The number of code points (not bytes, not graphemes)
inline int32_t CountUTF8CodePoints(std::string_view str) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should be moved to string_util.h or truncate_util.h

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved

Comment on lines 285 to 290
auto* transform_term = dynamic_cast<BoundTransform*>(bound_term.get());
if (!transform_term) {
// Should never happen after kind check, but be defensive
return std::make_shared<BoundLiteralPredicate>(BASE::op(), std::move(bound_term),
std::move(literal));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto* transform_term = dynamic_cast<BoundTransform*>(bound_term.get());
if (!transform_term) {
// Should never happen after kind check, but be defensive
return std::make_shared<BoundLiteralPredicate>(BASE::op(), std::move(bound_term),
std::move(literal));
}
auto* transform_term = internal::checked_cast<BoundTransform*>(bound_term.get());

Let's reuse the existing pattern for cast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

!literal.IsNull()) { // Null safety: skip null literals

// Extract width parameter using type-safe API
auto width_opt = transform_term->transform()->param();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can greatly simplify this line and below.

We can add transform_func like below.

class ICEBERG_EXPORT BoundTransform : public BoundTerm {
 public:
  const std::shared_ptr<TransformFunction>& transform_func() const { return transform_func_; }
};

Then we just need to call transform_term->transform_func()->Transform(literal) and check if its return value is same as literal. We don't even need CountUTF8CodePoints above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

/// For non-parameterized transforms (identity, year, etc.), returns std::nullopt.
///
/// \return The parameter if present, otherwise std::nullopt
std::optional<int32_t> param() const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this function? This makes it harder to evolve the variant in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@shangxinli shangxinli force-pushed the truncate_literal branch 4 times, most recently from 0aad922 to 1d01343 Compare November 11, 2025 17:32
Implements optimization to rewrite truncate equality predicates as
startsWith for better predicate pushdown and index usage.

The optimization applies when:
- Operation is equality
- Term is a truncate transform on string column
- Literal has exactly the truncate width in UTF-8 code points

Implementation uses transform_func()->Transform() to validate that:
1. truncate(literal) == literal (literal is compatible)
2. truncate(literal + 'x') == literal (literal has exact width)

This approach leverages the transform function without duplicating
UTF-8 code point counting logic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants