Skip to content

Conversation

@fivetran-amrutabhimsenayachit
Copy link
Collaborator

When STARTS_WITH is used with BLOB/BYTES types that are not literals (e.g., from CAST, table columns, function results), the transpiled DuckDB query fails because DuckDB's starts_with only accepts VARCHAR.

Before:

sqlglot %  bq --project_id fivetran-wild-west query --use_legacy_sql=false "SELECT STARTS_WITH(CAST('foo' AS BYTES), CAST('f' AS BYTES))"                            
+------+
| f0_  |
+------+
| true |
+------+
sqlglot % python3 -c "import sqlglot; print(sqlglot.transpile(\"SELECT STARTS_WITH(CAST('foo' AS BYTES), CAST('f' AS BYTES))\", read='bigquery', write='duckdb')[0])"
SELECT STARTS_WITH(CAST('foo' AS BLOB), CAST('f' AS BLOB))

sqlglot % duckdb -c "SELECT STARTS_WITH(CAST('foo' AS BLOB), CAST('f' AS BLOB))"                                                         
Binder Error:
No function matches the given name and argument types 'starts_with(BLOB, BLOB)'. You might need to add explicit type casts.
        Candidate functions:
        starts_with(VARCHAR, VARCHAR) -> BOOLEAN


LINE 1: SELECT STARTS_WITH(CAST('foo' AS BLOB), CAST('f' AS BLOB))

After:

sqlglot % bq --project_id fivetran-wild-west query --use_legacy_sql=false "SELECT STARTS_WITH(CAST('foo' AS BYTES), CAST('f' AS BYTES))"     
+------+
| f0_  |
+------+
| true |
+------+
sqlglot % python3 -c "import sqlglot; print(sqlglot.transpile(\"SELECT STARTS_WITH(CAST('foo' AS BYTES), CAST('f' AS BYTES))\", read='bigquery', write='duckdb')[0])"
SELECT STARTS_WITH(CAST(CAST('foo' AS BLOB) AS TEXT), CAST(CAST('f' AS BLOB) AS TEXT))

sqlglot % duckdb -c "SELECT STARTS_WITH(CAST(CAST('foo' AS BLOB) AS TEXT), CAST(CAST('f' AS BLOB) AS TEXT))"
┌───────────────────────────────────────────────────────────────────────┐
│ starts_with(CAST('foo'::BLOB AS VARCHAR), CAST('f'::BLOB AS VARCHAR)) │
│                                boolean                                │
├───────────────────────────────────────────────────────────────────────┤
│ true                                                                  │
└───────────────────────────────────────────────────────────────────────┘

@fivetran-amrutabhimsenayachit fivetran-amrutabhimsenayachit force-pushed the RD-1050424-transpile-big-querys-starts-with-string-function-to-duck-db branch 2 times, most recently from a99ba1f to 93b7efa Compare November 7, 2025 15:18
Copy link
Collaborator

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mimic the implementation for Lower and Upper. Take a look at _case_conversion_sql.

@fivetran-amrutabhimsenayachit fivetran-amrutabhimsenayachit force-pushed the RD-1050424-transpile-big-querys-starts-with-string-function-to-duck-db branch from 82a535d to fd278a0 Compare November 10, 2025 17:29
Copy link
Collaborator

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment, looks good otherwise. Thanks!

Comment on lines +1218 to +1228
# Annotate types if needed for type-based casting
if not arg.type:
from sqlglot.optimizer.annotate_types import annotate_types

annotate_types(arg, dialect=self.dialect)

# Convert ByteString to String literal before generation
# ByteStrings get typed as UNKNOWN and would be wrapped in CAST(...AS BLOB) by generator
if isinstance(arg, exp.ByteString):
arg.replace(exp.Literal.string(arg.this))
# Cast non-VARCHAR types to VARCHAR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this logic is necessary?

  1. We don't have to annotate types here, since transpiling this successfully from bigquery to duckdb requires that we run type inference first, and hence we should have the types available. For duckdb -> duckdb, we don't need types– the varchar/unknown branch should suffice since the type will always be unknown and we won't cast, hence preserving the sql @ roundtrip
  2. Doesn't look like we should deal with bytestring as a special case. Despite a bit verbose, doing a double-cast is probably the safer way to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants