Description
What happened:
When attempting to join tables using a WHERE
clause containing casted columns, we sometimes fail on the Rust side due to an unexpected join condition, or on the Python side due to improper datatype handling.
What you expected to happen:
I would expect the joins to succeed, being filtered properly by the condition using the casted column.
Minimal Complete Verifiable Example:
If we attempt a join that would require a cast operation on the lefthand column, we fail at PyJoin.getCondition()
:
import pandas as pd
from dask_sql import Context
c = Context()
c.create_table("df1", pd.DataFrame({"a": [0, 1], "b": ["cat", "dog"]}))
c.create_table("df2", pd.DataFrame({"c": ["0", "1"], "d": ["cat", "dog"]}))
# Inner Join: CAST(df1.a AS Utf8) = df2.c
# TableScan: df1 projection=[a, b], full_filters=[CAST(df1.a AS Utf8) IS NOT NULL]
# TableScan: df2 projection=[c, d], full_filters=[df2.c IS NOT NULL]
c.sql("select * from df1, df2 where a = c")
# TypeError: "unsupported join condition. Left: CAST(df1.a AS Utf8) - Right: df2.c"
Otherwise, we don't properly handle the cast operation, which can potentially cause issues on the Python side:
c.create_table("df1", pd.DataFrame({"a": ["0", "1"], "b": ["cat", "dog"]}))
c.create_table("df2", pd.DataFrame({"c": [0, 1], "d": ["cat", "dog"]}))
# Inner Join: df1.a = CAST(df2.c AS Utf8)
# TableScan: df1 projection=[a, b], full_filters=[df1.a IS NOT NULL]
# TableScan: df2 projection=[c, d], full_filters=[CAST(df2.c AS Utf8) IS NOT NULL]
c.sql("select * from df1, df2 where a = c")
# ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
Anything else we need to know?:
It looks like this case is part of the query failures currently blocking CI (example of a failing run).
Environment:
- dask-sql version: latest
main
- Python version: 3.10
- Operating System: ubuntu
- Install method (conda, pip, source): source