Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BETWEEN query might fail to be recognized as candidate for scalar index due to type differences #3311

Open
westonpace opened this issue Dec 27, 2024 · 0 comments

Comments

@westonpace
Copy link
Contributor

To reproduce:

import lance
import pyarrow as pa

def test_indexed_between(tmp_path):
    dataset = lance.write_dataset(
        pa.table({"u32": pa.array(range(100), pa.uint32()),
        tmp_path,
    )
    dataset.create_scalar_index("u32", index_type="BTREE")

    scanner = dataset.scanner(
        filter=f"u32 BETWEEN 10 AND 20",
        columns=[],
        with_row_id=True,
        prefilter=True,
    )
    assert "MaterializeIndex" in scanner.explain_plan()

The problem is that this ends up compiling down to the physical expression CAST(u32 as u64) >= 10_u64 AND CAST(u32 as u64) <= 20_u64 and the scalar index parser gets confused by the CAST statements.

If we just pass the filter as u32 >= 10 AND u32 <= 20 then the SQL parsers correctly infers the type of the literals to be u32.

Not sure if it's easier to fix the SQL parser or to insert some kind of optimizer rule in datafusion (we should always be able to normalize CAST(column) BINARY_OP literal into column BINARY_OP CAST(literal)) or to put a workaround in lance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant