Datafusion fails to read from LanceDataset #3281

dreverri · 2024-12-20T19:22:31Z

I'm getting the following error when trying to read a LanceDB table with Datafusion:

[2024-12-20T19:10:11Z WARN  lance::dataset::write::insert] No existing dataset at /lance-dataset/data/sample-lancedb/my_table.lance, it will be created
Traceback (most recent call last):
  File "/lance-dataset/hello.py", line 23, in <module>
    main()
    ~~~~^^
  File "/lance-dataset/hello.py", line 19, in main
    df.show()
    ~~~~~~~^^
  File "/lance-dataset/.venv/lib/python3.13/site-packages/datafusion/dataframe.py", line 360, in show
    self.df.show(num)
    ~~~~~~~~~~~~^^^^^
Exception: External error: TypeError: LanceFragment.scanner() takes 1 positional argument but 2 positional arguments (and 3 keyword-only arguments) were given

I'm not sure if this is an issue with LanceDataset or Datafusion or if I am just doing something wrong.

Here is the code:

from datafusion import SessionContext
import lancedb


def main():
    uri = "data/sample-lancedb"
    db = lancedb.connect(uri)

    data = [
        {"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
        {"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
    ]

    tbl = db.create_table("my_table", data=data, mode="overwrite")

    ctx = SessionContext()
    ctx.register_dataset("my_table", tbl.to_lance())
    df = ctx.table("my_table")
    df.show()


if __name__ == "__main__":
    main()

westonpace · 2024-12-20T20:35:53Z

Looks like you're using datafusion's pyarrow integration to read from a pyarrow dataset. Lance mimics a pyarrow dataset. This is how we are able to be queried from DuckDb. However, it seems that we don't mimic it faithfully enough 😄 and so Datafusion is getting confused.

I seem to recall digging into this a while back and Datafusion want to split up the dataset into fragments and query it that way and we didn't really flesh out the pyarrow fragment integration completely.

So there are two options we can take to fix this. First, we could fix up the python interface to more faithfully mimic pyarrow dataset but pyarrow dataset wasn't really intended to be a standard / protocol and there are a few limitations with this approach:

You won't get the proper parallelism on reads
Filters are not pushed down (or maybe they are but only a limited subset are supported)
Some python overhead (not sure if it is per-batch overhead or not but it might be and that could be significant for some queries)

A different approach (now that apache/datafusion-python#823 has merged) would be to do something like this: https://github.com/delta-io/delta-rs/pull/3012/files

That would be limited to newer versions of datafusion python (43.1 and above) but would overcome the above drawbacks and be easier to maintain.

westonpace · 2024-12-20T20:45:44Z

(to be clear, both approaches will require changes to Lance)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datafusion fails to read from LanceDataset #3281

Datafusion fails to read from LanceDataset #3281

dreverri commented Dec 20, 2024

westonpace commented Dec 20, 2024

westonpace commented Dec 20, 2024

Datafusion fails to read from LanceDataset #3281

Datafusion fails to read from LanceDataset #3281

Comments

dreverri commented Dec 20, 2024

westonpace commented Dec 20, 2024

westonpace commented Dec 20, 2024