You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm getting the following error when trying to read a LanceDB table with Datafusion:
[2024-12-20T19:10:11Z WARN lance::dataset::write::insert] No existing dataset at /lance-dataset/data/sample-lancedb/my_table.lance, it will be created
Traceback (most recent call last):
File "/lance-dataset/hello.py", line 23, in<module>main()
~~~~^^
File "/lance-dataset/hello.py", line 19, in main
df.show()
~~~~~~~^^
File "/lance-dataset/.venv/lib/python3.13/site-packages/datafusion/dataframe.py", line 360, in show
self.df.show(num)
~~~~~~~~~~~~^^^^^
Exception: External error: TypeError: LanceFragment.scanner() takes 1 positional argument but 2 positional arguments (and 3 keyword-only arguments) were given
I'm not sure if this is an issue with LanceDataset or Datafusion or if I am just doing something wrong.
Looks like you're using datafusion's pyarrow integration to read from a pyarrow dataset. Lance mimics a pyarrow dataset. This is how we are able to be queried from DuckDb. However, it seems that we don't mimic it faithfully enough 😄 and so Datafusion is getting confused.
I seem to recall digging into this a while back and Datafusion want to split up the dataset into fragments and query it that way and we didn't really flesh out the pyarrow fragment integration completely.
So there are two options we can take to fix this. First, we could fix up the python interface to more faithfully mimic pyarrow dataset but pyarrow dataset wasn't really intended to be a standard / protocol and there are a few limitations with this approach:
You won't get the proper parallelism on reads
Filters are not pushed down (or maybe they are but only a limited subset are supported)
Some python overhead (not sure if it is per-batch overhead or not but it might be and that could be significant for some queries)
I'm getting the following error when trying to read a LanceDB table with Datafusion:
I'm not sure if this is an issue with LanceDataset or Datafusion or if I am just doing something wrong.
Here is the code:
The text was updated successfully, but these errors were encountered: