-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Finalize support for SQLFrame #2038
base: main
Are you sure you want to change the base?
Conversation
validate_column_names: bool, | ||
) -> None: | ||
if validate_column_names: | ||
check_column_names_are_unique(native_dataframe.columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While pyspark does it out of the box, sqlframe with duckdb backend doesn't.
I think it is good to have it regardless in order to standardize errors and error messages
) | ||
) | ||
elif self._implementation.is_sqlframe(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same workaround as per duckdb native. As order is not guaranteed, I think that's ok
@@ -436,13 +447,9 @@ def _is_finite(_input: Column) -> Column: | |||
|
|||
def is_in(self: Self, values: Sequence[Any]) -> Self: | |||
def _is_in(_input: Column) -> Column: | |||
return _input.isin(values) | |||
return _input.isin(values) if values else self._F.lit(False) # noqa: FBT003 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to #2031
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it work to use contains
, as per the latest change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have missed a lot this week. Not sure which change you refer to π€
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently in _duckdb/expr.py is_in
uses contains
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I might be missing something - pyspark contains has different behaviour, and for sqlframe I rather not specialize for a specific dialect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok if pyspark's contains is different let's keep it like this π
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to expand a bit on this: pyspark.sql.functions.contains would be the equivalent of .str.contains
if any(backend in str(constructor) for backend in ("dask", "modin", "cudf")): | ||
if any( | ||
backend in str(constructor) for backend in ("dask", "modin", "cudf", "sqlframe") | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless having it more generic in line 256, I could not manage to make the test pass with sqlframe.
Might require investigation.
self = <sqlframe.duckdb.session.DuckDBSession object at 0x7fcf53fa0380>
sql = 'WITH "t26313807" AS (SELECT CAST("a" AS MAP(TEXT, TEXT)) AS "a" FROM (VALUES (MAP([\'movie \', \'rating\'], [\'Cars\'...AS DOUBLE)} AS "a" FROM "t26313807") SELECT CAST("a" AS STRUCT("movie" TEXT, "rating" DOUBLE)) AS "a" FROM "t10314570"'
def _execute(self, sql: str) -> None:
> self._last_result = self._cur.execute(sql) # type: ignore
E duckdb.duckdb.BinderException: Binder Error: STRUCT to STRUCT cast must have at least one matching member
.venv/lib/python3.12/site-packages/sqlframe/duckdb/session.py:75: BinderException
CI failure is.... interesting?! π€ @eakmanrq apologies for the direct tag, but I wasn't able to figure out what I am doing wrong.
which I could trace back to pytest runner and threadexception. Could it be related to how we create (and don't stop) the sqlframe session? Do you have any insight on the reason why this is happening? Lines 185 to 193 in f67e767
|
Thanks for the tag and sorry for the delay I didn't see this notification at first. This does look strange. If I'm reading the error correctly, it seems like the first time SQLFrame is used, regardless of test, this warning from numpy is thrown. Purest is seeing this warning as unhandled and therefore considering the test a failure. So I believe the test is actually passing in terms of the logic but pytest isn't happy about the raised numpy warning. I'm trying to figure out why though. One thing I noticed is that you are using DuckDB 1.2.0 here and SQLFrame technically doesn't support that yet. I will be adding it soon but was waiting on a SQLGlot release with a fix. I can't explain though why that matters. Looking through SQLFrame I don't use numpy directly and I'm not using private methods. So I can't really explain this yet. Also it looks like the 3.11 test would pass if it wasn't cancelled so this could be specific to 3.13. Let me add DuckDB 1.2.0 support to SQLFrame and see if that luckily solves this. Otherwise we might need to see if we can repro locally and figure out where exactly this numpy warnings coming from. Edit: I see SQLGlot still doesn't have a release out with 1.2.0 compatibility. Can you do one of the following:
|
Thanks @eakmanrq for your time and help! Let me try to help with some more context:
Locally I can replicate the following:
Yes that seems to be exactly the case - There are cases in which we filter warning in tests, yet this might be a tricky one since:
Yes and yes!
As a fairly minimal repro, this issue is independent of what operation is run and seems associated with pytest. I can run the following to reproduce in python 3.13: repro_test.py from sqlframe.duckdb import DuckDBSession
def test_repro():
session = DuckDBSession()
data = {"foo": [1, 3, 2]}
session.createDataFrame([*zip(*data.values())], schema=[*data.keys()])
Error log
Used versions:
I pinged you in the first place since I noticed that in the sqlframe tests you never initialize the session like this (or I couldn't spot it), so I thought I was doing something wrong |
Thanks @FBruzzesi that repro is very helpful. I indeed could reproduce locally and this is actually a duckdb only repro (doesn't involve SQLFrame):
I actually think the issue is this line in DuckDB which triggers this warning in later versions of numpy: https://github.com/duckdb/duckdb/blob/cd0d0da9b1f475632a21e11a7c00cc726bedaacc/tools/pythonpkg/src/python_udf.cpp#L485 This is the line in SQLFrame that triggers it: https://github.com/eakmanrq/sqlframe/blob/main/sqlframe/duckdb/session.py#L49 Currently SQLFrame just creates a Python UDF for Let me know what you think. |
thanks all for investigations! shall we report this to duckdb, and here just silence the warning in |
Yeah I think silencing on Narwhal end is the right choice. Up to you if you want to open an issue on DuckDB Github. |
@eakmanrq I can't thank you enough for your support! and thanks @MarcoGorelli for reviewing! I committed the change to ignore the warning for now and opened an issue in duckdb (duckdb#16370) |
What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below