Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Finalize support for SQLFrame #2038

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

FBruzzesi
Copy link
Member

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

@FBruzzesi FBruzzesi added the enhancement New feature or request label Feb 18, 2025
Comment on lines +48 to +51
validate_column_names: bool,
) -> None:
if validate_column_names:
check_column_names_are_unique(native_dataframe.columns)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While pyspark does it out of the box, sqlframe with duckdb backend doesn't.
I think it is good to have it regardless in order to standardize errors and error messages

)
)
elif self._implementation.is_sqlframe():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same workaround as per duckdb native. As order is not guaranteed, I think that's ok

@@ -436,13 +447,9 @@ def _is_finite(_input: Column) -> Column:

def is_in(self: Self, values: Sequence[Any]) -> Self:
def _is_in(_input: Column) -> Column:
return _input.isin(values)
return _input.isin(values) if values else self._F.lit(False) # noqa: FBT003
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to #2031

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it work to use contains, as per the latest change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have missed a lot this week. Not sure which change you refer to πŸ€”

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently in _duckdb/expr.py is_in uses contains

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I might be missing something - pyspark contains has different behaviour, and for sqlframe I rather not specialize for a specific dialect

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok if pyspark's contains is different let's keep it like this πŸ‘

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to expand a bit on this: pyspark.sql.functions.contains would be the equivalent of .str.contains

Comment on lines -239 to +241
if any(backend in str(constructor) for backend in ("dask", "modin", "cudf")):
if any(
backend in str(constructor) for backend in ("dask", "modin", "cudf", "sqlframe")
):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless having it more generic in line 256, I could not manage to make the test pass with sqlframe.
Might require investigation.

self = <sqlframe.duckdb.session.DuckDBSession object at 0x7fcf53fa0380>
sql = 'WITH "t26313807" AS (SELECT CAST("a" AS MAP(TEXT, TEXT)) AS "a" FROM (VALUES (MAP([\'movie \', \'rating\'], [\'Cars\'...AS DOUBLE)} AS "a" FROM "t26313807") SELECT CAST("a" AS STRUCT("movie" TEXT, "rating" DOUBLE)) AS "a" FROM "t10314570"'

    def _execute(self, sql: str) -> None:
>       self._last_result = self._cur.execute(sql)  # type: ignore
E       duckdb.duckdb.BinderException: Binder Error: STRUCT to STRUCT cast must have at least one matching member

.venv/lib/python3.12/site-packages/sqlframe/duckdb/session.py:75: BinderException

@FBruzzesi
Copy link
Member Author

FBruzzesi commented Feb 18, 2025

CI failure is.... interesting?! πŸ€”

@eakmanrq apologies for the direct tag, but I wasn't able to figure out what I am doing wrong.
The TL;DR is that in python 3.13, the very first test run for SQLFrame raises the following exception:

pytest.PytestUnraisableExceptionWarning: Exception ignored in PyObject_HasAttrString(); consider using PyObject_HasAttrStringWithError(), PyObject_GetOptionalAttrString() or PyObject_GetAttrString(): None

which I could trace back to pytest runner and threadexception.

Could it be related to how we create (and don't stop) the sqlframe session? Do you have any insight on the reason why this is happening?

narwhals/tests/conftest.py

Lines 185 to 193 in f67e767

def sqlframe_pyspark_lazy_constructor(
obj: dict[str, Any],
) -> IntoFrame: # pragma: no cover
from sqlframe.duckdb import DuckDBSession
session = DuckDBSession()
return ( # type: ignore[no-any-return]
session.createDataFrame([*zip(*obj.values())], schema=[*obj.keys()])
)

@eakmanrq
Copy link

eakmanrq commented Feb 22, 2025

Thanks for the tag and sorry for the delay I didn't see this notification at first.

This does look strange. If I'm reading the error correctly, it seems like the first time SQLFrame is used, regardless of test, this warning from numpy is thrown. Purest is seeing this warning as unhandled and therefore considering the test a failure. So I believe the test is actually passing in terms of the logic but pytest isn't happy about the raised numpy warning.

I'm trying to figure out why though. One thing I noticed is that you are using DuckDB 1.2.0 here and SQLFrame technically doesn't support that yet. I will be adding it soon but was waiting on a SQLGlot release with a fix. I can't explain though why that matters.

Looking through SQLFrame I don't use numpy directly and I'm not using private methods. So I can't really explain this yet. Also it looks like the 3.11 test would pass if it wasn't cancelled so this could be specific to 3.13.

Let me add DuckDB 1.2.0 support to SQLFrame and see if that luckily solves this. Otherwise we might need to see if we can repro locally and figure out where exactly this numpy warnings coming from.

Edit: I see SQLGlot still doesn't have a release out with 1.2.0 compatibility. Can you do one of the following:

  • Update this branch to us DuckDB < 1.2.0 and rule out that being an issue?
  • See if it can reproduce for you locally when running 3.13?

@FBruzzesi
Copy link
Member Author

FBruzzesi commented Feb 22, 2025

Thanks @eakmanrq for your time and help!

Let me try to help with some more context:

  • See if it can reproduce for you locally when running 3.13?

Locally I can replicate the following:

  • Failure happens only for python 3.13, both for duckdb 1.2 and 1.1.3
  • No issues for python<3.13 with both duckdb versions

This does look strange. If I'm reading the error correctly, it seems like the first time SQLFrame is used, regardless of test, this warning from numpy is thrown. Purest is seeing this warning as unhandled and therefore considering the test a failure. So I believe the test is actually passing in terms of the logic but pytest isn't happy about the raised numpy warning.

Yes that seems to be exactly the case - There are cases in which we filter warning in tests, yet this might be a tricky one since:

  • It depends on order of test collection/execution
  • We are not too happy to filter it globally

Looking through SQLFrame I don't use numpy directly and I'm not using private methods. So I can't really explain this yet. Also it looks like the 3.11 test would pass if it wasn't cancelled so this could be specific to 3.13.

Yes and yes!

Let me add DuckDB 1.2.0 support to SQLFrame and see if that luckily solves this. Otherwise we might need to see if we can repro locally and figure out where exactly this numpy warnings coming from.

As a fairly minimal repro, this issue is independent of what operation is run and seems associated with pytest. I can run the following to reproduce in python 3.13:

repro_test.py

from sqlframe.duckdb import DuckDBSession


def test_repro():
    session = DuckDBSession()

    data = {"foo": [1, 3, 2]}
    session.createDataFrame([*zip(*data.values())], schema=[*data.keys()])
pytest repro_test.py
Error log

E                   pytest.PytestUnraisableExceptionWarning: Exception ignored in PyObject_HasAttrString(); consider using PyObject_HasAttrStringWithError(), PyObject_GetOptionalAttrString() or PyObject_GetAttrString(): None
E                   
E                   Traceback (most recent call last):
E                     File "/home/fbruzzesi/open-source/narwhals/.venv13/lib/python3.13/site-packages/numpy/core/__init__.py", line 31, in __getattr__
E                       _raise_warning(attr_name)
E                       ~~~~~~~~~~~~~~^^^^^^^^^^^
E                     File "/home/fbruzzesi/open-source/narwhals/.venv13/lib/python3.13/site-packages/numpy/core/_utils.py", line 10, in _raise_warning
E                       warnings.warn(
E                       ~~~~~~~~~~~~~^
E                           f"{old_module} is deprecated and has been renamed to {new_module}. "
E                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E                       ...<8 lines>...
E                           stacklevel=3
E                           ^^^^^^^^^^^^
E                       )
E                       ^
E                   DeprecationWarning: numpy.core is deprecated and has been renamed to numpy._core. The numpy._core namespace contains private NumPy internals and its use is discouraged, as NumPy internals can change without warning in any release. In practice, most real-world usage of numpy.core is to access functionality in the public NumPy API. If that is the case, use the public NumPy API. If not, you are using NumPy internals. If you would still like to access an internal attribute, use numpy._core.multiarray.

.venv13/lib/python3.13/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning

Used versions:

uv pip freeze | grep -E "duck|sqlframe"                                  

Using Python 3.13.0 environment at: .venv13
duckdb==1.1.3
sqlframe==3.22.0

I pinged you in the first place since I noticed that in the sqlframe tests you never initialize the session like this (or I couldn't spot it), so I thought I was doing something wrong

@eakmanrq
Copy link

Thanks @FBruzzesi that repro is very helpful. I indeed could reproduce locally and this is actually a duckdb only repro (doesn't involve SQLFrame):

import duckdb
from duckdb.typing import VARCHAR


def test_repro():
    conn = duckdb.connect()
    conn.create_function("blah", lambda x: x, return_type=VARCHAR)

I actually think the issue is this line in DuckDB which triggers this warning in later versions of numpy: https://github.com/duckdb/duckdb/blob/cd0d0da9b1f475632a21e11a7c00cc726bedaacc/tools/pythonpkg/src/python_udf.cpp#L485

This is the line in SQLFrame that triggers it: https://github.com/eakmanrq/sqlframe/blob/main/sqlframe/duckdb/session.py#L49

Currently SQLFrame just creates a Python UDF for soundex support which is a fairly niche function so I could potentially remove it. On the other hand though I may add more Python UDFs in the future if needed in order to get full PySpark compatibility support. So even if I remove this now I could end up adding a UDF in the future. I think the core issue here is that DuckDB should update their code to be better compatible with later versions of numpy.

Let me know what you think.

@MarcoGorelli
Copy link
Member

thanks all for investigations!

shall we report this to duckdb, and here just silence the warning in pyproject.toml? i don't think there's anything we can do about it on the narwhals side?

@eakmanrq
Copy link

Yeah I think silencing on Narwhal end is the right choice. Up to you if you want to open an issue on DuckDB Github.

@FBruzzesi FBruzzesi marked this pull request as ready for review February 23, 2025 20:49
@FBruzzesi
Copy link
Member Author

@eakmanrq I can't thank you enough for your support! and thanks @MarcoGorelli for reviewing!

I committed the change to ignore the warning for now and opened an issue in duckdb (duckdb#16370)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants