chore: Finalize support for SQLFrame #2038

FBruzzesi · 2025-02-18T08:11:02Z

What type of PR is this? (check all applicable)

Related issues

Related issue refactor: use sqlframe instead of pyspark for testing? #1818

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below

FBruzzesi · 2025-02-18T08:12:44Z

narwhals/_spark_like/dataframe.py

+        validate_column_names: bool,
    ) -> None:
+        if validate_column_names:
+            check_column_names_are_unique(native_dataframe.columns)


While pyspark does it out of the box, sqlframe with duckdb backend doesn't.
I think it is good to have it regardless in order to standardize errors and error messages

FBruzzesi · 2025-02-18T08:14:06Z

narwhals/_spark_like/dataframe.py

            )
-        )
+        elif self._implementation.is_sqlframe():


This is the same workaround as per duckdb native. As order is not guaranteed, I think that's ok

FBruzzesi · 2025-02-18T08:15:03Z

narwhals/_spark_like/expr.py

@@ -436,13 +447,9 @@ def _is_finite(_input: Column) -> Column:

    def is_in(self: Self, values: Sequence[Any]) -> Self:
        def _is_in(_input: Column) -> Column:
-            return _input.isin(values)
+            return _input.isin(values) if values else self._F.lit(False)  # noqa: FBT003


Related to #2031

does it work to use contains, as per the latest change?

I might have missed a lot this week. Not sure which change you refer to 🤔

currently in _duckdb/expr.py is_in uses contains

Ok I might be missing something - pyspark contains has different behaviour, and for sqlframe I rather not specialize for a specific dialect

ah ok if pyspark's contains is different let's keep it like this 👍

Just to expand a bit on this: pyspark.sql.functions.contains would be the equivalent of .str.contains

FBruzzesi · 2025-02-18T08:17:58Z

tests/expr_and_series/cast_test.py

-    if any(backend in str(constructor) for backend in ("dask", "modin", "cudf")):
+    if any(
+        backend in str(constructor) for backend in ("dask", "modin", "cudf", "sqlframe")
+    ):


Regardless having it more generic in line 256, I could not manage to make the test pass with sqlframe.
Might require investigation.

self = <sqlframe.duckdb.session.DuckDBSession object at 0x7fcf53fa0380> sql = 'WITH "t26313807" AS (SELECT CAST("a" AS MAP(TEXT, TEXT)) AS "a" FROM (VALUES (MAP([\'movie \', \'rating\'], [\'Cars\'...AS DOUBLE)} AS "a" FROM "t26313807") SELECT CAST("a" AS STRUCT("movie" TEXT, "rating" DOUBLE)) AS "a" FROM "t10314570"' def _execute(self, sql: str) -> None: > self._last_result = self._cur.execute(sql) # type: ignore E duckdb.duckdb.BinderException: Binder Error: STRUCT to STRUCT cast must have at least one matching member .venv/lib/python3.12/site-packages/sqlframe/duckdb/session.py:75: BinderException

FBruzzesi · 2025-02-18T10:24:36Z

CI failure is.... interesting?! 🤔

@eakmanrq apologies for the direct tag, but I wasn't able to figure out what I am doing wrong.
The TL;DR is that in python 3.13, the very first test run for SQLFrame raises the following exception:

pytest.PytestUnraisableExceptionWarning: Exception ignored in PyObject_HasAttrString(); consider using PyObject_HasAttrStringWithError(), PyObject_GetOptionalAttrString() or PyObject_GetAttrString(): None

which I could trace back to pytest runner and threadexception.

Could it be related to how we create (and don't stop) the sqlframe session? Do you have any insight on the reason why this is happening?

narwhals/tests/conftest.py

Lines 185 to 193 in f67e767

    
           def sqlframe_pyspark_lazy_constructor( 
        
               obj: dict[str, Any], 
        
           ) -> IntoFrame:  # pragma: no cover 
        
               from sqlframe.duckdb import DuckDBSession 
        
               session = DuckDBSession() 
        
               return (  # type: ignore[no-any-return] 
        
                   session.createDataFrame([*zip(*obj.values())], schema=[*obj.keys()]) 
        
               )

eakmanrq · 2025-02-22T02:42:53Z

Thanks for the tag and sorry for the delay I didn't see this notification at first.

This does look strange. If I'm reading the error correctly, it seems like the first time SQLFrame is used, regardless of test, this warning from numpy is thrown. Purest is seeing this warning as unhandled and therefore considering the test a failure. So I believe the test is actually passing in terms of the logic but pytest isn't happy about the raised numpy warning.

I'm trying to figure out why though. One thing I noticed is that you are using DuckDB 1.2.0 here and SQLFrame technically doesn't support that yet. I will be adding it soon but was waiting on a SQLGlot release with a fix. I can't explain though why that matters.

Looking through SQLFrame I don't use numpy directly and I'm not using private methods. So I can't really explain this yet. Also it looks like the 3.11 test would pass if it wasn't cancelled so this could be specific to 3.13.

Let me add DuckDB 1.2.0 support to SQLFrame and see if that luckily solves this. Otherwise we might need to see if we can repro locally and figure out where exactly this numpy warnings coming from.

Edit: I see SQLGlot still doesn't have a release out with 1.2.0 compatibility. Can you do one of the following:

Update this branch to us DuckDB < 1.2.0 and rule out that being an issue?
See if it can reproduce for you locally when running 3.13?

FBruzzesi · 2025-02-22T09:29:30Z

Thanks @eakmanrq for your time and help!

Let me try to help with some more context:

See if it can reproduce for you locally when running 3.13?

Locally I can replicate the following:

Failure happens only for python 3.13, both for duckdb 1.2 and 1.1.3
No issues for python<3.13 with both duckdb versions

This does look strange. If I'm reading the error correctly, it seems like the first time SQLFrame is used, regardless of test, this warning from numpy is thrown. Purest is seeing this warning as unhandled and therefore considering the test a failure. So I believe the test is actually passing in terms of the logic but pytest isn't happy about the raised numpy warning.

Yes that seems to be exactly the case - There are cases in which we filter warning in tests, yet this might be a tricky one since:

It depends on order of test collection/execution
We are not too happy to filter it globally

Looking through SQLFrame I don't use numpy directly and I'm not using private methods. So I can't really explain this yet. Also it looks like the 3.11 test would pass if it wasn't cancelled so this could be specific to 3.13.

Yes and yes!

Let me add DuckDB 1.2.0 support to SQLFrame and see if that luckily solves this. Otherwise we might need to see if we can repro locally and figure out where exactly this numpy warnings coming from.

As a fairly minimal repro, this issue is independent of what operation is run and seems associated with pytest. I can run the following to reproduce in python 3.13:

repro_test.py

from sqlframe.duckdb import DuckDBSession


def test_repro():
    session = DuckDBSession()

    data = {"foo": [1, 3, 2]}
    session.createDataFrame([*zip(*data.values())], schema=[*data.keys()])

pytest repro_test.py

Error log


E                   pytest.PytestUnraisableExceptionWarning: Exception ignored in PyObject_HasAttrString(); consider using PyObject_HasAttrStringWithError(), PyObject_GetOptionalAttrString() or PyObject_GetAttrString(): None
E                   
E                   Traceback (most recent call last):
E                     File "/home/fbruzzesi/open-source/narwhals/.venv13/lib/python3.13/site-packages/numpy/core/__init__.py", line 31, in __getattr__
E                       _raise_warning(attr_name)
E                       ~~~~~~~~~~~~~~^^^^^^^^^^^
E                     File "/home/fbruzzesi/open-source/narwhals/.venv13/lib/python3.13/site-packages/numpy/core/_utils.py", line 10, in _raise_warning
E                       warnings.warn(
E                       ~~~~~~~~~~~~~^
E                           f"{old_module} is deprecated and has been renamed to {new_module}. "
E                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E                       ...<8 lines>...
E                           stacklevel=3
E                           ^^^^^^^^^^^^
E                       )
E                       ^
E                   DeprecationWarning: numpy.core is deprecated and has been renamed to numpy._core. The numpy._core namespace contains private NumPy internals and its use is discouraged, as NumPy internals can change without warning in any release. In practice, most real-world usage of numpy.core is to access functionality in the public NumPy API. If that is the case, use the public NumPy API. If not, you are using NumPy internals. If you would still like to access an internal attribute, use numpy._core.multiarray.

.venv13/lib/python3.13/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning

Used versions:

uv pip freeze | grep -E "duck|sqlframe"                                  

Using Python 3.13.0 environment at: .venv13
duckdb==1.1.3
sqlframe==3.22.0

I pinged you in the first place since I noticed that in the sqlframe tests you never initialize the session like this (or I couldn't spot it), so I thought I was doing something wrong

eakmanrq · 2025-02-23T02:33:57Z

Thanks @FBruzzesi that repro is very helpful. I indeed could reproduce locally and this is actually a duckdb only repro (doesn't involve SQLFrame):

import duckdb
from duckdb.typing import VARCHAR


def test_repro():
    conn = duckdb.connect()
    conn.create_function("blah", lambda x: x, return_type=VARCHAR)

I actually think the issue is this line in DuckDB which triggers this warning in later versions of numpy: https://github.com/duckdb/duckdb/blob/cd0d0da9b1f475632a21e11a7c00cc726bedaacc/tools/pythonpkg/src/python_udf.cpp#L485

This is the line in SQLFrame that triggers it: https://github.com/eakmanrq/sqlframe/blob/main/sqlframe/duckdb/session.py#L49

Currently SQLFrame just creates a Python UDF for soundex support which is a fairly niche function so I could potentially remove it. On the other hand though I may add more Python UDFs in the future if needed in order to get full PySpark compatibility support. So even if I remove this now I could end up adding a UDF in the future. I think the core issue here is that DuckDB should update their code to be better compatible with later versions of numpy.

Let me know what you think.

MarcoGorelli · 2025-02-23T07:56:29Z

thanks all for investigations!

shall we report this to duckdb, and here just silence the warning in pyproject.toml? i don't think there's anything we can do about it on the narwhals side?

eakmanrq · 2025-02-23T17:22:47Z

Yeah I think silencing on Narwhal end is the right choice. Up to you if you want to open an issue on DuckDB Github.

FBruzzesi · 2025-02-23T21:03:28Z

@eakmanrq I can't thank you enough for your support! and thanks @MarcoGorelli for reviewing!

I committed the change to ignore the warning for now and opened an issue in duckdb (duckdb#16370)

FBruzzesi added 10 commits February 11, 2025 23:50

WIP

700532a

WIP

0383ee0

ruff

8d48013

Merge branch 'main' into chore/better-sqlframe-support

ddd4797

getting there

5735fb5

Merge branch 'main' into chore/better-sqlframe-support

ec0bb9a

duplicate column names check for pyspark

c1aa784

xfail unary with 2 elements

78d22fb

Merge branch 'main' into chore/better-sqlframe-support

48ed576

good stuff

586f520

FBruzzesi added the enhancement New feature or request label Feb 18, 2025

FBruzzesi commented Feb 18, 2025

View reviewed changes

FBruzzesi added 2 commits February 18, 2025 09:24

mypy, missing deps

09e46ca

rm direct pyspark import

0b84b76

FBruzzesi and others added 3 commits February 18, 2025 16:17

Merge branch 'main' into chore/better-sqlframe-support

2425798

Merge branch 'main' into chore/better-sqlframe-support

3d6f0c0

merge main, fix join

f67e767

Merge branch 'main' into chore/better-sqlframe-support

0570287

FBruzzesi added 2 commits February 23, 2025 21:38

Merge branch 'main' into chore/better-sqlframe-support

0762bc2

ignore numpy deprecation warning

8662d02

FBruzzesi marked this pull request as ready for review February 23, 2025 20:49

dangotbanned mentioned this pull request Feb 24, 2025

fix: pyspark typing and related unpivot bug #2051

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Finalize support for SQLFrame #2038

chore: Finalize support for SQLFrame #2038

FBruzzesi commented Feb 18, 2025

FBruzzesi Feb 18, 2025

FBruzzesi Feb 18, 2025

FBruzzesi Feb 18, 2025

MarcoGorelli Feb 19, 2025

FBruzzesi Feb 19, 2025

MarcoGorelli Feb 19, 2025

FBruzzesi Feb 19, 2025

MarcoGorelli Feb 19, 2025

FBruzzesi Feb 21, 2025

FBruzzesi Feb 18, 2025

FBruzzesi commented Feb 18, 2025 •

edited

Loading

eakmanrq commented Feb 22, 2025 •

edited

Loading

FBruzzesi commented Feb 22, 2025 •

edited

Loading

eakmanrq commented Feb 23, 2025

MarcoGorelli commented Feb 23, 2025

eakmanrq commented Feb 23, 2025

FBruzzesi commented Feb 23, 2025

chore: Finalize support for SQLFrame #2038

Are you sure you want to change the base?

chore: Finalize support for SQLFrame #2038

Conversation

FBruzzesi commented Feb 18, 2025

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FBruzzesi commented Feb 18, 2025 • edited Loading

eakmanrq commented Feb 22, 2025 • edited Loading

FBruzzesi commented Feb 22, 2025 • edited Loading

eakmanrq commented Feb 23, 2025

MarcoGorelli commented Feb 23, 2025

eakmanrq commented Feb 23, 2025

FBruzzesi commented Feb 23, 2025

FBruzzesi commented Feb 18, 2025 •

edited

Loading

eakmanrq commented Feb 22, 2025 •

edited

Loading

FBruzzesi commented Feb 22, 2025 •

edited

Loading