Filtering LazyFrame using `is_null()` returns rows with non-null values in new streaming engine from parquet scan #21538

kgoodrick-uu · 2025-02-28T18:23:14Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

pl.scan_parquet("bug.parquet").filter(pl.col("value").is_null()).collect(
    new_streaming=True
)

Wrong Output

Github will not let me upload the parquet file that reproduces this, so I have uploaded a CSV. I have confirmed that the error is still present when generating the parquet file as follows:

pl.read_csv('bug.csv', try_parse_dates=True).write_parquet('bug.parquet')

bug.csv

Log output

_init_credential_provider_builder(): credential_provider_init = None
[ParquetSource] Memory prefetch function: madvise_willneed
polars-stream: updating graph state
polars-stream: running in_memory_sink in subgraph
polars-stream: running parquet_source in subgraph
[ParquetSource]: Config { num_pipelines: 12, metadata_prefetch_size: 24, metadata_decode_ahead_size: 12, row_group_prefetch_size: 128, min_values_per_thread: 16777216 }
[ParquetSource]: 2 / 2 parquet columns to be projected from 1 files
[ParquetSource]: Byte source builder: Mmap
[ParquetSource]: Pre-filtered decode enabled (1 live, 1 non-live)
[ParquetSource]: ideal_morsel_size: 100000
[ParquetSource]: Starting data fetch
parquet row group must be read, statistics not sufficient for predicate.
[parquet_source]: Last data received.
polars-stream: done running graph phase
polars-stream: updating graph state
_init_credential_provider_builder(): credential_provider_init = None
_init_credential_provider_builder(): credential_provider_init = None
_init_credential_provider_builder(): credential_provider_init = None

Issue description

I am trying to collect rows from a parquet file using the new streaming engine where a value is null (normally float). When I use the new streaming engine it returns rows where the value is not null. I get the expected result when I am not streaming.

I do not observe this behaivor when scanning a csv.

Expected behavior

Only null rows are included in the collected dataframe

pl.scan_parquet("bug.parquet").filter(pl.col("value").is_null()).collect(
    # new_streaming=True
)

Installed versions

--------Version info---------
Polars:              1.23.0
Index type:          UInt32
Platform:            macOS-15.3-arm64-arm-64bit-Mach-O
Python:              3.13.1 (main, Dec  3 2024, 17:59:52) [Clang 16.0.0 (clang-1600.0.26.4)]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               5.5.0
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                <not installed>
openpyxl             <not installed>
pandas               <not installed>
polars_cloud         <not installed>
pyarrow              <not installed>
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>```

</details>

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2025-02-28T18:39:04Z

Can reproduce.

Seems related to parallel="prefiltered" - I get just nulls with either row_groups or columns.

pl.scan_parquet("bug.parquet", parallel="row_groups")

coastalwhite · 2025-02-28T19:47:19Z

I am 99% sure this is indeed related to prefiltering but specifically to ColumnWiseExpressions. This is something only the new streaming engine does for at the moment so that is why it only reproduces there.

kgoodrick-uu added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 28, 2025

coastalwhite self-assigned this Feb 28, 2025

coastalwhite added accepted Ready for implementation P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering LazyFrame using `is_null()` returns rows with non-null values in new streaming engine from parquet scan #21538

Filtering LazyFrame using `is_null()` returns rows with non-null values in new streaming engine from parquet scan #21538

kgoodrick-uu commented Feb 28, 2025

cmdlineluser commented Feb 28, 2025

coastalwhite commented Feb 28, 2025

Filtering LazyFrame using is_null() returns rows with non-null values in new streaming engine from parquet scan #21538

Filtering LazyFrame using is_null() returns rows with non-null values in new streaming engine from parquet scan #21538

Comments

kgoodrick-uu commented Feb 28, 2025

Checks

Reproducible example

Wrong Output

Log output

Issue description

Expected behavior

Installed versions

cmdlineluser commented Feb 28, 2025

coastalwhite commented Feb 28, 2025

Filtering LazyFrame using `is_null()` returns rows with non-null values in new streaming engine from parquet scan #21538

Filtering LazyFrame using `is_null()` returns rows with non-null values in new streaming engine from parquet scan #21538