Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering LazyFrame using is_null() returns rows with non-null values in new streaming engine from parquet scan #21538

Open
2 tasks done
kgoodrick-uu opened this issue Feb 28, 2025 · 2 comments
Assignees
Labels
accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars

Comments

@kgoodrick-uu
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

pl.scan_parquet("bug.parquet").filter(pl.col("value").is_null()).collect(
    new_streaming=True
)

Wrong Output

Image

Github will not let me upload the parquet file that reproduces this, so I have uploaded a CSV. I have confirmed that the error is still present when generating the parquet file as follows:

pl.read_csv('bug.csv', try_parse_dates=True).write_parquet('bug.parquet')

bug.csv

Log output

_init_credential_provider_builder(): credential_provider_init = None
[ParquetSource] Memory prefetch function: madvise_willneed
polars-stream: updating graph state
polars-stream: running in_memory_sink in subgraph
polars-stream: running parquet_source in subgraph
[ParquetSource]: Config { num_pipelines: 12, metadata_prefetch_size: 24, metadata_decode_ahead_size: 12, row_group_prefetch_size: 128, min_values_per_thread: 16777216 }
[ParquetSource]: 2 / 2 parquet columns to be projected from 1 files
[ParquetSource]: Byte source builder: Mmap
[ParquetSource]: Pre-filtered decode enabled (1 live, 1 non-live)
[ParquetSource]: ideal_morsel_size: 100000
[ParquetSource]: Starting data fetch
parquet row group must be read, statistics not sufficient for predicate.
[parquet_source]: Last data received.
polars-stream: done running graph phase
polars-stream: updating graph state
_init_credential_provider_builder(): credential_provider_init = None
_init_credential_provider_builder(): credential_provider_init = None
_init_credential_provider_builder(): credential_provider_init = None

Issue description

I am trying to collect rows from a parquet file using the new streaming engine where a value is null (normally float). When I use the new streaming engine it returns rows where the value is not null. I get the expected result when I am not streaming.

I do not observe this behaivor when scanning a csv.

Expected behavior

Only null rows are included in the collected dataframe

pl.scan_parquet("bug.parquet").filter(pl.col("value").is_null()).collect(
    # new_streaming=True
)

Image

Installed versions

--------Version info---------
Polars:              1.23.0
Index type:          UInt32
Platform:            macOS-15.3-arm64-arm-64bit-Mach-O
Python:              3.13.1 (main, Dec  3 2024, 17:59:52) [Clang 16.0.0 (clang-1600.0.26.4)]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               5.5.0
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                <not installed>
openpyxl             <not installed>
pandas               <not installed>
polars_cloud         <not installed>
pyarrow              <not installed>
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>```

</details>
@kgoodrick-uu kgoodrick-uu added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 28, 2025
@cmdlineluser
Copy link
Contributor

Can reproduce.

Seems related to parallel="prefiltered" - I get just nulls with either row_groups or columns.

pl.scan_parquet("bug.parquet", parallel="row_groups")

@coastalwhite coastalwhite self-assigned this Feb 28, 2025
@coastalwhite coastalwhite added accepted Ready for implementation P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels Feb 28, 2025
@coastalwhite
Copy link
Collaborator

I am 99% sure this is indeed related to prefiltering but specifically to ColumnWiseExpressions. This is something only the new streaming engine does for at the moment so that is why it only reproduces there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants