You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Github will not let me upload the parquet file that reproduces this, so I have uploaded a CSV. I have confirmed that the error is still present when generating the parquet file as follows:
_init_credential_provider_builder(): credential_provider_init = None
[ParquetSource] Memory prefetch function: madvise_willneed
polars-stream: updating graph state
polars-stream: running in_memory_sink in subgraph
polars-stream: running parquet_source in subgraph
[ParquetSource]: Config { num_pipelines: 12, metadata_prefetch_size: 24, metadata_decode_ahead_size: 12, row_group_prefetch_size: 128, min_values_per_thread: 16777216 }
[ParquetSource]: 2 / 2 parquet columns to be projected from 1 files
[ParquetSource]: Byte source builder: Mmap
[ParquetSource]: Pre-filtered decode enabled (1 live, 1 non-live)
[ParquetSource]: ideal_morsel_size: 100000
[ParquetSource]: Starting data fetch
parquet row group must be read, statistics not sufficient for predicate.
[parquet_source]: Last data received.
polars-stream: done running graph phase
polars-stream: updating graph state
_init_credential_provider_builder(): credential_provider_init = None
_init_credential_provider_builder(): credential_provider_init = None
_init_credential_provider_builder(): credential_provider_init = None
Issue description
I am trying to collect rows from a parquet file using the new streaming engine where a value is null (normally float). When I use the new streaming engine it returns rows where the value is not null. I get the expected result when I am not streaming.
I do not observe this behaivor when scanning a csv.
Expected behavior
Only null rows are included in the collected dataframe
I am 99% sure this is indeed related to prefiltering but specifically to ColumnWiseExpressions. This is something only the new streaming engine does for at the moment so that is why it only reproduces there.
Checks
Reproducible example
Wrong Output
Github will not let me upload the parquet file that reproduces this, so I have uploaded a CSV. I have confirmed that the error is still present when generating the parquet file as follows:
bug.csv
Log output
Issue description
I am trying to collect rows from a parquet file using the new streaming engine where a value is null (normally float). When I use the new streaming engine it returns rows where the value is not null. I get the expected result when I am not streaming.
I do not observe this behaivor when scanning a csv.
Expected behavior
Only null rows are included in the collected dataframe
Installed versions
The text was updated successfully, but these errors were encountered: