`scan_parquet` push down hive filtering breaks on categorical columns #21532

davidia · 2025-02-28T15:01:44Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import os
import datetime as dt

import numpy as np 
import polars as pl

my_bucket = "s3://we-love-polars"

date_range = pl.datetime_range(dt.datetime(2015,1,1,0),dt.datetime(2025,2,27,23),interval='30m',eager=True)

base_df = pl.DataFrame({'timestamp':date_range}).head(100)

parts = []

for key in 'ABCD':
    df = base_df.with_columns(data=np.random.normal(size=base_df.height),str_key=pl.lit(key),cat_key=pl.lit(key)
                              ,year=pl.col('timestamp').dt.year()
                              ,month=pl.col('timestamp').dt.month()
                              ,day=pl.col('timestamp').dt.day())
    parts.append(df)    
                              
data = pl.concat(parts).with_columns(pl.col('cat_key').cast(pl.Categorical))

s3_dest = f'{my_bucket}/polars_bug_short2/'
data.write_parquet(s3_dest,partition_by=['str_key','cat_key','year','month','day'])


print(pl.scan_parquet(s3_dest,hive_partitioning=True).filter(str_key='C').explain())
print(pl.scan_parquet(s3_dest,hive_partitioning=True).filter(cat_key='C').explain())

#output

# Filtered
#Parquet SCAN [s3://.../polars_bug_short2/str_key=C/cat_key=C/year=2015/month=1/day=1/00000000.parquet, ... 2 other sources] #PROJECT 2/7 COLUMNS SELECTION: [(col("str_key")) == (String(C))]

# No filtering
# Parquet SCAN [s3://.../polars_bug_short2/str_key=A/cat_key=A/year=2015/month=1/day=1/00000000.parquet, ... 11 other sources] #PROJECT 2/7 COLUMNS SELECTION: [(col("cat_key")) == (String(C))]

Log output

_init_credential_provider_builder(): credential_provider_init = CredentialProviderBuilder(CredentialProviderAWS @ AutoInitAWS)
[CredentialProviderBuilder]: Begin initialize CredentialProviderAWS @ AutoInitAWS
[CredentialProviderBuilder]: Initialized <polars.io.cloud.credential_provider._providers.CredentialProviderAWS object at 0x7f419496aab0> from CredentialProviderAWS @ AutoInitAWS
[FetchedCredentialsCache]: Call update_func: current_time = 1740756158, last_fetched_expiry = 0
[FetchedCredentialsCache]: Finish update_func: new expiry = (never expires)
Parquet SCAN [s3://.../polars_bug_short2/str_key=C/cat_key=C/year=2015/month=1/day=1/00000000.parquet, ... 2 other sources]
PROJECT 2/7 COLUMNS
SELECTION: [(col("str_key")) == (String(C))]
_init_credential_provider_builder(): credential_provider_init = CredentialProviderBuilder(CredentialProviderAWS @ AutoInitAWS)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1740756158, expiry = (never expires)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1740756158, expiry = (never expires)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1740756158, expiry = (never expires)
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not suf[CredentialProviderBuilder]: Begin initialize CredentialProviderAWS @ AutoInitAWS
ficient for predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
hive partitioning: skipped 9 files, first file : s3://.../polars_bug_short2/str_key=A/cat_key=A/year=2015/month=1/day=1/00000000.parquet
[CredentialProviderBuilder]: Initialized <polars.io.cloud.credential_provider._providers.CredentialProviderAWS object at 0x7f41949168d0> from CredentialProviderAWS @ AutoInitAWS
[FetchedCredentialsCache]: Call update_func: current_time = 1740756159, last_fetched_expiry = 0
[FetchedCredentialsCache]: Finish update_func: new expiry = (never expires)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1740756159, expiry = (never expires)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1740756159, expiry = (never expires)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1740756159, expiry = (never expires)
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parqu
Parquet SCAN [s3://.../polars_bug_short2/str_key=A/cat_key=A/year=2015/month=1/day=1/00000000.parquet, ... 11 other sources]
PROJECT 2/7 COLUMNS
SELECTION: [(col("cat_key")) == (String(C))]
et file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.

Issue description

Hive filtering push down doesn't work on categorical columns

Expected behavior

The hive parquet dataset with categorical columns should be filtered down as the String version was

Installed versions

--------Version info---------
Polars:              1.23.0
Index type:          UInt32
Platform:            Linux-5.15.150.1-microsoft-standard-WSL2-x86_64-with-glibc2.39
Python:              3.12.8 | packaged by conda-forge | (main, Dec  5 2024, 14:24:40) [GCC 13.3.0]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                1.35.36
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.10.0
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                1.26.4
openpyxl             <not installed>
pandas               2.1.4
polars_cloud         <not installed>
pyarrow              19.0.0
pydantic             2.10.3
pyiceberg            <not installed>
sqlalchemy           1.4.54
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

The text was updated successfully, but these errors were encountered:

davidia · 2025-02-28T15:21:06Z

I've updated with a much simpler example

coastalwhite · 2025-02-28T19:53:03Z

This is more of a limitation at the moment. We don't do predicate pushdown for categorical in row groups because the ordering is not the same, and we use the same expression to filter hive partitions as we use to filter row groups.

Your case might fall under #21477 and thus be fixed already on main, but I am not 100% sure.

davidia added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 28, 2025

davidia changed the title ~~scan_parquet push down hive filtering breaks on categorical columns when there are lots of files~~ scan_parquet push down hive filtering breaks on categorical columns Feb 28, 2025

coastalwhite added A-io-partitioning Area: reading/writing (Hive) partitioned files and removed needs triage Awaiting prioritization by a maintainer labels Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`scan_parquet` push down hive filtering breaks on categorical columns #21532

`scan_parquet` push down hive filtering breaks on categorical columns #21532

davidia commented Feb 28, 2025 •

edited

Loading

davidia commented Feb 28, 2025

coastalwhite commented Feb 28, 2025 •

edited

Loading

scan_parquet push down hive filtering breaks on categorical columns #21532

scan_parquet push down hive filtering breaks on categorical columns #21532

Comments

davidia commented Feb 28, 2025 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

davidia commented Feb 28, 2025

coastalwhite commented Feb 28, 2025 • edited Loading

`scan_parquet` push down hive filtering breaks on categorical columns #21532

`scan_parquet` push down hive filtering breaks on categorical columns #21532

davidia commented Feb 28, 2025 •

edited

Loading

coastalwhite commented Feb 28, 2025 •

edited

Loading