Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scan_parquet push down hive filtering breaks on categorical columns #21532

Open
2 tasks done
davidia opened this issue Feb 28, 2025 · 2 comments
Open
2 tasks done

scan_parquet push down hive filtering breaks on categorical columns #21532

davidia opened this issue Feb 28, 2025 · 2 comments
Labels
A-io-partitioning Area: reading/writing (Hive) partitioned files bug Something isn't working python Related to Python Polars

Comments

@davidia
Copy link

davidia commented Feb 28, 2025

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import os
import datetime as dt

import numpy as np 
import polars as pl

my_bucket = "s3://we-love-polars"

date_range = pl.datetime_range(dt.datetime(2015,1,1,0),dt.datetime(2025,2,27,23),interval='30m',eager=True)

base_df = pl.DataFrame({'timestamp':date_range}).head(100)

parts = []

for key in 'ABCD':
    df = base_df.with_columns(data=np.random.normal(size=base_df.height),str_key=pl.lit(key),cat_key=pl.lit(key)
                              ,year=pl.col('timestamp').dt.year()
                              ,month=pl.col('timestamp').dt.month()
                              ,day=pl.col('timestamp').dt.day())
    parts.append(df)    
                              
data = pl.concat(parts).with_columns(pl.col('cat_key').cast(pl.Categorical))

s3_dest = f'{my_bucket}/polars_bug_short2/'
data.write_parquet(s3_dest,partition_by=['str_key','cat_key','year','month','day'])


print(pl.scan_parquet(s3_dest,hive_partitioning=True).filter(str_key='C').explain())
print(pl.scan_parquet(s3_dest,hive_partitioning=True).filter(cat_key='C').explain())

#output

# Filtered
#Parquet SCAN [s3://.../polars_bug_short2/str_key=C/cat_key=C/year=2015/month=1/day=1/00000000.parquet, ... 2 other sources] #PROJECT 2/7 COLUMNS SELECTION: [(col("str_key")) == (String(C))]

# No filtering
# Parquet SCAN [s3://.../polars_bug_short2/str_key=A/cat_key=A/year=2015/month=1/day=1/00000000.parquet, ... 11 other sources] #PROJECT 2/7 COLUMNS SELECTION: [(col("cat_key")) == (String(C))]

Log output

_init_credential_provider_builder(): credential_provider_init = CredentialProviderBuilder(CredentialProviderAWS @ AutoInitAWS)
[CredentialProviderBuilder]: Begin initialize CredentialProviderAWS @ AutoInitAWS
[CredentialProviderBuilder]: Initialized <polars.io.cloud.credential_provider._providers.CredentialProviderAWS object at 0x7f419496aab0> from CredentialProviderAWS @ AutoInitAWS
[FetchedCredentialsCache]: Call update_func: current_time = 1740756158, last_fetched_expiry = 0
[FetchedCredentialsCache]: Finish update_func: new expiry = (never expires)
Parquet SCAN [s3://.../polars_bug_short2/str_key=C/cat_key=C/year=2015/month=1/day=1/00000000.parquet, ... 2 other sources]
PROJECT 2/7 COLUMNS
SELECTION: [(col("str_key")) == (String(C))]
_init_credential_provider_builder(): credential_provider_init = CredentialProviderBuilder(CredentialProviderAWS @ AutoInitAWS)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1740756158, expiry = (never expires)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1740756158, expiry = (never expires)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1740756158, expiry = (never expires)
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not suf[CredentialProviderBuilder]: Begin initialize CredentialProviderAWS @ AutoInitAWS
ficient for predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
hive partitioning: skipped 9 files, first file : s3://.../polars_bug_short2/str_key=A/cat_key=A/year=2015/month=1/day=1/00000000.parquet
[CredentialProviderBuilder]: Initialized <polars.io.cloud.credential_provider._providers.CredentialProviderAWS object at 0x7f41949168d0> from CredentialProviderAWS @ AutoInitAWS
[FetchedCredentialsCache]: Call update_func: current_time = 1740756159, last_fetched_expiry = 0
[FetchedCredentialsCache]: Finish update_func: new expiry = (never expires)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1740756159, expiry = (never expires)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1740756159, expiry = (never expires)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1740756159, expiry = (never expires)
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parqu
Parquet SCAN [s3://.../polars_bug_short2/str_key=A/cat_key=A/year=2015/month=1/day=1/00000000.parquet, ... 11 other sources]
PROJECT 2/7 COLUMNS
SELECTION: [(col("cat_key")) == (String(C))]
et file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.

Issue description

Hive filtering push down doesn't work on categorical columns

Expected behavior

The hive parquet dataset with categorical columns should be filtered down as the String version was

Installed versions

--------Version info---------
Polars:              1.23.0
Index type:          UInt32
Platform:            Linux-5.15.150.1-microsoft-standard-WSL2-x86_64-with-glibc2.39
Python:              3.12.8 | packaged by conda-forge | (main, Dec  5 2024, 14:24:40) [GCC 13.3.0]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                1.35.36
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.10.0
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                1.26.4
openpyxl             <not installed>
pandas               2.1.4
polars_cloud         <not installed>
pyarrow              19.0.0
pydantic             2.10.3
pyiceberg            <not installed>
sqlalchemy           1.4.54
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@davidia davidia added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 28, 2025
@davidia
Copy link
Author

davidia commented Feb 28, 2025

I've updated with a much simpler example

@davidia davidia changed the title scan_parquet push down hive filtering breaks on categorical columns when there are lots of files scan_parquet push down hive filtering breaks on categorical columns Feb 28, 2025
@coastalwhite coastalwhite added A-io-partitioning Area: reading/writing (Hive) partitioned files and removed needs triage Awaiting prioritization by a maintainer labels Feb 28, 2025
@coastalwhite
Copy link
Collaborator

coastalwhite commented Feb 28, 2025

This is more of a limitation at the moment. We don't do predicate pushdown for categorical in row groups because the ordering is not the same, and we use the same expression to filter hive partitions as we use to filter row groups.

Your case might fall under #21477 and thus be fixed already on main, but I am not 100% sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-partitioning Area: reading/writing (Hive) partitioned files bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants