Closed as not planned
Closed as not planned
Description
Describe the bug
When I attempt to write both a partitioned Parquet dataset and a non-partitioned Parquet file from the same data schema, I encounter a schema mismatch error. This occurs because partitioned writes exclude the partition columns from the Parquet file schema, while non-partitioned writes include them. Attempting one after the other leads to:
Table schema does not match schema used to create file:
table:
[schema without partition keys]
file:
[schema with partition keys]
How to Reproduce
import pandas as pd
import awswrangler as wr
df = pd.DataFrame({
"merchant_id": [1, 2],
"payout_type": ["X", "Y"],
"execution_date": pd.to_datetime("2024-12-16"),
"model_version": ["v1", "v1"]
})
# First write a non-partitioned file that includes partition keys as normal columns
wr.s3.to_parquet(
df=df,
path="s3://mybucket1/non_partitioned_file.parquet",
dataset=False
)
# Then try writing a partitioned dataset (which excludes partition columns from the file schema)
wr.s3.to_parquet(
df=df,
path="s3://mybucket2/partitioned_dataset/",
dataset=True,
partition_cols=["execution_date", "model_version"]
)
The second call fails with a schema mismatch error. If you reverse the order of the calls (first the partitioned and then the non partitioned) also fails.
Expected behavior
The second call should write the data successfully without a schema mismatch error.
Your project
No response
Screenshots
No response
OS
Docker Container
Python version
3.11.8
AWS SDK for pandas version
3.10.1
Additional context
ChatGPT o1 says here's probably the cause of the bug: