[BUG] #4304

omarMseddi0 · 2025-03-21T16:59:12Z

Bug

Which Delta project/connector is this regarding?

Describe the problem

In Delta Lake 3.3.0, I tried to use the Liquid Clustering feature in a local Kubernetes cluster with Apache Spark 3.5.5. The goal was to test incremental optimization behavior as described in the Delta Lake documentation.

Steps to reproduce

I created a Delta table and enabled Liquid Clustering:
```
ALTER TABLE my_table CLUSTER BY (my_column);
```
I inserted approximately 40 GB of data into the table.
I executed:
```
OPTIMIZE my_table;
```
✔️ This successfully clustered the data (as expected).
For testing incremental behavior, I appended an additional 1 GB of data.
I then ran OPTIMIZE again on the table.

Observed results

Instead of optimizing only the newly appended 1 GB of data, Spark scanned, shuffled, and rewrote the entire 41 GB, including the 40 GB that had already been clustered.

This behavior contradicts the expected performance improvement of incremental optimization with Liquid Clustering. According to the documentation and Liquid Clustering goals, previously optimized data should be skipped unless re-clustering is required — which was not the case here.

✔️ I had not changed the clustering column.
❌ Yet, the entire dataset was reprocessed and rewritten.

Expected results

I expected that:

Only the newly appended 1 GB of unclustered data would be optimized.
The already optimized 40 GB of clustered data would be left untouched.
Spark would avoid full shuffle and rewrite, leading to incremental performance gain.

Further details

This behavior occurred consistently. It looks like OPTIMIZE is not incremental, even when using Liquid Clustering. It behaves like a full-table re-optimization regardless of data changes.

Please confirm:

Is this behavior expected in Delta Lake 3.3.0?
Does Delta track previously clustered ranges internally?
Is there a bug or missing configuration that prevents incremental behavior?
Is there a way to force optimization to be incremental only?

Environment information

Delta Lake version: 3.3.0
Spark version: 3.5.5
Scala version: 2.12
Cluster Type: Local Kubernetes cluster
Storage: Local disk storage (no cloud/S3)
Data Size: Initial 40 GB, then 1 GB appended

Willingness to contribute

Yes. I can contribute a fix for this bug independently.
Yes. I would be willing to contribute a fix with guidance from the Delta Lake community.
No. I cannot contribute a bug fix at this time.

The text was updated successfully, but these errors were encountered:

omarMseddi0 added the bug Something isn't working label Mar 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] #4304

[BUG] #4304

omarMseddi0 commented Mar 21, 2025 •

edited

Loading

[BUG] #4304

[BUG] #4304

Comments

omarMseddi0 commented Mar 21, 2025 • edited Loading

Bug

Which Delta project/connector is this regarding?

Describe the problem

Steps to reproduce

Observed results

Expected results

Further details

Environment information

Willingness to contribute

omarMseddi0 commented Mar 21, 2025 •

edited

Loading