Skip to content

[HUDI-8371][CHERRYPICK] Fix column stats index with MDT for a few scenarios#18314

Open
vamsikarnika wants to merge 11 commits intoapache:release-0.14.2-prepfrom
vamsikarnika:mor_colstats_initializationfix-oss-cp
Open

[HUDI-8371][CHERRYPICK] Fix column stats index with MDT for a few scenarios#18314
vamsikarnika wants to merge 11 commits intoapache:release-0.14.2-prepfrom
vamsikarnika:mor_colstats_initializationfix-oss-cp

Conversation

@vamsikarnika
Copy link
Collaborator

Describe the issue this Pull Request addresses

  • Support bootstrapping of col stats for MOR table.
  • Fix clean operation with col stats. Even though stats are nullified, the records apparently were not deleted from the col stats partition.

Summary and Changelog

  • Support bootstrapping of col stats for MOR table.
  • Fix clean operation with col stats. Even though stats are nullified, the records apparently were not deleted from the col stats partition.

Impact

We could enable col stats for MOR table at any given state.
Ran into other issues along the way which I had to fix to get the patch ready.

  • DirectoryInfo was not accounting for files fetched from MDT. When a new MDT partition is initialized, to fetch file info, we poll MDT rather than doing FS based listing. This had some a bug and had to fix it.
  • When clean from data table is applied to MDT, we were nullifying the stats or marking it as deleted, but the record as such is not deleted from col stats partition and was lingering. Fixed the same in this patch.

Tests covered:

  • bootstrapping of both COW and MOR table.
  • Covered both partitioned and non-partitioned table.
  • Ensure log files w/ delete block, partially failed log blocks and rollback blocks are accounted for in tests.
  • Added tests to validate clean does remove the entry from col stats for both table types and partition and non-partitioned table.

Risk Level

low

Documentation Update

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

vamsikarnika and others added 4 commits March 12, 2026 16:37
…geMetadata

Convert HoodieRecord list to IndexedRecord before calling collectColumnRangeMetadata,
matching the 3-arg signature in 0.14.x (master's version accepted HoodieRecord + Schema).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace Collector wildcard pattern with forEach+map in collectColumnRangeMetadata
  (HoodieTableMetadataUtil) and readRangeFromParquetMetadata (ParquetUtils) to fix
  Java 8 type inference failures
- Replace FileSlice.hasLogFiles() with getLogFiles().findAny().isPresent() since
  hasLogFiles() doesn't exist in 0.14.x

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Collect flatMap result to List before grouping to avoid raw type inference
issue where Java 8 loses generic type parameter through the flatMap.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Mar 12, 2026
@github-actions github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:XL PR with lines of changes > 1000 labels Mar 13, 2026
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@github-actions github-actions bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants