Skip to content

perf(table-services): Only attempt scheduling log compaction if number of deltacommits is at least LogCompactionBlocksThreshold#18306

Open
kbuci wants to merge 4 commits intoapache:masterfrom
kbuci:logcompact-schedule
Open

perf(table-services): Only attempt scheduling log compaction if number of deltacommits is at least LogCompactionBlocksThreshold#18306
kbuci wants to merge 4 commits intoapache:masterfrom
kbuci:logcompact-schedule

Conversation

@kbuci
Copy link
Contributor

@kbuci kbuci commented Mar 11, 2026

Describe the issue this Pull Request addresses

Currently, log compaction is always scheduled whenever the operation type is LOG_COMPACT, regardless of how many delta commits have occurred since the last log compaction. This leads to unnecessary log compaction scheduling, wasting resources when few delta commits (and therefore most likely only a few log files/blocks) have accumulated. This is especially helpful for the Metadata table with RLI, where all file groups in RLI are any updated with new log files/bocks at a roughly equal "cadence"

Summary and Changelog

Changes log compaction scheduling to use the LogCompactionBlocksThreshold config as a gating threshold. Instead of unconditionally scheduling log compaction, the scheduler now counts delta commits since the last compaction and the last log compaction, takes the minimum of the two, and only schedules log compaction when that count meets or exceeds the threshold.

  • Added CompactionUtils.getDeltaCommitsSinceLatestLogCompaction() which determines the number of delta commits since the most recent completed log compaction by inspecting the raw active timeline (needed because completed log compaction instants transition from LOG_COMPACTION_ACTION to DELTA_COMMIT_ACTION)
  • Added ScheduleCompactionActionExecutor.getDeltaCommitInfoSinceLogCompaction() which creates a raw active timeline and delegates to the new CompactionUtils method
  • Renamed getLatestDeltaCommitInfo() to getLatestDeltaCommitInfoSinceCompaction() for clarity
  • Updated needCompact() to replace the unconditional return true for LOG_COMPACT with threshold-based logic: Math.min(deltaCommitsSinceCompaction, deltaCommitsSinceLogCompaction) >= logCompactionBlocksThreshold
  • Added unit tests for getDeltaCommitsSinceLatestLogCompaction covering completed log compaction, no log compaction, and empty timeline cases

Impact

No public API changes. Log compaction will now be scheduled less frequently — only when enough delta commits have accumulated since the last compaction or log compaction to meet the hoodie.log.compaction.blocks.threshold (default: 5). This reduces unnecessary log compaction overhead for tables with frequent small writes.

Risk Level

Low. The change only affects log compaction scheduling frequency. Regular compaction scheduling is unchanged.

Documentation Update

None. No new configs are introduced; the existing hoodie.log.compaction.blocks.threshold config is now also used to gate scheduling frequency in addition to its existing role in plan generation.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

Krishen Bhan added 2 commits March 11, 2026 13:46
@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Mar 11, 2026
Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will review tests in next iteration

final HoodieActiveTimeline rawActiveTimeline) {
Option<HoodieInstant> lastLogCompactionInstantOption = Option.fromJavaOptional(
rawActiveTimeline
.filterPendingLogCompactionTimeline()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we introduce filterLogCompactionTimeline()
and then process the latest instant from it.
we can avoid the polling the timeline twice right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up putting this in the helper method in this class. I didn't want to add any methods for this in the timeline related classes since I was worried about mis-use - the filtering logic only works for a raw active timeline

return Option.of(Pair.of(deltaCommitTimeline.findInstantsAfter(
lastCompletedLogCompactionInstant.requestedTime(), Integer.MAX_VALUE), lastCompletedLogCompactionInstant));
} else {
LOG.info("Last log compaction instant {}, is in pending state so returning empty value.", lastLogCompactionTimestamp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets be judicious on info logging.
can you confirm we log this occasionally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only happen if a logcompact attempt was attempted to be scheduled while a pending log compact still exists in timeline (from prior failed attempt or if table service platform is still working on it). I kept this here to help debug, in case user is confused why their job isn't scheduling a new log compact. If you think it might be too noisy thought I can remove it.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@kbuci kbuci requested a review from nsivabalan March 19, 2026 00:12
@github-actions github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Mar 19, 2026
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 85.00000% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.50%. Comparing base (39f1f39) to head (95cf1aa).
⚠️ Report is 11 commits behind head on master.

Files with missing lines Patch % Lines
...tion/compact/ScheduleCompactionActionExecutor.java 85.29% 1 Missing and 4 partials ⚠️
...a/org/apache/hudi/common/util/CompactionUtils.java 84.61% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             master   #18306       +/-   ##
=============================================
+ Coverage     57.27%   68.50%   +11.22%     
- Complexity    18639    27378     +8739     
=============================================
  Files          1956     2420      +464     
  Lines        107069   132173    +25104     
  Branches      13255    15918     +2663     
=============================================
+ Hits          61324    90542    +29218     
+ Misses        39939    34619     -5320     
- Partials       5806     7012     +1206     
Flag Coverage Δ
common-and-other-modules 44.37% <50.00%> (?)
hadoop-mr-java-client 45.13% <6.66%> (-0.08%) ⬇️
spark-client-hadoop-common 48.33% <41.66%> (?)
spark-java-tests 48.94% <81.66%> (+1.48%) ⬆️
spark-scala-tests 45.12% <48.33%> (-0.45%) ⬇️
utilities 38.68% <6.66%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...a/org/apache/hudi/common/util/CompactionUtils.java 93.70% <84.61%> (+12.50%) ⬆️
...tion/compact/ScheduleCompactionActionExecutor.java 87.06% <85.29%> (+0.61%) ⬆️

... and 1270 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants