perf(table-services): Only attempt scheduling log compaction if number of deltacommits is at least LogCompactionBlocksThreshold#18306
Conversation
nsivabalan
left a comment
There was a problem hiding this comment.
Will review tests in next iteration
...mon/src/main/java/org/apache/hudi/table/action/compact/ScheduleCompactionActionExecutor.java
Outdated
Show resolved
Hide resolved
...mon/src/main/java/org/apache/hudi/table/action/compact/ScheduleCompactionActionExecutor.java
Outdated
Show resolved
Hide resolved
...mon/src/main/java/org/apache/hudi/table/action/compact/ScheduleCompactionActionExecutor.java
Outdated
Show resolved
Hide resolved
...mon/src/main/java/org/apache/hudi/table/action/compact/ScheduleCompactionActionExecutor.java
Outdated
Show resolved
Hide resolved
| final HoodieActiveTimeline rawActiveTimeline) { | ||
| Option<HoodieInstant> lastLogCompactionInstantOption = Option.fromJavaOptional( | ||
| rawActiveTimeline | ||
| .filterPendingLogCompactionTimeline() |
There was a problem hiding this comment.
how about we introduce filterLogCompactionTimeline()
and then process the latest instant from it.
we can avoid the polling the timeline twice right?
There was a problem hiding this comment.
I ended up putting this in the helper method in this class. I didn't want to add any methods for this in the timeline related classes since I was worried about mis-use - the filtering logic only works for a raw active timeline
hudi-common/src/main/java/org/apache/hudi/common/util/CompactionUtils.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/util/CompactionUtils.java
Outdated
Show resolved
Hide resolved
| return Option.of(Pair.of(deltaCommitTimeline.findInstantsAfter( | ||
| lastCompletedLogCompactionInstant.requestedTime(), Integer.MAX_VALUE), lastCompletedLogCompactionInstant)); | ||
| } else { | ||
| LOG.info("Last log compaction instant {}, is in pending state so returning empty value.", lastLogCompactionTimestamp); |
There was a problem hiding this comment.
lets be judicious on info logging.
can you confirm we log this occasionally.
There was a problem hiding this comment.
This should only happen if a logcompact attempt was attempted to be scheduled while a pending log compact still exists in timeline (from prior failed attempt or if table service platform is still working on it). I kept this here to help debug, in case user is confused why their job isn't scheduling a new log compact. If you think it might be too noisy thought I can remove it.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18306 +/- ##
=============================================
+ Coverage 57.27% 68.50% +11.22%
- Complexity 18639 27378 +8739
=============================================
Files 1956 2420 +464
Lines 107069 132173 +25104
Branches 13255 15918 +2663
=============================================
+ Hits 61324 90542 +29218
+ Misses 39939 34619 -5320
- Partials 5806 7012 +1206
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Describe the issue this Pull Request addresses
Currently, log compaction is always scheduled whenever the operation type is
LOG_COMPACT, regardless of how many delta commits have occurred since the last log compaction. This leads to unnecessary log compaction scheduling, wasting resources when few delta commits (and therefore most likely only a few log files/blocks) have accumulated. This is especially helpful for the Metadata table with RLI, where all file groups in RLI are any updated with new log files/bocks at a roughly equal "cadence"Summary and Changelog
Changes log compaction scheduling to use the
LogCompactionBlocksThresholdconfig as a gating threshold. Instead of unconditionally scheduling log compaction, the scheduler now counts delta commits since the last compaction and the last log compaction, takes the minimum of the two, and only schedules log compaction when that count meets or exceeds the threshold.CompactionUtils.getDeltaCommitsSinceLatestLogCompaction()which determines the number of delta commits since the most recent completed log compaction by inspecting the raw active timeline (needed because completed log compaction instants transition fromLOG_COMPACTION_ACTIONtoDELTA_COMMIT_ACTION)ScheduleCompactionActionExecutor.getDeltaCommitInfoSinceLogCompaction()which creates a raw active timeline and delegates to the newCompactionUtilsmethodgetLatestDeltaCommitInfo()togetLatestDeltaCommitInfoSinceCompaction()for clarityneedCompact()to replace the unconditionalreturn trueforLOG_COMPACTwith threshold-based logic:Math.min(deltaCommitsSinceCompaction, deltaCommitsSinceLogCompaction) >= logCompactionBlocksThresholdgetDeltaCommitsSinceLatestLogCompactioncovering completed log compaction, no log compaction, and empty timeline casesImpact
No public API changes. Log compaction will now be scheduled less frequently — only when enough delta commits have accumulated since the last compaction or log compaction to meet the
hoodie.log.compaction.blocks.threshold(default: 5). This reduces unnecessary log compaction overhead for tables with frequent small writes.Risk Level
Low. The change only affects log compaction scheduling frequency. Regular compaction scheduling is unchanged.
Documentation Update
None. No new configs are introduced; the existing
hoodie.log.compaction.blocks.thresholdconfig is now also used to gate scheduling frequency in addition to its existing role in plan generation.Contributor's checklist