Skip to content

fix(producer): drop empty trailing chunk slice in distributed render plan#1133

Merged
miguel-heygen merged 1 commit into
heygen-com:mainfrom
calcarazgre646:fix/distributed-chunk-plan-empty-slice
May 30, 2026
Merged

fix(producer): drop empty trailing chunk slice in distributed render plan#1133
miguel-heygen merged 1 commit into
heygen-com:mainfrom
calcarazgre646:fix/distributed-chunk-plan-empty-slice

Conversation

@calcarazgre646
Copy link
Copy Markdown
Contributor

The bug

resolveChunkPlan derives chunkCount from the naive count (min(maxParallelChunks, ceil(totalFrames / resolvedChunkSize))), then rounds effectiveChunkSize up to ceil(totalFrames / chunkCount). When that ceil rounds up, the first chunkCount - 1 chunks can already cover every frame, so buildChunkSlices emits a final slice with startFrame >= totalFrames — an empty [n, n) or inverted range.

renderChunk then rejects that slice (framesInChunk = endFrame - startFrame <= 0RenderChunkValidationError), and under Step Functions retries it exhausts retries and fails the whole distributed render, even though [0, totalFrames) was already fully covered. It also violates buildChunkSlices's own documented contract ("the union is exactly [0, totalFrames)").

Reachability

This is reachable straight from the user-facing CLI. --chunk-size and --max-parallel-chunks are first-class flags on hyperframes lambda render that flow into DistributedRenderConfigresolveChunkPlan:

hyperframes lambda render --chunk-size 10 --max-parallel-chunks 12

on a ~4s / 30fps (121-frame) composition gives resolvedChunkSize=10, chunkCount=min(12, 13)=12, effectiveChunkSize=max(10, ceil(121/12)=11)=11. Slices 0–10 cover [0, 121); slice 11 is [121, 121).

The fix

Tighten chunkCount to ceil(totalFrames / effectiveChunkSize) after the size is finalized, so the trailing empty slice is dropped and the union stays exactly [0, totalFrames). This only ever lowers chunkCount in the explicit-small-chunkSize case; the auto-sized and large-chunkSize paths already satisfy ceil(totalFrames / effectiveChunkSize) >= chunkCount, so it's a no-op there (every existing test's asserted chunkCount is unchanged). maxParallelChunks is still respected.

Tests

plan.test.ts gains the 121 / 10 / 12 regression case (asserts chunkCount === 11 and the last slice is [110, 121)) plus a grid property test over a range of totalFrames × maxParallelChunks × explicit chunkSize asserting every slice is non-empty, contiguous from 0, and ends exactly at totalFrames.

…plan

resolveChunkPlan caps chunkCount at maxParallelChunks from the naive
count, then rounds effectiveChunkSize up to ceil(totalFrames /
chunkCount). When that ceil rounds up, the first (chunkCount - 1) chunks
can already cover every frame, so buildChunkSlices emits a final slice
with startFrame >= totalFrames — an empty [n, n) or inverted range.
renderChunk rejects it (framesInChunk <= 0) and, under Step Functions
retries, fails the whole distributed render even though [0, totalFrames)
is fully covered.

This is reachable from the user-facing CLI: `hyperframes lambda render
--chunk-size 10 --max-parallel-chunks 12` on a ~4s/30fps (121-frame)
composition yields chunkCount=12, effectiveChunkSize=11, and a 12th slice
of [121, 121).

Tighten chunkCount to ceil(totalFrames / effectiveChunkSize) after the
size is finalized, so the union stays exactly [0, totalFrames) with no
empty tail. This only lowers chunkCount in the explicit-small-chunkSize
case; the auto-sized and large-chunkSize paths already satisfy
ceil(totalFrames / effectiveChunkSize) >= chunkCount, so it's a no-op
there (existing tests' chunkCount values are unchanged).

Adds a regression test for the 121/10/12 case plus a grid property test
asserting contiguous, non-empty, exact coverage across explicit sizes.
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, correct fix. The root cause is well-explained: effectiveChunkSize rounding up post-cap left a [totalFrames, totalFrames) tail that renderChunk rejects, wedging the whole distributed render. The tighten-after-finalize approach is the right spot to fix it — no change to auto-sized or large-chunkSize paths.

The grid property test covering 150 combinations (6 frame counts × 5 max-parallel × 5 chunk sizes) is exactly the right test for an arithmetic invariant like this. No issues.

@miguel-heygen miguel-heygen merged commit 81521a9 into heygen-com:main May 30, 2026
31 of 41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants