Move `MeshUniform` allocation from the CPU to the GPU. by pcwalton · Pull Request #23662 · bevyengine/bevy

pcwalton · 2026-04-04T20:50:55Z

The goal of GPU-driven rendering is to cache the entire scene graph on the GPU in a form that's efficient for rendering and, for objects that didn't change since the previous frame, to have zero CPU-side overhead. If the scene didn't change, the only CPU overhead should be proportional to the number of multi-draw indirect calls. PR #23481 eliminated the CPU loop over every mesh instance in rendering, which brought us closer to this ideal, but it didn't fully get us there, because there's still a CPU loop over every mesh. Although there are usually many fewer meshes than mesh instances in large scenes, this still represents a potential bottleneck on complex scenes and/or on lower-end hardware.

This CPU loop exists to allocate MeshUniforms, which are the data structures that the GPU transform-and-cull stage stores the post-transform data in. Unlike MeshInputUniforms, which are scattered throughout memory and allocated using a CPU-side free list, MeshUniforms are indexed by instance ID. Because of the way multi-draw indirect assigns instance IDs, all instances of a specific mesh must be adjacent to one another. This necessitates a global allocation pass that lays out MeshUniforms in memory such that all the instances of a specific mesh end up adjacent to one another. This operation is currently performed on the CPU in the MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set method and has overhead proportional to the number of separate meshes (not mesh instances) in each batch set.

This PR addresses the problem by moving the sequential loop in that method to the GPU. A new GPU phase known as the uniform allocation step has been added. This shader essentially performs a prefix sum in order to allocate the MeshUniforms corresponding to the batches within a batch set. This isn't the first prefix sum operation that we have in Bevy: PR #23036 added a prefix sum for light clustering. However, in order to scale better to tens of thousands of meshes in a single batch set (i.e. multi-draw command), the uniform allocation pass added in this PR uses the three-step scan and fan process rather than the two-step process that PR #23036 uses. The scan and fan algorithm works as follows:

Local allocation: Perform a Hillis-Steele scan on chunks of size equal to the workgroup size (256, in this case), producing a prefix sum for each 256-element block. Write the final sum of each chunk to a fan buffer.
Global allocation: Perform a Hillis-Steele scan on the fan buffer and write the results. Now each chunk can determine the running total leading into that chunk.
Fan: For each chunk, add the running total leading into that chunk to every one of that chunk's elements.

Note that, if the number of meshes is lower than the workgroup size, we only need step (1) above and can skip steps (2) and (3). Because batch sets rarely contain over 256 meshes, this means that in real-world scenes we typically only need to run step (1).

This patch had to rework the RenderMultidrawableBatchSet structure added in PR #23481 in order to perform additional bookkeeping necessary to keep the time complexity of adding a mesh instance O(1). The proptest-based test suite has been updated and extended significantly to deal with this additional complexity.

For static meshes without skins and morph targets, this PR eliminates the last remaining per-mesh overhead in the render schedules, with the exceptions of (a) the full ECS table scans required for change detection and (b) the overhead of reuploading the various GPU buffers. Change indexes (PR #23519) address issue (a), and more use of SparseBufferVec (PR #23242) will address issue (b).

The goal of GPU-driven rendering is to cache the entire scene graph on the GPU in a form that's efficient for rendering and, for objects that didn't change since the previous frame, to have zero CPU-side overhead. If the scene didn't change, the only CPU overhead should be proportional to the number of multi-draw indirect calls. PR bevyengine#23481 eliminated the CPU loop over every mesh *instance* in rendering, which brought us closer to this ideal, but it didn't fully get us there, because there's still a CPU loop over every *mesh*. Although there are usually many fewer meshes than mesh instances in large scenes, this still represents a potential bottleneck on complex scenes and/or on lower-end hardware. This CPU loop exists to allocate `MeshUniform`s, which are the data structures that the GPU transform-and-cull stage stores the post-transform data in. Unlike `MeshInputUniform`s, which are scattered throughout memory and allocated using a CPU-side free list, `MeshUniform`s are indexed by *instance ID*. Because of the way multi-draw indirect assigns instance IDs, all instances of a specific mesh must be adjacent to one another. This necessitates a global allocation pass that lays out `MeshUniform`s in memory such that all the instances of a specific mesh end up adjacent to one another. This operation is currently performed on the CPU in the `MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set` method and has overhead proportional to the number of separate meshes (not mesh *instances*) in each batch set. This PR addresses the problem by moving the sequential loop in that method to the GPU. A new GPU phase known as the *uniform allocation* step has been added. This shader essentially performs a [prefix sum] in order to allocate the `MeshUniform`s corresponding to the batches within a batch set. This isn't the first prefix sum operation that we have in Bevy: PR bevyengine#23036 added a prefix sum for light clustering. However, in order to scale better to tens of thousands of meshes in a single batch set (i.e. multi-draw command), the uniform allocation pass added in this PR uses the three-step *scan and fan* process rather than the two-step process that PR bevyengine#23036 uses. The scan and fan algorithm works as follows: 1. *Local allocation*: Perform a [Hillis-Steele scan] on chunks of size equal to the workgroup size (256, in this case), producing a prefix sum for each 256-element block. Write the final sum of each chunk to a *fan buffer*. 2. *Global allocation*: Perform a Hillis-Steele scan on the fan buffer and write the results. Now each chunk can determine the running total leading into that chunk. 3. *Fan*: For each chunk, add the running total leading into that chunk to every one of that chunk's elements. Note that, if the number of meshes is lower than the workgroup size, we only need step (1) above and can skip steps (2) and (3). Because batch sets rarely contain over 256 meshes, this means that in real-world scenes we typically only need to run step (1). This patch had to rework the `RenderMultidrawableBatchSet` structure added in PR bevyengine#23481 in order to perform additional bookkeeping necessary to keep the time complexity of adding a mesh instance O(1). The `proptest`-based test suite has been updated and extended significantly to deal with this additional complexity. For static meshes without skins and morph target, this PR eliminates the last remaining per-mesh overhead in the render schedules, with the exceptions of (a) the full ECS table scans required for change detection and (b) the overhead of reuploading the various GPU buffers. Change indexes (PR bevyengine#23519) address issue (a), and more use of `SparseBufferVec` (PR bevyengine#23242) will address issue (b). [prefix sum]: https://en.wikipedia.org/wiki/Prefix_sum [Hillis-Steele scan]: https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorter_span,_more_parallel

pcwalton requested review from IceSentry, atlv24 and tychedelia April 4, 2026 20:51

pcwalton added the A-Rendering Drawing game state to the screen label Apr 4, 2026

github-project-automation bot moved this to Needs SME Triage in Rendering Apr 4, 2026

github-project-automation bot added this to Rendering Apr 4, 2026

pcwalton added S-Needs-Review Needs reviewer attention (from anyone!) to move forward C-Performance A change motivated by improving speed, memory usage or compile times D-Complex Quite challenging from either a design or technical perspective. Ask for help! labels Apr 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move `MeshUniform` allocation from the CPU to the GPU.#23662

Move `MeshUniform` allocation from the CPU to the GPU.#23662
pcwalton wants to merge 1 commit intobevyengine:mainfrom
pcwalton:batch-slabs

pcwalton commented Apr 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

pcwalton commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pcwalton commented Apr 4, 2026 •

edited

Loading