Skip to content

Move MeshUniform allocation from the CPU to the GPU.#23662

Open
pcwalton wants to merge 1 commit intobevyengine:mainfrom
pcwalton:batch-slabs
Open

Move MeshUniform allocation from the CPU to the GPU.#23662
pcwalton wants to merge 1 commit intobevyengine:mainfrom
pcwalton:batch-slabs

Conversation

@pcwalton
Copy link
Copy Markdown
Contributor

@pcwalton pcwalton commented Apr 4, 2026

The goal of GPU-driven rendering is to cache the entire scene graph on the GPU in a form that's efficient for rendering and, for objects that didn't change since the previous frame, to have zero CPU-side overhead. If the scene didn't change, the only CPU overhead should be proportional to the number of multi-draw indirect calls. PR #23481 eliminated the CPU loop over every mesh instance in rendering, which brought us closer to this ideal, but it didn't fully get us there, because there's still a CPU loop over every mesh. Although there are usually many fewer meshes than mesh instances in large scenes, this still represents a potential bottleneck on complex scenes and/or on lower-end hardware.

This CPU loop exists to allocate MeshUniforms, which are the data structures that the GPU transform-and-cull stage stores the post-transform data in. Unlike MeshInputUniforms, which are scattered throughout memory and allocated using a CPU-side free list, MeshUniforms are indexed by instance ID. Because of the way multi-draw indirect assigns instance IDs, all instances of a specific mesh must be adjacent to one another. This necessitates a global allocation pass that lays out MeshUniforms in memory such that all the instances of a specific mesh end up adjacent to one another. This operation is currently performed on the CPU in the MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set method and has overhead proportional to the number of separate meshes (not mesh instances) in each batch set.

This PR addresses the problem by moving the sequential loop in that method to the GPU. A new GPU phase known as the uniform allocation step has been added. This shader essentially performs a prefix sum in order to allocate the MeshUniforms corresponding to the batches within a batch set. This isn't the first prefix sum operation that we have in Bevy: PR #23036 added a prefix sum for light clustering. However, in order to scale better to tens of thousands of meshes in a single batch set (i.e. multi-draw command), the uniform allocation pass added in this PR uses the three-step scan and fan process rather than the two-step process that PR #23036 uses. The scan and fan algorithm works as follows:

  1. Local allocation: Perform a Hillis-Steele scan on chunks of size equal to the workgroup size (256, in this case), producing a prefix sum for each 256-element block. Write the final sum of each chunk to a fan buffer.

  2. Global allocation: Perform a Hillis-Steele scan on the fan buffer and write the results. Now each chunk can determine the running total leading into that chunk.

  3. Fan: For each chunk, add the running total leading into that chunk to every one of that chunk's elements.

Note that, if the number of meshes is lower than the workgroup size, we only need step (1) above and can skip steps (2) and (3). Because batch sets rarely contain over 256 meshes, this means that in real-world scenes we typically only need to run step (1).

This patch had to rework the RenderMultidrawableBatchSet structure added in PR #23481 in order to perform additional bookkeeping necessary to keep the time complexity of adding a mesh instance O(1). The proptest-based test suite has been updated and extended significantly to deal with this additional complexity.

For static meshes without skins and morph targets, this PR eliminates the last remaining per-mesh overhead in the render schedules, with the exceptions of (a) the full ECS table scans required for change detection and (b) the overhead of reuploading the various GPU buffers. Change indexes (PR #23519) address issue (a), and more use of SparseBufferVec (PR #23242) will address issue (b).

The goal of GPU-driven rendering is to cache the entire scene graph on
the GPU in a form that's efficient for rendering and, for objects that
didn't change since the previous frame, to have zero CPU-side overhead.
If the scene didn't change, the only CPU overhead should be proportional
to the number of multi-draw indirect calls. PR bevyengine#23481 eliminated the CPU
loop over every mesh *instance* in rendering, which brought us closer to
this ideal, but it didn't fully get us there, because there's still a
CPU loop over every *mesh*. Although there are usually many fewer meshes
than mesh instances in large scenes, this still represents a potential
bottleneck on complex scenes and/or on lower-end hardware.

This CPU loop exists to allocate `MeshUniform`s, which are the data
structures that the GPU transform-and-cull stage stores the
post-transform data in. Unlike `MeshInputUniform`s, which are scattered
throughout memory and allocated using a CPU-side free list,
`MeshUniform`s are indexed by *instance ID*. Because of the way
multi-draw indirect assigns instance IDs, all instances of a specific
mesh must be adjacent to one another. This necessitates a global
allocation pass that lays out `MeshUniform`s in memory such that all the
instances of a specific mesh end up adjacent to one another. This
operation is currently performed on the CPU in the
`MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set`
method and has overhead proportional to the number of separate meshes
(not mesh *instances*) in each batch set.

This PR addresses the problem by moving the sequential loop in that
method to the GPU. A new GPU phase known as the *uniform allocation*
step has been added. This shader essentially performs a [prefix sum] in
order to allocate the `MeshUniform`s corresponding to the batches within
a batch set. This isn't the first prefix sum operation that we have in
Bevy: PR bevyengine#23036 added a prefix sum for light clustering. However, in
order to scale better to tens of thousands of meshes in a single batch
set (i.e. multi-draw command), the uniform allocation pass added in this
PR uses the three-step *scan and fan* process rather than the two-step
process that PR bevyengine#23036 uses.  The scan and fan algorithm works as
follows:

1. *Local allocation*: Perform a [Hillis-Steele scan] on chunks of size
   equal to the workgroup size (256, in this case), producing a prefix
   sum for each 256-element block. Write the final sum of each chunk to
   a *fan buffer*.

2. *Global allocation*: Perform a Hillis-Steele scan on the fan buffer
   and write the results. Now each chunk can determine the running total
   leading into that chunk.

3. *Fan*: For each chunk, add the running total leading into that chunk
   to every one of that chunk's elements.

Note that, if the number of meshes is lower than the workgroup size, we
only need step (1) above and can skip steps (2) and (3). Because batch
sets rarely contain over 256 meshes, this means that in real-world
scenes we typically only need to run step (1).

This patch had to rework the `RenderMultidrawableBatchSet` structure
added in PR bevyengine#23481 in order to perform additional bookkeeping necessary
to keep the time complexity of adding a mesh instance O(1). The
`proptest`-based test suite has been updated and extended significantly
to deal with this additional complexity.

For static meshes without skins and morph target, this PR eliminates the
last remaining per-mesh overhead in the render schedules, with the
exceptions of (a) the full ECS table scans required for change detection
and (b) the overhead of reuploading the various GPU buffers. Change
indexes (PR bevyengine#23519) address issue (a), and more use of `SparseBufferVec`
(PR bevyengine#23242) will address issue (b).

[prefix sum]: https://en.wikipedia.org/wiki/Prefix_sum

[Hillis-Steele scan]: https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorter_span,_more_parallel
@pcwalton pcwalton added the A-Rendering Drawing game state to the screen label Apr 4, 2026
@github-project-automation github-project-automation bot moved this to Needs SME Triage in Rendering Apr 4, 2026
@pcwalton pcwalton added S-Needs-Review Needs reviewer attention (from anyone!) to move forward C-Performance A change motivated by improving speed, memory usage or compile times D-Complex Quite challenging from either a design or technical perspective. Ask for help! labels Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times D-Complex Quite challenging from either a design or technical perspective. Ask for help! S-Needs-Review Needs reviewer attention (from anyone!) to move forward

Projects

Status: Needs SME Triage

Development

Successfully merging this pull request may close these issues.

1 participant