Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain bins from frame to frame. #17698

Merged
merged 7 commits into from
Feb 8, 2025
Merged

Conversation

pcwalton
Copy link
Contributor

@pcwalton pcwalton commented Feb 5, 2025

This PR makes Bevy keep entities in bins from frame to frame if they haven't changed. This reduces the time spent in queue_material_meshes and related functions to near zero for static geometry. This patch uses the same change tick technique that #17567 uses to detect when meshes have changed in such a way as to require re-binning.

In order to quickly find the relevant bin for an entity when that entity has changed, we introduce a new type of cache, the bin key cache. This cache stores a mapping from main world entity ID to cached bin key, as well as the tick of the most recent change to the entity. As we iterate through the visible entities in queue_material_meshes, we check the cache to see whether the entity needs to be re-binned. If it doesn't, then we mark it as clean in the valid_cached_entity_bin_keys bit set. If it does, then we insert it into the correct bin, and then mark the entity as clean. At the end, all entities not marked as clean are removed from the bins.

This patch has a dramatic effect on the rendering performance of most benchmarks, as it effectively eliminates queue_material_meshes from the profile. Note, however, that it generally simultaneously regresses batch_and_prepare_binned_render_phase by a bit (not by enough to outweigh the win, however). I believe that's because, before this patch, queue_material_meshes put the bins in the CPU cache for batch_and_prepare_binned_render_phase to use, while with this patch, batch_and_prepare_binned_render_phase must load the bins into the CPU cache itself.

On Caldera, this reduces the time spent in queue_material_meshes from 5+ ms to 0.2ms-0.3ms. Note that benchmarking on that scene is very noisy right now because of #17535.

Screenshot 2025-02-05 153458

@pcwalton pcwalton added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Feb 5, 2025
@pcwalton pcwalton force-pushed the retained-bins-2 branch 2 times, most recently from 1036f81 to 7b2fd74 Compare February 6, 2025 02:48
This PR makes Bevy keep entities in bins from frame to frame if they
haven't changed. This reduces the time spent in `queue_material_meshes`
and related functions to near zero for static geometry. This patch uses
the same change tick technique that bevyengine#17567 to detect when meshes have
changed in such a way as to require re-binning.

In order to quickly find the relevant bin for an entity when that entity
has changed, we introduce a new type of cache, the *bin key cache*. This
cache stores a mapping from main world entity ID to cached bin key, as
well as the tick of the most recent change to the entity. As we iterate
through the visible entities in `queue_material_meshes`, we check the
cache to see whether the entity needs to be re-binned. If it doesn't,
then we mark it as clean in the `valid_cached_entity_bin_keys` bitset.
At the end, all bin keys not marked as clean are removed from the bins.

This patch has a dramatic effect on the rendering performance of most
benchmarks, as it effectively eliminates `queue_material_meshes` from
the profile. Note, however, that it generally simultaneously regresses
`batch_and_prepare_binned_render_phase` by a bit (not by enough to
outweigh the win, however). I believe that's because, before this patch,
`queue_material_meshes` put the bins in the CPU cache for
`batch_and_prepare_binned_render_phase` to use, while with this patch,
`batch_and_prepare_binned_render_phase` must load the batches into the
CPU cache itself.
Copy link
Contributor

@tychedelia tychedelia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@@ -1378,7 +1380,7 @@ pub fn batch_and_prepare_binned_render_phase<BPI, GFBD>(
},
);
batch = Some(BinnedRenderPhaseBatch {
representative_entity: (entity, main_entity),
representative_entity: (Entity::PLACEHOLDER, *main_entity),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense since we are entirely using main entity in the mesh-material 3d stuff but part of me wants to modify the phase item api to declare that you either are using a render entity or a main entity rather than both which is just confusing. A problem for a future refactor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API needs an overhaul--much of it is pure legacy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. And that is definitely a separate PR. :)

crates/bevy_render/src/render_phase/mod.rs Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, out of scope for this PR, but makes me wonder whether we don't want to also demonstrate a "custom" specialization system here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a specialized_mesh_pipeline example, is that what you mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No just that we're eagerly specializing and rebuilding the bins. I still like that this example is relatively minimal (and it's not as if it's a problem necessary), was just unsure whether we should demonstrate the cached version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps at some point in a follow-up. I'm not super happy with specialized_mesh_pipeline and custom_shader_instancing and we might want to think about what a more comprehensive example would look like.

ickshonpe

This comment was marked as off-topic.

@pcwalton
Copy link
Contributor Author

pcwalton commented Feb 6, 2025

I pushed a fix for deferred_rendering. The problem was that we were only accounting for entities being added or removed from a phase, not entities changing bins within the same phase.

@pcwalton
Copy link
Contributor Author

pcwalton commented Feb 8, 2025

All the pixel eagle failures look like false positives or pre-existing. I think this is good.

@pcwalton
Copy link
Contributor Author

pcwalton commented Feb 8, 2025

Two of the failures in the Bevy example runner are fixed by #17717 and #17736.

// from by setting them all to 1.
let mut block = self.bitset.as_slice()[block_index as usize];
if bit_pos + 1 < (Block::BITS as isize) {
block |= (!0) << (bit_pos + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why !0 instead of just 1? Seems unnecessarily indirect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

! in Rust actually flips all the bits, like ~ in C.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we need to fill all the bits so that we can set all the bits "past the end" of the bit vector. I don't actually know what value fixedbitset fills them with, but I doubt we can rely on them all being 1s (if anything, they're more likely to be 0s). So I mask them off to be safe.

// Search for the next unset bit. Note that the `leading_ones`
// function counts from the MSB to the LSB, so we need to flip it to
// get the bit number.
let pos = (Block::BITS as isize) - (block.leading_ones() as isize) - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t understand why leading_ones() is used here. Doesn’t the algorithm scan from the end of the set backwards to find zeros? Shouldn’t it look at trailing ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so fixedbitset actually fills each block from LSB to MSB, so bit 0 is (1 << 0), bit 1 is (1 << 1), etc. leading_ones searches from MSB to LSB. So we need to use leading_ones to iterate in reverse. (trailing_ones would search from LSB to MSB, which would be wrong in this case.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, ok. I guess they did that for simpler indexing? Regardless, fine.

@@ -151,6 +152,8 @@ impl Plugin for SpritePlugin {
.ambiguous_with(queue_material2d_meshes::<ColorMaterial>),
prepare_sprite_image_bind_groups.in_set(RenderSet::PrepareBindGroups),
prepare_sprite_view_bind_groups.in_set(RenderSet::PrepareBindGroups),
sort_binned_render_phase::<Opaque2d>.in_set(RenderSet::PhaseSort),
sort_binned_render_phase::<AlphaMask2d>.in_set(RenderSet::PhaseSort),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were these just accidentally missing before this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, they were accidentally missing. And we need them now.

Copy link
Contributor

@superdump superdump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of questions. Otherwise LGTM. Nice!

@pcwalton pcwalton requested a review from superdump February 8, 2025 09:35
@superdump superdump added this pull request to the merge queue Feb 8, 2025
Merged via the queue into bevyengine:main with commit 7fc122a Feb 8, 2025
32 checks passed
pcwalton added a commit to pcwalton/bevy that referenced this pull request Feb 10, 2025
Currently, invocations of `batch_and_prepare_binned_render_phase` and
`batch_and_prepare_sorted_render_phase` can't run in parallel because
they write to scene-global GPU buffers. After PR bevyengine#17698,
`batch_and_prepare_binned_render_phase` started accounting for the
lion's share of the CPU time, causing us to be strongly CPU bound on
scenes like Caldera when occlusion culling was on (because of the
overhead of batching for the Z-prepass). Although I eventually plan to
optimize `batch_and_prepare_binned_render_phase`, we can obtain
significant wins now by parallelizing that system across phases.

This commit splits all GPU buffers that
`batch_and_prepare_binned_render_phase` and
`batch_and_prepare_sorted_render_phase` touches into separate buffers
for each phase so that the scheduler will run those phases in parallel.
At the end of batch preparation, we gather the render phases up into a
single resource with a new *collection* phase. Because we already run
mesh preprocessing separately for each phase in order to make occlusion
culling work, this is actually a cleaner separation. For example, mesh
output indices (the unique ID that identifies each mesh instance on GPU)
are now guaranteed to be sequential starting from 0, which will simplify
the forthcoming work to remove them in favor of the compute dispatch ID.

On Caldera, this brings the frame time down to approximately 9.1 ms with
occlusion culling on.
pcwalton added a commit to pcwalton/bevy that referenced this pull request Feb 10, 2025
Currently, invocations of `batch_and_prepare_binned_render_phase` and
`batch_and_prepare_sorted_render_phase` can't run in parallel because
they write to scene-global GPU buffers. After PR bevyengine#17698,
`batch_and_prepare_binned_render_phase` started accounting for the
lion's share of the CPU time, causing us to be strongly CPU bound on
scenes like Caldera when occlusion culling was on (because of the
overhead of batching for the Z-prepass). Although I eventually plan to
optimize `batch_and_prepare_binned_render_phase`, we can obtain
significant wins now by parallelizing that system across phases.

This commit splits all GPU buffers that
`batch_and_prepare_binned_render_phase` and
`batch_and_prepare_sorted_render_phase` touches into separate buffers
for each phase so that the scheduler will run those phases in parallel.
At the end of batch preparation, we gather the render phases up into a
single resource with a new *collection* phase. Because we already run
mesh preprocessing separately for each phase in order to make occlusion
culling work, this is actually a cleaner separation. For example, mesh
output indices (the unique ID that identifies each mesh instance on GPU)
are now guaranteed to be sequential starting from 0, which will simplify
the forthcoming work to remove them in favor of the compute dispatch ID.

On Caldera, this brings the frame time down to approximately 9.1 ms with
occlusion culling on.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times S-Needs-Review Needs reviewer attention (from anyone!) to move forward
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants