-
Notifications
You must be signed in to change notification settings - Fork 3.5k
[webgpu] revise implementation of buffer split support #26429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the WebGPU provider to replace the testing-only boolean flag small_storage_buffer_binding_size_for_testing with a more flexible max_storage_buffer_binding_size parameter that allows users to explicitly set the maximum storage buffer binding size.
Key changes:
- Replaced boolean testing flag with a configurable uint64_t parameter for controlling storage buffer binding size
- Moved segment calculation from ProgramBase member storage to separate computed vectors
- Made ProgramBase const-correct in Run operations to prevent unintended modifications
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| webgpu_provider_options.h | Renamed option key from testing-specific to production-ready name |
| webgpu_provider_factory.cc | Replaced boolean parsing with uint64_t parsing using std::from_chars |
| webgpu_context.h | Changed parameter type from bool to uint64_t with validation logic |
| webgpu_context.cc | Updated Run signature to const and moved indirect dispatch handling |
| shader_helper.h | Added spans for segments and removed FinalizeInputs method |
| shader_helper.cc | Updated to use segment spans instead of ProgramInput/Output members |
| program_manager.h | Changed CalculateSegmentsForInputsAndOutputs to return segments via out parameters |
| program_manager.cc | Implemented segment calculation into separate vectors |
| program_cache_key.h | Updated signature to accept segment spans |
| program_cache_key.cc | Updated to use segments from spans with optimization for default values |
| program.h | Removed segments members from ProgramInput/Output structs and related setters |
| program.cc | Moved indirect dispatch tensor registration to SetIndirectDispatchTensor |
| compute_context.h | Updated RunProgram to accept const reference |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
| } | ||
| } else { | ||
| bind_group_entries.push_back({nullptr, entry_index++, buffer, 0, std::min(kMaxBufferSize, buffer_size), nullptr, nullptr}); | ||
| bind_group_entries.push_back({nullptr, entry_index++, buffer, 0, WGPU_WHOLE_SIZE, nullptr, nullptr}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder that is there a specific reason to use WGPU_WHOLE_SIZE (UINT64_MAX)? Intuitively, using the actual buffer size be more appropriate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason is I try to avoid perform one call to wgpuBufferGetSize, which is trivial in native but can be expensive in web because it involves an inter-op call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be better to change it only bind the normalized tensor size (align with 16) which is the actual size we need and no extra memory is bound and no wgpuBufferGetSize is called? Anyway, I think it's ok to leave it as WGPU_WHOLE_SIZE. I am just curious whether it helps the performance if we only bind the needed size instead of the buffer size which is the bucket size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will think about how to do this in another PR. Ideally, all necessary information for creating binding group can be determined earlier.
Description
This PR addresses a few concerns:
const ProgramBase&->ProgramBase&: this itself is not doing something wrong but gives much more pressure for who reads the code to understand whether/where the program object is modified. It also can introduce further unexpected modifications to the program object (for example the indirect dispatch code)"ep.webgpuexecutionprovider.smallStorageBufferBindingSizeForTesting"to"ep.webgpuexecutionprovider.maxStorageBufferBindingSize"so now it's possible to set any value in option. (setting to <128MB will cause an assert failure)