-
Notifications
You must be signed in to change notification settings - Fork 722
[ET-VK] Re-implement split_with_sizes #15793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
As title. The current implementation of split_with_sizes uses functions from the `Copy.[h|cpp]` file in particular `add_copy_channel_offset_node`. However, the shaders dispatched by this function have a critical bug where the output tensor is passed in separately with difference access types, i.e.
```cpp
graph.execute_nodes().emplace_back(new DispatchNode(
graph,
VK_KERNEL_FROM_STR(kernel_name),
global_size,
local_size,
// Inputs and Outputs
{
{out, vkapi::kWrite},
{out, vkapi::kRead},
{in, vkapi::kRead},
},
```
This creates many validation layer errors because the memory barriers for the resource cannot be formed properly. The shader essentially relies on undefined behaviour to work correctly
To fix, this diff re-implements the operator from scratch with a dedicated compute shader.
Differential Revision: [D86910642](https://our.internmc.facebook.com/intern/diff/D86910642/)
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15793
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 50 PendingAs of commit 7602357 with merge base 3af3385 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
As title. The current implementation of split_with_sizes uses functions from the `Copy.[h|cpp]` file in particular `add_copy_channel_offset_node`. However, the shaders dispatched by this function have a critical bug where the output tensor is passed in separately with difference access types, i.e.
```cpp
graph.execute_nodes().emplace_back(new DispatchNode(
graph,
VK_KERNEL_FROM_STR(kernel_name),
global_size,
local_size,
// Inputs and Outputs
{
{out, vkapi::kWrite},
{out, vkapi::kRead},
{in, vkapi::kRead},
},
```
This creates many validation layer errors because the memory barriers for the resource cannot be formed properly. The shader essentially relies on undefined behaviour to work correctly
To fix, this diff re-implements the operator from scratch with a dedicated compute shader.
Differential Revision: [D86910642](https://our.internmc.facebook.com/intern/diff/D86910642/)
[ghstack-poisoned]
As title. The current implementation of split_with_sizes uses functions from the `Copy.[h|cpp]` file in particular `add_copy_channel_offset_node`. However, the shaders dispatched by this function have a critical bug where the output tensor is passed in separately with difference access types, i.e.
```cpp
graph.execute_nodes().emplace_back(new DispatchNode(
graph,
VK_KERNEL_FROM_STR(kernel_name),
global_size,
local_size,
// Inputs and Outputs
{
{out, vkapi::kWrite},
{out, vkapi::kRead},
{in, vkapi::kRead},
},
```
This creates many validation layer errors because the memory barriers for the resource cannot be formed properly. The shader essentially relies on undefined behaviour to work correctly
To fix, this diff re-implements the operator from scratch with a dedicated compute shader.
Differential Revision: [D86910642](https://our.internmc.facebook.com/intern/diff/D86910642/)
[ghstack-poisoned]
As title. The current implementation of split_with_sizes uses functions from the `Copy.[h|cpp]` file in particular `add_copy_channel_offset_node`. However, the shaders dispatched by this function have a critical bug where the output tensor is passed in separately with difference access types, i.e.
```cpp
graph.execute_nodes().emplace_back(new DispatchNode(
graph,
VK_KERNEL_FROM_STR(kernel_name),
global_size,
local_size,
// Inputs and Outputs
{
{out, vkapi::kWrite},
{out, vkapi::kRead},
{in, vkapi::kRead},
},
```
This creates many validation layer errors because the memory barriers for the resource cannot be formed properly. The shader essentially relies on undefined behaviour to work correctly
To fix, this diff re-implements the operator from scratch with a dedicated compute shader.
Differential Revision: [D86910642](https://our.internmc.facebook.com/intern/diff/D86910642/)
[ghstack-poisoned]
#15794) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #15829 * #15796 * #15795 * __->__ #15794 * #15793 As title. Make sure that ops that do not support quantized tensors do not get assigned memory layouts that are intended for quantized tensors. Differential Revision: [D86910639](https://our.internmc.facebook.com/intern/diff/D86910639/) --------- Co-authored-by: ssjia <[email protected]>
) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #15829 * #15796 * __->__ #15795 * #15794 * #15793 Title says it all! Add two additional export options: 1. `skip_memory_planning` - skips the memory planning pass which can be useful for debugging. 2. `small_texture_limits` - sets the default texture limit to be (2048, 2048, 2048) which is compatible with more devices (i.e. desktop/laptop GPUs) compared to the default (16384, 16384, 2048) which is more targeted for mobile GPUs Also adds some improvements to the export script that were made while debugging the `YOLO_NAS` model (#15700) Differential Revision: [D86910640](https://our.internmc.facebook.com/intern/diff/D86910640/) --------- Co-authored-by: ssjia <[email protected]>
…eMetadata (#15796) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #15829 * __->__ #15796 * #15795 * #15794 * #15793 Title says it all! Motivation: code simplification and allows these ops to handle high dim tensors. Differential Revision: [D86910641](https://our.internmc.facebook.com/intern/diff/D86910641/) --------- Co-authored-by: ssjia <[email protected]>
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #15829 * #15796 * #15795 * #15794 * #15793 Title says it all! Adds `int32` and `uint8` shader variants to a bunch of operators that don't currently have variants for these dtypes, but should. This should prevent folks from running into dtype crashes at runtime when using the Vulkan delegate. Differential Revision: [D87082724](https://our.internmc.facebook.com/intern/diff/D87082724/) Co-authored-by: ssjia <[email protected]>
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #15829 * #15796 * #15795 * #15794 * #15793 Title says it all! Adds `int32` and `uint8` shader variants to a bunch of operators that don't currently have variants for these dtypes, but should. This should prevent folks from running into dtype crashes at runtime when using the Vulkan delegate. Differential Revision: [D87082724](https://our.internmc.facebook.com/intern/diff/D87082724/) Co-authored-by: ssjia <[email protected]> (cherry picked from commit a6c5921)
Stack from ghstack (oldest at bottom):
As title. The current implementation of split_with_sizes uses functions from the
Copy.[h|cpp]file in particularadd_copy_channel_offset_node. However, the shaders dispatched by this function have a critical bug where the output tensor is passed in separately with difference access types, i.e.This creates many validation layer errors because the memory barriers for the resource cannot be formed properly. The shader essentially relies on undefined behaviour to work correctly
To fix, this diff re-implements the operator from scratch with a dedicated compute shader.
Differential Revision: D86910642