-
Notifications
You must be signed in to change notification settings - Fork 3.5k
[QNN EP] Fuse Gelu pattern into a QNN Gelu Node #26332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[QNN EP] Fuse Gelu pattern into a QNN Gelu Node #26332
Conversation
- ONNX models exported with older Opset version contains Gelu operator decomposed into multiple operators (Div, Erf, Add, Mul). - QNN doesn't support Erf operator but supports Gelu operator - Since QNN doesn't support Erf operator, the graphs contain Gelu pattern partition between QNN and CPU EPs and degrading the inference time. - Identify and fuse the Gelu pattern into a QNN Gelu node improves the inference time.
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
@chilo-ms and @devang-ml , |
@chilo-ms and @devang-ml |
| #define ValidateOnQnn(qnn_model_wrapper, node_units, root_input, final_output) \ | ||
| CreateOrValidateOnQnn((qnn_model_wrapper), (node_units), (root_input), (final_output), true) | ||
| #define CreateOnQnn(qnn_model_wrapper, node_units, root_input, final_output) \ | ||
| CreateOrValidateOnQnn((qnn_model_wrapper), (node_units), (root_input), (final_output), false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a good reason to make these macros instead of helper functions?
macros should be named like this to make them easy to identify: VALIDATE_ON_QNN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no specific reason for using macros. We can use helper functions too.
I modified it to helper functions.
Unfortunately, something went wrong with rebasing.
I pushed a new PR with addressing the comments.
#26417
| const NodeUnit* producer_unit = it->second; | ||
| if (producer_unit->OpType() == "Mul" && | ||
| node_unit_to_qnn_node_group.find(producer_unit) == node_unit_to_qnn_node_group.end()) { | ||
| // Check if this Mul has root as one input (no longer checking for constant 0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that the GELU calculation requires specific constant values. where do we check for those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need to check the constant values.
I added the constant value checks.
Unfortunately, something went wrong with rebasing.
I pushed a new PR with addressing the comments.
#26417
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
@chilo-ms |
@quic-tirupath I had some comments, please take a look. |
… CPU with 4-bit quantized models (microsoft#26280) ### Description This submission is a 4-bit quantized matrix multiplication operator suitable for the Loongson platform. It has passed the internal test checks of ONNX and has been successfully deployed for actual inference on the Loongson platform. It includes five modifications: (1) **sqnbitgemm_kernel_lasx.cpp**: Acceleration of inference for 4-bit quantized models on the LoongArch64 architecture, utilizing lasx/lsx vector instruction sets; (2) **sqnbitgemm_kernel_lasx_common.h**: Implementation of auxiliary functions used by **sqnbitgemm_kernel_lasx.cpp**`; (3) **cmake**: Added compilation options for **sqnbitgemm_kernel_lasx.cpp** under the LoongArch64 architecture; (4) **mlasi.h**: Added interface for calling the operator in **sqnbitgemm_kernel_lasx.cpp** under the LoongArch64 architecture; (5) **platform.cpp**: Added calls to the operators in **sqnbitgemm_kernel_lasx.cpp** under the LoongArch64 architecture. ### Motivation and Context Loongson has a critical lack of key operations in ONNX quantized model inference tasks. The issue of poor inference performance for 4-bit quantized models on the Loongson platform has been addressed. In tests using the Deepseek-R1-1.5B model, our operators have increased TPS by more than 7 times, with the speed of quantization matrix dequantization improving by up to 3 times. ### Pictures Dequantization Acceleration: In the chart, the vertical axis represents time in milliseconds (ms), the horizontal axis represents the number of test matrices, and the size of the quantized matrix is rows × columns, such as the 1536*256. <img width="4039" height="831" alt="反量化加速" src="https://github.com/user-attachments/assets/26da1ed9-79ae-4abd-9e6d-cadaea9ee013" /> --------- Co-authored-by: 全都做不队 <[email protected]>
…th (microsoft#26346) ### Description When using onnxruntime_CUSTOM_DAWN_SRC_PATH, the folder is assumed to have its dependencies ready ### Motivation and Context to make customized dawn src path usage more flexible
…ft#26334) This pull request updates the FlashAttention WebGPU implementation to improve support for indirect dispatch. The main changes ensure that when indirect dispatch is used, the shader receives the actual workgroup dimensions from an input buffer rather than relying on built-in variables, which avoids duplication overhead in Dawn/WebGPU. See https://source.chromium.org/chromium/chromium/src/+/main:third_party/dawn/src/dawn/native/ComputePassEncoder.cpp;l=275. This PR fixes the issue that indirect dispatch is slower than normal dispatch for the same program. With this change, the phi4 with graph capture enabled can run 145 tps from 125 tps on NV 5080.
### Description Fix quantization in Whisper model export ### Motivation and Context As titled. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description This PR introduces the support for building WebGPU EP with JSPI support. When build with `--enable_wasm_jspi`, it does the following: - Use `JSPI` instead of `ASYNCIFY` - Use WebAssembly EH (-fwasm-exceptions) instead of Emscripten exceptions ### Motivation and Context Using JSPI with wasm exceptions help to have: - better performance (reduced CPU overhead) - smaller binary size (no extra generated ASYNCIFY code) - faster build time
Bumps [actions/setup-node](https://github.com/actions/setup-node) from 5 to 6. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/actions/setup-node/releases">actions/setup-node's releases</a>.</em></p> <blockquote> <h2>v6.0.0</h2> <h2>What's Changed</h2> <p><strong>Breaking Changes</strong></p> <ul> <li>Limit automatic caching to npm, update workflows and documentation by <a href="https://github.com/priyagupta108"><code>@priyagupta108</code></a> in <a href="https://redirect.github.com/actions/setup-node/pull/1374">actions/setup-node#1374</a></li> </ul> <p><strong>Dependency Upgrades</strong></p> <ul> <li>Upgrade ts-jest from 29.1.2 to 29.4.1 and document breaking changes in v5 by <a href="https://github.com/dependabot"><code>@dependabot</code></a>[bot] in <a href="https://redirect.github.com/actions/setup-node/pull/1336">#1336</a></li> <li>Upgrade prettier from 2.8.8 to 3.6.2 by <a href="https://github.com/dependabot"><code>@dependabot</code></a>[bot] in <a href="https://redirect.github.com/actions/setup-node/pull/1334">#1334</a></li> <li>Upgrade actions/publish-action from 0.3.0 to 0.4.0 by <a href="https://github.com/dependabot"><code>@dependabot</code></a>[bot] in <a href="https://redirect.github.com/actions/setup-node/pull/1362">#1362</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/actions/setup-node/compare/v5...v6.0.0">https://github.com/actions/setup-node/compare/v5...v6.0.0</a></p> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/actions/setup-node/commit/2028fbc5c25fe9cf00d9f06a71cc4710d4507903"><code>2028fbc</code></a> Limit automatic caching to npm, update workflows and documentation (<a href="https://redirect.github.com/actions/setup-node/issues/1374">#1374</a>)</li> <li><a href="https://github.com/actions/setup-node/commit/13427813f706a0f6c9b74603b31103c40ab1c35a"><code>1342781</code></a> Bump actions/publish-action from 0.3.0 to 0.4.0 (<a href="https://redirect.github.com/actions/setup-node/issues/1362">#1362</a>)</li> <li><a href="https://github.com/actions/setup-node/commit/89d709d423dc495668cd762a18dd4a070611be3f"><code>89d709d</code></a> Bump prettier from 2.8.8 to 3.6.2 (<a href="https://redirect.github.com/actions/setup-node/issues/1334">#1334</a>)</li> <li><a href="https://github.com/actions/setup-node/commit/cd2651c46231bc0d6f48d6b34433b845331235fe"><code>cd2651c</code></a> Bump ts-jest from 29.1.2 to 29.4.1 (<a href="https://redirect.github.com/actions/setup-node/issues/1336">#1336</a>)</li> <li>See full diff in <a href="https://github.com/actions/setup-node/compare/v5...v6">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…rosoft#25919) ### Description This is a small change that includes the insertion of the `/* @vite-ignore */` magic comment within the `build.ts` `postProcess` function, alongside `/* webpackIgnore:true */`, for dynamic imports. `/* @vite-ignore */` suppresses a Vite warning: > The above dynamic import cannot be analyzed by Vite. See https://github.com/rollup/plugins/tree/master/packages/dynamic-import-vars#limitations for supported dynamic import formats. If this is intended to be left as-is, you can use the /* @vite-ignore */ comment inside the import() call to suppress this warning. , and does not change other behaviour. Note that it must be included as a separate, distinct comment to match the [regexp](https://github.com/vitejs/vite/blob/bcc31449c0c4f852ccb1eedda1842bc7ded23d01/packages/vite/src/node/plugins/importAnalysis.ts#L88). ### Motivation and Context Resolves microsoft#25918. The Vite magic comment should be included because the dynamic imports are intended 'to be left as-is', and the warning cannot otherwise be disabled by an end-user (myself in an Angular project).
…nce (microsoft#26326) Addresses microsoft#26290 --------- Co-authored-by: Prathik Rao <[email protected]>
…ft#26374) ### Description Port upstream fix for protobuf for s390x. Fix unused variable warnings. ### Motivation and Context These changes fix unnoticed issues which might appear when onnxruntime is compiled on s390x with newer gcc versions.
…oft#26098) ### Description Free cuFFT plans and associated GPU workspaces when the CuFFTPlanCache is destroyed. A destructor (`~CuFFTPlanCache()`) now calls `Clear()`. `Clear()` destroys cuFFT plans and frees workspace memory (calls `cufftDestroy`), then clears the internal plan map. ### Motivation and Context When creating and destroying ONNX Runtime sessions that use the CUDA execution provider and cuFFT-based nodes, cuFFT plans and their GPU workspaces could remain allocated across session lifetimes. This can produce an increasing GPU memory footprint when sessions are repeatedly opened and closed. The change ensures internal cuFFT resources are released during cache cleanup, preventing GPU memory leaks in multi-session or repeated create/destroy scenarios. ### How to reproduce (minimal repro) The following Python script builds a minimal ONNX model in memory (RFFT -> IRFFT round-trip), repeatedly creates and destroys ONNX Runtime CUDA sessions, and prints GPU memory after each session close. Use this to observe a memory increase before the fix and stable memory after the fix. **Dependencies** - Python 3.8+ - `onnx` - `onnxruntime-gpu` - `cupy` matching your CUDA (example package names: `cupy-cuda12x`, `cupy-cuda11x` depending on CUDA) - `numpy` ```python # leak_repro_fft.py # Minimal repro: build an ONNX model (Rfft -> Irfft round-trip), run many sessions # and print GPU memory used after each session close. import gc import numpy as np import onnx import onnx.helper as oh import onnxruntime as ort try: import cupy as cp except Exception as e: raise RuntimeError("CuPy is required to measure GPU memory. Install cupy for your CUDA version.") from e # ---------- helpers to create MS Rfft / Irfft nodes ---------- def make_ms_rfft_node(inp, out, signal_ndim=1): return oh.make_node( "Rfft", [inp], [out], domain="com.microsoft", onesided=1, normalized=0, signal_ndim=signal_ndim ) def make_ms_irfft_node(inp, out, signal_ndim=1): return oh.make_node( "Irfft", [inp], [out], domain="com.microsoft", onesided=1, normalized=0, signal_ndim=signal_ndim ) def build_model_fft_ifft_complex(): """ Input: X_ab [2, N] (float32) Graph: RFFT -> IRFFT (round-trip) Output: Y_ab [2, N] (float32) """ X = oh.make_tensor_value_info("X_ab", onnx.TensorProto.FLOAT, [2, None]) Y = oh.make_tensor_value_info("Y_ab", onnx.TensorProto.FLOAT, [2, None]) nodes = [] nodes.append(make_ms_rfft_node("X_ab", "R_ab", signal_ndim=1)) # [2, N//2+1, 2] nodes.append(make_ms_irfft_node("R_ab", "Y_ab", signal_ndim=1)) # [2, N] graph = oh.make_graph(nodes, "complex_fft_ifft", [X], [Y]) model = oh.make_model( graph, opset_imports=[ oh.make_operatorsetid("", 20), oh.make_operatorsetid("com.microsoft", 1), ], ir_version=10, producer_name="leak_repro_complex_fft_ifft" ) return model # ---------- utility to probe GPU memory ---------- def gpu_used_bytes(): free, total = cp.cuda.runtime.memGetInfo() return int(total - free), int(total) # ---------- main loop: create/close sessions ---------- def run_repro(iters=20, N=2**22, provider="CUDAExecutionProvider"): # prepare input (avoid host reallocation between iterations) rng = np.random.default_rng(1234) a = rng.standard_normal(N).astype(np.float32) b = rng.standard_normal(N).astype(np.float32) x_ab = np.stack((a, b), axis=0) # shape [2, N] # check provider availability providers = ort.get_available_providers() if provider not in providers: raise RuntimeError(f"{provider} not available (providers: {providers})") model = build_model_fft_ifft_complex() model_bytes = model.SerializeToString() # baseline cp.cuda.Device().synchronize() used0, total0 = gpu_used_bytes() print(f"Baseline GPU used: {used0/1024**2:8.2f} MB / {total0/1024**2:8.2f} MB total") for i in range(1, iters + 1): # create session from bytes sess = ort.InferenceSession(model_bytes, sess_options=ort.SessionOptions(), providers=[provider]) # run once _ = sess.run(None, {"X_ab": x_ab}) # ensure device completed cp.cuda.Device().synchronize() # delete session and force GC del sess gc.collect() cp.cuda.Device().synchronize() used, _ = gpu_used_bytes() print(f"Iter {i:02d}: GPU used {used/1024**2:8.2f} MB") # final baseline cp.cuda.Device().synchronize() usedf, _ = gpu_used_bytes() print(f"Final GPU used: {usedf/1024**2:8.2f} MB") print("Done.") if __name__ == "__main__": # tweak iter and N to show leak on your machine run_repro(iters=5, N=2**22) ``` ```text Example ouptut (before fix) Baseline GPU used: 3105.56 MB / 8191.56 MB total Iter 01: GPU used 3173.56 MB Iter 02: GPU used 3241.56 MB Iter 03: GPU used 3309.56 MB Iter 04: GPU used 3377.56 MB Iter 05: GPU used 3445.56 MB Final GPU used: 3445.56 MB Done.
### Description - Add optrace profiling level - Add profiling to compose graph - Add new qnn system profile serializer class - Add API versioning safeguards - Use current behavior for QNN API < 2.29 - Use QNN System Profile API for QNN API >= 2.29 - Check for log file at end of profiling unit test - Ensure system libs are loaded when profiling is enabled ### Motivation and Context Adds optrace level profiling for debugging purposes. Utilizes new QNN System Profile API to generate a binary log file compatible with qnn-profile-viewer executable for all profiling levels when QNN API is >= 2.29 (i.e. QAIRT 2.39 or later). Current behavior will still persist regardless of QNN API version (i.e. a .csv file will still be generated). The binary log file naming will be based on the existing profiling_file_path EP option. The value of profiling_file_path is expected to be .csv as before, and the binary log name will remove ".csv" and append "_qnn.log". For example, if profiling_file_path is set to "foo.csv", then both "foo.csv" and "foo_qnn.log" will be generated. With optrace enablement, qnn-profile-viewer can then be used to generate and view chrometraces with a user-friendly UI. For more details, see: https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_backend.html#qnn-htp-optrace-profiling https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_backend.html#qnn-htp-analysis-summary-qhas- --------- Co-authored-by: Calvin Nguyen <[email protected]>
### Description This patch fixes all `splitted` with `split` as `splitted` is not a correct word.
react-native pipeline has issues
This pull request refactors the logic for handling past key/value (KV) cache in Flash Attention implementation. The main focus is to simplify and clarify the determination of when past KV cache is used, remove redundant code paths in shader. The motivation is to remove the dependency on parameters.total_sequence_length_ in cpu side to prepare to register total_seqlen_tensor on gpu when graph capture enabled.
This pull request updates the registration of the `Squeeze` operator for the WebGPU execution provider to support additional ONNX operator set versions. The changes ensure that the `Squeeze` kernel is correctly registered for opset versions 13 through 24, aligning with ONNX's evolving specification and improving compatibility.
microsoft#26297 deleted `seqlen_k` by accident which broke the graph capture in phi4. This PR brings it back.
### Description Removes the redundant `webgpu` from `conv2d_mm_webgpu.cc/h` and `conv_backprop_webgpu.cc/h`. This change ensures consistency and better readability across the codebase. ### Motivation and Context See above.
### Description <!-- Describe your changes. --> This refactors the `TransposeKernel` to call `Transpose::DoTranspose` directly. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> See above.
- ONNX models exported with older Opset version contains Gelu operator decomposed into multiple operators (Div, Erf, Add, Mul). - QNN doesn't support Erf operator but supports Gelu operator - Since QNN doesn't support Erf operator, the graphs contain Gelu pattern partition between QNN and CPU EPs and degrading the inference time. - Identify and fuse the Gelu pattern into a QNN Gelu node improves the inference time.
- Check the constant values on Div, Add and Mul operators in Gelu Fusion - Modify the Gelu Fusion unit tests
|
@chilo-ms Could you please trigger CI job on #26417 |
@chilo-ms Could you please trigger CI job on #26417 |
Description
Motivation and Context