Skip to content

Conversation

@quic-tirupath
Copy link
Contributor

Description

  • ONNX models exported with older Opset version contains Gelu operator decomposed into multiple operators (Div, Erf, Add, Mul).
  • QNN doesn't support Erf operator but supports Gelu operator
  • Since QNN doesn't support Erf operator, the graphs contain Gelu pattern partition between QNN and CPU EPs and degrading the inference time.

Motivation and Context

  • Identify and fuse the Gelu pattern into a QNN Gelu node improves the inference time.

 - ONNX models exported with older Opset version contains Gelu operator
   decomposed into multiple operators (Div, Erf, Add, Mul).
 - QNN doesn't support Erf operator but supports Gelu operator
 - Since QNN doesn't support Erf operator, the graphs contain Gelu pattern
   partition between QNN and CPU EPs and degrading the inference time.
 - Identify and fuse the Gelu pattern into a QNN Gelu node improves
   the inference time.
@chilo-ms
Copy link
Contributor

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@quic-tirupath
Copy link
Contributor Author

@chilo-ms
Could you please merge this now?
After this is merged, we need to rebase other PR: #26338

@quic-tirupath
Copy link
Contributor Author

@chilo-ms and @devang-ml ,
I had to rebase the PR after the merge of PR: #26338.
Could you please re-trigger CI pipeline and merge.

@quic-tirupath
Copy link
Contributor Author

@chilo-ms and @devang-ml , I had to rebase the PR after the merge of PR: #26338. Could you please re-trigger CI pipeline and merge.

@chilo-ms and @devang-ml
Could you please check the pending checks and unblock this PR?

Comment on lines 24 to 27
#define ValidateOnQnn(qnn_model_wrapper, node_units, root_input, final_output) \
CreateOrValidateOnQnn((qnn_model_wrapper), (node_units), (root_input), (final_output), true)
#define CreateOnQnn(qnn_model_wrapper, node_units, root_input, final_output) \
CreateOrValidateOnQnn((qnn_model_wrapper), (node_units), (root_input), (final_output), false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a good reason to make these macros instead of helper functions?

macros should be named like this to make them easy to identify: VALIDATE_ON_QNN

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no specific reason for using macros. We can use helper functions too.
I modified it to helper functions.
Unfortunately, something went wrong with rebasing.
I pushed a new PR with addressing the comments.
#26417

const NodeUnit* producer_unit = it->second;
if (producer_unit->OpType() == "Mul" &&
node_unit_to_qnn_node_group.find(producer_unit) == node_unit_to_qnn_node_group.end()) {
// Check if this Mul has root as one input (no longer checking for constant 0.5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that the GELU calculation requires specific constant values. where do we check for those?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to check the constant values.
I added the constant value checks.
Unfortunately, something went wrong with rebasing.
I pushed a new PR with addressing the comments.
#26417

@edgchen1
Copy link
Contributor

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@quic-tirupath
Copy link
Contributor Author

@chilo-ms
Can we merge this PR.

@edgchen1
Copy link
Contributor

@chilo-ms Can we merge this PR.

@quic-tirupath I had some comments, please take a look.

movedancer and others added 18 commits October 27, 2025 14:11
… CPU with 4-bit quantized models (microsoft#26280)

### Description
This submission is a 4-bit quantized matrix multiplication operator
suitable for the Loongson platform. It has passed the internal test
checks of ONNX and has been successfully deployed for actual inference
on the Loongson platform. It includes five modifications:
(1) **sqnbitgemm_kernel_lasx.cpp**: Acceleration of inference for 4-bit
quantized models on the LoongArch64 architecture, utilizing lasx/lsx
vector instruction sets;
(2) **sqnbitgemm_kernel_lasx_common.h**: Implementation of auxiliary
functions used by **sqnbitgemm_kernel_lasx.cpp**`;
(3) **cmake**: Added compilation options for
**sqnbitgemm_kernel_lasx.cpp** under the LoongArch64 architecture;
(4) **mlasi.h**: Added interface for calling the operator in
**sqnbitgemm_kernel_lasx.cpp** under the LoongArch64 architecture;
(5) **platform.cpp**: Added calls to the operators in
**sqnbitgemm_kernel_lasx.cpp** under the LoongArch64 architecture.

### Motivation and Context
Loongson has a critical lack of key operations in ONNX quantized model
inference tasks.
The issue of poor inference performance for 4-bit quantized models on
the Loongson platform has been addressed. In tests using the
Deepseek-R1-1.5B model, our operators have increased TPS by more than 7
times, with the speed of quantization matrix dequantization improving by
up to 3 times.

### Pictures
Dequantization Acceleration:
In the chart, the vertical axis represents time in milliseconds (ms),
the horizontal axis represents the number of test matrices, and the size
of the quantized matrix is rows × columns, such as the 1536*256.
<img width="4039" height="831" alt="反量化加速"
src="https://github.com/user-attachments/assets/26da1ed9-79ae-4abd-9e6d-cadaea9ee013"
/>

---------

Co-authored-by: 全都做不队 <[email protected]>
…th (microsoft#26346)

### Description

When using onnxruntime_CUSTOM_DAWN_SRC_PATH, the folder is assumed to
have its dependencies ready


### Motivation and Context

to make customized dawn src path usage more flexible
…ft#26334)

This pull request updates the FlashAttention WebGPU implementation to
improve support for indirect dispatch. The main changes ensure that when
indirect dispatch is used, the shader receives the actual workgroup
dimensions from an input buffer rather than relying on built-in
variables, which avoids duplication overhead in Dawn/WebGPU. See
https://source.chromium.org/chromium/chromium/src/+/main:third_party/dawn/src/dawn/native/ComputePassEncoder.cpp;l=275.
This PR fixes the issue that indirect dispatch is slower than normal
dispatch for the same program.
With this change, the phi4 with graph capture enabled can run 145 tps
from 125 tps on NV 5080.
### Description
Fix quantization in Whisper model export



### Motivation and Context
As titled.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description

This PR introduces the support for building WebGPU EP with JSPI support.
When build with `--enable_wasm_jspi`, it does the following:
- Use `JSPI` instead of `ASYNCIFY`
- Use WebAssembly EH (-fwasm-exceptions) instead of Emscripten
exceptions

### Motivation and Context

Using JSPI with wasm exceptions help to have:
- better performance (reduced CPU overhead)
- smaller binary size (no extra generated ASYNCIFY code)
- faster build time
Bumps [actions/setup-node](https://github.com/actions/setup-node) from 5
to 6.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/actions/setup-node/releases">actions/setup-node's
releases</a>.</em></p>
<blockquote>
<h2>v6.0.0</h2>
<h2>What's Changed</h2>
<p><strong>Breaking Changes</strong></p>
<ul>
<li>Limit automatic caching to npm, update workflows and documentation
by <a
href="https://github.com/priyagupta108"><code>@​priyagupta108</code></a>
in <a
href="https://redirect.github.com/actions/setup-node/pull/1374">actions/setup-node#1374</a></li>
</ul>
<p><strong>Dependency Upgrades</strong></p>
<ul>
<li>Upgrade ts-jest from 29.1.2 to 29.4.1 and document breaking changes
in v5 by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a>[bot]
in <a
href="https://redirect.github.com/actions/setup-node/pull/1336">#1336</a></li>
<li>Upgrade prettier from 2.8.8 to 3.6.2 by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a>[bot]
in <a
href="https://redirect.github.com/actions/setup-node/pull/1334">#1334</a></li>
<li>Upgrade actions/publish-action from 0.3.0 to 0.4.0 by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a>[bot]
in <a
href="https://redirect.github.com/actions/setup-node/pull/1362">#1362</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/setup-node/compare/v5...v6.0.0">https://github.com/actions/setup-node/compare/v5...v6.0.0</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/actions/setup-node/commit/2028fbc5c25fe9cf00d9f06a71cc4710d4507903"><code>2028fbc</code></a>
Limit automatic caching to npm, update workflows and documentation (<a
href="https://redirect.github.com/actions/setup-node/issues/1374">#1374</a>)</li>
<li><a
href="https://github.com/actions/setup-node/commit/13427813f706a0f6c9b74603b31103c40ab1c35a"><code>1342781</code></a>
Bump actions/publish-action from 0.3.0 to 0.4.0 (<a
href="https://redirect.github.com/actions/setup-node/issues/1362">#1362</a>)</li>
<li><a
href="https://github.com/actions/setup-node/commit/89d709d423dc495668cd762a18dd4a070611be3f"><code>89d709d</code></a>
Bump prettier from 2.8.8 to 3.6.2 (<a
href="https://redirect.github.com/actions/setup-node/issues/1334">#1334</a>)</li>
<li><a
href="https://github.com/actions/setup-node/commit/cd2651c46231bc0d6f48d6b34433b845331235fe"><code>cd2651c</code></a>
Bump ts-jest from 29.1.2 to 29.4.1 (<a
href="https://redirect.github.com/actions/setup-node/issues/1336">#1336</a>)</li>
<li>See full diff in <a
href="https://github.com/actions/setup-node/compare/v5...v6">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=actions/setup-node&package-manager=github_actions&previous-version=5&new-version=6)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
)

### Description
(1) Use new linux docker images: gcc upgraded from 12 to 14; add python
3.14.
(2) Use CUDA 12.8 to replace CUDA 12.2 in workflows.
(3) Replace CUDA 11.8 with 13.0 and latest TensorRT version in
alternative configuration (default is still CUDA 12.8).
…rosoft#25919)

### Description
This is a small change that includes the insertion of the `/*
@vite-ignore */` magic comment within the `build.ts` `postProcess`
function, alongside `/* webpackIgnore:true */`, for dynamic imports.

`/* @vite-ignore */` suppresses a Vite warning:

> The above dynamic import cannot be analyzed by Vite. See
https://github.com/rollup/plugins/tree/master/packages/dynamic-import-vars#limitations
for supported dynamic import formats. If this is intended to be left
as-is, you can use the /* @vite-ignore */ comment inside the import()
call to suppress this warning.

, and does not change other behaviour. Note that it must be included as
a separate, distinct comment to match the
[regexp](https://github.com/vitejs/vite/blob/bcc31449c0c4f852ccb1eedda1842bc7ded23d01/packages/vite/src/node/plugins/importAnalysis.ts#L88).


### Motivation and Context
Resolves microsoft#25918. The Vite magic comment should be included because the
dynamic imports are intended 'to be left as-is', and the warning cannot
otherwise be disabled by an end-user (myself in an Angular project).
…ft#26374)

### Description
Port upstream fix for protobuf for s390x.
Fix unused variable warnings.

### Motivation and Context
These changes fix unnoticed issues which might appear when onnxruntime
is compiled on s390x with newer gcc versions.
…oft#26098)

### Description
Free cuFFT plans and associated GPU workspaces when the CuFFTPlanCache
is destroyed. A destructor (`~CuFFTPlanCache()`) now calls `Clear()`.
`Clear()` destroys cuFFT plans and frees workspace memory (calls
`cufftDestroy`), then clears the internal plan map.

### Motivation and Context
When creating and destroying ONNX Runtime sessions that use the CUDA
execution provider and cuFFT-based nodes, cuFFT plans and their GPU
workspaces could remain allocated across session lifetimes. This can
produce an increasing GPU memory footprint when sessions are repeatedly
opened and closed. The change ensures internal cuFFT resources are
released during cache cleanup, preventing GPU memory leaks in
multi-session or repeated create/destroy scenarios.

### How to reproduce (minimal repro)
The following Python script builds a minimal ONNX model in memory (RFFT
-> IRFFT round-trip), repeatedly creates and destroys ONNX Runtime CUDA
sessions, and prints GPU memory after each session close. Use this to
observe a memory increase before the fix and stable memory after the
fix.

**Dependencies**
- Python 3.8+
- `onnx`
- `onnxruntime-gpu` 
- `cupy` matching your CUDA (example package names: `cupy-cuda12x`,
`cupy-cuda11x` depending on CUDA)
- `numpy`

```python
# leak_repro_fft.py
# Minimal repro: build an ONNX model (Rfft -> Irfft round-trip), run many sessions
# and print GPU memory used after each session close.

import gc
import numpy as np
import onnx
import onnx.helper as oh
import onnxruntime as ort

try:
    import cupy as cp
except Exception as e:
    raise RuntimeError("CuPy is required to measure GPU memory. Install cupy for your CUDA version.") from e

# ---------- helpers to create MS Rfft / Irfft nodes ----------
def make_ms_rfft_node(inp, out, signal_ndim=1):
    return oh.make_node(
        "Rfft", [inp], [out],
        domain="com.microsoft",
        onesided=1, normalized=0, signal_ndim=signal_ndim
    )

def make_ms_irfft_node(inp, out, signal_ndim=1):
    return oh.make_node(
        "Irfft", [inp], [out],
        domain="com.microsoft",
        onesided=1, normalized=0, signal_ndim=signal_ndim
    )

def build_model_fft_ifft_complex():
    """
    Input: X_ab [2, N] (float32)
    Graph: RFFT -> IRFFT (round-trip)
    Output: Y_ab [2, N] (float32)
    """
    X = oh.make_tensor_value_info("X_ab", onnx.TensorProto.FLOAT, [2, None])
    Y = oh.make_tensor_value_info("Y_ab", onnx.TensorProto.FLOAT, [2, None])

    nodes = []
    nodes.append(make_ms_rfft_node("X_ab", "R_ab", signal_ndim=1))   # [2, N//2+1, 2]
    nodes.append(make_ms_irfft_node("R_ab", "Y_ab", signal_ndim=1))  # [2, N]
    graph = oh.make_graph(nodes, "complex_fft_ifft", [X], [Y])
    model = oh.make_model(
        graph,
        opset_imports=[
            oh.make_operatorsetid("", 20),
            oh.make_operatorsetid("com.microsoft", 1),
        ],
        ir_version=10,
        producer_name="leak_repro_complex_fft_ifft"
    )
    return model

# ---------- utility to probe GPU memory ----------
def gpu_used_bytes():
    free, total = cp.cuda.runtime.memGetInfo()
    return int(total - free), int(total)

# ---------- main loop: create/close sessions ----------
def run_repro(iters=20, N=2**22, provider="CUDAExecutionProvider"):
    # prepare input (avoid host reallocation between iterations)
    rng = np.random.default_rng(1234)
    a = rng.standard_normal(N).astype(np.float32)
    b = rng.standard_normal(N).astype(np.float32)
    x_ab = np.stack((a, b), axis=0)  # shape [2, N]

    # check provider availability
    providers = ort.get_available_providers()
    if provider not in providers:
        raise RuntimeError(f"{provider} not available (providers: {providers})")

    model = build_model_fft_ifft_complex()
    model_bytes = model.SerializeToString()

    # baseline
    cp.cuda.Device().synchronize()
    used0, total0 = gpu_used_bytes()
    print(f"Baseline GPU used: {used0/1024**2:8.2f} MB / {total0/1024**2:8.2f} MB total")

    for i in range(1, iters + 1):
        # create session from bytes
        sess = ort.InferenceSession(model_bytes, sess_options=ort.SessionOptions(), providers=[provider])

        # run once
        _ = sess.run(None, {"X_ab": x_ab})

        # ensure device completed
        cp.cuda.Device().synchronize()

        # delete session and force GC
        del sess
        gc.collect()
        cp.cuda.Device().synchronize()

        used, _ = gpu_used_bytes()
        print(f"Iter {i:02d}: GPU used {used/1024**2:8.2f} MB")

    # final baseline
    cp.cuda.Device().synchronize()
    usedf, _ = gpu_used_bytes()
    print(f"Final GPU used: {usedf/1024**2:8.2f} MB")
    print("Done.")

if __name__ == "__main__":
    # tweak iter and N to show leak on your machine
    run_repro(iters=5, N=2**22)
```

```text
Example ouptut (before fix)
Baseline GPU used:  3105.56 MB /  8191.56 MB total
Iter 01: GPU used  3173.56 MB
Iter 02: GPU used  3241.56 MB
Iter 03: GPU used  3309.56 MB
Iter 04: GPU used  3377.56 MB
Iter 05: GPU used  3445.56 MB
Final GPU used:  3445.56 MB
Done.
### Description
 - Add optrace profiling level
 - Add profiling to compose graph
 - Add new qnn system profile serializer class
 - Add API versioning safeguards
 - Use current behavior for QNN API < 2.29
 - Use QNN System Profile API for QNN API >= 2.29
 - Check for log file at end of profiling unit test
 - Ensure system libs are loaded when profiling is enabled



### Motivation and Context
Adds optrace level profiling for debugging purposes. Utilizes new QNN
System Profile API to generate a binary log file compatible with
qnn-profile-viewer executable for all profiling levels when QNN API is
>= 2.29 (i.e. QAIRT 2.39 or later).

Current behavior will still persist regardless of QNN API version (i.e.
a .csv file will still be generated).

The binary log file naming will be based on the existing
profiling_file_path EP option. The value of profiling_file_path is
expected to be .csv as before, and the binary log name will remove
".csv" and append "_qnn.log". For example, if profiling_file_path is set
to "foo.csv", then both "foo.csv" and "foo_qnn.log" will be generated.

With optrace enablement, qnn-profile-viewer can then be used to generate
and view chrometraces with a user-friendly UI.
For more details, see:

https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_backend.html#qnn-htp-optrace-profiling

https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_backend.html#qnn-htp-analysis-summary-qhas-

---------

Co-authored-by: Calvin Nguyen <[email protected]>
### Description
This patch fixes all `splitted` with `split` as `splitted` is not a
correct word.
react-native pipeline has issues
This pull request refactors the logic for handling past key/value (KV)
cache in Flash Attention implementation. The main focus is to simplify
and clarify the determination of when past KV cache is used, remove
redundant code paths in shader. The motivation is to remove the
dependency on parameters.total_sequence_length_ in cpu side to prepare
to register total_seqlen_tensor on gpu when graph capture enabled.
This pull request updates the registration of the `Squeeze` operator for
the WebGPU execution provider to support additional ONNX operator set
versions. The changes ensure that the `Squeeze` kernel is correctly
registered for opset versions 13 through 24, aligning with ONNX's
evolving specification and improving compatibility.
qjia7 and others added 5 commits October 27, 2025 14:11
microsoft#26297 deleted `seqlen_k` by accident which broke the graph capture in
phi4. This PR brings it back.
### Description
Removes the redundant `webgpu` from `conv2d_mm_webgpu.cc/h` and
`conv_backprop_webgpu.cc/h`.

This change ensures consistency and better readability across the
codebase.

### Motivation and Context
See above.
### Description
<!-- Describe your changes. -->

This refactors the `TransposeKernel` to call `Transpose::DoTranspose`
directly.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

See above.
 - ONNX models exported with older Opset version contains Gelu operator
   decomposed into multiple operators (Div, Erf, Add, Mul).
 - QNN doesn't support Erf operator but supports Gelu operator
 - Since QNN doesn't support Erf operator, the graphs contain Gelu pattern
   partition between QNN and CPU EPs and degrading the inference time.
 - Identify and fuse the Gelu pattern into a QNN Gelu node improves
   the inference time.
 - Check the constant values on Div, Add and Mul operators in Gelu Fusion
 - Modify the Gelu Fusion unit tests
@quic-tirupath
Copy link
Contributor Author

@chilo-ms
Unfortunately, something went wrong while rebasing my new changes. It added multiple other commits. I could not find a way to resolve in this PR.
Hence i uploaded a new PR #26417
where i addressed all the comments raised in this PR.
we can close this PR and use #26417 for making this node group fusion.

Could you please trigger CI job on #26417

@quic-tirupath
Copy link
Contributor Author

quic-tirupath commented Oct 30, 2025

@chilo-ms Can we merge this PR.

@quic-tirupath I had some comments, please take a look.

@chilo-ms
As i commented at #26332 (comment), something went wrong in rebasing the branch. I addressed the comments in a new PR:#26417

Could you please trigger CI job on #26417

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.