fix(optim): untie batched constant MatMul for OpenVINO GPU by xieofxie · Pull Request #817 · microsoft/winml-cli

xieofxie · 2026-06-05T06:59:11Z

Problem

winml perf -m cross-encoder/nli-deberta-v3-small --task zero-shot-classification --ep openvino --device gpu fails to compile:

[GPU] Failed to select implementation for
name:matmul:/deberta/encoder/layer.5/attention/self/MatMul_1
type: gemm
... compile_graph.cpp:59  (shape_type == dynamic_shape || node->selected_impl != nullptr)

Root cause

OpenVINO GPU's oneDNN gemm cannot select an implementation for a batched (rank ≥ 3) MatMul where an operand is a compile-time constant. Verified by isolation against the real OV-GPU EP:

case	result
3D dynamic @ 3D dynamic (content q·kᵀ)	✅ compiles
3D dynamic @ 3D constant (position terms)	❌ fails
3D @ 2D constant	✅ compiles
operand converted to runtime input	✅ compiles

For a static-shaped node selected_impl must be non-null; impl-selection returns nothing for the batched-constant gemm, so the assert fires. DeBERTa hits this because its disentangled-attention position key/query depend only on weights and fold to 3D constants during export (12 such MatMuls — 2 per layer). Disabling torch constant-folding doesn't help: OV folds the all-constant subgraph itself.

Fix

A new EP-gated surgery transform, untie-constant-batched-matmul, routes each constant operand through Add(const, zero) where zero is a data-dependent runtime [1] tensor (Cast → Reshape(-1) → Slice[0:1] → Sub). This makes the operand runtime-valued so OV's constant folder can't repack it into a gemm weight, while:

keeping the single batched MatMul (no perf regression — a 2D per-head decomposition also works but explodes into 144 tiny matmuls),
leaving numerics unchanged (+0).

Wired via autoconf: BatchedConstMatMulValidator detects the pattern and, gated to Intel IHV + GPU, emits a GraphOptimization opportunity the existing autoconf loop auto-applies. Pattern-based and architecture-agnostic (no model-name hardcoding). The detector doesn't re-fire after surgery, so autoconf converges.

Two incidental bugs fixed:

Model-validator device filter was case-sensitive ("gpu" ≠ "GPU") → made case-insensitive.
First construction used ReduceMin (no axes), which crashed the static analyzer's reduction input-generator → replaced with ubiquitous analyzer-safe ops.

Verification

Original failing command now compiles on OV-GPU and benchmarks (~15.5 ms avg, ~64 samples/sec).
GPU output matches CPU reference (argmax matches; diff 6e-4, normal fp16/fp32).
Detector gates correctly (openvino+GPU on; NPU / CPU / DML off).
New unit tests pass; full optim + analyze unit suites (1923 tests) pass — no regressions.

Note: artifacts are cached by build-config hash, so an existing stale cache needs --rebuild / --ignore-cache to pick up the fix.

🤖 Generated with Claude Code

OpenVINO GPU's oneDNN gemm cannot select an implementation for a batched (rank >= 3) MatMul where an operand is a compile-time constant; the same gemm with a dynamic operand, and 2D constant gemm, both compile fine. Transformer disentangled-attention position terms (e.g. DeBERTa) fold to 3D constants and fail to compile with: [GPU] Failed to select implementation for ... type: gemm (compile_graph.cpp:59 selected_impl == nullptr) Add an EP-gated `untie-constant-batched-matmul` surgery that routes the constant operand through Add(const, zero), where zero is a data-dependent runtime [1] tensor (Cast -> Reshape(-1) -> Slice[0:1] -> Sub). This makes the operand runtime-valued so OV's constant folder cannot repack it into a gemm weight, while keeping the single batched MatMul (no perf regression) and leaving numerics unchanged (+0). Wired via autoconf: BatchedConstMatMulValidator detects the pattern and, gated to Intel IHV + GPU, emits a GraphOptimization opportunity the existing autoconf loop auto-applies. Pattern-based, architecture-agnostic. Also makes the model-validator device filter case-insensitive so builds that pass lowercase "gpu" are matched.

xieofxie · 2026-06-05T07:16:29Z

Source code here https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/src/graph/graph_optimizer/compile_graph.cpp

Also issue created openvinotoolkit/openvino#36272

xieofxie requested a review from a team as a code owner June 5, 2026 06:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(optim): untie batched constant MatMul for OpenVINO GPU#817

fix(optim): untie batched constant MatMul for OpenVINO GPU#817
xieofxie wants to merge 1 commit into
mainfrom
hualxie/fix_ov_gpu

xieofxie commented Jun 5, 2026

Uh oh!

xieofxie commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xieofxie commented Jun 5, 2026

Problem

Root cause

Fix

Verification

Uh oh!

xieofxie commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant