Skip to content

fix(optim): untie batched constant MatMul for OpenVINO GPU#817

Open
xieofxie wants to merge 1 commit into
mainfrom
hualxie/fix_ov_gpu
Open

fix(optim): untie batched constant MatMul for OpenVINO GPU#817
xieofxie wants to merge 1 commit into
mainfrom
hualxie/fix_ov_gpu

Conversation

@xieofxie
Copy link
Copy Markdown
Contributor

@xieofxie xieofxie commented Jun 5, 2026

Problem

winml perf -m cross-encoder/nli-deberta-v3-small --task zero-shot-classification --ep openvino --device gpu fails to compile:

[GPU] Failed to select implementation for
name:matmul:/deberta/encoder/layer.5/attention/self/MatMul_1
type: gemm
... compile_graph.cpp:59  (shape_type == dynamic_shape || node->selected_impl != nullptr)

Root cause

OpenVINO GPU's oneDNN gemm cannot select an implementation for a batched (rank ≥ 3) MatMul where an operand is a compile-time constant. Verified by isolation against the real OV-GPU EP:

case result
3D dynamic @ 3D dynamic (content q·kᵀ) ✅ compiles
3D dynamic @ 3D constant (position terms) ❌ fails
3D @ 2D constant ✅ compiles
operand converted to runtime input ✅ compiles

For a static-shaped node selected_impl must be non-null; impl-selection returns nothing for the batched-constant gemm, so the assert fires. DeBERTa hits this because its disentangled-attention position key/query depend only on weights and fold to 3D constants during export (12 such MatMuls — 2 per layer). Disabling torch constant-folding doesn't help: OV folds the all-constant subgraph itself.

Fix

A new EP-gated surgery transform, untie-constant-batched-matmul, routes each constant operand through Add(const, zero) where zero is a data-dependent runtime [1] tensor (Cast → Reshape(-1) → Slice[0:1] → Sub). This makes the operand runtime-valued so OV's constant folder can't repack it into a gemm weight, while:

  • keeping the single batched MatMul (no perf regression — a 2D per-head decomposition also works but explodes into 144 tiny matmuls),
  • leaving numerics unchanged (+0).

Wired via autoconf: BatchedConstMatMulValidator detects the pattern and, gated to Intel IHV + GPU, emits a GraphOptimization opportunity the existing autoconf loop auto-applies. Pattern-based and architecture-agnostic (no model-name hardcoding). The detector doesn't re-fire after surgery, so autoconf converges.

Two incidental bugs fixed:

  • Model-validator device filter was case-sensitive ("gpu""GPU") → made case-insensitive.
  • First construction used ReduceMin (no axes), which crashed the static analyzer's reduction input-generator → replaced with ubiquitous analyzer-safe ops.

Verification

  • Original failing command now compiles on OV-GPU and benchmarks (~15.5 ms avg, ~64 samples/sec).
  • GPU output matches CPU reference (argmax matches; diff 6e-4, normal fp16/fp32).
  • Detector gates correctly (openvino+GPU on; NPU / CPU / DML off).
  • New unit tests pass; full optim + analyze unit suites (1923 tests) pass — no regressions.

Note: artifacts are cached by build-config hash, so an existing stale cache needs --rebuild / --ignore-cache to pick up the fix.

🤖 Generated with Claude Code

OpenVINO GPU's oneDNN gemm cannot select an implementation for a batched
(rank >= 3) MatMul where an operand is a compile-time constant; the same
gemm with a dynamic operand, and 2D constant gemm, both compile fine.
Transformer disentangled-attention position terms (e.g. DeBERTa) fold to
3D constants and fail to compile with:

  [GPU] Failed to select implementation for ... type: gemm
  (compile_graph.cpp:59 selected_impl == nullptr)

Add an EP-gated `untie-constant-batched-matmul` surgery that routes the
constant operand through Add(const, zero), where zero is a data-dependent
runtime [1] tensor (Cast -> Reshape(-1) -> Slice[0:1] -> Sub). This makes
the operand runtime-valued so OV's constant folder cannot repack it into a
gemm weight, while keeping the single batched MatMul (no perf regression)
and leaving numerics unchanged (+0).

Wired via autoconf: BatchedConstMatMulValidator detects the pattern and,
gated to Intel IHV + GPU, emits a GraphOptimization opportunity the
existing autoconf loop auto-applies. Pattern-based, architecture-agnostic.

Also makes the model-validator device filter case-insensitive so builds
that pass lowercase "gpu" are matched.
@xieofxie xieofxie requested a review from a team as a code owner June 5, 2026 06:59
@xieofxie
Copy link
Copy Markdown
Contributor Author

xieofxie commented Jun 5, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant