fix(optim): untie batched constant MatMul for OpenVINO GPU#817
Open
xieofxie wants to merge 1 commit into
Open
Conversation
OpenVINO GPU's oneDNN gemm cannot select an implementation for a batched (rank >= 3) MatMul where an operand is a compile-time constant; the same gemm with a dynamic operand, and 2D constant gemm, both compile fine. Transformer disentangled-attention position terms (e.g. DeBERTa) fold to 3D constants and fail to compile with: [GPU] Failed to select implementation for ... type: gemm (compile_graph.cpp:59 selected_impl == nullptr) Add an EP-gated `untie-constant-batched-matmul` surgery that routes the constant operand through Add(const, zero), where zero is a data-dependent runtime [1] tensor (Cast -> Reshape(-1) -> Slice[0:1] -> Sub). This makes the operand runtime-valued so OV's constant folder cannot repack it into a gemm weight, while keeping the single batched MatMul (no perf regression) and leaving numerics unchanged (+0). Wired via autoconf: BatchedConstMatMulValidator detects the pattern and, gated to Intel IHV + GPU, emits a GraphOptimization opportunity the existing autoconf loop auto-applies. Pattern-based, architecture-agnostic. Also makes the model-validator device filter case-insensitive so builds that pass lowercase "gpu" are matched.
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
winml perf -m cross-encoder/nli-deberta-v3-small --task zero-shot-classification --ep openvino --device gpufails to compile:Root cause
OpenVINO GPU's oneDNN gemm cannot select an implementation for a batched (rank ≥ 3) MatMul where an operand is a compile-time constant. Verified by isolation against the real OV-GPU EP:
For a static-shaped node
selected_implmust be non-null; impl-selection returns nothing for the batched-constant gemm, so the assert fires. DeBERTa hits this because its disentangled-attention position key/query depend only on weights and fold to 3D constants during export (12 such MatMuls — 2 per layer). Disabling torch constant-folding doesn't help: OV folds the all-constant subgraph itself.Fix
A new EP-gated surgery transform,
untie-constant-batched-matmul, routes each constant operand throughAdd(const, zero)wherezerois a data-dependent runtime[1]tensor (Cast → Reshape(-1) → Slice[0:1] → Sub). This makes the operand runtime-valued so OV's constant folder can't repack it into a gemm weight, while:+0).Wired via autoconf:
BatchedConstMatMulValidatordetects the pattern and, gated to Intel IHV + GPU, emits aGraphOptimizationopportunity the existing autoconf loop auto-applies. Pattern-based and architecture-agnostic (no model-name hardcoding). The detector doesn't re-fire after surgery, so autoconf converges.Two incidental bugs fixed:
"gpu"≠"GPU") → made case-insensitive.ReduceMin(no axes), which crashed the static analyzer's reduction input-generator → replaced with ubiquitous analyzer-safe ops.Verification
6e-4, normal fp16/fp32).🤖 Generated with Claude Code