Implement tensor similarity evaluator by zhenchaoni · Pull Request #805 · microsoft/winml-cli

zhenchaoni · 2026-06-03T05:30:46Z

Fixes #804

Implement tensor similarity evaluator

Summary

Adds a new compare mode to winml eval that compares an ONNX candidate against its HF PyTorch reference on identical random inputs and reports per-output tensor-parity metrics (SQNR, PSNR, cosine similarity, MSE, max absolute diff). This isolates divergence introduced by the build pipeline (optimize / quantize / compile) from data- or pipeline-related differences — there is no labeled dataset, no HF pipeline, and no preprocessor in the loop.

Motivation

Task-level metrics (top-1, mIoU, BLEU, ...) tell us whether an optimized model still works, but not how much the optimize/quantize/compile passes perturbed the raw tensors. Tensor-similarity gives a fast, label-free, dataset-free signal for build-pipeline regressions and for picking quantization configs.

Usage

winml eval --mode compare -m microsoft/resnet-50 --task image-classification --precision fp16 --samples 100

What's new

winml eval --mode {onnx,compare} — new Click option on winml eval. onnx (default) is the existing dataset-driven flow; compare activates the new evaluator.
TensorSimilarityEvaluator (tensor_similarity_evaluator.py) — loads the HF reference on CPU/fp32 via resolve_task_and_model_class, draws inputs from RandomDataset over the candidate's ONNX I/O spec, runs both backends per sample, and aggregates per-output metrics.
TensorSimilarityMetric (tensor_similarity.py) — stateful update / compute / reset metric mirroring MeanIoUMetric. Per-sample math is bit-equivalent to the team-wide eval_tensors reference library on the same .npy pair.
Dispatch — evaluate.py registers "compare-tensor" and get_evaluator_class routes to it when config.mode == "compare"; compare mode bypasses default-dataset resolution and the dataset section of print_config.
Config — WinMLEvaluationConfig.mode: str = "onnx"; to_dict only emits mode when non-default.

Output shape

compute() returns display-ready flat dict so the existing generic eval report renders without a custom renderer:

{
    f"{metric}_{stat}": {output_name: float},  # 5 metrics × 4 stats = 20 keys
    ...
}

Stats are mean / std / min / max. The renderer prints one row per {metric}_{stat} with output_name=value cells joined across outputs.

Notable design choices

Output-name overlap, not strict equality. ONNX and HF output sets can differ (HF often exposes auxiliary tensors). We compute on the intersection and warn on divergence rather than failing.
Composite-model guard. Multi-component models (e.g. BLIP) raise a TypeError with guidance to run compare per sub-component — there is no canonical "one HF reference" for the composite.
int dtype normalization. Narrow int tensors are upcast to int64 before inference so HF embeddings accept them; WinMLSession down-casts to the ORT graph's declared dtype on its side. The same input dict feeds both backends.
Architecture-agnostic. No model-specific names, layer patterns, or hardcoded outputs anywhere in metric or evaluator code.

Tests

test_tensor_similarity_metric.py — 10 unit tests for the metric (numerics, identity, stat shape, reset, empty-state error).
test_tensor_similarity_evaluator.py — 4 unit tests (composite-model guard, output-name overlap, dispatch).
test_eval.py — get_evaluator_class updated to take WinMLEvaluationConfig; "compare-tensor" exempted from the task-schema set check.
tests/e2e/test_eval_e2e.py::test_compare_mode_image_classification — full CLI run on microsoft/resnet-50 fp16: asserts the 20-key flat shape, per-output cosine bounds [-1, 1], and (QNN host only) cosine_similarity_mean >= 0.95.

53 unit tests pass; ruff clean.

Out of scope (follow-ups)

--mode hf (run the HF pipeline on a labeled dataset as the reference for task-level metrics) — placeholder removed from this PR; will land as a separate change.
A custom renderer for compare mode (currently uses the generic table).

…_eval

xieofxie

Drive-by review focused on correctness, conventions in this repo (e.g. tqdm-is-optional from #788), and a few edge cases worth tightening before merge. Individual suggestions on the lines below.

Implement tensor similarity evaluator

1edda7a

zhenchaoni requested a review from a team as a code owner June 3, 2026 05:30

xieofxie reviewed Jun 3, 2026

View reviewed changes

Comment thread src/winml/modelkit/commands/eval.py Outdated

vortex-captain reviewed Jun 3, 2026

View reviewed changes

Comment thread src/winml/modelkit/eval/metrics/tensor_similarity.py

vortex-captain approved these changes Jun 3, 2026

View reviewed changes

zhenchaoni added 4 commits June 4, 2026 17:04

resolve comments

3d5bb55

Merge remote-tracking branch 'origin/main' into private/zhenni/tensor…

4a3cbdb

…_eval

fix lint

4b44963

fix test

2e891e5

vortex-captain approved these changes Jun 4, 2026

View reviewed changes

xieofxie reviewed Jun 4, 2026

View reviewed changes

Resolve comments

0857e26

xieofxie approved these changes Jun 5, 2026

View reviewed changes

zhenchaoni merged commit d6a5ada into main Jun 5, 2026
9 checks passed

zhenchaoni deleted the private/zhenni/tensor_eval branch June 5, 2026 05:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement tensor similarity evaluator#805

Implement tensor similarity evaluator#805
zhenchaoni merged 6 commits into
mainfrom
private/zhenni/tensor_eval

zhenchaoni commented Jun 3, 2026

Uh oh!

Uh oh!

Uh oh!

xieofxie left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zhenchaoni commented Jun 3, 2026