Skip to content

Implement tensor similarity evaluator#805

Merged
zhenchaoni merged 6 commits into
mainfrom
private/zhenni/tensor_eval
Jun 5, 2026
Merged

Implement tensor similarity evaluator#805
zhenchaoni merged 6 commits into
mainfrom
private/zhenni/tensor_eval

Conversation

@zhenchaoni
Copy link
Copy Markdown
Member

Fixes #804

Implement tensor similarity evaluator

Summary

Adds a new compare mode to winml eval that compares an ONNX candidate against its HF PyTorch reference on identical random inputs and reports per-output tensor-parity metrics (SQNR, PSNR, cosine similarity, MSE, max absolute diff). This isolates divergence introduced by the build pipeline (optimize / quantize / compile) from data- or pipeline-related differences — there is no labeled dataset, no HF pipeline, and no preprocessor in the loop.

Motivation

Task-level metrics (top-1, mIoU, BLEU, ...) tell us whether an optimized model still works, but not how much the optimize/quantize/compile passes perturbed the raw tensors. Tensor-similarity gives a fast, label-free, dataset-free signal for build-pipeline regressions and for picking quantization configs.

Usage

winml eval --mode compare -m microsoft/resnet-50 --task image-classification --precision fp16 --samples 100

What's new

  • winml eval --mode {onnx,compare} — new Click option on winml eval. onnx (default) is the existing dataset-driven flow; compare activates the new evaluator.
  • TensorSimilarityEvaluator (tensor_similarity_evaluator.py) — loads the HF reference on CPU/fp32 via resolve_task_and_model_class, draws inputs from RandomDataset over the candidate's ONNX I/O spec, runs both backends per sample, and aggregates per-output metrics.
  • TensorSimilarityMetric (tensor_similarity.py) — stateful update / compute / reset metric mirroring MeanIoUMetric. Per-sample math is bit-equivalent to the team-wide eval_tensors reference library on the same .npy pair.
  • Dispatchevaluate.py registers "compare-tensor" and get_evaluator_class routes to it when config.mode == "compare"; compare mode bypasses default-dataset resolution and the dataset section of print_config.
  • ConfigWinMLEvaluationConfig.mode: str = "onnx"; to_dict only emits mode when non-default.

Output shape

compute() returns display-ready flat dict so the existing generic eval report renders without a custom renderer:

{
    f"{metric}_{stat}": {output_name: float},  # 5 metrics × 4 stats = 20 keys
    ...
}

Stats are mean / std / min / max. The renderer prints one row per {metric}_{stat} with output_name=value cells joined across outputs.

Notable design choices

  • Output-name overlap, not strict equality. ONNX and HF output sets can differ (HF often exposes auxiliary tensors). We compute on the intersection and warn on divergence rather than failing.
  • Composite-model guard. Multi-component models (e.g. BLIP) raise a TypeError with guidance to run compare per sub-component — there is no canonical "one HF reference" for the composite.
  • int dtype normalization. Narrow int tensors are upcast to int64 before inference so HF embeddings accept them; WinMLSession down-casts to the ORT graph's declared dtype on its side. The same input dict feeds both backends.
  • Architecture-agnostic. No model-specific names, layer patterns, or hardcoded outputs anywhere in metric or evaluator code.

Tests

  • test_tensor_similarity_metric.py — 10 unit tests for the metric (numerics, identity, stat shape, reset, empty-state error).
  • test_tensor_similarity_evaluator.py — 4 unit tests (composite-model guard, output-name overlap, dispatch).
  • test_eval.py — get_evaluator_class updated to take WinMLEvaluationConfig; "compare-tensor" exempted from the task-schema set check.
  • tests/e2e/test_eval_e2e.py::test_compare_mode_image_classification — full CLI run on microsoft/resnet-50 fp16: asserts the 20-key flat shape, per-output cosine bounds [-1, 1], and (QNN host only) cosine_similarity_mean >= 0.95.

53 unit tests pass; ruff clean.

Out of scope (follow-ups)

  • --mode hf (run the HF pipeline on a labeled dataset as the reference for task-level metrics) — placeholder removed from this PR; will land as a separate change.
  • A custom renderer for compare mode (currently uses the generic table).

@zhenchaoni zhenchaoni requested a review from a team as a code owner June 3, 2026 05:30
Comment thread src/winml/modelkit/commands/eval.py Outdated
Comment thread src/winml/modelkit/eval/metrics/tensor_similarity.py
Copy link
Copy Markdown
Contributor

@xieofxie xieofxie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by review focused on correctness, conventions in this repo (e.g. tqdm-is-optional from #788), and a few edge cases worth tightening before merge. Individual suggestions on the lines below.

Comment thread src/winml/modelkit/eval/config.py
Comment thread src/winml/modelkit/eval/config.py
Comment thread src/winml/modelkit/eval/tensor_similarity_evaluator.py Outdated
Comment thread src/winml/modelkit/eval/tensor_similarity_evaluator.py Outdated
Comment thread src/winml/modelkit/eval/tensor_similarity_evaluator.py Outdated
Comment thread src/winml/modelkit/eval/tensor_similarity_evaluator.py
Comment thread src/winml/modelkit/eval/tensor_similarity_evaluator.py Outdated
Comment thread src/winml/modelkit/eval/metrics/tensor_similarity.py
Comment thread src/winml/modelkit/eval/evaluate.py
Comment thread tests/e2e/test_eval_e2e.py Outdated
@zhenchaoni zhenchaoni merged commit d6a5ada into main Jun 5, 2026
9 checks passed
@zhenchaoni zhenchaoni deleted the private/zhenni/tensor_eval branch June 5, 2026 05:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: support tensor level comparison for model evaluation

3 participants