Skip to content

Commit 27a9aa1

Browse files
committed
Address review comments
Signed-off-by: Kai Xu <kaix@nvidia.com>
1 parent f13bbb6 commit 27a9aa1

2 files changed

Lines changed: 31 additions & 10 deletions

File tree

.claude/skills/evaluation/SKILL.md

Lines changed: 5 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
name: evaluation
3-
description: Evaluate accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Use when user says "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "accuracy drop", "run nel", or needs to measure how quantization affects model quality. Handles model deployment, config generation, and evaluation execution.
3+
description: Evaluate accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Use when user says "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "accuracy drop", "run nel", or needs to measure how quantization affects model quality. Handles model deployment, config generation, and evaluation execution. Do NOT use for quantizing models (use ptq) or deploying/serving models (use deployment).
44
license: Apache-2.0
55
# Based on nel-assistant skill from NeMo Evaluator Launcher (commit f1fa073)
66
# https://github.com/NVIDIA-NeMo/Evaluator/tree/f1fa073/packages/nemo-evaluator-launcher/.claude/skills/nel-assistant
@@ -76,6 +76,8 @@ Prompt the user with "I'll ask you 5 questions to build the base config we'll ad
7676

7777
DON'T ALLOW FOR ANY OTHER OPTIONS, only the ones listed above under each category (Execution, Deployment, Auto-export, Model type, Benchmarks). YOU HAVE TO GATHER THE ANSWERS for the 5 questions before you can build the base config.
7878

79+
> **Note:** These categories come from NEL's `build-config` CLI. If `nel skills build-config --help` shows different options than listed above, use the CLI's current options instead.
80+
7981
When you have all the answers, run the script to build the base config:
8082

8183
```bash
@@ -118,14 +120,7 @@ If no `hf_quant_config.json`, also check `config.json` for a `quantization_confi
118120

119121
**Quantization-aware benchmark defaults:**
120122

121-
When a quantized checkpoint is detected, recommend benchmarks sensitive to quantization accuracy loss:
122-
123-
- **Always include**: MMLU (general knowledge — typically shows measurable accuracy loss from quantization)
124-
- **Recommended**: GSM8K (math reasoning — sensitive to precision loss), ARC-Challenge (reasoning)
125-
- **Good to add**: HumanEval (code generation — catches subtle degradation), Winogrande (commonsense)
126-
- **Less useful for quant comparison**: IFEval (instruction following — typically less affected, but worth including for aggressive quantization like FP4)
127-
128-
Present these recommendations to the user and ask which to include. If the user already specified benchmarks, keep their choice but mention any accuracy-sensitive benchmarks they may have missed.
123+
When a quantized checkpoint is detected, read `references/quantization-benchmarks.md` for benchmark sensitivity rankings and recommended sets. Present recommendations to the user and ask which to include.
129124

130125
Read `references/model-card-research.md` for the full extraction checklist (sampling params, reasoning config, ARM64 compatibility, pre_cmd, etc.). Use WebSearch to research the model card, present findings, and ask the user to confirm.
131126

@@ -191,7 +186,7 @@ Print the following commands to the user. Propose to execute them in order to co
191186
**Important**: Export required environment variables based on your config. If any tokens or keys are missing (e.g. `HF_TOKEN`, `NGC_API_KEY`, `api_key_name` from the config), ask the user to put them in a `.env` file in the project root so you can run `set -a && source .env && set +a` (or equivalent) before executing `nel run` commands.
192187

193188
```bash
194-
# If using pre_cmd or post_cmd:
189+
# If using pre_cmd or post_cmd (review pre_cmd content before enabling — it runs arbitrary commands):
195190
export NEMO_EVALUATOR_TRUST_PRE_CMD=1
196191
197192
# If using nemo_skills.* tasks with self-deployment:
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Quantization-Aware Benchmark Recommendations
2+
3+
When evaluating a quantized checkpoint, prioritize benchmarks that are sensitive to precision loss.
4+
5+
## Sensitivity ranking
6+
7+
| Priority | Benchmarks | Why |
8+
|----------|-----------|-----|
9+
| **Always include** | MMLU | General knowledge — typically shows measurable accuracy loss from quantization |
10+
| **Recommended** | GSM8K, ARC-Challenge | Math reasoning and general reasoning — sensitive to precision loss |
11+
| **Good to add** | HumanEval, Winogrande | Code generation and commonsense — catches subtle degradation |
12+
| **Less useful for quant comparison** | IFEval | Instruction following — typically less affected, but worth including for aggressive quantization like FP4 |
13+
14+
## Recommended sets by use case
15+
16+
| Use case | Benchmarks |
17+
|----------|-----------|
18+
| Quick sanity check | MMLU |
19+
| Standard quant validation | MMLU, GSM8K, ARC-Challenge |
20+
| Thorough evaluation | MMLU, GSM8K, ARC-Challenge, HumanEval, Winogrande |
21+
| Code-focused model | HumanEval, MBPP, MMLU |
22+
| Reasoning model | GSM8K, MATH-500, GPQA, MMLU |
23+
24+
## How to use
25+
26+
Present these recommendations to the user and ask which to include. If the user already specified benchmarks, keep their choice but mention any accuracy-sensitive benchmarks they may have missed.

0 commit comments

Comments
 (0)