mlcommons · mkankana · Dec 20, 2024 · Sep 18, 2025 · Sep 18, 2025 · Sep 19, 2025
@@ -0,0 +1,131 @@
+# TEST09 Compliance for E2E-RAG Workload
+
+## Overview
+
+TEST09 verifies that the output token length during performance runs matches expected values to prevent output truncation cheating.
+
+## Statistics from Reference Implementation
+
+Based on 5 production runs (4021 total answer_generator invocations):
+
+| Run | Samples | Avg OSL |
+|-----|---------|---------|
+| 1   | 794     | 221     |
+| 2   | 805     | 258     |
+| 3   | 810     | 273     |
+| 4   | 813     | 211     |
+| 5   | 799     | 214     |
+
+**Weighted Mean OSL:** 235.47 tokens
+
+**TEST09 Thresholds (±10%):**
+- Min output tokens: 211.92
+- Max output tokens: 259.02
+
+## Usage
+
+### Automated Workflow (Recommended)
+
+Use the automated compliance test script that handles setup, run, verification, and cleanup:
+
+```bash
+cd inference/e2e
+
+# Run TEST09 only
+bash run_compliance_test09.sh
+
+# Or run all compliance tests
+bash run_all_compliance_tests.sh
+```
+
+The script will:
+1. ✓ Copy audit.config to working directory
+2. ✓ Run performance test with LoadGen compliance logging
+3. ✓ Verify output token length thresholds
+4. ✓ Copy results to submission directory
+5. ✓ Clean up audit.config automatically
+
+**Environment Variables (optional):**
+
+```bash
+# Override defaults
+export DATABASE=vector_html_hnsw_len768_ov32_word.db
+export MAX_ASYNC_QUERIES=10
+export MAX_WORKERS=10
+export PERF_CACHE_FILE=logs_result.json  # Use cached LLM responses
+
+bash run_compliance_test09.sh
+```
+
+### Manual Workflow
+
+### Part I: Setup
+
+Copy the audit.config to your working directory:
+
+```bash
+cd inference/e2e-rag
+cp ../compliance/TEST09/e2e-rag/audit.config ./
+```
+
+### Part II: Run Performance Test
+
+Run the benchmark as normal. LoadGen will automatically detect `audit.config`:
+
+```bash
+bash reference_mlperf_perf.sh
+```
+
+Or directly:
+
+```bash
+python3 reference_mlperf.py \
+    --dataset_path data/frames_dataset.tsv \
+    --database vector_html_hnsw_len768_ov32_word.db \
+    --scenario Offline \
+    --log_dir run_output_test09 \
+    --perf_count 824
+```
+
+**Important:** Remove `audit.config` after the test to avoid running in compliance mode unintentionally.
+
+### Part III: Run Verification
+
+```bash
+python3 inference/e2e/third_party/mlperf-inference/compliance/TEST09/run_verification.py \
+    -c run_output_test09 \
+    -o submission/compliance/e2e-rag/Offline \
+    --audit-config ../compliance/TEST09/e2e-rag/audit.config
+```
+
+### Expected Output
+
+```
+================================================================================
+TEST09: Verify Output Token Length in Performance Mode
+================================================================================
+Output Token Length Statistics
+================================================================================
+Total samples: 824
+Mean output tokens: 235.47
+Min output tokens: 1
+Max output tokens: 2829
+Std deviation: ~150
+
+================================================================================
+Verification Results
+================================================================================
+Mean output tokens: 235.47
+Min threshold: 211.92 -> PASS
+Max threshold: 259.02 -> PASS
+
+Overall: TEST PASS
+```
+
+## Notes
+
+- **Component Measured:** `answer_generator` - this is the only component that LoadGen logs (the final answer generation step)
+- **Dataset:** Full 824 queries from frames_dataset.tsv
+- **Workload:** Multi-hop RAG with iterative retrieval (max 5 iterations)
+- The thresholds are based on the answer_generator component only, not the intermediate multi-shot retrieval steps
+- Other components (check_sufficiency, generate_search_queries, evaluate_document_relevance) are internal to the SUT and not measured by TEST09
@@ -0,0 +1,39 @@
+# The format of this config file is 'key = value'.
+# The key has the format 'model.scenario.key'. Value is mostly int64_t.
+# Model maybe '*' as wildcard. In that case the value applies to all models.
+# All times are in milli seconds
+
+# TEST09: Verify output token length in performance mode
+# This test logs ALL samples and verifies mean output token length is within bounds.
+
+# mode dictionary (0 = submission, 1 = accuracy, 2 = performance, 3 = find peak perf)
+*.*.mode = 2
+
+# Use a fixed RNG seed for reproducibility
+*.*.accuracy_log_rng_seed = 720381539243781796
+
+# Log ALL samples - set to a value >= total dataset size (824 samples for e2e-rag)
+# Using a large value ensures all samples are logged regardless of performance
+*.*.accuracy_log_sampling_target = 10000
+
+# Ensure we run through all samples
+*.*.min_query_count = 824
+*.*.min_duration = 0
+
+# Turn off sample concatenation for accurate logging
+*.*.sample_concatenate_permutation = 0
+
+# =============================================================================
+# TEST09 Compliance Thresholds (read by run_verification.py, not by LoadGen)
+# =============================================================================
+# Output token length bounds for compliance verification
+# Each benchmark defines its own thresholds based on reference implementation.
+# For e2e-rag workload:
+# - dataset: 824 queries from frames_dataset.tsv
+# - component: answer_generator (the only component visible to loadgen)
+# - reference mean-OSL: 235.47 (weighted average across 5 runs of 4021 total samples)
+# - threshold: +/-10%
+# 235.47 * 0.90 = 211.92
+*.*.test09_min_output_tokens = 211.92
+# 235.47 * 1.10 = 259.02
+*.*.test09_max_output_tokens = 259.02
@@ -0,0 +1,42 @@
+*.db
+*.db-*
+output_*
+output/
+run_output/
+subsample_variation_results/
+variance_test_*/
+passages/
+passages_*.json
+data/
+run_container.sh
+.claude/
+config.sh
+*.log
+third_party/
+doc_html/
+wiki_articles/
+mlperf.conf
+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+.Python
+*.so
+*.egg
+*.egg-info/
+dist/
+build/
+*.emb.pkl
+oracle_checkpoint.pkl
+result_*.json
+*_checkpoint.pkl
+.DS_Store
+*.swp
+*.swo
+*~
+temp_complete_kpi_*.json
+run_output_datasetup_accuracy/
+run_output_datasetup/
+colbert-ir_colbertv2.0/
+frames-benchmark-dataset/
+intfloat_e5-base-v2/