Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
132 commits
Select commit Hold shift + click to select a range
9b59d2e
[Automated Commit] Format Codebase
mlcommons-bot Dec 20, 2024
fe51c12
Merge branch 'mlcommons:master' into master
v-shobhit Sep 18, 2025
1f2666c
[Automated Commit] Format Codebase
github-actions[bot] Sep 18, 2025
ec84225
Add download_pdf.py
v-shobhit Sep 19, 2025
a77eb4d
Add artefacts downloading scripts
v-shobhit Sep 22, 2025
9f0db17
Fix downloading script
v-shobhit Sep 22, 2025
b4dcd57
[Automated Commit] Format Codebase
github-actions[bot] Sep 22, 2025
e0f28b2
cleanup read_pdf.py
v-shobhit Sep 22, 2025
d301bff
add env setup files
v-shobhit Sep 22, 2025
daa07f8
add single-shot retrieval
v-shobhit Sep 22, 2025
23aaea6
renmae setup file
v-shobhit Sep 22, 2025
a15bc6d
Add README
v-shobhit Sep 22, 2025
fb5f157
name fix
v-shobhit Sep 22, 2025
94b11d1
Add reranker
v-shobhit Sep 22, 2025
9275aaa
Add README.md
v-shobhit Sep 22, 2025
c5c7213
Save vector db
hans-intel Sep 26, 2025
5142a06
Support Intel XPU for embedding and reranker model
hans-intel Sep 26, 2025
9f5d8c5
Separate tok_k for retrieval and reranking
hans-intel Sep 26, 2025
41cd5c4
save url mapping in passage for evaluation
hans-intel Sep 26, 2025
7b9390f
Implement evaluation
hans-intel Sep 26, 2025
142d41d
Support bm25. RagDB is a superclass of vectordb and bm25db
hans-intel Sep 29, 2025
d1ed96d
Fix a bug in scoring; code cleanup
hans-intel Sep 29, 2025
73d1350
Add BM25 params
hans-intel Sep 29, 2025
9ab82f9
Change url_mapping to exclude file extension to support both pdf and txt
hans-intel Sep 29, 2025
beeb74f
Support txt file to ingest: --passages renamed to --ingest
hans-intel Sep 30, 2025
1f70b6c
Add stemmer to bm25
hans-intel Oct 1, 2025
94a46cb
Add missing ingest function for vectordb
hans-intel Oct 1, 2025
1140a8a
Add metrics recall, precision, F1, MAP (Mean Average Precision)
hans-intel Oct 1, 2025
4fd651c
implment retrieval strategies (top_p, relative, elbow, ...)
hans-intel Oct 2, 2025
140add3
rename evaluate() to evaluate_retrieval()
hans-intel Oct 2, 2025
4a8deb1
support html parsing (bs4)
hans-intel Oct 3, 2025
9710776
code cleanup
hans-intel Oct 3, 2025
1e84b9c
Fix vector db ingestion
hans-intel Oct 3, 2025
e72e2c7
fix r2r variation (torch seed)
hans-intel Oct 3, 2025
35f5bc2
improve parsing (remove wiki metadata, default sentence boundary)
hans-intel Oct 3, 2025
6c6d160
Add perf monitoring feature (--benchmark)
hans-intel Oct 6, 2025
fba871e
Add indexing trend measurement
hans-intel Oct 7, 2025
9396387
Add vector indexing option
hans-intel Oct 8, 2025
bb3737e
fix read_docs measurement
hans-intel Oct 10, 2025
f6b4396
Merge branch 'e2e-rag-eval-adv' into e2e-rag
hans-intel Oct 10, 2025
a385e67
Merge branch 'e2e-rag-r2r-fix' into e2e-rag
hans-intel Oct 10, 2025
9b28f7b
Merge branch 'e2e-rag-bsoup' into e2e-rag
hans-intel Oct 10, 2025
b4b0e02
Merge branch 'e2e-rag-latency-measurement' into e2e-rag
hans-intel Oct 10, 2025
7aee971
Merge branch 'e2e-rag-vector-indexing-option' into e2e-rag
hans-intel Oct 10, 2025
3c8a004
Implment IVF nprobe option
hans-intel Oct 14, 2025
5a9120c
Update README
hans-intel Oct 15, 2025
ac61fc0
Add feature save/load embeddings (--load-embeddings)
hans-intel Oct 15, 2025
b39e8ce
Embedding is done in multiple devices
hans-intel Oct 17, 2025
bfcb570
Centralized parameter management
hans-intel Oct 17, 2025
f5775df
Support top_p in vector db
hans-intel Oct 17, 2025
5ded358
Support parallelization in read_docs (--processes)
hans-intel Oct 17, 2025
1ae5f56
Change default max retrieval to 20 from 100
hans-intel Oct 17, 2025
f47c4f5
fix result display -- deduplicated urls
hans-intel Oct 17, 2025
a004b93
Fix correctly passing retrieval strategy params to filter
hans-intel Oct 17, 2025
ccef3a0
Merge branch 'e2e-rag-fix-vector-top_p' into e2e-rag
hans-intel Oct 18, 2025
464f0d8
Add new metrics (@N where N=#retrieved docs)
hans-intel Oct 18, 2025
9156fc6
Add detailed analysis after eval based on reasoning types and number …
hans-intel Oct 18, 2025
0fedf4f
More detailed analyses for FRAMES prompts
hans-intel Oct 26, 2025
c2e78df
basic multi-shot implementation
hans-intel Oct 27, 2025
2be6029
Implement --difficulty option (target difficult prompts only)
hans-intel Oct 27, 2025
731e101
implement doc grader
hans-intel Oct 27, 2025
0e9e44c
Combined docgrader into query_rewriter
hans-intel Oct 27, 2025
1a58fab
improve query writer prompt (WIP)
hans-intel Oct 29, 2025
a4171c8
Support hpu devices
hans-intel Oct 29, 2025
9d9254f
Query writer prompt WIP
hans-intel Oct 30, 2025
34f5501
Query writer prompt WIP
hans-intel Oct 30, 2025
6c7b862
Query rewriter prompt WIP
hans-intel Oct 30, 2025
9b26053
HPU reranking+embedding workaround (run on CPU)
hans-intel Nov 6, 2025
d966e8b
HPU reranking+embedding workaround (run on CPU)
hans-intel Nov 6, 2025
936d5ab
HPU embedding fix
hans-intel Nov 8, 2025
2a00054
fix read_docs to include .infobox
hans-intel Nov 9, 2025
248e23f
HPU reranking+embedding workaround (run on CPU)
hans-intel Nov 6, 2025
0eedaa6
HPU embedding fix
hans-intel Nov 8, 2025
ea50009
Support hpu devices
hans-intel Oct 29, 2025
3994990
generate LLM answer single_shot_retrieval
hans-intel Nov 6, 2025
eb57581
LLM answer for single shot retrieval and evaluation script
hans-intel Nov 7, 2025
b4da009
change default params to match FRAMES paper (k=5, n=5)
hans-intel Nov 9, 2025
3c35831
HPU reranking+embedding workaround (run on CPU)
hans-intel Nov 6, 2025
f2d4e55
copy url_mapping to output folder
hans-intel Nov 9, 2025
8bfd4ad
Support hpu devices
hans-intel Oct 29, 2025
ab9f902
Saving result of multi-shot retrieval for evaluate.py
hans-intel Nov 9, 2025
1139d0d
Simplified query rewriter prompt as baseline
hans-intel Nov 9, 2025
068d9b3
improve scoring prompt
hans-intel Nov 9, 2025
8b3b155
Fix bm25 for single shot retrieval
hans-intel Nov 9, 2025
8d22e1f
add scripts
hans-intel Nov 10, 2025
1bfebe5
multi hop accuracy improvement with oracle test into the same branch
mkankana Apr 29, 2026
9a038e4
hybrid model usage 120B+20B. removed dead code
mkankana May 6, 2026
b9a73fc
LLM calls to openrouter. logging all LLM calls and results
mkankana May 14, 2026
bbc0e66
added log sample result
mkankana May 17, 2026
6670aa9
added log sample result
mkankana May 17, 2026
3cad55f
Add parallel multi-shot retrieval with per-query threading
hans-intel May 18, 2026
3e8a0b4
Add OUTPUT_DIR and TEMPERATURE as parameters
hans-intel May 18, 2026
b6eacfa
Add retry with backoff for LLM calls and fix silent failures
hans-intel May 19, 2026
ce33ce8
Fix colbert to be used in late interaction way (query, doc embedding …
hans-intel May 21, 2026
46a9d95
Refactor device detection for cross-vendor support
hans-intel May 22, 2026
777c5bf
- OMP_NUM_THREADS derived from sched_getaffinity (respects upstream n…
hans-intel May 22, 2026
25b9b24
Remove hard-coded NUMA config, now in python script (membind cannot b…
hans-intel May 22, 2026
0b5fa9c
Add per-system config.sh + entry-point wrappers
hans-intel May 22, 2026
b701630
Add per-process GPU index allocator
hans-intel May 22, 2026
4c22295
Per-process reranker; per-worker NUMA pinning
hans-intel May 22, 2026
88df0be
add cross-system DB verification scripts
hans-intel May 22, 2026
ff80999
Decouple endpoint routing from OpenRouter
hans-intel May 23, 2026
9527fdd
Update readme
hans-intel May 23, 2026
e025c66
performance test simulated with cached result
mkankana Jun 1, 2026
99835eb
add a script that calculates prefix cache hit rate
hans-intel Jun 2, 2026
32041c5
Fixes to download docs (with delay and proper URL formatting), WARN a…
rpoornac Jun 2, 2026
ed2419b
added missing perf cache result file
mkankana Jun 3, 2026
e811b74
Merge branch 'multi-vendor-support' of https://github.com/hans-intel/…
mkankana Jun 3, 2026
826c784
performance metric added for variation test purpose
mkankana Jun 3, 2026
2ee4fd3
added loadgen integration
mkankana Jun 4, 2026
d7d3dbe
Fix loadgen integration: mask API keys, default to local vLLM, add th…
mkankana Jun 8, 2026
5fbae0c
seperated endpoint parameters for different servers
mkankana Jun 9, 2026
9e0c1dd
added indexing measurments kpi
mkankana Jun 9, 2026
60f8e6d
passing the threads option from reference script
mkankana Jun 10, 2026
16ab751
fixed threading loadgen integration
mkankana Jun 12, 2026
0edc531
added datasetup KPI implementation
mkankana Jun 15, 2026
6635959
Add Apache 2.0 license headers to all Python files
mkankana Jun 15, 2026
c81c1dc
Prepare for MLCommons contribution
mkankana Jun 15, 2026
be83c3f
Merge remote-tracking branch 'upstream/master' into perf_test_with_ca…
mkankana Jun 15, 2026
c81c9ff
Add configurable judge LLM support to accuracy evaluation
mkankana Jun 16, 2026
226c025
Fix data setup KPI measurement script
mkankana Jun 17, 2026
cd961d3
Fix data setup KPI measurement script
mkankana Jun 17, 2026
e8ec8a6
loadgen integration for e2e-datasetup workload
mkankana Jun 18, 2026
793b934
Add TEST09 compliance testing for e2e-rag workload
mkankana Jun 18, 2026
4b25f32
Sort loadgen samples by index for deterministic processing
mkankana Jun 18, 2026
10145f8
Update documentation to use MLCommons-hosted datasets and models
mkankana Jun 18, 2026
f70f7f4
Merge branch 'master' into perf_test_with_cached_output
pgmpablo157321 Jun 22, 2026
6127720
path fixes to match download from mlcommons storage
mkankana Jun 22, 2026
a0b69ab
Rename workload names for MLPerf Loadgen integration
mkankana Jun 22, 2026
63125e6
renamed folder to e2e-rag
mkankana Jun 23, 2026
0ba4553
readme correction
mkankana Jun 23, 2026
3a53758
db workload name minor fix
mkankana Jun 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 131 additions & 0 deletions compliance/TEST09/e2e-rag/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# TEST09 Compliance for E2E-RAG Workload

## Overview

TEST09 verifies that the output token length during performance runs matches expected values to prevent output truncation cheating.

## Statistics from Reference Implementation

Based on 5 production runs (4021 total answer_generator invocations):

| Run | Samples | Avg OSL |
|-----|---------|---------|
| 1 | 794 | 221 |
| 2 | 805 | 258 |
| 3 | 810 | 273 |
| 4 | 813 | 211 |
| 5 | 799 | 214 |

**Weighted Mean OSL:** 235.47 tokens

**TEST09 Thresholds (±10%):**
- Min output tokens: 211.92
- Max output tokens: 259.02

## Usage

### Automated Workflow (Recommended)

Use the automated compliance test script that handles setup, run, verification, and cleanup:

```bash
cd inference/e2e

# Run TEST09 only
bash run_compliance_test09.sh

# Or run all compliance tests
bash run_all_compliance_tests.sh
```

The script will:
1. ✓ Copy audit.config to working directory
2. ✓ Run performance test with LoadGen compliance logging
3. ✓ Verify output token length thresholds
4. ✓ Copy results to submission directory
5. ✓ Clean up audit.config automatically

**Environment Variables (optional):**

```bash
# Override defaults
export DATABASE=vector_html_hnsw_len768_ov32_word.db
export MAX_ASYNC_QUERIES=10
export MAX_WORKERS=10
export PERF_CACHE_FILE=logs_result.json # Use cached LLM responses

bash run_compliance_test09.sh
```

### Manual Workflow

### Part I: Setup

Copy the audit.config to your working directory:

```bash
cd inference/e2e-rag
cp ../compliance/TEST09/e2e-rag/audit.config ./
```

### Part II: Run Performance Test

Run the benchmark as normal. LoadGen will automatically detect `audit.config`:

```bash
bash reference_mlperf_perf.sh
```

Or directly:

```bash
python3 reference_mlperf.py \
--dataset_path data/frames_dataset.tsv \
--database vector_html_hnsw_len768_ov32_word.db \
--scenario Offline \
--log_dir run_output_test09 \
--perf_count 824
```

**Important:** Remove `audit.config` after the test to avoid running in compliance mode unintentionally.

### Part III: Run Verification

```bash
python3 inference/e2e/third_party/mlperf-inference/compliance/TEST09/run_verification.py \
-c run_output_test09 \
-o submission/compliance/e2e-rag/Offline \
--audit-config ../compliance/TEST09/e2e-rag/audit.config
```

### Expected Output

```
================================================================================
TEST09: Verify Output Token Length in Performance Mode
================================================================================
Output Token Length Statistics
================================================================================
Total samples: 824
Mean output tokens: 235.47
Min output tokens: 1
Max output tokens: 2829
Std deviation: ~150

================================================================================
Verification Results
================================================================================
Mean output tokens: 235.47
Min threshold: 211.92 -> PASS
Max threshold: 259.02 -> PASS

Overall: TEST PASS
```

## Notes

- **Component Measured:** `answer_generator` - this is the only component that LoadGen logs (the final answer generation step)
- **Dataset:** Full 824 queries from frames_dataset.tsv
- **Workload:** Multi-hop RAG with iterative retrieval (max 5 iterations)
- The thresholds are based on the answer_generator component only, not the intermediate multi-shot retrieval steps
- Other components (check_sufficiency, generate_search_queries, evaluate_document_relevance) are internal to the SUT and not measured by TEST09
39 changes: 39 additions & 0 deletions compliance/TEST09/e2e-rag/audit.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# The format of this config file is 'key = value'.
# The key has the format 'model.scenario.key'. Value is mostly int64_t.
# Model maybe '*' as wildcard. In that case the value applies to all models.
# All times are in milli seconds

# TEST09: Verify output token length in performance mode
# This test logs ALL samples and verifies mean output token length is within bounds.

# mode dictionary (0 = submission, 1 = accuracy, 2 = performance, 3 = find peak perf)
*.*.mode = 2

# Use a fixed RNG seed for reproducibility
*.*.accuracy_log_rng_seed = 720381539243781796

# Log ALL samples - set to a value >= total dataset size (824 samples for e2e-rag)
# Using a large value ensures all samples are logged regardless of performance
*.*.accuracy_log_sampling_target = 10000

# Ensure we run through all samples
*.*.min_query_count = 824
*.*.min_duration = 0

# Turn off sample concatenation for accurate logging
*.*.sample_concatenate_permutation = 0

# =============================================================================
# TEST09 Compliance Thresholds (read by run_verification.py, not by LoadGen)
# =============================================================================
# Output token length bounds for compliance verification
# Each benchmark defines its own thresholds based on reference implementation.
# For e2e-rag workload:
# - dataset: 824 queries from frames_dataset.tsv
# - component: answer_generator (the only component visible to loadgen)
# - reference mean-OSL: 235.47 (weighted average across 5 runs of 4021 total samples)
# - threshold: +/-10%
# 235.47 * 0.90 = 211.92
*.*.test09_min_output_tokens = 211.92
# 235.47 * 1.10 = 259.02
*.*.test09_max_output_tokens = 259.02
42 changes: 42 additions & 0 deletions e2e-rag/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
*.db
*.db-*
output_*
output/
run_output/
subsample_variation_results/
variance_test_*/
passages/
passages_*.json
data/
run_container.sh
.claude/
config.sh
*.log
third_party/
doc_html/
wiki_articles/
mlperf.conf
__pycache__/
*.pyc
*.pyo
*.pyd
.Python
*.so
*.egg
*.egg-info/
dist/
build/
*.emb.pkl
oracle_checkpoint.pkl
result_*.json
*_checkpoint.pkl
.DS_Store
*.swp
*.swo
*~
temp_complete_kpi_*.json
run_output_datasetup_accuracy/
run_output_datasetup/
colbert-ir_colbertv2.0/
frames-benchmark-dataset/
intfloat_e5-base-v2/
Loading
Loading