Skip to content

Commit 285c7cd

Browse files
authored
Batch Bench (#146)
* add the desc of batch eval * 添加批量bench评估的用法
1 parent 42239ec commit 285c7cd

File tree

4 files changed

+93
-404
lines changed

4 files changed

+93
-404
lines changed
Lines changed: 47 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,15 @@
11
---
2-
title: Model QA Capability Assessment Pipeline
2+
title: Model Capability Assessment Pipeline
3+
createTime: 2025/08/30 14:27:02
34
icon: hugeicons:chart-evaluation
4-
createTime: 2025/10/20 10:41:21
5-
permalink: /en/guide/qmvjcv9o/
5+
permalink: /en/guide/evaluation-pipeline/
66
---
77

8+
# Model Capability Assessment Pipeline
89

9-
# Model QA Capability Assessment Pipeline
10-
11-
Only supports QA pair format evaluation
10+
⚠️Only supports QA pair format evaluation
1211

1312
## Quick Start
14-
1513
```bash
1614
cd DataFlow
1715
pip install -e .[eval]
@@ -32,55 +30,41 @@ dataflow eval init
3230
dataflow eval api / dataflow eval local
3331
```
3432

35-
36-
3733
## Step 1: Install Evaluation Environment
3834

3935
Download evaluation environment
40-
4136
```bash
4237
cd DataFlow
4338
pip install -e .[eval]
4439
cd ..
4540
```
4641

47-
48-
4942
## Step 2: Create and Enter DataFlow Working Directory
5043

5144
```bash
5245
mkdir workspace
5346
cd workspace
5447
```
5548

56-
57-
5849
## Step 3: Prepare Evaluation Data and Initialize Configuration Files
5950

6051
Initialize configuration files
61-
6252
```bash
6353
dataflow eval init
6454
```
6555

66-
After initialization, the project directory structure becomes:
67-
56+
💡After initialization, the project directory structure becomes:
6857
```bash
6958
Project Root/
7059
├── eval_api.py # Configuration file for API model evaluator
7160
└── eval_local.py # Configuration file for local model evaluator
7261
```
7362

74-
75-
7663
## Step 4: Prepare Evaluation Data
7764

78-
79-
8065
### Method 1: JSON Format
8166

8267
Please prepare a JSON format file with data structure similar to the example below:
83-
8468
```json
8569
[
8670
{
@@ -90,8 +74,7 @@ Please prepare a JSON format file with data structure similar to the example bel
9074
]
9175
```
9276

93-
In this example data:
94-
77+
💡In this example data:
9578
- `input` is the question (can also be question + answer choices merged into one input)
9679

9780
- `output` is the standard answer
@@ -101,7 +84,6 @@ In this example data:
10184
### Method 2: Custom Field Mapping
10285

10386
You can also skip data preprocessing (as long as you have clear question and standard answer fields) and configure field name mapping through `eval_api.py` and `eval_local.py`:
104-
10587
```python
10688
EVALUATOR_RUN_CONFIG = {
10789
"input_test_answer_key": "model_generated_answer", # Field name for model-generated answers
@@ -110,14 +92,11 @@ EVALUATOR_RUN_CONFIG = {
11092
}
11193
```
11294

113-
114-
11595
## Step 5: Configure Parameters
116-
96+
### Model Parameter Configure
11797
If you want to use a local model as the evaluator, please modify the parameters in the `eval_local.py` file.
11898

11999
If you want to use an API model as the evaluator, please modify the parameters in the `eval_api.py` file.
120-
121100
```python
122101
# Target Models Configuration (same as API mode)
123102

@@ -133,34 +112,55 @@ TARGET_MODELS = [
133112

134113
# 3. Custom configuration
135114
# Add more models...
136-
# {
137-
# "name": "llama_8b",
138-
# "path": "meta-llama/Llama-3-8B-Instruct",
139-
# "tensor_parallel_size": 2,
140-
# "max_tokens": 2048,
141-
# "gpu_memory_utilization": 0.9,
142-
143-
# # You can customize prompts for each model. If not specified, defaults to the template in build_prompt function.
144-
# # Default prompt for evaluated models
145-
# # IMPORTANT: This is the prompt for models being evaluated, NOT for the judge model!!!
146-
# "answer_prompt": """please answer the questions:
147-
# question:{question}
148-
# answer:"""
149-
# }
115+
116+
{
117+
"name": "qwen_7b", # Model name
118+
"path": "./Qwen2.5-7B-Instruct", # Model path
119+
# Large language models can use different parameters
120+
"vllm_tensor_parallel_size": 4, # Number of GPUs
121+
"vllm_temperature": 0.1, # Randomness
122+
"vllm_top_p": 0.9, # Top-p sampling
123+
"vllm_max_tokens": 2048, # Maximum number of tokens
124+
"vllm_repetition_penalty": 1.0, # Repetition penalty
125+
"vllm_seed": None, # Random seed
126+
"vllm_gpu_memory_utilization": 0.9, # Maximum GPU memory utilization
127+
# Custom prompt can be defined for each model
128+
"answer_prompt": """please answer the following question:"""
129+
}
130+
131+
150132
]
151133
```
152134

153-
135+
### Bench Parameter Configuration
136+
Supports batch configuration of benchmarks
137+
```python
138+
BENCH_CONFIG = [
139+
{
140+
"name": "bench_name", # Benchmark name
141+
"input_file": "path_to_your_qa/qa.json", # Data file
142+
"question_key": "input", # Question field name
143+
"reference_answer_key": "output", # Reference answer field name
144+
"output_dir": "path/bench_name", # Output directory
145+
},
146+
{
147+
"name": "other_bench_name",
148+
"input_file": "path_to_your_qa/other_qa.json",
149+
"question_key": "input",
150+
"reference_answer_key": "output",
151+
"output_dir": "path/other_bench_name",
152+
}
153+
]
154+
```
154155

155156
## Step 6: Run Evaluation
156157

157158
Run local evaluation:
158-
159159
```bash
160160
dataflow eval local
161161
```
162162

163163
Run API evaluation:
164-
165164
```bash
166-
dataflow eval api
165+
dataflow eval api
166+
```

docs/en/notes/guide/pipelines/EvalPipeline.md

Lines changed: 0 additions & 154 deletions
This file was deleted.

0 commit comments

Comments
 (0)