Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 47 additions & 47 deletions docs/en/notes/guide/model_evaluation/command_eval.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,15 @@
---
title: Model QA Capability Assessment Pipeline
title: Model Capability Assessment Pipeline
createTime: 2025/08/30 14:27:02
icon: hugeicons:chart-evaluation
createTime: 2025/10/20 10:41:21
permalink: /en/guide/qmvjcv9o/
permalink: /en/guide/evaluation-pipeline/
---

# Model Capability Assessment Pipeline

# Model QA Capability Assessment Pipeline

Only supports QA pair format evaluation
⚠️Only supports QA pair format evaluation

## Quick Start

```bash
cd DataFlow
pip install -e .[eval]
Expand All @@ -32,55 +30,41 @@ dataflow eval init
dataflow eval api / dataflow eval local
```



## Step 1: Install Evaluation Environment

Download evaluation environment

```bash
cd DataFlow
pip install -e .[eval]
cd ..
```



## Step 2: Create and Enter DataFlow Working Directory

```bash
mkdir workspace
cd workspace
```



## Step 3: Prepare Evaluation Data and Initialize Configuration Files

Initialize configuration files

```bash
dataflow eval init
```

After initialization, the project directory structure becomes:

💡After initialization, the project directory structure becomes:
```bash
Project Root/
├── eval_api.py # Configuration file for API model evaluator
└── eval_local.py # Configuration file for local model evaluator
```



## Step 4: Prepare Evaluation Data



### Method 1: JSON Format

Please prepare a JSON format file with data structure similar to the example below:

```json
[
{
Expand All @@ -90,8 +74,7 @@ Please prepare a JSON format file with data structure similar to the example bel
]
```

In this example data:

💡In this example data:
- `input` is the question (can also be question + answer choices merged into one input)

- `output` is the standard answer
Expand All @@ -101,7 +84,6 @@ In this example data:
### Method 2: Custom Field Mapping

You can also skip data preprocessing (as long as you have clear question and standard answer fields) and configure field name mapping through `eval_api.py` and `eval_local.py`:

```python
EVALUATOR_RUN_CONFIG = {
"input_test_answer_key": "model_generated_answer", # Field name for model-generated answers
Expand All @@ -110,14 +92,11 @@ EVALUATOR_RUN_CONFIG = {
}
```



## Step 5: Configure Parameters

### Model Parameter Configure
If you want to use a local model as the evaluator, please modify the parameters in the `eval_local.py` file.

If you want to use an API model as the evaluator, please modify the parameters in the `eval_api.py` file.

```python
# Target Models Configuration (same as API mode)

Expand All @@ -133,34 +112,55 @@ TARGET_MODELS = [

# 3. Custom configuration
# Add more models...
# {
# "name": "llama_8b",
# "path": "meta-llama/Llama-3-8B-Instruct",
# "tensor_parallel_size": 2,
# "max_tokens": 2048,
# "gpu_memory_utilization": 0.9,

# # You can customize prompts for each model. If not specified, defaults to the template in build_prompt function.
# # Default prompt for evaluated models
# # IMPORTANT: This is the prompt for models being evaluated, NOT for the judge model!!!
# "answer_prompt": """please answer the questions:
# question:{question}
# answer:"""
# }

{
"name": "qwen_7b", # Model name
"path": "./Qwen2.5-7B-Instruct", # Model path
# Large language models can use different parameters
"vllm_tensor_parallel_size": 4, # Number of GPUs
"vllm_temperature": 0.1, # Randomness
"vllm_top_p": 0.9, # Top-p sampling
"vllm_max_tokens": 2048, # Maximum number of tokens
"vllm_repetition_penalty": 1.0, # Repetition penalty
"vllm_seed": None, # Random seed
"vllm_gpu_memory_utilization": 0.9, # Maximum GPU memory utilization
# Custom prompt can be defined for each model
"answer_prompt": """please answer the following question:"""
}


]
```


### Bench Parameter Configuration
Supports batch configuration of benchmarks
```python
BENCH_CONFIG = [
{
"name": "bench_name", # Benchmark name
"input_file": "path_to_your_qa/qa.json", # Data file
"question_key": "input", # Question field name
"reference_answer_key": "output", # Reference answer field name
"output_dir": "path/bench_name", # Output directory
},
{
"name": "other_bench_name",
"input_file": "path_to_your_qa/other_qa.json",
"question_key": "input",
"reference_answer_key": "output",
"output_dir": "path/other_bench_name",
}
]
```

## Step 6: Run Evaluation

Run local evaluation:

```bash
dataflow eval local
```

Run API evaluation:

```bash
dataflow eval api
dataflow eval api
```
154 changes: 0 additions & 154 deletions docs/en/notes/guide/pipelines/EvalPipeline.md

This file was deleted.

Loading