diff --git a/docs/en/notes/guide/model_evaluation/command_eval.md b/docs/en/notes/guide/model_evaluation/command_eval.md index 9bbdb559e..59c67ffd6 100644 --- a/docs/en/notes/guide/model_evaluation/command_eval.md +++ b/docs/en/notes/guide/model_evaluation/command_eval.md @@ -1,17 +1,15 @@ --- -title: Model QA Capability Assessment Pipeline +title: Model Capability Assessment Pipeline +createTime: 2025/08/30 14:27:02 icon: hugeicons:chart-evaluation -createTime: 2025/10/20 10:41:21 -permalink: /en/guide/qmvjcv9o/ +permalink: /en/guide/evaluation-pipeline/ --- +# Model Capability Assessment Pipeline -# Model QA Capability Assessment Pipeline - -Only supports QA pair format evaluation +⚠️Only supports QA pair format evaluation ## Quick Start - ```bash cd DataFlow pip install -e .[eval] @@ -32,20 +30,15 @@ dataflow eval init dataflow eval api / dataflow eval local ``` - - ## Step 1: Install Evaluation Environment Download evaluation environment - ```bash cd DataFlow pip install -e .[eval] cd .. ``` - - ## Step 2: Create and Enter DataFlow Working Directory ```bash @@ -53,34 +46,25 @@ mkdir workspace cd workspace ``` - - ## Step 3: Prepare Evaluation Data and Initialize Configuration Files Initialize configuration files - ```bash dataflow eval init ``` -After initialization, the project directory structure becomes: - +💡After initialization, the project directory structure becomes: ```bash Project Root/ ├── eval_api.py # Configuration file for API model evaluator └── eval_local.py # Configuration file for local model evaluator ``` - - ## Step 4: Prepare Evaluation Data - - ### Method 1: JSON Format Please prepare a JSON format file with data structure similar to the example below: - ```json [ { @@ -90,8 +74,7 @@ Please prepare a JSON format file with data structure similar to the example bel ] ``` -In this example data: - +💡In this example data: - `input` is the question (can also be question + answer choices merged into one input) - `output` is the standard answer @@ -101,7 +84,6 @@ In this example data: ### Method 2: Custom Field Mapping You can also skip data preprocessing (as long as you have clear question and standard answer fields) and configure field name mapping through `eval_api.py` and `eval_local.py`: - ```python EVALUATOR_RUN_CONFIG = { "input_test_answer_key": "model_generated_answer", # Field name for model-generated answers @@ -110,14 +92,11 @@ EVALUATOR_RUN_CONFIG = { } ``` - - ## Step 5: Configure Parameters - +### Model Parameter Configure If you want to use a local model as the evaluator, please modify the parameters in the `eval_local.py` file. If you want to use an API model as the evaluator, please modify the parameters in the `eval_api.py` file. - ```python # Target Models Configuration (same as API mode) @@ -133,34 +112,55 @@ TARGET_MODELS = [ # 3. Custom configuration # Add more models... - # { - # "name": "llama_8b", - # "path": "meta-llama/Llama-3-8B-Instruct", - # "tensor_parallel_size": 2, - # "max_tokens": 2048, - # "gpu_memory_utilization": 0.9, - - # # You can customize prompts for each model. If not specified, defaults to the template in build_prompt function. - # # Default prompt for evaluated models - # # IMPORTANT: This is the prompt for models being evaluated, NOT for the judge model!!! - # "answer_prompt": """please answer the questions: - # question:{question} - # answer:""" - # } + +{ + "name": "qwen_7b", # Model name + "path": "./Qwen2.5-7B-Instruct", # Model path + # Large language models can use different parameters + "vllm_tensor_parallel_size": 4, # Number of GPUs + "vllm_temperature": 0.1, # Randomness + "vllm_top_p": 0.9, # Top-p sampling + "vllm_max_tokens": 2048, # Maximum number of tokens + "vllm_repetition_penalty": 1.0, # Repetition penalty + "vllm_seed": None, # Random seed + "vllm_gpu_memory_utilization": 0.9, # Maximum GPU memory utilization + # Custom prompt can be defined for each model + "answer_prompt": """please answer the following question:""" +} + + ] ``` - +### Bench Parameter Configuration +Supports batch configuration of benchmarks +```python +BENCH_CONFIG = [ + { + "name": "bench_name", # Benchmark name + "input_file": "path_to_your_qa/qa.json", # Data file + "question_key": "input", # Question field name + "reference_answer_key": "output", # Reference answer field name + "output_dir": "path/bench_name", # Output directory + }, + { + "name": "other_bench_name", + "input_file": "path_to_your_qa/other_qa.json", + "question_key": "input", + "reference_answer_key": "output", + "output_dir": "path/other_bench_name", + } +] +``` ## Step 6: Run Evaluation Run local evaluation: - ```bash dataflow eval local ``` Run API evaluation: - ```bash -dataflow eval api \ No newline at end of file +dataflow eval api +``` \ No newline at end of file diff --git a/docs/en/notes/guide/pipelines/EvalPipeline.md b/docs/en/notes/guide/pipelines/EvalPipeline.md deleted file mode 100644 index d23127d6f..000000000 --- a/docs/en/notes/guide/pipelines/EvalPipeline.md +++ /dev/null @@ -1,154 +0,0 @@ ---- -title: Model Capability Assessment Pipeline -createTime: 2025/08/30 14:27:02 -icon: hugeicons:chart-evaluation -permalink: /en/guide/evaluation-pipeline/ ---- - -# Model Capability Assessment Pipeline - -⚠️Only supports QA pair format evaluation - -## Quick Start -```bash -cd DataFlow -pip install -e .[eval] - -cd .. -mkdir workspace -cd workspace - -# Place the files you want to evaluate in the working directory - -# Initialize evaluation configuration files -dataflow eval init - -# IMPORTANT: You must modify the configuration files eval_api.py or eval_local.py -# By default, it finds the latest fine-tuned model and compares it with its base model -# Default evaluation method is semantic evaluation -# Evaluation metric is accuracy -dataflow eval api / dataflow eval local -``` - - - -## Step 1: Install Evaluation Environment - -Download evaluation environment -```bash -cd DataFlow -pip install -e .[eval] -cd .. -``` - - - -## Step 2: Create and Enter DataFlow Working Directory - -```bash -mkdir workspace -cd workspace -``` - - - -## Step 3: Prepare Evaluation Data and Initialize Configuration Files - -Initialize configuration files -```bash -dataflow eval init -``` - -💡After initialization, the project directory structure becomes: -```bash -Project Root/ -├── eval_api.py # Configuration file for API model evaluator -└── eval_local.py # Configuration file for local model evaluator -``` - - - -## Step 4: Prepare Evaluation Data - -### Method 1: JSON Format - -Please prepare a JSON format file with data structure similar to the example below: -```json -[ - { - "input": "What properties indicate that material PI-1 has excellent processing characteristics during manufacturing processes?", - "output": "Material PI-1 has high tensile strength between 85-105 MPa.\nPI-1 exhibits low melt viscosity below 300 Pa·s indicating good flowability.\n\nThe combination of its high tensile strength and low melt viscosity indicates that it can be easily processed without breaking during manufacturing." - } -] -``` - -💡In this example data: -- `input` is the question (can also be question + answer choices merged into one input) - -- `output` is the standard answer - - - -### Method 2: Custom Field Mapping - -You can also skip data preprocessing (as long as you have clear question and standard answer fields) and configure field name mapping through `eval_api.py` and `eval_local.py`: -```python -EVALUATOR_RUN_CONFIG = { - "input_test_answer_key": "model_generated_answer", # Field name for model-generated answers - "input_gt_answer_key": "output", # Field name for standard answers (from original data) - "input_question_key": "input" # Field name for questions (from original data) -} -``` - - - -## Step 5: Configure Parameters - -If you want to use a local model as the evaluator, please modify the parameters in the `eval_local.py` file. - -If you want to use an API model as the evaluator, please modify the parameters in the `eval_api.py` file. -```python -# Target Models Configuration (same as API mode) - -TARGET_MODELS = [ - # Demonstrating all usage methods - # The following methods can be used in combination - - # 1. Local path - # "./Qwen2.5-3B-Instruct", - - # 2. HuggingFace path - # "Qwen/Qwen2.5-7B-Instruct" - - # 3. Custom configuration - # Add more models... - # { - # "name": "llama_8b", - # "path": "meta-llama/Llama-3-8B-Instruct", - # "tensor_parallel_size": 2, - # "max_tokens": 2048, - # "gpu_memory_utilization": 0.9, - - # # You can customize prompts for each model. If not specified, defaults to the template in build_prompt function. - # # Default prompt for evaluated models - # # IMPORTANT: This is the prompt for models being evaluated, NOT for the judge model!!! - # "answer_prompt": """please answer the questions: - # question:{question} - # answer:""" - # } -] -``` - - - -## Step 6: Run Evaluation - -Run local evaluation: -```bash -dataflow eval local -``` - -Run API evaluation: -```bash -dataflow eval api -``` \ No newline at end of file diff --git a/docs/zh/notes/guide/model_evaluation/command_eval.md b/docs/zh/notes/guide/model_evaluation/command_eval.md index 548478a48..32485167b 100644 --- a/docs/zh/notes/guide/model_evaluation/command_eval.md +++ b/docs/zh/notes/guide/model_evaluation/command_eval.md @@ -1,13 +1,12 @@ --- -title: 模型QA能力评估流水线 +title: EvalPipeline +createTime: 2025/10/20 11:30:42 icon: hugeicons:chart-evaluation -createTime: 2025/10/20 10:41:22 -permalink: /zh/guide/2k5wjgls/ +permalink: /zh/guide/cqro9oa8/ --- +# 模型能力评估流水线 -# 模型QA能力评估流水线 - -仅支持QA对形式的评估 +⚠️仅支持QA对形式的评估 ## 快速开始 @@ -41,8 +40,6 @@ pip install -e .[eval] cd .. ``` - - ## 第二步:创建并进入dataflow工作文件夹 ```bash @@ -50,8 +47,6 @@ mkdir workspace cd workspace ``` - - ## 第三步:准备评估数据初始化配置文件 初始化配置文件 @@ -60,7 +55,7 @@ cd workspace dataflow eval init ``` -初始化完成后,项目目录变成: +💡初始化完成后,项目目录变成: ```bash 项目根目录/ @@ -68,15 +63,13 @@ dataflow eval init └── eval_local.py # 评估器为本地模型的配置文件 ``` - - ## 第四步:准备评估数据 ### 方式一: 请准备好json格式文件,数据格式与展示类似 -```python +```json [ { "input": "What properties indicate that material PI-1 has excellent processing characteristics during manufacturing processes?", @@ -85,19 +78,17 @@ dataflow eval init ] ``` -这里示例数据中 +💡这里示例数据中 `input`是问题(也可以是问题+选择的选项合并为一个input) `output`是标准答案 - - ### 方式二: 也可以不处理数据(需要有明确的问题和标准答案这两个字段),通过eval_api.py以及eval_local.py来进行配置映射字段名字 -```python +```bash EVALUATOR_RUN_CONFIG = { "input_test_answer_key": "model_generated_answer", # 模型生成的答案字段名 "input_gt_answer_key": "output", # 标准答案字段名(原始数据的字段) @@ -105,16 +96,14 @@ EVALUATOR_RUN_CONFIG = { } ``` - - ## 第五步:配置参数 +### 模型参数配置 假设想用本地模型作为评估器,请修改`eval_local.py`文件中的参数 假设想用api模型作为评估器,请修改`eval_api.py`文件中的参数 ```python -Target Models Configuration (same as API mode) TARGET_MODELS = [ # 展示所有用法 @@ -125,28 +114,45 @@ TARGET_MODELS = [ # "Qwen/Qwen2.5-7B-Instruct" # 3.单独配置 # 添加更多模型... - # { - # "name": "llama_8b", - # "path": "meta-llama/Llama-3-8B-Instruct", - # "tensor_parallel_size": 2 - # "max_tokens": 2048, - # "gpu_memory_utilization": 0.9, - # 可以为每个模型自定义提示词 不写就为默认模板 即 build_prompt函数中的prompt - # 默认被评估模型提示词 - # 再次提示:该prompt为被评估模型的提示词,请勿与评估模型提示词混淆!!! - # You can customize prompts for each model. If not specified, defaults to the template in build_prompt function. - # Default prompt for evaluated models - # IMPORTANT: This is the prompt for models being evaluated, NOT for the judge model!!! - # "answer_prompt": """please answer the questions: - # question:{question} - # answer:""" - # "" - # } - # +{ + "name": "qwen_7b", # 模型名称 + "path": "./Qwen2.5-7B-Instruct", # 模型路径 + # 大模型可以用不同的参数 + "vllm_tensor_parallel_size": 4, # 显卡数量 + "vllm_temperature": 0.1, # 随机性,值越大输出越随机 + "vllm_top_p": 0.9, # 核采样概率阈值,控制候选词的累积概率范围 + "vllm_max_tokens": 2048, # 最大生成token数 + "vllm_repetition_penalty": 1.0, # 重复惩罚系数,大于1时抑制重复内容 + "vllm_seed": None, # 随机种子,设置后可复现结果 + "vllm_gpu_memory_utilization": 0.9, # 最大显存利用率 + # 可以为每个模型自定义提示词 + "answer_prompt": """please answer the following question:""" # 回答提示词模板 +} ] ``` +### Bench参数配置 +支持批量Bench评估 +```python +BENCH_CONFIG = [ + { + "name": "bench_name", # bench名称 + "input_file": "path_to_your_qa/qa.json", # 数据文件 + "question_key": "input", # 问题字段名 + "reference_answer_key": "output", # 答案字段名 + "output_dir": "path//bench_name", # 输出目录 + }, + { + "name": "other_bench_name", + "input_file": "path_to_your_qa/other_qa.json", + "question_key": "input", + "reference_answer_key": "output", + "output_dir":"path/other_bench_name", + } +] + +``` ## 第六步:进行评估 @@ -161,4 +167,4 @@ dataflow eval local ```bash dataflow eval api -``` +``` \ No newline at end of file diff --git a/docs/zh/notes/guide/pipelines/EvalPipeline.md b/docs/zh/notes/guide/pipelines/EvalPipeline.md deleted file mode 100644 index b6974ab67..000000000 --- a/docs/zh/notes/guide/pipelines/EvalPipeline.md +++ /dev/null @@ -1,163 +0,0 @@ ---- -title: EvalPipeline -createTime: 2025/10/20 11:30:42 -icon: hugeicons:chart-evaluation -permalink: /zh/guide/cqro9oa8/ ---- -# 模型能力评估流水线 - -⚠️仅支持QA对形式的评估 - -## 快速开始 - -```bash -cd DataFlow -pip install -e .[eval] - -cd .. -mkdir workspace -cd workspace - -#将想要评估的文件放到工作目录下 - -#初始化评估的配置文件 -dataflow eval init - -#注意 一定要修改配置文件eval_api.py 或者 eval_local.py -#默认找到最新的微调模型与其基础模型对比 -#默认评估方法是语义评估 -#评估指标是准确度 -dataflow eval api / dataflow eval local -``` - -## 第一步:安装评估环境 - -下载评估环境 - -```bash -cd DataFlow -pip install -e .[eval] -cd .. -``` - - - -## 第二步:创建并进入dataflow工作文件夹 - -```bash -mkdir workspace -cd workspace -``` - - - -## 第三步:准备评估数据初始化配置文件 - -初始化配置文件 - -```bash -dataflow eval init -``` - -💡初始化完成后,项目目录变成: - -```bash -项目根目录/ -├── eval_api.py # 评估器为api模型的配置文件 -└── eval_local.py # 评估器为本地模型的配置文件 -``` - - - -## 第四步:准备评估数据 - -### 方式一: - -请准备好json格式文件,数据格式与展示类似 - -```json -[ - { - "input": "What properties indicate that material PI-1 has excellent processing characteristics during manufacturing processes?", - "output": "Material PI-1 has high tensile strength between 85-105 MPa.\nPI-1 exhibits low melt viscosity below 300 Pa·s indicating good flowability.\n\nThe combination of its high tensile strength and low melt viscosity indicates that it can be easily processed without breaking during manufacturing." - }, -] -``` - -💡这里示例数据中 - -`input`是问题(也可以是问题+选择的选项合并为一个input) - -`output`是标准答案 - - - -### 方式二: - -也可以不处理数据(需要有明确的问题和标准答案这两个字段),通过eval_api.py以及eval_local.py来进行配置映射字段名字 - -```bash -EVALUATOR_RUN_CONFIG = { - "input_test_answer_key": "model_generated_answer", # 模型生成的答案字段名 - "input_gt_answer_key": "output", # 标准答案字段名(原始数据的字段) - "input_question_key": "input" # 问题字段名(原始数据的字段) -} -``` - - - -## 第五步:配置参数 - -假设想用本地模型作为评估器,请修改`eval_local.py`文件中的参数 - -假设想用api模型作为评估器,请修改`eval_api.py`文件中的参数 - -```bash -Target Models Configuration (same as API mode) - -TARGET_MODELS = [ - # 展示所有用法 - # 以下用法可混合使用 - # 1.本地路径 - # "./Qwen2.5-3B-Instruct", - # 2.huggingface路径 - # "Qwen/Qwen2.5-7B-Instruct" - # 3.单独配置 - # 添加更多模型... - # { - # "name": "llama_8b", - # "path": "meta-llama/Llama-3-8B-Instruct", - # "tensor_parallel_size": 2 - # "max_tokens": 2048, - # "gpu_memory_utilization": 0.9, - # 可以为每个模型自定义提示词 不写就为默认模板 即 build_prompt函数中的prompt - # 默认被评估模型提示词 - # 再次提示:该prompt为被评估模型的提示词,请勿与评估模型提示词混淆!!! - # You can customize prompts for each model. If not specified, defaults to the template in build_prompt function. - # Default prompt for evaluated models - # IMPORTANT: This is the prompt for models being evaluated, NOT for the judge model!!! - # "answer_prompt": """please answer the questions: - # question:{question} - # answer:""" - # "" - # } - # - -] -``` - - - -## 第六步:进行评估 - -运行本地评估 - -```bash -dataflow eval local -``` - -运行api评估 - -```bash -dataflow eval api -``` \ No newline at end of file