This document explains the complete workflow for processing evaluation results from evaluate.py and calculating final accuracy metrics.
After running evaluate.py to get automatic evaluation results, the following steps are required to generate final accuracy metrics:
- Manual Annotation: REQUIRED - Human experts annotate generation tasks (22 instances in Chinese, 12 in English)
- Result Combination: Combine results from multiple evaluators (deepseek_r1_ls_z2 and gpt_4o)
Note: Manual annotation is mandatory and must be completed before running process_results.py.
Certain evaluation tasks require manual annotation, as they lie beyond the scope of current automatic LLM-based metrics. Accordingly, annotations should be performed with reference to the speech generated by the SDM. The dataset comprises 22 instances in Chinese and 12 in English which require manual annotation.
Chinese (result_path/):
chinese/ambiguity/phonological/generation/pause.jsonchinese/ambiguity/phonological/generation/heteronym.jsonchinese/ambiguity/phonological/generation/tone.json
English (result_path/):
english/ambiguity/phonological/generation.json
Chinese Format:
{
"content": "...",
"annotation": "...",
"answer": "...",
"check_answer": "最终答案为:是。" // or "最终答案为:否。"
}English Format:
{
"content": "...",
"notation": "...",
"answer": "...",
"check_answer": "The answer is: yes." // or "The answer is: no."
}- Open each JSON file requiring annotation
- For each item, read the
content,annotation(reference answer), andanswer(model response), - For each item, listen to the corresponding speech generated by the SDM
- Determine if the model's answer is correct
- Set
check_answerto the appropriate format based on your judgment
Use the unified process_results.py script to combine results from multiple evaluators.
# Combine results and calculate metrics (after manual annotation is completed)
python process_results.py \
--deepseek_path "/Mooer-Omni/deepseek-r1-ls-z2/english" \
--gpt_path "/Mooer-Omni/gpt-4o/english" \
--language english \
--sdm_name "Mooer-Omni" \
--output_path "/path/to/output"--deepseek_path: Path to DeepSeek evaluation results (e.g.,/Mooer-Omni/deepseek-r1-ls-z2/english)--gpt_path: Path to GPT evaluation results (e.g.,/Mooer-Omni/gpt-4o/english)--language: Language of evaluation results (englishorchinese)--sdm_name: Name of the SDM model being evaluated--output_path: Base path for output files
The script generates:
{output_path}/{SDM_NAME}/combined_{language}_statistic.json: Combined results from all evaluators{output_path}/{SDM_NAME}/{language}_result_data.json: Final accuracy metrics
The final metrics file contains:
[
{
"category": "overall",
"dpsk_correct": 150,
"gpt_correct": 145,
"total": 200,
"dpsk_accuracy": 0.75,
"gpt_accuracy": 0.725
},
{
"category": "/ambiguity/phonological/pause",
"dpsk_correct": 25,
"gpt_correct": 23,
"total": 30,
"dpsk_accuracy": 0.8333,
"gpt_accuracy": 0.7667
}
]