Accuracy Calculation Guide

This document explains the complete workflow for processing evaluation results from evaluate.py and calculating final accuracy metrics.

Overview

After running evaluate.py to get automatic evaluation results, the following steps are required to generate final accuracy metrics:

Manual Annotation: REQUIRED - Human experts annotate generation tasks (22 instances in Chinese, 12 in English)
Result Combination: Combine results from multiple evaluators (deepseek_r1_ls_z2 and gpt_4o)

Note: Manual annotation is mandatory and must be completed before running process_results.py.

Step-by-Step Process

Step 1: Manual Annotation

Certain evaluation tasks require manual annotation, as they lie beyond the scope of current automatic LLM-based metrics. Accordingly, annotations should be performed with reference to the speech generated by the SDM. The dataset comprises 22 instances in Chinese and 12 in English which require manual annotation.

Files Requiring Manual Annotation

Chinese (result_path/):

chinese/ambiguity/phonological/generation/pause.json
chinese/ambiguity/phonological/generation/heteronym.json
chinese/ambiguity/phonological/generation/tone.json

English (result_path/):

english/ambiguity/phonological/generation.json

Annotation Format

Chinese Format:

{
  "content": "...",
  "annotation": "...",
  "answer": "...",
  "check_answer": "最终答案为：是。"  // or "最终答案为：否。"
}

English Format:

{
  "content": "...",
  "notation": "...",
  "answer": "...",
  "check_answer": "The answer is: yes."  // or "The answer is: no."
}

Manual Annotation Process

Open each JSON file requiring annotation
For each item, read the content, annotation (reference answer), and answer (model response),
For each item, listen to the corresponding speech generated by the SDM
Determine if the model's answer is correct
Set check_answer to the appropriate format based on your judgment

Step 2: Result Combination

Use the unified process_results.py script to combine results from multiple evaluators.

Command Usage

# Combine results and calculate metrics (after manual annotation is completed)
python process_results.py \
  --deepseek_path "/Mooer-Omni/deepseek-r1-ls-z2/english" \
  --gpt_path "/Mooer-Omni/gpt-4o/english" \
  --language english \
  --sdm_name "Mooer-Omni" \
  --output_path "/path/to/output"

Parameters

--deepseek_path: Path to DeepSeek evaluation results (e.g., /Mooer-Omni/deepseek-r1-ls-z2/english)
--gpt_path: Path to GPT evaluation results (e.g., /Mooer-Omni/gpt-4o/english)
--language: Language of evaluation results (english or chinese)
--sdm_name: Name of the SDM model being evaluated
--output_path: Base path for output files

Output Files

The script generates:

{output_path}/{SDM_NAME}/combined_{language}_statistic.json: Combined results from all evaluators
{output_path}/{SDM_NAME}/{language}_result_data.json: Final accuracy metrics

The final metrics file contains:

[
  {
    "category": "overall",
    "dpsk_correct": 150,
    "gpt_correct": 145,
    "total": 200,
    "dpsk_accuracy": 0.75,
    "gpt_accuracy": 0.725
  },
  {
    "category": "/ambiguity/phonological/pause",
    "dpsk_correct": 25,
    "gpt_correct": 23,
    "total": 30,
    "dpsk_accuracy": 0.8333,
    "gpt_accuracy": 0.7667
  }
]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accuracy Calculation Guide

Overview

Step-by-Step Process

Step 1: Manual Annotation

Files Requiring Manual Annotation

Annotation Format

Manual Annotation Process

Step 2: Result Combination

Command Usage

Parameters

Output Files

FilesExpand file tree

CalculationUsage.md

Latest commit

History

CalculationUsage.md

File metadata and controls

Accuracy Calculation Guide

Overview

Step-by-Step Process

Step 1: Manual Annotation

Files Requiring Manual Annotation

Annotation Format

Manual Annotation Process

Step 2: Result Combination

Command Usage

Parameters

Output Files