Skip to content

Latest commit

 

History

History
111 lines (85 loc) · 3.43 KB

File metadata and controls

111 lines (85 loc) · 3.43 KB

Accuracy Calculation Guide

This document explains the complete workflow for processing evaluation results from evaluate.py and calculating final accuracy metrics.

Overview

After running evaluate.py to get automatic evaluation results, the following steps are required to generate final accuracy metrics:

  1. Manual Annotation: REQUIRED - Human experts annotate generation tasks (22 instances in Chinese, 12 in English)
  2. Result Combination: Combine results from multiple evaluators (deepseek_r1_ls_z2 and gpt_4o)

Note: Manual annotation is mandatory and must be completed before running process_results.py.

Step-by-Step Process

Step 1: Manual Annotation

Certain evaluation tasks require manual annotation, as they lie beyond the scope of current automatic LLM-based metrics. Accordingly, annotations should be performed with reference to the speech generated by the SDM. The dataset comprises 22 instances in Chinese and 12 in English which require manual annotation.

Files Requiring Manual Annotation

Chinese (result_path/):

  • chinese/ambiguity/phonological/generation/pause.json
  • chinese/ambiguity/phonological/generation/heteronym.json
  • chinese/ambiguity/phonological/generation/tone.json

English (result_path/):

  • english/ambiguity/phonological/generation.json

Annotation Format

Chinese Format:

{
  "content": "...",
  "annotation": "...",
  "answer": "...",
  "check_answer": "最终答案为:是。"  // or "最终答案为:否。"
}

English Format:

{
  "content": "...",
  "notation": "...",
  "answer": "...",
  "check_answer": "The answer is: yes."  // or "The answer is: no."
}

Manual Annotation Process

  1. Open each JSON file requiring annotation
  2. For each item, read the content, annotation (reference answer), and answer (model response),
  3. For each item, listen to the corresponding speech generated by the SDM
  4. Determine if the model's answer is correct
  5. Set check_answer to the appropriate format based on your judgment

Step 2: Result Combination

Use the unified process_results.py script to combine results from multiple evaluators.

Command Usage

# Combine results and calculate metrics (after manual annotation is completed)
python process_results.py \
  --deepseek_path "/Mooer-Omni/deepseek-r1-ls-z2/english" \
  --gpt_path "/Mooer-Omni/gpt-4o/english" \
  --language english \
  --sdm_name "Mooer-Omni" \
  --output_path "/path/to/output"

Parameters

  • --deepseek_path: Path to DeepSeek evaluation results (e.g., /Mooer-Omni/deepseek-r1-ls-z2/english)
  • --gpt_path: Path to GPT evaluation results (e.g., /Mooer-Omni/gpt-4o/english)
  • --language: Language of evaluation results (english or chinese)
  • --sdm_name: Name of the SDM model being evaluated
  • --output_path: Base path for output files

Output Files

The script generates:

  • {output_path}/{SDM_NAME}/combined_{language}_statistic.json: Combined results from all evaluators
  • {output_path}/{SDM_NAME}/{language}_result_data.json: Final accuracy metrics

The final metrics file contains:

[
  {
    "category": "overall",
    "dpsk_correct": 150,
    "gpt_correct": 145,
    "total": 200,
    "dpsk_accuracy": 0.75,
    "gpt_accuracy": 0.725
  },
  {
    "category": "/ambiguity/phonological/pause",
    "dpsk_correct": 25,
    "gpt_correct": 23,
    "total": 30,
    "dpsk_accuracy": 0.8333,
    "gpt_accuracy": 0.7667
  }
]