Skip to content

[Question/Feature Request] Clarification on Evaluation Metrics and Request for Execution Time Logging #6

@zhoukai83

Description

@zhoukai83

Hi Team,

I am currently using run_eval.py to evaluate agent performance and have a few questions regarding the output metrics and data logging.

1. Metric Calculation Methodology

Currently, run_eval.py generates all_results.json. I would like to clarify if the intended calculation for the following metrics aligns with my current understanding:

  • Partial Completion: Calculated by averaging the scores across all attempts for each task, and then taking the mean of those averages across all tasks.

  • Insight: Does this accurately reflect the "granularity" of the agent's performance (i.e., progress made even on failed tasks)?

  • Success Rate (Aggregate): Calculated as (Total Successful Attempts) / (Total Number of Attempts).

  • Insight: This seems to measure the "average quality stability" rather than just the Pass@1 rate. Is this the intended interpretation?

  • Pass@3: A task is considered a success if at least one of the first three attempts achieves a perfect score (1.0).

  • Insight: This reflects the "upper bound" or solvability ceiling of the agent.

Does the project currently support an automated output for these specific values, or is the user expected to post-process all_results.json manually?

2. Logging Execution Latency

I noticed that the script does not explicitly record the execution time (latency) for each agent run.

  • Do you recommend a specific format for recording this (e.g., adding an elapsed_time field to the results JSON)?
  • Are there plans to include avg_time per task in the standard evaluation output?

3. Proposed Metric Summary Output

It would be highly beneficial if the evaluation script could print a summary table similar to the following upon completion:

Metric Calculation Basis Current Value
Partial Completion Mean(Task_Avg_Scores) TBD
Success Rate Total_Success / Total_Attempts TBD
Pass@3 Any(Score==1.0) in 3 tries TBD
Avg. Latency Mean(Execution_Time) TBD

Looking forward to your guidance on whether these calculations match the internal logic of the project.

Best regards,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions