[Question/Feature Request] Clarification on Evaluation Metrics and Request for Execution Time Logging


Hi Team,

I am currently using `run_eval.py` to evaluate agent performance and have a few questions regarding the output metrics and data logging.

#### 1. Metric Calculation Methodology

Currently, `run_eval.py` generates `all_results.json`. I would like to clarify if the intended calculation for the following metrics aligns with my current understanding:

* **Partial Completion:** Calculated by averaging the scores across all attempts for each task, and then taking the mean of those averages across all tasks.
* *Insight:* Does this accurately reflect the "granularity" of the agent's performance (i.e., progress made even on failed tasks)?


* **Success Rate (Aggregate):** Calculated as (Total Successful Attempts) / (Total Number of Attempts).
* *Insight:* This seems to measure the "average quality stability" rather than just the Pass@1 rate. Is this the intended interpretation?


* **Pass@3:** A task is considered a success if at least one of the first three attempts achieves a perfect score (1.0).
* *Insight:* This reflects the "upper bound" or solvability ceiling of the agent.



Does the project currently support an automated output for these specific values, or is the user expected to post-process `all_results.json` manually?

#### 2. Logging Execution Latency

I noticed that the script does not explicitly record the **execution time (latency)** for each agent run.

* Do you recommend a specific format for recording this (e.g., adding an `elapsed_time` field to the results JSON)?
* Are there plans to include `avg_time` per task in the standard evaluation output?

#### 3. Proposed Metric Summary Output

It would be highly beneficial if the evaluation script could print a summary table similar to the following upon completion:

| Metric | Calculation Basis | Current Value |
| --- | --- | --- |
| **Partial Completion** | Mean(Task_Avg_Scores) | TBD |
| **Success Rate** | Total_Success / Total_Attempts | TBD |
| **Pass@3** | Any(Score==1.0) in 3 tries | TBD |
| **Avg. Latency** | Mean(Execution_Time) | TBD |

Looking forward to your guidance on whether these calculations match the internal logic of the project.

Best regards,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question/Feature Request] Clarification on Evaluation Metrics and Request for Execution Time Logging #6

1. Metric Calculation Methodology

2. Logging Execution Latency

3. Proposed Metric Summary Output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Calculation Basis	Current Value
Partial Completion	Mean(Task_Avg_Scores)	TBD
Success Rate	Total_Success / Total_Attempts	TBD
Pass@3	Any(Score==1.0) in 3 tries	TBD
Avg. Latency	Mean(Execution_Time)	TBD

[Question/Feature Request] Clarification on Evaluation Metrics and Request for Execution Time Logging #6

Description

1. Metric Calculation Methodology

2. Logging Execution Latency

3. Proposed Metric Summary Output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions