Hi Team,
I am currently using run_eval.py to evaluate agent performance and have a few questions regarding the output metrics and data logging.
1. Metric Calculation Methodology
Currently, run_eval.py generates all_results.json. I would like to clarify if the intended calculation for the following metrics aligns with my current understanding:
-
Partial Completion: Calculated by averaging the scores across all attempts for each task, and then taking the mean of those averages across all tasks.
-
Insight: Does this accurately reflect the "granularity" of the agent's performance (i.e., progress made even on failed tasks)?
-
Success Rate (Aggregate): Calculated as (Total Successful Attempts) / (Total Number of Attempts).
-
Insight: This seems to measure the "average quality stability" rather than just the Pass@1 rate. Is this the intended interpretation?
-
Pass@3: A task is considered a success if at least one of the first three attempts achieves a perfect score (1.0).
-
Insight: This reflects the "upper bound" or solvability ceiling of the agent.
Does the project currently support an automated output for these specific values, or is the user expected to post-process all_results.json manually?
2. Logging Execution Latency
I noticed that the script does not explicitly record the execution time (latency) for each agent run.
- Do you recommend a specific format for recording this (e.g., adding an
elapsed_time field to the results JSON)?
- Are there plans to include
avg_time per task in the standard evaluation output?
3. Proposed Metric Summary Output
It would be highly beneficial if the evaluation script could print a summary table similar to the following upon completion:
| Metric |
Calculation Basis |
Current Value |
| Partial Completion |
Mean(Task_Avg_Scores) |
TBD |
| Success Rate |
Total_Success / Total_Attempts |
TBD |
| Pass@3 |
Any(Score==1.0) in 3 tries |
TBD |
| Avg. Latency |
Mean(Execution_Time) |
TBD |
Looking forward to your guidance on whether these calculations match the internal logic of the project.
Best regards,
Hi Team,
I am currently using
run_eval.pyto evaluate agent performance and have a few questions regarding the output metrics and data logging.1. Metric Calculation Methodology
Currently,
run_eval.pygeneratesall_results.json. I would like to clarify if the intended calculation for the following metrics aligns with my current understanding:Partial Completion: Calculated by averaging the scores across all attempts for each task, and then taking the mean of those averages across all tasks.
Insight: Does this accurately reflect the "granularity" of the agent's performance (i.e., progress made even on failed tasks)?
Success Rate (Aggregate): Calculated as (Total Successful Attempts) / (Total Number of Attempts).
Insight: This seems to measure the "average quality stability" rather than just the Pass@1 rate. Is this the intended interpretation?
Pass@3: A task is considered a success if at least one of the first three attempts achieves a perfect score (1.0).
Insight: This reflects the "upper bound" or solvability ceiling of the agent.
Does the project currently support an automated output for these specific values, or is the user expected to post-process
all_results.jsonmanually?2. Logging Execution Latency
I noticed that the script does not explicitly record the execution time (latency) for each agent run.
elapsed_timefield to the results JSON)?avg_timeper task in the standard evaluation output?3. Proposed Metric Summary Output
It would be highly beneficial if the evaluation script could print a summary table similar to the following upon completion:
Looking forward to your guidance on whether these calculations match the internal logic of the project.
Best regards,