Measuring Core Agent Abilities

Download Dataset

We curated a single-step UI grounding test dataset, GroundUI-18K and a multi-step trajectory test dataset. For efficient benchmarking, we sampled subsets from both these datasets, forming GroundUI-1K and TrajectoryLite. All datasets can be obtained in our project page.

Using these two datasets, we created a leaderboard covering three core agent abilities: UI grounding, learning from videos, and success detection. The four specific datasets for benchmarking (GroundUI-1K, IDM-Single, IDM-Multiple, and SuccessDetection) can be automatically downloaded via HuggingFace without additional operations. The GroundUI-18K dataset is also hosted on HuggingFace.

If downloading from the Google Drive link at our project page (recommended, easier for making reports):

Extract the downloaded GroundUI dataset:

tar -xvf gui_grounding.tar.gz

The file structure:

eval_agent_desiderata/
└─ datasets/
    └─ gui_grounding/
        ├─ images/
        ├─ metadata_1k.jsonl
        ├─ metadata_raw_1k.jsonl
        ├─ metadata_raw.jsonl
        └─ metadata.jsonl

There are 13522 screenshots under images. Both raw instructions and recaptioned instructions are given in the metadata_1k.jsonl and metadata.jsonl. The raw files metadata_raw_1k and metadata.jsonl consists only raw instructions. More details of recaptioning can be found at the bottom of this page.

Extract the downloaded TrajectoryLite dataset:

tar -xvf trajectory_lite.tar.gz

The file structure:

eval_agent_desiderata/
└─ datasets/
    └─ trajectory_lite/
        ├─ images/
        ├─ metadata_idm.jsonl
        ├─ metadata_idmn2n.jsonl
        └─ metadata_success_detection.jsonl

For raw data downloading and processing, see the detailed instructions.

We use GPT-4o to recaption GroundUI-1K.

python eval_agent_desiderata/re_caption_gui_grounding_data.py --model gpt-4o-2024-05-13 --data_path eval_agent_desiderata/datasets/gui_grounding/metadata_raw_1k.jsonl

We use CogVLM2 to recaption GroundUI-18K.

python eval_agent_desiderata/re_caption_gui_grounding_data.py --model /PATH/TO/cogvlm2-llama3-chat-19B --data_path eval_agent_desiderata/datasets/gui_grounding/metadata_raw.jsonl

Evaluation on GUI Grounding

Evaluation on re-captioned GroundUI-1K

The --model tested are gpt-4o-2024-05-13, gpt-4-turbo-2024-04-09, gemini-pro-vision, gemini-1.5-pro-001, gemini-1.5-flash-001 (Vertex AI), claude-3-5-sonnet-20240620, claude-3-5-sonnet@20240620 (Vertex AI), claude-3-5-sonnet-20241022, /PATH/TO/SeeClick, /PATH/TO/cogvlm2-llama3-chat-19B, /PATH/TO/Qwen-VL-Chat, /PATH/TO/cogagent-chat-hf, /PATH/TO/paligemma-3b-mix-448, /PATH/TO/paligemma-3b-pt-896, /PATH/TO/MiniCPM-Llama3-V-2_5.

To use the latest models with Vertex AI, more details can be found in the Vertex AI documentation. To test other models, you can add it to the MODEL_PROVIDER_MAPPING in agent_studio/llm/__init__.py. You can also add --num_workers to speed up the evaluation process for APIs.

For example:

# If using local data downloaded from Google Drive
python eval_agent_desiderata/main.py --model gpt-4o-2024-05-13 --eval_type gui_grounding --data_path eval_agent_desiderata/datasets/gui_grounding/metadata_1k.jsonl
# If using HuggingFace dataset
python eval_agent_desiderata/main.py --model claude-3-5-sonnet-20241022 --eval_type gui_grounding --data_path agent-studio/GroundUI-1K

# You need to specify the `--tokenizer` for some open-source models like SeeClick, otherwise the tokenizer will be automatically loaded from the model path.
python eval_agent_desiderata/main.py --model /PATH/TO/SeeClick --tokenizer /PATH/TO/Qwen-VL-Chat --eval_type gui_grounding --data_path eval_agent_desiderata/datasets/gui_grounding/metadata_1k.jsonl

After running experiments, you can generate a report with metrics using the following command. The --result_path is the path to save the evaluation results shown in the last log of running evaluation, e.g., results/gui_grounding/gpt-4o-2024-05-13.jsonl. Example script for gathering results:

python eval_agent_desiderata/make_report.py --image_path /media/longtao/Expansion/agent_studio_h100/agent-studio/evals/datasets/gui_grounding/images --result_path results/gui_grounding/claude-3-5-sonnet-20241022.jsonl

Ablation on raw instruction

The following command ablates the performance of the model on raw instructions.

python eval_agent_desiderata/main.py --model gpt-4o-2024-05-13 --eval_type gui_grounding --data_path eval_agent_desiderata/datasets/gui_grounding/metadata_raw_1k.jsonl
python eval_agent_desiderata/main.py --model /PATH/TO/SeeClick --tokenizer /PATH/TO/Qwen-VL-Chat --eval_type gui_grounding --data_path eval_agent_desiderata/datasets/gui_grounding/metadata_raw_1k.jsonl
python eval_agent_desiderata/main.py --model /PATH/TO/cogvlm2-llama3-chat-19B --eval_type gui_grounding --data_path eval_agent_desiderata/datasets/gui_grounding/metadata_raw_1k.jsonl

Full evaluation on GroundUI-18K

python eval_agent_desiderata/main.py --model /PATH/TO/SeeClick --tokenizer /PATH/TO/Qwen-VL-Chat --eval_type gui_grounding --data_path eval_agent_desiderata/datasets/gui_grounding/metadata.jsonl

Evaluation on Inverse Action Labeling

IDM-Single: Predict a single action between two neighboring states (images).

# If using local data downloaded from Google Drive
python eval_agent_desiderata/main.py --model gpt-4o-2024-05-13 --eval_type idm --data_path eval_agent_desiderata/datasets/trajectory_lite/metadata_idm.jsonl
# If using HuggingFace dataset
python eval_agent_desiderata/main.py --model gpt-4o-2024-05-13 --eval_type idm --data_path agent-studio/IDM-Single

python eval_agent_desiderata/main.py --model /PATH/TO/Qwen-VL-Chat --eval_type idm --data_path eval_agent_desiderata/datasets/trajectory_lite/metadata_idm.jsonl

Example script for gathering results:

python eval_agent_desiderata/make_report.py --image_path eval_agent_desiderata/datasets/trajectory_lite/images --result_path results/idm/gpt-4o-2024-05-13.jsonl

IDM-Multiple: Predict all actions given a trajectory.

# If using local data downloaded from Google Drive
python eval_agent_desiderata/main.py --model gpt-4o-2024-05-13 --eval_type idmn2n --data_path eval_agent_desiderata/datasets/trajectory_lite/metadata_idmn2n.jsonl
# If using HuggingFace dataset
python eval_agent_desiderata/main.py --model gpt-4o-2024-05-13 --eval_type idmn2n --data_path agent-studio/IDM-Multiple

python eval_agent_desiderata/main.py --model /PATH/TO/Qwen-VL-Chat --eval_type idmn2n --data_path eval_agent_desiderata/datasets/trajectory_lite/metadata_idmn2n.jsonl

Example script for gathering results:

python eval_agent_desiderata/make_report.py --image_path eval_agent_desiderata/datasets/trajectory_lite/images --result_path results/idmn2n/gpt-4o-2024-05-13.jsonl

Evaluation on Success Detection

Example script for evaluation (success_detection or success_detection_actionless):

# If using local data downloaded from Google Drive
python eval_agent_desiderata/main.py --model gpt-4o-2024-05-13 --eval_type success_detection --data_path eval_agent_desiderata/datasets/trajectory_lite/metadata_success_detection.jsonl
# If using HuggingFace dataset
python eval_agent_desiderata/main.py --model gpt-4o-2024-05-13 --eval_type success_detection --data_path agent-studio/SuccessDetection

python eval_agent_desiderata/main.py --model /PATH/TO/Qwen-VL-Chat --eval_type success_detection --data_path eval_agent_desiderata/datasets/trajectory_lite/metadata_success_detection.jsonl

Example script for gathering results:

python eval_agent_desiderata/make_report.py --image_path eval_agent_desiderata/datasets/trajectory_lite/images --result_path results/success_detection/gpt-4o-2024-05-13.jsonl

Online benchmark analysis

to see the help message:

python eval_agent_desiderata/online_benchmark_analysis.py

E.g.

# Calculate the trajectory length statistics of gpt-4o-2024-08-06
python eval_agent_desiderata/online_benchmark_analysis.py traj_length_stat --model gpt-4o-2024-08-06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Measuring Core Agent Abilities

Download Dataset

Evaluation on GUI Grounding

Evaluation on re-captioned GroundUI-1K

Ablation on raw instruction

Full evaluation on GroundUI-18K

Evaluation on Inverse Action Labeling

Evaluation on Success Detection

Online benchmark analysis

Files

README.md

Latest commit

History

README.md

File metadata and controls

Measuring Core Agent Abilities

Download Dataset

Evaluation on GUI Grounding

Evaluation on re-captioned GroundUI-1K

Ablation on raw instruction

Full evaluation on GroundUI-18K

Evaluation on Inverse Action Labeling

Evaluation on Success Detection

Online benchmark analysis