Implement pipeline and new schema for image datasets #158

MajikalExplosions · 2025-11-20T18:02:47Z

Summary

Implements multimodal (image + text) fine-tuning support, enabling 9 datasets to preserve visual information for training vision-language models instead of converting images to text descriptions.

Implementation

schema/observation/image.py: Schema enhancements that:

Add optional fields to ImageAnnotation: content_description, clickable, editable
Maintain backward compatibility (all fields default to None)

agents/openhands/std_to_sft.py: Multimodal SFT conversion that:

Inserts <image> tokens in conversation text at appropriate positions
Tracks image paths via internal _image_path metadata
Converts annotations to human-readable format with interactivity indicators (e.g., [clickable], [editable])
Generates LLaMA Factory-compatible output with separate images array
Supports both file paths and base64-encoded images (though schema explicitly requires paths)
Handles nested images in WebObservations

Dataset converters (raw_to_standardized.py): Updated 9 image datasets.

android_in_the_wild: Intelligent two-pass conversion that analyzes action sequences to infer clickable/editable UI elements (from source repo), then generates trajectory with enriched annotations
androidcontrol: Populates content_description with resource ID/hint/tooltip, and clickable/editable from raw data fields
llava_plus: Regenerated samples only
omniact: Moves semantic labels from text to content_description, marks all elements as clickable=True
webarena_successful: Regenerated samples only
weblinx: Regenerated samples only
wonderbread: Populates content_description with XPath information
go-browse-wa: Regenerated samples only
openhands: Converts base64 screenshots to files, saving them to datasets/openhands/screenshots/{trajectory_id}/ directory structure

Testing

In progress. Some verification is done on the output.

Notes

Output matches LLaMA Factory multimodal requirements (count(<image> tokens) == len(images array))
Tests are forthcoming.

neubig · 2025-11-21T22:36:16Z

@MajikalExplosions could you check the failing pre-commit checks and Python unit tests? Thanks!

neubig

submitting a review comment so I pop it off my review stack, but please re-request review when tests are passing!

neubig

A few comments!

neubig · 2025-12-06T16:46:04Z

agents/openhands/system_prompt/tools/browser.py

        scroll(-50.2, -100.5)

-fill(bid: str, value: str)
+fill(bid: str, value: str, enable_autocomplete_menu: bool = False)


I'm a little bit skeptical that we actually need this, could you explain why it's necessary?

If it is necessary, please document it in the docstring.

It was changed because this API is from browsergym, which added that parameter in 0.14.2 (the latest release). The alternative, which I just implemented, is simply to pin the requirement to <0.14.2 . Let me know if you prefer one or the other.

neubig · 2025-12-06T16:50:09Z