Skip to content

Conversation

@MajikalExplosions
Copy link
Collaborator

@MajikalExplosions MajikalExplosions commented Nov 20, 2025

Summary

Implements multimodal (image + text) fine-tuning support, enabling 9 datasets to preserve visual information for training vision-language models instead of converting images to text descriptions.

Implementation

schema/observation/image.py: Schema enhancements that:

  • Add optional fields to ImageAnnotation: content_description, clickable, editable
  • Maintain backward compatibility (all fields default to None)

agents/openhands/std_to_sft.py: Multimodal SFT conversion that:

  • Inserts <image> tokens in conversation text at appropriate positions
  • Tracks image paths via internal _image_path metadata
  • Converts annotations to human-readable format with interactivity indicators (e.g., [clickable], [editable])
  • Generates LLaMA Factory-compatible output with separate images array
  • Supports both file paths and base64-encoded images (though schema explicitly requires paths)
  • Handles nested images in WebObservations

Dataset converters (raw_to_standardized.py): Updated 9 image datasets.

  • android_in_the_wild: Intelligent two-pass conversion that analyzes action sequences to infer clickable/editable UI elements (from source repo), then generates trajectory with enriched annotations
  • androidcontrol: Populates content_description with resource ID/hint/tooltip, and clickable/editable from raw data fields
  • llava_plus: Regenerated samples only
  • omniact: Moves semantic labels from text to content_description, marks all elements as clickable=True
  • webarena_successful: Regenerated samples only
  • weblinx: Regenerated samples only
  • wonderbread: Populates content_description with XPath information
  • go-browse-wa: Regenerated samples only
  • openhands: Converts base64 screenshots to files, saving them to datasets/openhands/screenshots/{trajectory_id}/ directory structure

Testing

In progress. Some verification is done on the output.

Notes

  • Output matches LLaMA Factory multimodal requirements (count(<image> tokens) == len(images array))
  • Tests are forthcoming.

@MajikalExplosions MajikalExplosions marked this pull request as draft November 20, 2025 18:18
@neubig
Copy link
Contributor

neubig commented Nov 21, 2025

@MajikalExplosions could you check the failing pre-commit checks and Python unit tests? Thanks!

Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submitting a review comment so I pop it off my review stack, but please re-request review when tests are passing!

Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments!

scroll(-50.2, -100.5)
fill(bid: str, value: str)
fill(bid: str, value: str, enable_autocomplete_menu: bool = False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'm a little bit skeptical that we actually need this, could you explain why it's necessary?
  2. If it is necessary, please document it in the docstring.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was changed because this API is from browsergym, which added that parameter in 0.14.2 (the latest release). The alternative, which I just implemented, is simply to pin the requirement to <0.14.2 . Let me know if you prefer one or the other.

axtree = generate_axtree.last_xtree
prompt = get_web_user_message("", event.url, axtree, PREV_BID)
return {"from": "human", "value": prompt}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test demonstrating how this works? This will also help prevent regressions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, in upcoming commit.


api_env = "browser"
if not browsergym_id[0] == browsergym_id[-1] == '"':
# Fix: Add None check before accessing browsergym_id indices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Fix: Add None check before accessing browsergym_id indices

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

return {"from": event.source, "value": event.content}

elif hasattr(event, "__class__") and event.__class__.__name__ == "ImageObservation":
elif isinstance(event, ImageObservation):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, it'd be good to have a test in the tests directory demonstrating behavior here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in the same file as the other tests; let me know if it makes sense.

return output_line


# Keep the old main function for backward compatibility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can maybe just remove this function if it's not used anywhere.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

parser.add_argument(
"--output_format",
type=str,
choices=["default", "finetune"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not named very well, could we think of better names?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to export_for, with choices "explicit", "training"; "explicit" is for tests and human readability, while "training" is for Llama Factory compatibility.

@@ -0,0 +1,74 @@
# PowerShell script to convert all 9 dataset samples: Raw -> Standardized -> SFT (OpenHands format)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally we work on mac with bash, so it'd be better to have a bash script if any. You can also use it on windows in WSL

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed; this was a test script that I forgot to remove from the commit.

@openhands-ai
Copy link

openhands-ai bot commented Dec 7, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Pre-commit Checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #158 at branch `android-in-the-wild-images`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants