-
Notifications
You must be signed in to change notification settings - Fork 8
Implement pipeline and new schema for image datasets #158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@MajikalExplosions could you check the failing pre-commit checks and Python unit tests? Thanks! |
neubig
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
submitting a review comment so I pop it off my review stack, but please re-request review when tests are passing!
neubig
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments!
| scroll(-50.2, -100.5) | ||
| fill(bid: str, value: str) | ||
| fill(bid: str, value: str, enable_autocomplete_menu: bool = False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I'm a little bit skeptical that we actually need this, could you explain why it's necessary?
- If it is necessary, please document it in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was changed because this API is from browsergym, which added that parameter in 0.14.2 (the latest release). The alternative, which I just implemented, is simply to pin the requirement to <0.14.2 . Let me know if you prefer one or the other.
| axtree = generate_axtree.last_xtree | ||
| prompt = get_web_user_message("", event.url, axtree, PREV_BID) | ||
| return {"from": "human", "value": prompt} | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test demonstrating how this works? This will also help prevent regressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, in upcoming commit.
agents/openhands/std_to_sft.py
Outdated
|
|
||
| api_env = "browser" | ||
| if not browsergym_id[0] == browsergym_id[-1] == '"': | ||
| # Fix: Add None check before accessing browsergym_id indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Fix: Add None check before accessing browsergym_id indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
| return {"from": event.source, "value": event.content} | ||
|
|
||
| elif hasattr(event, "__class__") and event.__class__.__name__ == "ImageObservation": | ||
| elif isinstance(event, ImageObservation): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, it'd be good to have a test in the tests directory demonstrating behavior here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in the same file as the other tests; let me know if it makes sense.
agents/openhands/std_to_sft.py
Outdated
| return output_line | ||
|
|
||
|
|
||
| # Keep the old main function for backward compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can maybe just remove this function if it's not used anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
agents/openhands/std_to_sft.py
Outdated
| parser.add_argument( | ||
| "--output_format", | ||
| type=str, | ||
| choices=["default", "finetune"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not named very well, could we think of better names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to export_for, with choices "explicit", "training"; "explicit" is for tests and human readability, while "training" is for Llama Factory compatibility.
convert_samples.ps1
Outdated
| @@ -0,0 +1,74 @@ | |||
| # PowerShell script to convert all 9 dataset samples: Raw -> Standardized -> SFT (OpenHands format) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think generally we work on mac with bash, so it'd be better to have a bash script if any. You can also use it on windows in WSL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed; this was a test script that I forgot to remove from the commit.
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
Summary
Implements multimodal (image + text) fine-tuning support, enabling 9 datasets to preserve visual information for training vision-language models instead of converting images to text descriptions.
Implementation
schema/observation/image.py: Schema enhancements that:
ImageAnnotation:content_description,clickable,editableNone)agents/openhands/std_to_sft.py: Multimodal SFT conversion that:
<image>tokens in conversation text at appropriate positions_image_pathmetadata[clickable],[editable])imagesarrayDataset converters (raw_to_standardized.py): Updated 9 image datasets.
content_descriptionwith resource ID/hint/tooltip, andclickable/editablefrom raw data fieldstexttocontent_description, marks all elements asclickable=Truecontent_descriptionwith XPath informationdatasets/openhands/screenshots/{trajectory_id}/directory structureTesting
In progress. Some verification is done on the output.
Notes
count(<image> tokens) == len(images array))