Trying to use it on Windows in Firefox browser, it fails to do even basic actions #56

RaiaN · 2024-12-12T23:30:41Z

Hi there,
I'm trying to make use of your app (using gpt-4o + ShowUI) but it fails to perform even basic actions. I have 2560x1440 monitor. I am using Windows. I have Firefox opened.

I ask it to open a new tab for me. It fails to do so. Mouse is not being moved to correct location. Clicks do not occur.

How do I debug your system?

h-siyuan · 2024-12-16T08:57:30Z

could you try the screeshot grounding on our huggingface space: https://huggingface.co/spaces/showlab/ShowUI? you can post the results here and we will investigate that:)

tristayunsub · 2024-12-31T00:20:11Z

You should better lower the resolution to 1028 x728

FringeNet · 2025-01-14T18:31:20Z

Even at lower resolution, it is still "unaware" of what is going on.
It tries to open software thats already open and displaying on the screen.
It also hallucinates having successfully completed steps.

e.g (1366x768, Win 11, ShowUI + GPT4o, RTX 4070):
Modifications to loop.py in order to actually get it working:

Planner tries to:

Click File Menu
Create a new project
Enter the name for the project
Create Minecraft entity as instructed

Actually does:

Click File Menu
2: Clicks New (Has a submenu of project types)
3: Hallucinates typing name after click the submenu
4: Hallucinates interacting with the application, despite seeing the same screenshot over and over where it is stuck.

Tail of Console output:

_render_message: **VLMPlanner**:
I need to adjust the cube to form the base shape of the Shardling's semi-transparent
_render_message: **VLMPlanner** sending action to **<span style="color:rgb(106, 158, 210)">S</span><span style="color
_render_message: Screenshot for **<span style="color:rgb(106, 158, 210)">S</span><span style="color:rgb(111, 163, 82)
Output Text: [{'action': 'CLICK', 'value': None, 'position': [0.23, 0.09]}]
Parsed Output: [{'action': 'CLICK', 'value': None, 'position': [0.23, 0.09]}]
Action Item: {'action': 'CLICK', 'value': None, 'position': [0.23, 0.09]}
Parsed Action List: [{'action': 'mouse_move', 'text': None, 'coordinate': (314, 69)}, {'action': 'left_click', 'text': None, 'coordinate': None}]
_render_message: **<span style="color:rgb(106, 158, 210)">S</span><span style="color:rgb(111, 163, 82)">h</span><span
Converted Action: {'action': 'mouse_move', 'text': None, 'coordinate': (314, 69)}
sync_call: computer {'action': 'mouse_move', 'text': None, 'coordinate': (314, 69)}
action: mouse_move, text: None, coordinate: (314, 69)
mouse move to 314, 69
_render_message: **<span style="color:rgb(106, 158, 210)">S</span><span style="color:rgb(111, 163, 82)">h</span><span
Converted Action: {'action': 'left_click', 'text': None, 'coordinate': None}
sync_call: computer {'action': 'left_click', 'text': None, 'coordinate': None}
action: left_click, text: None, coordinate: None```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to use it on Windows in Firefox browser, it fails to do even basic actions #56

Trying to use it on Windows in Firefox browser, it fails to do even basic actions #56

RaiaN commented Dec 12, 2024

h-siyuan commented Dec 16, 2024

tristayunsub commented Dec 31, 2024

FringeNet commented Jan 14, 2025 •

edited

Loading

Trying to use it on Windows in Firefox browser, it fails to do even basic actions #56

Trying to use it on Windows in Firefox browser, it fails to do even basic actions #56

Comments

RaiaN commented Dec 12, 2024

h-siyuan commented Dec 16, 2024

tristayunsub commented Dec 31, 2024

FringeNet commented Jan 14, 2025 • edited Loading

FringeNet commented Jan 14, 2025 •

edited

Loading