Skip to content
/ ShowUI Public

Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.

License

Notifications You must be signed in to change notification settings

showlab/ShowUI

Repository files navigation

ShowUI

Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.

ShowUI

   📑 Paper    | 🤗 Hugging Models   |    🤗 Spaces Demo    |    🕹️ OpenBayes贝式计算 Demo   
🤗 Datasets   |   💬 X (Twitter)   |    🖥️ Computer Use    |    📖 GUI Paper List   

ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou
Show Lab @ National University of Singapore, Microsoft

🔥 Update

  • [2024.12.28] Update GPT-4o annotation recaptioning scripts.
  • [2024.12.27] Update training codes and instructions.
  • [2024.12.23] Update showui for UI-guided token selection implementation.
  • [2024.12.15] ShowUI received Outstanding Paper Award at NeurIPS2024 Open-World Agents workshop.
  • [2024.12.9] Support int8 Quantization.
  • [2024.12.5] Major Update: ShowUI is integrated into OOTB for local run!
  • [2024.12.1] We support iterative refinement to improve grounding accuracy. Try it at HF Spaces demo.
  • [2024.11.27] We release the arXiv paper, HF Spaces demo and ShowUI-desktop-8K.
  • [2024.11.16] showlab/ShowUI-2B is available at huggingface.

🖥️ Computer Use

See Computer Use OOTB for using ShowUI to control your PC.

computer_use_with_showui-en-s.mp4

⭐ Quick Start

See Quick Start for model usage.

🤗 Local Gradio

See Gradio for installation.

🚀 Training

Our Training codebases supports:

  • DeepSpeed Zero1, Zero2, Zero3
  • Full-tuning (FP32, FP16, BF16), LoRA, QLoRA
  • SDPA, Flash Attention 2
  • Multiple datasets mixed training
  • Interleaved data streaming
  • Image randomly resize (crop, pad)

See Train for training set up.

🕹️ UI-Guided Token Selection

Try test.ipynb, which seamless support for Qwen2VL models.

(a) Screenshot patch number: 1296 (b) By applying UI-graph, UI Component number: 167

✍️ Annotate your own data

Try recaption.ipynb, where we provide instructions on how to recaption the original annotations using GPT-4o.

❤ Acknowledgement

We extend our gratitude to SeeClick for providing their codes and datasets.

Special thanks to Siyuan for assistance with the Gradio demo and OOTB support.

🎓 BibTeX

If you find our work helpful, please kindly consider citing our paper.

@misc{lin2024showui,
      title={ShowUI: One Vision-Language-Action Model for GUI Visual Agent}, 
      author={Kevin Qinghong Lin and Linjie Li and Difei Gao and Zhengyuan Yang and Shiwei Wu and Zechen Bai and Weixian Lei and Lijuan Wang and Mike Zheng Shou},
      year={2024},
      eprint={2411.17465},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.17465}, 
}

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Star History Chart