You can run this program by running python main.py
.
Before that, please visit GroundingDINO and SAM2 to prepare the environment according to their instructions.
Also, please download the checkpoints for SAM2 and GroundingDINO separately. You should place the checkpoint of SAM2 under /checkpoints/sam2.1_hiera_large.pt
and the checkpoints of GroundingDINO under /weights/groundingdino_swinb_cogcoor.pth
.
In addition to that, you only need to install the environment required for the UI by pip install gradio
, and downloading bert-base-uncased from huggingface to your root directory of this project will be recommended.
Avoid to upload a long video, because it will lead to a very long inference time. A video of 100 frames takes about 8 minutes. Basically, this is a tool for semantic labeling jobs.
Use keyword,keyword,...
as your text prompt rather than a long sentence.
The official project of Grounded-SAM-2 is a simple implementation of this project. In this project, we optimized in the following aspects:
- GroundingDINO only involves a single keyword for each inference, significantly reducing missed detection
- Searching grounded objects in all frames
- Instances detected by GroundingDINO are searched across the entire video
- Better mask post-processing
Experienced developers are welcome to collaborate with me on this project. If you are interested, please send an email to [email protected].