A CLI and GUI for using the Vision-and-Language Transformer (ViLT) model by dandelin for visual question answering (answering questions based on an image)
- Clone this repo
git clone https://github.com/Dafterfly/Quick_Vilt_Cli.git
- Navigate into the repo
cd Quick_Vilt_Cli
- Install requirements
pip install -r requirements.txt
You are now ready to use the script
Using this image from the COCO dataset as an example
Direct url: https://farm4.staticflickr.com/3076/5868604848_680662062a_z.jpg
COCO dataset link: https://cocodataset.org/#explore?id=18633
Note: the first time that you run either the CLI or GUI, the ViLT model will automatically be downloaded onto you computer. This download is 449 MB.
To use the command line interface script call the script and pass these 2 arguments:
--image
ori
can either be an image url from the web or a path stored locally--question
orq
is the question you'd like to ask- Examples
- Image from url
python quick_vilt.py -i https://farm4.staticflickr.com/3076/5868604848_680662062a_z.jpg -q "how many dogs are there?"
Output
Predicted answer: 2
- Image from local storage
python quick_vilt.py -i 5868604848_680662062a_z.jpg -q "how many dogs are there?"
Output
Predicted answer: 2
Alternatively, you can use the graphical user interface by calling
python quick_vilt_gui.py
You can browse for the image on the internet or local file storage using the file dialog that appears when you click 'Browse' or you can type it directly into the box.
You can tick or untick the 'Preview image' box to show or hide the selected image
You can click 'Run Prediction' to answer the question