This page will guide you to evaluate your trained LLaVA model on VIMABench.
Evaluate models trained on inBC / D-inBC
:
python3 eval-llara.py [evaluation name] --model-path [MODEL_PATH]
Evaluate models trained on RT-2
style datasets:
python3 eval-rt2.py [evaluation name] --model-path [MODEL_PATH]
Usage:
python3 eval-llara.py [-h] [--model-path MODEL_PATH] [--output-path OUTPUT_PATH] [--prompt-mode PROMPT_MODE] [--prompt-id PROMPT_ID] [--seed SEED] [--num-env NUM_ENV] [--max-length MAX_LENGTH] [--partition PARTITION] [--detector DETECTOR] [--detector-thre DETECTOR_THRE] filename
filename
: Name of the output file.OUTPUT_PATH
: Path to the output directory (default:../results/
).MODEL_PATH
: Path to LLaVA checkpoint.PROMPT_MODE
: Set of operational flags:-
h
: Enable action history.
-
s
: Query VLM for each observation and perform a single action step no matter how many steps the VLM generates.
-
d
: Enable object detection using VLM.
-
e
: Enable object detection using MaskRCNN.
-
o
: Enable oracle object detection.
PROMPT_ID
: Which prompt to use when genearte actions. What's this?-
- -1 (or any negative values) : Randomly selected from 15 options (default)
-
- from 0 to 14 (inclusive) : Fixed prompt at the index you set (0-index)
-
- 100 (or any number strictly greater than 14) : The prompt will be omitted
SEED
: Random seed for reproducibility.NUM_ENV
: Number of episodes per task.MAX_LENGTH
: Maximum steps per episode; episodes exceeding this limit are marked as failed.PARTITION
: Specific partition of VIMABench to test; tests all partitions (L1 - L4) if unspecified.DETECTOR
: Path to the MaskRCNN checkpoint, used only ife
is enabled inPROMPT_MODE
.DETECTOR_THRE
: Minimum score for an object detection proposal to be considered valid (default: 0.6).
For models trained on datasets without object detection features, use PROMPT_MODE
as hs
.
For models starting with D-
, set PROMPT_MODE
as hsd
, hse
, or hso
to enable different object detectors.
Note: Both eval-llara.py
and eval-rt2.py
scripts accept the same arguments.