MobileVLA: Vision Language Action Model for Mobile Devices

MobileVLA is expected to be a smart assistant by deploying on mobile devices.

Usage

Initializing a RoboFlamingo model

We support pre-trained vision encoders from the OpenCLIP package, which includes OpenAI's pre-trained models. We also support pre-trained language models from the transformers package, such as LLaMA models.

from mobilevla.factory import create_model_and_transforms

model, image_processor, tokenizer = create_model_and_transforms(
    clip_vision_encoder_path="ViT-L-14",
    clip_vision_encoder_pretrained="openai",
    lang_encoder_path="PATH/TO/LLM/DIR",
    tokenizer_path="PATH/TO/LLM/DIR",
    cross_attn_every_n_layers=1,
    decoder_type='lstm',
)

The cross_attn_every_n_layers argument controls how often cross-attention layers are applied and should be consistent with the VLM. The decoder_type argument controls the type of the decoder, currently, we support lstm, fc, diffusion (bugs exist for the dataloader), and GPT.

Performance

We report results on the CALVIN benchmark.

Method	Training Data	Test Split	1	2	3	4	5	Avg Len
MCIL	ABCD (Full)	D	0.373	0.027	0.002	0.000	0.000	0.40
HULC	ABCD (Full)	D	0.889	0.733	0.587	0.475	0.383	3.06
HULC (retrained)	ABCD (Lang)	D	0.892	0.701	0.548	0.420	0.335	2.90
RT-1 (retrained)	ABCD (Lang)	D	0.844	0.617	0.438	0.323	0.227	2.45
RoboFlamingo	ABCD (Lang)	D	0.964	0.896	0.824	0.740	0.66	4.09
Ours
MCIL	ABC (Full)	D	0.304	0.013	0.002	0.000	0.000	0.31
HULC	ABC (Full)	D	0.418	0.165	0.057	0.019	0.011	0.67
RT-1 (retrained)	ABC (Lang)	D	0.533	0.222	0.094	0.038	0.013	0.90
RoboFlamingo	ABC (Lang)	D	0.824	0.619	0.466	0.331	0.235	2.48
Ours
HULC	ABCD (Full)	D (Enrich)	0.715	0.470	0.308	0.199	0.130	1.82
RT-1 (retrained)	ABCD (Lang)	D (Enrich)	0.494	0.222	0.086	0.036	0.017	0.86
RoboFlamingo	ABCD (Lang)	D (Enrich)	0.720	0.480	0.299	0.211	0.144	1.85
RoboFlamingo (freeze-emb)	ABCD (Lang)	D (Enrich)	0.737	0.530	0.385	0.275	0.192	2.12
Ours

Step 0

Follow the instructions in the MobileVLM and CALVIN to download the necessary dataset and VLM pretrained Models.

Download the CALVIN dataset

$ cd $CALVIN_ROOT/dataset
$ sh download_data.sh D | ABC | ABCD | debug

Download the released MobileVLM model:

MobileVLM Family

Model	LLM	GQA	SQA^I	VQA^T	POPE	MME^P	MMB^dev	Avg.
MobileVLM-1.7B	MobileLLaMA 1.4B	56.1	57.3	41.5	84.5	1196.2	53.2	58.7
MobileVLM V2 1.7B	MobileLLaMA 1.4B	59.3	66.7	52.1	84.3	1302.8	57.7	64.2
MobileVLM-3B	MobileLLaMA 2.7B	59.0	61.2	47.5	84.9	1288.9	59.6	62.8
MobileVLM V2 3B	MobileLLaMA 2.7B	61.1	70.0	57.5	84.7	1440.5	63.2	68.1
MobileVLM V2 7B	Vicuna-7B	62.6	74.8	62.3	85.3	1560.7	69.2	72.1

Step 1

Clone this repo

git clone -b robo-vlm https://github.com/Evan-wyl/MobileVLA.git

Install the required packages:

cd MobileVLA
conda create -n MobileVLA python=3.8
source activate MobileVLA
pip install -r requirements.txt

Training the model (using DDP):

torchrun --nnodes=1 --nproc_per_node=8 --master_port=6042 mobilevla/train/train_calvin.py \
    --report_to_wandb \
    --llm_name mobilellama-1.4b \
    --use_gripper \
    --fusion_mode post \
    --gripper_pad 4 \
    --precision fp32 \
    --num_epochs 5 \
    --batch_size_calvin 6 \
    --run_name MobileVLA-LSTM \
    --calvin_dataset ${calvin_dataset_path} \
    --lm_path ${lm_path} \
    --tokenizer_path ${tokenizer_path} \
    --mm_projector_type ${mm_projector_type} \
    --cross_attn_every_n_layers 4 \
    --loss_multiplier_calvin 1.0 \
    --lr_scheduler constant \
    --warmup_steps 5000 \
    --learning_rate 1e-4 \
    --from_scratch \
    --window_size 12 > ${log_file} 2>&1

${calvin_dataset_path} is the path to the CALVIN dataset；

${lm_path} is the path to the pre-trained LLM；

${tokenizer_path} is the path to the VLM tokenizer；

${openflamingo_checkpoint} is the path to the OpenFlamingo pre-trained model；

${log_file} is the path to the log file.

We also provide mobilevla/pt_run_gripper_post_ws_12_traj_aug_mobilellama_1b.bash to launch the training.

Evaluating the model on the CALVIN benchmark

python eval_ckpts.py

By adding the checkpoint name and directory into eval_ckpts.py, the script would automatically load the model and evaluate it. For example, if you want to evaluate the checkpoint at path 'your-checkpoint-path', you can modify the ckpt_dir and ckpt_names variables in eval_ckpts.py, and the evaluation results would be saved as 'logs/your-checkpoint-prefix.log'.

Co-finetune with both robot data (CALVIN) and vision-language data (COCO caption, VQA)

The results shown below indicate that co-training could preserve most ability of the VLM backbone on VL tasks, while losing a bit of performance on robot tasks.

use

bash robot_flamingo/pt_run_gripper_post_ws_12_traj_aug_mobilellama_1b_co_train.bash

to launch co-train RoboFlamingo with CoCO, VQAV2 and CALVIN. You should update CoCO and VQA paths in get_coco_dataset and get_vqa_dataset in mobilevla/data/data.py.

Results on the CALVIN benchmark:

Split	SR 1	SR 2	SR 3	SR 4	SR 5	Avg Len
Co-Train	ABC->D
Fine-tune	ABC->D
Co-Train	ABCD->D
Fine-tune	ABCD->D
Co-Train	ABCD->D (Enrich)
Fine-tune	ABCD->D (Enrich)

Results on VL tasks:

				coco					VQA
	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE_L	CIDEr	SPICE	Acc
Fine-tune (3B, zero-shot)
Fine-tune (3B, 4-shot)
Co-Train (3B, zero-shot)
Original Flamingo (80B, fine-tuned)

Acknowledgment

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
mobilevla		mobilevla
mobilevlm		mobilevlm
open_flamingo		open_flamingo
LICENSE		LICENSE
README.md		README.md
enrich_lang_annotations.json		enrich_lang_annotations.json
eval_ckpts.py		eval_ckpts.py
eval_sequences.json		eval_sequences.json
lang_annotation_cache.json		lang_annotation_cache.json
partial_task_data.json		partial_task_data.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MobileVLA: Vision Language Action Model for Mobile Devices

Usage

Initializing a RoboFlamingo model

Performance

Step 0

MobileVLM Family

Step 1

Training the model (using DDP):

Evaluating the model on the CALVIN benchmark

Co-finetune with both robot data (CALVIN) and vision-language data (COCO caption, VQA)

Results on the CALVIN benchmark:

Results on VL tasks:

Acknowledgment

CALVIN

Meituan MobileVLM

ByteDance RoboFlamingo

About

Releases

Packages

Languages

License

Evan-wyl/MobileVLA

Folders and files

Latest commit

History

Repository files navigation

MobileVLA: Vision Language Action Model for Mobile Devices

Usage

Initializing a RoboFlamingo model

Performance

Step 0

MobileVLM Family

Step 1

Training the model (using DDP):

Evaluating the model on the CALVIN benchmark

Co-finetune with both robot data (CALVIN) and vision-language data (COCO caption, VQA)

Results on the CALVIN benchmark:

Results on VL tasks:

Acknowledgment

CALVIN

Meituan MobileVLM

ByteDance RoboFlamingo

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages