Runsen Xuβ
Weiyao Wangβ
Hao Tangβ
Xingyu Chenβ
Xiaodong Wangβ
Fu-Jen Chuβ
Dahua Linβ
Matt Feiszliβ
Kevin J. Liangβ
FAIR, MetaβThe Chinese University of Hong Kongβ
- [2025-05-22] We release the paper of Multi-SpatialMLLM and the codes of our data engine. π
To set up the Conda environment for our data engine, please follow these steps:
- Ensure you have Anaconda or Miniconda installed.
- Clone this repository.
git clone https://github.com/facebookresearch/Multi-SpatialMLLM.git cd Multi-SpatialMLLM
- Create the Conda environment using the provided YAML file:
conda env create -f requirements/data_engine.yaml
- Activate the newly created environment:
conda activate data_engine
Please follow spatial_engine/utils/scannet_utils/README.md to download and process the ScanNet data.
Follow TAPVid-3D to download the data. We only use the ADT and PStudio subsets. You need to download the version with camera extrinsics annotation according to this.
Here are some notes for your reference, but simply following the official script is enough.
- Change the download url in
tapnet/tapnet/tapvid3d/annotation_generation/gcs_utils.py
fromhttps://storage.googleapis.com/dm-tapnet/tapvid3d/release_files/rc4
to"https://storage.googleapis.com/dm-tapnet/tapvid3d/release_files/rc5"
. Also, modifytapnet/tapnet/tapvid3d/annotation_generation/adt_utils.py
to store the extrinsics_w2c in the npz file like below.# also add this for warning sequence_path = os.path.join(input_adt_path, adt_v2_name) # if the sequence_path does not exist, write to a warning file and exit if not os.path.exists(sequence_path): with open(f"adt_warning.txt", "a") as f: f.write(f"Sequence {seq_name} does not exist.") return ... ... queries_xyt = in_npz["queries_xyt"] trajectories = in_npz["tracks_XYZ"] visibilities = in_npz["visibility"] extrinsics_w2c = in_npz["extrinsics_w2c"] # add this # Verify video means. video_means = np.stack([np.mean(x, axis=(0, 1)) for x in rgb_ims], axis=0) assert np.allclose(video_means, in_npz["video_means"], atol=1e-3) example = { "images_jpeg_bytes": rgb_jpegs, "queries_xyt": queries_xyt, "tracks_XYZ": trajectories, "visibility": visibilities, "fx_fy_cx_cy": np.array( [FOCAL_LENGTH, FOCAL_LENGTH, WIDTH / 2, HEIGHT / 2] ), "extrinsics_w2c": extrinsics_w2c # add this }
- Each npz file from TAPVid-3D contains the following keys:
""" Each *.npz file contains: images_jpeg_bytes: tensor of shape [# of frames, height, width, 3], each frame stored as JPEG bytes that must be decoded intrinsics: (fx, fy, cx, cy) camera intrinsics of the video tracks_xyz: tensor of shape (# of frames, # of point tracks, 3), representing the 3D point trajectories and the last dimension is the (x, y, z) point position in meters. They are in camera coordinates. visibility: tensor of shape (# of frames, # of point tracks), representing the visibility of each point along its trajectory queries_xyt: tensor of shape (# of point tracks, 3), representing the query point used in the benchmark as the initial given point to track. The last dimension is given in (x, y, t), where x,y are the pixel location of the query point and t is the query frame. extrinsics_w2c: tensor of shape (#, 4, 4) """
- For Pstudio, after running the official script, you will have a
tmp
folder inside, which is used to store the original video (image) from the pstudio dataset. You can just omit this folder. - For ADT
- Need to download the original files from the Project Aria website and place the data in
data/projectaria_tools_adt_data
.pip install projectaria-tools'[all]' # get a adt_download_urls.json from the official website mkdir data/projectaria_tools_adt_data mv adt_download_urls.json data/projectaria_tools_adt_data # download the data with all the types, it costs 1.4 T in total. aria_dataset_downloader -c data/projectaria_tools_adt_data/adt_download_urls.json -o data/projectaria_tools_adt_data/ -l all
- Then run the official script to download the query points and postprocess them to store image info inside the npz file.
Specifically, in
cd tapnet ADT_OUTPUT_DIRECTORY="tapvid3d_dataset/adt/"\nmkdir -p $ADT_OUTPUT_DIRECTORY PYTHON_DEBUG="False" conda activate projectaria # if applicable, use a new env python3 -m tapnet.tapvid3d.annotation_generation.generate_adt --output_dir=$ADT_OUTPUT_DIRECTORY --debug=$PYTHON_DEBUG --split=all --adt_base_path data/projectaria_tools_adt_data
tapnet/tapnet/tapvid3d/annotation_generation/generate_adt.py
:This function will download the npz files from the given url togcs_utils.download_tapvid3d_files(tmp_adt_dir, _SPLIT.value, "adt", _DEBUG.value)
tmp
(which costs about 11G), and the following function will merge the images/videos fromadt_base_path
to the npz file.generate_adt_npz(_ADT_BASE_PATH.value, tmp_adt_dir, _OUTPUT_DIR.value)
- Need to download the original files from the Project Aria website and place the data in
Finally, we assume the structure of the data is like:
data/tapvid3d_dataset
βββ adt
β βββ "scene_id".npz
βββ pstudio
β βββ "scene_id".npz
We generate the data based on the conversation format of InternVL. You could easily change the generated jsonl file to the format of your own.
-
Run
python spatial_engine/camera_movement/calculate_frames_relations.py
to calculate the spatial relations between frames, e.g. their overlap ratios. After running this script, a parquet containing this spatial information will be generated intraining_data/camera_movement
andevaluation_data/camera_movement
. -
Then run
python spatial_engine/camera_movement/camera_movement_engine_train_val.py
to generate the training and evaluation data.
- Run
python spatial_engine/utils/scannet_utils/make_visibility_info.py
to compute the visibility information for each frame. Note that when using this information, loading the file takes a long time, about several minutes. - Run
python spatial_engine/depth_perception/depth_estimation_dot_engine.py
to generate the training and evaluation data for visual-based depth estimation. - Run
python spatial_engine/depth_perception/depth_estimation_coor_engine.py
to generate the training and evaluation data for coordinate-based depth estimation. - Run
python spatial_engine/depth_perception/depth_comparison_dot_engine.py
to generate the training and evaluation data for visual-based depth comparison. - Run
python spatial_engine/depth_perception/depth_comparison_coor_engine.py
to generate the training and evaluation data for coordinate-based depth comparison.
- Run
python spatial_engine/visual_correspondence/visual_correspondence_qa_engine_dot_2_multichoice.py
to generate the training and evaluation data for visual correspondence in dot-based multichoice format. - Run
python spatial_engine/visual_correspondence/visual_correspondence_qa_engine_coor_2_coor.py
to generate the training and evaluation data for visual correspondence in coordinate-based format.
- Run
python spatial_engine/object_perception/compute_object_visibility.py
to compute the visibility information for each object. After running this script, a pkl file containing this visibility information will be saved totraining_data/object_perception
andevaluation_data/object_perception
. - Run
bash find_object_coverage.sh
to compute the coverage information for each object in each scene. You could checkspatial_engine/object_perception/single_object_coverage_finder.py
to modify the parameters and run it with several processes. - After generating all the coverage information for each scene, run
python spatial_engine/object_perception/merge_object_coverage.py
to merge the coverage information. - Run
python spatial_engine/object_perception/single_object_perception_engine.py
to generate the training and evaluation data for object perception.
- Run
python spatial_engine/object_movement/single_object_movement_engine_coord.py
to generate the training and evaluation data for object movement in coordinate-based format. After running this script, images will be extracted from the npz file and saved todata/my_tapvid3d_images
. - Run
python spatial_engine/object_movement/single_object_movement_engine_dot.py
to generate the training and evaluation data for object movement in dot-based format.
We use the InternVL-2 models for experiments in our paper. You could follow their official instructions to easily fine-tune the models with the generated data and reproduce our results. Other VLMs can also be used. Below are some training details used in our experiments, and more can be found in our paper.
- All images should be resized to
H*W=1296*968
for training. - Different from the original InternVL setting of dynamically allocating 12 image tiles to all images, we make sure each image can use up to 6 image tiles for training and evaluation. Please change this line to
max_num=self.max_dynamic_patch
. Pay attention to GPU OOM issues, and you may change the--max_seq_length
to8192
. - The training config used for our main paper is in
data/configs/mix3M.json
. Note that this config only uses 3M training samples, and we use LoRA training for research efficiency. You could use more data and fully fine-tune the whole model to get much better performance. - To preserve the original ability of the model, some general instruction-following data should be added to the training data.
If you find our work and this codebase helpful, please consider starring this repo π and cite:
@article{xu2025multi,
title={Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models},
author={Xu, Runsen and Wang, Weiyao and Tang, Hao and Chen, Xingyu and Wang, Xiaodong and Chu, Fu-Jen and Lin, Dahua and Feiszli, Matt and Liang, Kevin J.},
journal={arXiv preprint arXiv:2505.17015},
year={2025}
}
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
See contributing and the code of conduct.