Skip to content

[ICCV 2025] InstaScene: Towards Complete 3D Instance Decomposition and Reconstruction from Cluttered Scenes

Notifications You must be signed in to change notification settings

zju3dv/InstaScene

Repository files navigation

InstaScene: Towards Complete 3D Instance Decomposition and Reconstruction from Cluttered Scenes

InstaScene: Towards Complete 3D Instance Decomposition and Reconstruction from Cluttered Scenes,
Zesong Yang, Bangbang Yang, Wenqi Dong, Chenxuan Cao, Liyuan Cui, Yuewen Ma, Zhaopeng Cui, Hujun Bao
ICCV 2025

teaser.mp4

Pipeline

Installation

  • Installation of Scene Decomposition.
conda create -n instascene python=3.9 -y
conda activate instascene 

pip install torch==2.1.0+cu118 torchvision==0.16.0+cu118 torchaudio==2.1.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118

pip install --extra-index-url=https://pypi.nvidia.com "cudf-cu11==24.2.*" "cuml-cu11==24.2.*"

pip install -r requirements.txt

Install CropFormer for instance-level segmentation.

cd semantic_modules/CropFormer
cd mask2former/modeling/pixel_decoder/ops
sh make.sh
cd ../../../../
git clone [email protected]:facebookresearch/detectron2.git
cd detectron2
pip install -e .
pip install git+https://github.com/cocodataset/panopticapi.git
pip install git+https://github.com/mcordts/cityscapesScripts.git
cd ..
pip install -r requirements.txt
pip install -U openmim
mim install mmcv
mkdir ckpts

Manually download CropFormer checkpoint into semantic_modules/CropFormer/ckpts

  • Installation of in-situ generation.

Data Preprocessing

Please follow the steps below to process your custom dataset, or directly download our preprocessed datasets.

1. Run instance-level segmentation.

  • It's ok to use other 2D segmentation models, but make sure the input masks don't exhibit overly complex hierarchy relationships; otherwise, our method will default to the finest level.
cd semantic_modules/CropFormer
bash run_segmentation.sh "$DATA_DIR"
cd ../..

2. Training 2DGS.

Follow the original repository to train the 2dgs model.

python train.py -s data/3dovs/bed -m output/3dovs/bed/train_2dgs

Optional mono normal prior (StableNormal) is available to enhance the reconstruction quality.

## Prepare Normal Priors
cd semantic_modules
git clone https://github.com/Stable-X/StableNormal && cd StableNormal
pip install -r requirements.txt
mv ../inference_stablenormal.py ./
python inference_stablenormal.py "$DATA_DIR"
cd ../..

## Training 2DGS with Normal Priors 
python train.py -s data/3dovs/bed --w_normal_prior stablenormal_normals -m output/3dovs/bed/train_2dgs

Put the trained point_cloud.ply file into the $DATA_DIR directory. After successfully executing the above steps, the data directory should be structured as follows:

data
   |——————3D_OVS
   |   |——————bed
   |      |——————point_cloud.ply
   |      |——————images
   |         |——————00.jpg
   |         ...
   |      |——————sam
   |         |——————mask
   |            |——————00.png
   |            ...
   |      |——————sparse
   |         |——————0
   |            |——————cameras.bin
   |            ...
   |      |——————(optional) stablenormal_normals
   |         |——————00.png
   |         ...
   |     ...

Training with Spatial Contrastive Learning

Note that for simple scenes, such as 3D-OVS (simple-object centered without overlap), no need to use spatial relationships to obtain robust semantic priors as shown in our supplementary material. Single-view constrastive learning is sufficient to achieve strong performance.

We train the model on a NVIDIA Tesla A100 GPU (40GB) with 10,000 iterations for about 20 minutes & less than 8GB GPU.

  • Reduce the GPU & Speed the time with --sample_batchsize 8 * 1024 or -r 2.
  • Use --gram_feat_3d for a more robust feature field in complex scenes.
  • It's normal to get stuck at the DBScan Filter Stage, since the backgrount gaussian points may be divided into multi-regions.
  • Use --consider_negative_labels to suppress floaters during background segmentation.
python train_semantic.py -s data/lerf/waldo_kitchen \
                         -m train_semanticgs \
                         --use_seg_feature --iterations 10000 \
                         --load_filter_segmap --consider_negative_labels

After completing the training, we provide a GUI modified from Omniseg3D for real-time ineractive segmentation. The point_cloud.ply in our preprocessed datasets already has pretrained semantic features.

python semantic_gui.py \
  --ply_path data/lerf/waldo_kitchen/point_cloud.ply \
  --interactive_note lerf_waldo_kitchen \
  --use_colmap_camera \
  --source_path data/lerf/waldo_kitchen --resolution 1
  • Left Mouse for changing rendering view
  • Click Mode + 0.9 Threshold + Right Mouse for segmentation
  • Clear Edit for clear the segmentation cache
  • Delete 3D for remove the chosen gaussians
  • Segment 3D for only keep the chosen gaussians
  • Reload Data for reload the gaussian model
Screencast.2025-07-24.13_31_27.mp4
Feishu20250723-192829.mp4

ToDos

🔥 Feel free to raise any requests, including support for additional datasets or broader applications of segmentation~

  • Release project page and paper.
  • Release scene decomposition code.
  • Release in-situ generation code.

Acknowledgements

Some codes are modified from Omniseg3D, MaskClustering, 2DGS++, thanks for the authors for their valuable works.

Citation

If you find this code useful for your research, please use the following BibTeX entry.

@inproceedings{yang2025instascene,
    title={InstaScene: Towards Complete 3D Instance Decomposition and Reconstruction from Cluttered Scenes},
    author={Yang, Zesong and Yang, Bangbang and Dong, Wenqi and Cao, Chenxuan and Cui, Liyuan and Ma, Yuewen and Cui, Zhaopeng and Bao, Hujun},
    booktitle=ICCV,
    year={2025}
}

About

[ICCV 2025] InstaScene: Towards Complete 3D Instance Decomposition and Reconstruction from Cluttered Scenes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published