Jinkun Hao1*,
Naifu Liang2*,
Zhen Luo3,4*,
Xudong Xu2‡,
Weipeng Zhong2,
Ran Yi1,
Yichen Jin5,
Zhaoyang Lyu2,
Feng Zheng4,
Lizhuang Ma1✉️,
Jiangmiao Pang2
1Shanghai Jiao Tong University, 2Shanghai AI Laboratory, 3SII,
4Southern University of Science and Technology, 5Peking University
* equal contribution, ‡ project lead, ✉️ corresponding author
NeurIPS 2025 Spotlight
The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce MesaTask-10K, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with manually crafted layouts that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a Spatial Reasoning Chain that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts.
- ✅ Release MesaTask-10k layout dataset
- ✅ Release inference code
- Release training code and training dataset
- Release sim-ready scene generation pipeline
This section provides a quick start guide to set up the environment and run the demo. The following steps will guide you through the installation of the required dependencies, downloading the pretrained models, and preparing the datasets.
- Create and activate conda environment
# Create conda environment with Python 3.10
conda create -n MesaTask python=3.10
conda activate MesaTask
- Install other requirements
# Install PyTorch
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0
# Install PyTorch3D
pip install "git+https://github.com/facebookresearch/pytorch3d.git"
# Install remaining requirements
pip install -r requirements.txt
- Download Blender. We test our code on blender-4.3.2
wget https://download.blender.org/release/Blender4.3/blender-4.3.2-linux-x64.tar.xz
tar -xvJf blender-4.3.2-linux-x64.tar.xz
We host our dataset on Hugging Face. The current version of the layout only contains 3D assets in GLB format. The layout version with URDF (Partnet Mobaility) assets will be released soon.
The Dataset structure should be as below
MesaTask-10K/
|-- MesaTask_model
|-- Asset_annotation.json
|-- sbert_text_features.pkl
|-- Assets_library/
|-- {uid}.glb
|-- ...
|-- Layout_info/
|-- bathroom_vanity/
|-- bathroom_vanity_0000/
|-- front.png
|-- layout.json
|-- bathroom_vanity_0001/
|-- ...
|-- coffee_table/
|-- dining_table/
|-- dressing_table/
|-- kitchen_counter/
|-- office_table/
cd dataset
python vis_single.py path/to/layout.json --output_dir vis_data
# python vis_single.py MesaTask-10K/Layout_info/office_table/office_table_0001/laout.json --output_dir vis_data
Our MesaTask model is on Hugging Face
MesaTask provides a two-step inference pipeline:
- Generate task information from task instruction
- Generate 3D scene layout and render the scene from task information
First, generate task information from a task description and table type:
python get_task_info.py \
--task_name "Organize books and magazines on the table" \
--table_type "Nightstand" \
--api_key "your_api_key" \
--model "gpt-4o" \
--output_dir "output"
- Create a new task folder (e.g.,
output/task_001/
) - Save the information to
task_info.json
in the task folder
Then, generate and render the 3D scene based on the task information:
python inference.py \
--input_file output/task_001/task_info.json \
--mesatask_model_path path/to/model \
--rendering
# Prepare .obj format 3D asset for optimization use
python tools/layoutopt/glb2obj.py \
--glb_dir ./MesaTask-10K/Assets_library \
--obj_dir ./MesaTask-10K/Assets_library_obj \
--max_workers 16
# Inference process include physical_optimization
python inference.py \
--input_file output/task_001/task_info.json \
--mesatask_model_path ./MesaTask-10K/MesaTask_model \
--physical_optimization \
--rendering
The output structure will be:
output/task_001/
├── task_info.json # Task information
└── scene_001/
├── scene_layout.txt # Generated scene layout
├── scene_processed_scene.json # Processed scene with object retrieval
├── scene_reconstructed_bpy.glb # 3D scene file
├── rendered_views/ # Basic rendered views
├── optimized_scene/ # (optional)
│ ├── scene_optimized.json
│ ├── scene_optimized_reconstructed_bpy.glb
│ └── rendered_views/
└── scene_retrieval_results.json # Object retrieval details
If you find this work useful, please consider citing our paper:
This work is licensed under a Apache License.