Teaching language models to think in 3D
AlphaSpace lets language models understand and manipulate objects in 3D space without needing eyes.
Traditional robots need computer vision to see objects. We've created a different approach: structured tokens that encode spatial information, allowing language models to reason about position, orientation, and physical relationships between objects.
AlphaSpace uses a hierarchical position-based token system:
- Global position tokens: Represent a 25x25 grid (
<|row-col|>
, e.g.,<|5-10|>
) - Local position tokens: Provide 4x4 fine-grained positioning within each cell (
<|local-row-col|>
, e.g.,<|local-2-3|>
) - Object attribute tokens: Encode object properties (
<|color|><|object|>
, e.g.,<|red|><|cube|>
) - State tokens: Special indicators like
<|empty|>
,<|origin|>
, and<|target|>
Our specialized training teaches models to manipulate these tokens to solve spatial problems.
- No cameras needed: Pure language-based spatial reasoning
- Precise manipulation: Robots can position objects at exact coordinates
- Natural instructions: Tell robots what to do in plain language
- Generalizes well: Works with novel objects and spatial arrangements
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
import torch
from utils import tokenize_desk, SYSTEM_PROMPT
# Load the mode
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Define your workspace
objects = [
{"red-cube": [51, 43, 17]},
{"black-cube": [44, 58, 17]},
{"purple-cube": [74, 59, 17]},
{"green-cube": [65, 82, 17]},
]
# Give a natural language instruction
instruction = "Throw the red cube on top of the blue cylinder"
desk, object_height = tokenize_desk(objects)
final_instruction = SYSTEM_PROMPT.format(object_height=object_height,instruction=instruction,TABLE_MAP=desk)
chat = [
{"role": "user", "content": final_instruction.strip()}
]
tokenized_chat = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, use_system_prompt=False, return_tensors="pt")
# print(len(tokenized_chat[0]))
generated_ids = model.generate(
tokenized_chat.to("cuda"),
max_new_tokens=2048,
do_sample=False,
temperature=0.6,
)
# Get the solution
result = tokenizer.decode(generated_ids[0][tokenized_chat.shape[1]:], skip_special_tokens=True)
print(result)
We're working on:
- Dynamic scene understanding
- Multi-step manipulation planning
- Physics-aware spatial reasoning
We're looking for collaborators and plan to expand the model's capabilities to include additional spatial tasks in the future! Check out the issues to get started.
@misc{dao2025alphaspaceenablingroboticactions,
title={AlphaSpace: Enabling Robotic Actions through Semantic Tokenization and Symbolic Reasoning},
author={Alan Dao and Dinh Bach Vu and Bui Quang Huy},
year={2025},
eprint={2503.18769},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.18769},
}
AlphaSpace was created by Menlo Reseach. If you use it in your research, please cite our paper.