CodeBotler is a system that converts natural language task descriptions into robot-agnostic programs that can be executed by general-purpose service mobile robots. It includes a benchmark (RoboEval) designed for evaluating Large Language Models (LLMs) in the context of code generation for mobile robot service tasks.
This project consists of two key components:
-
CodeBotler: This system features a web interface designed for generating general-purpose service mobile robot programs, along with a ROS (Robot Operating System) Action client for deploying these programs on a robot. It offers the flexibility to explore the code generation capabilities of CodeBotler in two ways: as a standalone system without a robot, as illustrated in the figure above, or by actual deployment on a real robot.
-
RoboEval: This benchmark for code generation features a suite of 16 user task descriptions, each with 5 paraphrases of the prompt. It includes a symbolic simulator and a temporal trace evaluator, specifically designed to assess Large Language Models (LLMs) in their ability to generate code for service mobile robot tasks.
Project website: https://amrl.cs.utexas.edu/codebotler
We provide a conda environment to run our code. To create and activate the environment:
conda create -n codebotler python=3.10
conda activate codebotler
pip install -r requirements.txt
After installing the conda environment, please go to pytorch's official website to install the pytorch corresponding to your cuda version (Note: do not install the cpu version).
Language Model Options
- To use an OpenAI model, you will need an OpenAI key, either saved in a file named
.openai_api_key
, or in theOPENAI_API_KEY
environment variable. - To use a PaLM model, you will need a Google Generative API key, either saved in a file named
.palm_api_key
, or in thePALM_API_KEY
environment variable. - You can use any pretrained model compatible with the HuggingFace AutoModel interface, including open-source models from the HuggingFace repository such as Starcoder. Note that some models, including Starcoder, require you to agree to the HuggingFace terms of use, and you must be logged in using
huggingface-cli login
. - You can also use a HuggingFace Inference Endpoint.
To run the web interface for CodeBotler-Deploy using the default options (using OpenAI's
gpt-4
model), run:
python3 codebotler.py
This will start the server on localhost:8080
. You can then open the interface
by navigating to http://localhost:8080/ in your browser.
List of arguments:
--ip
: The IP address to host the server on (default islocalhost
).--port
: The port to host the server on (default is8080
).--ws-port
: The port to host the websocket server on (default is8190
).--model-type
: The type of model to use. It is eitheropenai-chat
(default) andopenai
for OpenAI,palm
for PaLM, orautomodel
for AutoModel.--model-name
: The name of the model to use. Recommended options aregpt-4
for GPT-4 (default),text-daVinci-003
for GPT-3.5,models/text-bison-001
for PaLM, andbigcode/starcoder
for AutoModel.--robot
: Flag to indicate if the robot is available (default isFalse
).
Instructions for deploying on real robots are included in robot_interface/README.md.
The instructions below demonstrate how to run the benchmark using the open-source StarCoder model.
-
Run code generation for the benchmark tasks using the following command:
python3 roboeval.py --generate --generate-output completions/starcoder \ --model-type automodel --model-name "bigcode/starcoder"
This will generate the programs for the benchmark tasks and save them as a Python file in an output directory
completions/starcoder
. It assumes default values for temperature (0.2), top-p (0.9), and num-completions (20), to generate 20 programs for each task --- this will suffice for pass@1 evaluation.If you would rather not re-run inference, we have included saved output from every model in the
completions/
directory as a zip file. You can simply run.cd completions unzip -d <MODEL_NAME> <MODEL_NAME>.zip
For example, you can run:
cd completions unzip -d gpt4 gpt4.zip
-
Evaluate the generated programs using the following command:
python3 roboeval.py --evaluate --generate-output <Path-To-Program-Completion-Directory> --evaluate-output <Path-To-Evaluation-Result-File-Name>
For example:
python3 roboeval.py --evaluate --generate-output completions/gpt4/ --evaluate-output benchmark/evaluations/gpt4
This will evaluate the generated programs from the previous step, and save all the evaluation results in an python file.
If you would rather not re-run evaluation, we have included saved evaluation output from every model in the
benchmark/evaluations
directory. -
Finally, you can compute pass@1 score for every task:
python3 evaluate_pass1.py --llm codellama --tasks all
or
python3 evaluate_pass1.py --llm codellama --tasks CountSavory WeatherPoll