Screenshot from the Unity viewport.
Ollama is used to manage locally installed LLM models.
- Download ollama from https://ollama.com/download.
ollama pull gemma:2b-instruct
. Pull model file e.g. gemma:2b-instruct.- Verification:
ollama show gemma:2b-instruct --modelfile
. Inspect model file data.ollama run gemma:2b-instruct
. Open the chat in the console to check everything is OK.
Requires Python version >3.9 and <=3.11. TTS library does not work with the latest Python 3.12.
python3.11 -m venv .venv
. Create a virtual environment to not pollute global Python packages.- You can also use conda.
- If you have Python <=3.11 alongside the latest 3.12, use it instead e.g.
C:/programs/install/Python310/python.exe -m venv .venv
source .venv/Scripts/Activate
. Activate virtual environment.pip install -r requirements.txt
. Install dependencies.- (Optional) Install PyTorch CUDA for GPU acceleration for TTS:
pip install torch==2.2.2+cu118 -f https://download.pytorch.org/whl/torch_stable.html
. - (Optional) Install DeepSpeed (see below). However, I would first test the app without it.
python.exe main.py serve --config "config_xtts.yaml"
. Start the server. The first time will also download theXTTS v2.0
model.- Verification:
- http://localhost:8080/index.html should the open control panel.
You can find other commands in the makefile:
make curl_prompt_get
andmake curl_prompt_post
. Send prompt remotely through the/prompt
endpoint. You can also use it in your scripts.tts --list_models
ormake tts-list-models
. List models available in the TTS python package.make xtts-list-speakers
. List speakers available for theXTTS v2.0
model.make xtts-create-speaker-samples
. Writes speaker samples for theXTTS v2.0
into theout_speaker_samples
directory. You will have 55+ .wav files (one per speaker) that say the same test sentence. Use it to select the preferred one.make xtts-speak-test
. Speak the test sentence and write the result toout_speak_result.wav
. Uses the same configuration as the app server.make xtts-clone-test
. Speak test sentence using voice cloning based on thevoice_to_clone.wav
file (not provided in the repo). Write the result toout_speak_result.wav
.
Import Unity project from unity-project
. Open the OutdoorsScene
(found in Project
window under Assets
). You should see the 3D model in the viewport and all objects in the Hierarchy
window (just like on image above). Click Unity's run button. The Unity client should automatically connect to the Python server. At this point, go ahead and ask the character your question. Remember that the first question after the server restart takes longer (it loads the AI models into VRAM).
Read "Oculus Lipsync for Unity Development" beforehand (requires Windows/macOS). Their documentation also contains the "Download and Import" section in case of any problems. Make sure to accept their licensing.
I've added Oculus Lipsync's source code to this repo as it required some extra fixes inside C# scripts. Tested on Windows.
To create a production build follow official docs: Publishing Builds.
Once everything is working, try adding DeepSpeed and TTS streaming for better performance (see below).
See config.example.yaml for all configuration options. If you followed the instructions above, you have already used config_xtts.yaml. See server/config.py for default values.
This depends if your selected LLM works with ollama or not. If it does:
ollama pull <model_name>
.- Update the app's config file:
llm.model: <model_name>
(seeconfig.example.yaml
file). - Update
server\app_logic.py
:- Rewrite the
GemmaChatContext
class to generate a prompt based on: the user's query, past messages, and the LLM's expected format. - In the
AppLogic
class, there is a_exec_llm()
function. It might need adjusting e.g. to callawait self.llm.chat()
insteadawait self.llm.generate()
. Depends on the model.
- Rewrite the
If you want to connect a model that is not available in ollama, rewrite AppLogic
's _exec_llm()
function. You get chat history, current message, and config file values. The function is async and returns a string. I assume this is the API you would expect.
With the TTS library, you can select from many available models. You get e.g. Bark, tortoise-tts, Glow/VITS, and more. I went with XTTS v2.0 as it performed best when I did a blind test on Hugging Face. Each model usually supports many languages and speakers. Not only can you select the model itself, but you also choose which of many male/female voices suits your avatar best.
Everything is controlled from the config file. See config.example.yaml for details. In "other commands" I've listed a few scripts that make it easier to choose. E.g. generating a sample .wav file for each speaker available in a selected model.
To go beyond the TTS library, check the _exec_tts()
function inside server/app_logic.py. I recommend splitting the text into separate sentences and streaming them to the client one by one.
Set tts.sample_of_cloned_voice_wav
in config_xtts.yaml
to point to the voice sample file. Unfortunately, this usually prolongs the inference time for TTS. I found one of the built-in voices good enough. If you use XTTS v2.0 with either DeepSpeed or streaming, you get voice cloning for free. It's part of the XTTS v2.0.
That's not the only way to customize the voice. You can always apply a filter to the output. Shift pitch, speed up, trim, cut etc. Given the wide range of available base voices, it should be enough to create something custom. This has a smaller runtime cost than stacking another neural net.
DeepSpeed is a library that can speed up TTS inference ~2x. You must satisfy the following conditions:
- Use the CUDA version of PyTorch.
- In config, set both
tts.deepspeed_enabled
andtts.use_gpu
to True (which are the default). - The TTS model is
tts_models/multilingual/multi-dataset/xtts_v2
(which is the default). - DeepSpeed is installed.
To install DeepSpeed on non-Windows machine use pip install deepspeed
. For Windows, the official readme suggests manually building it instead. Fortunately, the community has done this for us:
- Find a pre-compiled wheel library based on your Python, CUDA, and PyTorch versions. Check the following repositories:
- Download the .whl file.
- Install:
pip install {deep-speed-wheel-file-name-here}
.- If you want to uninstall it later, use
pip uninstall deepspeed
.
- If you want to uninstall it later, use
I'm using Python 3.10. I've installed PyTorch with CUDA using pip install torch==2.2.2+cu118 -f https://download.pytorch.org/whl/torch_stable.html
(Pytorch 2.2.2, CUDA 11.8). From daswer123/deepspeed-windows-wheels I've downloaded deepspeed-0.13.1+cu118-cp310-cp310-win_amd64.whl
. Once you start the server next time, you should see the confirmation in the console.
Activating DeepSpeed replaces the TTS class with my custom FakeTTSWithRawXTTS2. It's just a thin wrapper around raw XTTS v2.0 that has the same API. It also enables voice cloning for free.
Streaming means that we split the generated text into smaller chunks. There is a crossfade to mask chunk transitions. A small first chunk means fast time-to-first-sound. It's disabled by default, as I don't know if your GPU is fast enough to handle this task. Choppy audio makes for a bad first impression. To enable:
- Use the CUDA version of PyTorch.
- In config, set
tts.streaming_enabled
andtts.use_gpu
to True.
The config file allows you to adjust both chunk size and crossfade. If you hear frequent 'popping' sounds, increase both chunk size and crossfade. If the sound is interrupted, lower the crossfade.
Streaming replaces the TTS class with my custom FakeTTSWithRawXTTS2. It's just a thin wrapper around raw XTTS v2.0 that has the same API. It also enables voice cloning for free.
Usually caused by a lack of VRAM.
- Check that there are no other apps that have loaded models on GPU (video games, stable diffusion, etc.). Even if they don't do anything ATM, they still take VRAM.
- Close Ollama.
- Make sure VRAM usage is at 0.
- Start Ollama.
- Restart the app.
- Ask a question to load all models into VRAM.
- Check you are not running out of VRAM.
TL;DR: Restart Ollama, check VRAM.
This feature is not available, but you can easily add it yourself. There is a /prompt
endpoint (either as GET or POST) used to send a query: curl "http://localhost:8080/prompt?value=Who%20is%20Michael%20Jordan%3F"
.
The simplest way is a separate script that wraps around a speech-to-text model. ATM Whisper models are popular: faster-whisper, insanely-fast-whisper, etc.
The robotic feeling you get is when the mouth just lerps between predefined shape keys. It's more natural to slur the shapes. Oculus lipsync allows you to adjust this tolerance. This setting is usually adjusted based on feeling and not any objective metric. I've chosen a conservative value to preserve a closed mouth on 'm', 'b', and 'p' sounds. From what I've perceived, this is the most important shape.
Unity client opens WebSocket to the Python server. Add a new onJsonMessage
handler to the existing object that has WebSocketClientBehaviour
. For particle effects, I've already done that inside WebSocketMsgHandler.cs
. Its OnMessage()
function contains a switch based on the JSON's object type
field.
Example flow when action is triggered from the web browser:
- The user clicks on the ParticleSystemsRow radio button.
- The browser sends
{ type: 'play-vfx', vfx: '<vfx-name>' }
JSON through the WebSocket to the Python server. - The Python server forwards the
"play-vfx"
message to the Unity client. - In Unity, my object with the
WebSocketClientBehaviour
component receives the message and determines it to be a string. This component is responsible for low-level WebSocket operations as well as differentiating between JSON and WAV audio messages. WebSocketClientBehaviour.onJsonMessage()
delegates are called, which includes my object withWebSocketMsgHandler
.WebSocketMsgHandler.OnMessage()
parses the message's"type"
field. It can then trigger the corresponding action.