-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
4 changed files
with
176 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
--- | ||
title: Speech LLMs for Conversations | ||
date: 2024-05-09 | ||
tags: [llm, speech, conversations] | ||
categories: [Machine Learning] | ||
layout: post | ||
authors: [Shangeth, lepisma] | ||
latex: True | ||
--- | ||
|
||
With LLMs making conversational systems has become easier. You no longer need to | ||
focus on the low-level details of categorizing semantics and designing | ||
responses. Instead, you can concentrate on controlling high-level behaviors via | ||
an LLM. This is the trend that we see most of the world moving towards as | ||
products are using vendor combinations of ASR, LLM, and TTS with some dialog | ||
management stitched in between. While this is going to be the norm soon, we want | ||
to keep exploring areas from where the next set of quality improvements will | ||
come. | ||
|
||
[Earlier](/speech-first-conversational-ai-revisited/) we discussed how spoken | ||
conversations are richer than pure text and how the gap would be not bridged by | ||
LLMs purely working on transcriptions. In one of our recent experiments we build | ||
an efficient multi-modal LLM that takes speech directly to provide better | ||
conversational experience. For production usage, the constraint here is that | ||
this should happen without losing the flexibility that you get in a text-only | ||
LLM around writing prompts, making changes, evaluating, and debugging. | ||
|
||
Below is a conversation with our recent in-house Speech LLM based conversational | ||
system. Notice that because of the extra information in speech some micro | ||
personalizations can happen like usage of gendered pronouns[^1]. You also get | ||
lower impact of transcription errors and in general better responses in | ||
non-speech signals. With access to both speech and text domains, the model | ||
allows for more fluent turn-taking, though not demonstrated in the current | ||
conversation. In addition, our approach also reduces the combined model size | ||
(<2B) for taking speech to response, leading to lower compute latency as | ||
compared to larger systems. | ||
|
||
<style> | ||
.webvtt-player .media { | ||
display: unset; | ||
} | ||
|
||
.webvtt-player .container { | ||
width: unset; | ||
} | ||
|
||
.webvtt-player { | ||
font-family: sans-serif; | ||
font-size: 0.8em; | ||
} | ||
</style> | ||
|
||
<div id="webvtt-player" | ||
data-audio="../assets/audios/posts/speech-conversational-llms/audio.m4a" | ||
data-transcript="../assets/audios/posts/speech-conversational-llms/transcript.vtt" | ||
data-metadata="../assets/audios/posts/speech-conversational-llms/metadata.vtt" /> | ||
|
||
<script src="https://umd-mith.github.io/webvtt-player/webvtt-player.js"></script> | ||
|
||
The model above doesn't yet control speech synthesis beyond the textual markers | ||
it can generate, but that's something to be added soon (you might have noticed | ||
erratic pitch shifts in the call above since TTS vendors don't contextualize | ||
based on past conversations). Stay tuned for more details on how we take this | ||
and similar research areas forward. | ||
|
||
[^1]: Of course concerns around paralinguistic prediction accuracies are | ||
extremely important to take something like this in production. |
Binary file not shown.
34 changes: 34 additions & 0 deletions
34
assets/audios/posts/speech-conversational-llms/metadata.vtt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
WEBVTT | ||
00:00:04.940 --> 00:00:06.880 | ||
{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "female, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""} | ||
|
||
00:00:17.530 --> 00:00:20.840 | ||
{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "female, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""} | ||
|
||
00:00:29.080 --> 00:00:33.400 | ||
{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""} | ||
|
||
00:00:47.420 --> 00:00:52.020 | ||
{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""} | ||
|
||
00:01:07.190 --> 00:01:11.510 | ||
{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, oceania", "title_alt": "", "subjects_alt": ""} | ||
|
||
00:01:20.900 --> 00:01:22.530 | ||
{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""} | ||
|
||
00:01:32.440 --> 00:01:37.700 | ||
{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""} | ||
|
||
00:01:44.770 --> 00:01:48.200 | ||
{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""} | ||
|
||
00:01:52.680 --> 00:01:53.790 | ||
{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""} | ||
|
||
00:02:01.050 --> 00:02:09.710 | ||
{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""} | ||
|
||
00:02:33.770 --> 00:02:36.590 | ||
{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""} |
75 changes: 75 additions & 0 deletions
75
assets/audios/posts/speech-conversational-llms/transcript.vtt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
WEBVTT | ||
Kind: captions | ||
Language: en | ||
00:00:00.000 --> 00:00:03.040 | ||
<Speaker 1>Hello, am I speaking with Tom Williams? | ||
|
||
00:00:04.940 --> 00:00:06.880 | ||
<Speaker 2>May I ask who is calling? | ||
|
||
00:00:07.740 --> 00:00:15.750 | ||
<Speaker 1>Hello Ma'am. This is the digital agent of FirstCar Finance on a recorded line. Could I please speak with Tom regarding an important financial matter? | ||
|
||
00:00:17.530 --> 00:00:20.840 | ||
<Speaker 2>Um, hold on. I'll give him the phone. | ||
|
||
00:00:21.760 --> 00:00:23.280 | ||
<Speaker 1>Sure Ma'am. No problem. | ||
|
||
00:00:29.080 --> 00:00:33.400 | ||
<Speaker 3>Yeah yeah, uh, it's Tom. And um who is this? | ||
|
||
00:00:35.240 --> 00:00:45.830 | ||
<Speaker 1>Hi, tom. this is the digital agent of FirstCar Finance on a recorded line. For safety, could you please confirm if 895 East Elm Street, Springfield, Illinois is your address? | ||
|
||
00:00:47.420 --> 00:00:52.020 | ||
<Speaker 3>Uh uh yeah, yeah. That'll be my address. | ||
|
||
00:00:53.420 --> 00:01:05.870 | ||
<Speaker 1>Thank you Tom, for confirming. I am giving you a call about the 2017 Honda Civic under your name. You have an overdue payment of $1345.75. Could you clear this today? | ||
|
||
00:01:07.190 --> 00:01:11.510 | ||
<Speaker 3>Um, I was planning to pay this, but I can't pay that all at once. | ||
|
||
00:01:12.750 --> 00:01:19.290 | ||
<Speaker 1>I understand, Tom. How does a partial payment of $1700 sound? You can pay the remaining amount later. | ||
|
||
00:01:20.900 --> 00:01:22.530 | ||
<Speaker 3>Yeah, I can do that. | ||
|
||
00:01:23.490 --> 00:01:31.060 | ||
<Speaker 1>Great. Tom, how would you like to do this? You can pay this amount via an SMS link or we can connect you to a human agent over the phone. | ||
|
||
00:01:32.440 --> 00:01:37.700 | ||
<Speaker 3>Uh I guess a payment link would be yeah I guess easier. | ||
|
||
00:01:38.830 --> 00:01:43.700 | ||
<Speaker 1>Alright. Can I send it to your mobile number ending in 7985? | ||
|
||
00:01:44.770 --> 00:01:48.200 | ||
<Speaker 3>Yeah, that should um yeah, that should work just fine. | ||
|
||
00:01:49.120 --> 00:01:51.980 | ||
<Speaker 1>Thank you Tom. I am sending the message right away. | ||
|
||
00:01:52.680 --> 00:01:53.790 | ||
<Speaker 3>Yeah okay. | ||
|
||
00:01:54.750 --> 00:02:00.130 | ||
<Speaker 1>Brilliant. I’ve sent the payment link to your phone. Could you please let me know once you've received it? | ||
|
||
00:02:01.050 --> 00:02:09.710 | ||
<Speaker 3>Um, yeah, just got it. Please give me a minute to complete the payment. | ||
|
||
00:02:10.950 --> 00:02:18.730 | ||
<Speaker 1>Of course, Tom. Take your time and hit the pound key once you've completed the payment. If you come across any issues, hit the star key. | ||
|
||
00:02:25.660 --> 00:02:32.710 | ||
<Speaker 1>Thank you, Tom. We will verify your payment on our end. Please reach out if you need further assistance. Have a nice day. | ||
|
||
00:02:33.770 --> 00:02:36.590 | ||
<Speaker 3>Okay, thanks. Goodbye. | ||
|
||
00:02:37.530 --> 00:02:38.590 | ||
<Speaker 1>Goodbye Tom. |