diff --git a/_posts/2024-05-09-speech-conversational-llms.md b/_posts/2024-05-09-speech-conversational-llms.md new file mode 100644 index 00000000..32ee3e5a --- /dev/null +++ b/_posts/2024-05-09-speech-conversational-llms.md @@ -0,0 +1,67 @@ +--- +title: Speech LLMs for Conversations +date: 2024-05-09 +tags: [llm, speech, conversations] +categories: [Machine Learning] +layout: post +authors: [Shangeth, lepisma] +latex: True +--- + +With LLMs making conversational systems has become easier. You no longer need to +focus on the low-level details of categorizing semantics and designing +responses. Instead, you can concentrate on controlling high-level behaviors via +an LLM. This is the trend that we see most of the world moving towards as +products are using vendor combinations of ASR, LLM, and TTS with some dialog +management stitched in between. While this is going to be the norm soon, we want +to keep exploring areas from where the next set of quality improvements will +come. + +[Earlier](/speech-first-conversational-ai-revisited/) we discussed how spoken +conversations are richer than pure text and how the gap would be not bridged by +LLMs purely working on transcriptions. In one of our recent experiments we build +an efficient multi-modal LLM that takes speech directly to provide better +conversational experience. For production usage, the constraint here is that +this should happen without losing the flexibility that you get in a text-only +LLM around writing prompts, making changes, evaluating, and debugging. + +Below is a conversation with our recent in-house Speech LLM based conversational +system. Notice that because of the extra information in speech some micro +personalizations can happen like usage of gendered pronouns[^1]. You also get +lower impact of transcription errors and in general better responses in +non-speech signals. With access to both speech and text domains, the model +allows for more fluent turn-taking, though not demonstrated in the current +conversation. In addition, our approach also reduces the combined model size +(<2B) for taking speech to response, leading to lower compute latency as +compared to larger systems. + + + +
+ + + +The model above doesn't yet control speech synthesis beyond the textual markers +it can generate, but that's something to be added soon (you might have noticed +erratic pitch shifts in the call above since TTS vendors don't contextualize +based on past conversations). Stay tuned for more details on how we take this +and similar research areas forward. + +[^1]: Of course concerns around paralinguistic prediction accuracies are + extremely important to take something like this in production. diff --git a/assets/audios/posts/speech-conversational-llms/audio.m4a b/assets/audios/posts/speech-conversational-llms/audio.m4a new file mode 100644 index 00000000..a9739a88 Binary files /dev/null and b/assets/audios/posts/speech-conversational-llms/audio.m4a differ diff --git a/assets/audios/posts/speech-conversational-llms/metadata.vtt b/assets/audios/posts/speech-conversational-llms/metadata.vtt new file mode 100644 index 00000000..0fa14575 --- /dev/null +++ b/assets/audios/posts/speech-conversational-llms/metadata.vtt @@ -0,0 +1,34 @@ +WEBVTT + +00:00:04.940 --> 00:00:06.880 +{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "female, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""} + +00:00:17.530 --> 00:00:20.840 +{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "female, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""} + +00:00:29.080 --> 00:00:33.400 +{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""} + +00:00:47.420 --> 00:00:52.020 +{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""} + +00:01:07.190 --> 00:01:11.510 +{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, oceania", "title_alt": "", "subjects_alt": ""} + +00:01:20.900 --> 00:01:22.530 +{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""} + +00:01:32.440 --> 00:01:37.700 +{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""} + +00:01:44.770 --> 00:01:48.200 +{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""} + +00:01:52.680 --> 00:01:53.790 +{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""} + +00:02:01.050 --> 00:02:09.710 +{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""} + +00:02:33.770 --> 00:02:36.590 +{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""} \ No newline at end of file diff --git a/assets/audios/posts/speech-conversational-llms/transcript.vtt b/assets/audios/posts/speech-conversational-llms/transcript.vtt new file mode 100644 index 00000000..12eeaa68 --- /dev/null +++ b/assets/audios/posts/speech-conversational-llms/transcript.vtt @@ -0,0 +1,75 @@ +WEBVTT +Kind: captions +Language: en + +00:00:00.000 --> 00:00:03.040 +Hello, am I speaking with Tom Williams? + +00:00:04.940 --> 00:00:06.880 +May I ask who is calling? + +00:00:07.740 --> 00:00:15.750 +Hello Ma'am. This is the digital agent of FirstCar Finance on a recorded line. Could I please speak with Tom regarding an important financial matter? + +00:00:17.530 --> 00:00:20.840 +Um, hold on. I'll give him the phone. + +00:00:21.760 --> 00:00:23.280 +Sure Ma'am. No problem. + +00:00:29.080 --> 00:00:33.400 +Yeah yeah, uh, it's Tom. And um who is this? + +00:00:35.240 --> 00:00:45.830 +Hi, tom. this is the digital agent of FirstCar Finance on a recorded line. For safety, could you please confirm if 895 East Elm Street, Springfield, Illinois is your address? + +00:00:47.420 --> 00:00:52.020 +Uh uh yeah, yeah. That'll be my address. + +00:00:53.420 --> 00:01:05.870 +Thank you Tom, for confirming. I am giving you a call about the 2017 Honda Civic under your name. You have an overdue payment of $1345.75. Could you clear this today? + +00:01:07.190 --> 00:01:11.510 +Um, I was planning to pay this, but I can't pay that all at once. + +00:01:12.750 --> 00:01:19.290 +I understand, Tom. How does a partial payment of $1700 sound? You can pay the remaining amount later. + +00:01:20.900 --> 00:01:22.530 +Yeah, I can do that. + +00:01:23.490 --> 00:01:31.060 +Great. Tom, how would you like to do this? You can pay this amount via an SMS link or we can connect you to a human agent over the phone. + +00:01:32.440 --> 00:01:37.700 +Uh I guess a payment link would be yeah I guess easier. + +00:01:38.830 --> 00:01:43.700 +Alright. Can I send it to your mobile number ending in 7985? + +00:01:44.770 --> 00:01:48.200 +Yeah, that should um yeah, that should work just fine. + +00:01:49.120 --> 00:01:51.980 +Thank you Tom. I am sending the message right away. + +00:01:52.680 --> 00:01:53.790 +Yeah okay. + +00:01:54.750 --> 00:02:00.130 +Brilliant. I’ve sent the payment link to your phone. Could you please let me know once you've received it? + +00:02:01.050 --> 00:02:09.710 +Um, yeah, just got it. Please give me a minute to complete the payment. + +00:02:10.950 --> 00:02:18.730 +Of course, Tom. Take your time and hit the pound key once you've completed the payment. If you come across any issues, hit the star key. + +00:02:25.660 --> 00:02:32.710 +Thank you, Tom. We will verify your payment on our end. Please reach out if you need further assistance. Have a nice day. + +00:02:33.770 --> 00:02:36.590 +Okay, thanks. Goodbye. + +00:02:37.530 --> 00:02:38.590 +Goodbye Tom. \ No newline at end of file