add speech conversational llm post

skit-ai · May 9, 2024 · ecbb540 · ecbb540
1 parent 121824a
commit ecbb540
Show file tree

Hide file tree

Showing 4 changed files with 176 additions and 0 deletions.
diff --git a/_posts/2024-05-09-speech-conversational-llms.md b/_posts/2024-05-09-speech-conversational-llms.md
@@ -0,0 +1,67 @@
+---
+title: Speech LLMs for Conversations
+date: 2024-05-09
+tags: [llm, speech, conversations]
+categories: [Machine Learning]
+layout: post
+authors: [Shangeth, lepisma]
+latex: True
+---
+
+With LLMs making conversational systems has become easier. You no longer need to
+focus on the low-level details of categorizing semantics and designing
+responses. Instead, you can concentrate on controlling high-level behaviors via
+an LLM. This is the trend that we see most of the world moving towards as
+products are using vendor combinations of ASR, LLM, and TTS with some dialog
+management stitched in between. While this is going to be the norm soon, we want
+to keep exploring areas from where the next set of quality improvements will
+come.
+
+[Earlier](/speech-first-conversational-ai-revisited/) we discussed how spoken
+conversations are richer than pure text and how the gap would be not bridged by
+LLMs purely working on transcriptions. In one of our recent experiments we build
+an efficient multi-modal LLM that takes speech directly to provide better
+conversational experience. For production usage, the constraint here is that
+this should happen without losing the flexibility that you get in a text-only
+LLM around writing prompts, making changes, evaluating, and debugging.
+
+Below is a conversation with our recent in-house Speech LLM based conversational
+system. Notice that because of the extra information in speech some micro
+personalizations can happen like usage of gendered pronouns[^1]. You also get
+lower impact of transcription errors and in general better responses in
+non-speech signals. With access to both speech and text domains, the model
+allows for more fluent turn-taking, though not demonstrated in the current
+conversation. In addition, our approach also reduces the combined model size
+(<2B) for taking speech to response, leading to lower compute latency as
+compared to larger systems.
+
+<style>
+.webvtt-player .media {
+  display: unset;
+}
+
+.webvtt-player .container {
+  width: unset;
+}
+
+.webvtt-player {
+  font-family: sans-serif;
+  font-size: 0.8em;
+}
+</style>
+
+<div id="webvtt-player"
+     data-audio="../assets/audios/posts/speech-conversational-llms/audio.m4a"
+     data-transcript="../assets/audios/posts/speech-conversational-llms/transcript.vtt"
+     data-metadata="../assets/audios/posts/speech-conversational-llms/metadata.vtt" />
+
+<script src="https://umd-mith.github.io/webvtt-player/webvtt-player.js"></script>
+
+The model above doesn't yet control speech synthesis beyond the textual markers
+it can generate, but that's something to be added soon (you might have noticed
+erratic pitch shifts in the call above since TTS vendors don't contextualize
+based on past conversations). Stay tuned for more details on how we take this
+and similar research areas forward.
+
+[^1]: Of course concerns around paralinguistic prediction accuracies are
+    extremely important to take something like this in production.
diff --git a/assets/audios/posts/speech-conversational-llms/audio.m4a b/assets/audios/posts/speech-conversational-llms/audio.m4a
diff --git a/assets/audios/posts/speech-conversational-llms/metadata.vtt b/assets/audios/posts/speech-conversational-llms/metadata.vtt
@@ -0,0 +1,34 @@
+WEBVTT
+
+00:00:04.940 --> 00:00:06.880
+{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "female, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""}
+
+00:00:17.530 --> 00:00:20.840
+{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "female, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""}
+
+00:00:29.080 --> 00:00:33.400
+{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""}
+
+00:00:47.420 --> 00:00:52.020
+{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""}
+
+00:01:07.190 --> 00:01:11.510
+{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, oceania", "title_alt": "", "subjects_alt": ""}
+
+00:01:20.900 --> 00:01:22.530
+{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""}
+
+00:01:32.440 --> 00:01:37.700
+{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""}
+
+00:01:44.770 --> 00:01:48.200
+{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""}
+
+00:01:52.680 --> 00:01:53.790
+{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""}
+
+00:02:01.050 --> 00:02:09.710
+{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""}
+
+00:02:33.770 --> 00:02:36.590
+{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""}
diff --git a/assets/audios/posts/speech-conversational-llms/transcript.vtt b/assets/audios/posts/speech-conversational-llms/transcript.vtt
@@ -0,0 +1,75 @@
+WEBVTT
+Kind: captions
+Language: en
+
+00:00:00.000 --> 00:00:03.040
+<Speaker 1>Hello, am I speaking with Tom Williams?
+
+00:00:04.940 --> 00:00:06.880
+<Speaker 2>May I ask who is calling?
+
+00:00:07.740 --> 00:00:15.750
+<Speaker 1>Hello Ma'am. This is the digital agent of FirstCar Finance on a recorded line. Could I please speak with Tom regarding an important financial matter?
+
+00:00:17.530 --> 00:00:20.840
+<Speaker 2>Um, hold on. I'll give him the phone.
+
+00:00:21.760 --> 00:00:23.280
+<Speaker 1>Sure Ma'am. No problem.
+
+00:00:29.080 --> 00:00:33.400
+<Speaker 3>Yeah yeah, uh, it's Tom. And um who is this?
+
+00:00:35.240 --> 00:00:45.830
+<Speaker 1>Hi, tom. this is the digital agent of FirstCar Finance on a recorded line. For safety, could you please confirm if 895 East Elm Street, Springfield, Illinois is your address?
+
+00:00:47.420 --> 00:00:52.020
+<Speaker 3>Uh uh yeah, yeah. That'll be my address.
+
+00:00:53.420 --> 00:01:05.870
+<Speaker 1>Thank you Tom, for confirming. I am giving you a call about the 2017 Honda Civic under your name. You have an overdue payment of $1345.75. Could you clear this today?
+
+00:01:07.190 --> 00:01:11.510
+<Speaker 3>Um, I was planning to pay this, but I can't pay that all at once.
+
+00:01:12.750 --> 00:01:19.290
+<Speaker 1>I understand, Tom. How does a partial payment of $1700 sound? You can pay the remaining amount later.
+
+00:01:20.900 --> 00:01:22.530
+<Speaker 3>Yeah, I can do that.
+
+00:01:23.490 --> 00:01:31.060
+<Speaker 1>Great. Tom, how would you like to do this? You can pay this amount via an SMS link or we can connect you to a human agent over the phone.
+
+00:01:32.440 --> 00:01:37.700
+<Speaker 3>Uh I guess a payment link would be yeah I guess easier.
+
+00:01:38.830 --> 00:01:43.700
+<Speaker 1>Alright. Can I send it to your mobile number ending in 7985?
+
+00:01:44.770 --> 00:01:48.200
+<Speaker 3>Yeah, that should um yeah, that should work just fine.
+
+00:01:49.120 --> 00:01:51.980
+<Speaker 1>Thank you Tom. I am sending the message right away.
+
+00:01:52.680 --> 00:01:53.790
+<Speaker 3>Yeah okay.
+
+00:01:54.750 --> 00:02:00.130
+<Speaker 1>Brilliant. I’ve sent the payment link to your phone. Could you please let me know once you've received it?
+
+00:02:01.050 --> 00:02:09.710
+<Speaker 3>Um, yeah, just got it. Please give me a minute to complete the payment.
+
+00:02:10.950 --> 00:02:18.730
+<Speaker 1>Of course, Tom. Take your time and hit the pound key once you've completed the payment. If you come across any issues, hit the star key.
+
+00:02:25.660 --> 00:02:32.710
+<Speaker 1>Thank you, Tom. We will verify your payment on our end. Please reach out if you need further assistance. Have a nice day.
+
+00:02:33.770 --> 00:02:36.590
+<Speaker 3>Okay, thanks. Goodbye.
+
+00:02:37.530 --> 00:02:38.590
+<Speaker 1>Goodbye Tom.