refactor: Enable partial transcription with a latency of 1000ms (#141)

* refactor: Enable partial transcription with a latency of 1000ms * refactor: Update CMakePresets.json and buildspec.json - Remove the "QT_VERSION" variable from CMakePresets.json for all platforms - Update the "version" of "obs-studio" and "prebuilt" dependencies in buildspec.json - Update the "version" of "qt6" dependency in buildspec.json - Update the "version" of the project to "0.3.3" in buildspec.json - Update the "version" of the project to "0.3.3" in CMakePresets.json - Remove unused code in whisper-processing.cpp * refactor: Add -Wno-error=deprecated-declarations option to compilerconfig.cmake * refactor: Update language codes in translation module
locaal-ai · Jul 19, 2024 · b3e4bfa · b3e4bfa
1 parent 19017ca
commit b3e4bfa
Show file tree

Hide file tree

Showing 16 changed files with 299 additions and 216 deletions.
diff --git a/.gitignore b/.gitignore
@@ -16,7 +16,6 @@
 !LICENSE
 !README.md
 !/vendor
-!patch_libobs.diff
 
 # Exclude lock files
 *.lock.json

diff --git a/CMakePresets.json b/CMakePresets.json
@@ -26,9 +26,8 @@
         "rhs": "Darwin"
       },
       "generator": "Xcode",
-      "warnings": { "dev": true, "deprecated": true },
+      "warnings": {"dev": true, "deprecated": true},
       "cacheVariables": {
-        "QT_VERSION": "6",
         "CMAKE_OSX_DEPLOYMENT_TARGET": "11.0",
         "CODESIGN_IDENTITY": "$penv{CODESIGN_IDENT}",
         "CODESIGN_TEAM": "$penv{CODESIGN_TEAM}"
@@ -57,9 +56,8 @@
       },
       "generator": "Visual Studio 17 2022",
       "architecture": "x64",
-      "warnings": { "dev": true, "deprecated": true },
+      "warnings": {"dev": true, "deprecated": true},
       "cacheVariables": {
-        "QT_VERSION": "6",
         "CMAKE_SYSTEM_VERSION": "10.0.18363.657"
       }
     },
@@ -84,9 +82,8 @@
         "rhs": "Linux"
       },
       "generator": "Ninja",
-      "warnings": { "dev": true, "deprecated": true },
+      "warnings": {"dev": true, "deprecated": true},
       "cacheVariables": {
-        "QT_VERSION": "6",
         "CMAKE_BUILD_TYPE": "RelWithDebInfo"
       }
     },
@@ -112,9 +109,8 @@
         "rhs": "Linux"
       },
       "generator": "Ninja",
-      "warnings": { "dev": true, "deprecated": true },
+      "warnings": {"dev": true, "deprecated": true},
       "cacheVariables": {
-        "QT_VERSION": "6",
         "CMAKE_BUILD_TYPE": "RelWithDebInfo"
       }
     },

diff --git a/README.md b/README.md
@@ -12,12 +12,13 @@
 
 ## Introduction
 
-LocalVocal live-streaming AI assistant plugin allows you to transcribe, locally on your machine, audio speech into text and perform various language processing functions on the text using AI / LLMs (Large Language Models). ✅ No GPU required, ✅ no cloud costs, ✅ no network and ✅ no downtime! Privacy first - all data stays on your machine.
+LocalVocal lets you transcribe, locally on your machine, speech into text and simultaneously translate to any language. ✅ No GPU required, ✅ no cloud costs, ✅ no network and ✅ no downtime! Privacy first - all data stays on your machine.
 
-If this free plugin has been valuable to you consider adding a ⭐ to this GH repo, rating it [on OBS](https://obsproject.com/forum/resources/localvocal-live-stream-ai-assistant.1769/), subscribing to [my YouTube channel](https://www.youtube.com/@royshilk) where I post updates, and supporting my work on [GitHub](https://github.com/sponsors/royshil) or [Patreon](https://www.patreon.com/RoyShilkrot) 🙏
+If this free plugin has been valuable consider adding a ⭐ to this GH repo, rating it [on OBS](https://obsproject.com/forum/resources/localvocal-live-stream-ai-assistant.1769/), subscribing to [my YouTube channel](https://www.youtube.com/@royshilk) where I post updates, and supporting my work on [GitHub](https://github.com/sponsors/royshil), [Patreon](https://www.patreon.com/RoyShilkrot) or [OpenCollective](https://opencollective.com/occ-ai) 🙏
 
-Internally the plugin is running a neural network ([OpenAI Whisper](https://github.com/openai/whisper)) locally to predict in real time the speech and provide captions.
+Internally the plugin is running [OpenAI's Whisper](https://github.com/openai/whisper) to process real-time the speech and predict a transcription.
 It's using the [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) project from [ggerganov](https://github.com/ggerganov) to run the Whisper network efficiently on CPUs and GPUs.
+Translation is done with [CTranslate2](https://github.com/OpenNMT/CTranslate2).
 
 ## Usage
 
@@ -45,9 +46,10 @@ Current Features:
 - Sync'ed captions with OBS recording timestamps
 - Send captions on a RTMP stream to e.g. YouTube, Twitch
 - Bring your own Whisper model (any GGML)
-- Translate captions in real time to major languages (both Whisper built-in translation as well as NMT models with [CTranslate2](https://github.com/OpenNMT/CTranslate2))
+- Translate captions in real time to major languages (both Whisper built-in translation as well as NMT models)
 - CUDA, OpenCL, Apple Arm64, AVX & SSE acceleration support
 - Filter out or replace any part of the produced captions
+- Partial transcriptions for a streaming-captions experience
 
 Roadmap:
 - More robust built-in translation options
@@ -57,22 +59,22 @@ Roadmap:
 Check out our other plugins:
 - [Background Removal](https://github.com/occ-ai/obs-backgroundremoval) removes background from webcam without a green screen.
 - [Detect](https://github.com/occ-ai/obs-detect) will detect and track >80 types of objects in real-time inside OBS
-- [CleanStream](https://github.com/occ-ai/obs-cleanstream) for real-time filler word (uh,um) and profanity removal from live audio stream
+- [CleanStream](https://github.com/occ-ai/obs-cleanstream) for real-time filler word (uh,um) and profanity removal from a live audio stream
 - [URL/API Source](https://github.com/occ-ai/obs-urlsource) that allows fetching live data from an API and displaying it in OBS.
-- [Polyglot](https://github.com/occ-ai/obs-polyglot) translation AI plugin for real-time, local translation to hunderds of languages
+- [Squawk](https://github.com/occ-ai/obs-squawk) adds lifelike local text-to-speech capabilities built-in OBS
 
 ## Download
 Check out the [latest releases](https://github.com/occ-ai/obs-localvocal/releases) for downloads and install instructions.
 
 ### Models
-The plugin ships with the Tiny.en model, and will autonomoously download other bigger Whisper models through a dropdown.
-However there's an option to select an external model file if you have it on disk.
+The plugin ships with the Tiny.en model, and will autonomously download other Whisper models through a dropdown.
+There's also an option to select an external GGML Whisper model file if you have it on disk.
 
-Get more models from https://ggml.ggerganov.com/ and follow [the instructions on whisper.cpp](https://github.com/ggerganov/whisper.cpp/tree/master/models) to create your own models or download others such as distilled models.
+Get more models from https://ggml.ggerganov.com/ and [HuggingFace](https://huggingface.co/ggerganov/whisper.cpp/tree/main), follow [the instructions on whisper.cpp](https://github.com/ggerganov/whisper.cpp/tree/master/models) to create your own models or download others such as distilled models.
 
 ## Building
 
-The plugin was built and tested on Mac OSX  (Intel & Apple silicon), Windows (with and without Nvidia CUDA) and Linux.
+The plugin was built and tested on Mac OSX (Intel & Apple silicon), Windows (with and without Nvidia CUDA) and Linux.
 
 Start by cloning this repo to a directory of your choice.
 
@@ -172,7 +174,7 @@ The build should exist in the `./release` folder off the root. You can manually
 
 LocalVocal will now build with CUDA support automatically through a prebuilt binary of Whisper.cpp from https://github.com/occ-ai/occ-ai-dep-whispercpp. The CMake scripts will download all necessary files.
 
-To build with cuda add `CPU_OR_CUDA` as an environment variable (with `cpu`, `12.2.0` or `11.8.0`) and build regularly
+To build with cuda add `CPU_OR_CUDA` as an environment variable (with `cpu`, `clblast`, `12.2.0` or `11.8.0`) and build regularly
 
 ```powershell
 > $env:CPU_OR_CUDA="12.2.0"

diff --git a/buildspec.json b/buildspec.json
@@ -1,33 +1,33 @@
 {
     "dependencies": {
         "obs-studio": {
-            "version": "30.0.2",
+            "version": "30.1.2",
             "baseUrl": "https://github.com/obsproject/obs-studio/archive/refs/tags",
             "label": "OBS sources",
             "hashes": {
-                "macos": "be12c3ad0a85713750d8325e4b1db75086223402d7080d0e3c2833d7c5e83c27",
-                "windows-x64": "970058c49322cfa9cd6d620abb393fed89743ba7e74bd9dbb6ebe0ea8141d9c7"
+                "macos": "490bae1c392b3b344b0270afd8cb887da4bc50bd92c0c426e96713c1ccb9701a",
+                "windows-x64": "c2dd03fa7fd01fad5beafce8f7156da11f9ed9a588373fd40b44a06f4c03b867"
             }
         },
         "prebuilt": {
-            "version": "2023-11-03",
+            "version": "2024-03-19",
             "baseUrl": "https://github.com/obsproject/obs-deps/releases/download",
             "label": "Pre-Built obs-deps",
             "hashes": {
-                "macos": "90c2fc069847ec2768dcc867c1c63b112c615ed845a907dc44acab7a97181974",
-                "windows-x64": "d0825a6fb65822c993a3059edfba70d72d2e632ef74893588cf12b1f0d329ce6"
+                "macos": "2e9bfb55a5e0e4c1086fa1fda4cf268debfead473089df2aaea80e1c7a3ca7ff",
+                "windows-x64": "6e86068371526a967e805f6f9903f9407adb683c21820db5f07da8f30d11e998"
             }
         },
         "qt6": {
-            "version": "2023-11-03",
+            "version": "2024-03-19",
             "baseUrl": "https://github.com/obsproject/obs-deps/releases/download",
             "label": "Pre-Built Qt6",
             "hashes": {
-                "macos": "ba4a7152848da0053f63427a2a2cb0a199af3992997c0db08564df6f48c9db98",
-                "windows-x64": "bc57dedf76b47119a6dce0435a2f21b35b08c8f2948b1cb34a157320f77732d1"
+                "macos": "694f1e639c017e3b1f456f735330dc5afae287cbea85757101af1368de3142c8",
+                "windows-x64": "72d1df34a0ef7413a681d5fcc88cae81da60adc03dcd23ef17862ab170bcc0dd"
             },
             "debugSymbols": {
-                "windows-x64": "fd8ecd1d8cd2ef049d9f4d7fb5c134f784836d6020758094855dfa98bd025036"
+                "windows-x64": "fbddd1f659c360f2291911ac5709b67b6f8182e6bca519d24712e4f6fd3cc865"
             }
         }
     },
@@ -38,7 +38,7 @@
     },
     "name": "obs-localvocal",
     "displayName": "OBS Localvocal",
-    "version": "0.3.2",
+    "version": "0.3.3",
     "author": "Roy Shilkrot",
     "website": "https://github.com/occ-ai/obs-localvocal",
     "email": "[email protected]",

diff --git a/cmake/BuildWhispercpp.cmake b/cmake/BuildWhispercpp.cmake
@@ -80,7 +80,8 @@ elseif(WIN32)
   FetchContent_Declare(
     whispercpp_fetch
     URL ${WHISPER_CPP_URL}
-    URL_HASH SHA256=${WHISPER_CPP_HASH})
+    URL_HASH SHA256=${WHISPER_CPP_HASH}
+    DOWNLOAD_EXTRACT_TIMESTAMP TRUE)
   FetchContent_MakeAvailable(whispercpp_fetch)
 
   add_library(Whispercpp::Whisper SHARED IMPORTED)
@@ -104,8 +105,20 @@ elseif(WIN32)
 
   # glob all dlls in the bin directory and install them
   file(GLOB WHISPER_DLLS ${whispercpp_fetch_SOURCE_DIR}/bin/*.dll)
-  install(FILES ${WHISPER_DLLS} DESTINATION "obs-plugins/64bit")
-
+  foreach(FILE ${WHISPER_DLLS})
+    file(RELATIVE_PATH REL_FILE ${whispercpp_fetch_SOURCE_DIR}/bin ${FILE})
+    set(DEST_DIR "${CMAKE_SOURCE_DIR}/release/${CMAKE_BUILD_TYPE}/obs-plugins/64bit")
+    set(DEST_FILE "${DEST_DIR}/${REL_FILE}")
+
+    if(NOT EXISTS ${DEST_DIR})
+      file(MAKE_DIRECTORY ${DEST_DIR})
+    endif()
+
+    if(NOT EXISTS ${DEST_FILE} OR ${FILE} IS_NEWER_THAN ${DEST_FILE})
+      message(STATUS "Copying ${FILE} to ${DEST_FILE}")
+      file(COPY ${FILE} DESTINATION ${DEST_DIR})
+    endif()
+  endforeach()
 else()
   set(Whispercpp_Build_GIT_TAG "v1.6.2")
   set(WHISPER_EXTRA_CXX_FLAGS "-fPIC")

diff --git a/cmake/macos/compilerconfig.cmake b/cmake/macos/compilerconfig.cmake
@@ -55,4 +55,4 @@ else()
 endif()
 
 add_compile_definitions($<$<CONFIG:DEBUG>:DEBUG> $<$<CONFIG:DEBUG>:_DEBUG> SIMDE_ENABLE_OPENMP)
-add_compile_options(-Wno-error=newline-eof)
+add_compile_options(-Wno-error=newline-eof -Wno-error=deprecated-declarations -Wno-deprecated-declarations)
diff --git a/data/locale/en-US.ini b/data/locale/en-US.ini
@@ -83,3 +83,5 @@ log_group="Logging"
 advanced_group="Advanced Configuration"
 buffered_output_parameters="Buffered Output Configuration"
 file_output_info="Note: Translation output will be saved to a file in the same directory with the target language added to the name, e.g. 'output_es.srt'."
+partial_transcription="Enable Partial Transcription"
+partial_transcription_info="Partial transcription will increase processing load on your machine to transcribe content in real-time, which may impact performance."
diff --git a/patch_libobs.diff b/patch_libobs.diff
diff --git a/src/transcription-filter-callbacks.cpp b/src/transcription-filter-callbacks.cpp
@@ -188,7 +188,8 @@ void set_text_callback(struct transcription_filter_data *gf,
 		       const DetectionResultWithText &resultIn)
 {
 	DetectionResultWithText result = resultIn;
-	if (!result.text.empty() && result.result == DETECTION_RESULT_SPEECH) {
+	if (!result.text.empty() && (result.result == DETECTION_RESULT_SPEECH ||
+				     result.result == DETECTION_RESULT_PARTIAL)) {
 		gf->last_sub_render_time = now_ms();
 		gf->cleared_last_sub = false;
 	}
@@ -231,7 +232,10 @@ void set_text_callback(struct transcription_filter_data *gf,
 			str_copy = translated_sentence;
 		} else {
 			if (gf->buffered_output) {
-				gf->translation_monitor.addSentence(translated_sentence);
+				if (result.result == DETECTION_RESULT_SPEECH) {
+					// buffered output - add the sentence to the monitor
+					gf->translation_monitor.addSentence(translated_sentence);
+				}
 			} else {
 				// non-buffered output - send the sentence to the selected source
 				send_caption_to_source(gf->translation_output, translated_sentence,
@@ -241,17 +245,20 @@ void set_text_callback(struct transcription_filter_data *gf,
 	}
 
 	if (gf->buffered_output) {
-		gf->captions_monitor.addSentence(str_copy);
+		if (result.result == DETECTION_RESULT_SPEECH) {
+			gf->captions_monitor.addSentence(str_copy);
+		}
 	} else {
 		// non-buffered output - send the sentence to the selected source
 		send_caption_to_source(gf->text_source_name, str_copy, gf);
 	}
 
-	if (gf->caption_to_stream) {
+	if (gf->caption_to_stream && result.result == DETECTION_RESULT_SPEECH) {
 		send_caption_to_stream(result, str_copy, gf);
 	}
 
-	if (gf->save_to_file && gf->output_file_path != "") {
+	if (gf->save_to_file && gf->output_file_path != "" &&
+	    result.result == DETECTION_RESULT_SPEECH) {
 		send_sentence_to_file(gf, result, str_copy, translated_sentence);
 	}
 };
@@ -291,8 +298,10 @@ void reset_caption_state(transcription_filter_data *gf_)
 {
 	if (gf_->captions_monitor.isEnabled()) {
 		gf_->captions_monitor.clear();
+		gf_->translation_monitor.clear();
 	}
 	send_caption_to_source(gf_->text_source_name, "", gf_);
+	send_caption_to_source(gf_->translation_output, "", gf_);
 	// flush the buffer
 	{
 		std::lock_guard<std::mutex> lock(gf_->whisper_buf_mutex);
@@ -326,13 +335,15 @@ void media_started_callback(void *data_, calldata_t *cd)
 	gf_->active = true;
 	reset_caption_state(gf_);
 }
+
 void media_pause_callback(void *data_, calldata_t *cd)
 {
 	UNUSED_PARAMETER(cd);
 	transcription_filter_data *gf_ = static_cast<struct transcription_filter_data *>(data_);
 	obs_log(gf_->log_level, "media_pause");
 	gf_->active = false;
 }
+
 void media_restart_callback(void *data_, calldata_t *cd)
 {
 	UNUSED_PARAMETER(cd);
@@ -341,6 +352,7 @@ void media_restart_callback(void *data_, calldata_t *cd)
 	gf_->active = true;
 	reset_caption_state(gf_);
 }
+
 void media_stopped_callback(void *data_, calldata_t *cd)
 {
 	UNUSED_PARAMETER(cd);

diff --git a/src/transcription-filter-data.h b/src/transcription-filter-data.h
@@ -81,6 +81,8 @@ struct transcription_filter_data {
 	bool enable_audio_chunks_callback = false;
 	bool source_signals_set = false;
 	bool initial_creation = true;
+	bool partial_transcription = false;
+	int partial_latency = 1000;
 
 	// Last transcription result
 	std::string last_text;

diff --git a/src/transcription-filter-properties.cpp b/src/transcription-filter-properties.cpp
@@ -46,8 +46,9 @@ bool advanced_settings_callback(obs_properties_t *props, obs_property_t *propert
 	UNUSED_PARAMETER(property);
 	// If advanced settings is enabled, show the advanced settings group
 	const bool show_hide = obs_data_get_int(settings, "advanced_settings_mode") == 1;
-	for (const std::string &prop_name : {"whisper_params_group", "buffered_output_group",
-					     "log_group", "advanced_group", "file_output_enable"}) {
+	for (const std::string &prop_name :
+	     {"whisper_params_group", "buffered_output_group", "log_group", "advanced_group",
+	      "file_output_enable", "partial_group"}) {
 		obs_property_set_visible(obs_properties_get(props, prop_name.c_str()), show_hide);
 	}
 	translation_options_callback(props, NULL, settings);
@@ -457,6 +458,22 @@ void add_general_group_properties(obs_properties_t *ppts)
 	}
 }
 
+void add_partial_group_properties(obs_properties_t *ppts)
+{
+	// add a group for partial transcription
+	obs_properties_t *partial_group = obs_properties_create();
+	obs_properties_add_group(ppts, "partial_group", MT_("partial_transcription"),
+				 OBS_GROUP_CHECKABLE, partial_group);
+
+	// add text info
+	obs_properties_add_text(partial_group, "partial_info", MT_("partial_transcription_info"),
+				OBS_TEXT_INFO);
+
+	// add slider for partial latecy
+	obs_properties_add_int_slider(partial_group, "partial_latency", MT_("partial_latency"), 500,
+				      3000, 50);
+}
+
 obs_properties_t *transcription_filter_properties(void *data)
 {
 	struct transcription_filter_data *gf =
@@ -480,6 +497,7 @@ obs_properties_t *transcription_filter_properties(void *data)
 	add_buffered_output_group_properties(ppts);
 	add_advanced_group_properties(ppts, gf);
 	add_logging_group_properties(ppts);
+	add_partial_group_properties(ppts);
 	add_whisper_params_group_properties(ppts);
 
 	// Add a informative text about the plugin