Merge pull request #45 from savbell/pyqt5-conversion

🔨 Major Rewrite: PyQt5 Conversion!
savbell · May 28, 2024 · 932b4bf · 932b4bf
2 parents 2b37eb4 + 783d43a
commit 932b4bf
Show file tree

Hide file tree

Showing 20 changed files with 1,218 additions and 504 deletions.
diff --git a/.env.example b/.env.example
diff --git a/.gitignore b/.gitignore
@@ -127,3 +127,6 @@ dmypy.json
 
 # Pyre type checker
 .pyre/
+
+# Configuration files
+config.yaml
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,10 +6,18 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 
 ## [Unreleased]
 ### Added
+- New settings window to configure WhisperWriter.
+- New main window to either start the keyboard listener or open the settings window.
+- New continuous recording mode ([Issue #40](https://github.com/savbell/whisper-writer/issues/40)).
 - New option to play a sound when transcription finishes ([Issue #40](https://github.com/savbell/whisper-writer/issues/40)).
 
 ### Changed
-- Upgraded to latest versions of OpenAI API and faster-whisper, including support for local API ([Issue #32](https://github.com/savbell/whisper-writer/issues/32))
+- Migrated status window from using `tkinter` to `PyQt5`.
+- Migrated from using JSON to using YAML to store configuration settings.
+- Upgraded to latest versions of `openai` and `faster-whisper`, including support for local API ([Issue #32](https://github.com/savbell/whisper-writer/issues/32)).
+
+### Removed
+- No longer using `keyboard` package to listen for key presses.
 
 ## [1.0.1] - 2024-01-28
 ### Added

diff --git a/README.md b/README.md
@@ -3,19 +3,22 @@
 ![version](https://img.shields.io/badge/version-1.0.1-blue)
 
 <p align="center">
-    <img src="./assets/ww-demo-image.gif" alt="WhisperWriter demo gif">
+    <img src="./assets/ww-demo-image-02.gif" alt="WhisperWriter demo gif" width="340" height="136">
 </p>
 
-WhisperWriter is a small speech-to-text app that uses [OpenAI's Whisper model](https://openai.com/research/whisper) to auto-transcribe recordings from a user's microphone.
+**Update (2024-05-28):** I've just merged in a major rewrite of WhisperWriter! We've migrated from using `tkinter` to using `PyQt5` for the UI, added a new settings window for configuration, a new continuous recording mode, support for a local API, and more! Please be patient as I work out any bugs that may have been introduced in the process. If you encounter any problems, please [open a new issue](https://github.com/savbell/whisper-writer/issues)!
 
-Once started, the script runs in the background and waits for a keyboard shortcut to be pressed (`ctrl+shift+space` by default). When the shortcut is pressed, the app starts recording from your microphone. There are three options to stop recording:
-- `voice_activity_detection` that stops recording once it detects a long enough pause in your speech.
-- `press_to_toggle` that stops recording when the activation key is pressed again.
-- `hold_to_record` that stops recording when the activation key is released.
+WhisperWriter is a small speech-to-text app that uses [OpenAI's Whisper model](https://openai.com/research/whisper) to auto-transcribe recordings from a user's microphone to the active window.
 
-You can change the activation key and recording mode in the [Configuration Options](#configuration-options). While recording and transcribing, a small status window is displayed that shows the current stage of the process (but this can be turned off). Once the transcription is complete, the transcribed text will be automatically written to the active window.
+Once started, the script runs in the background and waits for a keyboard shortcut to be pressed (`ctrl+shift+space` by default). When the shortcut is pressed, the app starts recording from your microphone. There are four recording modes to choose from:
+- `continuous` (default): Recording will stop after a long enough pause in your speech. The app will transcribe the text and then start recording again. To stop listening, press the keyboard shortcut again.
+- `voice_activity_detection`: Recording will stop after a long enough pause in your speech. Recording will not start until the keyboard shortcut is pressed again.
+- `press_to_toggle` Recording will stop when the keyboard shortcut is pressed again. Recording will not start until the keyboard shortcut is pressed again.
+- `hold_to_record` Recording will continue until the keyboard shortcut is released. Recording will not start until the keyboard shortcut is held down again.
 
-The transcription can either be done locally through the [faster-whisper Python package](https://github.com/SYSTRAN/faster-whisper/) or through a request to [OpenAI's API](https://platform.openai.com/docs/guides/speech-to-text). By default, the app will use a local model, but you can change this in the [Configuration Options](#configuration-options). If you choose to use the API, you will need to provide your OpenAI API key in a `.env` file, or change the base URL endpoint.
+You can change the keyboard shortcut (`activation_key`) and recording mode in the [Configuration Options](#configuration-options). While recording and transcribing, a small status window is displayed that shows the current stage of the process (but this can be turned off). Once the transcription is complete, the transcribed text will be automatically written to the active window.
+
+The transcription can either be done locally through the [faster-whisper Python package](https://github.com/SYSTRAN/faster-whisper/) or through a request to [OpenAI's API](https://platform.openai.com/docs/guides/speech-to-text). By default, the app will use a local model, but you can change this in the [Configuration Options](#configuration-options). If you choose to use the API, you will need to either provide your OpenAI API key or change the base URL endpoint.
 
 **Fun fact:** Almost the entirety of the initial release of the project was pair-programmed with [ChatGPT-4](https://openai.com/product/gpt-4) and [GitHub Copilot](https://github.com/features/copilot) using VS Code. Practically every line, including most of this README, was written by AI. After the initial prototype was finished, WhisperWriter was used to write a lot of the prompts as well!
 
@@ -60,111 +63,56 @@ venv\Scripts\activate
 pip install -r requirements.txt
 ```
 
-#### 4. Switch between a local model and the OpenAI API:
-To switch between running Whisper locally and using the OpenAI API, you need to modify the `src\config.json` file:
-
-- If you prefer using the OpenAI API, set `"use_api"` to `true`. You will also need to either set up your OpenAI API key or change the base URL in the next step.
-- If you prefer using a local Whisper model, set `"use_api"` to `false`. You may also want to change the device that the model uses; see the [Model Options](#model-options). Note that you need to have the [NVIDIA libraries installed](https://github.com/SYSTRAN/faster-whisper/#gpu) to run the model on your GPU.
-
-```
-{
-    "use_api": false,    // Change this value to true to use the OpenAI API
-    ...
-}
-```
-
-#### 5. If using the OpenAI API, configure the environment variables:
-
-Copy the ".env.example" file to a new file named ".env":
-```
-# For Linux and macOS
-cp .env.example .env
-
-# For Windows
-copy .env.example .env
-```
-Open the ".env" file and add in your OpenAI API key:
-```
-OPENAI_API_KEY=<your_openai_key_here>
-```
-You can find your API key on the [OpenAI dashboard](https://platform.openai.com/api-keys). You will need to have available credits to use the API.
-
-Alternatively, you can set the base URL endpoint to use a local API such as [LocalAI](https://localai.io/):
-```
-OPENAI_API_BASE_URL=<your_custom_url_here>
-```
-
-#### 6. Run the Python code:
+#### 4. Run the Python code:
 
 ```
 python run.py
 ```
 
 ### Configuration Options
 
-WhisperWriter uses a configuration file to customize its behaviour. To set up the configuration, modify the [`src\config.json`](src\config.json) file:
-
-```json
-{
-    "use_api": false,
-    "api_options": {
-        "model": "whisper-1",
-        "language": null,
-        "temperature": 0.0,
-        "initial_prompt": null
-    },
-    "local_model_options": {
-        "model": "base",
-        "device": "auto",
-        "compute_type": "auto",
-        "language": null,
-        "temperature": 0.0,
-        "initial_prompt": null,
-        "condition_on_previous_text": true,
-        "vad_filter": false
-    },
-    "activation_key": "ctrl+shift+space",
-    "recording_mode": "voice_activity",
-    "sound_device": null,
-    "sample_rate": 16000,
-    "silence_duration": 900,
-    "writing_key_press_delay": 0.005,
-    "noise_on_completion": false,
-    "remove_trailing_period": false,
-    "add_trailing_space": true,
-    "remove_capitalization": false,
-    "print_to_terminal": true
-}
-```
+WhisperWriter uses a configuration file to customize its behaviour. To set up the configuration, open the Settings window:
+
+<p align="center">
+    <img src="./assets/ww-settings-demo.gif" alt="WhisperWriter Settings window demo gif" width="350" height="350">
+</p>
+
 #### Model Options
-- `use_api`: Set to `true` to use the OpenAI API for transcription. Set to `false` to use a local Whisper model. (Default: `false`)
-- `api_options`: Contains options for the OpenAI API. See the [API reference](https://platform.openai.com/docs/api-reference/audio/create?lang=python) for more details.
-  - `model`: The model to use for transcription. Currently only `whisper-1` is available. (Default: `"whisper-1"`)
-  - `language`: The language code for the transcription in [ISO-639-1 format](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). (Default: `null`)
-  - `temperature`: Controls the randomness of the transcription output. Lower values (e.g., 0.0) make the output more focused and deterministic. (Default: `0.0`)
-  - `initial_prompt`: A string used as an initial prompt to condition the transcription. [Here's some info on how it works](https://platform.openai.com/docs/guides/speech-to-text/prompting). Set to null for no initial prompt. (Default: `null`)
-- `local_model_options`: Contains options for the local Whisper model. See the [function definition](https://github.com/openai/whisper/blob/main/whisper/transcribe.py#L52-L108) for more details.
-  - `model`: The model to use for transcription. See [available models and languages](https://github.com/openai/whisper#available-models-and-languages). (Default: `"base"`)
-  - `device`: The device to run the local Whisper model on. Options include `cuda` for NVIDIA GPUs, `cpu` for CPU-only processing, or `auto` to let the system automatically choose the best available device. (Default: `auto`)
-  - `compute_type`: The compute type to use for the local Whisper model. [More information can be found here.](https://opennmt.net/CTranslate2/quantization.html) (Default: `auto`)
-  - `language`: The language code for the transcription in [ISO-639-1 format](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). (Default: `null`)
-  - `temperature`: Controls the randomness of the transcription output. Lower values (e.g., 0.0) make the output more focused and deterministic. (Default: `0.0`)
-  - `initial_prompt`: A string used as an initial prompt to condition the transcription. [Here's some info on how it works](https://platform.openai.com/docs/guides/speech-to-text/prompting). Set to null for no initial prompt. (Default: `null`)
+- `use_api`: Toggle to choose whether to use the OpenAI API or a local Whisper model for transcription. (Default: `false`)
+- `common`: Options common to both API and local models.
+  - `language`: The language code for the transcription in [ISO-639-1 format](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes). (Default: `null`)
+  - `temperature`: Controls the randomness of the transcription output. Lower values make the output more focused and deterministic. (Default: `0.0`)
+  - `initial_prompt`: A string used as an initial prompt to condition the transcription. More info: [OpenAI Prompting Guide](https://platform.openai.com/docs/guides/speech-to-text/prompting). (Default: `null`)
+
+- `api`: Configuration options for the OpenAI API. See the [OpenAI API documentation](https://platform.openai.com/docs/api-reference/audio/create?lang=python) for more information.
+  - `model`: The model to use for transcription. Currently, only `whisper-1` is available. (Default: `whisper-1`)
+  - `base_url`: The base URL for the API. Can be changed to use a local API endpoint, such as [LocalAI](https://localai.io/). (Default: `https://api.openai.com/v1`)
+  - `api_key`: Your API key for the OpenAI API. Required for non-local API usage. (Default: `null`)
+
+- `local`: Configuration options for the local Whisper model.
+  - `model`: The model to use for transcription. The larger models provide better accuracy but are slower. See [available models and languages](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages). (Default: `base`)
+  - `device`: The device to run the local Whisper model on. Use `cuda` for NVIDIA GPUs, `cpu` for CPU-only processing, or `auto` to let the system automatically choose the best available device. (Default: `auto`)
+  - `compute_type`: The compute type to use for the local Whisper model. [More information on quantization here](https://opennmt.net/CTranslate2/quantization.html). (Default: `default`)
   - `condition_on_previous_text`: Set to `true` to use the previously transcribed text as a prompt for the next transcription request. (Default: `true`)
   - `vad_filter`: Set to `true` to use [a voice activity detection (VAD) filter](https://github.com/snakers4/silero-vad) to remove silence from the recording. (Default: `false`)
-#### Customization Options
-- `activation_key`: The keyboard shortcut to activate the recording and transcribing process. (Default: `"ctrl+shift+space"`)
-- `recording_mode`: The recording mode to use. Options include `voice_activity_detection` to use voice activity detection to determine when to stop recording, or `press_to_toggle` to start and stop recording by pressing the activation key, or `hold_to_record` to start recording when the activation key is pressed down and stop recording when the activation key is released. (Default: `"voice_activity"`)
-- `sound_device`: The name of the sound device to use for recording. Set to `null` to let the system automatically choose the default device. To find a device number, run `python -m sounddevice`. (Default: `null`)
+
+#### Recording Options
+- `activation_key`: The keyboard shortcut to activate the recording and transcribing process. Separate keys with a `+`. (Default: `ctrl+shift+space`)
+- `recording_mode`: The recording mode to use. Options include `continuous` (auto-restart recording after pause in speech until activation key is pressed again), `voice_activity_detection` (stop recording after pause in speech), `press_to_toggle` (stop recording when activation key is pressed again), `hold_to_record` (stop recording when activation key is released). (Default: `continuous`)
+- `sound_device`: The numeric index of the sound device to use for recording. To find device numbers, run `python -m sounddevice`. (Default: `null`)
 - `sample_rate`: The sample rate in Hz to use for recording. (Default: `16000`)
 - `silence_duration`: The duration in milliseconds to wait for silence before stopping the recording. (Default: `900`)
+
+#### Post-processing Options
 - `writing_key_press_delay`: The delay in seconds between each key press when writing the transcribed text. (Default: `0.005`)
-- `noise_on_completion`: Set to `true` to play a sound when the transcription is complete. (Default: `false`)
 - `remove_trailing_period`: Set to `true` to remove the trailing period from the transcribed text. (Default: `false`)
-- `add_trailing_space`: Set to `true` to add a trailing space to the transcribed text. (Default: `true`)
+- `add_trailing_space`: Set to `true` to add a space to the end of the transcribed text. (Default: `true`)
 - `remove_capitalization`: Set to `true` to convert the transcribed text to lowercase. (Default: `false`)
+
+#### Miscellaneous Options
 - `print_to_terminal`: Set to `true` to print the script status and transcribed text to the terminal. (Default: `true`)
-- `hide_window`: Set to `true` to hide the status window.
+- `hide_status_window`: Set to `true` to hide the status window during operation. (Default: `false`)
+- `noise_on_completion`: Set to `true` to play a noise after the transcription has been typed out. (Default: `false`)
 
 If any of the configuration options are invalid or not provided, the program will use the default values.
 
@@ -174,12 +122,12 @@ You can see all reported issues and their current status in our [Issue Tracker](
 
 ## Roadmap
 Below are features I am planning to add in the near future:
-- [ ] Restructuring configuration options to reduce redundancy
+- [x] Restructuring configuration options to reduce redundancy
 - [x] Update to use the latest version of the OpenAI API
 - [ ] Additional post-processing options:
   - [ ] Simple word replacement (e.g. "gonna" -> "going to" or "smiley face" -> "😊")
   - [ ] Using GPT for instructional post-processing
-- [ ] Updating GUI
+- [x] Updating GUI
 - [ ] Creating standalone executable file
 
 Below are features not currently planned:

diff --git a/assets/ww-demo-image-02.gif b/assets/ww-demo-image-02.gif
diff --git a/assets/ww-settings-demo.gif b/assets/ww-settings-demo.gif
diff --git a/requirements.txt b/requirements.txt
diff --git a/run.py b/run.py
@@ -1,9 +1,8 @@
 import os
 import sys
 import subprocess
-
-# Disabling output buffering so that the status window can be updated in real time
-os.environ['PYTHONUNBUFFERED'] = '1'
+from dotenv import load_dotenv
 
 print('Starting WhisperWriter...')
+load_dotenv()
 subprocess.run([sys.executable, os.path.join('src', 'main.py')])
diff --git a/src/config.json b/src/config.json