A user-friendly audio transcription web application built using Gradio and OpenAI's Whisper model. This app allows users to upload audio files, transcribe them to text, and even save the output for later use.
- Model Selection: Choose from a variety of Whisper models based on your performance and accuracy needs.
- Hint Support: Optionally provide context for better transcription accuracy.
- Audio Upload: Supports audio files in various formats.
- Automatic Formatting: Converts audio to the WAV format (16kHz, mono) for compatibility with the Whisper model.
- Accurate Transcription: Uses OpenAI's Whisper model for speech-to-text transcription.
- Subtitle Generation: Create SRT subtitle files directly from the transcription.
- Translate Feature: "Translate" button for translating transcriptions. Available for specific models (
medium
,large
,small
, andbase
) and hidden for unsupported models (tiny
andturbo
). - Downloadable Results: Save your transcription and subtitles with a single click.
- Intuitive Web UI: Built using Gradio for a smooth and interactive user interface.
- Direct Programmatic Usage: Use
audio_print.py
as a standalone utility to transcribe audio files without the GUI.
To use the app, ensure the following are installed on your system:
- Python 3.10+
ffmpeg
(cross-platform multimedia framework, required for audio processing)- Python packages:
- openai-whisper
- gradio
- ffmpeg-python (optional, if used)
- torch (PyTorch) — Make sure to install the version compatible with your CUDA version.
You can check the official PyTorch compatibility table to find the correct version.
Example:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Clone this repository to your local system:
git clone https://github.com/loglux/FlexAudioPrint.git
cd FlexAudioPrint
Install the required Python packages using the requirements.txt
file:
pip install -r requirements.txt
- For Windows, download
ffmpeg
from FFmpeg.org and add it to your system's PATH. - For Linux/MacOS, install via your package manager:
sudo apt install ffmpeg # For Debian/Ubuntu brew install ffmpeg # For Mac using Homebrew
To enable GPU acceleration for Whisper, you need to install a PyTorch version that supports your CUDA version. Use the following command, replacing cu118
with your specific CUDA version:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
By default, FlexAudioPrint uses the large model of OpenAI's Whisper for transcription, as it provides the highest accuracy for most use cases. However, the model can be replaced with other available Whisper models depending on your resource constraints or accuracy needs. The choice of model can significantly impact both performance and accuracy.
Model Name | Parameters | English-only Model | Multilingual Model | Required VRAM | Relative Speed |
---|---|---|---|---|---|
tiny |
39M | tiny.en |
tiny |
~1 GB | ~10x |
base |
74M | base.en |
base |
~1 GB | ~7x |
small |
244M | small.en |
small |
~2 GB | ~4x |
medium |
769M | medium.en |
medium |
~5 GB | ~2x |
large (default) |
1550M | N/A | large |
~10 GB | ~1x |
turbo |
809M | N/A | turbo |
~6 GB | ~8x |
- Models like
tiny
,base
, andsmall
can run efficiently on a CPU, making them suitable for systems without a GPU. - While the
turbo
model is faster and more resource-efficient, it has a noticeable drawback: it tends to "swallow" small or less distinct words during transcription. This issue is specific toturbo
and does not occur with other models likebase
,small
, ormedium
. Even these smaller models often provide more reliable results, making them preferable overturbo
when accuracy matters. Personally, I prefer thelarge
model overturbo
due to its superior accuracy, despite its higher resource requirements.
Note: Larger models require more computational resources (RAM, GPU, etc.). Make sure your system meets the requirements for the selected model.
When the Gradio interface is refreshed, the default model (turbo
) is reloaded automatically, ensuring consistency across sessions. Users can select other models from the dropdown menu during their session, but these changes will not persist after a refresh.
-
Run the Gradio app:
python app.py
-
Open the URL provided by Gradio (e.g.,
http://127.0.0.1:7860/
) in your browser and interact with the web interface.
The "Translate" button is dynamically shown or hidden depending on the selected model:
- Visible for:
base
,small
,medium
,large
- Hidden for:
tiny
,turbo
To run the transcription process directly from the command line or using Python scripts, you can use audio_print.py
.
If you have an audio file and want to transcribe its contents without the GUI:
python audio_print.py
Modify the paths in audio_print.py
(input_audio
and output_text_file
) with the file names of your choice for the audio file you'd like to transcribe and the output file to save the transcription.
Users can upload an audio file in the Gradio UI. Supported formats include .wav
, .mp3
, .aac
, and others.
The app converts the uploaded audio to a .wav
file with a 16kHz sampling rate and mono channel for compatibility with Whisper.
The AudioTranscriber
class uses Whisper's speech-to-text model to transcribe the audio into text and generates subtitles in SRT format
The transcribed text is displayed in the Gradio interface, and the user can save it as a file to download.
If you want to test the transcription programmatically, you can use audio_print.py
:
from audio_print import AudioTranscriber
# Path to your audio file
input_audio = "example_audio.mp3"
# Path for saving the text transcription
output_text_file = "transcription.txt"
# Create an instance of the AudioTranscriber class
audio_transcriber = AudioTranscriber(model_name="base")
# Process the audio file and save the transcription
result = audio_transcriber.process_audio(input_audio, output_text_path=output_text_file)
print("Recognized Text:", result)
This project is licensed under the MIT License. You are free to use, modify, and distribute this project according to the terms of the license.
Contributions are welcome! Feel free to open an issue or submit a pull request for any improvements or bug fixes.
- Gradio: For building an effortless web UI for machine learning models.
- OpenAI Whisper: For the incredible whisper-based speech-to-text transcription model.
- FFmpeg: For reliable audio processing and conversion.