A modern web application that transforms spoken audio into synchronized text transcriptions with AI-generated visual descriptions.
- Real-time Audio Transcription: Convert spoken words to text using OpenAI's Whisper model
- Synchronized Text Display: View transcriptions that animate in sync with audio playback
- AI Visual Descriptions: Experience generated visual descriptions that represent the content
- Streaming Architecture: Process audio in chunks for a responsive experience
- Intuitive Interface: Simple and elegant design for easy interaction
- Python 3.10+ for the backend
- Node.js 18+ for the frontend
- Whisper for speech-to-text
- Ollama with LLaMA3 model for visual descriptions
-
Navigate to the backend directory:
cd back
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install required packages:
pip install django django-cors-headers openai-whisper ffmpeg-python
-
Start the Django server:
python manage.py runserver
-
Navigate to the frontend directory:
cd front
-
Install dependencies:
npm install
-
Start the development server:
npm run dev
- Install Ollama
- Pull the LLaMA3 model:
ollama pull llama3
- Ensure the Ollama service is running at
http://localhost:11434
- Open the application in your web browser (typically at http://localhost:5173)
- Click on "Choose Audio File" to upload an audio recording
- Wait for the transcription process to begin
- Watch as the text appears synchronized with the audio playback
- Experience AI-generated visual descriptions that represent the content
The application is structured with:
- Frontend: React with Vite for a fast and responsive UI
- Backend: Django for robust server-side processing
- Speech-to-Text: OpenAI's Whisper model for high-quality transcription
- Visual Descriptions: LLaMA3 through Ollama for generating descriptive imagery
- Audio is uploaded from the client to the Django backend
- The backend processes the audio in chunks using Whisper
- Transcription results are streamed back to the frontend
- Each chunk is sent to Ollama for visual description generation
- The frontend synchronizes the display with audio playback
- Visual descriptions update as the audio progresses
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Multiple language support
- Custom visual styling options
- User accounts for saving and sharing transcriptions
- Integration with stable diffusion for actual image generation
- Mobile application support