A production-ready, multimodal video retrieval system that matches videos to natural language queries using powerful vision-language models and LLMs. This system enables semantic video search by leveraging frame-level visual summaries and enriched audio transcripts, all indexed in a vector database for blazing-fast similarity search.
- 🔍 Natural Language Video Search: Find relevant videos using everyday language.
 - 🧠 Multimodal Understanding: Combines visual and audio information for richer context.
 - 🖼️ Visual Summarization: Extracts key frames (1 FPS) and generates meaningful summaries using a vision-language model.
 - 🎙️ Audio Transcript Enrichment: Converts speech to text and integrates it with visual data.
 - ⚡ Fast Retrieval: Stores embeddings and metadata in a vector database for efficient semantic search.
 - 🧪 Scalable & Production-Ready: Clean architecture ready for deployment and scaling.
 
| Component | Technology | 
|---|---|
| Language | Python | 
| Visualsummary Model | BLIP | 
| Vector DB | Chroma | 
| Audio Processing | ffmpeg, pydub | 
| Frame Extraction | OpenCV | 
| Deployment | Streamlit | 
RUN run.py using |streamlit run| command in CLI using seperate environmet to get streamlit UI as local webhost.