AI tool transforms any URL into a structured knowledge source by: extracting content using Crawl4AI ,vectorizing and summarizing data , running Retrieval-Augmented Generation (RAG) for deep information discovery, enabling a smart chatbot for interactive Q&A.
WebMaster is a powerful AI-driven tool that transforms any URL into a structured knowledge source. Built using Crawl4AI, Ollama, DeepSeek, and Streamlit, it enables you to extract, vectorize, and summarize web content—and interact with it through a smart chatbot. Perfect for researchers, analysts, and AI enthusiasts, WebMaster isn’t just another coding exercise; it’s a real-world solution to information overload.
In today’s fast-paced digital age, countless websites contain valuable data and insights—but manually extracting and understanding this content is time-consuming and error-prone.
- For researchers and analysts: Sifting through lengthy articles and disparate data is inefficient.
- For businesses: Making sense of scattered online information can hinder strategic decisions.
WebMaster addresses these challenges by:
- Extracting Web Content: Automatically crawling and gathering text from any URL.
- Structuring Information: Vectorizing and summarizing data to present clear, concise insights.
- Deep Information Discovery: Employing Retrieval-Augmented Generation (RAG) to uncover deeper, contextual details.
- Interactive Q&A: Offering a chatbot interface that lets you query and interact with the extracted content in real time.
-
Website Extraction:
Uses Crawl4AI to efficiently crawl and extract content from web pages. -
Summarization:
Generates detailed summaries of the extracted content—ideal for long articles or complex websites. -
Embeddings & Retrieval:
Creates embeddings using FAISS for intelligent document retrieval, overcoming open-source context window limitations. -
Chatbot Interface:
Provides a conversational agent for interactive Q&A, letting you explore your content seamlessly. -
Dual AI Engine Support:
Choose between Closed Source (OpenAI) and Open Source (Ollama) engines for both summarization and conversation to suit your needs.
-
Real-World Problem Solving:
Rather than being just a coding exercise, WebSage is designed as a business tool—for instance, helping freelancers manage data or enabling researchers to efficiently analyze academic content. -
Quantifiable Benefits:
- Time Savings: Automates extraction and summarization, potentially reducing manual analysis time by up to 35%.
- Enhanced Insight: The RAG approach enables deeper, context-aware retrieval of information.
- Flexibility & Cost-Efficiency: Supports both open and closed source AI engines, allowing for tailored, budget-friendly solutions.
- Python 3.8+
- Required packages as listed in
requirements.txt
- API keys or access tokens for AI engines (if using Closed Source models)
Clone the repository and install dependencies:
git clone https://github.com/yourusername/websage.git
cd webmaster
pip install -r requirements.txt
Edit the config.yaml
file to set your preferred options:
-
AI Engine Selection:
Choose between OpenAI (Closed Source) and Ollama (Open Source) for summarization and chat. -
FAISS Vector Database:
Configure local vector database settings. -
Other Parameters:
Set URL input, output format, etc.
Launch the Streamlit interface to start using WebSage:
streamlit run app.py
This opens a browser window where you can:
- Enter a URL: Trigger content extraction.
- View Summaries: Read concise, AI-generated summaries.
- Chat with the Bot: Ask follow-up questions and explore your content interactively.
websage/ ├── app.py # Streamlit web app entry point ├── config.yaml # Configuration file for API keys, DB settings, etc. ├── crawlers/ # Content extraction using Crawl4AI ├── summarizer/ # Modules for text summarization and embeddings creation ├── chatbot/ # Chatbot interface using RAG for Q&A ├── requirements.txt # Python dependencies └── README.md # Project documentation
We welcome contributions! To get involved:
- Fork the repository.
- Create a feature branch.
- Submit a pull request with your changes.
For major contributions, please open an issue to discuss your ideas first.
This project is open-source and available under the MIT License.
For early-career developers aiming to add meaningful projects to your GitHub profile, WebMaster demonstrates not only coding ability but also strong problem-solving skills. Focus on impact, not just output—one impactful project can be far more valuable than hundreds of clone apps.
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
- Benchmarking DeepSeek R1 for Text Classification and Summarization
- FinGPT-Forecaster Model Comparison: Llama-3.1-8B vs DeepSeek-R1-Distill-Llama-8B
Feel free to modify any section to suit your project's specifics or update links and images as needed.