Skip to content

AI tool to transforms any URL into a structured knowledge source by: extracting content using Crawl4AI ,vectorizing and summarizing data , running Retrieval-Augmented Generation (RAG) for deep information discovery, enabling a smart chatbot for interactive Q&A.

Notifications You must be signed in to change notification settings

balaji1233/WEB_MASTER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WEB_MASTER

AI tool transforms any URL into a structured knowledge source by: extracting content using Crawl4AI ,vectorizing and summarizing data , running Retrieval-Augmented Generation (RAG) for deep information discovery, enabling a smart chatbot for interactive Q&A.

WebMaster is a powerful AI-driven tool that transforms any URL into a structured knowledge source. Built using Crawl4AI, Ollama, DeepSeek, and Streamlit, it enables you to extract, vectorize, and summarize web content—and interact with it through a smart chatbot. Perfect for researchers, analysts, and AI enthusiasts, WebMaster isn’t just another coding exercise; it’s a real-world solution to information overload.


🚀 Why WebMaster?

The Problem

In today’s fast-paced digital age, countless websites contain valuable data and insights—but manually extracting and understanding this content is time-consuming and error-prone.

  • For researchers and analysts: Sifting through lengthy articles and disparate data is inefficient.
  • For businesses: Making sense of scattered online information can hinder strategic decisions.

Our Solution

WebMaster addresses these challenges by:

  • Extracting Web Content: Automatically crawling and gathering text from any URL.
  • Structuring Information: Vectorizing and summarizing data to present clear, concise insights.
  • Deep Information Discovery: Employing Retrieval-Augmented Generation (RAG) to uncover deeper, contextual details.
  • Interactive Q&A: Offering a chatbot interface that lets you query and interact with the extracted content in real time.

🔑 Key Features

  • Website Extraction:
    Uses Crawl4AI to efficiently crawl and extract content from web pages.

  • Summarization:
    Generates detailed summaries of the extracted content—ideal for long articles or complex websites.

  • Embeddings & Retrieval:
    Creates embeddings using FAISS for intelligent document retrieval, overcoming open-source context window limitations.

  • Chatbot Interface:
    Provides a conversational agent for interactive Q&A, letting you explore your content seamlessly.

  • Dual AI Engine Support:
    Choose between Closed Source (OpenAI) and Open Source (Ollama) engines for both summarization and conversation to suit your needs.


🎯 Impact & Value

  • Real-World Problem Solving:
    Rather than being just a coding exercise, WebSage is designed as a business tool—for instance, helping freelancers manage data or enabling researchers to efficiently analyze academic content.

  • Quantifiable Benefits:

    • Time Savings: Automates extraction and summarization, potentially reducing manual analysis time by up to 35%.
    • Enhanced Insight: The RAG approach enables deeper, context-aware retrieval of information.
    • Flexibility & Cost-Efficiency: Supports both open and closed source AI engines, allowing for tailored, budget-friendly solutions.

🛠️ How to Use WebMaster

Prerequisites

  • Python 3.8+
  • Required packages as listed in requirements.txt
  • API keys or access tokens for AI engines (if using Closed Source models)

Installation

Clone the repository and install dependencies:

git clone https://github.com/yourusername/websage.git
cd webmaster
pip install -r requirements.txt

Configuration

Edit the config.yaml file to set your preferred options:

  • AI Engine Selection:
    Choose between OpenAI (Closed Source) and Ollama (Open Source) for summarization and chat.

  • FAISS Vector Database:
    Configure local vector database settings.

  • Other Parameters:
    Set URL input, output format, etc.


Running the Application

Launch the Streamlit interface to start using WebSage:

streamlit run app.py

This opens a browser window where you can:

  • Enter a URL: Trigger content extraction.
  • View Summaries: Read concise, AI-generated summaries.
  • Chat with the Bot: Ask follow-up questions and explore your content interactively.

Project Structure

websage/
├── app.py                # Streamlit web app entry point
├── config.yaml           # Configuration file for API keys, DB settings, etc.
├── crawlers/             # Content extraction using Crawl4AI
├── summarizer/           # Modules for text summarization and embeddings creation
├── chatbot/              # Chatbot interface using RAG for Q&A
├── requirements.txt      # Python dependencies
└── README.md             # Project documentation

💬 Contributing

We welcome contributions! To get involved:

  • Fork the repository.
  • Create a feature branch.
  • Submit a pull request with your changes.

For major contributions, please open an issue to discuss your ideas first.

📄 License

This project is open-source and available under the MIT License.

🙌 Final Thoughts

For early-career developers aiming to add meaningful projects to your GitHub profile, WebMaster demonstrates not only coding ability but also strong problem-solving skills. Focus on impact, not just output—one impactful project can be far more valuable than hundreds of clone apps.

References

Feel free to modify any section to suit your project's specifics or update links and images as needed.

About

AI tool to transforms any URL into a structured knowledge source by: extracting content using Crawl4AI ,vectorizing and summarizing data , running Retrieval-Augmented Generation (RAG) for deep information discovery, enabling a smart chatbot for interactive Q&A.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published