This repository implements a custom Retrieval-Augmented Generation (RAG) server using Express.js, OpenAI, Pinecone, and a dedicated Docling microservice for document preprocessing. Its modular architecture gives you complete control over the LLM APIs, allowing you to choose and switch between different models for embedding and generation while using a single vector database for retrieval.
A RAG system answers questions by retrieving the most relevant documents and then using a language model to generate accurate, context-aware responses. The process is divided into two main phases:
-
Preparation Phase:
- Document Upload & Preprocessing:
Users can upload various document types (PDF, DOCX, HTML, TXT) via the/api/upload
endpoint. Documents are then sent to the Docling microservice, which extracts and structures the text into a JSON format that preserves sections, titles, and paragraphs. - Embedding & Indexing:
The structured text is divided into chunks and converted into numerical embeddings using OpenAIβstext-embedding-3-small
model. These embeddings are stored in Pinecone for fast retrieval.
- Document Upload & Preprocessing:
-
Usage Phase:
- Query Processing:
The userβs query is embedded and compared against the indexed document chunks. - Response Generation:
Retrieved contexts are used to create a prompt for an LLM (such as GPT-4 or a specialized model) that generates the final response.
- Query Processing:
- Express.js & Node.js:
Handles the API endpoints and orchestrates the processing workflow. - OpenAI API:
Provides embedding (using models liketext-embedding-3-small
) and text generation capabilities. The modular design allows you to switch models easilyβfor example, using GPT-4 for generation or specialized models for other tasks. - Pinecone:
A vector database that stores and retrieves embeddings efficiently. - Docling Microservice:
Preprocesses uploaded documents and extracts structured text. - Crawl4AI: Handles advanced web scraping from public URLs. Converts webpages into Markdown, then sanitizes and filters the resulting content to ensure only meaningful chunks are embedded into Pinecone.
- Multer:
Manages file uploads on the server.
- API Flexibility:
You can independently manage the embedding and generation endpoints. This means you can select different OpenAI models based on task requirements (e.g., using a lightweight model for embeddings and a high-capacity model for response generation). - Separation of Concerns:
The document processing (handled by Docling), embedding, and vector storage are decoupled from the text generation module. This separation allows you to update or swap out components without overhauling the entire system. - Scalability:
The modular architecture makes it easy to add new features or integrate additional services (like real-time TTS or other specialized agents) without disrupting existing workflows. - Cost and Resource Optimization:
By tailoring models to specific tasks (e.g., smaller models for embeddings and more advanced ones for generating responses), you can optimize both performance and cost. - Specialized Agents:
The system supports the implementation of mini agents or specialized modules that operate on different indices within the DB, enabling highly targeted retrieval and processing (for instance, real-time TTS agents for voice-based applications).
π¦ rag-backend
ββπ crawl-microservice
β ββ π app.py
β ββ π Dockerfile
β ββ π requirements.txt
ββπ docling-microservice
β ββ π app.py
β ββ π Dockerfile
β ββ π requirements.txt
ββ π LICENSE
ββ π package-lock.json
ββ π package.json
ββ π README.md
ββ π server
β ββ π config.js
β ββ π embed_and_upload.js
β ββ π index.js
β ββ π site
β ββ π completeUpload.html
β ββ π upload.html
ββ π uploads
ββ π scripts
ββπ htmlToFolder.py
Create a .env
file in the root directory:
OPENAI_API_KEY=your_openai_key
PINECONE_API_KEY=your_pinecone_key
PINECONE_INDEX_NAME=your_index_name
npm install
npm run start
This command will:
- build and run the Docling microservice (using Docker)
- start the Express server with the upload and RAG endpoints
GET /
: provides an HTML form for quick document uploads (local testing).POST /api/upload
: uploads and processes documents using the Docling microservice. The processed, structured text is then embedded using OpenAIβs API and stored in Pinecone.POST /api/rag
: accepts text queries, retrieves relevant document chunks from Pinecone, and generates context-based responses using an OpenAI model.
const formData = new FormData();
formData.append('file', fileInput.files[0]);
fetch('/api/upload', {
method: 'POST',
body: formData
})
.then(res => res.json())
.then(data => console.log(data));
fetch('/api/rag', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ query: "Your question?" })
})
.then(res => res.json())
.then(data => console.log(data.answer));
- Freedom to Choose Different Models: Manage and switch between different OpenAI models for embedding and generation independently. For example, you can use a lightweight model for embeddings and a more powerful one for generating responses.
- Separation of Indexing and Generation Processes: Documents are processed and converted to vectors separately from the generative module, allowing you to update or replace components without rebuilding the entire index.
- Scalability and Flexibility: The modular design makes it easy to add new functionalities or integrate additional services (such as real-time TTS agents or specialized mini agents for different DB indices) without disrupting existing workflows.
- Cost and Resource Optimization: Tailor models to specific tasks (e.g., smaller models for embeddings and more advanced models for generation) to optimize performance and manage costs effectively.
- Specialized Agents: Enables the creation of mini agents, each focused on a specific index within the DB, providing highly targeted retrieval and processing. This can be particularly useful for applications such as real-time TTS or domain-specific query handling.
Docling is a microservice dedicated to converting complex documents into a structured JSON format.
- Key Features:
- Text Extraction:
It automatically extracts text from documents while preserving the original structure. - Structured Output:
The output JSON includes detailed metadata such as section headers, paragraphs, and other text elements.
- Text Extraction:
- Benefits:
- Context Preservation:
Keeping the document structure intact allows the RAG system to perform more intelligent chunking and weighting during retrieval. - Modularity:
By isolating document preprocessing, you can update or tweak this component without affecting the rest of the system.
- Context Preservation:
Crawl4AI is an advanced asynchronous web scraping library built on Playwright. In this project, it is used as a dedicated microservice to extract clean, structured markdown content from web pages.
-
Key Features:
-
Asynchronous scraping with full support for dynamic JavaScript-rendered websites.
-
Outputs well-structured Markdown for consistent processing.
-
Integrated via a FastAPI endpoint (/crawl/), enabling direct crawling from a user-provided URL.
-
Seamlessly fits into the document ingestion pipeline, alongside PDF, DOCX, and HTML uploads.
-
-
Smart Content Sanitization: To ensure high-quality indexing and retrieval, all crawled content undergoes text cleaning and normalization, including:
-
Removal of excessive line breaks and spacing.
-
Trimming of empty lines and HTML artifacts.
-
Whitespace collapsing to avoid bloated or meaningless text.
-
-
Chunk Filtering Logic: Before embeddings are generated, every chunk is evaluated. We discard chunks that:
-
Are too short or contain only whitespace, newlines, or special characters.
-
Clean input = smart output. Lack meaningful semantic content.
-
Duplicate empty or template-based blocks.
-
This helps keep your vector store clean and relevant, reducing noise and improving query performance.
License & Attribution This project uses Crawl4AI for web data extraction. It is distributed under the Apache License 2.0, with an attribution clause.
Please refer to the LICENSE file for full license details and obligations.
This repository offers a highly flexible and modular solution for building RAG systems, allowing you to:
- Freely manage and switch between different OpenAI models for embedding and generation.
- Maintain a separation between vector storage (Pinecone) and LLM components.
- Easily integrate additional functionalities like real-time TTS or specialized agents for different database indices.
By preserving the structured output from Docling, you gain the flexibility to implement advanced retrieval strategies that leverage the documentβs inherent hierarchyβleading to more accurate and context-aware responses.