Enterprise RAG is a workspace-scoped document intelligence platform for uploading PDFs, extracting text, indexing chunks with embeddings, and answering questions with grounded citations.
The repository is organized as a monorepo with three application surfaces:
server: FastAPI API and core RAG orchestrationworker: Redis/RQ background jobs for extraction, indexing, and maintenanceclient: React + Vite frontend with Supabase authentication
This README is written for the current codebase. It explains what the system does today, how the services interact, and how the architecture is intended to scale like an enterprise application.
- Why This Exists
- What The System Does
- System Architecture
- How It Works End to End
- Repository Layout
- API Surface
- Data Model
- Limits and Controls
- Local Development
- Environment Variables
- Operational Notes
- Current Status
- Roadmap
Typical RAG demos stop at a single script that embeds documents and sends a prompt to an LLM. That is not enough for a production system.
This project is built around the concerns that matter in enterprise environments:
- strict workspace isolation
- authenticated access with bearer tokens
- asynchronous ingestion so uploads do not block API requests
- token budget enforcement with reservation and commit semantics
- query logging and observability
- repeatable local development with Docker Compose
- clean separation between API, workers, storage, and UI
At a high level, the platform supports this flow:
- A user signs in with Supabase Auth.
- The user creates a workspace.
- The client requests a signed upload URL from the API.
- A PDF is uploaded to Supabase Storage.
- The API confirms the upload and enqueues background jobs.
- Workers extract page text, split it into chunks, and generate embeddings.
- The document becomes queryable.
- The user asks a question against a document.
- The API embeds the question, retrieves the most relevant chunks, calls the LLM with grounded context, and returns an answer with citations.
- Usage, latency, and errors are recorded for observability.
flowchart LR
User[User] --> Client[React Client]
Client --> Auth[Supabase Auth]
Client -->|JWT| API[FastAPI API]
API --> DB[(PostgreSQL + pgvector)]
API --> Redis[(Redis)]
API --> Storage[Supabase Storage]
API --> OpenAI[OpenAI API]
API -->|enqueue| ExtractQ[RQ ingest_extract]
API -->|enqueue| IndexQ[RQ ingest_index]
ExtractQ --> ExtractWorker[Extraction Worker]
IndexQ --> IndexWorker[Indexing Worker]
ExtractWorker --> Storage
ExtractWorker --> DB
ExtractWorker --> Redis
IndexWorker --> DB
IndexWorker --> OpenAI
API --> Client
+--------------------+ +---------------------+
| React Client | | Supabase |
| Vite app | | Auth + Storage |
+---------+----------+ +----------+----------+
| ^
| JWT / signed upload flow |
v |
+---------+-------------------------------------------+
| FastAPI Server |
| - auth validation |
| - workspace-scoped APIs |
| - query orchestration |
| - token budget checks |
+---------+----------------------+----------------------+
| |
| SQL | enqueue jobs
v v
+---------+----------+ +-------+----------------------+
| PostgreSQL | | Redis / RQ |
| pgvector | | rate limiting + job queues |
+---------+----------+ +-------+----------------------+
^ |
| v
| +-------+----------------------+
| | Worker Processes |
| | - extract PDF text |
| | - chunk pages |
| | - create embeddings |
| | - cleanup reservations |
| +------------------------------+
|
+------------ OpenAI embeddings / chat model
- API and worker responsibilities are separated.
- Document ingestion is asynchronous and queue-backed.
- Every major operation is scoped by
workspace_id. - Token usage is tracked centrally by day.
- Redis is used both for rate limiting and job execution.
- Vector retrieval stays in Postgres with
pgvectorinstead of introducing another data store. - The frontend is a separate deployable artifact.
The client authenticates with Supabase and sends the bearer token to the API. The server validates the token and derives the current user. Workspace-scoped endpoints then resolve the user's workspace before accessing documents or usage records.
Primary files:
client/src/lib/supabase.tsserver/app/api/deps.pyserver/app/core/auth.pyserver/app/api/workspaces.py
The upload pipeline is designed so the API never has to receive the PDF bytes directly.
sequenceDiagram
participant C as Client
participant A as API
participant S as Supabase Storage
participant R as Redis/RQ
participant W1 as Extract Worker
participant W2 as Index Worker
participant D as PostgreSQL
C->>A: POST /documents/upload-prepare
A->>D: create placeholder document row
A-->>C: signed upload URL + storage path
C->>S: upload PDF directly
C->>A: POST /documents/upload-complete
A->>R: enqueue extract job
R->>W1: ingest_extract
W1->>S: download PDF
W1->>D: write document_pages
W1->>R: enqueue ingest_index
R->>W2: ingest_index
W2->>D: write chunks
W2->>D: write chunk_embeddings
W2->>D: mark document ready/indexed
What happens in practice:
upload-preparevalidates file size, content type, workspace limits, and idempotency.- The API stores a placeholder document record and returns a signed storage URL.
upload-completeconfirms the object exists in storage and enqueues extraction.ingest_extractdownloads the PDF and writes extracted page text intodocument_pages.ingest_indexchunks page text, generates embeddings, stores vectors, and marks the document ready.
Primary files:
server/app/api/documents.pyserver/app/core/storage.pyworker/jobs/ingest_extract.pyworker/jobs/ingest_index.py
The query flow is grounded retrieval, not free-form generation.
sequenceDiagram
participant C as Client
participant A as API
participant D as PostgreSQL/pgvector
participant O as OpenAI
C->>A: POST /query or POST /query/stream
A->>A: validate workspace, document, limits
A->>O: embed question
A->>D: retrieve top-k chunks by vector similarity
A->>A: reserve token budget
A->>O: generate grounded answer
A->>A: commit actual usage and release remainder
A->>D: write query log
A-->>C: answer + citations + usage
What the server does:
- embeds the question with
text-embedding-3-small - retrieves top chunks from
chunk_embeddingsandchunks - builds a grounded prompt using retrieved content
- reserves the estimated token budget before the LLM call
- commits actual usage after the response returns
- logs citations, latency, and token usage
Primary files:
server/app/api/query.pyserver/app/api/query_stream.pyserver/app/core/retrieval.pyserver/app/core/embeddings.pyserver/app/core/llm.pyserver/app/core/token_budget.py
The system exposes both real-time usage and an observability summary.
Current observability coverage includes:
- daily token usage and remaining budget
- total query count
- 24-hour query volume and error rate
- latency statistics
- document status summary
- top queried documents
- recent query failures
Primary files:
server/app/api/usage.pyserver/app/db/models.pyworker/jobs/maintenance.py
enterprise-rag/
├── client/ # React + Vite frontend
├── server/ # FastAPI API, core logic, DB layer
├── worker/ # Redis/RQ workers and maintenance jobs
├── scripts/ # DB bootstrap and utility scripts
├── infrastructure/ # Infrastructure placeholders
├── docker-compose.yml # Full local stack
├── docker-compose.prod.yml # Production-style compose file
├── AGENTS.md # Architecture contract and implementation notes
└── README.md
server/app/api: REST and streaming endpointsserver/app/core: auth, retrieval, embeddings, prompts, token budgetserver/app/db: SQLAlchemy models and DB session setupserver/app/schemas: request and response modelsworker/jobs: extraction, indexing, maintenance jobsclient/src/pages: authenticated and public application pagesclient/src/components: UI modules for upload, chat, usage, and layout
GET /healthGET /auth/me
POST /workspacesGET /workspaces/me
GET /documentsGET /documents/{document_id}GET /documents/{document_id}/pages/{page_number}POST /documents/upload-preparePOST /documents/upload-completePOST /documents/{document_id}/retryPOST /documents/{document_id}/reindexDELETE /documents/{document_id}
POST /queryPOST /query/streamGET /citations/{chunk_id}GET /queriesGET /queries/{query_id}
POST /chats/sessionsPATCH /chats/sessions/{session_id}GET /chats/sessionsGET /chats/sessions/{session_id}
GET /usage/todayGET /usage/observability
Core tables in the current implementation:
workspaces: tenant root for all user contentdocuments: uploaded PDF metadata and pipeline statusdocument_pages: extracted page textchunks: page-bounded text chunks used for retrievalchunk_embeddings: vector embeddings stored inpgvectorworkspace_daily_usage: daily token accounting with reserved and used bucketsquery_logs: query history, citations, latency, and token metricschat_sessions: persisted chat metadata and messages
Document lifecycle in the current codebase:
pending_upload/uploading -> uploaded -> extracting -> indexing -> ready/indexed
\-> failed
Current enforced limits from the application config and rate limiter:
1workspace per user- up to
100documents per workspace - maximum file size:
20 MB - supported upload type:
application/pdf - maximum query length:
500characters - retrieval depth:
top_k = 5 - LLM max output tokens:
2000 - daily token limit:
100000tokens per workspace - upload prepare rate limit:
10requests per minute per workspace - upload complete rate limit:
20requests per minute per workspace - query rate limit:
100requests per minute per workspace
The token budget is managed with reservation semantics so concurrent requests do not overspend the daily allowance.
Flow:
- Estimate query embedding + prompt + max output cost.
- Reserve the estimated tokens.
- Execute the LLM call.
- Commit actual tokens used.
- Release any unused reservation.
- Periodically clean stale reservations.
This logic is implemented in server/app/core/token_budget.py and worker/jobs/maintenance.py.
- Docker and Docker Compose
- Node.js 20+ if running the client outside Docker
- Python 3.11 if running the API or worker outside Docker
- A Supabase project
- An OpenAI API key for embeddings and answer generation
- Create your environment file.
cp .env.example .env- Fill in at least these values:
SUPABASE_URL=
SUPABASE_SERVICE_ROLE_KEY=
SUPABASE_ANON_KEY=
SUPABASE_JWT_SECRET=
OPENAI_API_KEY=
DATABASE_URL=postgresql://postgres:postgres@localhost:5432/enterprise_rag
REDIS_URL=redis://localhost:6379/0- Start the stack.
docker-compose up --build- Open the services:
- client:
http://localhost:5173 - api:
http://localhost:8000 - rq dashboard:
http://localhost:9181
# start everything
docker-compose up
# rebuild and start
docker-compose up --build
# stop services
docker-compose down
# stop and remove volumes
docker-compose down -v
# run DB migrations from the server container
docker-compose exec server alembic upgrade head
# view server logs
docker-compose logs -f server
# view worker logs
docker-compose logs -f worker-extract
docker-compose logs -f worker-indexServer:
cd server
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000Client:
cd client
npm install
npm run dev -- --host 0.0.0.0Worker:
cd worker
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
QUEUE_NAME=ingest_extract python worker.pyRoot .env.example is the primary template for local development.
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_SERVICE_ROLE_KEY=your-service-role-key
SUPABASE_ANON_KEY=your-anon-key
SUPABASE_JWT_SECRET=your-jwt-secret
SUPABASE_STORAGE_BUCKET=documents
OPENAI_API_KEY=sk-...
DATABASE_URL=postgresql://postgres:postgres@localhost:5432/enterprise_rag
REDIS_URL=redis://localhost:6379/0ENVIRONMENT=development
API_HOST=0.0.0.0
API_PORT=8000
DAILY_TOKEN_LIMIT=100000
RESERVATION_TTL_SECONDS=600
LOG_EACH_QUERY=false
EMBEDDING_MODEL=text-embedding-3-small
VITE_API_URL=http://localhost:8000
VITE_SUPABASE_URL=https://your-project.supabase.co
VITE_SUPABASE_ANON_KEY=your-anon-keyThe system is designed around workspace_id as the isolation boundary. Document access, usage tracking, retrieval, and query logs are all scoped to a workspace.
PDF binaries live in Supabase Storage. Extracted text, chunks, metadata, and embeddings live in Postgres.
Embeddings are stored in chunk_embeddings.embedding using pgvector. Retrieval uses cosine distance and returns the most relevant chunk candidates for a single document.
- failed documents can be retried with
POST /documents/{document_id}/retry - already processed documents can be reindexed with
POST /documents/{document_id}/reindex - stale token reservations can be cleared by the maintenance job
- document deletion removes metadata first, then attempts storage cleanup
Current routed pages in the client:
/login/signup/workspace/app/upload/app/chat/app/observability/app/workspace
The repository is beyond a scaffold. These capabilities are already present in code:
- JWT-backed auth integration with Supabase
- one-workspace-per-user model
- signed upload preparation and upload completion
- background extraction and indexing jobs
- page text persistence
- chunk persistence and vector embedding storage
- grounded query endpoint
- streaming query endpoint using SSE
- citation source retrieval
- query history APIs
- chat session APIs
- usage and observability endpoints
- Docker-based local runtime
Known gaps or areas still being hardened:
- not every table in the architecture contract is represented yet in SQLAlchemy models
server/app/core/chunking.pyremains a placeholder while worker-side chunking is active- production deployment still needs full operational hardening, secrets handling, and CI maturity
- test coverage is still light for end-to-end ingestion and retrieval
Near-term improvements that fit the current architecture:
- Move chunking into a shared core module so API and workers use one implementation.
- Expand integration tests around upload, extraction, indexing, and query behavior.
- Add stronger metrics, worker lifecycle hooks, and scheduled maintenance execution.
- Harden migration coverage for all current tables and status transitions.
- Expand query scope from single-document search to selected multi-document search where needed.
- Improve production deployment docs and CI/CD validation.
AGENTS.md: architecture contract and implementation guidanceserver/README.md: server-specific notesworker/README.md: worker-specific notesclient/README.md: client-specific notes