An open-source RAG evaluation engine. Benchmark your retrieval pipeline before it ships.
Most RAG apps ship without knowing why they're failing. RagBench gives you the instrumentation to find out — swap chunking strategies, retrieval modes, and LLM providers, then measure the impact with the RAG Triad: Faithfulness, Contextual Precision, Contextual Recall, Answer Relevancy, and Contextual Relevancy.
One command to run. No config files to write. Seed data included.
Prerequisites: Docker + Docker Compose, and at least one LLM API key (OpenAI, Gemini, or Anthropic).
git clone https://github.com/paramjeetn/ragbench.git
cd ragbench
cp .env.example .env
# Edit .env — add OPENAI_API_KEY, GEMINI_API_KEY, or ANTHROPIC_API_KEY
make up| Service | URL |
|---|---|
| Frontend | http://localhost:3000 |
| API Docs (Swagger) | http://localhost:8000/docs |
| Qdrant Dashboard | http://localhost:6333/dashboard |
Seed data loads automatically — the dashboard shows real eval results on first launch.
make logs # tail all services
make down # stop
make clean # stop + wipe volumes (fresh start)- Upload documents — PDF, Markdown, or plain text. The ingestion pipeline parses, chunks, embeds, and indexes into Qdrant automatically.
- Chat with your documents — streaming Q&A with source citations and token/cost metadata per response.
- Run evaluations — pick a question set, run it against your current pipeline config, get per-question RAG Triad scores.
- Compare runs side-by-side — radar charts showing how a config change moved each metric. Know if your change helped or hurt.
- Tune from the UI — change chunking strategy, embedding model, retrieval mode, reranker, LLM — all from a settings panel, no restarts.
Documents → Parse → Chunk → Embed → Qdrant
│
Query → Embed → Dense Search ──┐ │
→ BM25 Sparse ────┴─ RRF Fusion (top 20)
│
Cross-Encoder Rerank (top 5)
│
LLM Generator
│
Answer + Citations
│
Evaluate with RAG Triad
Every stage is swappable:
| Stage | Options |
|---|---|
| Chunking | Fixed-size · Recursive · Semantic · Document-aware |
| Embedding | OpenAI · Gemini · SentenceTransformers (local) |
| Retrieval | Dense · Sparse (BM25) · Hybrid (RRF) |
| Reranking | Cross-encoder ms-marco-MiniLM-L-12-v2 · None |
| Generation | OpenAI · Anthropic · Gemini · Ollama (local) |
| Evaluation | DeepEval RAG Triad · GEval |
RagBench uses the RAG Triad framework. Each metric maps to a specific part of the pipeline, so a low score tells you exactly what to fix.
| Metric | What It Measures | Low Score Means |
|---|---|---|
| Contextual Precision | Reranker quality | Irrelevant chunks ranked too high |
| Contextual Recall | Embedding coverage | Missing relevant information |
| Contextual Relevancy | Chunk size / top-K tuning | Too much noise in retrieved context |
| Answer Relevancy | Prompt template quality | Answer doesn't address the question |
| Faithfulness | LLM groundedness | Hallucination — answer goes beyond context |
| Layer | Technology |
|---|---|
| Frontend | Next.js 16 · TypeScript · Tailwind CSS · shadcn/ui |
| Backend | Python 3.12 · FastAPI · SQLAlchemy (async) · Pydantic |
| Vector DB | Qdrant (HNSW indexing, hybrid search) |
| Database | PostgreSQL 16 |
| Embeddings | OpenAI · Gemini · sentence-transformers |
| Reranking | cross-encoder/ms-marco-MiniLM-L-12-v2 |
| Evaluation | DeepEval (RAG Triad + GEval) |
| Infra | Docker Compose · multi-stage builds · Make |
ragbench/
├── Makefile # make up / down / clean / seed / logs
├── docker-compose.yml
├── .env.example
│
├── backend/
│ ├── main.py # FastAPI entry + lifespan
│ ├── config.py # Pipeline config schema
│ ├── ingestion/ # Parser + chunking strategies
│ ├── embedding/ # Multi-provider embedder
│ ├── retrieval/ # Dense, sparse, hybrid, reranker
│ ├── generation/ # LLM providers + prompt templates
│ ├── evaluation/ # RAG Triad runner + comparison
│ ├── database/ # SQLAlchemy models + repository
│ ├── vectorstore/ # Qdrant client wrapper
│ ├── seed/ # Sample docs + idempotent seed loader
│ └── api/ # REST route handlers
│
└── frontend/src/
├── app/ # Pages: dashboard, chat, documents, evaluate, compare
├── components/ # UI components (shadcn/ui)
├── context/ # Chat + eval state
└── lib/ # API client + types
POST /api/ingest Upload and process documents
GET /api/documents List ingested documents
POST /api/query Ask a question (sync)
POST /api/query/stream Ask a question (SSE streaming)
POST /api/eval/run Run evaluation suite
GET /api/eval/runs/{id} Get results for an eval run
GET /api/eval/compare Compare two eval runs
GET /api/config Get current pipeline config
PUT /api/config Update pipeline config
GET /health Health check
Full interactive docs at http://localhost:8000/docs.
make up # Start all services (postgres → qdrant → backend → seed → frontend)
make down # Stop all services
make build # Rebuild containers
make logs # Tail all logs
make logs-backend # Tail backend logs only
make logs-frontend # Tail frontend logs only
make seed # Re-run seed data loader
make clean # Stop + wipe all volumes
make help # Show all commandsCopy .env.example to .env and fill in at least one LLM key:
# LLM — at least one required
OPENAI_API_KEY=
ANTHROPIC_API_KEY=
GEMINI_API_KEY=
# Evaluation — required for RAG Triad scoring
OPENAI_API_KEY= # DeepEval uses OpenAI by default
# Infra — defaults work for local Docker
POSTGRES_URL=postgresql+asyncpg://ragbench:ragbench@postgres:5432/ragbench
QDRANT_HOST=qdrant
QDRANT_PORT=6333Pull requests are welcome. For significant changes, open an issue first to discuss what you'd like to change.
- Fork the repo
- Create a feature branch:
git checkout -b feat/your-feature - Commit with conventional commits:
git commit -m "feat: add X" - Open a PR against
main
MIT © Paramjeet
