Complete code examples for the Medium article series on RAG chunking strategies. Each article has its own folder with working implementations, benchmarks, and practical examples.
- Why Your AI Keeps Forgetting Context - Understanding the chunking problem
- Chunking Fundamentals - Basic strategies and implementations
- Hierarchy-Based Chunking - Respecting document structure
- Semantic & Recursive Chunking - Advanced meaning-aware strategies
- Chunking Face-off - Performance comparisons and benchmarks
- File Format Challenges - PDFs, tables, images, and more
- Production Pipelines - Building robust systems
# Clone the repository
git clone https://github.com/Pritish053/rag-chunking-code.git
cd rag-chunking-code
# Install dependencies
pip install -r requirements.txt
# Run the examples for article 1
python 01-why-ai-forgets/naive_chunking_problems.py.
├── 01-why-ai-forgets/ # Code for article 1
│ ├── naive_chunking_problems.py
│ ├── quick_fixes.py
│ └── examples/
├── 02-chunking-fundamentals/ # Code for article 2
│ ├── fixed_size_chunking.py
│ ├── sentence_based_chunking.py
│ └── paragraph_chunking.py
├── 03-hierarchy-based/ # Code for article 3
│ ├── markdown_chunker.py
│ ├── html_chunker.py
│ └── document_structure.py
├── 04-semantic-recursive/ # Code for article 4
│ ├── semantic_chunker.py
│ ├── recursive_splitter.py
│ └── embedding_analysis.py
├── 05-chunking-faceoff/ # Code for article 5
│ ├── benchmark_suite.py
│ ├── evaluation_metrics.py
│ └── results/
├── 06-file-formats/ # Code for article 6
│ ├── pdf_chunker.py
│ ├── table_processor.py
│ └── multimodal_chunking.py
├── 07-production-pipelines/ # Code for article 7
│ ├── pipeline.py
│ ├── monitoring.py
│ └── deployment/
├── utils/ # Shared utilities
│ ├── tokenizer.py
│ ├── embeddings.py
│ └── evaluation.py
├── examples/ # Sample documents and data
│ ├── documents/
│ └── test_data/
└── tests/ # Unit tests
└── test_*.py
- Python 3.8+
- pip or conda
pip install -r requirements.txtFor development:
pip install -r requirements-dev.txtfrom chunking_strategies import NaiveChunker
# This is what NOT to do
chunker = NaiveChunker(chunk_size=1000)
chunks = chunker.chunk(document_text)
# See the problems it creates
from chunking_strategies import analyze_chunks
problems = analyze_chunks(chunks)
print(f"Found {len(problems)} issues with naive chunking")from chunking_strategies import OverlapChunker
# A quick improvement
chunker = OverlapChunker(chunk_size=1000, overlap=100)
chunks = chunker.chunk(document_text)from chunking_strategies import SentenceChunker
# Respects sentence boundaries
chunker = SentenceChunker(target_size=1000)
chunks = chunker.chunk(document_text)Run the benchmark suite to compare different strategies:
python 05-chunking-faceoff/benchmark_suite.py --input examples/documents/technical_manual.pdfRun all tests:
pytest tests/Run specific test suite:
pytest tests/test_semantic_chunking.py -vEach chunking strategy is evaluated on:
- Semantic Coherence: How well meaning is preserved
- Retrieval Accuracy: How often the right chunks are found
- Processing Speed: Time to chunk documents
- Memory Usage: RAM requirements
- Token Efficiency: Optimal use of model context
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License - see the LICENSE file for details.
- Examples inspired by real-world RAG implementations at Xcelcode
- Built with LangChain, OpenAI, and other excellent tools
- Special thanks to the open-source community
- Author: Pritish Maheta
- Medium: Follow the series
- LinkedIn: Connect with me
If you find this helpful, please star the repository!
Part of the "Mastering RAG Chunking Strategies" series on Medium