This is a production-ready document conversion and processing engine (and primarily a wrapper of other tools). It uses open-source libraries to convert common file formats (PDF, DOCX, etc.) and web content to Markdown—a format that is friendly for LLMs and embedding models.
- Multi-format Support: Converts PDFs, DOCX, and more to Markdown (thanks to Docling)
- LLM Integration: Optionally uses a VLLM (GPT4o) via Zerox for processing visually complex documents.
- Web Content Scraping: Converts webpages to Markdown using Jina AI Reader.
- Metadata Extraction: Detects document language and calculates token counts for popular tokenizers (
cl100k_base
&o200k_base
).
- Python: 3.12 or higher
- Libraries: Refer to the
pyproject.toml
file for a complete list. - Docker: (Optional) For containerized deployment.
-
Clone the Repository:
git clone https://github.com/yourusername/modern-doc-processing-stack.git cd modern-doc-processing-stack
-
Create and Configure Environment Variables:
cp .env.example .env
-
Set Up Python Environment:
Use uv or your preferred environment manager. For example:
uv sync
-
Run the Application:
uv run hypercorn src/main:app --bind 0.0.0.0:8000