The Modern Document Processing Stack

This is a production-ready document conversion and processing engine (and primarily a wrapper of other tools). It uses open-source libraries to convert common file formats (PDF, DOCX, etc.) and web content to Markdown—a format that is friendly for LLMs and embedding models.

Blog

Features

Multi-format Support: Converts PDFs, DOCX, and more to Markdown (thanks to Docling)
LLM Integration: Optionally uses a VLLM (GPT4o) via Zerox for processing visually complex documents.
Web Content Scraping: Converts webpages to Markdown using Jina AI Reader.
Metadata Extraction: Detects document language and calculates token counts for popular tokenizers (cl100k_base & o200k_base).

Requirements

Python: 3.12 or higher
Libraries: Refer to the pyproject.toml file for a complete list.
Docker: (Optional) For containerized deployment.

Installation

Local Setup

Clone the Repository:

git clone https://github.com/yourusername/modern-doc-processing-stack.git
cd modern-doc-processing-stack

Create and Configure Environment Variables:
```
cp .env.example .env
```
Set Up Python Environment:

Use uv or your preferred environment manager. For example:
```
uv sync
```

Run the Application:

uv run hypercorn src/main:app --bind 0.0.0.0:8000

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
examples		examples
src		src
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

The Modern Document Processing Stack

Features

Requirements

Installation

Local Setup

About

Uh oh!

Releases

Packages

Languages

marcelmarais/modern-doc-processing-stack

Folders and files

Latest commit

History

Repository files navigation

The Modern Document Processing Stack

Features

Requirements

Installation

Local Setup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages