Skip to content

A text extraction library supporting PDFs, images, office documents and more

License

Notifications You must be signed in to change notification settings

Goldziher/kreuzberg

Repository files navigation

Kreuzberg

PyPI version Documentation License: MIT

Kreuzberg is a Python library for text extraction from documents. It provides a unified interface for extracting text from PDFs, images, office documents, and more, with both async and sync APIs.

Why Kreuzberg?

  • Simple and Hassle-Free: Clean API that just works, without complex configuration
  • Local Processing: No external API calls or cloud dependencies required
  • Resource Efficient: Lightweight processing without GPU requirements
  • Format Support: Comprehensive support for documents, images, and text formats
  • Multiple OCR Engines: Support for Tesseract, EasyOCR, and PaddleOCR
  • Metadata Extraction: Get document metadata alongside text content
  • Table Extraction: Extract tables from documents using the excellent GMFT library
  • Modern Python: Built with async/await, type hints, and a functional-first approach
  • Permissive OSS: MIT licensed with permissively licensed dependencies

Quick Start

pip install kreuzberg

Install pandoc:

# Ubuntu/Debian
sudo apt-get install tesseract-ocr pandoc

# macOS
brew install tesseract pandoc

# Windows
choco install -y tesseract pandoc

The tesseract OCR engine is the default OCR engine. You can decide not to use it - and then either use one of the two alternative OCR engines, or have no OCR at all.

Alternative OCR engines

# Install with EasyOCR support
pip install "kreuzberg[easyocr]"

# Install with PaddleOCR support
pip install "kreuzberg[paddleocr]"

Quick Example

import asyncio
from kreuzberg import extract_file

async def main():
    # Extract text from a PDF
    result = await extract_file("document.pdf")
    print(result.content)

    # Extract text from an image
    result = await extract_file("scan.jpg")
    print(result.content)

    # Extract text from a Word document
    result = await extract_file("report.docx")
    print(result.content)

asyncio.run(main())

Documentation

For comprehensive documentation, visit our GitHub Pages:

Supported Formats

Kreuzberg supports a wide range of document formats:

  • Documents: PDF, DOCX, DOC, RTF, TXT, EPUB, etc.
  • Images: JPG, PNG, TIFF, BMP, GIF, etc.
  • Spreadsheets: XLSX, XLS, CSV, etc.
  • Presentations: PPTX, PPT, etc.
  • Web Content: HTML, XML, etc.

OCR Engines

Kreuzberg supports multiple OCR engines:

  • Tesseract (Default): Lightweight, fast startup, requires system installation
  • EasyOCR: Good for many languages, pure Python, but downloads models on first use
  • PaddleOCR: Excellent for Asian languages, pure Python, but downloads models on first use

For comparison and selection guidance, see the OCR Backends documentation.

Contribution

This library is open to contribution. Feel free to open issues or submit PRs. It's better to discuss issues before submitting PRs to avoid disappointment.

Local Development

  1. Clone the repo
  2. Install the system dependencies
  3. Install the full dependencies with uv sync
  4. Install the pre-commit hooks with: pre-commit install && pre-commit install --hook-type commit-msg
  5. Make your changes and submit a PR

License

This library is released under the MIT license.