fix: create symlink for Kreuzberg Tesseract cache path by Kaiohz · Pull Request #30 · SoluDevTech/mcp-raganything

Kaiohz · 2026-04-27T11:39:44Z

Problem

Kreuzberg was looking for Tesseract language data at /io/.tesseract-cache/linux-aarch64/tessdata/ regardless of the actual architecture, causing OCR failures:

Error opening data file /io/.tesseract-cache/linux-aarch64/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your tessdata directory.
Failed loading language eng
Tesseract couldnt load any languages!

Solution

Create the expected directory structure and symlink to the system Tesseract data files:

Set TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdata environment variable
Create /io/.tesseract-cache/linux-aarch64/tessdata symlink pointing to system tessdata

Testing

Build the new Docker image
Deploy to cluster
Test OCR with French PDF documents
Verify no more Tesseract language errors in logs

Notes

The linux-aarch64 in the path is misleading - it appears Kreuzberg hardcodes this path regardless of the actual architecture (x86_64 in our case). This workaround ensures Tesseract can find its language data files.

Kreuzberg hardcodes the Tesseract data path to /io/.tesseract-cache/linux-aarch64/tessdata regardless of the actual architecture. This commit creates the expected directory structure and symlinks to the system Tesseract data files. Fixes 'Error opening data file' and 'Failed loading language eng/fra' errors. - Set TESSDATA_PREFIX environment variable - Create /io/.tesseract-cache/linux-aarch64/tessdata symlink to system tessdata - Tesseract OCR with French language support now works correctly

Kaiohz

🔍 Code Review

Score: 7/10

✅ Points positifs:

Fix minimal et ciblé, exactement ce qu'il faut
Commentaire explicatif clair sur le workaround
Bon usage de ${TESSDATA_PREFIX} pour éviter les chemins hardcodés
PR description détaillée avec contexte du problème

⚠️ Points à améliorer:

Robustesse du symlink

# Actuel
ln -s ${TESSDATA_PREFIX} /io/.tesseract-cache/linux-aarch64/tessdata

Ajouter -f pour éviter les erreurs si le symlink existe déjà:

ln -sf ${TESSDATA_PREFIX} /io/.tesseract-cache/linux-aarch64/tessdata

Documentation du workaround
- Ajouter un lien vers l'issue Kreuzberg si existante
- Documenter dans un README ou wiki pourquoi ce symlink est nécessaire
Architecture confusion
- linux-aarch64 hardcoded alors qu'on tourne sur x86_64
- Un futur mainteneur pourrait être confus
- Le commentaire actuel est bien, mais pourrait être plus explicite: # Kreuzberg hardcodes linux-aarch64 path regardless of actual arch (x86_64)

💡 Suggestion optionnelle:
Si Kreuzberg est configurable via env var, explorer cette option avant de recourir au symlink.

Verdict: Le fix est fonctionnel et bien structuré. Quelques ajustements mineurs pour la robustesse et la maintenabilité. Prêt à merger, mais je suggère d'ajouter ln -sf avant de merger.

Kaiohz commented Apr 27, 2026

View reviewed changes

Kaiohz merged commit 01b7e96 into main Apr 27, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: create symlink for Kreuzberg Tesseract cache path#30

fix: create symlink for Kreuzberg Tesseract cache path#30
Kaiohz merged 1 commit into
mainfrom
fix/tesseract-tessdata-prefix

Kaiohz commented Apr 27, 2026

Uh oh!

Kaiohz left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Kaiohz commented Apr 27, 2026

Problem

Solution

Testing

Notes

Uh oh!

Kaiohz left a comment

Choose a reason for hiding this comment

🔍 Code Review

Score: 7/10

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant