fix: create symlink for Kreuzberg Tesseract cache path#30
Merged
Conversation
Kreuzberg hardcodes the Tesseract data path to /io/.tesseract-cache/linux-aarch64/tessdata regardless of the actual architecture. This commit creates the expected directory structure and symlinks to the system Tesseract data files. Fixes 'Error opening data file' and 'Failed loading language eng/fra' errors. - Set TESSDATA_PREFIX environment variable - Create /io/.tesseract-cache/linux-aarch64/tessdata symlink to system tessdata - Tesseract OCR with French language support now works correctly
Kaiohz
commented
Apr 27, 2026
Collaborator
Author
Kaiohz
left a comment
There was a problem hiding this comment.
🔍 Code Review
Score: 7/10
✅ Points positifs:
- Fix minimal et ciblé, exactement ce qu'il faut
- Commentaire explicatif clair sur le workaround
- Bon usage de
${TESSDATA_PREFIX}pour éviter les chemins hardcodés - PR description détaillée avec contexte du problème
-
Robustesse du symlink
# Actuel ln -s ${TESSDATA_PREFIX} /io/.tesseract-cache/linux-aarch64/tessdataAjouter
-fpour éviter les erreurs si le symlink existe déjà:ln -sf ${TESSDATA_PREFIX} /io/.tesseract-cache/linux-aarch64/tessdata -
Documentation du workaround
- Ajouter un lien vers l'issue Kreuzberg si existante
- Documenter dans un README ou wiki pourquoi ce symlink est nécessaire
-
Architecture confusion
linux-aarch64hardcoded alors qu'on tourne sur x86_64- Un futur mainteneur pourrait être confus
- Le commentaire actuel est bien, mais pourrait être plus explicite:
# Kreuzberg hardcodes linux-aarch64 path regardless of actual arch (x86_64)
💡 Suggestion optionnelle:
Si Kreuzberg est configurable via env var, explorer cette option avant de recourir au symlink.
Verdict: Le fix est fonctionnel et bien structuré. Quelques ajustements mineurs pour la robustesse et la maintenabilité. Prêt à merger, mais je suggère d'ajouter ln -sf avant de merger.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Kreuzberg was looking for Tesseract language data at
/io/.tesseract-cache/linux-aarch64/tessdata/regardless of the actual architecture, causing OCR failures:Solution
Create the expected directory structure and symlink to the system Tesseract data files:
TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdataenvironment variable/io/.tesseract-cache/linux-aarch64/tessdatasymlink pointing to system tessdataTesting
Notes
The
linux-aarch64in the path is misleading - it appears Kreuzberg hardcodes this path regardless of the actual architecture (x86_64 in our case). This workaround ensures Tesseract can find its language data files.