Skip to content

fix: create symlink for Kreuzberg Tesseract cache path#30

Merged
Kaiohz merged 1 commit into
mainfrom
fix/tesseract-tessdata-prefix
Apr 27, 2026
Merged

fix: create symlink for Kreuzberg Tesseract cache path#30
Kaiohz merged 1 commit into
mainfrom
fix/tesseract-tessdata-prefix

Conversation

@Kaiohz
Copy link
Copy Markdown
Collaborator

@Kaiohz Kaiohz commented Apr 27, 2026

Problem

Kreuzberg was looking for Tesseract language data at /io/.tesseract-cache/linux-aarch64/tessdata/ regardless of the actual architecture, causing OCR failures:

Error opening data file /io/.tesseract-cache/linux-aarch64/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your tessdata directory.
Failed loading language eng
Tesseract couldnt load any languages!

Solution

Create the expected directory structure and symlink to the system Tesseract data files:

  • Set TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdata environment variable
  • Create /io/.tesseract-cache/linux-aarch64/tessdata symlink pointing to system tessdata

Testing

  1. Build the new Docker image
  2. Deploy to cluster
  3. Test OCR with French PDF documents
  4. Verify no more Tesseract language errors in logs

Notes

The linux-aarch64 in the path is misleading - it appears Kreuzberg hardcodes this path regardless of the actual architecture (x86_64 in our case). This workaround ensures Tesseract can find its language data files.

Kreuzberg hardcodes the Tesseract data path to /io/.tesseract-cache/linux-aarch64/tessdata
regardless of the actual architecture. This commit creates the expected directory structure
and symlinks to the system Tesseract data files.

Fixes 'Error opening data file' and 'Failed loading language eng/fra' errors.

- Set TESSDATA_PREFIX environment variable
- Create /io/.tesseract-cache/linux-aarch64/tessdata symlink to system tessdata
- Tesseract OCR with French language support now works correctly
Copy link
Copy Markdown
Collaborator Author

@Kaiohz Kaiohz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 Code Review

Score: 7/10

✅ Points positifs:

  • Fix minimal et ciblé, exactement ce qu'il faut
  • Commentaire explicatif clair sur le workaround
  • Bon usage de ${TESSDATA_PREFIX} pour éviter les chemins hardcodés
  • PR description détaillée avec contexte du problème

⚠️ Points à améliorer:

  1. Robustesse du symlink

    # Actuel
    ln -s ${TESSDATA_PREFIX} /io/.tesseract-cache/linux-aarch64/tessdata

    Ajouter -f pour éviter les erreurs si le symlink existe déjà:

    ln -sf ${TESSDATA_PREFIX} /io/.tesseract-cache/linux-aarch64/tessdata
  2. Documentation du workaround

    • Ajouter un lien vers l'issue Kreuzberg si existante
    • Documenter dans un README ou wiki pourquoi ce symlink est nécessaire
  3. Architecture confusion

    • linux-aarch64 hardcoded alors qu'on tourne sur x86_64
    • Un futur mainteneur pourrait être confus
    • Le commentaire actuel est bien, mais pourrait être plus explicite: # Kreuzberg hardcodes linux-aarch64 path regardless of actual arch (x86_64)

💡 Suggestion optionnelle:
Si Kreuzberg est configurable via env var, explorer cette option avant de recourir au symlink.


Verdict: Le fix est fonctionnel et bien structuré. Quelques ajustements mineurs pour la robustesse et la maintenabilité. Prêt à merger, mais je suggère d'ajouter ln -sf avant de merger.

@Kaiohz Kaiohz merged commit 01b7e96 into main Apr 27, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant