DocumentSplitter with NLTK download breaks AWS Lambda deployments #8836

chrisk314 · 2025-02-10T17:30:39Z

Describe the bug
#8755 mentions that DocumentSplitter argument split_by='sentence' now uses NLTK. When running in AWS Lambda with an updated Haystack (2.8.0 -> 2.9.0) which includes this change, the lambda exits with an error as it cannot download some NLTK files. AWS Lambda deployments have a read-only filesystem, the updated DocumentSplitter now triggers download of some NLTK files causing the AWS Lambda to fail.

Is there a way to download these NLTK files ahead of time, i.e., during docker build of AWS Lambda image, so they do not need to be downloaded at runtime?

Error message

[nltk_data] Downloading package punkt_tab to
[nltk_data] /home/sbx_user1051/nltk_data...
[Errno 30] Read-only file system: '/home/sbx_user1051'

Expected behavior
Updating Haystack from one minor release to another will not result in otherwise unchanged code breaking due to changes in the Haystack API.

System:

OS: AWS Lambda
Haystack version (commit or version number): 2.9.0

The text was updated successfully, but these errors were encountered:

sjrl · 2025-02-11T09:48:20Z

Sorry to hear that this has caused issues for you. There are a few things you can do:

Change to a different split_by. Only when split_by="sentence" is specified will NLTK try to download punkt_tab
If you'd like to continue splitting by sentence you can use the following in the docker set up. This is what we call here if the local files don't already exist

import nltk
nltk.download("punkt_tab")

anakin87 · 2025-02-11T10:28:23Z

If you were happy about the former behavior of split_by="sentence", you can now use split_by="period", which does not require nltk.

chrisk314 changed the title ~~DocumentSpitter with NLTK download breaks AWS Lambda deployments~~ DocumentSplitter with NLTK download breaks AWS Lambda deployments Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DocumentSplitter with NLTK download breaks AWS Lambda deployments #8836

DocumentSplitter with NLTK download breaks AWS Lambda deployments #8836

chrisk314 commented Feb 10, 2025

sjrl commented Feb 11, 2025

anakin87 commented Feb 11, 2025

DocumentSplitter with NLTK download breaks AWS Lambda deployments #8836

DocumentSplitter with NLTK download breaks AWS Lambda deployments #8836

Comments

chrisk314 commented Feb 10, 2025

sjrl commented Feb 11, 2025

anakin87 commented Feb 11, 2025