Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DocumentSplitter with NLTK download breaks AWS Lambda deployments #8836

Open
chrisk314 opened this issue Feb 10, 2025 · 2 comments
Open

DocumentSplitter with NLTK download breaks AWS Lambda deployments #8836

chrisk314 opened this issue Feb 10, 2025 · 2 comments

Comments

@chrisk314
Copy link

Describe the bug
#8755 mentions that DocumentSplitter argument split_by='sentence' now uses NLTK. When running in AWS Lambda with an updated Haystack (2.8.0 -> 2.9.0) which includes this change, the lambda exits with an error as it cannot download some NLTK files. AWS Lambda deployments have a read-only filesystem, the updated DocumentSplitter now triggers download of some NLTK files causing the AWS Lambda to fail.

Is there a way to download these NLTK files ahead of time, i.e., during docker build of AWS Lambda image, so they do not need to be downloaded at runtime?

Error message

[nltk_data] Downloading package punkt_tab to
[nltk_data] /home/sbx_user1051/nltk_data...
[Errno 30] Read-only file system: '/home/sbx_user1051'

Expected behavior
Updating Haystack from one minor release to another will not result in otherwise unchanged code breaking due to changes in the Haystack API.

System:

  • OS: AWS Lambda
  • Haystack version (commit or version number): 2.9.0
@chrisk314 chrisk314 changed the title DocumentSpitter with NLTK download breaks AWS Lambda deployments DocumentSplitter with NLTK download breaks AWS Lambda deployments Feb 10, 2025
@sjrl
Copy link
Contributor

sjrl commented Feb 11, 2025

Sorry to hear that this has caused issues for you. There are a few things you can do:

  • Change to a different split_by. Only when split_by="sentence" is specified will NLTK try to download punkt_tab
  • If you'd like to continue splitting by sentence you can use the following in the docker set up. This is what we call here if the local files don't already exist
import nltk
nltk.download("punkt_tab")

@anakin87
Copy link
Member

If you were happy about the former behavior of split_by="sentence", you can now use split_by="period", which does not require nltk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants