You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug #8755 mentions that DocumentSplitter argument split_by='sentence' now uses NLTK. When running in AWS Lambda with an updated Haystack (2.8.0 -> 2.9.0) which includes this change, the lambda exits with an error as it cannot download some NLTK files. AWS Lambda deployments have a read-only filesystem, the updated DocumentSplitter now triggers download of some NLTK files causing the AWS Lambda to fail.
Is there a way to download these NLTK files ahead of time, i.e., during docker build of AWS Lambda image, so they do not need to be downloaded at runtime?
Expected behavior
Updating Haystack from one minor release to another will not result in otherwise unchanged code breaking due to changes in the Haystack API.
System:
OS: AWS Lambda
Haystack version (commit or version number): 2.9.0
The text was updated successfully, but these errors were encountered:
chrisk314
changed the title
DocumentSpitter with NLTK download breaks AWS Lambda deployments
DocumentSplitter with NLTK download breaks AWS Lambda deployments
Feb 10, 2025
Sorry to hear that this has caused issues for you. There are a few things you can do:
Change to a different split_by. Only when split_by="sentence" is specified will NLTK try to download punkt_tab
If you'd like to continue splitting by sentence you can use the following in the docker set up. This is what we call here if the local files don't already exist
Describe the bug
#8755 mentions that
DocumentSplitter
argumentsplit_by='sentence'
now uses NLTK. When running in AWS Lambda with an updated Haystack (2.8.0 -> 2.9.0) which includes this change, the lambda exits with an error as it cannot download some NLTK files. AWS Lambda deployments have a read-only filesystem, the updatedDocumentSplitter
now triggers download of some NLTK files causing the AWS Lambda to fail.Is there a way to download these NLTK files ahead of time, i.e., during docker build of AWS Lambda image, so they do not need to be downloaded at runtime?
Error message
Expected behavior
Updating Haystack from one minor release to another will not result in otherwise unchanged code breaking due to changes in the Haystack API.
System:
The text was updated successfully, but these errors were encountered: