Reading (Not Downloading) Large spaCy Model from S3 #6838
tarskiandhutch
started this conversation in
New Features & Project Ideas
Replies: 1 comment 1 reply
-
Hi, You can serialize the model before uploading to s3: import pickle
import spacy
import boto3
def create_session():
session = boto3.Session(
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
region_name=REGION_NAME
)
return session
def create_client():
return create_session().client('s3')
nlp = spacy.load('my_model')
object = {
"config": nlp.config,
"bytes_data": nlp.to_bytes()
}
pkl_object = pickle.dumps(object)
client = create_client()
client.put_object(Body=pkl_object, Bucket=YOUR_BUCKET, Key=YOUR_KEY) Then load it from there: import pickle
import spacy
import smart_open
def load_model_from_s3(s3_path):
session = create_session()
pkl_object = pickle.loads(smart_open.open(s3_path, 'rb', transport_params={'session': session}).read())
config = pkl_object["config"]
bytes_data = pkl_object["bytes_data"]
lang_cls = spacy.util.get_lang_class(config["nlp"]["lang"])
nlp = lang_cls.from_config(config)
nlp.from_bytes(bytes_data)
return nlp |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We have a Python app running on an EC2 instance that makes use of a variety of ML and NLP packages. For several such packages, reading large language models from an S3 bucket directly into memory is straightforward: e.g., Transformers by Huggingface, whose models can be read buffer-style as in this project's
load_model_from_s3()
method.The problem is that as far as we can tell, there is just no way of reading a spaCy model (e.g.,
en_core_web_lg
) into an object in memory, and then usingspacy.load()
to spin that object up into an NLP pipeline as usual. The model has to be stored, and then loaded using a file path. The only solutions to similar problems we've seen are:download the model to a local temp directory, which obviously does not solve the problem (from Github, 2016)
mount an S3 bucket as a local directory using a FUSE binding like S3FS, which does not appear to have spaCy integration (from Github, 2018)
download the model to an EFS volume and read from there, which only sidesteps the issue by outsourcing storage (from Towards Data Science, 2020)
In the intervening months/years, has anyone devised a solution for storing a spaCy model on S3, reading it as a buffer, and loading it directly into a pipeline from memory? Or does spaCy intend to enable this behavior in a future release?
(Being able to do this could help simplify/streamline app architecture for NLP projects hosted on both ECS and AWS Lambda.)
Beta Was this translation helpful? Give feedback.
All reactions