Reading (Not Downloading) Large spaCy Model from S3 #6838

tarskiandhutch · 2021-01-27T17:37:39Z

tarskiandhutch
Jan 27, 2021

We have a Python app running on an EC2 instance that makes use of a variety of ML and NLP packages. For several such packages, reading large language models from an S3 bucket directly into memory is straightforward: e.g., Transformers by Huggingface, whose models can be read buffer-style as in this project's load_model_from_s3() method.

The problem is that as far as we can tell, there is just no way of reading a spaCy model (e.g., en_core_web_lg) into an object in memory, and then using spacy.load() to spin that object up into an NLP pipeline as usual. The model has to be stored, and then loaded using a file path. The only solutions to similar problems we've seen are:

download the model to a local temp directory, which obviously does not solve the problem (from Github, 2016)
mount an S3 bucket as a local directory using a FUSE binding like S3FS, which does not appear to have spaCy integration (from Github, 2018)
download the model to an EFS volume and read from there, which only sidesteps the issue by outsourcing storage (from Towards Data Science, 2020)

In the intervening months/years, has anyone devised a solution for storing a spaCy model on S3, reading it as a buffer, and loading it directly into a pipeline from memory? Or does spaCy intend to enable this behavior in a future release?

(Being able to do this could help simplify/streamline app architecture for NLP projects hosted on both ECS and AWS Lambda.)

lmanhes · 2021-04-28T09:32:21Z

lmanhes
Apr 28, 2021

Hi,

You can serialize the model before uploading to s3:

import pickle
import spacy
import boto3


def create_session():
    session = boto3.Session(
        aws_access_key_id=AWS_ACCESS_KEY_ID,
        aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
        region_name=REGION_NAME
    )
    return session


def create_client():
    return create_session().client('s3')


nlp = spacy.load('my_model')

object = {
    "config": nlp.config,
    "bytes_data": nlp.to_bytes()
}

pkl_object = pickle.dumps(object)

client = create_client()
client.put_object(Body=pkl_object, Bucket=YOUR_BUCKET,  Key=YOUR_KEY)

Then load it from there:

import pickle
import spacy
import smart_open


def load_model_from_s3(s3_path):
    session = create_session()
    pkl_object = pickle.loads(smart_open.open(s3_path, 'rb', transport_params={'session': session}).read())
    config = pkl_object["config"]
    bytes_data = pkl_object["bytes_data"]

    lang_cls = spacy.util.get_lang_class(config["nlp"]["lang"])
    nlp = lang_cls.from_config(config)
    nlp.from_bytes(bytes_data)
    return nlp

1 reply

tarskiandhutch Apr 28, 2021
Author

This is awesome! And very simple. I never would have thought to do it myself. Thank you, @lmanhes.

I'm curious to see whether unpickling + from_bytes() is also faster than load().

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading (Not Downloading) Large spaCy Model from S3 #6838

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Reading (Not Downloading) Large spaCy Model from S3 #6838

tarskiandhutch Jan 27, 2021

Replies: 1 comment · 1 reply

lmanhes Apr 28, 2021

tarskiandhutch Apr 28, 2021 Author

tarskiandhutch
Jan 27, 2021

Replies: 1 comment 1 reply

lmanhes
Apr 28, 2021

tarskiandhutch Apr 28, 2021
Author