Memory efficient Example objects #7806

Joozty · 2021-04-15T20:45:49Z

Joozty
Apr 15, 2021

When I try to train my model it uses too much memory. It seems like most memory is consumed by Example objects. I am not sure about internal structure but when I print the example object I can see many sparse lists. I understand that those lists are filled once I enable the right pipeline component for it. However wouldn't it better to generate that sparse list on the fly when a particular pipeline component is disabled or store it somehow cleverly (maybe it's done already)?

import spacy
from spacy.training import Example
nlp = spacy.load("en_core_web_md", exclude=["tok2vec", "tagger", "parser", "tagger", "attribute_ruler", "lemmatizer", "ner"])
doc = nlp("test " * 10)
print(Example(doc, doc))

{'doc_annotation': {'cats': {}, 'entities': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], 'links': {}}, 'token_annotation': {'ORTH': ['test', 'test', 'test', 'test', 'test', 'test', 'test', 'test', 'test', 'test'], 'SPACY': [True, True, True, True, True, True, True, True, True, True], 'TAG': ['', '', '', '', '', '', '', '', '', ''], 'LEMMA': ['', '', '', '', '', '', '', '', '', ''], 'POS': ['', '', '', '', '', '', '', '', '', ''], 'MORPH': ['', '', '', '', '', '', '', '', '', ''], 'HEAD': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 'DEP': ['', '', '', '', '', '', '', '', '', ''], 'SENT_START': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}

polm · 2021-04-16T04:34:21Z

polm
Apr 16, 2021

An Example is basically just two Doc objects with various convenience methods. The output you get when printing it is the result of calling to_dict, not the way it's stored in memory. So those lists are already generated on the fly.

If you're not using Transformers or other user attributes, the size of a Doc should be pretty constant with regard to its length, since most fields are pointers to the string store in Cython. So I'm not sure there's much you can do about this.

One limitation of training at the moment is that we assume you can load the whole training set in memory, though we are working on supporting streaming corpora. The PR for that (#7208) has already been merged and should be in the next release.

Also, as a note, rather than disabling everything in a pipeline you can just use nlp.make_doc(text) to get the bare minimum processing.

3 replies

Joozty Apr 16, 2021
Author

I see. That's good to know, thanks. I am aware of streaming training set which is an alternative I am considering using.

Nonetheless, if Example objects are not the bottleneck whet else is? My serialized dataset via DocBin.to_disk takes 0.5GB on disk. When I load it in training it requires ~40GB of RAM which is IMHO insane amount of resources, isn't it?

polm Apr 16, 2021

The DocBin is gzipped so the fact that it's much bigger in memory than on disk is not surprising.

Joozty Apr 16, 2021
Author

but still, that much difference is not suprising?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory efficient Example objects #7806

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Memory efficient Example objects #7806

Joozty Apr 15, 2021

Replies: 1 comment · 3 replies

polm Apr 16, 2021

Joozty Apr 16, 2021 Author

polm Apr 16, 2021

Joozty Apr 16, 2021 Author

Joozty
Apr 15, 2021

Replies: 1 comment 3 replies

polm
Apr 16, 2021

Joozty Apr 16, 2021
Author

Joozty Apr 16, 2021
Author