Skip to content

Memory leak issue #10015

Jan 10, 2022 · 10 comments · 34 replies

You must be logged in to vote

The memory usage increases slightly during processing because the pipeline vocab in nlp.vocab is not static. The lexeme cache (nlp.vocab) and string store (nlp.vocab.strings) grow as texts are processed and new tokens are seen. The lexeme cache is just for speed, but the string store is needed to map the string hashes back to strings.

If you're saving Doc objects directly for future processing, you'd need the string store cache to know which token strings were in the docs, since the Doc object just includes the hashes (ints) for the tokens and not the strings.

If you're saving the resulting annotation/analysis in some other form, then you probably don't need to keep track of the full stri…

Replies: 10 comments 34 replies

You must be logged in to vote
16 replies
@rkoystart

@svlandeg

@chrishmorris

@jey07

@thomashacker

Answer selected by adrianeboyd

You must be logged in to vote
8 replies
@adrianeboyd

@rkoystart

@marzooq-unbxd

@marzooq-unbxd

@polm

You must be logged in to vote
2 replies
@polm

@kaliaanup

You must be logged in to vote
4 replies
@svlandeg

@holub008

@honnibal

@holub008

You must be logged in to vote
1 reply
@adrianeboyd

You must be logged in to vote
1 reply
@adrianeboyd

You must be logged in to vote
0 replies

You must be logged in to vote
1 reply
@honnibal

You must be logged in to vote
1 reply
@honnibal

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / vectors Feature: Word vectors and similarity perf / memory Performance: memory use
Converted from issue

This discussion was converted from issue #10012 on January 10, 2022 11:30.