You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have encountered a crucial bug, which makes running a continuous tokenization using Japanese tokenizer close to impossible. It's all due so memory leak of MorphAnalysis
How to reproduce the behaviour
import spacy
import tracemalloc
tracemalloc.start()
tokenizer = spacy.blank("ja")
tokenizer.add_pipe("sentencizer")
for _ in range(1000):
text = " ".join(["a"] * 1000)
snapshot = tracemalloc.take_snapshot()
with tokenizer.memory_zone():
doc = tokenizer(text)
tokenizer.max_length = len(text) + 10
import gc
gc.collect()
snapshot2 = tracemalloc.take_snapshot()
# Compare the two snapshots
p_stats = snapshot2.compare_to(snapshot, "lineno")
# Pretty print the top 10 differences
print("[ Top 10 ]")
# Stop here with pdb
for stat in p_stats[:10]:
if stat.size_diff > 0:
print(stat)
Run this script and observe how memory keeps growing:
It all happens due to the this line: token.morph = MorphAnalysis(self.vocab, morph). I have checked the implementation itself and there is neither code for dealocation implemented, nor it supports the memory_zone.
The text was updated successfully, but these errors were encountered:
I have encountered a crucial bug, which makes running a continuous tokenization using Japanese tokenizer close to impossible. It's all due so memory leak of MorphAnalysis
How to reproduce the behaviour
Run this script and observe how memory keeps growing:
It all happens due to the this line:
token.morph = MorphAnalysis(self.vocab, morph)
. I have checked the implementation itself and there is neither code for dealocation implemented, nor it supports the memory_zone.The text was updated successfully, but these errors were encountered: