Memory leak of MorphAnalysis object. #13684

hynky1999 · 2024-11-04T18:18:58Z

I have encountered a crucial bug, which makes running a continuous tokenization using Japanese tokenizer close to impossible. It's all due so memory leak of MorphAnalysis

How to reproduce the behaviour

import spacy
import tracemalloc


tracemalloc.start()
tokenizer = spacy.blank("ja")
tokenizer.add_pipe("sentencizer")

for _ in range(1000):
    text = " ".join(["a"] * 1000)
    snapshot = tracemalloc.take_snapshot()
    with tokenizer.memory_zone():
        doc = tokenizer(text)
        tokenizer.max_length = len(text) + 10
    import gc
    gc.collect()
    snapshot2 = tracemalloc.take_snapshot()
    # Compare the two snapshots
    p_stats = snapshot2.compare_to(snapshot, "lineno")
    # Pretty print the top 10 differences
    print("[ Top 10 ]")
    # Stop here with pdb
    for stat in p_stats[:10]:
        if stat.size_diff > 0:


            print(stat)

Run this script and observe how memory keeps growing:

It all happens due to the this line:
token.morph = MorphAnalysis(self.vocab, morph). I have checked the implementation itself and there is neither code for dealocation implemented, nor it supports the memory_zone.

The text was updated successfully, but these errors were encountered:

lise-brinck · 2024-11-15T09:47:57Z

We have observed similar issues in our pipeline. As you can see in this minimal example with da_core_news_md-model, the vocab keeps growing:

nlp = spacy.load("da_core_news_md")

test_texts = [
    "Varmere vintre: Flere trækfugle forurener søerne",
    "De højere vintertemperaturer giver problemer for landets søer.",
    "Blandt andet fordi flere trækfugle sover på vandet.",
    "I 1980'erne var der omkring 200 grågæs i Danmark om vinteren.",
    "I dag kan der være helt op mod 100.000.",
]

for text in test_texts:
    print("Vocab size before nlp:", len(nlp.vocab))
    with nlp.memory_zone():
        doc = nlp(text)
        print("Vocab size after nlp:", len(nlp.vocab))
    print("Vocab size out of memory zone:", len(nlp.vocab))

Output:

Vocab size before nlp: 2269
Vocab size after nlp: 2275
Vocab size out of memory zone: 2275
Vocab size before nlp: 2275
Vocab size after nlp: 2283
Vocab size out of memory zone: 2283
Vocab size before nlp: 2283
Vocab size after nlp: 2291
Vocab size out of memory zone: 2291
Vocab size before nlp: 2291
Vocab size after nlp: 2300
Vocab size out of memory zone: 2300
Vocab size before nlp: 2300
Vocab size after nlp: 2308
Vocab size out of memory zone: 2308

When trying to modify and access MorphAnalysis, an error occurs with hash in StringStore:

for text in test_texts:
    with nlp.memory_zone():
        doc = nlp(text)
        for token in doc:
            morph_str = str(token.morph)
            if "Definite" in morph_str:
                definite = token.morph.get("Definite")[0]
                new_morph_str = morph_str.replace(definite, "foo")
                token.set_morph(new_morph_str)
            token.morph.get("Definite")

Output:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[24], [line 20](vscode-notebook-cell:?execution_count=24&line=20)
     [18](vscode-notebook-cell:?execution_count=24&line=18)     new_morph_str = morph_str.replace(definite, "foo")
     [19](vscode-notebook-cell:?execution_count=24&line=19)     token.set_morph(new_morph_str)
---> [20](vscode-notebook-cell:?execution_count=24&line=20) token.morph.get("Definite")

File ~/.venv/lib/python3.11/site-packages/spacy/tokens/morphanalysis.pyx:71, in spacy.tokens.morphanalysis.MorphAnalysis.get()

File ~/.venv/lib/python3.11/site-packages/spacy/strings.pyx:162, in spacy.strings.StringStore.__getitem__()

KeyError: "[E018] Can't retrieve string for hash '6324204924076910789'. This usually refers to an issue with the `Vocab` or `StringStore`."

honnibal · 2024-12-10T17:42:42Z

@hynky1999 Are the Japanese morphological tags open-class, or are they a closed set? I've assumed that the morphology tags are a closed set and can be added to the string-store without problems.

Regarding deallocation, the MorphAnalysis object doesn't need deallocation code. It's a Python object with a C struct, and the C struct doesn't make any heap allocations. So the memory is freed as normal by Python's reference counting.

@lise-brinck Thanks for the example code. I've found a bug in the memory zone handling that causes this. I'll release a patch shortly.

* Fix bug in memory-zone code when adding non-transient strings. The error could result in segmentation faults or other memory errors during memory zones if new labels were added to the model. * Fix handling of new morphological labels within memory zones. Addresses second issue reported in Memory leak of MorphAnalysis object. #13684

hynky1999 · 2024-12-28T14:02:05Z

Hi @honnibal, the I expressed myself incorrectly.
Yes you are right the MorpAnalysis object is indeed a struct. The issue is rather with it's creation as it calls self.vocab.morphology.add(features).

This results in allocating new tags without any dealocation here. It woud only get dealocated if the self.vocab.morphology object would be deleted but I don't think it ever happens and for sure not with respect to mem zones.

https://github.com/explosion/spaCy/blob/master/spacy/morphology.pyx#L135-L136

honnibal mentioned this issue Dec 10, 2024

Fix allocation of non-transient strings in StringStore #13713

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak of MorphAnalysis object. #13684

Memory leak of MorphAnalysis object. #13684

hynky1999 commented Nov 4, 2024

lise-brinck commented Nov 15, 2024 •

edited

Loading

honnibal commented Dec 10, 2024

hynky1999 commented Dec 28, 2024 •

edited

Loading

Memory leak of MorphAnalysis object. #13684

Memory leak of MorphAnalysis object. #13684

Comments

hynky1999 commented Nov 4, 2024

How to reproduce the behaviour

lise-brinck commented Nov 15, 2024 • edited Loading

honnibal commented Dec 10, 2024

hynky1999 commented Dec 28, 2024 • edited Loading

lise-brinck commented Nov 15, 2024 •

edited

Loading

hynky1999 commented Dec 28, 2024 •

edited

Loading