Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PII Modifier fails to load on worker sporadically raising cannot reshape array of size #424

Open
praateekmahajan opened this issue Dec 11, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@praateekmahajan
Copy link
Collaborator

praateekmahajan commented Dec 11, 2024

From my experience with trying to run PII Modifier, if you have a fresh docker container and you run deidentify --device gpu ... the job might fail due at the construction of the PiiDeidentifier on each worker node which calls spacy.load(..). The second time it usually succeeds

Possible fixes

  1. Explicitly download the model python -m spacy download en_core_web_lg
  2. Upgrade spacy to v3

Related issues

  1. ValueError: cannot reshape array of size ... when loading model  explosion/spaCy#13262
  2. Run into 'cannot reshape array' problem when spacy.load triggered multiple times explosion/spaCy#8867
  3. Cannot load en_core_web_lg on Linux explosion/spaCy#3402

Stacktrace

  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modifiers/pii_modifier.py", line 79, in modify_document
    deidentifier = load_object_on_worker("deidentifier", self.load_deidentifier, {})
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/utils/distributed_utils.py", line 790, in load_object_on_worker
    obj = load_object_function(**load_object_kwargs)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modifiers/pii_modifier.py", line 102, in load_deidentifier
    deidentifier: PiiDeidentifier = PiiDeidentifier(
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/pii/algorithm.py", line 99, in __init__
    self.analyzer = AnalyzerEngine(
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/presidio_analyzer/analyzer_engine.py", line 71, in __init__
    self.nlp_engine.load()
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/pii/custom_nlp_engine.py", line 48, in load
    self.nlp[model["lang_code"]] = spacy.load(
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/spacy/__init__.py", line 51, in load
    return util.load_model(
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/spacy/util.py", line 465, in load_model
    return load_model_from_package(name, **kwargs)  # type: ignore[arg-type]
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/spacy/util.py", line 501, in load_model_from_package
    return cls.load(vocab=vocab, disable=disable, enable=enable, exclude=exclude, config=config)  # type: ignore[attr-defined]
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/en_core_web_lg/__init__.py", line 10, in load
    return load_model_from_init_py(__file__, **overrides)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/spacy/util.py", line 682, in load_model_from_init_py
    return load_model_from_path(
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/spacy/util.py", line 547, in load_model_from_path
    return nlp.from_disk(model_path, exclude=exclude, overrides=overrides)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/spacy/language.py", line 2209, in from_disk
    util.from_disk(path, deserializers, exclude)  # type: ignore[arg-type]
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/spacy/util.py", line 1390, in from_disk
    reader(path / key)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/spacy/language.py", line 2185, in deserialize_vocab
    self.vocab.from_disk(path, exclude=exclude)
  File "spacy/vocab.pyx", line 515, in spacy.vocab.Vocab.from_disk
    self.vectors.from_disk(path, exclude=["strings"])
  File "spacy/vectors.pyx", line 718, in spacy.vectors.Vectors.from_disk
    util.from_disk(path, serializers, exclude)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/spacy/util.py", line 1390, in from_disk
    reader(path / key)
  File "spacy/vectors.pyx", line 703, in spacy.vectors.Vectors.from_disk.load_vectors
    self.data = ops.xp.load(str(path))
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/cupy/_io/npz.py", line 64, in load
    obj = numpy.load(file, mmap_mode, allow_pickle)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numpy/lib/npyio.py", line 456, in load
    return format.read_array(fid, allow_pickle=allow_pickle,
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numpy/lib/format.py", line 839, in read_array
    array.shape = shape
ValueError: cannot reshape array of size 23068640 into shape (514157,300)
@praateekmahajan praateekmahajan added the bug Something isn't working label Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant