explosion / catalogue

Super lightweight function registries for your library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RuntimeError: dictionary changed size during iteration with spacy.load()

akapeletzis opened this issue · comments

I am loading a spaCy model as part of a step in my Dataflow streaming pipeline. To load the pre-downloaded spaCy model for a specific language I am using nlp_model = spacy.load(SPACY_KEYS[lang]) where SPACY_KEYS is a dictionary containing the names of the models for each language (e.g. 'en': 'en_core_web_sm').

This works without any issues for the majority of the jobs run by the pipeline, but for a few iterations I am getting the following error, which seems to be coming from catalogue:

Error message from worker: generic::unknown: Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 1232, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 752, in apache_beam.runners.common.PerWindowInvoker.invoke_process
  File "apache_beam/runners/common.py", line 870, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
  File "apache_beam/runners/common.py", line 1368, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "/usr/local/lib/python3.7/site-packages/submodules/entities_and_pii_removal.py", line 259, in entities_and_PII
    nlp_model = spacy.load(SPACY_KEYS[lang])  # load spacy model
  File "/usr/local/lib/python3.7/site-packages/spacy/__init__.py", line 52, in load
    name, vocab=vocab, disable=disable, exclude=exclude, config=config
  File "/usr/local/lib/python3.7/site-packages/spacy/util.py", line 420, in load_model
    return load_model_from_package(name, **kwargs)  # type: ignore[arg-type]
  File "/usr/local/lib/python3.7/site-packages/spacy/util.py", line 453, in load_model_from_package
    return cls.load(vocab=vocab, disable=disable, exclude=exclude, config=config)  # type: ignore[attr-defined]
  File "/usr/local/lib/python3.7/site-packages/de_core_news_sm/__init__.py", line 10, in load
    return load_model_from_init_py(__file__, **overrides)
  File "/usr/local/lib/python3.7/site-packages/spacy/util.py", line 621, in load_model_from_init_py
    config=config,
  File "/usr/local/lib/python3.7/site-packages/spacy/util.py", line 489, in load_model_from_path
    return nlp.from_disk(model_path, exclude=exclude, overrides=overrides)
  File "/usr/local/lib/python3.7/site-packages/spacy/language.py", line 2042, in from_disk
    util.from_disk(path, deserializers, exclude)  # type: ignore[arg-type]
  File "/usr/local/lib/python3.7/site-packages/spacy/util.py", line 1299, in from_disk
    reader(path / key)
  File "/usr/local/lib/python3.7/site-packages/spacy/language.py", line 2037, in <lambda>
    p, exclude=["vocab"]
  File "spacy/pipeline/trainable_pipe.pyx", line 343, in spacy.pipeline.trainable_pipe.TrainablePipe.from_disk
  File "/usr/local/lib/python3.7/site-packages/spacy/util.py", line 1299, in from_disk
    reader(path / key)
  File "spacy/pipeline/trainable_pipe.pyx", line 333, in spacy.pipeline.trainable_pipe.TrainablePipe.from_disk.load_model
  File "spacy/pipeline/trainable_pipe.pyx", line 334, in spacy.pipeline.trainable_pipe.TrainablePipe.from_disk.load_model
  File "/usr/local/lib/python3.7/site-packages/thinc/model.py", line 593, in from_bytes
    return self.from_dict(msg)
  File "/usr/local/lib/python3.7/site-packages/thinc/model.py", line 624, in from_dict
    loaded_value = deserialize_attr(default_value, value, attr, node)
  File "/usr/local/lib/python3.7/functools.py", line 840, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/usr/local/lib/python3.7/site-packages/thinc/model.py", line 804, in deserialize_attr
    return srsly.msgpack_loads(value)
  File "/usr/local/lib/python3.7/site-packages/srsly/_msgpack_api.py", line 27, in msgpack_loads
    msg = msgpack.loads(data, raw=False, use_list=use_list)
  File "/usr/local/lib/python3.7/site-packages/srsly/msgpack/__init__.py", line 76, in unpackb
    for decoder in msgpack_decoders.get_all().values():
  File "/usr/local/lib/python3.7/site-packages/catalogue/__init__.py", line 110, in get_all
    for keys, value in REGISTRY.items():
RuntimeError: dictionary changed size during iteration

Hi, I think that we need to replace REGISTRY.items() with REGISTRY.copy().items() for this to work without these kinds of runtime errors.

It's kind of hard to reproduce this kind of error on our end, so it would be helpful for us if you could try installing the modified version of catalogue from #29 to see if it resolves this problem for you:

pip install https://github.com/adrianeboyd/catalogue/archive/refs/heads/bugfix/copy-items-iteration.zip