Serializing the output

Question

Serializing the output

veronica320 opened this issue 2 years ago · comments

Hi there, thanks for putting together this awesome repo!

I'm wondering if it's possible to save the output somewhere on the disk (e.g., with pickle or spacy serialization methods).

For example:

import benepar, spacy

nlp = spacy.load('en_core_web_md')
nlp.add_pipe("benepar", config={"model": "benepar_en3"})

doc = nlp("The time for action is now. It's never too late to do something.")

fwn = "output.spacy"
doc.to_disk(fwn)

would yield the error:

/envs/lib/python3.7/site-packages/torch/distributions/distribution.py:46: UserWarning: <class 'torch_struct.distributions.TreeCRF'> does not define `arg_constraints`. Please set `arg_constraints = {}` or initialize the distribution with `validate_args=False` to turn off validation.
  'with `validate_args=False` to turn off validation.')
Traceback (most recent call last):
  File "/envs/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3552, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-acc55f65a9e7>", line 1, in <module>
    runfile('/parsing/serialization.py', wdir='/parsing')
  File "/home//.pycharm_helpers/pydev/_pydev_bundle/pydev_umd.py", line 198, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "/home//.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/parsing/serialization.py", line 15, in <module>
    doc.to_disk(fwn)
  File "spacy/tokens/doc.pyx", line 1270, in spacy.tokens.doc.Doc.to_disk
  File "spacy/tokens/doc.pyx", line 1271, in spacy.tokens.doc.Doc.to_disk
  File "spacy/tokens/doc.pyx", line 1298, in spacy.tokens.doc.Doc.to_bytes
  File "spacy/tokens/doc.pyx", line 1357, in spacy.tokens.doc.Doc.to_dict
  File "/envs/lib/python3.7/site-packages/spacy/util.py", line 1263, in to_dict
    serialized[key] = getter()
  File "spacy/tokens/doc.pyx", line 1354, in spacy.tokens.doc.Doc.to_dict.lambda19
  File "/envs/lib/python3.7/site-packages/srsly/_msgpack_api.py", line 14, in msgpack_dumps
    return msgpack.dumps(data, use_bin_type=True)
  File "/envs/lib/python3.7/site-packages/srsly/msgpack/__init__.py", line 55, in packb
    return Packer(**kwargs).pack(o)
  File "srsly/msgpack/_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
  File "srsly/msgpack/_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
  File "srsly/msgpack/_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
  File "srsly/msgpack/_packer.pyx", line 264, in srsly.msgpack._packer.Packer._pack
  File "srsly/msgpack/_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize 'ConstituentData' object

Is there any workaround? This would be really useful when dealing with large datasets. Thanks for any guidance!