microsoft / Simplify-Docx

Simplify DOCX files to JSON

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AttributeError: 'lxml.etree._Element' object has no attribute 'fldCharType'

LukeALee opened this issue · comments

when I open my .docx file , which is saved from a .doc file use python-docx, it cames out this Error,
C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\iterators\generic.py:194: UnexpectedElementWarning: Skipping unexpected tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}background UnexpectedElementWarning) C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\iterators\generic.py:194: UnexpectedElementWarning: Skipping unexpected tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}pict UnexpectedElementWarning) Traceback (most recent call last): File "E:/_master/硕士论文/data/data_preprocess/temp.py", line 27, in <module> db_json = simplify(db) File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\__init__.py", line 33, in simplify out = document(doc.element).to_json(doc, _options) File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\base.py", line 106, in to_json "VALUE": [ elt.to_json(doc, options) for elt in self], File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\base.py", line 106, in <listcomp> "VALUE": [ elt.to_json(doc, options) for elt in self], File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\body.py", line 25, in to_json JSON = elt.to_json(doc, options, iter_me) File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\paragraph.py", line 142, in to_json out: Dict[str, Any] = super(paragraph, self).to_json(doc, options, super_iter) File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\paragraph.py", line 27, in to_json for elt in run_iterator: File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\base.py", line 62, in __iter__ self.__iter_name__ if self.__iter_name__ else self.__type__): File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\iterators\generic.py", line 167, in xml_iter for elt in xml_iter(current, handlers.TAGS_TO_NEST[current.tag], _msg): File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\iterators\generic.py", line 156, in xml_iter yield handlers.TAGS_TO_YIELD[current.tag](current) File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\form.py", line 106, in __init__ super(fldChar, self).__init__(x) File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\base.py", line 36, in __init__ self.props[prop] = getattr(x, prop) AttributeError: 'lxml.etree._Element' object has no attribute 'fldCharType'
what should I do to solve it?

Thank you for your question! This error indicates that the .docx file was not valid. Specifically, the fldCharType attribute is required on fldChar elements, by the open office specification, and the error indicates that this required attribute is missing.

Since you saved this file with python-docx (from the original .doc file), you might want to raise an issue with python-docx.

commented

I had the same error using the library with *.docx that were saved from a template *.dotx file. Is there any easy work around you can think of? Documents have a lot of checkboxes, tables and dropdown fields. This library was looking like the solution with my tests then this element is missing on the documents. Thanks.

commented

Thanks for the response. [https://github.com/turbomanson/docx-scrape/blob/master/simplify_minimal_example.ipynb] Here is a complete example with an example docx generating error. I canʻt reproduce the error with a minimal document. I would appreciate if you could take a look, but I understand this is on your own time.

I had the same error with some tables in the docx. Is there any docx format for simplify_docx? In other words, what attribute in docx files that simplify_docx can't deal with?
Thanks.

when I open my .docx file , which is saved from a .doc file use python-docx, it cames out this Error,
C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\iterators\generic.py:194: UnexpectedElementWarning: Skipping unexpected tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}background UnexpectedElementWarning) C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\iterators\generic.py:194: UnexpectedElementWarning: Skipping unexpected tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}pict UnexpectedElementWarning) Traceback (most recent call last): File "E:/_master/硕士论文/data/data_preprocess/temp.py", line 27, in <module> db_json = simplify(db) File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\__init__.py", line 33, in simplify out = document(doc.element).to_json(doc, _options) File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\base.py", line 106, in to_json "VALUE": [ elt.to_json(doc, options) for elt in self], File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\base.py", line 106, in <listcomp> "VALUE": [ elt.to_json(doc, options) for elt in self], File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\body.py", line 25, in to_json JSON = elt.to_json(doc, options, iter_me) File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\paragraph.py", line 142, in to_json out: Dict[str, Any] = super(paragraph, self).to_json(doc, options, super_iter) File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\paragraph.py", line 27, in to_json for elt in run_iterator: File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\base.py", line 62, in __iter__ self.__iter_name__ if self.__iter_name__ else self.__type__): File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\iterators\generic.py", line 167, in xml_iter for elt in xml_iter(current, handlers.TAGS_TO_NEST[current.tag], _msg): File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\iterators\generic.py", line 156, in xml_iter yield handlers.TAGS_TO_YIELD[current.tag](current) File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\form.py", line 106, in __init__ super(fldChar, self).__init__(x) File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\base.py", line 36, in __init__ self.props[prop] = getattr(x, prop) AttributeError: 'lxml.etree._Element' object has no attribute 'fldCharType'
what should I do to solve it?

请问,这个问题解决了吗?

according to the README

This project relies on the python-docx package which can be installed via
pip install python-docx. However, as of this writing, if you wish to
scrape documents which contain (A) form fields such as drop down lists,
checkboxes and text inputs or (B) nested documents (subdocs, altChunks,
etc.), you'll need to clone this fork of the python-docx package

You need a modified version of python-docx to support parsing forms, otherwise python-docx will not parse fldChar and cause this issue.

I ran into this error recently upon parsing a docx file, and using the forked python-docx didn't solve it for me. Instead I got this error:

/usr/local/lib/python3.9/site-packages/simplify_docx/elements/paragraph.py in to_json(self, doc, options, super_iter)
    140         """Coerce a container object to JSON
    141         """
--> 142         out: Dict[str, Any] = super(paragraph, self).to_json(doc, options, super_iter)
    143 
    144         if options.get("remove-leading-white-space", True):

/usr/local/lib/python3.9/site-packages/simplify_docx/elements/paragraph.py in to_json(self, doc, options, super_iter)
     28 
     29                 if _fldChar is not None:
---> 30                     finished: bool = _fldChar.update(elt)
     31                     if finished:
     32                         _fldchar_json = _fldChar.to_json(doc, options)

/usr/local/lib/python3.9/site-packages/simplify_docx/elements/form.py in update(self, other)
    255         if isinstance(other, fldChar):
    256             if other.props["fldCharType"] == "begin":
--> 257                 raise RuntimeError("Unhandled nesting of data fields")
    258 
    259             if other.props["fldCharType"] == "separate":

RuntimeError: Unhandled nesting of data fields

Be happy to share a source file if anyone has the bandwidth to look into it.