Error parsing document with bullet list

Question

Error parsing document with bullet list

sam2k13 opened this issue 4 years ago · comments

I am trying to use this library to gather indention level from a word document. I am getting the following error when the word document includes a bullet point list...

Traceback (most recent call last):
  File "/home/src/main.py", line 9, in <module>
    text_groups = convert_document_to_text_groups(DOCUMENT_NAME, PATH_TO_DOCUMENT)
  File "/home/src/modules/document_converter.py", line 7, in convert_document_to_text_groups
    my_doc_as_json = simplify(document)
  File "/usr/local/lib/python3.8/site-packages/simplify_docx-0.1.0-py3.8.egg/simplify_docx/__init__.py", line 33, in simplify
    out = document(doc.element).to_json(doc, _options)
  File "/usr/local/lib/python3.8/site-packages/simplify_docx-0.1.0-py3.8.egg/simplify_docx/elements/base.py", line 106, in to_json
    "VALUE": [ elt.to_json(doc, options) for elt in self],
  File "/usr/local/lib/python3.8/site-packages/simplify_docx-0.1.0-py3.8.egg/simplify_docx/elements/base.py", line 106, in <listcomp>
    "VALUE": [ elt.to_json(doc, options) for elt in self],
  File "/usr/local/lib/python3.8/site-packages/simplify_docx-0.1.0-py3.8.egg/simplify_docx/elements/body.py", line 25, in to_json
    JSON = elt.to_json(doc, options, iter_me)
  File "/usr/local/lib/python3.8/site-packages/simplify_docx-0.1.0-py3.8.egg/simplify_docx/elements/paragraph.py", line 167, in to_json
    _indent = get_paragraph_ind(self.fragment, doc)
  File "/usr/local/lib/python3.8/site-packages/simplify_docx-0.1.0-py3.8.egg/simplify_docx/utils/paragrapy_style.py", line 56, in get_paragraph_ind
    num_style.pPr is not None and \
AttributeError: 'lxml.etree._Element' object has no attribute 'pPr'

I have nothing in my word document except these two lines.

Jason Thorpe · Answer 1 · Fri May 15 2020 10:08:38 GMT+0800 (China Standard Time)

Thanks for this bug report. I'll try and get to this in the next few days, or feel free to submit a PR. (I maintain this in my "free" time)

bighaidao · Answer 2 · Mon May 18 2020 16:20:22 GMT+0800 (China Standard Time)

I have the same error. hope resolve it

File "/Users/haidao/anaconda3/lib/python3.7/site-packages/simplify_docx/elements/body.py", line 25, in to_json
JSON = elt.to_json(doc, options, iter_me)
File "/Users/haidao/anaconda3/lib/python3.7/site-packages/simplify_docx/elements/paragraph.py", line 167, in to_json
_indent = get_paragraph_ind(self.fragment, doc)
File "/Users/haidao/anaconda3/lib/python3.7/site-packages/simplify_docx/utils/paragrapy_style.py", line 56, in get_paragraph_ind
num_style.pPr is not None and
AttributeError: 'lxml.etree._Element' object has no attribute 'pPr'

Deleted user · Answer 3 · Thu Jun 11 2020 00:29:39 GMT+0800 (China Standard Time)

Hello!

I had the same problem and I solved it by installing the version of python-docx that is specified in the README.md

This project relies on the python-docx package which can be installed via pip install python-docx. However, as of this writing, if you wish to scrape documents which contain (A) form fields such as drop down lists, checkboxes and text inputs or (B) nested documents (subdocs, altChunks, etc.), you'll need to clone this fork of the python-docx package.

Jason Thorpe · Answer 4 · Thu Jun 11 2020 00:36:58 GMT+0800 (China Standard Time)

Thanks @bastiansg. I'm glad that worked for you, though in principal that shouldn't be required for nested lists. However it tells me that the python docx probably had a minor change since I made that fork and I should update his repo to adapt. I'll try and get to it in the next day or so.

Hien Hoang · Answer 5 · Tue Oct 19 2021 17:27:09 GMT+0800 (China Standard Time)

Hi @jdthorpe, this issue is similar to #12
I have already provided a fix for this in PR.
#13

Hope this helps.

Jason Thorpe · Answer 6 · Wed Mar 09 2022 05:35:48 GMT+0800 (China Standard Time)

Closing this issue as it appears to be identical to issue #12.