microsoft / Simplify-Docx

Simplify DOCX files to JSON

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error parsing document with bullet list

sam2k13 opened this issue · comments

I am trying to use this library to gather indention level from a word document. I am getting the following error when the word document includes a bullet point list...

Traceback (most recent call last):
  File "/home/src/main.py", line 9, in <module>
    text_groups = convert_document_to_text_groups(DOCUMENT_NAME, PATH_TO_DOCUMENT)
  File "/home/src/modules/document_converter.py", line 7, in convert_document_to_text_groups
    my_doc_as_json = simplify(document)
  File "/usr/local/lib/python3.8/site-packages/simplify_docx-0.1.0-py3.8.egg/simplify_docx/__init__.py", line 33, in simplify
    out = document(doc.element).to_json(doc, _options)
  File "/usr/local/lib/python3.8/site-packages/simplify_docx-0.1.0-py3.8.egg/simplify_docx/elements/base.py", line 106, in to_json
    "VALUE": [ elt.to_json(doc, options) for elt in self],
  File "/usr/local/lib/python3.8/site-packages/simplify_docx-0.1.0-py3.8.egg/simplify_docx/elements/base.py", line 106, in <listcomp>
    "VALUE": [ elt.to_json(doc, options) for elt in self],
  File "/usr/local/lib/python3.8/site-packages/simplify_docx-0.1.0-py3.8.egg/simplify_docx/elements/body.py", line 25, in to_json
    JSON = elt.to_json(doc, options, iter_me)
  File "/usr/local/lib/python3.8/site-packages/simplify_docx-0.1.0-py3.8.egg/simplify_docx/elements/paragraph.py", line 167, in to_json
    _indent = get_paragraph_ind(self.fragment, doc)
  File "/usr/local/lib/python3.8/site-packages/simplify_docx-0.1.0-py3.8.egg/simplify_docx/utils/paragrapy_style.py", line 56, in get_paragraph_ind
    num_style.pPr is not None and \
AttributeError: 'lxml.etree._Element' object has no attribute 'pPr'

I have nothing in my word document except these two lines.

image

Thanks for this bug report. I'll try and get to this in the next few days, or feel free to submit a PR. (I maintain this in my "free" time)

I have the same error. hope resolve it

File "/Users/haidao/anaconda3/lib/python3.7/site-packages/simplify_docx/elements/body.py", line 25, in to_json
JSON = elt.to_json(doc, options, iter_me)
File "/Users/haidao/anaconda3/lib/python3.7/site-packages/simplify_docx/elements/paragraph.py", line 167, in to_json
_indent = get_paragraph_ind(self.fragment, doc)
File "/Users/haidao/anaconda3/lib/python3.7/site-packages/simplify_docx/utils/paragrapy_style.py", line 56, in get_paragraph_ind
num_style.pPr is not None and
AttributeError: 'lxml.etree._Element' object has no attribute 'pPr'

Hello!

I had the same problem and I solved it by installing the version of python-docx that is specified in the README.md

This project relies on the python-docx package which can be installed via pip install python-docx. However, as of this writing, if you wish to scrape documents which contain (A) form fields such as drop down lists, checkboxes and text inputs or (B) nested documents (subdocs, altChunks, etc.), you'll need to clone this fork of the python-docx package.

Thanks @bastiansg. I'm glad that worked for you, though in principal that shouldn't be required for nested lists. However it tells me that the python docx probably had a minor change since I made that fork and I should update his repo to adapt. I'll try and get to it in the next day or so.

Hi @jdthorpe, this issue is similar to #12
I have already provided a fix for this in PR.
#13

Hope this helps.

Closing this issue as it appears to be identical to issue #12.