python pptx not extracting all the text from the slide
vignesh0710 opened this issue · comments
I am trying to use python-pptx
to extract all the text in a given slide.
But it is missing some text from some text boxes.
code:
from pptx import Presentation
def getPptContent(path):
prs = Presentation(path)
text_runs = []
for slide in prs.slides:
for shape in slide.shapes:
if not shape.has_text_frame:
continue
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
text_runs.append(run.text)
return text_runs
But it is missing some text, but when I try to print
the contents of the slide
from slide._element
I can see the missing text
in the xml
Any suggestions will be helpful
A shape can be a group shape which can contain other shapes.
So you need to recursively descend into group shapes (which can be nested multiple layers deep).
Have a Google on "python-pptx traverse group shapes" and you should find what you need, this one for example: https://stackoverflow.com/questions/51701626/how-to-extract-text-from-a-text-shape-within-a-group-shape-in-powerpoint-using