python pptx not extracting all the text from the slide

Question

python pptx not extracting all the text from the slide

vignesh0710 opened this issue 3 months ago · comments

I am trying to use python-pptx to extract all the text in a given slide.

But it is missing some text from some text boxes.

code:

from pptx import Presentation

   def getPptContent(path):
     prs = Presentation(path)
     text_runs = []
     for slide in prs.slides:
         for shape in slide.shapes:
             if not shape.has_text_frame:
                 continue
             for paragraph in shape.text_frame.paragraphs:
                 for run in paragraph.runs:
                    text_runs.append(run.text)
     return text_runs

But it is missing some text, but when I try to print the contents of the slide
from slide._element I can see the missing text in the xml

Any suggestions will be helpful

Steve Canny · Answer 1 · Thu Mar 14 2024 08:35:54 GMT+0800 (China Standard Time)

A shape can be a group shape which can contain other shapes.

So you need to recursively descend into group shapes (which can be nested multiple layers deep).

Have a Google on "python-pptx traverse group shapes" and you should find what you need, this one for example: https://stackoverflow.com/questions/51701626/how-to-extract-text-from-a-text-shape-within-a-group-shape-in-powerpoint-using