scanny / python-pptx

Create Open XML PowerPoint documents in Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Removing video results in corrupted file.

shoang22 opened this issue · comments

Hello,

I'm trying to remove all movies on each slide with the following:

def remove_movie(file_path: str):
    prs = pptx.Presentation(file_path)
    for slide in prs.slides:
        for shape in slide.shapes:
            if type(shape) == pptx.shapes.picture.Movie:
                vid = shape._element
                vid.getparent().remove(vid)
    prs.save(file_path.rpartition(".")[0] + "_no_movies.pptx")

The code executes successfully, but when I try to open the output file, I get the following error:

PowerPoint found a problem with content in blank_presentation_no_movies.pptx.
PowerPoint can attempt to repair the presentation.

If you trust the source of this presentation, click Repair.

Is there something that I'm doing wrong?

There's probably rather more to deleting a movie than removing a chunk of XML.

@shoang22 you're going to want to remove the relationship from the slide (package) part to the part containing the movie (Media part maybe?). Otherwise I expect PowerPoint isn't going to like seeing the orphaned movie. Not sure if that's the whole problem, unfortunately the repair error doesn't give us any idea of what it figures to be a "problem with content".

And what would be the strategy for doing this - in Python code? I ask, @scanny, because this logic is probably common to other removals.

Basically dig out the relationship and delete it.

The relationship(s) would be identified by an embed or link element with rId="rId{N}" I believe, dumping the XML for the moving shape would give you and idea.

Then you need to get to the slide part because that's the source side of the relationship, so something like:

slide_part = slide.part
slide_part.rels.drop_rel("rIdN")

Somebody can dig through and refine that with actual code if they have a mind to :)

Somebody can dig through and refine that with actual code if they have a mind to :)

Something like this?

def remove_movie(file_path: str) -> None:
    slides_folder = os.path.dirname(file_path) + "/slides"
    os.makedirs(slides_folder, exist_ok=True)
    prs = pptx.Presentation(file_path)
    for idx, slide in enumerate(prs.slides):
        for shape in slide.shapes:
            if type(shape) == pptx.shapes.picture.Movie:
                p = slide.part
                x = etree.fromstring(p.rels.xml)
                before = etree.tostring(x, pretty_print=True)
                print(before.decode())
                vid = shape.element
                vid.getparent().remove(vid)
                p.rels.pop("rId2") 
                y = etree.fromstring(p.rels.xml)
                after = etree.tostring(y, pretty_print=True)
                print(after.decode())
    
    prs.save(file_path.rpartition(".")[0] + "_no_movies.pptx")

Prints:

<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <Relationship Id="rId1" Type="http://schemas.microsoft.com/office/2007/relationships/media" Target="../media/media1.mp4"/>
  <Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/video" Target="../media/media1.mp4"/>
  <Relationship Id="rId3" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slideLayout" Target="../slideLayouts/slideLayout1.xml"/>
  <Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/notesSlide" Target="../notesSlides/notesSlide1.xml"/>
  <Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="../media/image1.png"/>
  <Relationship Id="rId6" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="../media/image2.jpeg"/>
</Relationships>

<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <Relationship Id="rId1" Type="http://schemas.microsoft.com/office/2007/relationships/media" Target="../media/media1.mp4"/>
  <Relationship Id="rId3" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/slideLayout" Target="../slideLayouts/slideLayout1.xml"/>
  <Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/notesSlide" Target="../notesSlides/notesSlide1.xml"/>
  <Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="../media/image1.png"/>
  <Relationship Id="rId6" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="../media/image2.jpeg"/>
</Relationships>

But I'm still getting the same error when I attempt to open the file.

Okay, so a couple possible approaches:

  1. do the repair and save it to a separate file. Then compare the XML from the original to the repaired version to see how PowerPoint "fixes" the presentation.
  2. Extract the original powerpoint to a directory ($ unzip original.pptx). Then make the changes by hand, re-zip the presentation into a PPTX file and keep trying things until it works.

The opc-diag tool was built for this kind of exploration:

  • you'll need to install from the develop branch on GitHub for it to work with Python 3: https://github.com/python-openxml/opc-diag/commits/develop/. Pretty sure it's something like: pip install -U git+https://github.com/python-openxml/opc-diag.git@develop
  • documentation is here: https://opc-diag.readthedocs.io/en/latest/index.html
  • The diff, extract, and repackage subcommands are most useful for this work. In particular, just unzipping a PPTX leaves all the content in any of the XML files on a single line, which of course is hard to edit. opc-diag automatically reformats that nicely for you.

You might want to do a mix of these two approaches. The diff approach is good when you have no clue of what changes are required. The edit->repackage->try cycle is best when you have a pretty good idea what changes to try.