scanny / python-pptx

Create Open XML PowerPoint documents in Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Changing Text without editing the formatting

powahftw opened this issue · comments

To edit some text i currently use pharagraphs and runs and apply all the styles i'm interested back. It would be interesting to have a way to just change the text of an existing text, leaving the formatting untouched.

@powahftw Character formatting (font characteristics) are specified at the Run level. A Paragraph object contains one or more (usually more) runs. When assigning to Paragraph.text, all the runs in the paragraph are replaced with a single new run. This is why the text formatting disappears; because the runs that contained that formatting disappear.

Although it would not work for all cases one might want, a useful behavior would be to replace the text in a paragraph, retaining the formatting present in the first run. This could be accomplished like this:

def replace_paragraph_text_retaining_initial_formatting(paragraph, new_text):
    p = paragraph._p  # the lxml element containing the `<a:p>` paragraph element
    # remove all but the first run
    for idx, run in enumerate(paragraph.runs):
        if idx == 0:
            continue
        p.remove(run._r)
    paragraph.runs[0].text = new_text

paragraph = textframe.paragraph[0]  # or wherever you get the paragraph from
new_text = 'foobar'
replace_paragraph_text_retaining_initial_formatting(paragraph, new_text)

I haven't tested this, maybe you can report back any mistakes if you try it out, but I think it gives the gist.

This would be roughly how such a feature would be implemented.

@scanny Just tried your suggested function and it works perfectly - thanks!

wow... this is perfect. I had written a loop of try/except that stored the attributes of the first run and then re-applied them after changing the text. Feel like a barbarian.

Hi guys,

Thanks @scanny for the snippet! It was really helpful.
I did encounter something strange though. For some unknown reason(s), some cells in my file despite looking like a single line of text, it was split into a number of runs.

So, I had to modify your snippet into this

...
whole_text = " ".join([r.text for r in paragraph.runs])
whole_text = re.sub(replacement_string, new_text, whole_text)
    for idx, run in enumerate(paragraph.runs):
        if idx == 0:
            continue
        p = paragraph._p
        p.remove(run._r)
    paragraph.runs[0].text = whole_text
...

Hope this helps... or well, I just wanted to share the solution to the whole morning of frustration...

This really helps us.... thank you very much @scanny

Hi @scanny ,

Thanks for the snippet! Question though - is changing the text on the paragraph really supposed to clear the formatting or is that a bug?

Thanks

@franz-see Paragraph.text is a convenience property. There is no general-case way to change the text while preserving the formatting. For example, if you wanted to replace:

The quick, brown fox.

with:

The lazy yellow dog.

How would you do that with something like Paragraph.text? So assigning text to Paragraph.text replaces all the runs in the paragraph with a single run containing the assigned text with no special formatting.

Character formatting provided by the paragraph-style is preserved, and generally this produces the best possible result. If you need to apply "inline" character formatting, then you need to do it yourself, run-by-run.

One way to do this is to assign "" to paragraph.text to "clear" the existing text and then add runs to the paragraph to suit.

That's really useful, thanks @scanny.

Is there a way to keep the formatting when replacing text in tables (cells), too?
To be honest, I don't fully understand what your above code does, so I'd appreciate any help.

def replace_text(self, replacements: dict, shapes: list):
    for shape in shapes:
        for match, replacement in replacements.items():
            if shape.has_table:
                for row in shape.table.rows:
                    for cell in row.cells:
                        if match in cell.text:
                            new_text = cell.text.replace(str(match), str(replacement))
                            cell.text = new_text

from: https://stackoverflow.com/questions/37924808/python-pptx-power-point-find-and-replace-text-ctrl-h

I think, I solved it. It was just a matter of finding how to access the run level for cells. I'll leave the solution here in case someone encounters a similar problem.

Thank you for this great module!

    for shape in shapes:
        for match, replacement in replacements.items():
            if shape.has_table:
                for row in shape.table.rows:
                    for cell in row.cells:
                        if match in cell.text:                                                        
                            for paragraph in cell.text_frame.paragraphs:
                               for run in paragraph.runs:
                                   p = paragraph._p  # the lxml element containing the `<a:p>` paragraph element
    #                            remove all but the first run
                                   for idx, run in enumerate(paragraph.runs):
                                       if idx == 0:
                                           continue
                                       p.remove(run._r)
                                   cur_text = run.text
                                   new_text = cur_text.replace(str(match), str(replacement))
                                   run.text = new_text

Same code, slightly refactored for length and indent-level. I think there was a bug in there too, you deleted runs before capturing the text they contain.

def iter_table_cells(shapes):
    for shape in shapes:
        if not shape.has_table:
            continue
        for row in shape.table.rows:
            for cell in row.cells:
                yield cell


for cell in iter_table_cells(shapes):
    for match, replacement in replacements.items():
        for paragraph in cell.text_frame.paragraphs:
            if match not in paragraph.text:
                continue
            orig_text = paragraph.text
            # --- the lxml element containing the `<a:p>` paragraph element ---
            p = paragraph._p
            # --- remove all but the first run ---
            for run in paragraph.runs[1:]:
                p.remove(run._r)
            run = paragraph.runs[0]
            run.text = orig_text.replace(str(match), str(replacement))

You are very kind. Thanks again.

I am trying to highlight a specific word in red color in a pptx file using the below function.

def highlight_word_in_text(paragraph,highlight_word):
    p = paragraph._p
    paratext = p.text
    if highlight_word in paratext:
        for idx, run in enumerate(paragraph.runs):
            if idx == 0:
                continue
            p.remove(run._r)
        paragraph.runs[0].text = paratext[0:paratext.index(highlight_word)]
        run = paragraph.add_run()
        run.text = paratext[paratext.index(highlight_word):paratext.index(highlight_word)+len(highlight_word)]
        run.font.color.rgb = RGBColor(255, 0, 0)
        run = paragraph.add_run()
        run.text = paratext[paratext.index(highlight_word)+len(highlight_word):]

While it works, it loses the formatting and also adds some unusual characters like '_x000B' to some words where it finds the highlighted word. Could you please tell me what I am missing?

Full code below:

def highlight_word_in_text(paragraph,highlight_word):
    p = paragraph._p
    paratext = p.text
    if highlight_word in paratext:
        for idx, run in enumerate(paragraph.runs):
            if idx == 0:
                continue
            p.remove(run._r)
        paragraph.runs[0].text = paratext[0:paratext.index(highlight_word)]
        run = paragraph.add_run()
        run.text = paratext[paratext.index(highlight_word):paratext.index(highlight_word)+len(highlight_word)]
        run.font.color.rgb = RGBColor(255, 0, 0)
        run = paragraph.add_run()
        run.text = paratext[paratext.index(highlight_word)+len(highlight_word):]
        
prs2 = Presentation('test2.pptx')
for slide in prs2.slides:
    for shape in slide.shapes:
        if not shape.has_text_frame:
            continue
        for paragraph in shape.text_frame.paragraphs:
            highlight_word_in_text(paragraph,'business')
            
prs2.save('test3.pptx')