Hussam1 / translate-pptx

Using Selenium and DeepL to translate powerpoint files with python-pptx

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to translate ppt files ?

Read my Medium article to discover how the library was built !

Purpose

Free online translators of PowerPoint files have 2 main issues :

  • The translation API's are often neither robust to short half-sentences (very common in PowerPoints) not to long text traductions
  • The structure of PowerPoint presentations are very complex (lots of unordered shapes) and after modification, nice presentation often get shapes misplaced

This project aims to solve the problem and to automate the process of translating *.pptx files with the same nice-reendering as the original, with well-traducted sentences/expressions.

This repo contains materials to :

  • Translate texts using Selenium on deepL translation website.
  • Extract and modify PowerPoint texts from different objects with the powerful python-pptx library

4 Scripts are available in src folder :

  • default_selenium.py : defaultSelenium class contains the bases to connect to Selenium API and launch a website
  • deepL_selenium.py : seleniumDeepL inheritates from the previous one and contains all the interaction specifically needed to the deepL context
  • ppt_interaction.py : contains functions to inspect a presentations : from presentation, to slides, to shapes, to their text_frame properties.
  • ppt_translation.py : uses both functions from ppt_interaction.py and seleniumDeepL to accomplish the final task : translating files.

Running the translator

The translation object uses a corpus concept. Text must be given as a list of strings (each string equals to a sentence, max number of caracters in a single sentence is 4900 due to deepL's webpage limits). A translation example is provided.

There are 5 steps to run the translation on a corpus.

  1. Clone the repo

git clone https://github.com/ThibaudLamothe/translate-pptx.git

  1. Download the selenium chromedriver at the project's root. By the way, Google Chrome needs to be installed.

  2. Go to src folder

cd src/

  1. Install necessary libraries

pip install -r requirements.txt

  1. Run the deepL_selenium.py file

python deepL_selenium

The output is the following one :

drawing

Translator's features

Initiating the translator launchs the selenium driver and needs a driver to run correctly. This one has to be specified with the driver_path argument. The loglevel might also be indicated (error/warning/information/debug) depending on the level of information to track. See the previous picture.

deepL = seleniumDeepL(driver_path='../chromedriver', loglevel='debug')

When running that command an empty internet pages open. We can now start the translation process.

Functions available

The seleniumDeepL contains multiple methods, but only 4 are useful for the translation process. The other ones are only part of the processing.

deepL.run_translation( see next part for parameters )

This is the main function. It takes the corpus, transforms to better suit the deepL's website, make the traduction and store the results into a dictionnary.

deepL.get_translated_corpus()

It returns the dictionnary of the traducted sentences. Keys are the orginals sentences or group of words, values correspond to their translations.

deepL.save_translations(json_path as str)

It is possible to store the translated as a json file, using that function. It only needs one argument : the path to the json file as a string.

deepL.load_translations(json_path as str)

During the translation process, a sentence which has already been translated is not translated a second time. It is possible to reload translations from a previous run with that functions. It takes the path to a json file as a string.

Running the translation

So far we've seen the 4 useful functions of seleniumDeepL. The deepL.run_translation() is the most important one. Wee'll see now how to correctly use and parameter it.

  • corpus (as str or list, default : 'Hello, World!')

The corpus is the text to be translated. Can be a string or a list of strings. And as translating one sentence does not necessarly need automation, the list option is more interesting.

  • destination_language (as str, default : 'en')

self.available_languages = ['fr', 'en', 'de', 'es', 'pt', 'it', 'nl', 'pl', 'ru', 'ja', 'zh']

  • joiner (as str, default : '\n____\n')

  • quit_web (as boolean, default : True)

  • time_to_translate (as integer, default : 10)

  • time_batch_rest (as integer, default : 2)

  • raise_error (as boolean, default : False)

  • load_at (as string default : None)

  • store_at (as string default : None)

  • load_and_store_at (as string default : None)

PPT Insertion

  • Replacing text without modifying its look

Good to know

NB : the project was developped on MacOS and selenium used with Google Chrome

Resources

TODO

  • Deal with bigger texts. Idea. Separate long sentences on \n's. Reconciliate them after translation. Do it at the reception and delivey of the corpus, so that no modification are done in the batch_corpus creation ?

About

Using Selenium and DeepL to translate powerpoint files with python-pptx


Languages

Language:Python 100.0%