st-vincent1 / opensubtitles_parser

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

opensubtitles_parser

Code to download and parse OpenSubtitles, specifically for MTCue (ACL2023).

Installation

A couple python packages are required to run the parser. Preferably in a conda environment, run:

pip install pycld3 mosestokenizer tqdm

To download the OpenSubtitles XML files, run

bash src/download_os_xml.sh

By default, this will download files necessary for four language pairs: English to-and-from Polish, German, French and Russian. Comment out the specific languages if they're not necessary.

To extract context files, you must obtain an API key from OMDb by subscribing to the (minimum Basic) Patreon here. It costs only $1 and grants access to the API.

Once files are downloaded, run

python src/extract_bitext.py --language [de/fr/pl/ru] --split_set [train/dev/test] --apikey [OMDb API Key]

The relevant files will be saved under data/en-[de/fr/pl/ru]. Context files will be saved under data/en-[de/fr/pl/ru]/context.

About

License:MIT License


Languages

Language:Perl 50.3%Language:Ruby 33.4%Language:Python 15.5%Language:Shell 0.8%