This parser extracts parallel subtitles for any language pair (listed below) from the OpenSubtitles18 corpus page. This dataset is a scrape of TV and movie subtitles available at http://www.opensubtitles.org/.
Guidelines for selecting subtitles are aligned with Voita et al. (2018) and consist of the following:
- Each sentence pair is coupled with the sentence which provides context for that pair, i.e. the immediate previous sentence spoken. Voita et al. (2018) use only source side context (English, in their case) but for completeness the present script extracts both source- and target-side context.
- The dataset is cleaned of uncertain alignments and subtitle-sentence breaks. The subtitles are extracted according to the alignment file (usually called align-$src-$tgt.xml), so that rather than many-to-many mappings there is always a sentence-to-sentence mapping. The .xml file provides overlap statistics for alignment and alignments with overlap below 0.9 are not considered for extraction.
- For context, pairs of consecutive sentences with a break between them of more than 7 seconds are not considered.
- Clone the repository.
- Navigate to the repository.
- Type
./run.sh en fr
. The order of languages doesn't matter.- The script runs
download.sh
which downloads subs from the website,extract_subtitles.py
which extracts the subtitles from xml files, aligns and filters them andprepare_dataset.py
which compiles them into a usable train/dev/test split. - By nature the procedures are bidirectional, at train time you may specify the source and target languages and adjust context files accordingly (e.g. if you want to only use source context).
- The script runs
- Once the script is done, the files are saved in the following hierarchy:
OpenSubtitles
|-xml
| |-en
| |-fr
|-en-fr
| |-en-fr.xml
| |-parsed
| | |-raw sentences...
| |-cxt_dataset
| | |-train, dev, test files...
The scripts should work for any language pair available for download on the OpenSubtitles18 corpus page.
Lison, P. and Tiedemann, J. (2016) 'OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles.', In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016).
Voita, E. et al. (2018) ‘Context-aware neural machine translation learns anaphora resolution’, ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers), 1, pp. 1264–1274.