biBooks - Automatic Creation of Bilingual eBooks

This repository allows the creation of an ebook with text passages alternating in two languages for the purpose of language learning. Matching the two texts is done using vecalign which itself relies on Facebook's language agnostic sentence representations LASER.

Installation

It is recommended to clone the repository, create a virtual environment and then install all requirements.

git clone https://github.com/pschonev/biBooks.git
cd biBooks
virtualenv -p python3 bibook_env
source bibook_env/bin/activate

And then to install everything run

pip install -e .[full]

Next run python src/download_models.py to get the necessary LASER models.

Finally install Calibre as this will be used to convert our generated HTML files to the desired eBook format.

Basic Usage

The two books have to be provided in two files as a list of sentences seperated with a newline (see the books folder for examples). There are several tools available to split a text into sentences (sentence tokenization) and generate this output format (I used pySBD).

Using these two files, you can simply run python bilingual_books.py and provide arguments via a config file with -c [config_path] (see configs for examples) or via the command line. The result will be a finished eBook file.

The steps to go through are described below. Note that step 1-3 have examples in the notebooks folder and 4-9 are done automatically when running bilingual_books.py.

Steps

Get text data in two files e.g. web scraping (see examples in notebooks) or converting eBook to txt using Calibre ebook-convert [ebook_file] [output.txt]
Clean the text data
Run sentence tokenization, e.g. using pySBD
Possible overlaps of n (e.g. 10) sentences are created with vecalign/overlap.py
These overlapped sentences are then embedded using LASER, making them comparable independent of their language
Then all 6 files (original sentences, overlaps and embeddings) are fed to the main vecalign algorithm to determine matching text passages
The resulting alignment file indicates which lines of the original text with sentence tokenization match which each other. This can now be used to create a combined tab seperated (.tsv) file of matching text passages
This .tsv file is then converted into HTML format and can be accompanied by a .css file for styling
With Calibre installed run ebook-convert [HTML-file] [ebook-file] to get an eBook file with the format of your choice (epub, mobi, etc.)
Finally you may want to open the eBook in Calibre and fix some issues or add additional things such as a cover pic

Optional 6a: For Russian, use UDAR (https://github.com/reynoldsnlp/udar) to create a file with stresses added from the source file with sentence tokenization and use it instead of the unstressed source file

Background Info and Credits

During my research how to handle this problem I stumbled across this forum post from 2016 explaining how to do the same thing except with hunalign instead of vecalign. However when trying hunalign the results were very poor and the dictionary creation seemed tedious. However this forum post was still helpful for my overall procedural structure and it additionally linked to this helpful blog post where I found the HTML conversion script. So credit to the user slex and doviende who also let me use his script.

To-Do

better handling of paragraphs (e.g. keep paragraphs together up to a certain length or ensure a newline after a paragraph is over in the final document)
dynamic layout inspired by Doppeltext
automatically process files in eBook format (converting and cleaning newlines)
Create a MOBI dictionary from wiktionary https://github.com/nyg/wiktionary-to-kindle

About

This allows automatic creation of bilingual e-books with two translations of a text as input using an alignment of language agnostic sentence vectors.

MIT License

Languages

Language:Ruby 84.4%Language:Python 6.6%Language:Perl 2.9%Language:C++ 1.9%Language:Cython 1.5%Language:Emacs Lisp 1.4%Language:Jupyter Notebook 0.7%Language:Shell 0.2%Language:Smalltalk 0.2%Language:NewLisp 0.1%Language:JavaScript 0.1%Language:Slash 0.0%Language:SystemVerilog 0.0%Language:CSS 0.0%