lemire / unicode_lipsum

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

unicode_lipsum

Tests files encoded with UTF-8, UTF-16LE and UTF-32LE.

By convention, all UTF-8 files end with .utf8.txt while all UTF-16LE files end with .utf16.txt and all UTF-32LE end with .utf32.txt.

A small number of files are encoded using Latin 1 (ISO-8859-1): esperanto.latin1.txt, french.latin1.txt, german.latin1.txt, portuguese.latin1.txt in the wikipedia_mars directory. They are not exactly equivalent to the Unicode files: e.g., it is not possible to reproduce the equivalent Unicode files from the Latin 1 files. However, we have have modified Unicode files with the suffixes .utflatin8.txt (UTF-8 recovered from Latin 1), .utflatin16.txt (UTF-16LE recovered from Latin 1), .utflatin32.txt (UTF-32LE recovered from Latin 1).

The wikipedia_mars files are derived from the Mars wikipedia article in different languages. Wikipedia is licensed under a Creative Commons license. The html2text Python program is used to convert them to text, by stripping HTML codes.

The lipsum file come from the package https://github.com/rusticstuff/simdutf8 by Hans Kratz (licensed under both MIT and Apache).

These files are provided for research purposes.

About


Languages

Language:HTML 99.9%Language:Makefile 0.0%Language:Python 0.0%Language:Shell 0.0%