joelthe1 / wildebeest

Text normalization and cleaning; analysis of types of characters used, encoding issues

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

wildebeest

normalize.py

Script repairs common encoding errors, normalizes characters into their canonical form, maps digits and some punctuation to ASCII, deletes many non-printable characters and performs other repair, normalization and cleaning steps. A few steps are specific to Pashto, Farsi, or Devanagari (Hindi etc.). The script contains a list of normalization modules as listed below. The script argument --skip allows users to specify any normalization modules they want to skip.

Usage   (click below for details)

CLI to normalize a file: python -m wildebeest or its alias wb-norm
python -m wildebeest  [-h] [-i INPUT-FILENAME] [-o OUTPUT-FILENAME] [--lc LANGUAGE-CODE] [--skip NORM-STEPS] [-v] [--version]
optional arguments:
  -h, --help            show this help message and exit
  -i INPUT-FILENAME, --input INPUT-FILENAME
                        (default: STDIN)
  -o OUTPUT-FILENAME, --output OUTPUT-FILENAME
                        (default: STDOUT)
  --lc LANGUAGE-CODE    ISO 639-3, e.g. 'fas' for Persian
  --skip NORM-STEPS     comma-separated list of normalization/cleaning steps to be skipped: repair-encodings-errors,del-surrogate,del-
                        ctrl-char,del-arabic-diacr,del-hebrew-diacr,core-compat,pres-form,ligatures,signs-and-
                        symbols,cjk,width,font,small,vertical,enclosure,hangul,repair-combining,combining-compose,combining-
                        decompose,punct,punct-dash,punct-arabic,punct-cjk,punct-greek,punct-misc-f,space,digit,arabic-char,farsi-
                        char,pashto-char,georgian-char,look-alike,repair-xml,repair-url-escapes,repair-token (default: nothing skipped)
  -v, --verbose         write change log etc. to STDERR
  --version             show program's version number and exit

Example:

python -m wildebeest -i corpus-raw.txt -o corpus-wb.txt --lc eng --skip punct-dash,enclosure,del-arabic-diacr

Note: Please make sure that your $PYTHONPATH includes the directory in which this README file resides. Note: For robustness regarding input files that do not fully conform to UTF8, please use -i (rather than STDIN), as it includes UTF8-encoding error handling.

norm_clean_string (Python function call to normalize a string)
from wildebeest.normalize import Wildebeest
wb = Wildebeest()
ht = {}                             # dictionary sets/resets steps to be skipped (default: not skipped)
# ht['SKIP-punct-dash'] = 1         # optionally skip normalization of ndash, mdash etc. to ASCII hyphen-minus.
# ht['SKIP-enclosure'] = 1          # optionally skip 'enclosure' normalization
# ht['SKIP-del-arabic-diacr'] = 1   # optionally skip 'delete arabic diacritic' normalization
wb.load_look_alike_file()           # optional
print(wb.norm_clean_string('🄐…25km²', ht, lang_code='eng'))
print(wb.norm_clean_string('೧೯೨೩', ht, lang_code='kan'))

Note: Please make sure that your $PYTHONPATH includes the directory in which this README file resides.

Installation
# from PyPi (after public release)
pip install wildebeest

# Latest master branch: either https or git/ssh 
pip install git+https://github.com/uhermjakob/wildebeest.git

# For editing/development
git clone git://github.com/uhermjakob/wildebeest.git
cd wildebeest
pip install --editable .   # run it from dir having setup.py

To call wildebeest after installation, run python -m wildebeest or its alias wb-norm.

List of Normalization Steps

repair-encodings-errors

The script generally expects input encoded in UTF8. However, it will recognize and repair some common text encoding errors:

  • (Some) text is still encoded in Windows1252 or Latin1. Any byte that is not part of a well-formed UTF8 character will be interpreted as a Windows1252 character (and mapped to UTF8). This includes printable Latin1 characters as a subset.
  • Text in Windows1252 was incorrectly converted to UTF8 by a Latin1-to-UTF8 converter. This maps Windows1252 characters \x80-\x9F to \u0080-\uu009F, which is the Unicode block of C1 control characters. These C1 control characters are extremely rare, and so our script will interpret such C1 control characters as ill-converted Windows1252 characters, as do many major software applications such as Google Chrome, Microsoft Outlook, Github (text files) and PyCharm (where they are often displayed in a slightly different form).
  • Text in Windows1252 or Latin1 was converted twice, using some combination of Latin1-to-UTF8 converter and Windows1252-to-UTF converter; or a file already in UTF8 was incorrectly subjected to another conversion. Sample wildebeest repair:
    • Input: Donâ��t tell your â��fiancéâ�� â�� Schöne GrüÃ�e aus Mährenâ�¦ â�� Ma sÅ�ur trouve ça «bête». ¡Coño! â�¬50 â�¢ 25km² â�¢ ½µm
    • Output: Don’t tell your “fiancé” — Schöne Grüße aus Mähren… – Ma sœur trouve ça «bête». ¡Coño! €50 • 25km² • ½µm

Other normalization modules

  • del-surrogate (deletes surrogate characters (representing non-UTF8 characters in input), alternative/backup to windows-1252)
  • del-ctrl-char (deletes control characters (expect tab and linefeed), zero-width characters, byte order mark, directional marks, join marks, variation selectors, Arabic tatweel)
  • core-compat (normalizes Hangul Compatibility characters to Unicode standard Hangul characters)
  • arabic-char (to Arabic canonical forms, e.g. maps Farsi kaf/yeh to Arabic versions)
  • farsi-char (to Farsi canonical forms, e.g. maps Arabic yeh, kaf to Farsi versions)
  • pashto-char (to Pashto canonical forms, e.g. maps Arabic kaf to Farsi version)
  • georgian-char (to Georgian canonical forms, e.g. to standard script, map archaic characters)
  • pres-form (e.g. maps from presentation form (isolated, initial, medial, final) to standard form)
  • ligatures (e.g. decomposes non-Arabic ligatures (e.g. ij, ffi, DŽ, ﬓ))
  • signs-and-symbols (e.g. maps symbols (e.g. kappa symbol) and signs (e.g. micro sign µ))
  • cjk (e.g. CJK square composites (e.g. ㋀㏾))
  • width (e.g. maps fullwidth and halfwidth characters to ASCII, e.g. A to A)
  • font (maps font-variations characters such as ℂ, ℹ, 𝒜 to regular characters)
  • small (maps small versions of characters to normal versions, such as small ampersand ﹠ to regular &)
  • vertical (maps vertical versions of punctuation characters with normal horizontal version, such as vertical em-dash ︱ to horizontal em-dash —)
  • enclosure (decomposes circled, squared and parenthesized characters, e.g. 🄐 to (A))
  • hangul (combine Hangul jamos onto Hangul syllables)
  • repair-combining (e.g. order of nukta/vowel-sign)
  • combining-compose (e.g. applies combining-modifiers to preceding character, e.g. ö (o + ̈) -> ö)
  • combining-decompose (e.g. for some Indian characters, splits off Nukta)
  • del-arabic-diacr (e.g. deletes optional Arabic diacritics such as fatha, damma, kasra)
  • del-hebrew-diacr (e.g. deletes Hebrew points)
  • digit (e.g. maps decimal-system digits of 54 scripts to ASCII digits)
  • punct (e.g. maps ellipsis … to periods ... and two-dot-lead ‥ to ..; a few math symbols ∭; ⒛ 🄆 )
  • punct-dash (e.g. maps various dashes, hyphens, minus signs to ASCII hyphen-minus)
  • punct-arabic (e.g. Arabic exclamation mark etc. to ASCII equivalent)
  • punct-cjk (e.g. Chinese Ideographic Full Stop etc. to ASCII equivalent)
  • punct-greek (e.g. Greek question mark etc. to ASCII equivalent)
  • punct-misc-f (e.g. Tibetan punctuation to ASCII equivalent)
  • space (e.g. maps non-zero spaces to normal space)
  • look-alike (normalizes Latin/Cyrillic/Greek look-alike characters, e.g. Latin character A to Greek Α (capital alpha) in otherwise Greek word)
  • repair-xml (e.g. repairs multi-escaped tokens such as " or ‌)
  • repair-url-escapes (e.g. repairs multi-escaped url substrings such as Jo%25C3%25ABlle_Aubron)
  • repair-token (e.g. splits +/-/*/digits off Arabic words; maps not-sign inside Arabic to token-separating hyphen)

wb_analysis.py

Script searches a tokenized text for a range of potential problems, such as UTF-8 encoding violations, control characters, zero-with characters, letters/numbers/punctuation/letter-modifiers from various scripts, tokens with letters from different scripts, XML tokens, tokens with certain punctuation of interest, orphan letter modifiers, non-canonical character combinations.

usage: wb_analysis.py [-h] [-i INPUT-FILENAME] [--batch BATCH] [-s] [-o OUTPUT-FILENAME] [-j JSON-OUTPUT-FILENAME] [--file_id FILE_ID]
                      [--lc LANGUAGE-CODE] [-v] [-pb] [-n MAX_CASES] [-x MAX_EXAMPLES] [-r REF-FILENAME] [--version]

Analyzes a given text for a wide range of anomalies

options:
  -h, --help            show this help message and exit
  -i INPUT-FILENAME, --input INPUT-FILENAME
                        (default: STDIN)
  --batch BATCH_DIR     Directory with batch of input files (BATCH_DIR/*.txt)
  -s, --summary         single summary line per file
  -o OUTPUT-FILENAME, --output OUTPUT-FILENAME
                        (default: STDOUT)
  -j JSON-OUTPUT-FILENAME, --json JSON-OUTPUT-FILENAME
                        (default: None)
  --file_id FILE_ID
  --lc LANGUAGE-CODE    ISO 639-3, e.g. 'fas' for Persian
  -v, --verbose         write change log etc. to STDERR
  -pb, --progress_bar   Show progress bar
  -n MAX_CASES, --max_cases MAX_CASES
                        max number of cases per group
  -x MAX_EXAMPLES, --max_examples MAX_EXAMPLES
                        max number of examples per line
  -r REF-FILENAME, --ref_id_file REF-FILENAME
                        (optional file with sentence reference IDs)
  --version             show program's version number and exit

Sample calls:

wb_analysis.py --help
echo 'Hеllο!' | wb_analysis.py
wb_analysis.py -i test/data/hello.txt
wb_analysis.py -i test/data/wildebeest-test.txt -o test/data/wildebeest-test-out
wb_analysis.py --batch test/data/phrasebook -s -o test/data/phrasebook-dir-out
wb_analysis.py -i test/data/phrasebook/deu.txt -r test/data/phrasebook/eng.txt -o test/data/phrasebook-deu-out
wb_analysis.py -i test/data/wildebeest-test-invalid-utf8.txt

wb-analysis.pl

Old Perl script searches a tokenized text for a range of potential problems, such as UTF-8 encoding violations, control characters, non-ASCII punctuation, characters from a variety of language groups, very long tokens, unsplit 's, unsplit punctuation, script mixing; split URLs, email addresses, filenames, XML tokens.

It will report the number of instances in each category and give examples.

Currently available: wildebeest_analysis.pl (Perl) v2.6 (April 28, 2021)

About

Text normalization and cleaning; analysis of types of characters used, encoding issues


Languages

Language:Python 76.9%Language:Perl 23.1%