Kyubyong / mtp

Multi-lingual Text Processing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multi-lingual Text Processing

This is for my tech talk at Naver on September 6, 2018.

Why Multi-lingual Text Processing?

Yes! Modeling is fancy. Data processing is tedious. You don't want to do that. I know. But from my experience it's often data processing that determines the performance of your experiement rather than modeling. If you can't avoid, it's better do it right.

Why Multi-lingual Text Processing?

You can obtain many techniques of image processing through many routes. More importantly, I'm not an expert in it. Let me focus on text, which is one of the two most typical modalities along with sound when handling language .

Why Multi-lingual Text Processing?

If you're interested in a single language, say, English, it's fine. But if you touch a language you're not familiar with for some reason, you may need some knowledge on it.

Basic Text Processing

(Main source: Lecture slides from the Stanford Coursera course)

Regular Expressions

  • Syntax for processing strings
  • LIBRARY regex (third-party): You can use unicode category expressions such as '\p{Han}' for all Chinese characters and '\p{Latin}' for the Latin script.
  • ONLINE https://regexr.com/
  • SOFTWARE PowerGrep

Tokenization

  • Token: a unit like character, subword (bpe), word, mwe, sentence, etc.
  • Character
    • Simple (😄)
    • Small vocabulary (< 100) (😄)
    • Robust to rare words (😄)
    • Long sequence (😭)
  • Subword
    • Best performance in machine translation (😄)
    • Robust to rare words (😄)
    • Not intuitive (😭)
    • Data-dependent (😭)
  • Word
    • Usually simple (😄)
    • Short sequence (😄)
    • Transfer learning (😄)
    • Large vocabulary (> 10000) (😭)
    • Weak in rare words (😭)
  • MWE (Multi-word expression)
    • Idioms e.g., ‘kick the bucket’
    • Compounds e.g., ‘San Francisco’
    • Phrasal verbs e.g. ‘get … across’
    • PROJECT Multiword Expression Project
  • Sentence
    • Usually identified by a sentenc ending symbol (.!?)
    • Period (.) is sometimes ambiguous.
    • Abbreviations like Inc. or Dr.
    • Numbers like .02% or 4.3

Normalization

Lemmatization

  • Lemma: the canonical or dictionary form of a set of words
    • E.g., produce, produced, production -> produce
  • WHY? Dictionary lookup
  • HOW? Linguistic knowledge
  • LIBRARY nltk wordnet lemmatizer

Stemming

  • Stem: the part of the word that never changes even when morphologically inflected
    • E.g., produce, produced, production -> produc-
  • WHY? Query-document match
  • HOW? Sequence of rules
  • LIBRARY nltk stemmers

Unicode Normalization

(Main source: unicode.org)

  • Canonical equivalence: a fundamental equivalency between characters which represent the same abstract character
    • E.g., combining sequence: Ç C+◌̧
    • E.g., ordering of combining marks: q+◌̇+◌̣ q+◌̣+◌̇
  • Compatibility equivalence: a weaker type of equivalence between characters which represent the same abstract character, but which may have distinct visual appearances or behaviors
    • E.g., circled variants: ① → 1
    • E.g., width variants: カ → カ
  • NFD: Canonical Decomposition
  • NFKD: Compatibility Decomposition
  • NFC: NFD + Canonical Composition
  • NFKC: NFKD + Canonical Composition
  • Examples

  • Typically NFC is desirable for string matching.
  • NFKC is useful if you don't want to distinguish compatibility-equivalent characters like full- and half-width characters.
  • Strip diacritics: to ASCII characters
import unicodedata
def strip_diacritics(str):
	return ''.join(char for char in unicodedata.normalize('NFD', str)
                   if unicodedata.category(char) != 'Mn')

Writing Systems

(Main source: omniglot)

Alphabets

  • Corresponds to one or more phonemes.
  • Latin alphabet (AaBbCc), Cyrillic alphabet (кириллица), Hangul (한글)
  • Hangul

  • There is a fixed order.
  • Consonants and vowels stand alone.
  • Desirable for computer processing.

Abjads (= Consonant alphabets)

  • Each letter stands for a consonant, leaving the reader to supply the vowel.
  • "Cn y ndrstnd ths?"
  • Arabic script (عربى), Hebrew script (עִברִית)
  • 'book' in Arabic (= 'kitaab')

Abugidas

  • Consonants (Primary) + Vowels (Secondary)
  • Devanagari (देवनागरी), Tamil (தமிழ்)
  • Devanagari compounds

Syllabaries

  • Corresponds to a syllable that is not further decomposed.
  • Hiragana (ひらがな), Katakana (カタカナ)
  • Phonemic transcription is often useful.
    • E.g., かわいい -> ka wa i i

Logographs

  • Each letter represents an abstract concept.
  • Chinese characters
  • Many letters
  • Challenging for processing
  • Phonemic transcription is often useful.
    • E.g., 我爱你 -> wǒ ài nǐ

IPA (International Phonetic Alphabet)

  • Universal alphabet
  • IPA Chart
  • Each distinctive sound is represented as a single letter. (/sh/ -> /ʃ/, /th/ -> /θ/, /ng/ -> /ŋ/)
  • Slashes (/ /) for phonemic transcription (e.g., 'pin' /pɪn/ vs. 'spin' /spɪn/)
  • Square brackets ([ ]) for phonetic transcription. (e.g., 'pin' [ɪn] vs. 'spin' [spɪn])

ARPABET

Languages

(Main sources: Relevant Wiki pages)

Arabic

  • CHAR SET [ \p{Arabic}.؟!،0-9]
  • Written from right to left
  • Cursive
  • No distinct upper and lower case letter forms
  • Comma (،), and question mark (؟) are different from those of English.
  • Many dialects with varying orthographies exist.
  • Clitics are attached to a stem any orthographic marks like an apostrophe. (See Fahad Alotaiby et al.)
    • مستواك "your level" -> ك "your" + مستوى "level"
  • TOOL Stanford Arabic Segmenter

Dutch

  • CHAR SET [ A-Za-z.!?'-0-9]
    • Digraph 'ij' is considered the same as 'y'. (See this)

English

  • CHAR SET [ A-Za-z.!?'-0-9]
  • Diacrtics are optional.
    • E.g., naïve = naive, façade = facade, résumé = resume
  • Period (.) is used at the end of a sentence or for abbreviations.
    • E.g., etc., i.e., e.g.
  • Most hyphens in compounds can be replaced with a space.
    • E.g., state-of-the-art = state of the art
  • Apostrophe (') can construct clitics.
    • E.g. I'm (=I am), we've (=we have)
  • The closing quotation mark (’) and apostrophe (') are often mixed up. (Read this)
  • Many words have more than one spelling. (E.g., gray / grey)
  • Graphemes and phonemes are not directly linked. In other words, it's not always possible to infer the pronunciation of a word from its spelling. Therefore in speech synthesis a preprocessor that converts graphemes to phonemes is often used. (Check English g2p)
  • Compared to such languages as Chinese, Japanese, or Thai, tokenization is not so important. You can simply divide text into sentences by [.!?] and words by a white space, respectively at the sacrifice of accuracy. (Check nltk tokenize)
  • To identify multi word expressions is not always easy.

French

  • CHAR SET [ A-Za-zçÉéÀàÈèÙùÂâÊêÎîÔôÛûœæ.!?'-0-9]
  • Diacritics on captial letters are often ignored.
  • Mostly two ligatures 'œ' and 'æ' are the same as 'oe' and 'ae', respectively.
  • Hyphen (-) is used before a pronoun in imperative sentences.
    • Donne-les-moi ! "Give them to me!""
  • Clitics with a apostrophe (')
    • E.g., je t'aime "I love you"

German

  • CHAR SET [ A-Za-zÄäÖöÜüẞß.!?'-0-9]
  • Nouns are written in capital letters.
  • No space for compound nouns (Check compound splitter)
    • E.g., Rinderwahnsinn "mad cow syndrome"
  • 'ß' and 'ss' are interchangeable.

Greek

  • CHAR SET [ \p{Greek}.!;'-0-9]
  • β (beta), θ (theta), and χ (chi) are used as phonetic symbols in the IPA.
  • The letter sigma 'Σ' has two different lowercase forms, 'σ' and 'ς'. 'ς' is used in word-final position and 'σ' elsewhere. (Read this)
  • Semicolon (;) is used as a question mark.

Hindi

  • CHAR SET [ \p{Devanagari}0-9|?!]
  • Vertical line (|) is used at the end of a sentence.
  • Indian numbering system is special.
    • E.g., 1,00,00,00,000

Japanese

  • CHAR SET [\p{Hiragana}\p{Katakana}\p{Han}A-Za-z0-90-9。、?!]
  • No space between words
  • Both full- and half-width arabic numbers are used.
  • Note that period, comma, question mark, and exclamation mark are different from English ones.
  • Often people depend on Romanization to input Japanese in the digital setting. Romanization to Japanese conversion is very important. (Check this)
  • A morph analyzer functions as a tokenizer and a grapheme to phoneme converter. (Check MeCab)
  • When は /ha/ is used as a topic marker it is pronounced as /wa/.

Korean

  • CHAR SET [ \p{Hangul}A-Za-z.!?0-9]
  • Consonants and vowels, called 'jamo' in Korean, combine to form a syllable, which has an independent code point.
    • E.g., ㅎ (314E)+ㅏ (314F) +ㄴ(3134) ->한 (D55C)
  • Jamo has two types: Hangul compatibility Jamo and Hangul Jamo.
    • Hangul Compatibility Jamo (U+3130-U+318F)
      • Composes a syllable
      • In computer keyboards
      • The consonants in the onset and the coda are identical.
    • Hangul Jamo (U+1100-U+11FF)
      • Used mostly when representing old Hangul
      • The consonants in the onset and the coda are NOT identical.
      • If you need to decompose Hangul syllables, Hangul Jamo is better than Hangul Compatibility Jamo. (Check this)
  • Orthography is notoriously difficult. For that reason you can't expect any unofficial writing will obey the rules.
  • Grammar checker is hard to make. (But surprisingly there is a decent one. Check this )
  • Like German, many compounds are created by merging two words without a space.
    • E.g., 점심시간 "lunch time" (= 점심 "lunch" + 시간 "time")
  • Hangul is phonetic, but the current orthography policy respects the origin of words rather than reflecting sound itself. As a result, sometimes the real pronunciation of some words is different from its grapheme.
    • E.g., 독립 dok rip (spelling) -> /dong nip/ (pronunciation) "independence"
  • TOOL Python-jamo: Hangul syllable decomposition and synthesis library
  • TOOL KoG2P

Mandarin

  • CHAR SET [\p{Han}。、,!?0-9]
  • There are two types of commas: , and 、. Ideographic comma (、) is used when enumerating items in a list.(e.g. 红色、白色、黄色 "red, white, and yellow").
  • No space between words
  • Pinyin, the standard Romanization system for Mandarin, is used.
  • 5 different tones are marked by diacritics in pinyin.
    • mā (high level)
    • má (rising)
    • mǎ (falling and rising)
    • mà (falling)
    • ma (neutral)
  • There are two types of characters: simplfied and traditional. The former is used in the mainland, wheras the latter is used in Taiwan and Korea.
  • Check this to see the list of characters that are differntly used in Chinese, Japanese, and Korean.
  • Typically people type pinyin to input Chinese characters in the digital setting. The pinyin to Chinese conversion is very important. (Check this)
  • TOOL pypinyin: a python project for getting pinyin for Chinese words or sentence
  • TOOL Jieba: Chinese text segmentation module
  • TOOL hanziconv: tool converts between simplified and traditional Chinese Characters

Persian

  • CHAR SET [ \p{Arabic}.؟!،0-9]
  • Check Arabic
  • When a Zero-Width Non-Joiner (ZWNJ) is used between two characters, it forces a final form on the preceding character. (See this)

Portuguese

  • CHAR SET [ \p{Latin}.?!'-0-9]
  • The hyphen (-) is used to make compound words
    • E.g., levaria + vos + os = levar-vos-ia "I would take to you"

Russian

  • CHAR SET [ \p{Cyrillic}.!?'-0-9]

Spanish

  • CHAR SET [ \p{Latin}.!¡?¿'-0-9]
  • ¿ is used at the beginning of a interrogative sentence, pairing with ?.
  • ¡ is used at the beginning of a exclamatory sentence, paring with !.

Thai

Vietnamese

  • CHAR SET [ \p{Latin}.!?'-0-9]
  • 6 different tones are marked by diacritics.
    • a (mid level)
    • à (low falling)
    • ả (mid falling)
    • ã (glottalized rising)
    • á (high rising)
    • ạ (glottalized falling)
  • Spaces are used to separate syllables, not words.
    • E.g., thuế thu nhập cá nhâ -> thuế "tax" + thu_nhập "income" + cá_nhân "individual"
  • INFO word segmentation tools

About

Multi-lingual Text Processing