rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics

Home Page:https://rapidfuzz.github.io/RapidFuzz/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

add support for romanisation

maxbachmann opened this issue · comments

As described in #7 metrics like the levenshtein distance only make much sense for langauges like chinese, if there is support for romanisation.

@mrtolkien @lingvisa I opened this new issue to track support for romanisation. Note that:

  1. I am unsure how this would optimally be implemented to help the largest amount of people
  2. I do not think I have time to implement this myself soon, but someone else might want to pick this up. Especially now that there is a Python only mode it would be enough to implement this in pure Python for now (I can do porting to C++ for better performance)

This should be implemented as a separate preprocessing function for the current default_process method.

Note that I am unsure how simple/hard romanisation is depending on the language, since I have zero experience with languages that need this sort of preprocessing. So any solution making it into RapidFuzz would need to be:

  1. simple enough for even me to maintain
  2. not generate tons of issues due to suboptimal romanisation in some cases (which depending on the language are probably going to occur)

Depending on the amount of work this requires, it might make sense to make this a separate project. This is really not an integral step of the matching but a preprocessing step, which is likely helpful to users in and of itself (probably some projects for this already exist).
Note that I have a C-API for preprocessing function which would even allow you to achieve this without any performance loss compared to a built in implementation.

I would be happy to mention these solutions in my documentation to help users coming from a language benefiting from romanisation.

Depending on the amount of work this requires, it might make sense to make this a separate project.

This feels out-of-scope for RapidFuzz, because transcribing non-Roman languages is a totally separate problem-space. I think users should just do it separately and pass in the inputs to RapidFuzz, because then they will have complete freedom of implementation – there are many ways to transcribe, each with different tradeoffs, and none are perfect.

I'll give an example for Japanese, but a similar approach could be taken for Chinese.

Getting the pronunciation of Japanese text

Getting the phonetic transcriptions for Japanese is a straightforward process, but you'll need some pretty heavy dependencies for it.

Installation

  • fugashi is a morphological analyser for Japanese. It's just a Python wrapper around MeCab.
  • unidic is a very large (770 MB) dictionary file that provides MeCab with the token data needed to segment Japanese text.
pip install fugashi
pip install unidic

# Warning: the download for UniDic is around 770 MB!
python -m unidic download

Usage

from fugashi import GenericTagger
import unidic

tagger = GenericTagger('-d "{}"'.format(unidic.DICDIR))


def get_pronunciation(text, tagger):
    acc = ""
    pron_index = 9
    for word in tagger(text):
        pron = (
            word.feature[pron_index]
            if len(word.feature) > pron_index
            else word.surface
        )
        if pron == "*":
            pron = word.surface
        acc = acc + pron
    return acc

print(get_pronunciation("東京に住む。"))
# "トーキョーニスム。"

From there, you'd need a separate library to map the phonetic (katakana) characters to Roman characters – but actually just getting them as far as phonetic characters could be enough for your purposes.

For Japanese cutlet runs on top of fugashi and could probably be used in a preprocessing function. It's a bit heavy needing unidic or unidic-lite, but maybe an example in the documentation would be enough?

I think a documentation section on options for romanisation for different languages would make sense. It is a fairly common thing people run into when matching non roman-languages and so having some documentation for this would be useful.