add support for romanisation

Question

add support for romanisation

maxbachmann opened this issue 2 years ago · comments

As described in #7 metrics like the levenshtein distance only make much sense for langauges like chinese, if there is support for romanisation.

@mrtolkien @lingvisa I opened this new issue to track support for romanisation. Note that:

I am unsure how this would optimally be implemented to help the largest amount of people
I do not think I have time to implement this myself soon, but someone else might want to pick this up. Especially now that there is a Python only mode it would be enough to implement this in pure Python for now (I can do porting to C++ for better performance)

This should be implemented as a separate preprocessing function for the current default_process method.

Max Bachmann · Answer 1 · Thu Sep 22 2022 08:42:39 GMT+0800 (China Standard Time)

Note that I am unsure how simple/hard romanisation is depending on the language, since I have zero experience with languages that need this sort of preprocessing. So any solution making it into RapidFuzz would need to be:

simple enough for even me to maintain
not generate tons of issues due to suboptimal romanisation in some cases (which depending on the language are probably going to occur)

Depending on the amount of work this requires, it might make sense to make this a separate project. This is really not an integral step of the matching but a preprocessing step, which is likely helpful to users in and of itself (probably some projects for this already exist).
Note that I have a C-API for preprocessing function which would even allow you to achieve this without any performance loss compared to a built in implementation.

I would be happy to mention these solutions in my documentation to help users coming from a language benefiting from romanisation.

Jamie Birch · Answer 2 · Mon Jul 10 2023 12:56:22 GMT+0800 (China Standard Time)

Depending on the amount of work this requires, it might make sense to make this a separate project.

This feels out-of-scope for RapidFuzz, because transcribing non-Roman languages is a totally separate problem-space. I think users should just do it separately and pass in the inputs to RapidFuzz, because then they will have complete freedom of implementation – there are many ways to transcribe, each with different tradeoffs, and none are perfect.

I'll give an example for Japanese, but a similar approach could be taken for Chinese.

Getting the pronunciation of Japanese text

Getting the phonetic transcriptions for Japanese is a straightforward process, but you'll need some pretty heavy dependencies for it.

Installation

fugashi is a morphological analyser for Japanese. It's just a Python wrapper around MeCab.
unidic is a very large (770 MB) dictionary file that provides MeCab with the token data needed to segment Japanese text.

pip install fugashi
pip install unidic

# Warning: the download for UniDic is around 770 MB!
python -m unidic download

Usage

from fugashi import GenericTagger
import unidic

tagger = GenericTagger('-d "{}"'.format(unidic.DICDIR))


def get_pronunciation(text, tagger):
    acc = ""
    pron_index = 9
    for word in tagger(text):
        pron = (
            word.feature[pron_index]
            if len(word.feature) > pron_index
            else word.surface
        )
        if pron == "*":
            pron = word.surface
        acc = acc + pron
    return acc

print(get_pronunciation("東京に住む。"))
# "トーキョーニスム。"

From there, you'd need a separate library to map the phonetic (katakana) characters to Roman characters – but actually just getting them as far as phonetic characters could be enough for your purposes.

Jason Penney · Answer 3 · Fri Jul 28 2023 20:41:13 GMT+0800 (China Standard Time)

For Japanese cutlet runs on top of fugashi and could probably be used in a preprocessing function. It's a bit heavy needing unidic or unidic-lite, but maybe an example in the documentation would be enough?

Max Bachmann · Answer 4 · Wed Aug 02 2023 21:28:58 GMT+0800 (China Standard Time)

I think a documentation section on options for romanisation for different languages would make sense. It is a fairly common thing people run into when matching non roman-languages and so having some documentation for this would be useful.