CUNY-CL / wikipron

Massively multilingual pronunciation mining

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Common characters and _detect_best_script_name

agutkin opened this issue · comments

While scraping Makasar (PR #415), I've encountered the following default behavior:

  1. We currently have strict mode turned on by default _detect_best_script_name, which is sensible. This makes sure that all the characters in a word come from the same script.
  2. Because of (1), we are failing to detect about half of Makasar words (200+) in Latin orthography, where the apostrophe ' signifies glottal stop.

I was under the impression that the plan was to allow a set of configurable (per-language) common characters to be allowed inside _detect_best_script_name. Perhaps I misunderstood the original plan?

I suppose the current behavior may be too restrictive as there are plenty of languages where the common Latin apostrophe is used.

I thought we handled apostrophe used to mark glottal stop. Maybe we didn't. @cgibson6279 can perhaps weigh in.

yeah running common_characters.py should extend the regex in split.py to accommodate apostrophe characters in all scripts, so I'm not sure why that would be an issue. I'm pretty sure common_characters.py is run as part of the postprocess script for scraping data, so if that was run apostrophes shouldn't be an issue. But it also looks like maybe Lucas may have done a bit of work/added some things to those files as part of the reorganization effort? So I'm not one hundred percent sure what exactly changed there if that's the case.

Running the postprocess resulted in all the pronunciations with apostrophes being removed from the original tsv, so I guess something has changed. But, on the other hand, correct me if I am wrong - shouldn't this logic be incorporated into _detect_best_script_name because it fails precisely because of the strict mode which is not permissive to common characters. And if _detect_best_script_name fails, then the words in the dictionary are not updated correctly.

That logic sounds correct to me.

The solution might be as simple as setting that param to False

I think disabling the strict mode will be too permissive - I suppose we don't want to allow, say, a mix of Cyrillic and Latin characters in a single word.

That was considered, but I don't remember what the solution was... I'll have to take a look further and report back.

I think our problem is line 62 here:
https://github.com/kylebgorman/wikipron/blob/5026d95d2a71b0823cd633b08c4350ad86ef0422/data/scrape/lib/common_characters.py#L59-L67
Should this be if unicodedataplus.script(char) == "Common": ?
I don't think _detect_best_script_name is the problem here. It was moved into languages_update.py and doesn't interact with the Common/Inherited regex extension stuff.

One potential problem with our approach is if there were no words in a language that all came from a single script (if every Makasar word contained an apostrophe) then we wouldn't be able to assign a script and I'm not sure what would happen after that.