Common characters and _detect_best_script_name

Question

Common characters and _detect_best_script_name

agutkin opened this issue 3 years ago · comments

While scraping Makasar (PR #415), I've encountered the following default behavior:

We currently have strict mode turned on by default _detect_best_script_name, which is sensible. This makes sure that all the characters in a word come from the same script.
Because of (1), we are failing to detect about half of Makasar words (200+) in Latin orthography, where the apostrophe ' signifies glottal stop.

I was under the impression that the plan was to allow a set of configurable (per-language) common characters to be allowed inside _detect_best_script_name. Perhaps I misunderstood the original plan?

I suppose the current behavior may be too restrictive as there are plenty of languages where the common Latin apostrophe is used.

Kyle Gorman commented 3 years ago

@lfashby

Kyle Gorman · Answer 1 · Sat May 08 2021 06:04:38 GMT+0800 (China Standard Time)

I thought we handled apostrophe used to mark glottal stop. Maybe we didn't. @cgibson6279 can perhaps weigh in.

Cameron Gibson · Answer 2 · Sat May 08 2021 06:40:12 GMT+0800 (China Standard Time)

yeah running common_characters.py should extend the regex in split.py to accommodate apostrophe characters in all scripts, so I'm not sure why that would be an issue. I'm pretty sure common_characters.py is run as part of the postprocess script for scraping data, so if that was run apostrophes shouldn't be an issue. But it also looks like maybe Lucas may have done a bit of work/added some things to those files as part of the reorganization effort? So I'm not one hundred percent sure what exactly changed there if that's the case.

Alexander Gutkin · Answer 3 · Sat May 08 2021 06:51:43 GMT+0800 (China Standard Time)

Running the postprocess resulted in all the pronunciations with apostrophes being removed from the original tsv, so I guess something has changed. But, on the other hand, correct me if I am wrong - shouldn't this logic be incorporated into _detect_best_script_name because it fails precisely because of the strict mode which is not permissive to common characters. And if _detect_best_script_name fails, then the words in the dictionary are not updated correctly.

Cameron Gibson · Answer 4 · Sat May 08 2021 07:10:57 GMT+0800 (China Standard Time)

That logic sounds correct to me.

Cameron Gibson · Answer 5 · Sat May 08 2021 07:11:38 GMT+0800 (China Standard Time)

The solution might be as simple as setting that param to False

Alexander Gutkin · Answer 6 · Sat May 08 2021 07:22:13 GMT+0800 (China Standard Time)

I think disabling the strict mode will be too permissive - I suppose we don't want to allow, say, a mix of Cyrillic and Latin characters in a single word.

Cameron Gibson · Answer 7 · Sat May 08 2021 07:44:46 GMT+0800 (China Standard Time)

That was considered, but I don't remember what the solution was... I'll have to take a look further and report back.

Lucas Ashby · Answer 8 · Sat May 08 2021 07:58:09 GMT+0800 (China Standard Time)

I think our problem is line 62 here:
https://github.com/kylebgorman/wikipron/blob/5026d95d2a71b0823cd633b08c4350ad86ef0422/data/scrape/lib/common_characters.py#L59-L67
Should this be if unicodedataplus.script(char) == "Common": ?
I don't think _detect_best_script_name is the problem here. It was moved into languages_update.py and doesn't interact with the Common/Inherited regex extension stuff.

One potential problem with our approach is if there were no words in a language that all came from a single script (if every Makasar word contained an apostrophe) then we wouldn't be able to assign a script and I'm not sure what would happen after that.