CUNY-CL / wikipron

Massively multilingual pronunciation mining

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Audit zero-width characters

kylebgorman opened this issue · comments

Several languages use ZERO WIDTH SPACE and ZERO WIDTH NON JOINER, which, as the name suggests, aren't real characters. Let's look into why and see whether that's a bug upstream or an issue for Wikipron.

Let's look into why

It seems like sometimes ZERO WIDTH NON JOINER (U+200C) is used as a "space" character. For example, line 463 of fas_arab_broad.tsv is for the word "اتم‌ها", which has two parts ("اتم" and "ها") which are connected with a U+200C character. In this case, I think it might be better for wikipron if we replaced them with spaces (which are more consistent and easier to deal with).

However, there are a lot of other cases where ZERO WIDTH SPACE (U+200B) and U+200C are used at the end of words for no apparent reason. For example, line 4439 of mya_mymr_broad.tsv uses U+200B at the end of the word "အဓိကရုဏ်း​", and line 33 of new_deva_narrow.tsv uses U+200C at the end of "उसाँय्‌". These seem like pointless extra characters that could be removed.

Also, a few other languages (Malayam, Punjabi, Marathi, Thai, Yamphu, Nepali) use ZERO WIDTH JOINER (U+200D). This character is used the same way as U+200B and U+200C -- as either a space or whitespace at the end of a word so I think we treat it the same way as we treat U+200B or U+200C.

see whether that's a bug upstream

I'm pretty sure this is an issue on Wiktionary's end. In the web version of the Wiktionary entry for any of the words that use these characters (such as this one), all instances of the word on the page include the special characters in them, so it's definitely Wiktionary that's choosing to include these.

an issue for Wikipron

I don't think it's an issue for Wikipron, but it's probably still a good idea to either replace these characters with a space or filter them out when we're first processing the data from Wiktionary in scrape.py. Lmk if you think this would be a good idea.

Thanks for the nice report.

It seems like sometimes ZERO WIDTH NON JOINER (U+200C) is used as a "space" character. For example, line 463 of fas_arab_broad.tsv is for the word "اتم‌ها", which has two parts ("اتم" and "ها") which are connected with a U+200C character. In this case, I think it might be better for wikipron if we replaced them with spaces (which are more consistent and easier to deal with).

So I think that's a proper use of the character. It prevents the [ha] portion (which is the plural suffix here) from attaching to the base, which I assume is just Persian orthographic convention. However, because this is "ZERO WIDTH" it takes up no space.

This suggests a heuristic: if the ZERO WIDTH NON JOINER character is word-internal, it's valid.

However, there are a lot of other cases where ZERO WIDTH SPACE (U+200B) and U+200C are used at the end of words for no apparent reason. For example, line 4439 of mya_mymr_broad.tsv uses U+200B at the end of the word "အဓိကရုဏ်း​", and line 33 of new_deva_narrow.tsv uses U+200C at the end of "उसाँय्‌". These seem like pointless extra characters that could be removed.

Yes, these both seem like errors, ones that we probably want to fix "upstream".

Also, a few other languages (Malayam, Punjabi, Marathi, Thai, Yamphu, Nepali) use ZERO WIDTH JOINER (U+200D). This character is used the same way as U+200B and U+200C -- as either a space or whitespace at the end of a word so I think we treat it the same way as we treat U+200B or U+200C.

I'm pretty sure this is an issue on Wiktionary's end. In the web version of the Wiktionary entry for any of the words that use these characters (such as this one), all instances of the word on the page include the special characters in them, so it's definitely Wiktionary that's choosing to include these.

an issue for Wikipron

I don't think it's an issue for Wikipron, but it's probably still a good idea to either replace these characters with a space or filter them out when we're first processing the data from Wiktionary in scrape.py. Lmk if you think this would be a good idea.

So I see two potential action items:

  1. WikiPron (the base library, probably, not just the code under data/) should remove (trailing?) ZERO WIDTH NON JOINER and ZERO WIDTH JOINER characters from entries.
  2. We make a list of all the cases where these characters are used incorrectly and we fix them upstream and then rescrape.

Of these I think (2) makes more sense. It's unlikely they'll be added back in if we do this.

Having revisited @sonofthomp's notes on this I think that it is reasonable to assume these zero width characters are at least in some cases being used correctly, and we can't be sure they're not being used well in others. I could imagine some day we'd want to strip leading and trailing ones, but I'm not confident enough to do it now so I think I will close this for now. Thanks @sonofthomp.