facebookresearch / flores

Facebook Low Resource (FLoRes) MT Benchmark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About the function creep and 9 improvements to the file est_Latn_twl.txt

coder037 opened this issue · comments

Here are listed some of the excessive problems with Estonian "toxic" wordlist.
The particular improvement items for the toxicity list (ref: est_Latn_twl.txt) are below, right after the introduction. Warning: because I am making the improvements regarding the "toxicity list", it is unavoidable to explicitly name certain body parts.

A tremendous problem with the META understanding of the "toxicity" is, the scope and rules are undefined. What are the conditions and limitations to use this censorship list? When and for what is this used? Because, if usage scenarios are not defined, your list can seriously harm some potential translation cases outside of your primary focus (due to function creep phenomena). The issue of substring matching belongs to the same category. You can read more about the function creep in MSc dissertation of Manon Jacobs: “Function Creep in Surveillance Situations: Identifying control paradoxes through agency and power relations using ANT” (2016)

Your disclaimer says "The primary purpose of such lists is to help with translation model safety by monitoring for hallucinated toxicity. By hallucinated toxicity, we mean the presence of toxic items in the translated text when no such toxic items can be found in the source text." However, this context is absolutely unsufficient for the purpose and will result misusages (as minimum, due to the function creep). Censorship is an invasive technology. Making an invasive technology freely available without pinpointing the accompanying issues is unethical.

I really do not understand how is the business logic able to decide e.g. that "mdv" in Estonian target text is toxic due to a hallucination when the original Finnish "mitä vittu" is missing from your lists. You are unable to follow this origin and could wrongly mark "mdv" as an AI MT hallucination, what it is not. I feel your "hallucination criterias" may be wrong and the "hallucination" concept just looks a bad pretext for an applied commercial censorship.

Even further, we can estimate that your disclaimer is not fully correct about the purpose, because another source (https://ai.facebook.com/blog/nllb-200-high-quality-machine-translation/) reveals "25 billion translations served every day" and the scope being "to spot harmful content and misinformation" (not at all saving the world from "hallucinates toxicity"). As a minimum you should justify the purpose reasons over all of your texts.

The summary:

A. For languages like Estonian, no wordlist based censorship will guarantee the results, while it will certainly damage the original pragma of the source text.

B. Publishing a censorship list (and technology) without indicating its limitations and blurring about its true purpose is un-ethical.

In general, joining the 56 African languages to the Internet community is a beautiful idea.

ref: est_Latn_twl.txt

  1. The username for the toxicity-200 resource file is huihuifan.
    In file est_Latn_twl.txt, the word "hui" is listed on line 57.
    Should now the uploader be twice blocked at the Github? Will it actually be blocked in case of a FB instant communication with an Estonian? (The particular phenomena originates from a Russian word denoting the male member while Slavic cursewords are popular and often included into Estonian speech and instant communications.)

  2. Due to the toxicity lists, The Christian Bible translation is not anymore possible without a particular concept that words on lines 3-56 are expressing (see Jacob 2:25 about someone Rahaab). This is quite unexpected considering the baseline level of the toxicity project was claimed to be calibrated against the Christian Bible corpus.

  3. Lines 60-62 contain multiple words separated by a space. In case of a wrong tokenization, errors may arise. "ime" means both "suck" and "miracle" in Estonian language. The issue with the space symbol seems to be wider, see also lines 83-84 and e.g. 231. Albeit there are some disclaiming attempts at your site, censorship lists are very often misused for function creep purposes. That kind of an unintended use should be explicitly warned against, what however has not been done.

  4. For the word on the line 64-79, the Singular Nominative form is not present. A question arises, why it is prohibited to declinate a word while its main form is allowed. Btw, Plural forms are also missing. What is the logic behind this kind of an approach? The full paradigm of an Estonian noun is 28 words (cases from 1-14, both Singular and Plural), plus there are occasional "short forms" (for certain cases) that you seem not to notice at all. Then, it is known that every generation has its own profanities. More classical profanities (beside the instant messaging ones) seem to be missing from the list.

  5. The word on line 80 "krt" - why is this prohibited? It is a very mild abbreviation for Devil. I do not see Devil listed under all his names, only lines 80-82. All other cases (remember - 28 in the full paradigm) are missing for the category as well a couple of synonyms.

  6. It is unclear whether or not the substrings count. For the words on lines 183 and 187, these form substring from the word "politsei" (the police). The word on line 109 form a substring from the Estonian word "trollibusse" - "trolley bus" in Plural Accusative case. There are more casualities, e.g. the word on the line 586 forms a substring for Estonian equivalents of thyratrons. Then, the word on the line 477 forms a substring for the Estonian equivalent of the "closing ceremony". There are more examples. Why is this important - because of the compound words and a possible function creep usage of the list.

  7. As a native speaker, I do not understand the line 235. In Business Register of Estonia, e.g. a company "MDV Ekspress OÜ" exist (https://www.teatmik.ee/et/personlegal/11718798-MDV-Ekspress-O%C3%9C). That means the abbreviation MDV is absolutely legal. The origin of "mdv" is Finnish, while the corresponding "mitä vittu" is missing from the file fin_Latn_twl.zip. The reviewer bias is obvious.

  8. The word on line 278 means "nazi". Why should this word be supressed? It is a legal word. Israel cannot continue with their holocaust propaganda if you apply this. Why aren't you supressing the "communists"?

  9. The word on line 558 is required to explain both the biology (calyx) and armour (sheath). It's use in no way is limited to the topic of toxicity.

I leave the action on your discretion.

Thank you for submitting this issue and contributing these nine proposed improvements. We're going to discuss them with our translators.
You can find more information about the ethical aspects of this research work in this paper, more specifically Section 7.3, which should clarify what we mean by monitoring (i.e. monitoring in the sense of flagging/detecting translation model errors rather than surveillance/censorship).