feature request: add Japanese

Question

feature request: add Japanese

micuat opened this issue 3 years ago · comments

I'd like to do participate but I don't write in English so I'd like to do something in Japanese (break the word count).

John Ohno · Answer 1 · Sat Nov 06 2021 22:32:27 GMT+0800 (China Standard Time)

We have had a lot of entries in languages other than English, but none in Japanese that I know of. It should be exciting! Do you know how you're going to count words?

…

On Sat, Nov 6, 2021, 9:49 AM Naoto HIÉDA ***@***.***> wrote: I'd like to do participate but I don't write in English so I'd like to do something in Japanese (break the word count). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#45>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADXUGLUJXJAMPT5K7SM3S3UKUXAHANCNFSM5HPUF7XA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

Naoto Hieda · Answer 2 · Wed Nov 17 2021 16:58:16 GMT+0800 (China Standard Time)

I'm super behind but trying to catch up with the research... since I use javascript (I don't want to use python) I might remix a project like this
https://github.com/kylestetz/metaphorpsum/blob/master/routes/index.js#L170
The way of using template sentences would be an easy start for conversion between English and Japanese

Naoto Hieda · Answer 3 · Fri Nov 19 2021 15:24:27 GMT+0800 (China Standard Time)

some weird things I tried: https://github.com/micuat/metaphorpsum/tree/nngm

en: 'In recent years, a pump is a seedy twist. This could be, or perhaps a cheese is a pudgy Sunday. If this was somewhat unclear, the first glary clave is, in its own way, a lotion. However, a postage is a dimming title. ',
ja: "近年、パンプス(ひもや金具がなく,甲のあいた靴)は (果物などが)種の多い〈糸・なわなど〉‘を'『よる』,より合わせる(糸・なわなどに)…‘を'よる《+名+into(in)+名》である。恐らく、強いて言えば『チーズ』は小さくてずんぐりした『日曜日』(キリスト教の安息日で週の第1日;《略》Sun.)である。それが不明瞭であれば、初めてのGLアRY cleaveの過去形は、ある意味では外用水薬;化粧水,ローションだ。しかし、『郵便料金』,郵送料は『薄暗い』,ほの暗い〈C〉(…の)『題名』,題目,題《+of(to)+名》である。"

First I modified metaphorpsum to be able to simply output a random text on the console. Then I added Japanese translation to the template sentences. By overriding actions of Sentencer, random nouns/adjectives are stored on the stack, translated into Japanese using ejdict.

Challenges:

Because ejdict simply outputs texts from an English-Japanese dictionary, I cannot get clean results (in the case above, pump should translate into パンプス (well, I don't think it's right, though...) but instead it outputs パンプス(ひもや金具がなく,甲のあいた靴)). This actually looks funny and I like it, but the resulting text is way longer than the English text.
When the word is not in the dictionary, what should happen? I tried using hepburn to romanize the word, but it failed (e.g., GLアRY in the text above). I don't know if I keep it like this because it's a bit too much of a reminiscent of superdry.
When the English template sentence has a same time of action that occurs twice or more times (e.g., the {{ adjective }} {{ noun }} comes from {{ an_adjective }} {{ noun }}") it has to be distinguished from each other because in the Japanese translation, the order might be flipped (but after all, who cares...)

Next steps:

Add more templates. Even some kind of "scenes"... the choreography of text.
Slightly clean up the output from ejdict. Also if there are several interpretations, the result can be randomly chosen (currently I just take the first sentence).
Adding sentiment analysis or even simple word2vec may be fun because I don't think it's often done with Japanese.

Naoto Hieda · Answer 4 · Tue Nov 23 2021 19:18:40 GMT+0800 (China Standard Time)

here are my (close to final) results: english | japanese

I looked into the English-Japanese dictionary (ejdict) further. The output of ejdict looks like this

make
----
 …‘を'『作る』,製造する,建造する
 …‘を'『整える』,用意する
 …‘を'生じさせる,もたらす,引き起こす
 〈金など〉‘を'得る,もうける,〈財産など〉‘を'作る
 《行為・動作を表す名詞を目的語にして》…‘を'『する』,行う
 (ある状態・形態に)…‘を'『する』
 《『make』+『名』+do》〈人・動物など〉‘に'強制して(…)させる

since it's very cluttered and difficult to simply replace an English word with the output of ejdict, I started writing regular expressions to clean it up

https://github.com/micuat/metaphorpsum/blob/8f4d502330ae284fdfeabb0d92a2fd260f0e91a8/app.js#L183-L202

I spent an hour or so with regular expressions (and the result is still not perfect). Then I thought, what if I make a feature vector of an English word based on this process - e.g., if the text contains … turn on a flag, and another flag for 《.*》 - which effectively represents how cluttered the word is in an English-Japanese dictionary (since I read an issue about word2vec on ml5js/ml5-library#1238 I was looking for an alternative way to find words). This is how the program chooses a word; it simply stores the last word's feature vector, randomly picks a few words into a pool, and finds the word that has the closest feature vector. Every chapter I increased the size of the pool, so I expect that the first chapter looks more random, and the later chapters should have similar words based on how cluttered the word is in the E-J dictionary (note that only nouns/adjectives are randomized in the sentences and the rest is based on the template).

Currently the amount of sentence templates are very small so you can see a lot of repetitions - I might work on it but it won't be the core of the project. Now I think my interest is, as a Japanese, since we are asked to look up dictionaries a lot as most of the English education in Japan is based on reading, how it shapes Japanese people's competency in English and how I can intervene it.

Naoto Hieda · Answer 5 · Sat Nov 27 2021 14:37:34 GMT+0800 (China Standard Time)

for now here are the final results, adding more sentence templates

english | japanese