feature request: add Japanese
micuat opened this issue · comments
I'd like to do participate but I don't write in English so I'd like to do something in Japanese (break the word count).
I'm super behind but trying to catch up with the research... since I use javascript (I don't want to use python) I might remix a project like this
https://github.com/kylestetz/metaphorpsum/blob/master/routes/index.js#L170
The way of using template sentences would be an easy start for conversion between English and Japanese
some weird things I tried: https://github.com/micuat/metaphorpsum/tree/nngm
en: 'In recent years, a pump is a seedy twist. This could be, or perhaps a cheese is a pudgy Sunday. If this was somewhat unclear, the first glary clave is, in its own way, a lotion. However, a postage is a dimming title. ',
ja: "近年、 パンプス(ひもや金具がなく,甲のあいた靴)は (果物などが)種の多い 〈糸・なわなど〉‘を'『よる』,より合わせる(糸・なわなどに)…‘を'よる《+名+into(in)+名》 である。恐らく、強いて言えば 『チーズ』は 小さくてずんぐりした 『日曜日』(キリスト教の安息日で週の第1日;《略》Sun.)である。それが不明瞭であれば、初めてのGLアRY cleaveの過去形は、ある意味では 外用水薬;化粧水,ローションだ。しかし、 『郵便料金』,郵送料は 『薄暗い』,ほの暗い 〈C〉(…の)『題名』,題目,題《+of(to)+名》 である。"
First I modified metaphorpsum to be able to simply output a random text on the console. Then I added Japanese translation to the template sentences. By overriding actions of Sentencer, random nouns/adjectives are stored on the stack, translated into Japanese using ejdict.
Challenges:
- Because ejdict simply outputs texts from an English-Japanese dictionary, I cannot get clean results (in the case above, pump should translate into パンプス (well, I don't think it's right, though...) but instead it outputs パンプス(ひもや金具がなく,甲のあいた靴)). This actually looks funny and I like it, but the resulting text is way longer than the English text.
- When the word is not in the dictionary, what should happen? I tried using hepburn to romanize the word, but it failed (e.g., GLアRY in the text above). I don't know if I keep it like this because it's a bit too much of a reminiscent of superdry.
- When the English template sentence has a same time of action that occurs twice or more times (e.g.,
the {{ adjective }} {{ noun }} comes from {{ an_adjective }} {{ noun }}"
) it has to be distinguished from each other because in the Japanese translation, the order might be flipped (but after all, who cares...)
Next steps:
- Add more templates. Even some kind of "scenes"... the choreography of text.
- Slightly clean up the output from ejdict. Also if there are several interpretations, the result can be randomly chosen (currently I just take the first sentence).
- Adding sentiment analysis or even simple word2vec may be fun because I don't think it's often done with Japanese.
here are my (close to final) results: english | japanese
I looked into the English-Japanese dictionary (ejdict) further. The output of ejdict looks like this
make
----
…‘を'『作る』,製造する,建造する
…‘を'『整える』,用意する
…‘を'生じさせる,もたらす,引き起こす
〈金など〉‘を'得る,もうける,〈財産など〉‘を'作る
《行為・動作を表す名詞を目的語にして》…‘を'『する』,行う
(ある状態・形態に)…‘を'『する』
《『make』+『名』+do》〈人・動物など〉‘に'強制して(…)させる
since it's very cluttered and difficult to simply replace an English word with the output of ejdict, I started writing regular expressions to clean it up
I spent an hour or so with regular expressions (and the result is still not perfect). Then I thought, what if I make a feature vector of an English word based on this process - e.g., if the text contains …
turn on a flag, and another flag for 《.*》
- which effectively represents how cluttered the word is in an English-Japanese dictionary (since I read an issue about word2vec on ml5js/ml5-library#1238 I was looking for an alternative way to find words). This is how the program chooses a word; it simply stores the last word's feature vector, randomly picks a few words into a pool, and finds the word that has the closest feature vector. Every chapter I increased the size of the pool, so I expect that the first chapter looks more random, and the later chapters should have similar words based on how cluttered the word is in the E-J dictionary (note that only nouns/adjectives are randomized in the sentences and the rest is based on the template).
Currently the amount of sentence templates are very small so you can see a lot of repetitions - I might work on it but it won't be the core of the project. Now I think my interest is, as a Japanese, since we are asked to look up dictionaries a lot as most of the English education in Japan is based on reading, how it shapes Japanese people's competency in English and how I can intervene it.