NaNoGenMo / 2021

National Novel Generation Month, 2021 edition.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

feature request: add Japanese

micuat opened this issue · comments

I'd like to do participate but I don't write in English so I'd like to do something in Japanese (break the word count).

I'm super behind but trying to catch up with the research... since I use javascript (I don't want to use python) I might remix a project like this
https://github.com/kylestetz/metaphorpsum/blob/master/routes/index.js#L170
The way of using template sentences would be an easy start for conversion between English and Japanese

some weird things I tried: https://github.com/micuat/metaphorpsum/tree/nngm

en: 'In recent years, a pump is a seedy twist. This could be, or perhaps a cheese is a pudgy Sunday. If this was somewhat unclear, the first glary clave is, in its own way, a lotion. However, a postage is a dimming title. ',
ja: "近年、 パンプス(ひもや金具がなく,甲のあいた靴)は (果物などが)種の多い 〈糸・なわなど〉‘を'『よる』,より合わせる(糸・なわなどに)…‘を'よる《+名+into(in)+名》 である。恐らく、強いて言えば 『チーズ』は 小さくてずんぐりした 『日曜日』(キリスト教の安息日で週の第1日;《略》Sun.)である。それが不明瞭であれば、初めてのGLアRY cleaveの過去形は、ある意味では 外用水薬;化粧水,ローションだ。しかし、 『郵便料金』,郵送料は 『薄暗い』,ほの暗い 〈C〉(…の)『題名』,題目,題《+of(to)+名》 である。"

First I modified metaphorpsum to be able to simply output a random text on the console. Then I added Japanese translation to the template sentences. By overriding actions of Sentencer, random nouns/adjectives are stored on the stack, translated into Japanese using ejdict.

Challenges:

  • Because ejdict simply outputs texts from an English-Japanese dictionary, I cannot get clean results (in the case above, pump should translate into パンプス (well, I don't think it's right, though...) but instead it outputs パンプス(ひもや金具がなく,甲のあいた靴)). This actually looks funny and I like it, but the resulting text is way longer than the English text.
  • When the word is not in the dictionary, what should happen? I tried using hepburn to romanize the word, but it failed (e.g., GLアRY in the text above). I don't know if I keep it like this because it's a bit too much of a reminiscent of superdry.
  • When the English template sentence has a same time of action that occurs twice or more times (e.g., the {{ adjective }} {{ noun }} comes from {{ an_adjective }} {{ noun }}") it has to be distinguished from each other because in the Japanese translation, the order might be flipped (but after all, who cares...)

Next steps:

  • Add more templates. Even some kind of "scenes"... the choreography of text.
  • Slightly clean up the output from ejdict. Also if there are several interpretations, the result can be randomly chosen (currently I just take the first sentence).
  • Adding sentiment analysis or even simple word2vec may be fun because I don't think it's often done with Japanese.

here are my (close to final) results: english | japanese

I looked into the English-Japanese dictionary (ejdict) further. The output of ejdict looks like this

make
----
 …‘を'『作る』,製造する,建造する
 …‘を'『整える』,用意する
 …‘を'生じさせる,もたらす,引き起こす
 〈金など〉‘を'得る,もうける,〈財産など〉‘を'作る
 《行為・動作を表す名詞を目的語にして》…‘を'『する』,行う
 (ある状態・形態に)…‘を'『する』
 《『make』+『名』+do》〈人・動物など〉‘に'強制して(…)させる

since it's very cluttered and difficult to simply replace an English word with the output of ejdict, I started writing regular expressions to clean it up

https://github.com/micuat/metaphorpsum/blob/8f4d502330ae284fdfeabb0d92a2fd260f0e91a8/app.js#L183-L202

I spent an hour or so with regular expressions (and the result is still not perfect). Then I thought, what if I make a feature vector of an English word based on this process - e.g., if the text contains turn on a flag, and another flag for 《.*》 - which effectively represents how cluttered the word is in an English-Japanese dictionary (since I read an issue about word2vec on ml5js/ml5-library#1238 I was looking for an alternative way to find words). This is how the program chooses a word; it simply stores the last word's feature vector, randomly picks a few words into a pool, and finds the word that has the closest feature vector. Every chapter I increased the size of the pool, so I expect that the first chapter looks more random, and the later chapters should have similar words based on how cluttered the word is in the E-J dictionary (note that only nouns/adjectives are randomized in the sentences and the rest is based on the template).

Currently the amount of sentence templates are very small so you can see a lot of repetitions - I might work on it but it won't be the core of the project. Now I think my interest is, as a Japanese, since we are asked to look up dictionaries a lot as most of the English education in Japan is based on reading, how it shapes Japanese people's competency in English and how I can intervene it.

for now here are the final results, adding more sentence templates

english | japanese