robinhad / ukrainian-tts

Ukrainian TTS (text-to-speech) using ESPNET

Home Page:https://huggingface.co/spaces/robinhad/ukrainian-tts

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add stress for words from the RID app

acmpo6ou opened this issue · comments

There is an app for learning Ukrainian words: RID. Unfortunately, it's not very well maintained and will be removed in June. But I was able to reverse-engineer it, and download all 9580 words from their servers. You can download the data from this repo: https://github.com/acmpo6ou/rid-words

Here is an example of a word file:

{
    "id":11,
    "title":"Талалай",
    "description":"Той, хто багато, беззмістовно говорить. «— Досі, — впав у річ сповідальник, — ти мені здавався більше талалаєм, ніж чистобрехою, а втім, не знаю, за кого тебе мати надалі» (Мігель де Сервантес «Премудрий гідальго Дон Кіхот з Ламанчі», перекл.\t Микола Лукаш).",
    "html_description":"\u003cp\u003eТой, хто багато, беззмістовно говорить.\u003c/p\u003e\r\n\r\n\u003cp\u003e\u003cem\u003e\u0026laquo;\u0026mdash; Досі, \u0026mdash; впав у річ сповідальник, \u0026mdash; ти мені здавався більше \u003cstrong\u003eталалаєм\u003c/strong\u003e, ніж чистобрехою, а втім, не знаю, за кого тебе мати надалі\u0026raquo; (Мігель де Сервантес \u0026laquo;Премудрий гідальго Дон Кіхот з Ламанчі\u0026raquo;, перекл. Микола Лукаш).\u003c/em\u003e\u003c/p\u003e\r\n",
    "word_category_id":2,
    "stresses":[
        6
    ],
    "word_images":[
        "/uploads/word_image/photo/16772/crop_version_ok-4946387_960_720.webp"
    ],
    "done":false,
    "favorite":false,
    "shared_link":"http://rid.ck.ua/sharing/talalaj"
}

The interesting fields are: title - the word itself, and stresses the array of the stresses for the word (a word can have multiple stresses).

Using this data you can expand your dictionary with more words and their stresses. I would contribute a PR myself, but I'm not sure how to. I found a file stress.trie that probably stores all stresses, but I'm not sure how to edit it.

Unfortunately I can't use it because it was obtained from reverse engineering

Unfortunately I can't use it because it was obtained from reverse engineering

What? Really? I don't think they would mind tbh. RID is a volunteer project, I think if you'd ask their permission, they would allow you. I tried emailing them myself, but they didn't reply (so I decided to do the reverse-engineering). They have an Instagram, maybe asking there would be better (I don't use Instagram though).

But I do think the database would help, it has some nice fancy words, and their stresses.