Add more languages corpora, tools and research

Question

Add more languages corpora, tools and research

NirantK opened this issue 6 years ago · comments

Indic

Asian

Chinese
Korean
Japanese

We should be able to add content regarding Indian/Indic languages as well, keeping in mind the growth of India Stack and need for Indic tools.

Nirant · Answer 1 · Sun Feb 11 2018 17:27:27 GMT+0800 (China Standard Time)

I will be working on this issue for a brief duration.

I'd love to assist you @the-ethan-hunt if you are up to take the lead on this.

Dhruv Apte · Answer 2 · Sun Feb 11 2018 20:38:50 GMT+0800 (China Standard Time)

@NirantK , I would be happy to be assisted by you! But the pond is too large and the fish too small. 😅

Nirant · Answer 3 · Sun Feb 11 2018 20:48:41 GMT+0800 (China Standard Time)

Let's start small, let's get Hindi rolling with some work on datasets and pre trained models. What do you think?

…

On 11 Feb 2018 6:08 pm, "Dhruv Apte" ***@***.***> wrote: @NirantK <https://github.com/nirantk> , I would be happy to be assisted by you! But the pond is too large and the fish too small. 😅 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#113 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADGaPUZAnlhPAAVOfNcpeSGW38t9soMcks5tTt9dgaJpZM4SBRZl> .

Dhruv Apte · Answer 4 · Sun Feb 11 2018 20:50:41 GMT+0800 (China Standard Time)

You are right!

Dhruv Apte · Answer 5 · Sun Feb 11 2018 20:57:34 GMT+0800 (China Standard Time)

@NirantK , a simple GitHub search is leading nowhere. Any leads to start this? 😅

Nirant · Answer 6 · Sun Feb 11 2018 21:01:32 GMT+0800 (China Standard Time)

Go old school. Search Google scholar for work by keywords like Hindi language parsing, modeling etc. Skip everything which says translation. That's often lazy work.

…

On 11 Feb 2018 6:27 pm, "Dhruv Apte" ***@***.***> wrote: @NirantK <https://github.com/nirantk> , a simple GitHub search is leading nowhere. Any leads to start this? 😅 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#113 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADGaPcRO8ft0QWc_8UBBi2-QzF7zpnxGks5tTuPBgaJpZM4SBRZl> .

Dhruv Apte · Answer 7 · Mon Feb 12 2018 20:51:32 GMT+0800 (China Standard Time)

@NirantK , I went old school and discovered two good papers worth to be mentioned in this list:

A POS tagger and chunker system for Hindi language using Maximum Entropy Markov Model. Here is the paper
A lightweight stemmer for Hindi link
IMHO, there has been negligible research conducted for NLP in Tamil, Telugu, Marathi and other Indian languages

Nirant · Answer 8 · Tue Feb 13 2018 13:59:27 GMT+0800 (China Standard Time)

Welcome @arpitabatra to the thread. She did her thesis on Hindi Text Processing. Some of the cool stuff she mentioned:

English Hindi Parallel Corpus http://www.cfilt.iitb.ac.in/iitb_parallel/
Hindi Wordnet http://www.cfilt.iitb.ac.in/wordnet/webhwn/
Sanskrit Mini Corpora from JNU: http://sanskrit.jnu.ac.in/corpora/tagset.jsp
Indo Wordnet, no download option, but supports Gujarati and other Indic languages http://www.cfilt.iitb.ac.in/indowordnet/index.jsp
Hindi Treebank http://ltrc.iiit.ac.in/treebank_H2014/

Thanks for the search @the-ethan-hunt, they look good. Let's go a little wide in the beginning and then we can trim down. Sounds good?

Dhruv Apte · Answer 9 · Tue Feb 13 2018 14:24:10 GMT+0800 (China Standard Time)

@NirantK , sure!
Thanks for the stuff Arpita! And welcome to awesome-nlp! 😄

arpitabatra · Answer 10 · Thu Feb 15 2018 13:57:30 GMT+0800 (China Standard Time)

POS tagging related papers.

Morphological Richness Offsets Resource Demand- Experiences in Constructing a POS Tagger for Hindi
link
Building Feature Rich POS Tagger for Morphologically Rich Languages: Experiences in Hindi
link
Hindi POS Tagger Using Naive Stemming : Harnessing Morphological Information Without Extensive Linguistic Knowledge
link

arpitabatra · Answer 11 · Thu Feb 15 2018 14:04:31 GMT+0800 (China Standard Time)

@NirantK and @the-ethan-hunt : shall we explore some papers which are not only statistics based but also uses some linguistic cues? Since the datasets of large size are unavailable for Hindi, it will become difficult to train the models.

Nirant · Answer 12 · Thu Feb 15 2018 18:24:49 GMT+0800 (China Standard Time)

Sure @arpitabatra, that's a good insight. We should definitely look into those. Please do help us around that.

Sidenote: If there are any glaring holes in Hindi Text Processing, please mention them as well, they can become research avenues for people after us. We can note and detail those challenges in a separate repository/markdown file as well.

@the-ethan-hunt, I think we both can focus on Gujarati/other Indic languages as @arpitabatra has been kind enough to share her expertise with us on Hindi. What do you think?

Edit: I've added Gujarati in the task list above keeping in mind the comment by @the-ethan-hunt stating that prima facie, no good work was found for Tamil, Telegu and Bengali.

Dhruv Apte · Answer 13 · Thu Feb 15 2018 20:54:54 GMT+0800 (China Standard Time)

@NirantK , yes I agree with you. And while tinkering around, I found this. There are huge treebanks of several languages(both Indian and foreign).
Should I make a PR for this?

Nirant · Answer 14 · Thu Feb 15 2018 21:29:20 GMT+0800 (China Standard Time)

Hey, @the-ethan-hunt that's a good find.

Let's link to Hindi specific work for now.

Maybe we need to look into more tooling, datasets and academic work beyond treebanks, POS taggers and actually compile the best from what is out there?

If I was starting looking into Hindi NLP, the above list of work is not even 20-30% of what I'd need to get started.

Dhruv Apte · Answer 15 · Fri Feb 16 2018 10:53:52 GMT+0800 (China Standard Time)

@NirantK , there is also this treebank prepared by several American universities. The thing here is, that both the mentioned treebanks are annotated ones; this would largely help linguistics experts and NLP scientists to stop using their time annotating their corpora.

Nirant · Answer 16 · Fri Feb 16 2018 11:07:54 GMT+0800 (China Standard Time)

That is already in the list from @arpitabatra :)

Dhruv Apte · Answer 17 · Sat Feb 17 2018 14:28:21 GMT+0800 (China Standard Time)

@NirantK , any points we can start working? Like the Universal Dependencies thing? 😄

Nirant · Answer 18 · Sat Feb 17 2018 15:20:04 GMT+0800 (China Standard Time)

@the-ethan-hunt
Why we need data to work in NLP?

Dictionaries and WordNets are useful for syntactic tasks
Large text corpus is useful for lot of tasks such as text classification, text embeddings, and so on

Then, in terms of data, we need the following:

Dictionaries e.g. Gujarati to English and vice versa
Large News corpus similar to CNN or DailyMail ones

I hope that this helps us streamline our efforts. I will look into large news corpus, if unavailable at least list them down a few major websites which we can use to generate that dataset.

Nirant · Answer 19 · Sat Feb 17 2018 15:22:49 GMT+0800 (China Standard Time)

Hey @arpitabatra @the-ethan-hunt, please go ahead and raise one PR for Hindi datasets (excluding the work on POS, Stemming etc) as soon as you have sometime?

The work isn't quite sufficient to get started in terms of tools, but I think we should share the datasets atleast as we've done for Spanish.

Dhruv Apte · Answer 20 · Tue Sep 18 2018 21:10:14 GMT+0800 (China Standard Time)

@NirantK , regarding the shift of the language section to NLP-Progress as discussed in this thread, should I raise PRs for new resources here or at NLP-Progress?
Does it sound alright, @sebastianruder ? 😅

Nirant · Answer 21 · Tue Sep 18 2018 21:14:44 GMT+0800 (China Standard Time)

@the-ethan-hunt

If there are performance numbers available, or high user trust in that lib - raise it directly at NLP-Progress.
If not, raise them there here for now.

There is a lot of work which does not have results. E.g. datasets, Python libs in Arabic/Hindi etc.
They are good enough for programmers quite often. We can discuss and sort those edge cases out.

Sebastian Ruder · Answer 22 · Tue Sep 18 2018 21:16:34 GMT+0800 (China Standard Time)

Just to be clear: awesome-nlp should stay awesome, so we shouldn't remove anything from here from now and awesome-nlp should still be the place where libraries and tools, etc. are collected.
As @NirantK mentions, anything with reported results and standard evaluation setups can be added to nlpprogress.

Pawan Sasanka Ammanamanchi · Answer 23 · Fri Sep 28 2018 13:38:50 GMT+0800 (China Standard Time)

@NirantK i think this library can be added as tool for indic languages
http://anoopkunchukuttan.github.io/indic_nlp_library/

Nirant · Answer 24 · Fri Sep 28 2018 13:45:08 GMT+0800 (China Standard Time)

Thanks for the hunt! That was not included here because - The core lib has not been updated in 4+ years, is still Py2.7 for some reason - Is not Unicode compliant for some reason? I had some trouble installing it as well on Linux and Windows. In other words, this is not awesome.

…

On Fri, 28 Sep 2018 at 11:08, Pawan Sasanka Ammanamanchi < ***@***.***> wrote: @NirantK <https://github.com/NirantK> i think this library can be added as tool for indic languages http://anoopkunchukuttan.github.io/indic_nlp_library/ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#113 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADGaPYrByQ02rwe6QSd7c7mklgS5lKh3ks5ufbXxgaJpZM4SBRZl> .

Pawan Sasanka Ammanamanchi · Answer 25 · Fri Sep 28 2018 13:45:16 GMT+0800 (China Standard Time)

also @NirantK I think ACL 2018 highlights by Sebastian ruder should be added to the research trends and summaries

Nirant · Answer 26 · Fri Sep 28 2018 13:46:46 GMT+0800 (China Standard Time)

@Shashi456 please raise a MR for Ruder's highlights with a 1 line explanation and we'll review the same?

Pawan Sasanka Ammanamanchi · Answer 27 · Fri Sep 28 2018 13:47:17 GMT+0800 (China Standard Time)

@NirantK do you know of any Indic libraries other than that, i've been scourging the internet for some but have found none satisfiable

Nirant · Answer 28 · Fri Sep 28 2018 13:53:27 GMT+0800 (China Standard Time)

One, I recall seeing few more like the one you mentioned above. The recurring challenge being that almost none of them are Unicode compliant and poorly maintained. 2 of them did not have a setup.py for installation. Two, I have mostly worked with Hindi (e.g. https://github.com/NirantK/hindi2vec/ ) - simple tokenization is solved with 'space' splits and simple rules around word suffixes. Three, spaCy (https://github.com/explosion/spaCy) works for me in most cases. Please consider contributing to the same if you want to build a NLP toolkit for Indic scripts.

…

On Fri, 28 Sep 2018 at 11:17, Pawan Sasanka Ammanamanchi < ***@***.***> wrote: @NirantK <https://github.com/NirantK> do you know of any Indic libraries other than that, i've been scourging the internet for some but have found none satisfiable — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#113 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADGaPbuYSamnS7gbWT3RShwb0vbyFOayks5ufbfvgaJpZM4SBRZl> .

Guillaume Chevalier · Answer 29 · Sat Oct 13 2018 16:18:41 GMT+0800 (China Standard Time)

Hello @NirantK, I made this project for clustering/topic extraction: https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA

It also contains a tutorial explaining the architecture: https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA/blob/master/Stemming-words-from-multiple-languages.ipynb

It also has unit tests.

All those languages are supported:

Danish
Dutch
English
Finnish
French
German
Hungarian
Italian
Norwegian
Porter
Portuguese
Romanian
Russian
Spanish
Swedish
Turkish

I was hesitating whether or not to add a new section, such as "Many languages". My question is: what would you do? Where would you add this?

Thank you!

Nirant · Answer 30 · Sat Oct 13 2018 21:35:20 GMT+0800 (China Standard Time)

@guillaume-chevalier that should go under Libraries -> Python. Please raise a PR. Great to see a multi-lingual clustering toolkit!

Nero · Answer 31 · Thu Mar 28 2019 00:43:56 GMT+0800 (China Standard Time)

Hi @NirantK, I can support Traditional Chinese translation and I am currently working on it.
This is really awesome repo, hope more people can see.

Nirant · Answer 32 · Thu Mar 28 2019 11:19:39 GMT+0800 (China Standard Time)

Wow, this is fantastic @NeroCube! Can you please raise a PR with your translation?

Gaurav Arora · Answer 33 · Tue Jun 11 2019 21:26:14 GMT+0800 (China Standard Time)

@NirantK Do you think adding links to repos NLP for Hindi, NLP for Punjabi , NLP for Sanskrit, NLP for Gujarati, NLP for Kannada, NLP for Malayalam, NLP for Nepali, NLP for Odia, NLP for Marathi, NLP for Bengali, NLP for Tamil, NLP for Urdu under the Indic Languages section would be helpful? All these repos contain Language Models, Classifiers and Tokenizers, along with the dataset used to train models, for their respective languages and are being used in iNLTK

Nirant · Answer 34 · Tue Jun 11 2019 23:16:22 GMT+0800 (China Standard Time)

We already have iNLTK, which in turn links to all of the above.

Maybe not add all of them? This might get spammy.

Gaurav Arora · Answer 35 · Tue Jun 11 2019 23:48:58 GMT+0800 (China Standard Time)

Yes okay! That seems right! Thanks!

Nirant · Answer 36 · Mon Sep 09 2019 16:26:18 GMT+0800 (China Standard Time)

Thank you everyone who has contributed to the multiple languages work, here on Awesome-NLP. While we continue to welcome the contributions along similar lines, we have some sort of coverage now.

I'm closing this issue for now. We will open new issues to encourage specific languages.