Reference of source document

Question

Reference of source document

DonaldTsang opened this issue 5 years ago · comments

It seems that NONE of the languages have sources to the data.json 3-gram model.
Is it possible to provide document sources for each language such that we can review the material,
and possibly generate 2-grams and 4-grams (or 2/3 or 3/4 or 2/3/4-gram combos) models?

Titus · Answer 1 · Mon Nov 18 2019 15:38:06 GMT+0800 (China Standard Time)

Franc is built from udhr, which has this.
You’ll have to read the source code of this franc and the other projects, but I made sure everything does one thing well, to allow for these things!

Don Tsang · Answer 2 · Mon Nov 18 2019 16:22:36 GMT+0800 (China Standard Time)

Here are two questions:

Some problems regarding using UDHR is that the biases within UDHR's linguistic characteristics would skew the data one way or another. Has it been fully researched about it?
Are there any other datasets that covers as many languages as possible? Would like to try, test and compare as I really would like to apply this to vocabulary and short texts which require bigger datasets of Ngrams.

Titus · Answer 3 · Mon Nov 18 2019 16:39:13 GMT+0800 (China Standard Time)

This is a software project, not a research project. If you want to use it, use it, if you don’t, don’t.
UDHR is used because it allows for detecting the most languages possible, the readme describes what’s good and not so good about that. Supporting many languages is a goal of franc
There are many data sets out there, none that support so many languages as UDHR.

Don Tsang · Answer 4 · Mon Nov 18 2019 16:43:04 GMT+0800 (China Standard Time)

@wooorm Maybe it is the choice of words, but I when I say "researched" I meant "optimized" as in if using the UDHR is the best route in having the highest accuracy.

There are many data sets out there, none that support so many languages as UDHR.

Do you know of any examples that have at least 75 or 100 or 125 languages? Maybe 400 languages is a bit too "extreme" but I would like to know if you have already encountered such data sets for people to share.

Titus · Answer 5 · Mon Nov 18 2019 16:48:35 GMT+0800 (China Standard Time)

When I say "researched" I meant "optimized" as in if using the UDHR is the best route in having the highest accuracy.

UDHR definitely does not give the highest accuracy, but it does support the most languages

Do you know of any examples that have at least 75 or 100 languages

I don’t. You could look into the Bible. There have been several issues over the years with conversations going into similar directions as this, e.g., #76 and #75. You can read through the closed issues to find out more.

Titus · Answer 6 · Mon Nov 18 2019 16:51:08 GMT+0800 (China Standard Time)

BTW, I think the non-ngram (CLD2) approach is often better than ngrams.

Don Tsang · Answer 7 · Mon Nov 18 2019 16:55:00 GMT+0800 (China Standard Time)

@wooorm yes so does CLD2 used codepoint filtering for detecting languages? Might need some primer on how it works. Because codepoint filtering is something that I would like to see data on.

Also wow Machine Learning for https://github.com/google/cld3 and https://github.com/ropensci/cld3

Titus · Answer 8 · Mon Nov 18 2019 17:02:20 GMT+0800 (China Standard Time)

I don't know; I maintain this project and give it away for free 🤷‍♂️

Don Tsang · Answer 9 · Mon Nov 18 2019 17:03:13 GMT+0800 (China Standard Time)

Okay, thanks for the help.

Don Tsang · Answer 10 · Mon Nov 18 2019 17:15:35 GMT+0800 (China Standard Time)

BTW I think we can possibly improvise with any collection of fictional and religious books, when we are provided a tool to remove proper nouns from such works to leave in common word structures. Problem: copyright. See: The Bible

BTW here is something unique https://github.com/pemistahl/lingua#4--how-good-is-it-top-
Also these three uses Wikipedia as base:

https://github.com/optimaize/language-detector
https://github.com/shuyo/language-detection
https://github.com/Mimino666/langdetect
anything that uses code.google.com/p/language-detection

There are also others who use http://wortschatz.uni-leipzig.de/en/download/ and even more exotic, https://github.com/google/corpuscrawler and with tweets, https://github.com/mitjat/langid_eval

https://github.com/davidjurgens/equilid#model-details is even more comprehensive
But https://github.com/landrok/language-detector basically has a hidden dataset