wooorm / franc

Natural language detection

Home Page:https://wooorm.com/franc/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reference of source document

DonaldTsang opened this issue · comments

It seems that NONE of the languages have sources to the data.json 3-gram model.
Is it possible to provide document sources for each language such that we can review the material,
and possibly generate 2-grams and 4-grams (or 2/3 or 3/4 or 2/3/4-gram combos) models?

commented

Franc is built from udhr, which has this.
You’ll have to read the source code of this franc and the other projects, but I made sure everything does one thing well, to allow for these things!

Here are two questions:

  1. Some problems regarding using UDHR is that the biases within UDHR's linguistic characteristics would skew the data one way or another. Has it been fully researched about it?
  2. Are there any other datasets that covers as many languages as possible? Would like to try, test and compare as I really would like to apply this to vocabulary and short texts which require bigger datasets of Ngrams.
commented
  • This is a software project, not a research project. If you want to use it, use it, if you don’t, don’t.
  • UDHR is used because it allows for detecting the most languages possible, the readme describes what’s good and not so good about that. Supporting many languages is a goal of franc
  • There are many data sets out there, none that support so many languages as UDHR.

@wooorm Maybe it is the choice of words, but I when I say "researched" I meant "optimized" as in if using the UDHR is the best route in having the highest accuracy.

There are many data sets out there, none that support so many languages as UDHR.

Do you know of any examples that have at least 75 or 100 or 125 languages? Maybe 400 languages is a bit too "extreme" but I would like to know if you have already encountered such data sets for people to share.

commented

When I say "researched" I meant "optimized" as in if using the UDHR is the best route in having the highest accuracy.

UDHR definitely does not give the highest accuracy, but it does support the most languages

Do you know of any examples that have at least 75 or 100 languages

I don’t. You could look into the Bible. There have been several issues over the years with conversations going into similar directions as this, e.g., #76 and #75. You can read through the closed issues to find out more.

commented

BTW, I think the non-ngram (CLD2) approach is often better than ngrams.

@wooorm yes so does CLD2 used codepoint filtering for detecting languages? Might need some primer on how it works. Because codepoint filtering is something that I would like to see data on.

Also wow Machine Learning for https://github.com/google/cld3 and https://github.com/ropensci/cld3

commented

I don't know; I maintain this project and give it away for free 🤷‍♂️

Okay, thanks for the help.

BTW I think we can possibly improvise with any collection of fictional and religious books, when we are provided a tool to remove proper nouns from such works to leave in common word structures. Problem: copyright. See: The Bible

BTW here is something unique https://github.com/pemistahl/lingua#4--how-good-is-it-top-
Also these three uses Wikipedia as base:

There are also others who use http://wortschatz.uni-leipzig.de/en/download/ and even more exotic, https://github.com/google/corpuscrawler and with tweets, https://github.com/mitjat/langid_eval

https://github.com/davidjurgens/equilid#model-details is even more comprehensive
But https://github.com/landrok/language-detector basically has a hidden dataset