Minimum Text Length Threshold for Reliable Language Detection in Langdetect

Question

Minimum Text Length Threshold for Reliable Language Detection in Langdetect

Chetan-Yeola opened this issue 8 months ago · comments

What is considered a 'short text' in langdetect, and is there a specific minimum text length threshold for reliable language detection?

Jean-Baptiste Bertrand · Answer 1 · Thu May 02 2024 17:14:21 GMT+0800 (China Standard Time)

@Chetan-Yeola According to the presentation page of this other library , langdetect performs poorly on texts with length similar to twitter messages ("For very short text snippets such as Twitter messages, they do not provide adequate results."). Which means anything less than 280 characters might give poor results, assuming the page does not exaggerate the problem. However, the page is a bit vague, and the threshold (if any) might be higher than 280 characters. It also probably depends on the language considered (I guess that some languages may be much easier to detect than others -e.g. consider detecting Hebrew, which uses a rare alphabet, vs. detecting Spanish, which is very similar to other Romance languages).

But you could try and test automatically with a large sample of short texts taken from various language instances of Wikipedia, to see if the error rate is OK relative to your requirements. The previous page does not mention the classification error rate they observed to make this statement, so if your own requirements relative to the error rate are very liberal, it may be worth take the time to test.