akihikodaki / cld3-ruby

cld3-ruby is an interface of Compact Language Detector v3 (CLD3) for Ruby.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Empty and nonsense strings are detected as being in a language

hult opened this issue · comments

Hi,

It may be the underlying CLD3 library rather than your wrapper, but:

cld3 = CLD3::NNetLanguageIdentifier.new(0)
> cld3.find_language("")
=> #<struct Struct::Result language=:ja, probability=0.7837570905685425, :reliable?=true, proportion=1.0>
> cld3.find_language("123")
=> #<struct Struct::Result language=:ja, probability=0.7837570905685425, :reliable?=true, proportion=1.0>

You can get rid of this specific error by requiring at least one byte of data, but still:

> cld3 = CLD3::NNetLanguageIdentifier.new(1)
> cld3.find_language("a")
=> #<struct Struct::Result language=:lb, probability=0.9725591540336609, :reliable?=true, proportion=1.0>

tl;dr

  • The underlying library does not provide any clue to reject such unreliable result.
  • Determine the minimum length requirement considering your requirement.

It may be just fine that it returns an arbitrary language if a string is too short. However, it says the probability is very high and the result is reliable. That's not good.

I skimmed the source code of the underlying library. Here is the cited probability calculation code:

  EmbeddingNetwork::Vector scores;
  network_.ComputeFinalScores(features, &scores);
  int prediction_id = -1;
  float max_val = -std::numeric_limits<float>::infinity();
  for (size_t i = 0; i < scores.size(); ++i) {
    if (scores[i] > max_val) {
      prediction_id = i;
      max_val = scores[i];
    }
  }

  // Compute probability.
  Result result;
  float diff_sum = 0.0;
  for (size_t i = 0; i < scores.size(); ++i) {
    diff_sum += exp(scores[i] - max_val);
  }
  const float log_sum_exp = max_val + log(diff_sum);
  result.probability = exp(max_val - log_sum_exp);

In short, it does not take account of the length of the string at all. The probability is higher if there are less features not supporting the result, and it is lower if there are such more features. As a short string gives few features, the probability remains high.
The reliability is derived from the probability, so you cannot rely on the value, either.

So how can we get a reliable result? We need a string long enough to extract multiple features. The default requirement, which is also used by Chromium, is 140 characters. Chromium also sets the minium requirement to 0 characters (no requirement) in some cases. You may choose the value, depending on the reliability you require.