google / cld3

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Python binding forks and different fixes

bact opened this issue · comments

Update: CLD3 now has a Python binding code from Google themselves: gcld3

PyPI: https://pypi.org/project/gcld3/

GitHub: https://github.com/google/cld3/tree/master/gcld3


This issue is to documenting some Python binding forks, with a hope that fixes can be merged as much as possible at the higher upstreams:

Official CLD3: https://github.com/google/cld3
--> [based on google] First Python binding: https://github.com/jbaiter/cld3 by @jbaiter
----> [based on @jbaiter] Remove Chromium repo dependency (see #11) + PyPI: https://github.com/Elizafox/cld3 by @Elizafox
------> [based on @Elizafox] Fix res.language casting error (in Cython): https://github.com/RNogales94/cld3, https://github.com/PythonNut/cld3, https://github.com/houp/cld3 by @RNogales94 @PythonNut @houp
------> [based on @Elizafox] Include protobuf headers and bodies (to get around #13): https://github.com/houp/cld3 by @houp
------> [based on @Elizafox] Fix memory leak; Introduce reuse of language model for faster performance https://github.com/iamthebot/cld3 by @iamthebot
--------> [based on @iamthebot] Fix res.language comparison; Provide easy pip install under pycld3 name https://github.com/bsolomon1124/pycld3 by @bsolomon1124

Note:

Python Binding Documentation

(based on the documentation from https://github.com/Elizafox/cld3 )

Usage:

Here's some examples:

>>> cld3.get_language("This is a test")
LanguagePrediction(language='en', probability=0.9999980926513672, is_reliable=True, proportion=1.0)

>>> cld3.get_frequent_languages("This piece of text is in English. Този текст е на Български.", 5)
[LanguagePrediction(language='bg', probability=0.9173890948295593, is_reliable=True, proportion=0.5853658318519592), LanguagePrediction(language='en', probability=0.9999790191650391, is_reliable=True, proportion=0.4146341383457184)]

In short:

  • get_language returns the most likely language as the named tuple LanguagePrediction. Proportion is always 1.0 when called in this way.
  • get_frequent_languages will return the top number of guesses, up to a maximum specified (in the example, 5). The maximum is mandatory. Proportion will be set to the proportion of bytes found to be the target language in the list.

In the normal cld3 library, "und" may be returned as a language for unknown languages (with no other stats given). This library filters that result out as extraneous; if the language couldn't be detected, nothing will be returned. This also means, as a consequence, get_frequent_languages may return fewer results than what you asked for, or none at all.

I have been testing the Elizafox/cld3 Python binding and I had severe memory issues. The more sentences I detect, the more memory is used. I don't know if this is an issue in cld3 or in the Python binding specifically.

And given that I cannot open any issue in any of the Python binding forks, I though to report it here.

@Ipla I've fixed these memory leaks in my fork of CLD3. Basically, the elizafox version creates a new model object on each call to get_language and on top of it doesn't clean it up. My fork has both the original functions (but cleans up the objects) and a class called LanguageIdentifier which permits reuse of the model for faster performance.

The fork is iamthebot/cld3

Hi @jasonriesa and @akihiroota87: do the maintainers of google/cld3 have any interest in incorporating Python bindings within this repo, by reviewing and combining the various forks mentioned above?

As a tangentially related change, as a part of those forks, the Chromium dependency was removed. If that wasn't the case, the logical solution might be a git submodule, but since the C source itself has changed in the forks, that becomes difficult.

@iamthebot

I believe there's still a small error in your fork.

You use the comparison:

str(res.language) != ident.kUnknown:

This is not doing what you think it is.

Originally, res.language is a CPP string, while ident.kUnknown is a const char array (with value "und").

However, str(res.language) does not do the correct coercion in the same way that str(b"hello") does not decode the string; it just makes a str representation of that bytes object.

>>> str(b"hello")
"b'hello'"
>>> str(b"hello") == "hello"  # No!
False

What is needed here is:

if <bytes> res.language != <bytes> ident.kUnknown:

You can prove this for yourself by throwing this into get_language():

cdef string tst = b"und" 
print(tst)
print(str(tst) == ident.kUnknown)
print(tst.decode("utf-8") == ident.kUnknown)

Then

python3 setup.py build_ext --inplace --quiet && python3 -c 'import cld3; cld3.get_language("hello there!")'

Will produce False, False.

Using the work of everyone here (thank you everyone!) I've tried to combine the change sets into one clean set of commits and put a shiny new wrapper on things, which also sits on PyPI as pycld3.

https://github.com/bsolomon1124/pycld3

Reviews appreciated. Again, I've made my best effort to make sure the incremental changes across different forks are picked up and put together.

Thanks @bsolomon1124! I actually just copied that part from the elizafox cld3 fork so I guess many of us had been using this in its broken form for a while lol. The new wrapper looks great and we'll switch to using it soon.