Python binding forks and different fixes

Question

Python binding forks and different fixes

bact opened this issue 6 years ago · comments

Arthit Suriyawongkul commented 6 years ago

Update: CLD3 now has a Python binding code from Google themselves: gcld3

GitHub: https://github.com/google/cld3/tree/master/gcld3

This issue is to documenting some Python binding forks, with a hope that fixes can be merged as much as possible at the higher upstreams:

Official CLD3: https://github.com/google/cld3
--> [based on google] First Python binding: https://github.com/jbaiter/cld3 by @jbaiter
----> [based on @jbaiter] Remove Chromium repo dependency (see #11) + PyPI: https://github.com/Elizafox/cld3 by @Elizafox
------> [based on @Elizafox] Fix res.language casting error (in Cython): https://github.com/RNogales94/cld3, https://github.com/PythonNut/cld3, https://github.com/houp/cld3 by @RNogales94 @PythonNut @houp
------> [based on @Elizafox] Include protobuf headers and bodies (to get around #13): https://github.com/houp/cld3 by @houp
------> [based on @Elizafox] Fix memory leak; Introduce reuse of language model for faster performance https://github.com/iamthebot/cld3 by @iamthebot
--------> [based on @iamthebot] Fix res.language comparison; Provide easy pip install under pycld3 name https://github.com/bsolomon1124/pycld3 by @bsolomon1124

Note:

If you use one from pip install cld3 (from PyPI), it is https://github.com/Elizafox/cld3 by @Elizafox
Use pip install pycld3 for an updated version, at https://github.com/bsolomon1124/pycld3 by @bsolomon1124, with all the fixes and improvement listed above

Python Binding Documentation

(based on the documentation from https://github.com/Elizafox/cld3 )

Usage:

Here's some examples:

>>> cld3.get_language("This is a test")
LanguagePrediction(language='en', probability=0.9999980926513672, is_reliable=True, proportion=1.0)

>>> cld3.get_frequent_languages("This piece of text is in English. Този текст е на Български.", 5)
[LanguagePrediction(language='bg', probability=0.9173890948295593, is_reliable=True, proportion=0.5853658318519592), LanguagePrediction(language='en', probability=0.9999790191650391, is_reliable=True, proportion=0.4146341383457184)]

In short:

get_language returns the most likely language as the named tuple LanguagePrediction. Proportion is always 1.0 when called in this way.
get_frequent_languages will return the top number of guesses, up to a maximum specified (in the example, 5). The maximum is mandatory. Proportion will be set to the proportion of bytes found to be the target language in the list.

In the normal cld3 library, "und" may be returned as a language for unknown languages (with no other stats given). This library filters that result out as extraneous; if the language couldn't be detected, nothing will be returned. This also means, as a consequence, get_frequent_languages may return fewer results than what you asked for, or none at all.

Leopoldo Pla Sempere · Answer 1 · Thu Aug 01 2019 16:50:51 GMT+0800 (China Standard Time)

I have been testing the Elizafox/cld3 Python binding and I had severe memory issues. The more sentences I detect, the more memory is used. I don't know if this is an issue in cld3 or in the Python binding specifically.

And given that I cannot open any issue in any of the Python binding forks, I though to report it here.

Alfredo Luque · Answer 2 · Sat Aug 31 2019 04:44:02 GMT+0800 (China Standard Time)

@Ipla I've fixed these memory leaks in my fork of CLD3. Basically, the elizafox version creates a new model object on each call to get_language and on top of it doesn't clean it up. My fork has both the original functions (but cleans up the objects) and a class called LanguageIdentifier which permits reuse of the model for faster performance.

The fork is iamthebot/cld3

Brad Solomon · Answer 3 · Sat Oct 05 2019 08:53:41 GMT+0800 (China Standard Time)

Hi @jasonriesa and @akihiroota87: do the maintainers of google/cld3 have any interest in incorporating Python bindings within this repo, by reviewing and combining the various forks mentioned above?

As a tangentially related change, as a part of those forks, the Chromium dependency was removed. If that wasn't the case, the logical solution might be a git submodule, but since the C source itself has changed in the forks, that becomes difficult.

Brad Solomon · Answer 4 · Sun Oct 06 2019 00:22:37 GMT+0800 (China Standard Time)

@iamthebot

I believe there's still a small error in your fork.

You use the comparison:

str(res.language) != ident.kUnknown:

This is not doing what you think it is.

Originally, res.language is a CPP string, while ident.kUnknown is a const char array (with value "und").

However, str(res.language) does not do the correct coercion in the same way that str(b"hello") does not decode the string; it just makes a str representation of that bytes object.

>>> str(b"hello")
"b'hello'"
>>> str(b"hello") == "hello"  # No!
False

What is needed here is:

if <bytes> res.language != <bytes> ident.kUnknown:

You can prove this for yourself by throwing this into get_language():

cdef string tst = b"und" 
print(tst)
print(str(tst) == ident.kUnknown)
print(tst.decode("utf-8") == ident.kUnknown)

Then

python3 setup.py build_ext --inplace --quiet && python3 -c 'import cld3; cld3.get_language("hello there!")'

Will produce False, False.

Brad Solomon · Answer 5 · Wed Oct 09 2019 05:09:12 GMT+0800 (China Standard Time)

Using the work of everyone here (thank you everyone!) I've tried to combine the change sets into one clean set of commits and put a shiny new wrapper on things, which also sits on PyPI as pycld3.

https://github.com/bsolomon1124/pycld3

Reviews appreciated. Again, I've made my best effort to make sure the incremental changes across different forks are picked up and put together.

Alfredo Luque · Answer 6 · Tue Oct 22 2019 08:15:37 GMT+0800 (China Standard Time)

Thanks @bsolomon1124! I actually just copied that part from the elizafox cld3 fork so I guess many of us had been using this in its broken form for a while lol. The new wrapper looks great and we'll switch to using it soon.

Arthit Suriyawongkul · Answer 7 · Mon Jan 29 2024 09:01:31 GMT+0800 (China Standard Time)

gcld3 - a Python binding for CLD3 from Google

PyPI: https://pypi.org/project/gcld3/

GitHub: https://github.com/google/cld3/tree/master/gcld3