nielstron / quantulum3

Library for unit extraction - fork of quantulum for python3

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Celsius does not work when using unit C

infered5 opened this issue · comments

Describe the bug
In a typical written conversation, it's common to use F/C (or f/c) as temperature units, instead of writing out the full Fahrenheit or Celsius. This library errors when attempting to parse sentences like "The hotend is 200C".

To Reproduce
Steps to reproduce the behavior:
quants = parser.parse('it is set to 200c')
quants

Expected behavior
[Quantity](200, "Unit(name="degree Celsius", entity=Entity("temperature"), uri=Celsius)")]

Actual Behavior
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/andrew/redenv/lib/python3.10/site-packages/quantulum3/parser.py", line 564, in parse
unit, unit_shortening = get_unit(item, text, lang, classifier_path)
File "/home/andrew/redenv/lib/python3.10/site-packages/quantulum3/parser.py", line 365, in get_unit
base = dis.disambiguate_unit(unit_surface, text, lang, classifier_path)
File "/home/andrew/redenv/lib/python3.10/site-packages/quantulum3/disambiguate.py", line 18, in disambiguate_unit
base = clf.disambiguate_unit(unit_surface, text, lang, classifier_path).name
File "/home/andrew/redenv/lib/python3.10/site-packages/quantulum3/classifier.py", line 319, in disambiguate_unit
scores = classifier_.classifier.predict_proba(transformed).tolist()[0]
File "/home/andrew/redenv/lib/python3.10/site-packages/sklearn/utils/_available_if.py", line 40, in __get__
self._check(obj, owner=owner)
File "/home/andrew/redenv/lib/python3.10/site-packages/sklearn/utils/_available_if.py", line 31, in _check
raise AttributeError(attr_err_msg) from e

Screenshots
If applicable, add screenshots to help explain your problem.

Additional information:

  • Python Version: 3.1.0
  • Classifier activated/ sklearn installed: yes
  • OS: Ubuntu 22.04
  • Version 22.04.4 LTS Jammy

Additional context
Writing the full name Celsius works as intended. C or F causes errors.

Looks to me as if there is an issue with sklearn (i.e. more generally with the trained classifier that is used to disambiguate amibuguous statements.)

I have pushed a new version 0.9.1, which should be released soon. Can you please retry with that version?

This has solved the issue, at least for C, F and f. c (lowercase) appears to try to use the currency Centavo, one cent of a Mexican Peso, among a few other countries. This is sufficient for me, but may not be for others.

The disambiguator is context dependent and might therefore occasionally spew out incorrect disambiguations. I think it could be improved a lot by drawing insights from recent LLM developments, the current method is relatively simple.