nielstron / quantulum3

Library for unit extraction - fork of quantulum for python3

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Specify an order for preferred interpretations as low-key disambiguation

EdwardChamberlain opened this issue · comments

commented

Describe the bug
The shorthand notation of inch (“) is detected but is parsed as second. While technically true the more common use of “ is to mean inch.

To Reproduce
Steps to reproduce the behaviour:

>>> from quantulum3 import parser
>>> s = 'Its about 24" long'
>>> quants = parser.parse(s)
>>> print(s)
Its about 24" long
>>> print(quants)
[Quantity(24, "Unit(name="second of arc", entity=Entity("angle"), uri=Minute_and_second_of_arc)”)]

Expected behavior
I would expect it to default to “ meaning inches rather than seconds.

Screenshots
N/A

Additional information:

  • Python Version: 3.7
  • Classifier activated/ sklearn installed: yes
  • OS: linux
  • Version: 0.7.3

Additional context
Is there anyway to force an override on this?

the more common use of “ is to mean inch.
Do you have some source for this claim? Since this tool should be as general as possible I would prefer to keep all ambiguous units random when not using the disambiguation system.

Relatedly, for disambiguation there is a pretrained classifier included in the system. However, without any context, it is not really likely that it will correctly determine the appropriate unit.

Note taken: A way to pass in an ordering for preferred/less preferred interpretations of some symbols could be included.

commented

Do you have some source for this claim?

Sure:

The inch (abbreviation: in or ″)

Source: https://en.wikipedia.org/wiki/Inch (first line)

It seems to sometimes pickup inch correctly if using “ but I'm not sure how to reproduce yet.

Yes of course " is an abbreviation for inch, but I rather wanted to know whether there is a source for the claim that " more frequently means "inch" than "second" :)

The tool knows that " is an abbreviation for inch just as it knows that " is an abbreviation for seconds, however there is no order of preference of which interpretation to choose. If it picks up " as inch it might be related to the context of the sentence but also due to (something very close to) pure luck, especially if disambiguation is not enabled.