nielstron / quantulum3

Library for unit extraction - fork of quantulum for python3

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Compound statements of the same entity but different magnitudes are parsed separately

psychemedia opened this issue · comments

Using the current package installed from pypi (0.7.2), f I do something like:

from quantulum3 import parser
parser.parse('it weighs four hundred and twenty kilograms.')

it is correctly parsed as: Quantity(420, "Unit(name="kilogram", entity=Entity("mass"), uri=Kilogram)")

With time, however, different magnitude time elements are not composed. For example:

parser.parse('It took 4 years and 6 months, more or less.')

is parsed as:

[Quantity(4, "Unit(name="year", entity=Entity("time"), uri=Year)"),
 Quantity(6, "Unit(name="month", entity=Entity("time"), uri=Month)")]

Similarly,:

parser.parse('I measured it as 2 meters and 30 centimeters.')

is parsed as:

[Quantity(2, "Unit(name="metre", entity=Entity("length"), uri=Metre)"),
 Quantity(30, "Unit(name="centimetre", entity=Entity("length"), uri=Centimetre)")]

If quantities of the same entity but different magnitude are separated by a connector such as and, should they be summed?

Comparing with:

parser.parse('I measured it as 2.3m.')

which returns: [Quantity(2.3, "Unit(name="metre", entity=Entity("length"), uri=Metre)")]

I might expect a parse of "2 meters and 30 centimeters" to return that same (SI unit-ed) quantity?

Another example, this time dimensionless:

parser.parse('three million, two hundred & forty, you say?')

returns:

[Quantity(3e+06, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)"),
 Quantity(200, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)"),
 Quantity(40, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]

whereas without the comma and the ampersand:

parser.parse('three million two hundred and forty miles, you say?')

we correctly get: [Quantity(3.00024e+06, "Unit(name="mile", entity=Entity("length"), uri=Mile)")]

(It also works without the miles unit.)

different magnitude time elements are not composed

This is true and partly this is simply due to quantulum not doing unit conversion. At the current state, it is not implemented anywhere in the tool on how the known units relate to each other (and maybe this shouldn't even ever be implemented and left to a quantity conversion package).
On the other hand, I see your point and for rigorously outputting the quantities meant by the speaker, the quantities should be merged somehow. I do not currently have a proposal or concept on how to handle this issue.

parser.parse('three million, two hundred & forty, you say?')

This is rather a simple problem, not necessarily connected to the first one. Here we "only" have to fix/expand the parsing regular expression for allowing commata and ampersands. Feel free to have a look at the existing regexes on how to expand them :)