spencermountain / compromise

modest natural-language processing

Home Page:http://compromise.cool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tagging mixed number as #Value

track0x1 opened this issue · comments

Mixed numbers are a common way to express a value like ‘1-1/2 cups’ sometimes without the hyphen separator ‘1 1/2 cups’. When I used compromise v11 I was able to make a plugin with a regex to try and tag these as #Value but it doesn’t seem to work in the latest release. Because it’s so common should this be out of the box tagging?
My purpose here is to match all types of values (including mixed number values) for capturing.

hey Tom, yep - if I remember we still do some of this number-range stuff out of the box, but shied-away from some of it that resembled algebra or subtraction. This is a real doozie, and I agree it's a cool thing to opt-in to, and we should support any unambiguous 'and a half' stuff as much as we can.

You can see some of the fractions tests we pass, and avoid for this here, PRs welcome if you can improve on it, in any way.

ps i enjoyed your blog.
cheers

@spencermountain Thank you Spencer! I just realized something that looks like a bug. When 15-ounce is wrapped in parentheses it's tagged as a single term and resultantly has the wrong tags.

> nlp('15-ounce (15-ounce)').debug()

  ┌─────────
   '15'       - Value, Cardinal, NumericValue, Hyphenated
   'ounce'    - Noun, Unit, Singular, Hyphenated
   '15-ounce'  - Infinitive, Verb, PresentTense

sidebar: is there a way we can convert verbose number ranges (2 to 3) to hyphenated number ranges (2-3)? that would enable me to tap into the same #NumberRange tag for a match.

> nlp('2 to 3 people').debug()

  ┌─────────
   '2'        - Value, Cardinal, NumericValue
   'to'       - Conjunction
   '3'        - Value, Cardinal, NumericValue
   'people'   - Noun, Plural, Actor

> nlp('2-3 people').debug()

  ┌─────────
   '[2]'      - Value, Cardinal, NumericValue, NumberRange
   '[to]'     - Conjunction, NumberRange
   '[3]'      - Value, Cardinal, NumericValue, NumberRange
   'people'   - Noun, Plural, Actor

edit: also happy to split these concerns into separate issues/discussions if you prefer

hey Tom, apologies for the delay.
yeah, there's an ugly way:

let doc = nlp('2 to 3 people')
let { before, prep } = doc.match('[<before>#Value] [<prep>to] #Value').groups()
before.post('') //remove '2' whitespace
doc.match(prep).replaceWith('-').post('') //remove '-' whitespace
console.log(doc.text()) //2-3 people

in short, some of this is weird. You may benefit from using replace() with some term methods like @hasDash or @hasHyphen

This nlp('15-ounce (15-ounce)').debug() one is a doozie. Haven't got it yet, but will.

hey @track0x1 , this is fixed in 14.12.0:

let doc = nlp('10-ounce (12-ounce)')
doc.terms().length // 4

cheers

hey @track0x1 , this is fixed in 14.12.0:

let doc = nlp('10-ounce (12-ounce)')
doc.terms().length // 4

cheers

You're the best! Thank you