spencermountain / compromise

modest natural-language processing

Home Page:http://compromise.cool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug]: Syntax / Matching Parsing Issue.

MarketingPip opened this issue · comments

Reporting this - as far as I do know this is a bug.

This code:

const doc = nlp('Lucas Oil Raceway is a famous motorsports complex.');
const motorplexes = doc.match('#Person (Motorplex|Dragway|Raceway|Motorsports|Racetrack|Speedway)').out('array');

output's the following: Lucas Oil Raceway

but when used like this:

const motorplexes = doc.match('(#Person+) (Motorplex|Dragway|Raceway|Motorsports|Racetrack|Speedway)').out('array');

You can not capture the full result. And when you group with multiple things. You will see various results when playing around.

Feel free to play and see what you experience when switching tags. (Hoping I am wrong and using parser incorrectly late at night & not a major bug). ps; hoping we can add this rules in after this is figured out.

import nlp from "https://esm.sh/compromise"

function findMotorplex(text) {
  const doc = nlp(text);
  const motorplexes = doc.match('(#Person+| #Place+|#Organization|#Noun) (Motorplex|Dragway|Raceway|Motorsports|Racetrack|Speedway)').out('array');

  return motorplexes;
}//
//
// Test the function with an expanded test list
const testList = [
  'I live at the Motorplex and I am hosting an event this weekend.',
  'I visited Brisbane Dragway last summer.',
  'Lucas Oil Raceway is a famous motorsports complex.',
  'Sydney Dragway hosts the Nitro Thunder event.',
  'Bandimere Speedway is known for the NHRA Mile-High Nationals.',
  'Santa Pod Raceway in Wellingborough is a popular drag racing venue.',
  'Perth Motorplex features drag racing, speedway, and dirt track events.',
  'Maple Grove Raceway hosts the NHRA Nationals.',
  'Gulfport Dragway is a drag racing facility in Mississippi.',
  'South Georgia Motorsports Park is a versatile motorsports facility.',
];

testList.forEach((test, index) => {
  const result = findMotorplex(test);
  console.log(`Test ${index + 1}: ${result.length > 0 ? result : 'No motorplex found.'}`);
});

hey Jared, parentheses in the match syntax are for OR matches, like (a|b|c) - I'm not sure what (#Person+) is intended.
Maybe you can describe the match you're looking for, and I can help you create it.
cheers

@spencermountain - hopefully this makes sense.

import nlp from "https://esm.sh/compromise"

function findMotorplex(text) {
  const doc = nlp(text);
  const motorplexes = doc.match('#Person+ (Motorplex|Dragway|Raceway|Motorsports|Racetrack|Speedway)').out('array');

  return motorplexes;
}//
//
// Test the function with an expanded test list
const testList = [
  'Lucas Oil Raceway is a famous motorsports complex.',
];

testList.forEach((test, index) => {
  const result = findMotorplex(test);
  console.log(`Test ${index + 1}: ${result.length > 0 ? result : 'No motorplex found.'}`);
});//

Outputs:
"Test 1: Lucas Oil Raceway"

Now when I use a match like this - trying to handle all cases & match ALL names (in this list) for common patterns found with dragways / raceways etc... (to hopefully help improve compromise rule sets finding orgs etc)

import nlp from "https://esm.sh/compromise"

function findMotorplex(text) {
  const doc = nlp(text);
 const motorplexes = doc.match('(#Place+|#Organization|#Noun|#Person+) (Motorplex|Dragway|Raceway|Motorsports|Racetrack|Speedway)').out('array');

  return motorplexes;
}//
//
// Test the function with an expanded test list
const testList = [
  'I live at the Motorplex and I am hosting an event this weekend.',
  'I visited Brisbane Dragway last summer.',
  'Lucas Oil Raceway is a famous motorsports complex.',
  'Sydney Dragway hosts the Nitro Thunder event.',
  'Bandimere Speedway is known for the NHRA Mile-High Nationals.',
  'Santa Pod Raceway in Wellingborough is a popular drag racing venue.',
  'Perth Motorplex features drag racing, speedway, and dirt track events.',
  'Maple Grove Raceway hosts the NHRA Nationals.',
  'Gulfport Dragway is a drag racing facility in Mississippi.',
  'South Georgia Motorsports Park is a versatile motorsports facility.',
];

testList.forEach((test, index) => {
  const result = findMotorplex(test);
  console.log(`Test ${index + 1}: ${result.length > 0 ? result : 'No motorplex found.'}`);
});//

The match for Lucas Oil only returns "Oil Raceway". Again - hoping this is just a brain fart on my regex skills right now and not an issue with the parser. lol But hoping you can play with that code and try changing orders of matches for first group as it seem's the results were off. (or if I am just loosing my mind).

& oddly this (just playing with parser - I know groups are meant for different matches lol)

const motorplexes = doc.match('(#Person+) (Motorplex|Dragway|Raceway|Motorsports|Racetrack|Speedway)').out('array');

will only return Sydney Raceway. (Which confused me even more) lol.

@spencermountain - think I somewhat found the issue (has to do with people in match I think) - hoping you get your thoughts.

import nlp from "https://esm.sh/compromise"

function findMotorplex(text) {
  const doc = nlp(text);
  const motorplexes = doc.match('(#Place+|#Person #Person|#Organization|#Noun) (Motorplex|Dragway|Raceway|Motorsports|Racetrack|Speedway)').out('array');

  return motorplexes;
}//
//
// Test the function with an expanded test list
const testList = [
  'I live at the Motorplex and I am hosting an event this weekend.',
  'I visited Brisbane Dragway last summer.',
  'Lucas Oil Raceway is a famous motorsports complex.',
  'Sydney Dragway hosts the Nitro Thunder event.',
  'Bandimere Speedway is known for the NHRA Mile-High Nationals.',
  'Santa Pod Raceway in Wellingborough is a popular drag racing venue.',
  'Perth Motorplex features drag racing, speedway, and dirt track events.',
  'Maple Grove Raceway hosts the NHRA Nationals.',
  'Gulfport Dragway is a drag racing facility in Mississippi.',
  'South Georgia Motorsports Park is a versatile motorsports facility.',
];

testList.forEach((test, index) => {
  const result = findMotorplex(test);
  console.log(`Test ${index + 1}: ${result.length > 0 ? result : 'No motorplex found.'}`);
});//

This properly matches all results properly. (besides maple grove & santa pod - gets partial matches - not sure best solution yet for that) But I should not need to use a #Person #Person & only should need a #Person+.

yep, looks good to me

@spencermountain - are you sure? Shouldn't "Person+" grab multiple words? And if not - why does it do this when used by itself?

As still confused why this #Person+ (Motorplex|Dragway|Raceway|Motorsports|Racetrack|Speedway) works for Lucas Oil Speedway...?

But required to use #Person #Person to match it properly.

As this:

(#Place+|#Person #Person|#Organization|#Noun) (Motorplex|Dragway|Raceway|Motorsports|Racetrack|Speedway)

as far as I know SHOULD have worked with

(#Place+|#Person+|#Organization|#Noun) (Motorplex|Dragway|Raceway|Motorsports|Racetrack|Speedway)

and same results.

(Again - hoping you can clarify this for me - as I don't wanna touch a rule set till I am cleared up on this lol)

Hoping to we can go through a list one weekend etc and make some more rules for common patterns of organizations etc found in human language.

hey, ya if you do this:

nlp('Lucas Oil Speedway').debug()

you'll see it's mistakenly tagged as firstname-lastname.

if you wanted to drill-down into why, add nlp.verbose('tagger') before you run it, and it will show Lucas is tagged as a first name, then the firstname-titlecase matcher mistakenly taggs it as a lastname.
cheers

for 14.10.1 i added a bunch of org/person tagging changes, provoked by some of the issues you've found.

It's a bit of a mess - you can see it in ./src/2-two/preTagger/compute/tagger/3rd-pass/
The main concern was that doing #TitleCase (library|theatre|airport|.....) works for a handful of OR matches, but not two hundred. It starts to slow-down the library considerably. I added the placeWords/orgWords and a bunch of loops, to reproduce this in a faster way, but it's not very nice.

It would be great if you could find more issues with both - particularly false-positives, which IMO are a much more important problem than missing an organization here or there.

For example, if you found that it tags 'park my car' as an Org, or something - that would be lovely.
thanks

@spencermountain - so I was not going crazy then! I was purposely trying to tag Lucas Oil as person as I seen the tags were "#FirstName #LastName". As when I was looking at debugger (was using it previously to see wtf was going on).

 nlp('Lucas Oil Speedway').debug() 

by itself worked fine. (strangely in list too)....?

But when I used the WHOLE list it wouldn't catch that. (And it appeared to have same tags as I do recall)

I will try to report some false positives. Tho this is one of those thing's that kinda is just a mind fuck (if that makes sense lol).

Tho I am concerned about this and hoping this is not a MAJOR issue - as it will obviously be affecting the rule set right now (that we think is correct / passing tests....)

That said - if we are on same page, should this issue be opened back up again? (if so please do so - so I / other's are not confused). Mostly for me tho right now as still confused lol 😿

@spencermountain - should this be re-opened or....?

hey jared, can you reproduce it in a dumbed-down format, because i'm a big dumb guy

nlp('foo bar').match('(foo+) bar') //failing

that helps, thanks.