spencermountain / compromise

modest natural-language processing

Home Page:http://compromise.cool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature Request: Add data to a term

thegoatherder opened this issue · comments

commented

I'm looking for some kind of method on a View which will enable me to attach some data to words, the same way that I can attach tags.
Does anything like this already exist?

Here's a rough sketch:

const doc = nlp('Simon says see you next September')
const match = doc.match('#Month')
const someObj = doSomethingComplexWithMyMatch(match)
match.tag('SomeTag')
// we add some custom data to our Compromise View instance
match.addData(someObj)

// ... somewhere else in the app

const match2 = doc.match('#SomeTag')
// recall the data somehow
const myData = match2.out('data')

I think this could be an extremely powerful feature for our project and I'm sure many others...!

Example Use Case:

  • An app that extracts all dates from a doc in various formats and maps them to standard JS date objects (e.g. using spacetime or some lib)
  • Given some base date, calculates all the relative durations pre- and post- that date in the dates found across the document
  • Stores the durations in an object against the terms in the document
  • That data is now recallable anywhere else in the app with access to the nlp doc, without recalculating.

ya, really cool idea.
Agree that tags are limited as data, and have taken us really far - (probably too far).
I also like the idea for storing captured metadata, like date metadata, within reach of compromise somehow.

Imagine if we could do something in match queries with the json like:

let doc=nlp('paul, john lennon and ringo starr')
doc.match('ringo starr').payload({roles:['drummer', 'singer'], hair:'long'})

//then later...
doc.match('and {roles:'drummer'}') //or something

Been stuck, forever, on this same dilemma - where to store information about groups of words.
The good news is that they are just javascript objects, and we can stick stuff anywhere.

View objects are transient. Every method returns a new one, and would need to marshal any data payload around, with every interaction. Old views would have stale payloads. I don't think it's the right place for this.
Putting paylods in Term objects would also be the wrong place - 'ringo' and 'star' would need dangled or duped data between them.

Open to it, just haven't got it clear yet.

commented

@spencermountain

Just throwing some ideas around in case they offer any inspiration... far from a solution...!

What if there was some new layer like compromise/four with a method like .commit() that could commit a View and store it separately in the document.

const someObj = {} // my payload
const view = nlp('See you next September').match('next #Month').commit() 
view.payload(someObj)

.commit() could hash the Term.IDs to generate a deterministic ID for the View on .commit(). This would ensure that a committed View can be later updated with new data if needed.

doc: {
  commits: {
     "somehash1": {
       terms: []  // list of Terms
       payload: {} // the payload data
     }
  }
}

This would allow for Terms to hold different data in different contexts. For example a match of next #Month versus #Month could both attach data to the Term September, but independently. A user could then:

const payload1 = { a: 1 } 
const payload2 = { a: 2 } 
const doc = nlp('See you next September')
doc.match('next #Month').commit().payload(payload1)
doc.match('#Month').commit().payload(payload2)

// ... later in the app
doc.match('next #Month').payload()  // Generate checksum for this match and use it to lookup payload1 data from the commit
doc.match('#Month').payload()  // Generate checksum for this match and use it to lookup payload2 data from the commit

The data could also be output by the .json() function:

doc.match('next #Month').json()
[
  {
    "text": "next september",
    "terms": [
      {
        "text": "next",
        "pre": "",
        "post": " ",
        "tags": [
          "Adjective"
        ],
        "normal": "next",
        "index": [
          0,
          2
        ],
        "id": "next|00700002C",
        "dirty": true,
        "chunk": "Noun"
      },
      {
        "text": "september",
        "pre": "",
        "post": "",
        "tags": [
          "Date",
          "Noun",
          "Month"
        ],
        "normal": "september",
        "index": [
          0,
          3
        ],
        "id": "september|00800003V",
        "chunk": "Noun",
        "dirty": true
      }
    ],
    payload: {}   ***** MY PAYLOAD *****
  }
]

I think, but am not sure, that this might also support your (excellent!) suggestion of a new match syntax based on payloads:

doc.match('and {roles:'drummer'}') //or something

The matcher could simply know that when it sees {roles:'drummer'} that it has to go and find all committed views that have that data, return their term IDs and use those to complete the match like and ringo|00012ABC starr|0A11A00B

check out the compromise-payload plugin