NFD form combining characters not picked up as part of word
retorquere opened this issue · comments
function show(s) {
return s.replace(/[^\x00-\x7F]/g, c => "\\u" + ("0000" + c.charCodeAt(0).toString(16)).slice(-4))
}
var nlp = require("compromise/one")
var doc = nlp('Poincare\u0301')
for (const term of doc.json({offset:true})[0].terms) {
console.log(show(JSON.stringify(term, null, 2)))
}
logs
{
"text": "Poincare",
"pre": "",
"post": "\u0301",
"tags": [],
"normal": "poincare",
"index": [
0,
0
],
"id": "poincare|002000009",
"offset": {
"index": 0,
"start": 0,
"length": 8
}
}
normalizing to NFC does work, but not every combining char combination has an NFC form (eg 'Poincare\u0301 E\u0300\u0304'.normalize('NFC')
)
hey, good catch! Yeah, I agree that compromise should not tokenize these inline unicode forms. happy to add a guard for this, in the next release.
cheers
hey, just double-checking something, your example Poincare\u0301
seems to be a punctuation symbol '́'
- which arguably should be considered non-word whitepsace maybe.
Can you generate an example where the NFD character is more word-like? I agree it rubs-up against the javascript normalize feature, and maybe our supporting it would just complicate things.
lemme know,
cheers
It's just the Combining Acute Accent:
const show = obj => JSON.stringify(obj, null, 2).replace(/[\u007F-\uFFFF]/g, chr => `\\u${(`0000${chr.charCodeAt(0).toString(16)}`).substr(-4)}`)
console.log(show(`e\u0301`.normalize('NFC')))
shows
"\u00e9"
it's easy enough to normalize the input before passing it into tokenization, but that would then be a design constraint, and as mentioned, there are combining characters that have no single-char NFC form.