How to get words without punctuation marks?
feuerste opened this issue · comments
Hi @akahuku thank you for this library. When using getWords
, is there any way to distinguish punctuation marks (like .,;()"'
) and white spaces from actual words in the results? The returned type isn't clear to me. Thank you!
getWords
returns an array of object, which has 'type' property. This indicates the type of first letter of each word. Valid values for type are defined in:
Lines 348 to 376 in 7250d76
These values correspond to word boundaries defined in http://unicode.org/reports/tr29/#Word_Boundaries and are listed in http://www.unicode.org/Public/9.0.0/ucd/auxiliary/WordBreakProperty.txt
Therefore, if you want to pick up whitespace, for example there is a code like:
var result = Unistring.getWords('hello, world');
for (var i = 0; i < result.length; i++) {
if (result[i].type == Unistring.WBP.Space) {
// whitespace
}
}
Ok, thanks a lot!