akahuku / unistring

javascript library to handle "unicode string" easily and correctly

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to get words without punctuation marks?

feuerste opened this issue · comments

Hi @akahuku thank you for this library. When using getWords, is there any way to distinguish punctuation marks (like .,;()"') and white spaces from actual words in the results? The returned type isn't clear to me. Thank you!

getWords returns an array of object, which has 'type' property. This indicates the type of first letter of each word. Valid values for type are defined in:

unistring/unistring.js

Lines 348 to 376 in 7250d76

const WBP = {
/* ` */'Other': 0,
/* a */'SOT': 1,
/* b */'EOT': 2,
/* c */'Double_Quote': 3,
/* d */'Single_Quote': 4,
/* e */'Hebrew_Letter': 5,
/* f */'CR': 6,
/* g */'LF': 7,
/* h */'Newline': 8,
/* i */'Extend': 9,
/* j */'Regional_Indicator': 10,
/* k */'Format': 11,
/* l */'ALetter': 12,
/* m */'MidLetter': 13,
/* n */'MidNum': 14,
/* o */'MidNumLet': 15,
/* p */'Numeric': 16,
/* q */'ExtendNumLet': 17,
/* r */'E_Base': 18,
/* s */'E_Modifier': 19,
/* t */'ZWJ': 20,
/* u */'Glue_After_Zwj': 21,
/* v */'E_Base_GAZ': 22,
/* w */'Katakana': 23,
/* x */'Hiragana': 24,
/* y */'KanaExtension': 25,
/* z */'Space': 26
};

These values correspond to word boundaries defined in http://unicode.org/reports/tr29/#Word_Boundaries and are listed in http://www.unicode.org/Public/9.0.0/ucd/auxiliary/WordBreakProperty.txt

Therefore, if you want to pick up whitespace, for example there is a code like:

var result = Unistring.getWords('hello, world');
for (var i = 0; i < result.length; i++) {
  if (result[i].type == Unistring.WBP.Space) {
    // whitespace
  }
}

Ok, thanks a lot!