akahuku / unistring

Hi @akahuku thank you for this library. When using getWords, is there any way to distinguish punctuation marks (like .,;()"') and white spaces from actual words in the results? The returned type isn't clear to me. Thank you!

getWords returns an array of object, which has 'type' property. This indicates the type of first letter of each word. Valid values for type are defined in:

unistring/unistring.js

Lines 348 to 376 in 7250d76

    
           const WBP = { 
        
           	/* ` */'Other': 0, 
        
           	/* a */'SOT': 1, 
        
           	/* b */'EOT': 2, 
        
           	/* c */'Double_Quote': 3, 
        
           	/* d */'Single_Quote': 4, 
        
           	/* e */'Hebrew_Letter': 5, 
        
           	/* f */'CR': 6, 
        
           	/* g */'LF': 7, 
        
           	/* h */'Newline': 8, 
        
           	/* i */'Extend': 9, 
        
           	/* j */'Regional_Indicator': 10, 
        
           	/* k */'Format': 11, 
        
           	/* l */'ALetter': 12, 
        
           	/* m */'MidLetter': 13, 
        
           	/* n */'MidNum': 14, 
        
           	/* o */'MidNumLet': 15, 
        
           	/* p */'Numeric': 16, 
        
           	/* q */'ExtendNumLet': 17, 
        
           	/* r */'E_Base': 18, 
        
           	/* s */'E_Modifier': 19, 
        
           	/* t */'ZWJ': 20, 
        
           	/* u */'Glue_After_Zwj': 21, 
        
           	/* v */'E_Base_GAZ': 22, 
        
           	/* w */'Katakana': 23, 
        
           	/* x */'Hiragana': 24, 
        
           	/* y */'KanaExtension': 25, 
        
           	/* z */'Space': 26 
        
           };

These values correspond to word boundaries defined in http://unicode.org/reports/tr29/#Word_Boundaries and are listed in http://www.unicode.org/Public/9.0.0/ucd/auxiliary/WordBreakProperty.txt

Therefore, if you want to pick up whitespace, for example there is a code like:

var result = Unistring.getWords('hello, world');
for (var i = 0; i < result.length; i++) {
  if (result[i].type == Unistring.WBP.Space) {
    // whitespace
  }
}

Ok, thanks a lot!

	const WBP = {
	/* ` */'Other': 0,
	/* a */'SOT': 1,
	/* b */'EOT': 2,
	/* c */'Double_Quote': 3,
	/* d */'Single_Quote': 4,
	/* e */'Hebrew_Letter': 5,
	/* f */'CR': 6,
	/* g */'LF': 7,
	/* h */'Newline': 8,
	/* i */'Extend': 9,
	/* j */'Regional_Indicator': 10,
	/* k */'Format': 11,
	/* l */'ALetter': 12,
	/* m */'MidLetter': 13,
	/* n */'MidNum': 14,
	/* o */'MidNumLet': 15,
	/* p */'Numeric': 16,
	/* q */'ExtendNumLet': 17,
	/* r */'E_Base': 18,
	/* s */'E_Modifier': 19,
	/* t */'ZWJ': 20,
	/* u */'Glue_After_Zwj': 21,
	/* v */'E_Base_GAZ': 22,
	/* w */'Katakana': 23,
	/* x */'Hiragana': 24,
	/* y */'KanaExtension': 25,
	/* z */'Space': 26
	};

How to get words without punctuation marks?