REGEX used to split accented characters

Question

REGEX used to split accented characters

sergiomajluf opened this issue 5 years ago · comments

As always, thanks for sharing all this code!

While using it with a spanish text, I found that it didn´t work because of special characters. In concordance.js we have this code splitting the array

// Splitting up the text
        split(text) {
            // Split into array of tokens
            return text.split(/\W+/);
        }

Unfortunately, accented (diacritic) characters are also non-alphanumeric, so word like "selección" and "niño" get chopped into "selecci", "n", "ni", "o" by using that REGEX.

I found a workaround, by using match instead of split

var re = /\S+\s*/g;
tokens = allwords.match(re);

This of course required me to also change the previous code a bit, into

txt = loadStrings('preguntas/todas.txt');
allwords = txt.join("\n");

My proposed solution is not very good either, because after splitting into tokens, I still had to hand clean for many other non-alphanumeric characters, whitespace, line breaks, etc. But that was done while "sanitizing" each word before adding it into keys, and counts. For example

for (var i = 0; i < tokens.length; i++) {
        var word = tokens[i].toLowerCase();

        // Clean some more
         word = word.replace("(", "");
         word = word.replace(")", "");
         word = word.replace(".", "");
         word = word.replace(finBlanco, "");
         word = word.replace(/(\r\n|\n|\t|\r)/gm, "");



        if (!/\d+/.test(word)) {                     // is not a number
            if (sw.indexOf(word) == -1) {            // is not a stop word within a custom sw array
                if (counts[word] === undefined) {    // is a new word
                    counts[word] = 1;
                    keys.push(word);
                } else {
                    counts[word]++;
                }
            }
        }
    }