REGEX used to split accented characters
sergiomajluf opened this issue · comments
As always, thanks for sharing all this code!
While using it with a spanish text, I found that it didn´t work because of special characters. In concordance.js we have this code splitting the array
// Splitting up the text
split(text) {
// Split into array of tokens
return text.split(/\W+/);
}
Unfortunately, accented (diacritic) characters are also non-alphanumeric, so word like "selección" and "niño" get chopped into "selecci", "n", "ni", "o" by using that REGEX.
I found a workaround, by using match instead of split
var re = /\S+\s*/g;
tokens = allwords.match(re);
This of course required me to also change the previous code a bit, into
txt = loadStrings('preguntas/todas.txt');
allwords = txt.join("\n");
My proposed solution is not very good either, because after splitting into tokens, I still had to hand clean for many other non-alphanumeric characters, whitespace, line breaks, etc. But that was done while "sanitizing" each word before adding it into keys, and counts. For example
for (var i = 0; i < tokens.length; i++) {
var word = tokens[i].toLowerCase();
// Clean some more
word = word.replace("(", "");
word = word.replace(")", "");
word = word.replace(".", "");
word = word.replace(finBlanco, "");
word = word.replace(/(\r\n|\n|\t|\r)/gm, "");
if (!/\d+/.test(word)) { // is not a number
if (sw.indexOf(word) == -1) { // is not a stop word within a custom sw array
if (counts[word] === undefined) { // is a new word
counts[word] = 1;
keys.push(word);
} else {
counts[word]++;
}
}
}
}