Add tokenization for languages not using spaces
jacopofar opened this issue · comments
The space-based tokenization cannot properly tokenize Chinese and Japanese (among others). For them, add some tokener to be applied in the import script.
References: