Add tokenization for languages not using spaces

Question

Add tokenization for languages not using spaces

jacopofar opened this issue 4 years ago · comments

The space-based tokenization cannot properly tokenize Chinese and Japanese (among others). For them, add some tokener to be applied in the import script.

References: