RubixML / ML

A high-level machine learning and deep learning library for the PHP language.

Home Page:https://rubixml.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multi Language Tokenization Support

andrewdalpino opened this issue · comments

I'm hoping that we can get to the point where we fully support the following languages.

  • English
  • Spanish
  • German
  • French
  • Russian
  • Japanese
  • Hindi
  • Farsi
  • Chinese
  • Arabic

I started adding unit tests for these languages for a few tokenizers here https://github.com/RubixML/ML/tree/master/tests/Tokenizers - however, it doesn't look like we support all the langugaes. I only speak English so it's hard for me to tell. Could we get some help from the community to verify that our Tokenizers support all of these languages and, if not, contribute a fix?

https://github.com/RubixML/ML/tree/master/src/Tokenizers

Thank you!

How to join the development of multiple languages? I am good at Chinese and English.

Hi @taotecode, thanks for your interest in contributing to the project! Here are the unit tests for the Tokenizers implemented in the library.

https://github.com/RubixML/ML/tree/master/tests/Tokenizers

We need help from native language speakers to ensure that we have test coverage for different languages and that the current tests are correct.