🌿 Konoha: Simple wrapper of Japanese Tokenizers

Konoha is a Python library for providing easy-to-use integrated interface of various Japanese tokenziers, which enables you to switch a tokenizer and boost your pre-processing.

Supported tokenizers

Also, konoha provides rule-based tokenizers (whitespace, character) and a rule-based sentence splitter.

Quick Start with Docker

Simply run followings on your computer:

docker run --rm -p 8000:8000 -t himkt/konoha  # from DockerHub

Or you can build image on your machine:

git clone https://github.com/himkt/konoha  # download konoha
cd konoha && docker-compose up --build  # build and launch contaier

Tokenization is done by posting a json object to localhost:8000/api/v1/tokenize. You can also batch tokenize by passing texts: ["１つ目の入力", "２つ目の入力"] to the server.

(API documentation is available on localhost:8000/redoc, you can check it using your web browser)

Send a request using curl on you terminal. Note that a path to an endpoint is changed in v4.6.4. Please check our release note (https://github.com/himkt/konoha/releases/tag/v4.6.4).

$ curl localhost:8000/api/v1/tokenize -X POST -H "Content-Type: application/json" \
    -d '{"tokenizer": "mecab", "text": "これはペンです"}'

{
  "tokens": [
    [
      {
        "surface": "これ",
        "part_of_speech": "名詞"
      },
      {
        "surface": "は",
        "part_of_speech": "助詞"
      },
      {
        "surface": "ペン",
        "part_of_speech": "名詞"
      },
      {
        "surface": "です",
        "part_of_speech": "助動詞"
      }
    ]
  ]
}

Installation

I recommend you to install konoha by pip install 'konoha[all]' or pip install 'konoha[all_with_integrations]'. (all_with_integrations will install AllenNLP)

Install konoha with a specific tokenizer: pip install 'konoha[(tokenizer_name)].
Install konoha with a specific tokenizer and AllenNLP integration: pip install 'konoha[(tokenizer_name),allennlp].
Install konoha with a specific tokenzier and remote file support: pip install 'konoha[(tokenizer_name),remote]'

If you want to install konoha with a tokenizer, please install konoha with a specific tokenizer (e.g. konoha[mecab], konoha[sudachi], ...etc) or install tokenizers individually.

Example

Word level tokenization

from konoha import WordTokenizer

sentence = '自然言語処理を勉強しています'

tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize(sentence))
# => [自然, 言語, 処理, を, 勉強, し, て, い, ます]

tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))
# => [▁, 自然, 言語, 処理, を, 勉強, し, ています]

For more detail, please see the example/ directory.

Remote files

Konoha supports dictionary and model on cloud storage (currently supports Amazon S3). It requires installing konoha with the remote option, see Installation.

# download user dictionary from S3
word_tokenizer = WordTokenizer("mecab", user_dictionary_path="s3://abc/xxx.dic")
print(word_tokenizer.tokenize(sentence))

# download system dictionary from S3
word_tokenizer = WordTokenizer("mecab", system_dictionary_path="s3://abc/yyy")
print(word_tokenizer.tokenize(sentence))

# download model file from S3
word_tokenizer = WordTokenizer("sentencepiece", model_path="s3://abc/zzz.model")
print(word_tokenizer.tokenize(sentence))

Sentence level tokenization

from konoha import SentenceTokenizer

sentence = "私は猫だ。名前なんてものはない。だが，「かわいい。それで十分だろう」。"

tokenizer = SentenceTokenizer()
print(tokenizer.tokenize(sentence))
# => ['私は猫だ。', '名前なんてものはない。', 'だが，「かわいい。それで十分だろう」。']

AllenNLP integration

Konoha provides AllenNLP integration, it enables users to specify konoha tokenizer in a Jsonnet config file. By running allennlp train with --include-package konoha, you can train a model using konoha tokenizer!

For example, konoha tokenizer is specified in xxx.jsonnet like following:

{
  "dataset_reader": {
    "lazy": false,
    "type": "text_classification_json",
    "tokenizer": {
      "type": "konoha",  // <-- konoha here!!!
      "tokenizer_name": "janome",
    },
    "token_indexers": {
      "tokens": {
        "type": "single_id",
        "lowercase_tokens": true,
      },
    },
  },
  ...
  "model": {
  ...
  },
  "trainer": {
  ...
  }
}

After finishing other sections (e.g. model config, trainer config, ...etc), allennlp train config/xxx.jsonnet --include-package konoha --serialization-dir yyy works! (remember to include konoha by --include-package konoha)

For more detail, please refer my blog article (in Japanese, sorry).

Test

python -m pytest

Article

Introducing Konoha (in Japanese): トークナイザをいい感じに切り替えるライブラリ konoha を作った
Implementing AllenNLP integration (in Japanese): 日本語解析ツール Konoha に AllenNLP 連携機能を実装した

Acknowledgement

Sentencepiece model used in test is provided by @yoheikikuta. Thanks!

chigichan24 / konoha