mnpk / mecab-bind

Binding MeCab Tagger to Python3 and TensorFlow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mecab-bind

Build and test binding modules

Binding MeCab Tagger to python and tensorflow

Installation

  • Python binding: pip install mecab-bind
  • TensorFlow binding: pip install mecab-tf

Compatible TensorFlow version

mecab-tf tensorflow version python version
2.4.0 2.4.x 3.6, 3.7, 3.8
2.5.0 2.5.x 3.6, 3.7, 3.8, 3.9

Usage

Python Binding

>>> import mecab
>>> tagger = mecab.Tagger(mecab.get_model_args("./test-data/dic")) # pass dictionary path instead of "./test-data/dic"
>>> dic_infos = tagger.get_dictionary_info()
>>> tagger.get_dictionary_info()
[<DictionaryInfo filename=./test-data/dic/sys.dic, charset=UTF-8, size=4335, type=0, lsize=346, rsize=346, version=102>]
>>> tagger.parse_node_with_lattice("シリーズ中、カンフーシーンが一番多い。")
[
    <Node surface="", feature="BOS/EOS,*,*,*,*,*,*,*,*">,
    <Node surface="シリーズ", feature="名詞,一般,*,*,*,*,*">,
    <Node surface="中", feature="接頭詞,数接続,*,*,*,*,中,ナカ,ナカ">,
    <Node surface="、", feature="記号,読点,*,*,*,*,、,、,、">,
    <Node surface="カンフーシーン", feature="名詞,一般,*,*,*,*,*">,
    <Node surface="が", feature="助詞,格助詞,一般,*,*,*,が,ガ,ガ">,
    <Node surface="一番", feature="名詞,副詞可能,*,*,*,*,一番,イチバン,イチバン">,
    <Node surface="多い", feature="形容詞,自立,*,*,形容詞・アウオ段,基本形,多い,オオイ,オーイ">,
    <Node surface="。", feature="記号,句点,*,*,*,*,。,。,。">,
    <Node surface="", feature="BOS/EOS,*,*,*,*,*,*,*,*">
]
>>> tagger.parse_nbest_with_lattice("シリーズ中、カンフーシーンが一番多い。", 10)
[
    [
        <Node surface="", feature="BOS/EOS,*,*,*,*,*,*,*,*">,
        <Node surface="シリーズ", feature="名詞,一般,*,*,*,*,*">,
        <Node surface="中", feature="接頭詞,数接続,*,*,*,*,中,ナカ,ナカ">,
        ...
    ],
    [
        <Node surface="", feature="BOS/EOS,*,*,*,*,*,*,*,*">,
        <Node surface="シリーズ", feature="名詞,一般,*,*,*,*,*">,
        <Node surface="中", feature="接頭詞,数接続,*,*,*,*,中,ナカ,ナカ">,
        <Node surface="、", feature="記号,読点,*,*,*,*,、,、,、">,
        ...
    ],
    ...
]
>>> print(tagger.parse("シリーズ中、カンフーシーンが一番多い。"))
シリーズ        名詞,一般,*,*,*,*,*
      接頭詞,数接続,*,*,*,*,,ナカ,ナカ記号,読点,*,*,*,*,、,、,、
カンフーシーン  名詞,一般,*,*,*,*,*
      助詞,格助詞,一般,*,*,*,,,
一番    名詞,副詞可能,*,*,*,*,一番,イチバン,イチバン
多い    形容詞,自立,*,*,形容詞アウオ段,基本形,多い,オオイ,オーイ記号,句点,*,*,*,*,。,。,。
EOS

Bound commands

  • mecab-dict-index
  • mecab-dict-gen
  • mecab-system-eval
  • mecab-cost-train
  • mecab-test-gen
  • mecab

TensorFlow Binding

>>> import tensorflow as tf
>>> from mecab_tf.python.ops.mecab_ops import MecabTagger
>>> tagger = MecabTagger("./test-data/dic")
2021-05-20 05:35:48.759933: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> surfaces, features = tagger.tag(["シリーズ中、カンフーシーンが一番多い。", "※撮影中に、ジェット・リーが失踪。"])
>>> surfaces.shape
TensorShape([2, None])
>>> features.shape
TensorShape([2, None])
>>> for surface, feature in zip(surfaces[0], features[0]):  # print first sentence
...     print(surface.numpy().decode('utf8'), feature.numpy().decode('utf8'))
...
 BOS/EOS,*,*,*,*,*,*,*,*
シリーズ 名詞,一般,*,*,*,*,*
 接頭詞,数接続,*,*,*,*,,ナカ,ナカ記号,読点,*,*,*,*,、,、,、
カンフーシーン 名詞,一般,*,*,*,*,*
 助詞,格助詞,一般,*,*,*,,,
一番 名詞,副詞可能,*,*,*,*,一番,イチバン,イチバン
多い 形容詞,自立,*,*,形容詞アウオ段,基本形,多い,オオイ,オーイ記号,句点,*,*,*,*,。,。,。
 BOS/EOS,*,*,*,*,*,*,*,*
>>> # you can pass any shape of string tensor
>>> _ = tagger.tag("シリーズ中、カンフーシーンが一番多い。")
>>> _ = tagger.tag([["シリーズ中、カンフーシーンが一番多い。", "※撮影中に、ジェット・リーが失踪。"]])

Note: If you use this Module in SavedModel format, it is recommended to use model_path as absolute path. The model_path is serialized, not the dictionary data.

Prebuilt dictionaries

About

Binding MeCab Tagger to Python3 and TensorFlow

License:GNU General Public License v3.0


Languages

Language:C++ 32.6%Language:Python 31.2%Language:Shell 20.0%Language:Starlark 16.2%