evalphobia / go-jp-text-ripper

tokenize text and separate it into words for Japanese

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

go-jp-text-ripper

GoDoc License: Apache Release Build Status Codecov Coverage Go Report Card Code Climate BCH compliance Downloads

go-jp-text-ripper separates long text of Japanese into words and put spaces between ths words.

Quick Usage

Install

# install
$ go get github.com/evalphobia/go-jp-text-ripper

# or clone and build
# $ git clone --depth 1 https://github.com/evalphobia/go-jp-text-ripper.git
# $ cd ./go-jp-text-ripper
# $ make build
$ go-jp-text-ripper -h

Commands:

  help   show help
  rip    Separate japanese text into words from CSV/TSV file
  rank   Show ranking of the word frequency

Subcommands

rip

rip command separate japanese text into words from --input file.

$ go-jp-text-ripper rip -h

Separate japanese text into words from CSV/TSV file

Options:

  -h, --help            display help information
  -c, --column          target column name in input file
      --columnn         target column index in input file (1st col=1)
  -i, --input          *input file path --input='/path/to/input.csv'
  -o, --output          output file path --output='./my_result.csv'
      --dic             custom dictionary path (mecab ipa dictionaly)
      --stopword        stop word list file path
      --show            print separated words to console
      --original        output original form of word
      --noun            output 'noun' type of word
      --verb            output 'verb' type of word
      --adjective       output 'adjective' type of word
      --neologd         use prefilter for neologd
      --progress[=30]   print current progress (sec)
      --min[=1]         minimum letter size for output
      --quote           columns to add double-quotes (separated by comma)
      --prefix          prefix name for new columns
  -r, --replace         replace from text column data to output result
      --debug           print debug result to console
      --dropempty       remove empty result from output
      --stoptop         use ranking from top as stopword
      --stoptopp        use ranking from top by percent as stopword (0.0 ~ 1.0)
      --stoplast        use ranking from last as stopword
      --stoplastp       use ranking from last by percent as stopword (0.0 ~ 1.0)
      --stopunique      use ranking stopword as unique per line

For example, if you want to separate words from the example TSV file, try below command.

# chack the file contents
$ head -n 2 ./example/aozora_bunko.tsv

id	author	title	url	exerpt
1	夏目 漱石	吾輩は猫である	https://www.aozora.gr.jp/cards/000148/card789.html	一 吾輩は猫である。名前はまだ無い。 ...


# run rip command
$ go-jp-text-ripper rip \
    --input ./example/aozora_bunko.tsv \
    --column exerpt \
    --output ./output.tsv

[INFO]	[Run]	read and write lines...
[INFO]	[Run]	finish process

# check the results
$ head -n 2 ./output.tsv
id	author	title	url	exerpt	op_text	op_word_count	op_non_word_count	op_raw_char_count
1	夏目 漱石	吾輩は猫である	https://www.aozora.gr.jp/cards/000148/card789.html	一 吾輩は猫である。名前はまだ無い。...	一 吾輩 猫 名前 無い どこ 生れ 見当 つか 何 薄暗い し 所 ニャーニャー 泣い いた事 記憶 ...	562	719	2000

Advanced options

# `--columnn` sets column by index
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --show \
    --columnn 5

# `--dic` uses custom dictionary for kagome (https://github.com/ikawaha/kagome)
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --dic /opt/data/neologd.dic

# `--stopword` sets custom stopword file path and ignore the words
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --stopword ./stopwords.txt

# `--show` outputs the result on console
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt \
    --show

# `--original` uses original form (i.e. 原形) of the words for the results.
# in python code, use the word of `node.feature.split(",")[6]`
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --original

# if sets `--noun`, the results contains noun type of words.
# if sets `--verb`, the results contains verb type of words.
# if sets `--adjective`, the results contains adjective type of words.
# (default are 'noun', 'verb', 'adjective')
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --noun --verb  # in thie example, using only 'noun' and 'verb'

# `--neologd` uses the special prefilter for neologd to normalize text
# ref: https://github.com/evalphobia/go-jp-text-ripper/blob/master/prefilter/neologd.go
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --neologd

# `--progress` sets the interval in sec to show current progress
# default is '30' sec
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --progress 5

# `--min` sets the minimum letter size for the result
# if you set '2', then the result ignore one letter word (e.g. 'お', 'の', '犬', '猫', '嵐')
# default is '1'
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --min 3

# `--min` sets the minimum letter size for the result
# if you set '2', then the result ignore one letter word (e.g. 'お', 'の', '犬', '猫', '嵐')
# default is '1'
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --min 3

# `--prefix` sets the prefix for the new columns
# default is 'op_'
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --output ./output.tsv \
    --prefix n_

# `--replace` overwrite the target column by the result
# default is false and output the result on new column 'op_text'
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --min 3

# `--dropempty` removes the empty result row
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --dropempty

# `--stoptop`, `--stoptopp`, `--stoplast`, `--stoplastp` uses rank command result as a stopword
# `--stoptop` and `--stoptopp` uses the word with high frequency as a stopword
# `--stoplast` and `--stoplastp` uses the word with low frequency as a stopword
# if you use both of `--stoptop` and `--stoptopp` (or `--stoplast` and `--stoplastp`), then the filter condition stops when meets both.
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --stoptop 300
    --stoptopp 0.1  # whichever is bigger, 300 words or 10% words

# `--stopunique` is used with `--stop[top/last]` option
# this option count the frequency as one word per a row
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --stoptop 300
    --stopunique

rank

rank command gets word frequency ranking from --input file.

$ go-jp-text-ripper rank -h

Show ranking of the word frequency

Options:

  -h, --help            display help information
  -c, --column          target column name in input file
      --columnn         target column index in input file (1st col=1)
  -i, --input          *input file path --input='/path/to/input.csv'
  -o, --output          output file path --output='./my_result.csv'
      --dic             custom dictionary path (mecab ipa dictionaly)
      --stopword        stop word list file path
      --show            print separated words to console
      --original        output original form of word
      --noun            output 'noun' type of word
      --verb            output 'verb' type of word
      --adjective       output 'adjective' type of word
      --neologd         use prefilter for neologd
      --progress[=30]   print current progress (sec)
      --min[=1]         minimum letter size for output
      --top             rank from top by count
      --topp            rank from top by percent (0.0 ~ 1.0)
      --last            rank from last by count
      --lastp           rank from last by percent (0.0 ~ 1.0)
  -u, --unique          count as one word if the same word exists in a line

For example, if you want to get word frequency ranking from the example TSV file, try below command.

# chack the file contents
$ head -n 2 ./example/aozora_bunko.tsv

id	author	title	url	exerpt
1	夏目 漱石	吾輩は猫である	https://www.aozora.gr.jp/cards/000148/card789.html	一 吾輩は猫である。名前はまだ無い。 ...


# run rank command
$ go-jp-text-ripper rank \
    --input ./example/aozora_bunko.tsv \
    --column exerpt \
    --output ./output_rank.tsv \
	--stopword ./stopwords.txt

[INFO]	[DoWithProgress]	read lines...
[INFO]	[Do]	Total Words:1041
[INFO]	[DoWithProgress]	finish process

# check the results
$ head -n 10 ./output_rank.tsv
type	rank	word	countN	countP
top	1	し	52	0.02802
top	2	の	41	0.02209
top	3	い	31	0.01670
top	4	いる	23	0.01239
top	5	吾輩	18	0.00970
top	6	ゐる	16	0.00862
top	7	政治	14	0.00754
top	8	人間	13	0.00700
top	9	れ	12	0.00647

Custome Go App

Import go-jp-text-ripper and add plugins into Config. You can add your custom plugins.

package main

import (
	"github.com/evalphobia/go-jp-text-ripper/plugin"
	"github.com/evalphobia/go-jp-text-ripper/ripper"
)

// cli entry point
func main() {
	common := ripper.CommonConfig{}

	// prefilters to normalize raw text
	common.PreFilters = []*ripper.PreFilter{
			prefilter.Neologd,
	}

	// plugins
	common.Plugins = []*ripper.Plugin{
		plugin.KanaCountPlugin,
		plugin.AlphaNumCountPlugin,
		plugin.CharTypeCountPlugin,
		plugin.MaxCharCountPlugin,
		plugin.MaxWordCountPlugin,
		plugin.SymbolCountPlugin,
		plugin.NounNameCountPlugin,
		plugin.NounHasFullNamePlugin,
		plugin.NounNumberCountPlugin,
		plugin.KanaNumberLikeCountPlugin,
		plugin.KanaAlphabetLikeCountPlugin,
		plugin.NounLocationCountPlugin,
		plugin.NounOrganizationCountPlugin,
		// MyCustomePlugin,
		&ripper.Plugin{
			Title: "proper_noun_count",
			Fn: func(text *ripper.TextData) string {
				return strconv.Itoa(text.GetWords().CountFeatures("固有名詞"))
			},
		},
	}

	// postfilters running after processed all of the plugins
	common.PostFilters = []*ripper.PostFilter{
		postfilter.RatioJP,
		postfilter.RatioAlphaNum,
	}

	err := ripper.DoRip(ripper.RipConfig{
		CommonConfig:        common,
		DropEmpty:           true,
		StopWordTopNumber:   300,
	})
}

then, build and run!

License

Apache License, Version 2.0

Credit

This project depends on these awesome libraries,

About

tokenize text and separate it into words for Japanese


Languages

Language:Go 97.8%Language:Makefile 2.2%