koron / wpwordtool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wikipedia Word Tool

PkgGoDev Actions/Go Go Report Card

Wikipedia の記事見出しを java-ugmatcha-suite で取り扱える形式に変換するツール。

詳細は https://github.com/koron/java-ugmatcha-suite/blob/master/doc/algorithm.md および https://github.com/koron/java-ugmatcha-suite/blob/master/wikidict/README.md を参照してください。

Getting started

Install or Update

$ go install github.com/koron/wpwordtool@latest

Convert sub command

$ wpwordtool convert

Load jawiki-latest-all-titles-in-ns0.gz (default of -ja) and enwiki-latest-all-titles-in-ns0.gz (default of -en), then save converted dictionary files tmp/wikiwords.stt and tmp/wikiwords.stw. Basename of output files is tmp/wikiwords which is default of -out.

Abstract sub command

Extract all abstracts from wikipedia's article XML files. The pattern of names is jawiki-YYYYMMDD-abstractX.xml.gz.

How to use:

$ wpwordtool abstract jawiki-20190701-abstract*.xml.gz > abstract.txt
2019/07/24 11:58:09 extracting from jawiki-20190701-abstract1.xml.gz
2019/07/24 11:58:11 extracting from jawiki-20190701-abstract2.xml.gz
2019/07/24 11:58:13 extracting from jawiki-20190701-abstract3.xml.gz
2019/07/24 11:58:14 extracting from jawiki-20190701-abstract4.xml.gz
2019/07/24 11:58:15 extracting from jawiki-20190701-abstract5.xml.gz
2019/07/24 11:58:16 extracting from jawiki-20190701-abstract6.xml.gz

$ wc -l -c abstract.txt
  1157686 141950858 abstract.txt

Dictionaries

Download these files.

About

License:MIT License


Languages

Language:Go 93.0%Language:Makefile 7.0%