soniccat / go-wiktionary-parse

Tool to parse wiktionary dump into a sqlite database file

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

go-wiktionary-parse

This is a tool to parse language dumps from Wiktionary and store the results into a Sqlite database.

Quickstart

git clone https://github.com/macdub/go-wiktionary-parse
cd go-wikitionary-parse
wget https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2
bzip2 -d enwiktionary-latest-pages-articles.xml.bz2
go install .
go-wiktionary-parse -file enwiktionary-latest-pages-articles.xml -threads 20 -database test.db

Usage

Usage of wiktionary-parser:
    -cache_file string
        Use this as the cache file (default "xmlCache.gob")
    -database string
        Database file to use (default "database.db")
    -file string
        XML file to parse
    -lang string
        Language to target for parsing (default "English")
    -log_file string
        Log to this file
    -make_cache
        Make a cache file of the parsed XML
    -threads int
        Set the number of threads to use for parsing (default 5)
    -use_cache
        Use a 'gob' of the parsed XML file
    -purge
        Purge the existing database provided by the database flag
    -verbose
        Use verbose logging

Build

Dependencies

Build

$ go build -o wiktionary-parser main.go

Current Limitations

  • It only looks at 14 lemmas
  • Does not clean the definition. Meaning it looks like raw wiki markup. This is something that will be fixed in the near future.

Database

Structure

  • table name: dictionary
COLUMN TYPE
id integer
word text
lemma text
etymology_no integer
definition_no integer
definition text
  • Primary key is on ID
  • Index is setup over word, lemma, etymology_no, definition_no

Statistics

  • The database (20200506) file that is built is ~127MB (51MB compressed)
    • 914,799 words
    • 1,098,087 definitions
    • 14 lemmas

About

Tool to parse wiktionary dump into a sqlite database file


Languages

Language:Go 100.0%