j-min / Easy-Namuwiki-Extractor

Easy Namuwiki Extractor

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Easy NamuWiki Extractor

Simple Namuwiki Extractor extension of Namu Wiki Extractor

This module strips the namu mark from a namu wiki document and extracts its plain text only.

Environment

Usage

  • Clone this repo : git clone https://github.com/j-min/Easy-Namuwiki-Extractor

  • Download Namuwiki json dump inside directory of repo : wget http://file2.unofficialnis.ga/namuwiki_161031.json

  • You can find latest dumps here

  • Run extractor: python Run_extractor.py -i input_json_file -o outputfile_name

  • Tags:

--input (-i) : input filename
--output (-o) : output filename
--multiprocess (-m) : run multiprocessing module
--title (-t) : include titles of documents while extracting

How Namuwiki Json looks like

alt tag

Sample Output

alt tag

About

Easy Namuwiki Extractor


Languages

Language:Python 100.0%