TAbdiukov / mwscrape2slob

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mwscrape2slob

This is a tool to create slob with content from a MediaWiki CouchDB created by mwscrape.

Installation

“mwscrape2slob“ depends on the following components:

Install “lxml“ (consult your operating system documentation for installation instructions). For example, on Ubuntu 14.04:

sudo apt-get install python3-lxml

Create Python 3 virtual environment and install slob.py as described at http://github.org/itkach/slob/.

In this virtual environment run

pip install git+https://github.com/itkach/mwscrape2slob.git

Usage

usage: mwscrape2slob [-h] [-s STARTKEY] [-e ENDKEY] [-k KEY [KEY ...]]
                     [-K KEY_FILE] [-f FILTER_FILE [FILTER_FILE ...]]
                     [-F FILTER [FILTER ...]] [-o OUTPUT_FILE]
                     [-c {lzma2,zlib}] [-b BIN_SIZE] [-u URI]
                     [-l LICENSE_NAME] [-L LICENSE_URL]
                     [-ll LANGLINKS [LANGLINKS ...]] [-a CREATED_BY]
                     [-w WORK_DIR] [--filter-dir FILTER_DIR]
                     [--html-encoding HTML_ENCODING]
                     [--remove-embedded-bg REMOVE_EMBEDDED_BG]
                     couch_url

positional arguments:
  couch_url             URL of CouchDB created by mwscrape to be used as input

optional arguments:
  -h, --help            show this help message and exit
  -s STARTKEY, --startkey STARTKEY
                        Skip articles with titles before this one when sorted
  -e ENDKEY, --endkey ENDKEY
                        Stop processing when this title is reached
  -k KEY [KEY ...], --key KEY [KEY ...]
                        Process specified keys only
  -K KEY_FILE, --key-file KEY_FILE
                        Process only keys specified in file
  -f FILTER_FILE [FILTER_FILE ...], --filter-file FILTER_FILE [FILTER_FILE ...]
                        Name of filter file. Filter file consists of CSS
                        selectors (see cssselect documentation for description
                        of supported selectors), one selector per line.
  -F FILTER [FILTER ...], --filter FILTER [FILTER ...]
                        CSS selectors for elements to exclude (see cssselect
                        documentation for description of supported selectors)
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        Name of output slob file
  -c {lzma2,zlib}, --compression {lzma2,zlib}
                        Name of compression to use. Default: lzma2
  -b BIN_SIZE, --bin-size BIN_SIZE
                        Minimum storage bin size in kilobytes. Default: 384
  -u URI, --uri URI     Value for uri tag. Slob-specific article URLs such as
                        bookmarks can be migrated to another slob based on
                        matching "uri" tag values
  -l LICENSE_NAME, --license-name LICENSE_NAME
                        Value for license.name tag. This should be name under
                        which the license is commonly known.
  -L LICENSE_URL, --license-url LICENSE_URL
                        Value for license.url tag. This should be a URL for
                        license text
  -ll LANGLINKS [LANGLINKS ...], --langlinks LANGLINKS [LANGLINKS ...]
                        Include article titles from Wikipedia language links
                        for these languages if available
  -a CREATED_BY, --created-by CREATED_BY
                        Value for created.by tag. Identifier (e.g. name or
                        email) for slob file creator
  -w WORK_DIR, --work-dir WORK_DIR
                        Directory for temporary files created during
                        compilation. Default: .
  --filter-dir FILTER_DIR
                        Directory where filter files are located. Default:
                        filters directory in this package
  --html-encoding HTML_ENCODING
                        HTML text encoding. Default: utf-8
  --remove-embedded-bg REMOVE_EMBEDDED_BG
                        Comma separated list of CSS selectors. Background will
                        be removed from matching element's style attribute.
                        For example, to remove background from all elements
                        with style attributespecify selector [style]. Default:
  --no-math
                        Do not include MathJax resources into dictionary 
                        (articles do not use math markup).
  --article-namespace
                        Treat specified Mediawiki namespaces as articles 
  --ensure-ext-image-urls
                        Convert internal image URLs to external URLs 

For example, assuming CouchDB server runs at localhost on port 5984 and has mwscrape database named simple-m-wikipedia-org, to create a slob using common and wiki content filters:

mwscrape2slob http://127.0.0.1:5984/simple-m-wikipedia-org -f common wiki

About


Languages

Language:Python 49.6%Language:CSS 25.2%Language:JavaScript 22.5%Language:Shell 2.7%