zim_wikireader

Python scripts to extract html from zim file to feed into wikireader build process

Supplied :- extract.py filter_out.pl duplicate.pl remove.pl

==========

Download wikipedia_en_all_nopic_*zim from https://download.kiwix.org/zim/wikipedia/
Seems to lag behind database backup by 1 to 2 months, I assume the processing required is immense given they do all the other zims too.

mkdir OUTPUT2

This chokes so that only 70 processes extract.py run at once, and suspends the others
I have a x99 board with 72 processes on it, so it runs lovely... ymmv
I need to implement python queuing when I've learnt it

perl watcher.pl &

So this takes the zim file, and sequentially processes each page, each page is html, but with lots of extra formatting that is not needed and bloats/breaks the wikireader build
so it filters it out the best it can
Apologies for awful python code

python3 extract.py wikipedia_en_all_nopic_2024-04.zim 1024 2024-04

Parameters ZIM file Number of workers File_name

This creates 1024 "chunks" of files, because they get bigger when processed sequentially - due to how origianl wikipedia was built -
ends up with 1 or 2 processes taking many more times than the average. Hence the "sort -R" below makes it random and mostly makes each stream roughly the same size.
The below assembles it into on ~50gb file

cd OUTPUT2

cat `ls *.wiki | sort -R | xargs echo` > full.txt

Now do some tidying, as some articles still "break" wikireader build (usually to do with utf8 corruption). Seems to vary between extracts

cat full.txt | grep -v "^ This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files." | perl filter_out.pl | perl duplicate.pl | perl remove.pl >> enwiki.xml

Now feed enwiki.xml into the wikireader-master build process as normal

i.e.

export PATH=$PATH:/u01/WIKI/BUILD/wikireader-master/scripts/

./scripts/Run --verbose --machines=1 --parallel=64 --farm=1 en:::NO::::

make WORKDIR=work DESTDIR=image combine install

geoffwolf98 / zim_wikireader

zim_wikireader

About

Languages