geoffwolf98 / zim_wikireader

Python scripts to extract html from zim file to feed into wikireader build process

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

zim_wikireader

Python scripts to extract html from zim file to feed into wikireader build process

Supplied :- extract.py filter_out.pl duplicate.pl remove.pl

==========

  • Download wikipedia_en_all_nopic_*zim from https://download.kiwix.org/zim/wikipedia/
  • Seems to lag behind database backup by 1 to 2 months, I assume the processing required is immense given they do all the other zims too.

mkdir OUTPUT2

  • This chokes so that only 70 processes extract.py run at once, and suspends the others
  • I have a x99 board with 72 processes on it, so it runs lovely... ymmv
  • I need to implement python queuing when I've learnt it

perl watcher.pl &

  • So this takes the zim file, and sequentially processes each page, each page is html, but with lots of extra formatting that is not needed and bloats/breaks the wikireader build
  • so it filters it out the best it can
  • Apologies for awful python code

python3 extract.py wikipedia_en_all_nopic_2024-04.zim 1024 2024-04

Parameters ZIM file Number of workers File_name

  • This creates 1024 "chunks" of files, because they get bigger when processed sequentially - due to how origianl wikipedia was built -

  • ends up with 1 or 2 processes taking many more times than the average. Hence the "sort -R" below makes it random and mostly makes each stream roughly the same size.

  • The below assembles it into on ~50gb file

cd OUTPUT2

cat `ls *.wiki | sort -R | xargs echo` > full.txt

  • Now do some tidying, as some articles still "break" wikireader build (usually to do with utf8 corruption). Seems to vary between extracts

cat full.txt | grep -v "^ This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files." | perl filter_out.pl | perl duplicate.pl | perl remove.pl >> enwiki.xml

  • Now feed enwiki.xml into the wikireader-master build process as normal

i.e.

export PATH=$PATH:/u01/WIKI/BUILD/wikireader-master/scripts/

./scripts/Run --verbose --machines=1 --parallel=64 --farm=1 en:::NO::::

make WORKDIR=work DESTDIR=image combine install

About

Python scripts to extract html from zim file to feed into wikireader build process


Languages

Language:Python 61.7%Language:Perl 38.3%