Dumb web to disk tool; html, markdown / md / text, epub
Python 3 and 2.7
Table of contents generated with markdown-toc
python3 -m venv py3venv # optional...
# TODO better way than directly from command line list
python -m pip install --upgrade markdownify readability-lxml git+https://github.com/clach04/pypub.git git+https://github.com/clach04/w2d.git # Python 2 or 3 - without trafilatura
python -m pip install --upgrade markdownify readability-lxml trafilatura git+https://github.com/clach04/pypub.git git+https://github.com/clach04/w2d.git # Python 3 only
python -m pip install -e git+https://github.com/clach04/w2d.git#egg=w2d
w2d
w2d https://en.wikipedia.org/wiki/EPUB
w2d local_file.html
TODO document debian packages that can be installed
git clone https://github.com/clach04/w2d.git
cd w2d
python3 -m venv py3venv
. py3venv/bin/activate
python -m pip install -r requirements.txt
python setup.py develop # optional to have w2d binary
python -m w2d
python -m w2d https://en.wikipedia.org/wiki/EPUB
python -m w2d local_file.html
# if setup.py ran in install or develop mode
w2d
w2d https://en.wikipedia.org/wiki/EPUB
w2d local_file.html
set W2D_OUTPUT_FORMAT=epub
export W2D_OUTPUT_FORMAT=epub
python -m w2d https://en.wikipedia.org/wiki/EPUB
Then read with an standards compliant epub reader, e.g. https://addons.mozilla.org/en-US/firefox/addon/epubreader/
html
export W2D_EXTRACTOR=postlight
export MP_URL=http://localhost:3000/parser
export MP_URL=http://username:password@localhost:3000/parser
export W2D_OUTPUT_FORMAT=html
env W2D_OUTPUT_FORMAT=html python -m w2d https://en.wikipedia.org/wiki/EPUB
python -m w2d https://en.wikipedia.org/wiki/EPUB
Alternative config
cat .env
W2D_EXTRACTOR=postlight
MP_URL=http://localhost:3000/parser
export W2D_OUTPUT_FORMAT=md
python -m w2d https://en.wikipedia.org/wiki/EPUB
env W2D_OUTPUT_FORMAT=html W2D_EXTRACTOR=raw python -m w2d http://localhost:8000/one.html
env W2D_OUTPUT_FORMAT=md W2D_EXTRACTOR=raw python -m w2d http://localhost:8000/one.html # needs either Pandoc binary in path or markdownify library available
- right now there is no commandline argument processing other than list of URLs
- really intended to be used as a library, main user/consumer https://github.com/clach04/whatabagacack
- no control over output format - use operating system environment variable
W2D_OUTPUT_FORMAT
(may be set tohtml
,md
,epub
, andall
) - no control over epub tool/processing - use operating system environment variable
W2D_EPUB_TOOL
(may be set topypub
orpandoc
- NOTE needs pandoc exe in path) - no control over intermediate format - use operating system environment variable
W2D_INTERMEDIATE_FORMAT
(may be set tohtml
ormd
) - no control over whether readabilty extract is performed or not (it always performs an extract) - see environment variable
W2D_EXTRACTOR
(may be set toreadability
,postlight
,postlight_exe
, orraw
- if postlight is used also see/setMP_URL
) - no control over disk cache contents, all pages are cached.
- cache location is controlled via operating system environment variable
W2D_CACHE_DIR
, if not set defaults toscrape_cache
in current directory - cache name is md5sum in hex of the URL, same root URL with different parameters (or href shortcuts
#id_marker
) will cause new cache entry to be pulled down
- cache location is controlled via operating system environment variable
This project builds on a number of other tools to perform the heavy lifting:
- https://github.com/matthewwithanm/python-markdownify - for outputing Markdown with python-readability
- https://github.com/clach04/pypub is based on https://github.com/wcember/ original work for outputing epub2 files
- https://github.com/buriy/python-readability - used to extract main content from html pages, in turn based on https://github.com/timbertson/ work, which is in turn pased on arc90's readability bookmarklet https://web.archive.org/web/20130519040221/http://www.readability.com/
- https://github.com/adbar/trafilatura - has great meta data extraction support
- Postlight (nee mercury) parser
-
Windows 10 - Python 3.10
(py310venv) C:\code\py\w2d>pip list Package Version ---------------- --------- beautifulsoup4 4.9.3 certifi 2023.7.22 chardet 3.0.4 courlan 0.5.0 cssselect 1.2.0 dateparser 1.1.8 htmldate 0.8.1 idna 2.8 Jinja2 2.11.3 jusText 3.0.0 langcodes 3.3.0 lxml 4.9.3 markdownify 0.11.6 MarkupSafe 1.1.1 pip 22.0.4 pypub 1.5 python-dateutil 2.8.2 pytz 2023.3 readability 0.3.1 readability-lxml 0.8.1 regex 2023.6.3 requests 2.22.0 setuptools 58.1.0 six 1.16.0 soupsieve 2.4.1 tld 0.13 trafilatura 0.8.2 tzdata 2023.3 tzlocal 5.0.1 urllib3 1.25.11
-
Windows 10 - Python 2.7.18
(py210venv) C:\code\py\w2d>pip list DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop sup port for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python- 2-support pip 21.0 will remove support for this functionality. Package Version ----------------------------- --------- backports.functools-lru-cache 1.6.6 beautifulsoup4 4.9.3 certifi 2021.10.8 chardet 3.0.4 idna 2.8 Jinja2 2.11.3 lxml 4.9.3 markdownify 0.11.6 MarkupSafe 1.1.1 pip 20.3.4 pypub 1.6 readability 0.3.1 requests 2.22.0 setuptools 44.1.1 six 1.16.0 soupsieve 1.9.6 urllib3 1.25.11 wheel 0.37.1
-
Linux Ubuntu 18.04.6 LTS (Bionic Beaver) - Python 3.6.9
Without trafilatura:
(py3venv) clach04@fugly:/tmp$ pip list DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning. beautifulsoup4 (4.12.2) certifi (2023.7.22) chardet (5.0.0) charset-normalizer (2.0.12) cssselect (1.1.0) idna (3.4) Jinja2 (3.0.3) lxml (4.9.3) markdownify (0.11.6) MarkupSafe (2.0.1) pip (9.0.1) pkg-resources (0.0.0) pypub (1.6) readability-lxml (0.8.1) requests (2.27.1) setuptools (39.0.1) six (1.16.0) soupsieve (2.3.2.post1) urllib3 (1.26.16) w2d (0.0.1)
With trafilatura:
(py3venv) :~/w2d$ pip list DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning. backports-datetime-fromisoformat (2.0.0) backports.zoneinfo (0.2.1) beautifulsoup4 (4.12.2) certifi (2023.7.22) chardet (5.0.0) charset-normalizer (3.0.1) courlan (0.9.3) cssselect (1.1.0) dateparser (1.1.3) htmldate (1.4.3) idna (3.4) importlib-resources (5.4.0) Jinja2 (3.0.3) jusText (3.0.0) langcodes (3.3.0) lxml (4.9.3) markdownify (0.11.6) MarkupSafe (2.0.1) pip (9.0.1) pkg-resources (0.0.0) pypub (1.6) python-dateutil (2.8.2) pytz (2023.3) pytz-deprecation-shim (0.1.0.post0) readability-lxml (0.8.1) regex (2022.3.2) requests (2.27.1) setuptools (39.0.1) six (1.16.0) soupsieve (2.3.2.post1) tld (0.12.6) trafilatura (1.6.1) tzdata (2023.3) tzlocal (4.2) urllib3 (1.26.16) zipp (3.6.0)
- https://github.com/danburzo/percollate JavaScript based
- https://github.com/dullage/url2kindle - python based and it does almost the same thing re Postlight/Mercury parsing with embedded images