cwilvx / html2md

A python script that reads the debian wiki news page and spits out a markdown file that renders the same page.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

html2md

Tests

A python script that reads the debian news page and spits out a markdown file that renders the same page.

Read the thought process

Running it

Clone this repo locally, create a virtual environment, install dependencies and run main.py.

git clone https://github.com/mungai-njoroge/html2md.git

cd html2md

If you have Poetry installed:

poetry install

# run main.py
poetry run python main.py

Without Poetry:

# create virtual environment
python -m venv venv

# activate it
source venv/bin/activate

# install dependencies
pip install -r requirements.txt

# run script
python main.py

Libraries used

How it works

The page is fetched using the requests package and then parsed into a tree structure using BeautifulSoup.

Important information that can be used by a wiki engine is extracted from the page and stored to be used as front matter in the final markdown file.

The relevant section of the webpage is inside the element with id content. This section is singled out using the BeautifulSoup.find method. Unneeded elements in the 'content' are identified and removed using the BeautifulSoup.decompose method. The markdownify package is then used to generate markdown from the remainder.

Running tests

Tests are defined in the test_main.py file. You can run them by running pytest (which was installed as a dependency).

python -m pytest

With Poetry:

poetry run python -m pytest

About

A python script that reads the debian wiki news page and spits out a markdown file that renders the same page.


Languages

Language:Python 100.0%