peduncle

very very very simple DOM based HTML content extraction tool, get rid of boilerplate dressing of a web page[1].

easy but useable

work with python 3.7+

[1] the word comes from dragnet.

install

pip install peduncle

usage

import requests
from peduncle.peduncle import extract_text

# obtain the raw html
url="https://blog.rust-lang.org/2023/05/29/RustConf.html"
html = requests.get(url).text

# extract
print(extract_text(html))

benchmark

data

benchmark data comes from dragnet_data, which contains 1381 web pages.

result

	similarity	95%hit_rate	avg_length_gap(char)	length_gap_std
a=0.01	0.5767456743946341	0.22	-4673.118	15343.704819895227
a=025	0.8451692708814662	0.548	-2082.988	14502.183923390849
a=0.5	0.8226224698726087	0.47	-368.696	8452.075615349402
a=0.99	0.7527591593485807	0.292	1614.306	7917.618208044891

a: alpha, control how much the content extractor tens to extract larger content piece
similarity: cosine similarity between sparse vectors of answer and extracted text
95hit rate: percentage of similarity larger than 95%
length gap: extracted text length - answer text length
std: std

algorithm

Node grading is based on several key features:

tag name: The tag name is a crucial determinant of a node's potential to contain the "main content". Nodes tagged with <content>, <article>, or <main> are more likely to house the main content than those tagged with <menu>, <nav>, or <aside>.
children tags: The distribution of a node's child tags can also suggest its likelihood of being the main content. Nodes with a higher percentage of <p> tags among child tags are scored favorably.
text - children ratio: Nodes with an excessively high or low number of children, or those with too much or too little text, are less likely to contain the main content — they're located too high or too low in the HTML tree. We thus use the text-to-children ratio to assess the suitability of a node, aiming for nodes that contain only a few sizable blocks of text.

We use the following equation to calculate this ratio:

$$\frac{t\times(1+c/n)}{c/n}$$

Here, $t$ is the text length, $c$ is the total number of child nodes, and $n$ is a variable used to gauge what counts as "too few children". The $n$ variable is also instrumental in adjusting whether we want the chosen node to be closer to the root or leaf.

We recursively grade each node in the HTML tree and ultimately select one node per document as the main content.

midstreeeam / peduncle