zhtsh / body_text_extraction

DOM tree based HTML body text extraction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

#BodyTextExtraction DOM Based heuristic algorithm for body text extraction from HTML.

ref: DOM Based Content Extraction via Text Density

usage

from body_text_extraction import BodyTextExtraction
bte = BodyTextExtraction()
text = bte.extract( html )  

About

DOM tree based HTML body text extraction


Languages

Language:Python 100.0%