pmatigakis / article-extraction

Extract the article content from a page

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Article extraction library.

article-extraction is a package that can be used to extract the article content from an HTML page.

Installation

Use poetry to install the library from GitHub.

poetry add "git+https://github.com/pmatigakis/article-extraction.git"

Usage

Extract the content of an article using article-extraction.

from urllib.request import urlopen

from articles.mss.extractors import MSSArticleExtractor

document = urlopen("https://www.bbc.com/sport/formula1/64983451").read()
article_extractor = MSSArticleExtractor()
article = article_extractor.extract_article(document)
print(article)

About

Extract the article content from a page

License:MIT License


Languages

Language:Python 88.9%Language:HTML 11.1%