adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Home Page:https://trafilatura.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

weird xml extraction

fortyfourforty opened this issue · comments

Example url:
https://www.dummies.com/article/home-auto-hobbies/home-improvement-appliances/electrical/how-to-replace-a-light-switch-185346/

Command:
trafilatura.extract(page_source, output_format='xml', include_comments=False)

Problem:
Output is not reading like a regular XML.

Yes, there is something wrong with the extraction here.

The main extractor is not impacted, readability_lxml extracts the wrong content, I will implement a quick fix.