weird xml extraction

Question

fortyfourforty opened this issue 3 months ago · comments

Example url:
https://www.dummies.com/article/home-auto-hobbies/home-improvement-appliances/electrical/how-to-replace-a-light-switch-185346/

Command:
trafilatura.extract(page_source, output_format='xml', include_comments=False)

Problem:
Output is not reading like a regular XML.

Adrien Barbaresi · Answer 1 · Thu Jun 27 2024 18:50:04 GMT+0800 (China Standard Time)

Yes, there is something wrong with the extraction here.

Adrien Barbaresi · Answer 2 · Thu Jun 27 2024 19:42:29 GMT+0800 (China Standard Time)

The main extractor is not impacted, readability_lxml extracts the wrong content, I will implement a quick fix.