weird xml extraction
fortyfourforty opened this issue · comments
fortyfourforty commented
Example url:
https://www.dummies.com/article/home-auto-hobbies/home-improvement-appliances/electrical/how-to-replace-a-light-switch-185346/
Command:
trafilatura.extract(page_source, output_format='xml', include_comments=False)
Problem:
Output is not reading like a regular XML.
Adrien Barbaresi commented
Yes, there is something wrong with the extraction here.
Adrien Barbaresi commented
The main extractor is not impacted, readability_lxml extracts the wrong content, I will implement a quick fix.