<main> Content gets missed out
alroythalus opened this issue · comments
Alroy commented
For this website
https://www.enpass.io/privacy-notice/
the entire content within main tag is ignored. Only content from div tags before main tag is extracted.
web_content = "".join(
extract(
web_content,
include_formatting=True,
include_tables=True,
include_comments=False,
include_links=False,
output_format="xml",
favor_recall=True,
config=config,
))
This is all that get scraped @adbar
Adrien Barbaresi commented
I cannot reproduce the bug, I just tried on the command-line and both the basic extraction and your options work for me:
trafilatura -u "https://www.enpass.io/privacy-notice/"
trafilatura -u "https://www.enpass.io/privacy-notice/" --formatting --no-comments --xml --recall
The problem is probably related to your connection or other settings on your machine, can you try again?