<main> Content gets missed out

Question

<main> Content gets missed out

alroythalus opened this issue 5 months ago · comments

For this website

https://www.enpass.io/privacy-notice/

the entire content within main tag is ignored. Only content from div tags before main tag is extracted.

    web_content = "".join(
        extract(
            web_content,
            include_formatting=True,
            include_tables=True,
            include_comments=False,
            include_links=False,
            output_format="xml",
            favor_recall=True,
            config=config,
        ))

This is all that get scraped @adbar

Adrien Barbaresi · Answer 1 · Mon May 06 2024 23:49:37 GMT+0800 (China Standard Time)

I cannot reproduce the bug, I just tried on the command-line and both the basic extraction and your options work for me:

trafilatura -u "https://www.enpass.io/privacy-notice/"
trafilatura -u "https://www.enpass.io/privacy-notice/" --formatting --no-comments --xml --recall

The problem is probably related to your connection or other settings on your machine, can you try again?