adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Home Page:https://trafilatura.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

<main> Content gets missed out

alroythalus opened this issue · comments

For this website

https://www.enpass.io/privacy-notice/

the entire content within main tag is ignored. Only content from div tags before main tag is extracted.

    web_content = "".join(
        extract(
            web_content,
            include_formatting=True,
            include_tables=True,
            include_comments=False,
            include_links=False,
            output_format="xml",
            favor_recall=True,
            config=config,
        ))

Screenshot (804)
This is all that get scraped @adbar

I cannot reproduce the bug, I just tried on the command-line and both the basic extraction and your options work for me:

  • trafilatura -u "https://www.enpass.io/privacy-notice/"
  • trafilatura -u "https://www.enpass.io/privacy-notice/" --formatting --no-comments --xml --recall

The problem is probably related to your connection or other settings on your machine, can you try again?