adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Home Page:https://trafilatura.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

some extraction duplicated in xml

fortyfourforty opened this issue · comments

hi,

I was setting a test site and playing with trafilatura and found a weird bug.

site URL:
https://milkfriends.s1-tastewp.com/2024/06/27/ok-this/
as this test site is only available for 2 days, so I also attached the simple Gutenberg block code below for you to replicate

Command:

html = trafilatura.fetch_url(url, no_ssl=True,)
ts = trafilatura.extract(html, output_format='xml', include_comments=False)

the Wordpress Gutenberg htmls below

<!-- wp:paragraph -->
<p>this is sample intro</p>
<!-- /wp:paragraph -->

<!-- wp:heading {"level":3} -->
<h3 class="wp-block-heading">intro 2</h3>
<!-- /wp:heading -->

<!-- wp:paragraph -->
<p>table below</p>
<!-- /wp:paragraph -->

<!-- wp:table -->
<figure class="wp-block-table"><table><tbody><tr><td>a</td><td>b</td><td></td></tr><tr><td>f</td><td>s</td><td>s</td></tr><tr><td>g</td><td></td><td>b</td></tr></tbody></table></figure>
<!-- /wp:table -->

<!-- wp:paragraph -->
<p>header table below</p>
<!-- /wp:paragraph -->

<!-- wp:table -->
<figure class="wp-block-table"><table><thead><tr><th>b</th><th>s</th><th>h</th></tr></thead><tbody><tr><td>a</td><td>b</td><td></td></tr><tr><td>f</td><td>s</td><td>s</td></tr><tr><td>g</td><td></td><td>b</td></tr></tbody></table></figure>
<!-- /wp:table -->

<!-- wp:paragraph -->
<p>list below</p>
<!-- /wp:paragraph -->

<!-- wp:list -->
<ul><!-- wp:list-item -->
<li>this is 1</li>
<!-- /wp:list-item -->

<!-- wp:list-item -->
<li>this is 2</li>
<!-- /wp:list-item -->

<!-- wp:list-item -->
<li>this is 3</li>
<!-- /wp:list-item --></ul>
<!-- /wp:list -->

<!-- wp:paragraph -->
<p>numbered list below</p>
<!-- /wp:paragraph -->

<!-- wp:list {"ordered":true} -->
<ol><!-- wp:list-item -->
<li>this is 1</li>
<!-- /wp:list-item -->

<!-- wp:list-item -->
<li>this is 2</li>
<!-- /wp:list-item -->

<!-- wp:list-item -->
<li>this is 3</li>
<!-- /wp:list-item --></ol>
<!-- /wp:list -->

It is very simple extraction but I find some elements are extracted twice.
elements below "this is sample intro" appeared twice but not all of the elements appear twice. some of the list elements only show up once.

See the extraction below:

<doc sitename="milkfriends.s1-tastewp.com" title="ok this" author="Admin" date="2024-06-27" url="https://milkfriends.s1-tastewp.com/2024/06/27/ok-this/" hostname="s1-tastewp.com" fingerprint="f69d7033beefe32d">
  <main>
    <p>this is sample intro</p>
    <head rend="h3">intro 2</head>
    <p>table below</p>
    <table>
      <row span="3">
        <cell>a</cell>
        <cell>b</cell>
      </row>
      <row span="3">
        <cell>f</cell>
        <cell>s</cell>
        <cell>s</cell>
      </row>
      <row>
        <cell>g</cell>
        <cell>b</cell>
      </row>
    </table>
    <p>header table below</p>
    <table>
      <row span="3">
        <cell role="head">b</cell>
        <cell role="head">s</cell>
        <cell role="head">h</cell>
      </row>
      <row span="3">
        <cell>a</cell>
        <cell>b</cell>
      </row>
      <row span="3">
        <cell>f</cell>
        <cell>s</cell>
        <cell>s</cell>
      </row>
      <row>
        <cell>g</cell>
        <cell>b</cell>
      </row>
    </table>
    <p>list below</p>
    <list rend="ul">
      <item>this is 1</item>
      <item>this is 2</item>
      <item>this is 3</item>
    </list>
    <p>numbered list below</p>
    <list rend="ol">
      <item>this is 1</item>
      <item>this is 2</item>
      <item>this is 3</item>
    </list>
    <p>this is sample intro</p>
    <p>table below</p>
    <table>
      <row span="3">
        <cell>a</cell>
        <cell>b</cell>
      </row>
      <row span="3">
        <cell>f</cell>
        <cell>s</cell>
        <cell>s</cell>
      </row>
      <row>
        <cell>g</cell>
        <cell>b</cell>
      </row>
    </table>
    <p>header table below</p>
    <table>
      <row span="3">
        <cell role="head">b</cell>
        <cell role="head">s</cell>
        <cell role="head">h</cell>
      </row>
      <row span="3">
        <cell>a</cell>
        <cell>b</cell>
      </row>
      <row span="3">
        <cell>f</cell>
        <cell>s</cell>
        <cell>s</cell>
      </row>
      <row>
        <cell>g</cell>
        <cell>b</cell>
      </row>
    </table>
    <p>list below</p>
    <p>numbered list below</p>
  </main>
</doc>

I'm not sure what happens here but this is odd indeed. Note that if you can use a web archive to reproduce the errors later.

In general, duplicated elements can be easily tackled by using the integrated deduplication filters and setting the right threshold.

sorry, I forgot about archive.is. Noted.

I don't think using deduplicate = True is a valid workaround as there are some pages that do have extact same text segments on the same page.

@fortyfourforty The integrated deduplication does prevent identical text segments on the same page.