some extraction duplicated in xml

Question

some extraction duplicated in xml

fortyfourforty opened this issue 3 months ago · comments

hi,

I was setting a test site and playing with trafilatura and found a weird bug.

site URL:
https://milkfriends.s1-tastewp.com/2024/06/27/ok-this/
as this test site is only available for 2 days, so I also attached the simple Gutenberg block code below for you to replicate

Command:

html = trafilatura.fetch_url(url, no_ssl=True,)
ts = trafilatura.extract(html, output_format='xml', include_comments=False)

the Wordpress Gutenberg htmls below

<!-- wp:paragraph -->
<p>this is sample intro</p>
<!-- /wp:paragraph -->

<!-- wp:heading {"level":3} -->
<h3 class="wp-block-heading">intro 2</h3>
<!-- /wp:heading -->

<!-- wp:paragraph -->
<p>table below</p>
<!-- /wp:paragraph -->

<!-- wp:table -->
<figure class="wp-block-table"><table><tbody><tr><td>a</td><td>b</td><td></td></tr><tr><td>f</td><td>s</td><td>s</td></tr><tr><td>g</td><td></td><td>b</td></tr></tbody></table></figure>
<!-- /wp:table -->

<!-- wp:paragraph -->
<p>header table below</p>
<!-- /wp:paragraph -->

<!-- wp:table -->
<figure class="wp-block-table"><table><thead><tr><th>b</th><th>s</th><th>h</th></tr></thead><tbody><tr><td>a</td><td>b</td><td></td></tr><tr><td>f</td><td>s</td><td>s</td></tr><tr><td>g</td><td></td><td>b</td></tr></tbody></table></figure>
<!-- /wp:table -->

<!-- wp:paragraph -->
<p>list below</p>
<!-- /wp:paragraph -->

<!-- wp:list -->
<ul><!-- wp:list-item -->
<li>this is 1</li>
<!-- /wp:list-item -->

<!-- wp:list-item -->
<li>this is 2</li>
<!-- /wp:list-item -->

<!-- wp:list-item -->
<li>this is 3</li>
<!-- /wp:list-item --></ul>
<!-- /wp:list -->

<!-- wp:paragraph -->
<p>numbered list below</p>
<!-- /wp:paragraph -->

<!-- wp:list {"ordered":true} -->
<ol><!-- wp:list-item -->
<li>this is 1</li>
<!-- /wp:list-item -->

<!-- wp:list-item -->
<li>this is 2</li>
<!-- /wp:list-item -->

<!-- wp:list-item -->
<li>this is 3</li>
<!-- /wp:list-item --></ol>
<!-- /wp:list -->

It is very simple extraction but I find some elements are extracted twice.
elements below "this is sample intro" appeared twice but not all of the elements appear twice. some of the list elements only show up once.

See the extraction below:

<doc sitename="milkfriends.s1-tastewp.com" title="ok this" author="Admin" date="2024-06-27" url="https://milkfriends.s1-tastewp.com/2024/06/27/ok-this/" hostname="s1-tastewp.com" fingerprint="f69d7033beefe32d">
  <main>
    <p>this is sample intro</p>
    <head rend="h3">intro 2</head>
    <p>table below</p>
    <table>
      <row span="3">
        <cell>a</cell>
        <cell>b</cell>
      </row>
      <row span="3">
        <cell>f</cell>
        <cell>s</cell>
        <cell>s</cell>
      </row>
      <row>
        <cell>g</cell>
        <cell>b</cell>
      </row>
    </table>
    <p>header table below</p>
    <table>
      <row span="3">
        <cell role="head">b</cell>
        <cell role="head">s</cell>
        <cell role="head">h</cell>
      </row>
      <row span="3">
        <cell>a</cell>
        <cell>b</cell>
      </row>
      <row span="3">
        <cell>f</cell>
        <cell>s</cell>
        <cell>s</cell>
      </row>
      <row>
        <cell>g</cell>
        <cell>b</cell>
      </row>
    </table>
    <p>list below</p>
    <list rend="ul">
      <item>this is 1</item>
      <item>this is 2</item>
      <item>this is 3</item>
    </list>
    <p>numbered list below</p>
    <list rend="ol">
      <item>this is 1</item>
      <item>this is 2</item>
      <item>this is 3</item>
    </list>
    <p>this is sample intro</p>
    <p>table below</p>
    <table>
      <row span="3">
        <cell>a</cell>
        <cell>b</cell>
      </row>
      <row span="3">
        <cell>f</cell>
        <cell>s</cell>
        <cell>s</cell>
      </row>
      <row>
        <cell>g</cell>
        <cell>b</cell>
      </row>
    </table>
    <p>header table below</p>
    <table>
      <row span="3">
        <cell role="head">b</cell>
        <cell role="head">s</cell>
        <cell role="head">h</cell>
      </row>
      <row span="3">
        <cell>a</cell>
        <cell>b</cell>
      </row>
      <row span="3">
        <cell>f</cell>
        <cell>s</cell>
        <cell>s</cell>
      </row>
      <row>
        <cell>g</cell>
        <cell>b</cell>
      </row>
    </table>
    <p>list below</p>
    <p>numbered list below</p>
  </main>
</doc>

Adrien Barbaresi · Answer 1 · Thu Jun 27 2024 19:03:42 GMT+0800 (China Standard Time)

I'm not sure what happens here but this is odd indeed. Note that if you can use a web archive to reproduce the errors later.

In general, duplicated elements can be easily tackled by using the integrated deduplication filters and setting the right threshold.

fortyfourforty · Answer 2 · Thu Jun 27 2024 19:09:43 GMT+0800 (China Standard Time)

sorry, I forgot about archive.is. Noted.

I don't think using deduplicate = True is a valid workaround as there are some pages that do have extact same text segments on the same page.

Adrien Barbaresi · Answer 3 · Thu Jul 25 2024 19:58:45 GMT+0800 (China Standard Time)

@fortyfourforty The integrated deduplication does prevent identical text segments on the same page.