Efficiency of page deduplication

Question

Efficiency of page deduplication

mutantzombie opened this issue 9 years ago · comments

(Issue #6 focuses on drawbacks for the test and audit phase. The focuses on applicability to crawling.)

This approach doesn't look very robust to improve crawling against real-world web apps. It requires a link to be requested before the crawler can decide whether the requested link was redundant.

The comparison also seems overly sensitive to tags that do not affect structure or navigation. For example, false positives can come from text nodes with large variations in formatting tags like <b> and <i>, pages that display tables with different amounts of rows, or articles with varying numbers of comments (where a comment may be in its own <div>)

For the flickr examples, the distance doesn't seem to consistently reflect duplicate content. For example, it appears to produce a 100% match for a user's /albums/ and /groups/ content, even though the /groups/ clearly points to additional navigation links that would be important to crawl.

How does the dedupe work for pages that dynamically create content? Is the HTML taken from the HTTP response, or from the version rendered in a browser? If it renders from a browser, at what point is the page considered settled versus an "infinite scroll" or dynamically refreshing page?

Are link+page combinations labeled with an authentication state? The content for a link can change significantly depending on whether the user is logged in.

Albert Yu · Answer 1 · Sat Oct 03 2015 06:04:07 GMT+0800 (China Standard Time)

Thanks for the long write up!

This approach doesn't look very robust to improve crawling against real-world web apps. It requires a link to be requested before the crawler can decide whether the requested link was redundant.

Good catch. We have a feedback loop in our python version, v1. The github version is Go, v2. In V1, we could train a url pattern model based on the similarity matches. That module was a bit error prone, and it is in my long todo list to get it run on Go version.

The comparison also seems overly sensitive to tags that do not affect structure or navigation. For example, false positives can come from text nodes with large variations in formatting tags like and , pages that display tables with different amounts of rows, or articles with varying numbers of comments (where a comment may be in its own
)

Simhash seems to be taking a good care of that based on what we observed! There could be false positives, but the amount of such is still a bit insignificant.

For the flickr examples, the distance doesn't seem to consistently reflect duplicate content. For example, it appears to produce a 100% match for a user's /albums/ and /groups/ content, even though the /groups/ clearly points to additional navigation links that would be important to crawl.

Do you have the specific flickr urls? The html-distance module and simhash algorithm has a couple parameters and I may be able to adjust it a bit.

How does the dedupe work for pages that dynamically create content? Is the HTML taken from the HTTP response, or from the version rendered in a browser? If it renders from a browser, at what point is the page considered settled versus an "infinite scroll" or dynamically refreshing page?

Gryffin take the html from the DOM (document.body), via render.js.

Are link+page combinations labeled with an authentication state? The content for a link can change significantly depending on whether the user is logged in.

Handling of different user agents or initial session state is still being discussed if it is in scope of Gryffin. The workaround is to create multiple different scans with different user agents or initial session.

If we decide that Gryffin should have the intelligence of determining supported user agent and differentiating authenticated session, that may be done via a pre-crawl phase. Another TODO item.

Rachael Andrew · Answer 2 · Sat Feb 13 2021 05:06:01 GMT+0800 (China Standard Time)

The project is being archived.