PuerkitoBio / gocrawl

Polite, slim and concurrent web crawler.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Redirect + normalization problem

goodsign opened this issue · comments

Hi Martin!

Currently I'm having a problem, but I'm not sure what I should focus on or whether it is a complex of problems. I'll try to explain what I'm encountering and I'd be thankful if you leave some comments on that, because maybe it's not even a bug.

Okay, for example, let's crawl 'http://golang.org' : if you look at the golang.org source code, you'll see links like: /pkg/, /doc/, etc.

These links are getting resolved to absolute and normalized by gocrawl, so for example for /pkg/ I get 'http://golang/pkg' (Default purell flag is 'all greedy' so I lose the trailing slash).

If you visit 'http://golang/pkg' (even just using your browser) you'll see that it would redirect you to '/pkg/' (Just where the initial link goes).

First problem

And here goes the first problem, which is depicted by a piece of gocrawl log (I removed unneccessary log parts):

enqueue: http://golang.org/pkg
...
worker 1 - popped: http://golang.org/pkg
...
worker 1 - redirect to /pkg/
...
receive url /pkg/
ignore on absolute policy: /pkg

So it seems that redirected URL doesn't get resolved to absolute one like the original one was. I checked your code and saw resolving logic only in worker.processLinks if I'm not mistaken. So it seems that somewhere in the redirect logic resolving is missing.

Second problem

Even if the redirect URL would get resolved to an absolute one, it still gets normalized if I don't change URLNormalizationFlags. So the trailing slash would still be always removed (We see that in log we 'receive /pkg/' and 'ignore /pkg') and thus we'll be infinitely redirected, because golang.org/pkg redirects to golang.org/pkg/ (and after normalization it gets to golang.org/pkg and it redirects to ....).

My temporary solution

I've temporarily solved that just by avoiding any slash-related logic, so I've set

opts.URLNormalizationFlags = purell.FlagsAllGreedy & (^purell.FlagRemoveTrailingSlash)

and everything went fine.

Fix proposal and discussion

Maybe some other normalization flag should be chosen as the default?

Or maybe it is even better to change the strategy a bit:

  • Pass a normalized URL to filter,
  • After Filter returned 'true', fetch the original URL as-is

Personally I like the latter, because this way I exclude the situation that normalization changes URL and website gives something different for the modified one (like a redirect to the original again).

What do you think? Tell me if I'm missing something here.

Hi,

Thanks for the detailed information. I did run into something similar (normalization made the request fail, the website did not allow non-www), and I used a different normalization to make it work, but the website did not redirect to the original (non-normalized) URL, so I didn't think about this possible circular problem. Your proposal makes sense as far as I'm concerned.

As for the redirect, I wrongly assumed that the new location was always absolute.

Let me check this all in context in the coming days, but this feels right.

Thanks!