Crawler processes out of scope page after redirect

Question

Crawler processes out of scope page after redirect

spekulatius opened this issue 4 years ago · comments

something I've noticed working on my crawler project for Rankletter.com (and tested on) is:

301 redirects aren't picked up, unless passing the follow redirect flag (RequestOptions::ALLOW_REDIRECTS) to the crawler (expected)
with activated flag, the crawler also processes the external page independently if is scope or not.

In my case I redirect /contact to rankletter.com/contact to my blog peterthaleikis.com/blog (as a temporary solution more or less), these the crawler picks up relative links as if they were on the domain I'm crawling (here rankletter.com).

On looking over the code briefly, it looks like the CrawlRequestFulfilled classes needs to be extended for a check on this.

Just thought I let you know and check in if this is known (and maybe the reason for switching the redirects off in the first place?)

Cheers,
Peter

Freek Van der Herten · Answer 1 · Wed Dec 09 2020 05:59:13 GMT+0800 (China Standard Time)

To be honest, its been a while since I coded that part up, and I don't know anymore if it's intended or not 😬

Peter Thaleikis · Answer 2 · Wed Dec 09 2020 06:22:06 GMT+0800 (China Standard Time)

That's fair enough. totally get the "Ehm, yeah, maybe, maybe not. I can't remember"-feeling 😄