spatie / crawler

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.

Home Page:https://freek.dev/308-building-a-crawler-in-php

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Crawler processes out of scope page after redirect

spekulatius opened this issue · comments

Hello @freekmurze

something I've noticed working on my crawler project for Rankletter.com (and tested on) is:

  • 301 redirects aren't picked up, unless passing the follow redirect flag (RequestOptions::ALLOW_REDIRECTS) to the crawler (expected)
  • with activated flag, the crawler also processes the external page independently if is scope or not.

In my case I redirect /contact to rankletter.com/contact to my blog peterthaleikis.com/blog (as a temporary solution more or less), these the crawler picks up relative links as if they were on the domain I'm crawling (here rankletter.com).

On looking over the code briefly, it looks like the CrawlRequestFulfilled classes needs to be extended for a check on this.

Just thought I let you know and check in if this is known (and maybe the reason for switching the redirects off in the first place?)

Cheers,
Peter

To be honest, its been a while since I coded that part up, and I don't know anymore if it's intended or not 😬

That's fair enough. totally get the "Ehm, yeah, maybe, maybe not. I can't remember"-feeling 😄