johntitus / node-horseman

Run PhantomJS from Node

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

So many 'failed to GET url'

minotaurrr opened this issue · comments

commented

I'm just doing horseman.open('https://www.google.com') for testing but getting sooo many failed to get URL just at random times - maybe about 7 out of 10 times it'll fail.

any idea why?

Kicked the tires for this library following the docs for the project and saw a similar thing. Both Twitter and Google examples failed to run.

horseman v3.3.0
node v 8.9.1

commented

Tried on multiple hosts, and did notice that frequencies vary. But still getting the same error at some point evenutially

Up to this topic, same happening to me

commented

Up to this, I'm getting it repeatedly, not can I catch them

minotaurrr, Google detects scrapper and banned your IP address very quickly.
That's mean you can only "horseman.open('http://google.com') " ONCE every 5 minutes. If you want to scrap it more than 1 time per 5 minutes, you need to :

  • set up proxy in horseman options
  • clean cookies with horseman.cookies()
  • changing User-Agent in horseman
    -also modify your value in horseman.wait(value). If you always have same timing interval between your request, google will flagged it.

Google must have banned your IP. Set the time interval between GET request OR set a list of proxy and cycle through randomly.