Kevenyf / web-crawlers

several py web crawlers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

four crawlers, targeted at four different sites.

####Difficulties

  • douban
    • captcha
    • block ip
  • zhihu
    • dynamic page
  • weibo
    • post data has random id
  • songtaste

####Solutions

  • douban
    • catch the captcha and enter the characters manually
    • set a interval for each request, or use a proxy
  • zhihu
    • use selenium2 and phantomjs instead of urllib2
  • weibo
    • catch the random id
  • songtaste
    • the simplest one

####Gains

  • Beautiful Soup is really awesome, and it has a interesting name.
  • Though extracting information is a relatively easy part in a research, a efficient crawler will still be very helpful.

About

several py web crawlers


Languages

Language:Python 100.0%