Kevenyf / web-crawlers

several py web crawlers

four crawlers, targeted at four different sites.

####Difficulties

douban
- captcha
- block ip
zhihu
- dynamic page
weibo
- post data has random id
songtaste

####Solutions

douban
- catch the captcha and enter the characters manually
- set a interval for each request, or use a proxy
zhihu
- use selenium2 and phantomjs instead of urllib2
weibo
- catch the random id
songtaste
- the simplest one

####Gains

Beautiful Soup is really awesome, and it has a interesting name.
Though extracting information is a relatively easy part in a research, a efficient crawler will still be very helpful.

About

several py web crawlers

Languages

Language:Python 100.0%