Scraping ajax enabled webpages
SorataAragaki opened this issue · comments
When I scrap some pages that have lots of ajax and js script using mechanize,some information lost compared with original pages. Mechanize doesn't have a js implement , but gem watir-webdriver is really really slow . Are there some great solutions?
You could try phantomjs with something like selenium-webdriver, at least this would provide you with a headless option. The alternative is to figure out what the underlying js request is and convert this to something mechanise can use.
It's not too difficult to figure out what the XMLHttpRequests are. If you use a proxy server like Charles you can inspect the all the calls the page makes and then usually mimic them with Mechanize.
This doesn't however give you the excellent (easy to read) output that Mechanize produces and you can't interact with the resultant DOM. I'd love to see the DSL of Mechanize with the output it produces built on top of something like PhantomJS so you could execute JS but suspect that this would be a huge (and unlikely) change.