Scraping ajax enabled webpages

Question

Scraping ajax enabled webpages

SorataAragaki opened this issue 8 years ago · comments

When I scrap some pages that have lots of ajax and js script using mechanize，some information lost compared with original pages. Mechanize doesn't have a js implement , but gem watir-webdriver is really really slow . Are there some great solutions?

UKSA · Answer 1 · Tue Jun 14 2016 17:27:14 GMT+0800 (China Standard Time)

You could try phantomjs with something like selenium-webdriver, at least this would provide you with a headless option. The alternative is to figure out what the underlying js request is and convert this to something mechanise can use.

dwkns · Answer 2 · Thu Nov 02 2017 19:27:49 GMT+0800 (China Standard Time)

It's not too difficult to figure out what the XMLHttpRequests are. If you use a proxy server like Charles you can inspect the all the calls the page makes and then usually mimic them with Mechanize.

This doesn't however give you the excellent (easy to read) output that Mechanize produces and you can't interact with the resultant DOM. I'd love to see the DSL of Mechanize with the output it produces built on top of something like PhantomJS so you could execute JS but suspect that this would be a huge (and unlikely) change.