medialab / sandcrawler

sandcrawler.js - the server-side scraping companion.

Home Page:http://medialab.github.io/sandcrawler/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PhantomJS library integration - phridge/node-phantomjs

moshewe opened this issue · comments

I tried to follow how the library works with the phantomJS process, and I got lost. Are you using any library to do that?

I am using bothan to do so and bothan uses the phantomjs dep itself.

OK, now I understand what's going on.
I really want to use this library, but I'm a little hesitant as not using a common phantom-nodejs bridge is a hard decision to defend when explaining this to our CTO... How does bothan differ from the above mentioned libs?

+1 for the Star Wars reference, btw :)

I don't use a bridge such as those you mention because it does not let me use the phantom the way I really need to. Bothan provides a low-level access to the phantomjs child such as you can really script for phantomjs and not for node. Phantomjs has many issues such as memory leaks etc. that I wouldn't be able to contain (as much as possible) by using other higher-level bridges.

But keep in mind that all this code here is quite experimental and will be rewritten soon enough. One of the major problems with phantomjs is that it does not scale well and you constantly need to kill them and respawn them to avoid serious leaks which are inherent to phantomjs (less so with the 2.1 version, but still).

I totally get you on that, I found myself switching to CSS selectors from XPaths because it would leak most of the time... I assume managing a phantom-spawn pool might do the trick, and use each spawn for two or three pages or so.

Please note the original phantomjs dep package is deprecated and has moved to something-prebuilt.

Please note the original phantomjs dep package is deprecated and has moved to something-prebuilt.

Yup. I just need time to rework on all of this soon.

I will also add an electron engine.

Never heard of Electron before, thanks! Looks interesting!

There is also the jsdom option that can work for some simple cases.