mikeal / spider

Programmable spidering of web sites with node.js and jQuery

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

process out of memory

gmarcus opened this issue · comments

I am creating a spider to walk the iTunes App Store.

It hits about 80 or so pages, then I get the following message:
FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory

You can find the spider script, and the console log here:
https://gist.github.com/934787

this should be fixed in the latest github version.
defaulting to NoCache, MemoryCache was kind of a placeholder for a real persistent cache storage, it doesn't really do anything except eat up all your memory at the moment :)

Hey Mikeal,
I am already running the latest.

main.js:99: this.cache = options.cache || new NoCache();

I am seeing huge memory usage. Can you try the script out at https://gist.github.com/934787 and see if you also see it leak?

Best,
Glenn

We learned today that jsdom is leaking badly.

Any thoughts of switching to node-soupselect and node-htmlparser?

actually, i'm going to wait a bit for admc's jellyfish to be ready (which should be around JSConf).
if we can suitably abstract for using jellyfish we could swap out the underlying browser-like environment for any number of things including real browsers. we would probably still default to the jsdom-like environment.

process out of memory Again!

https://github.com/admc/jellyfish
jellyfish is ready.So it's time:)

i'll talk to admc today about the best integration point, we might need to opt out of doing the HTTP requests on our own for more browsers, which would put the caching responsibility on the browser.

I think the most interesting part of this spider module is the router I would like fork this module to make a more scalable/flexible spider, which can prefer wether to parser the html or use their own engine/plugin.
Any idea about this?

"their own engine/plugin", like what?
jellyfish seeks to support all "browser like environments" so hooking spider in to that seems like the best route to supporting additional environments.

If I just wanna handle with a raw XML/JSON response,would jellyfish match?

so you don't want a browser env, you just want the raw HTTP response string?

Yeah. why not? Sometimes the response string is tiny,which is not fit for giving a large browser env.

i wasn't suggesting it was negative, i was just trying to fully understand your use case.
part of jellyfish integration will need to be setting for which env to use, this will of course change the callback args for a few environments. i'm not opposed to a "raw" env which just passes in the response body.

Hi, I had a problem with memory leaks too. The process runs out of memory very quickly (for me after 300 page loads). The problem was that jQuery calls setInterval which keeps a reference to the window, so the window is never gc'd. There's a long discussion about it here nodejs/node-v0.x-archive#1007 (comment).

The solutions is easy though - call window.close() at the end of each .route

Hey guys,

I have memory leak problem too and after 2 days I found that the problem related to jsdom. The jsdom has a big memory leak bug and because of spider uses jsdom, it has this problem too. I use cheerio and now I don't have any memory leak problem.

Thanks

commented

The issue with JSDom "leaking memory" is a programmer error. The programmer using the environment must call window.close() on completion for destruction of the environment otherwise the virtual browser window remains open.

I think this issue can be closed.