process out of memory

Question

process out of memory

gmarcus opened this issue 13 years ago · comments

I am creating a spider to walk the iTunes App Store.

It hits about 80 or so pages, then I get the following message:
FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory

You can find the spider script, and the console log here:
https://gist.github.com/934787

Mikeal Rogers · Answer 1 · Fri Apr 22 2011 01:00:01 GMT+0800 (China Standard Time)

this should be fixed in the latest github version.
defaulting to NoCache, MemoryCache was kind of a placeholder for a real persistent cache storage, it doesn't really do anything except eat up all your memory at the moment :)

Glenn Marcus · Answer 2 · Fri Apr 22 2011 01:48:02 GMT+0800 (China Standard Time)

Hey Mikeal,
I am already running the latest.

main.js:99: this.cache = options.cache || new NoCache();

I am seeing huge memory usage. Can you try the script out at https://gist.github.com/934787 and see if you also see it leak?

Best,
Glenn

Glenn Marcus · Answer 3 · Fri Apr 22 2011 14:38:57 GMT+0800 (China Standard Time)

We learned today that jsdom is leaking badly.

Any thoughts of switching to node-soupselect and node-htmlparser?

Mikeal Rogers · Answer 4 · Sat Apr 23 2011 03:13:26 GMT+0800 (China Standard Time)

actually, i'm going to wait a bit for admc's jellyfish to be ready (which should be around JSConf).
if we can suitably abstract for using jellyfish we could swap out the underlying browser-like environment for any number of things including real browsers. we would probably still default to the jsdom-like environment.

Jason-son · Answer 5 · Wed May 25 2011 01:03:20 GMT+0800 (China Standard Time)

process out of memory Again!

https://github.com/admc/jellyfish
jellyfish is ready.So it's time:)

Mikeal Rogers · Answer 6 · Wed May 25 2011 01:49:47 GMT+0800 (China Standard Time)

i'll talk to admc today about the best integration point, we might need to opt out of doing the HTTP requests on our own for more browsers, which would put the caching responsibility on the browser.

Jason-son · Answer 7 · Wed May 25 2011 09:24:47 GMT+0800 (China Standard Time)

I think the most interesting part of this spider module is the router I would like fork this module to make a more scalable/flexible spider, which can prefer wether to parser the html or use their own engine/plugin.
Any idea about this?

Mikeal Rogers · Answer 8 · Wed May 25 2011 11:03:22 GMT+0800 (China Standard Time)

"their own engine/plugin", like what?
jellyfish seeks to support all "browser like environments" so hooking spider in to that seems like the best route to supporting additional environments.

Jason-son · Answer 9 · Wed May 25 2011 11:35:41 GMT+0800 (China Standard Time)

If I just wanna handle with a raw XML/JSON response,would jellyfish match?

Mikeal Rogers · Answer 10 · Wed May 25 2011 14:30:22 GMT+0800 (China Standard Time)

so you don't want a browser env, you just want the raw HTTP response string?

Jason-son · Answer 11 · Wed May 25 2011 16:25:05 GMT+0800 (China Standard Time)

Yeah. why not? Sometimes the response string is tiny,which is not fit for giving a large browser env.

Mikeal Rogers · Answer 12 · Thu May 26 2011 00:22:56 GMT+0800 (China Standard Time)

i wasn't suggesting it was negative, i was just trying to fully understand your use case.
part of jellyfish integration will need to be setting for which env to use, this will of course change the callback args for a few environments. i'm not opposed to a "raw" env which just passes in the response body.

John Hewson · Answer 13 · Thu Aug 18 2011 04:24:59 GMT+0800 (China Standard Time)

Hi, I had a problem with memory leaks too. The process runs out of memory very quickly (for me after 300 page loads). The problem was that jQuery calls setInterval which keeps a reference to the window, so the window is never gc'd. There's a long discussion about it here nodejs/node-v0.x-archive#1007 (comment).

The solutions is easy though - call window.close() at the end of each .route

Afshin Mehrabani · Answer 14 · Fri Jul 06 2012 19:54:04 GMT+0800 (China Standard Time)

Hey guys,

I have memory leak problem too and after 2 days I found that the problem related to jsdom. The jsdom has a big memory leak bug and because of spider uses jsdom, it has this problem too. I use cheerio and now I don't have any memory leak problem.

Thanks

PlNG · Answer 15 · Fri Apr 26 2013 03:42:19 GMT+0800 (China Standard Time)

The issue with JSDom "leaking memory" is a programmer error. The programmer using the environment must call window.close() on completion for destruction of the environment otherwise the virtual browser window remains open.

I think this issue can be closed.