omegahat / RCurl

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

getURLAsynchronous file descriptor exhaustion

asieira opened this issue · comments

I am running a large number (32k) number of GET requests using getURLAsynchronous, 120 URLs at a time. This is what the call looks like within the loop:

ret = getURLAsynchronous(cfg$access.point, httpheader=c("X-Api-Key"=cfg$key), perform=16, verbose=FALSE, header=FALSE, binary=F, ssl.verifypeer=FALSE)
gc(FALSE)

After this batch is executed, showConnections shows me no entries other than stdin, stdout and stderr.

Still, when I try to use mclapply to process all the combined responses in parallel I get the following error:

Error in mcfork() (from util.R#160) : unable to create a pipe
Calls: <redacted> ... unlist -> lapply -> mclapply -> lapply -> FUN -> mcfork

If I run the same mclapply without calling getURLAsynchronous first, with test data of the same size, no such error happens.

I tried searching the RCurl documentation for any steps I could include in my code to close connections and/or release resources, and was unable to find any. So I'm assuming that would happen when the handles are garbage collected, hence the explicit call to gc() after each step.

Am I missing something here?

BTW, I saw references on the documentation to a function called curlGlobalCleanup but that simply is not found by R when I try to execute it.

Also, important to mention I am running R 3.0.3 64 bits on Mac OS X 10.9.2. sessionInfo tells me I am running RCurl_1.95-4.1.

The handle and multi handle objects seem to be missing finalizers (as per http://cran.r-project.org/doc/manuals/R-exts.html#External-pointers-and-weak-references) that call the libcurl cleanup functions to release resources.

Found a workaround. If I use forbid.reuse=T in the call to getURLAsynchronous the problem does not happen.

This however confirms that proper cleanup of the handles (which would cause the connections to be closed) would solve it in the general case and lead to optimal performance.