psf / requests

A simple, yet elegant, HTTP library.

Home Page:https://requests.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

requests leaves a hanging connection for each request

Bluehorn opened this issue · comments

This was already noted in #458, but deserves an extra ticket: When using the simple API (via requests.head/get/post etc.), the requests module leaves a connection open until it is garbage collected. This eats up resources on the server site and it really impolite.

How to reproduce:

Run this script (can somebody please tell me how one can embed a gist in markdown? The help page shows only the result of embedding...):

https://gist.github.com/2234274

Result on my system:

torsten@sharokan:~/workspace/loco2-git$ python demo.py
Open sockets after 20 head requests:
COMMAND   PID    USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
python  13928 torsten    3u  IPv4 616968      0t0  TCP sharokan.fritz.box:50072->bk-in-f94.1e100.net:www (ESTABLISHED)
python  13928 torsten    4u  IPv4 616972      0t0  TCP sharokan.fritz.box:50073->bk-in-f94.1e100.net:www (ESTABLISHED)
python  13928 torsten    5u  IPv4 616976      0t0  TCP sharokan.fritz.box:50074->bk-in-f94.1e100.net:www (ESTABLISHED)
python  13928 torsten    6u  IPv4 616980      0t0  TCP sharokan.fritz.box:50075->bk-in-f94.1e100.net:www (ESTABLISHED)
python  13928 torsten    7u  IPv4 616984      0t0  TCP sharokan.fritz.box:50076->bk-in-f94.1e100.net:www (ESTABLISHED)
python  13928 torsten    8u  IPv4 616988      0t0  TCP sharokan.fritz.box:50077->bk-in-f94.1e100.net:www (ESTABLISHED)
python  13928 torsten    9u  IPv4 616992      0t0  TCP sharokan.fritz.box:50078->bk-in-f94.1e100.net:www (ESTABLISHED)
python  13928 torsten   10u  IPv4 616996      0t0  TCP sharokan.fritz.box:50079->bk-in-f94.1e100.net:www (ESTABLISHED)
python  13928 torsten   11u  IPv4 617000      0t0  TCP sharokan.fritz.box:50080->bk-in-f94.1e100.net:www (ESTABLISHED)
python  13928 torsten   12u  IPv4 617004      0t0  TCP sharokan.fritz.box:50081->bk-in-f94.1e100.net:www (ESTABLISHED)
python  13928 torsten   13u  IPv4 617008      0t0  TCP sharokan.fritz.box:50082->bk-in-f94.1e100.net:www (ESTABLISHED)
python  13928 torsten   14u  IPv4 617012      0t0  TCP sharokan.fritz.box:50083->bk-in-f94.1e100.net:www (ESTABLISHED)
python  13928 torsten   15u  IPv4 617016      0t0  TCP sharokan.fritz.box:50084->bk-in-f94.1e100.net:www (ESTABLISHED)
python  13928 torsten   16u  IPv4 617020      0t0  TCP sharokan.fritz.box:50085->bk-in-f94.1e100.net:www (ESTABLISHED)
python  13928 torsten   17u  IPv4 617024      0t0  TCP sharokan.fritz.box:50086->bk-in-f94.1e100.net:www (ESTABLISHED)
python  13928 torsten   18u  IPv4 617028      0t0  TCP sharokan.fritz.box:50087->bk-in-f94.1e100.net:www (ESTABLISHED)
python  13928 torsten   19u  IPv4 617032      0t0  TCP sharokan.fritz.box:50088->bk-in-f94.1e100.net:www (ESTABLISHED)
python  13928 torsten   20u  IPv4 617036      0t0  TCP sharokan.fritz.box:50089->bk-in-f94.1e100.net:www (ESTABLISHED)
python  13928 torsten   21u  IPv4 617040      0t0  TCP sharokan.fritz.box:50090->bk-in-f94.1e100.net:www (ESTABLISHED)
python  13928 torsten   22u  IPv4 617044      0t0  TCP sharokan.fritz.box:50091->bk-in-f94.1e100.net:www (ESTABLISHED)
Garbage collection result: 1700
Open sockets after garbage collection:

I think the sockets should be closed immediately when the session is unrefed. FYI, I found this when testing a local webservice by running it in CherryPy which defaults to 10 threads - after 10 requests, the web server was blocked.

Digging further, this seems due to a reference loop and is really a problem inherited from httplib: The HTTPResponse keeps a reference to the file object as _fileobject. I would argue that that reference should be dropped after the complete response was read (it should not be possible to read from the socket of a pooled connection using some old response).

Okay, and the reason this shows up is because there is a reference cycle between the request and response object.
Otherwise, when the response object is dropped, the file would be closed. Due to the ref cycle, this can only happen on the next garbage collection.

Is the the same issue as in #239?

@shazow any thoughts?

Are we talking about urllib3 request/response objects, or requests request/response objects?

urllib3 cleans up in the response.release_conn method. The connection gets None'd for the exact reason that @Bluehorn mentioned. We could also None the self._fp but that's more for custom interfaces; I don't think httplib uses it.

We can definitely be more aggressive with closing connections in the cleanup of Requests's context manager. Right now to manually close connections, you'd need to iterate for each pool and pop its queue of connections and close them if they're open. If that's what we want to do, then I can definitely add a helper in urllib3 for "closing pools" which will do that in one call.

Another approach would be to break the hard cycle between response and request by changing req.response to a weakref proxy before returning it from the session: http://dpaste.org/GOqzA/.

@Bluehorn's test continues to reproduce in trunk for me. Thanks!

Here's something else interesting. If I change the test like so:

diff --git a/conntest.py b/conntest.py
index 3eac6c7..ec30d3f 100644
--- a/conntest.py
+++ b/conntest.py
@@ -2,8 +2,10 @@ import gc, os, subprocess, requests

 gc.disable()         # Just to make sure - I have seen this with gc enabled

+s = requests.session()
+
 for x in range(20):
-    requests.head("http://www.google.com/")
+    s.head("http://www.google.com/")

I get the same result, 20 leaked ESTABLISHED connections. Are we sure that keep-alive is working?

Sorry, documentation-reading fail. I wasn't consuming the content of the response objects. If I do that, things work as expected, connections get reused, and in the end only one is left alive.

Here's my understanding of this issue so far:

  1. @Bluehorn and @gsakkis are right: the root problem is the reference cycle between requests.Request and requests.Response.
  2. Breaking this cycle in a crude way, e.g., removing the request member of the Response class, fixes @Bluehorn's test case.
  3. Although it might be possible to improve the way Requests uses urllib3, that can't fix the test case, because by default Requests returns the Response object with the socket still open and the body unread. Therefore, we rely on garbage collection to close the socket when clients don't read all the data. (Technically, since the test case is doing a HEAD request, the implementation of requests.head could be changed to close the socket. But the problem would still exist for the analogous test case calling requests.get.)

So here are the three viable approaches that occur to me:

  1. Replace Response.request with a weakref.proxy, as @gsakkis suggested. This breaks the following test: r = requests.get(httpbin('get)); assert r.request.sent (because r.request has been garbage-collected already, so we get "ReferenceError: weakly-referenced object no longer exists"). Anyway, this involves a smallish API breakage.
  2. Remove Response.request. This has the advantage of not using weak references. This breaks the above test and also breaks HTTPDigestAuth, which uses the response hook and expects the Response object it operates on to have a request member. We could work around this by changing the hook API, but either way, it seems like a larger API break.
  3. Enable prefetch and disable keep_alive by default, and then ensure that all the data is read and the socket is closed before we return to the client.

Random thought: it seems like keep-alive should be off by default. Outside the context of a session, we have no way to reuse the connection anyway.

Heh, we could remove Response.request in general, but add it back in for the duration of the hook invocation. This makes #2 look more like #1, in terms of API breakage.

https://gist.github.com/2621434

In favor of option 3: any solution that relies on eager garbage collection won't work with PyPy.

Once upon a time prefetch was on by default, we went through significant lengths to fix that. :)

Also it's pretty rare that you'll only make 1 request to a host in a script, so having keep_alive on by default is a nice feature. Otherwise we'd be content using urllib{1,2} with a thin API wrapper.

@shazow

Can you clarify the reasons for defaulting prefetch to off instead of on?

So --- defaulting keep_alive=True in the context of a session makes sense. But in the context of the vanilla API, that is to say, a plain requests.get or requests.post, there doesn't seem to be any mechanism for reusing the keep-alive connections, because an ephemeral Session object is created and it is given an ephemeral PoolManager in init_poolmanager().

Perhaps the original reason was that it made urllib3 and requests perform better in naive benchmarks. Upon further consideration, it also made sense to fetch only the data we're actually using, especially for large responses when we only really care about headers. Someday in the future when we add ETAG-based caching, this will be a necessity. Overall, lots of upsides, no real downsides (until now).

I'm not super familiar with the Requests code in this context, but I'd imagine we'd want a reused static PoolManager instance for these things.

A static or module-level mechanism for sharing and reusing the connections would make sense. On the other hand, it would significantly worsen the current "gotcha" with leaked connections, because if a client doesn't call .content, those connections will remain reserved in the static pool forever --- not even GC will be able to reclaim them.

What if we made an explicit del call on the connection when the LRU invalidates the object?

Then we'd just leak connections up to the size of the LRU cache (10 by default?).

My vote would be to enable prefetch=True, keep keep_alive=True, and add a module-level connection pool for reusing connections. I don't think this would break a future implementation of ETAG caching; we could just check the ETAG before prefetching and hang up the phone if the data is cached.

Is an O(1) leak really a leak? ;)

I'm -0 on prefetch=True just because I feel the current settings cover the +80% use case, but I wouldn't shed any tears. Up to @kennethreitz, I'd say.

In the real world, everything is a constant factor ;-)

Module-level connection pool doesn't sound like a good idea to me. Setting prefetch=True just because that's easier :) doesn't sound good, either. Just saying.

@piotr-dobrogost if we don't prefetch, I don't see a good solution that works on all platforms, including those like PyPy that lack eager garbage collection.

For comparison, here's something that always infuriated me about urllib2:

try:
    fileobj = urllib2.urlopen(MY_URL)
    text = fileobj.read()
finally:
    fileobj.close()

Seems like the 80%, batteries-included thing is to have an API that reads the whole page in a fire-and-forget way.

A module-level connection pool? Absolutely not. That's what session is for.

That makes sense, as long as we turn off keep_alive by default. I think:

  1. Default prefetch=True
  2. Default keep_alive=False (unless we're in an actual Session)
  3. Ensure the connections are being closed

This will resolve the connection leak. Optionally, we could also break the reference cycle, as a memory-usage optimization for CPython.

All requests are made within a session.

Right. For "actual Session", read "non-ephemeral Session", i.e., "Session created by the client with the expectation of reuse."

But I guess this doesn't answer all the questions about expected behavior of Session objects. I think something like this:

  1. A non-ephemeral Session should send Connection: keep-alive by default, and retain connections in a pool
  2. On CPython, the Session object should not be involved in any reference cycles, and those connections should be disposed of when the object goes out of scope
  3. For PyPy etc., there should be a Session.dispose() and/or a context manager that manually closes the connections

It shouldn't matter if it's ephemeral or not. When the session dies (and the connections therein), so should the connection pool.

There should be no specific behavior for different platforms.

Well, we wouldn't special-case behavior inside Requests itself based on platforms. CPython would do the right thing whether or not the client called dispose() (which is what CPython users probably expect --- instantaneous scope-based GC). PyPy users would have to ensure that they called dispose().

As for why it would matter whether the session is ephemeral or not, @craigholm makes a good point on #458: a client that does not support persistent connections "must" send Connection: close. If we're creating the session just to destroy it, that's essentially not supporting persistent connections. (Technically, you can get the session as resp.request.session and reuse it, but if we get rid of the reference cycle, that would also go away.)

Session objects actually already have a context manager API, it just doesn't do anything. Let's put it to work :)

@slingamn keep-alive should be on, we're using it for 301s and the like.

Why are we keeping request.session again? That doesn't seem useful...

Oh, OK, that makes sense. Redirection is a good case for keeping keep_alive=True as the default.

Yeah, request.session does not seem particularly useful.

Here's my understanding of the current state of the discussion:

  1. Default keep_alive=True (the current behavior)
  2. Default prefetch=True; when prefetching, ensure that the socket gets closed before we return to the client
  3. The code that creates an ephemeral Session should create it using a context manager that ensures the connection pool is emptied by the time we return to the client. (If prefetch is on, this is unproblematic. If prefetch is off, this involves detaching the connection from the Session pool and attaching it to the Response object, which will then close it when the client calls .content. If the client disables prefetch, it is the client's responsibility to read .content, and if they don't, the connection will leak until GC, as it does now.)
  4. The default behavior of the Session object will be to maintain a persistent connection pool. On CPython, Session should not be involved in reference cycles and the pool should be destroyed on GC. On PyPy, clients can explicitly use the context manager to get the dispose behavior.
  5. (optional) break the reference cycle between Request and Response by removing Response.request (we can add it back in for the duration of the response hook).

Wait, item 2 as stated is wrong. We should default prefetch=True, but the prefetch implementation itself will not close the socket --- that will be done by the upstream API code that created the Session object holding the connection.

This plan does not necessarily remove Request.session; we can retain it, only by default the session will be returned to the client empty of connections. So there is no API break specific to steps 1-4. Step 5 does involve an API break.

break the reference cycle between Request and Response by removing Response.request

Why don't break it at the other end? Having reference from response to request is nice.

@piotr-dobrogost my take is that the client already "knows" what was in the request, i.e., has a local-variable reference to the URL they requested, or knows that they are using a proxy, etc., etc. Also, the async idiom is "create a request, submit it, expect the response to be available as a member of the request object."

I don't have super strong feelings about this, though; probably my real motivation is that removing Response.request seems to involve changing fewer lines of code ;-)

I'm not sure this is the right solution, but have you considered using weak references? http://docs.python.org/library/weakref.html

my take is that the client already "knows" what was in the request

Sure but it doesn't change the fact that it's natural to have pointer to request in the response. Also whole history is kept in response. As an example Qt framework has QNetworkRequest and QNetworkReply classes and there's a link from reply to request but not the other way around.

Also, the async idiom is "create a request, submit it, expect the response to be available as a member of the request object."

I don't see it this way:

request = async.get(url)
response = async.map([req])[0]

@laurentb yeah, @gsakkis suggested this. One problem is outlined in https://github.com/kennethreitz/requests/issues/520#issuecomment-5534515 --- in the common case, the Request object will in fact be garbage-collected and the weak reference will dangle.

Request reference is pretty key. Why not just get rid of the session reference? I don't see the use of it.

AFAICT, getting rid of the session reference is orthogonal:

  1. If we get rid of the session reference, but don't do steps 1-4 (i.e., enabling prefetch), the socket will still be retained by the reference cycle between Request and Response (which will own the socket so the client code can read from it).
  2. If we do steps 1-4, the socket will not be retained, but Request and Response will still form a reference cycle (this even if we remove Request.session)
  3. If we do steps 1-5, the socket will not be retained and there will be no reference cycle, even if we keep Request.session

Did I get that right?

Keeping sockets open is fine, as long as they need to kept open. We need to release the connection once the content is read and not hold on to the sessions. Seems simple enough to me.

Don't we want to protect clients from neglecting to read the content? That's what's happening in the original test case.

There are a variety of situations in which this could happen --- simple carelessness, divergent code paths, exception handling that doesn't do:

finally:
    resp.content

Reading the original test case, it's far from obvious that it should leak connections. But it will, as long as there is (a) a reference cycle between Request and Response and (b) no default prefetching. Fixing (b) will fix the problem on CPython and PyPy. Fixing (a) will fix the problem on CPython only.

BTW: I habe no problem if not reading the reply keeps the connection alive as long as this does not apply to head requests.
My original code actually does POST requests and reads the replies in full, I only reduced it to head requests so that the code is minimal. I am using requests to drive a SOAP service. Originally I did not use sessions and introducing them worked around the problem for me (as a side effect, I need sessions in any case to optimize away https connects per request).

@Bluehorn that's fair. I think the test case points towards other gotchas that aren't fixed by doing a HEAD. For example:

  1. The client doing a POST and then reading only the status code
  2. The client doing a GET and then reading the content on a 200 but not on a 404.

Essentially, some servers don't support HEAD requests. That's why prefetch isn't default.

Intuitively, a property access is not supposed to have side effects; @property should transparently abstract away the computation of a "derived member", or transparently implement lazy evaluation. Closing a socket is quite a serious side effect.

Changing a default or creating even a small API break would definitely be unfortunate, but I'm having trouble seeing another way to protect the client from socket leaks.

A user of Requests shouldn't be even remotely concerned with sockets. The solution's quite simple — don't leave references to sockets/sessions that aren't needed.

I agree completely: no one should have to think about sockets. But if we don't prefetch, then a reference to the open socket must be retained --- whether the Response object or the Session object retains it is just an implementation detail, either way the socket has to be retained so the client can read the page body from it.

So when do we clean up this reference? The possible answers seem to be:

  1. When the Response goes out of scope. This does the right thing on CPython, but not on PyPy.
  2. When the user accesses Response.content or Response.text. This is nontransparent, and the user has to be aware of the side effect.

Let's do both. There's no side effect worth consideration, because that's all handled by connectionpool.

BTW: I habe no problem if not reading the reply keeps the connection alive as long as this does not apply to head requests.
My original code actually does POST requests and reads the replies in full, I only reduced it to head requests so that the code is minimal. I am using requests to drive a SOAP service. Originally I did not use sessions and introducing them worked around the problem for me (as a side effect, I need sessions in any case to optimize away https connects per request).

Fixed!

I'm running into this with requests==2.0.0. The code is using a timeout, might that be causing a problem?

I believe I have confirmed this to be a problem with sessions + timeout. When using the etcd-py package, a new connection is created any time this call times out; no dangling connection when the request returns data:

stuff = etcd.watch('message', timeout=30.0)

I believe this problem is in urllib3, and will be incidentally fixed by urllib3/urllib3#252.

@Lukasa thanks for the tip. After more research I am pretty sure this is a problem with the server side code in etcd.

Thanks!

The fix from this issue was reverted here kennethreitz@92d5703 ? Any reason why ? I am specially talking about the changes in requests/api.py.

And can some one please explain what is the guid line when using ephemeral session like what most requests/api.py shortcuts does ? A new session mean a new connection pool which really doesn't make any sense if you want to just do a requests.get(...) and especially if the connections (it only one after all) in the pool are not close afterward.

@mouadino The commit you're talking about was the rewrite that took place for V1.0.0. This was basically a total rewrite of the library, so the fix was not 'reverted', it was just no longer necessary.

I'm not entirely clear on what you're asking in the last part, but you've got a substantial misunderstanding of how Requests behaves. When we create a connection pool we don't open all of those connections at once: that would be ridiculous, and HTTP doesn't work that way anyway. We open one connection when you make your first request, and then reuse it as much as possible.

There is an inefficiency when using requests.get() and friends, but that inefficiency is acceptable to get that clean API. People who don't want the inefficiency can simply use a Session object as described in the documentation.

@Lukasa Thanks for the quick response, and you miss understood my question, sorry for not being clear, what i meant is actually what is described in this bug report.

If i have in my script a call like this request.get(....), this call will create a new Session object which create a new connection pool, and from this later a connection will be created/borrowed to send my GET request, this connection will not be closed by the server (b/c of keep-alive, assuming that the server understand it), so we got now one connection in the pool that is still open and ready to be re-used right ?

But how about if i do another requests.get(...)This will also create another session which mean another connection pool and like before the result will be one connection from this new pool that is still open and ready to be re-use, but as you can see they will not be re-used because in the end the session object is not re-used.

So the result now is that we have 2 connection open to our server which will not be close until my program stop (clean up by the kernel !?) or the server clean them up (if it's support a keep alive timeout), so you can imagine what will happen if i have a lot of requests.get(...) calls (as an example).

First of all does this make sense ? If so what what is the guide line of using ephemeral session ? if this laters will not clean up after them self, unless i call session.close(...) (which will close opened connection from the pool), and if the close call is needed why wasn't it part of the api shortcuts ? And wouldn't it make more sense to disable connection pooling in a use case like this one ?

Ah, that makes more sense. =)

The answer is no, you won't leak connections. This is because eventually the Session, the Connection Pool it contains and all of its active connections will get garbage collected, and when they do any open connections are closed by the garbage collector.

Of course, this doesn't matter because the server will close them well beforehand. Most HTTP servers don't keep the connection alive for more than a few seconds because it's a severe resource overhead for that server.

The guide for using ephemeral Sessions is simple: use them if you don't care about efficiency. If you use them you're accepting that you don't care whether your resources are cleaned up right away or not. If you believe that cleaning up after yourself is important then you should use an explicit Session. =)

Aha, because underneath they are actually socket which are actually closed automatically when they are GC, but relaying on the GC for my use case is not really good, and the worse is that i am in presence of a http server that doesn't clean up opened connections (or maybe i didn't figure out yet how to do it), but i can see that this in not something that requests can fix :=)

Thanks again for you response :)

What about for a streaming requests session. How to ensure this is closed?