sanic-org / sanic

Accelerate your web app development | Build fast. Run fast.

Home Page:https://sanic.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sanic drops part of HTTP response data

xbeastx opened this issue · comments

commented

Is there an existing issue for this?

  • I have searched the existing issues

Describe the bug

The bug is that Sanic closes the connection before the final transfer of all data (see example below)

From my point of view this is a very critical bug. For a reason unknown to me, it repeats only when starting the Sanic server inside the docker container (i.e., perhaps it is somehow related to the networkmode or delays during the work of the docker).

After analyzing the commits, we managed to understand that the bug was introduced in 1310684 and not repeated on d1fc867

Code snippet

So we have really simple server returning some json demo.py:

from sanic import Sanic
from sanic.response import json

app = Sanic("Demo")

@app.route("/")
async def handle(request):
    return json({'some_val': 'A'*743000})

if __name__ == "__main__":
    app.run(host="0.0.0.0")

Running in docker container:

FROM python:3.11

COPY demo.py demo.py
RUN pip install sanic==23.12.1
ENTRYPOINT python3 demo.py
$ docker build -t demo .
$ docker run -p 127.0.0.1:8000:8000 --rm -it demo

and the client.py:

import socket
import sys


def getsock(ip, port, req):
    s = socket.socket(2, 1, 6)
    s.connect((ip, port))
    s.send(req)
    return s


REQ = (
    b'GET / HTTP/1.1\r\n'
    b'User-Agent: Mozilla/5.0\r\n'
    b'Connection: close\r\n'
    b'\r\n'
)

s = getsock('127.0.0.1', 8000, REQ)

headers = s.recv(94).decode()

print(f'Headers: {headers!r}')

total = 0
by_len = int(sys.argv[1])
while (data := s.recv(by_len)):
    total += len(data)

print(f'Total length: {total}')

s.close()

I was not able to reproduce by curl, may be it's read too fast... But in real case it's repeats with Nginx proxy and Sanic as upstream.

so now if you will run hundred times you will get something like this:

$ for i in {1..100}; do python3 client.py 4096; done
...
Headers: 'HTTP/1.1 200 OK\r\ncontent-length: 743015\r\nconnection: close\r\ncontent-type: application/json\r\n\r\n'
Total length: 586346
Headers: 'HTTP/1.1 200 OK\r\ncontent-length: 743015\r\nconnection: close\r\ncontent-type: application/json\r\n\r\n'
Total length: 586346
Headers: 'HTTP/1.1 200 OK\r\ncontent-length: 743015\r\nconnection: close\r\ncontent-type: application/json\r\n\r\n'
Total length: 652954
Headers: 'HTTP/1.1 200 OK\r\ncontent-length: 743015\r\nconnection: close\r\ncontent-type: application/json\r\n\r\n'
Total length: 586346
Headers: 'HTTP/1.1 200 OK\r\ncontent-length: 743015\r\nconnection: close\r\ncontent-type: application/json\r\n\r\n'
Total length: 650058
Headers: 'HTTP/1.1 200 OK\r\ncontent-length: 743015\r\nconnection: close\r\ncontent-type: application/json\r\n\r\n'
Total length: 586346
Headers: 'HTTP/1.1 200 OK\r\ncontent-length: 743015\r\nconnection: close\r\ncontent-type: application/json\r\n\r\n'
Total length: 652954
Headers: 'HTTP/1.1 200 OK\r\ncontent-length: 743015\r\nconnection: close\r\ncontent-type: application/json\r\n\r\n'
Total length: 586346
...

So length should be 743015 but sanic returns only 586346-652954.

client.py must be runner outside the docker container, e.g on host. If you will run inside the docker it will not reproduce.

Expected Behavior

Return all the data from response.

How do you run Sanic?

Sanic CLI

Operating System

Linux

Sanic Version

v23.12.1

Additional context

No response

I can confirm this problem and this is a very critical bug.

I believe the issue lies within the close() method. Instead of directly calling self.abort(), it should be replaced with:

                timeout = self.app.config.GRACEFUL_SHUTDOWN_TIMEOUT
                self.loop.call_later(timeout, self.abort)

By making this adjustment, the connection will function correctly. Otherwise, nginx encounters an error stating upstream prematurely closed connection while reading upstream.

I encountered the same issue. When utilizing the Python packages requests or aiohttp to send requests to a Sanic server, everything works fine. However, when using Nginx as a reverse proxy, Nginx reports an error: 'upstream prematurely closed connection while reading upstream.' Downgrading to sanic 23.6.0 resolves this error.

@ahopkins any chance you can confirm that the fix from @makeev all that is needed?

If so can we please do a point release for the 23.12 LTS branch?

I believe the issue lies within the close() method. Instead of directly calling self.abort(), it should be replaced with:

                timeout = self.app.config.GRACEFUL_SHUTDOWN_TIMEOUT
                self.loop.call_later(timeout, self.abort)

By making this adjustment, the connection will function correctly. Otherwise, nginx encounters an error stating upstream prematurely closed connection while reading upstream.

This code in sanic/server/protocol/http_protocol.py:242 appears to have been introduced in #2831. Ping @gluhar2006 can you have a look? This asyncio transport code is very complicated and easily broken by small details, also the docstring of that function is not quite understandable.

The graceful shutdown timeout normally has a different meaning in Sanic (roughly speaking: how long to wait for handler to finish), not that of just closing a TCP connection. Would need a bit deeper look at what exactly is being fixed here and what is to proper approach + write tests for those cases.

Possibly related also to #2531 (Nginx failures).

I believe the issue lies within the close() method. Instead of directly calling self.abort(), it should be replaced with:

                timeout = self.app.config.GRACEFUL_SHUTDOWN_TIMEOUT
                self.loop.call_later(timeout, self.abort)

By making this adjustment, the connection will function correctly. Otherwise, nginx encounters an error stating upstream prematurely closed connection while reading upstream.

This code in sanic/server/protocol/http_protocol.py:242 appears to have been introduced in #2831. Ping @gluhar2006 can you have a look? This asyncio transport code is very complicated and easily broken by small details, also the docstring of that function is not quite understandable.

The graceful shutdown timeout normally has a different meaning in Sanic (roughly speaking: how long to wait for handler to finish), not that of just closing a TCP connection. Would need a bit deeper look at what exactly is being fixed here and what is to proper approach + write tests for those cases.

Possibly related also to #2531 (Nginx failures).

It appears that the incorrect behavior of my changes was due to my misunderstanding of the meaning of the graceful shutdown timeout. I tried to look deeper, but so far without success. The main problem here is that I use Keep Alive 99% of the time, so I haven't encountered this problem when testing my changes.
I'll keep researching and come back if I find something.

As for now, I still cannot reproduce such problems with using keep-alive.

The graceful shutdown timeout normally has a different meaning in Sanic (roughly speaking: how long to wait for handler to finish), not that of just closing a TCP connection. Would need a bit deeper look at what exactly is being fixed here and what is to proper approach + write tests for those cases.

@Tronic I think they are just using the shutdown timer here as a further delay of the response timeout.


@ahopkins any chance you can confirm that the fix from @makeev all that is needed?

If so can we please do a point release for the 23.12 LTS branch?

@robd003 No. That is an incorrect use of graceful timeout. But also, I am not sure why we would want to further delay abort. Needs investigation.


@xbeastx Shouldn't this just be easily solvable by increasing the response timeout?

@xbeastx Shouldn't this just be easily solvable by increasing the response timeout?

@ahopkins I just came across this issue upgrading from 23.6.0 to 23.12.1 and my response timeout is configured to 60 seconds which is not reached before response data is truncated, in my case always at 109kb, so to your question about increasing response_timeout, i don't think so.

There's also a thread on discord which seems to be the same issue https://discord.com/channels/812221182594121728/1209575840203939880

Any work arounds for this problem?

Unfortunately I didn't find any explanation yet which would suit my understanding of this case. May be it's time to revert the changes

Working on resolving in upcoming release and will port back to the LTS.

Anyone have some good minimal reproducible cases to add to test suite?

So, to recap, I was able to reproduce as follows:

# server.py
from sanic import Sanic, json

app = Sanic("TEST")


@app.route("")
async def handle(_):
    return json({"A": "A" * 7430000})
# nginx.conf
events {}

http {
    server {
        listen 80;

        location / {
            proxy_pass http://<MY LOCAL IP>:9999;
            proxy_set_header Host $host;

            proxy_read_timeout 1s;
            proxy_connect_timeout 1s;
            proxy_send_timeout 1s;
        }
    }
}
# docker-compose.yml
services:
  nginx:
    image: nginx:latest
    container_name: nginx-proxy
    ports:
      - "8888:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    networks:
      - proxy-network

networks:
  proxy-network:
    driver: bridge
# client.py
import socket
import sys


def getsock(ip, port, req):
    s = socket.socket(2, 1, 6)
    s.connect((ip, port))
    s.send(req)
    return s


REQ = (
    b"GET / HTTP/1.1\r\n"
    b"User-Agent: Mozilla/5.0\r\n"
    b"Connection: close\r\n"
    b"\r\n"
)

s = getsock("127.0.0.1", 9999, REQ)

headers = s.recv(94).decode()


total = 0
by_len = int(sys.argv[1])
while data := s.recv(by_len):
    total += len(data)

expected = 7430020
if total != expected:
    print(f"Expected {expected}, got {total}")

s.close()
for i in {1..100}; do python3 client.py 4096; done

I saw a few misses. After making the change in #2966, I saw none even when bumping the workload much higher. Will continue to verify the solution, but @makeev approach (excluding the graceful shutdown) does appear to solve the proxy problem. But I am not convinced this is the complete story, rather just masking it.

@ahopkins @Tronic Its been a while since I've played with the Sanic codebase, but this bug caught my interest.

From what I can see, the issue is relatively straightforward.

Before the release of v23.12, when a HttpProtocol instance got closed, it used the close() implementation in the parent class SanicProtocol here:

def close(self, timeout: Optional[float] = None):
"""
Attempt close the connection.
"""
# Cause a call to connection_lost where further cleanup occurs
if self.transport:
self.transport.close()
if timeout is None:
timeout = self.app.config.GRACEFUL_SHUTDOWN_TIMEOUT
self.loop.call_later(timeout, self.abort)

As you can see, it calls abort with call_later, using the GRACEFUL_SHUTDOWN timeout. (Not sure if that is the correct use of GRACEFUL_SHUTDOWN but it's worked like that for a long time without issue).

In PR #2381, a new close() implementation was added to HttpProtocol, that behaves differently than that in SanicProtocol. That was released in Sanic v23.12, and the abort() issues started popping up.
https://github.com/gluhar2006/sanic/blob/6771fac8375f98fb2d26aa7f6730cd5219de7c73/sanic/server/protocols/http_protocol.py#L231-L242
I can see three bugs in that replacement implementation:

  1. If the close() fn is called with a manually-specified timeout argument, then it short-circuits to calling the super().close() function, and skips cancelling the instance's scheduled tasks, defeating the point of the PR (to reduce memory usage).
  2. If self._callback_check_timeouts is None for any reason (eg, if connection is made, but connect_task setup fails, then the connection is live but _callback_check_timeouts is not set) then the close() routine will completely skip the transport close() and abort() steps, that leaves dangling connections open, thats bad.
  3. Unlike the parent protocol close implementation, there is no call_later on abort(). That causes the transport to be killed before it has finished closing, before the remaining data is flushed. That is what is directly causing the described issues in this thread.

That is why @makeev suggested to use the GRACEFUL_SHUTDOWN in the comment #2921 (comment)
Because that is exactly what the parent protocol (and previously the default behavior) does.

I think the issue here is there is a misunderstanding of exactly what transport.abort() is for.
It force-kills the TCP stream, and is supposed to be a fallback only. transport.close() is supposed to gracefully close the TCP connection, but that can take time, it needs to flush all the pending data first (and in Sanic, when using streaming responses, it seems like there can often be a lot of pending data), and also needs to call the connection_lost() callbacks that do cleanup operations, all before finally cleanly shutting down the TCP connection.
That is why calling transport.abort() immediately after calling transport.close() can cut off the rest of the response.
See the implementation of transport.close() and transport.abort() in the UVLoop codebase here.

By calling abort() with loop.call_later, it allows the transport to cleanly close itself before it is force-killed. However that is also not the ideal behaviour either. abort() should only be called if the close() has not completed after a certain period of time.

I don't know if the use of GRACEFUL_SHUTDOWN_TIMEOUT in the superclass SanicProtocol close() implementation is correct either, but it works. (EDIT: You can blame me!) I suppose the ideal solution would be to have a new GRACEFUL_TCP_CLOSE_TIMEOUT value to use instead (default about 1 second?), and if the transport is already successfully closed before that timeout expires, then no need to call abort() at all.

@ahopkins Can you please try your reproduction with this version of HttpProtocol.close(). I haven't tested it, but it fixes the three issues outlined above, and always calls the superclass close() that ensures it adheres to the pre-v23.12 behavior.

def close(self, timeout: Optional[float] = None):
        """
        Requires to prevent checking timeouts for closed connections
        """
        if self._callback_check_timeouts:
            self._callback_check_timeouts.cancel()
        super().close(timeout=timeout)

Looking further at the implementation of transport.close() in UVLoop
https://github.com/MagicStack/uvloop/blob/6c770dc3fbdd281d15c2ad46588c139696f9269c/uvloop/handles/basetransport.pyx#L252-L262

I think I see another contributor to this issue. Notice, transport.close() closes the read stream, but if there is still data in the transport's write buffer, it does not close the connection. That is why Sanic always relied on a fallback to subsequently call transport.abort() to kill the transport if close() left it open.

Perhaps a better solution would be to check get_write_buffer_size() on close, then if there is still data in the write buffer, do not attempt transport.close(), force a flush somehow.

Just looking through the flow-control code in the UVLoop transport stream code, I can see how there may be some situations where there is still data in the transport send buffer after the final write() operation.
Eg, if a very large chunk of data was sent to write(), then the transport tried to send it all but couldn't send the whole chunk, it keeps some in its buffer. If the buffer is below the high-water mark, it pauses the protocol, so sanic cannot send more data. It runs an async send with the rest of its buffer, and calls transport._on_write() when the async transfer is done, which unpauses the transport.
But two situations arise:

  1. Sanic still has data sitting in send() that cannot send because the write-buffer flow-control is paused, and the transport is killed while its waiting, so sanic cannot send the rest of the payload.
  2. Sanic thinks the initial write was successful, and closes the transport, then subsequently aborts the transport, so uv_write() never completes its asynchronous write with the rest of the buffer.

I notice in the transport._on_write() callback in UVLoop, it checks if the transport is in a closing state, if it is, it completes the second half of the close() routine that was skipped above, namely it schedules the call the protocol connection_lost() callback, and then actually closes the connection.

I suppose the issue here boils down to two false assumptions:

  1. When we write to the TCP transport using transport.write() it seems like a synchronous operation, that sends the whole buffer. In reality it can send only part of the buffer, in that case it will be an asynchronous operation. The only way to tell is to check get_write_buffer_size which can even tell you if there is still data in the async libuv stream buffer.
  2. When close the TCP transport using transport.close() it seems like a synchronous operation, and it usually is, but if there is a pending async send then close becomes an asynchronous operation, and the only way to tell is by analyzing when the connection_lost() callback was triggered.

I still don't know why this only occurs on NGINX. There must be something special about the NGINX TCP implementation that causes UVLoop to not be able to send the whole buffer on each write, that then triggers the uvloop async write operation. Its probably NGINX's flow control fighting with UVLoop's flow control, or perhaps NGINX deliberately does it to force the sender into async writes to reduce blocking writes (and blocking reads on NGINX side).

@ahopkins I'm using your docker-compose, nginx, client, and server snippets above to try to reproduce this for the purposes of creating a real fix.
When pointing the client at the nginx server, I'm getting the error:

request without "Host" header while reading client request headers

The client is not sending a request Host header, so nginx cannot deal with it.

You did not see this in your testing because you have another error, in your client connector you are connecting to port 9999 that is your local sanic server, not the nginx proxy. So I suspect you were not actually running the reproduction test through nginx.

Edit: After changing the reproduction code to hit the NGINX proxy, I'm seeing this issue reproduce 100% of the time (no need to hammer it with a for-loop, its every 1 out of 1 attempt). I'm seeing the same as @xbeastx. When data size is above 700kb, Sanic always returns length 586346-652954 bytes.