fffonion / lua-resty-sniproxy

SNI Proxy based on stream-lua-nginx-module

Home Page:https://yooooo.us/2016/lua-resty-sniproxy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

don't check size for streaming

fcicq opened this issue · comments

commented

The size checking is not actually required as long as no changes are applied to the data in _upl() and _dwn().

Currently I don't have a working env for openresty. I guess the core loop should look like this:

while not self.exit_flag do
  buf, _, __ = ssock:receive(4096)
  if not buf then
    rsock:send(__)
    break
  end
  local bytes = rsock:send(buf)
  if not bytes then
    break
  end
end
commented

and I think self.exit_flag is unnecessary. "while true do" would be the same and still able to break the loop on error.

There'll be some problems if client reuse the same connection to issue multiple requests and the requests are not continuous. Imagine the following scenario between a client and our server (_upl thread):

|-------------------4096 bytes----------------------|-------------------4096 bytes----------------------|
|----req1----|----req2----|----req3----|--req4--|-----req5-----|----------------------------req6------------------|

If client finnished several http requests after req5 and stop sending data to our server. But the client wants to reuse this connection in the future, so TCP is not closed. In this situation, our server will not read req5 and proxy req5 to remote before client sends req6, which will leads to the request req5 will not be finnished before client sends req6. The situation is simliar between the response between upstream server and our server (_dwn thread).

This problem will not exists if the connection is only used once or the requests are issued continously.

The self.exit_flag part makes sense and I'll leave this issue open to discuss the buffer size part.

commented

You are wrongly understanding the semantics of recv().
As man 2 recv: "The receive calls normally return any data available, up to the requested amount, rather than waiting for receipt of the full amount requested."

I'm not sure about how it is possible to reuse the connection in the forwarding mode. Please tell me the RFC Standard(s) if there are.

The receive() API of ngx cosocket is wrapped from nginx's own event loop and is not working as the standard recv() call.

According to document, "This method will not return until it reads exactly this size of data or an error occurs."

I've set up a echo server with the following code:

server {
        listen 18444;
        content_by_lua_block {
                local s = ngx.req.socket()
                local b = s:receive(10)
                s:send(b)
        }   
}

You can try it using telnet 172.245.255.135 18444 and type 10 charaters. The server will not return until you type the 10th byte.

commented

@fffonion That is shocking that echo server cannot be implemented in an easy way, the current receive() call behavior will largely limit the usefulness of stream-lua-nginx-module. :(

I'm sorry I didn't find that in the first place. so it becomes clear that the issue depends on an API change from the upstream project.

commented

a "partial" argument can be added to the receive() call like the read_bytes() in tornado.iostream module.

the current receive() function waits for (bytes >= (ssize_t) u->rest) in ngx_stream_lua_socket_read_chunk.

            u->input_filter = ngx_stream_lua_socket_read_chunk;
            u->length = (size_t) bytes;
            u->rest = u->length;

a new input_filter function which do the following should work, u->rest should be 0:

        u->buf_in->buf->last += bytes;
        b->pos += bytes;
        return NGX_OK;
commented

Please test the patch for stream-lua-nginx-module so receive('*p') can be introduced.
https://gist.github.com/fcicq/82e1c6d0c85cbc2d3f8e9f1523bfd1d1

I don't think that'll be the principle of writing coroutines. I suppose a echo server can be implemented like:

while true do
    local buf = sock:receive("*p")
    if buf then
        -- do something of yield some data
    end
end

The code above will block a nginx worker process if client is not sending any data. And ngx_lua has no mechanism like callback (and shouldn't have one) like tornado.

commented

it should block in the same way as line reading (*l), but returns data as soon as arrived. you mean *l will block the process?

I misread your code before. That makes sense to me.
It would be nice if you send a PR to the upstream 😄

commented

hmm. that patch makes writing echo and other kinds of proxy server in openresty possible.
wondering why other guys didn't use openresty in this way.

actually the same API issue comes from luasocket (buffer.c), and later ported to openresty project.
the same method can be applied to lua-nginx-module with a little modification as well.
a careful discussion with luasocket author(s?) may be required to make the change merged, then openresty should follow.

anyway, thanks for your project that revealed the issues in upstream projects, so some future contributions can be expected.
please do take advantage of the patch to make more great projects.

Just for some results I found after doing several tests:
The bottleneck of this proxy at present, or to be specific the stream forwarding part is not in len = lshift(byte(hd, 4), 8) + byte(hd, 5) but in lua buffer copying.
See flamegraph flame.zip

This can be convinced by comparing cpu time between current receive(len) method and using receive("*p") method (8.35s compared to 9.3s with default value of lua_socket_buffer_size set to 4k ). So that receive("*p") will not read more than 4k data at a time. In my test, a download from dl-ssl.google.com has a max TLS packet of 16408, roughly 16k to the payload. By default tls payload has a max length of 16k. After chaning lua_socket_buffer_size to 16k or 32k, we spotted a decreasing of CPU time of receive("*p") approach.

This leads to a problem that, if the server or client has a slow connection to our proxy, receive("*p") will be outperformed by receive(len). Since more time is wasted on smaller buffer copying. What's the general solution for handling such situations?

commented

hmm... the final method would be a native forwarding loop, like ngx_stream_proxy_process in ngx_stream_proxy_module.c.
but I'm not sure if socket can be detached from lua object or not.

I can accept the performance fact, but the semantics is different.
the latency would be quite different in "reading size then forward" method. a larger length means larger buffer. what if the sender is slow? the chunk will not be forwarded until fully received.
and I'm happy with the possibility that encryption or other transforms can be applied later (without knowing the length first, aka streaming).

native module will defintely be a better solution.

while limit the implementation in lua land, whether to use stream forwarding or read length seems to be a tradeoff🌚.

commented

I think that only means the difference of the times of string copy (or build?) operations.

yes and that'll lead to a dilemma between "more performance" or "less system resources". at current state I think both implementations are acceptable :smile

commented

I don't have enough time to implement a forwarder that bypass the lua string.

well i'm not asking for anything. it's just a discussion.
i'm closing this issue if you are ok with it.

commented

please leave this open. the cpu usage issue comes from luaL_addlstring implementation in luajit (lib_aux.c). it calls luaL_addchar for length() times.

the luaL_buffer struct in luajit can only store strings with length up to LUAL_BUFFERSIZE, while in lua 5.3 it is able to use luaL_prepbuffsize to prepare the buffer, and luaL_addlstring uses memcpy directly.
hopefully there will be a method to reduce the running complexity for luaL_addlstring.

commented

as defined in src/lauxlib.h, we can do memcpy up to the lesser of length or bufffree(B). then do a emptybuffer(B) && adjuststack(B).
it still allocate once every LUAL_BUFFERSIZE, but I do think it will be faster due to memcpy() usage.

#define luaL_addchar(B,c) \
  ((void)((B)->p < ((B)->buffer+LUAL_BUFFERSIZE) || luaL_prepbuffer(B)), \
(*(B)->p++ = (char)(c)))
commented

please attach a new flamegraph in your workload. I'm also considering a new method similar with select(), it should match the raw forwarding speed if implemented.

Thanks for your work and it would be more helpful if you can send PR to luajit. 😄