zonyitoo / coio-rs

Coroutine I/O for Rust

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Connections might be silentrly not served

redbaron opened this issue · comments

On current master (7d12511), simple echo server might "forget" to serve response to certain clients, no erros are reported on server side, tcp connection is not dropped either.

This is very racey and you might need to play with params to trigger it more or less reliably on your machines. Params I am using in my Core i3 M370 (2 cores + 2 HT):

server:

cd benchmark; cargo run --bin coio-tcp-echo-server --release -- --bind 127.0.0.1:3000 -t 2

client:

cd benchmark; GOMAXPROCS=1 go run tcp_bench.go -c 3 -t 10 -a "127.0.0.1:3000" -l 4096 -d 10s

New -d param specifies read/write timeout and was added in #58

Output:

2016/07/10 14:01:52 read tcp 127.0.0.1:40988->127.0.0.1:3000: i/o timeout
2016/07/10 14:01:52 read tcp 127.0.0.1:40990->127.0.0.1:3000: i/o timeout
Benchmarking: 127.0.0.1:3000
3 clients, running 4096 bytes, 30 sec.

Speed: 18879 request/sec, 18879 response/sec
Requests: 566395
Responses: 566393

It is not happening on every run, but I found that it happens more often when client opens less connections.

UPD1: it is not a performance issue, set timeout to "60s", you'll see that test completes, no more active communication happens, yet few connections remain waiting to read data back indefinitely.

commented

Hmm, bug confirmed. But when I tried to run it in debug mode, and I found it works pretty well

Benchmarking: 127.0.0.1:3000
3 clients, running 4096 bytes, 10 sec.

Speed: 11870 request/sec, 11870 response/sec
Requests: 118703
Responses: 118703

and if I run it with --release, then all of them will end up timedout:

2016/07/10 23:05:51 read tcp 127.0.0.1:49317->127.0.0.1:3000: i/o timeout
2016/07/10 23:05:51 read tcp 127.0.0.1:49318->127.0.0.1:3000: i/o timeout
2016/07/10 23:05:51 read tcp 127.0.0.1:49316->127.0.0.1:3000: i/o timeout
Benchmarking: 127.0.0.1:3000
3 clients, running 4096 bytes, 10 sec.

Speed: 4400 request/sec, 4399 response/sec
Requests: 44000
Responses: 43997

Update: Oh! It will also happen in debug mode!

commented

Interestingly ... The problem is: read() returns WouldBlock, and then Mio will not trigger any events until tcp_bench gets timeouts:

## Skips...
TRACE:coio::net: GenericEvented(Token(1)): read() => Ok(4096)
TRACE:coio::net: GenericEvented(Token(1)): write() => Ok(4096)
TRACE:coio::net: GenericEvented(Token(2)): read() => Ok(4096)
TRACE:coio::net: GenericEvented(Token(2)): write() => Ok(4096)
TRACE:coio::net: GenericEvented(Token(1)): read() => WouldBlock
TRACE:coio::net: GenericEvented(Token(1)): wait(Readable)     <--- Client 1 starts to be blocked
TRACE:coio::net: GenericEvented(Token(2)): read() => Ok(4096)
TRACE:coio::net: GenericEvented(Token(3)): read() => Ok(4096)
## Skips...
TRACE:coio::net: GenericEvented(Token(3)): read() => Ok(4096)
TRACE:coio::net: GenericEvented(Token(3)): write() => Ok(4096)
TRACE:coio::net: GenericEvented(Token(3)): read() => WouldBlock
TRACE:coio::net: GenericEvented(Token(3)): wait(Readable) <--- Client 3 starts to be blocked
TRACE:coio::net: GenericEvented(Token(2)): read() => Ok(4096)
TRACE:coio::net: GenericEvented(Token(2)): write() => Ok(4096)
TRACE:coio::net: GenericEvented(Token(2)): read() => WouldBlock
## Skips ....
TRACE:coio::net: GenericEvented(Token(2)): read() => Ok(4096)
TRACE:coio::net: GenericEvented(Token(2)): write() => Ok(4096)
TRACE:coio::net: GenericEvented(Token(2)): read() => Ok(4096)
TRACE:coio::net: GenericEvented(Token(2)): write() => Ok(4096)
TRACE:coio::net: GenericEvented(Token(2)): read() => WouldBlock
TRACE:coio::net: GenericEvented(Token(2)): wait(Readable) <--- Client 2 starts to be blocked

###  Nothing goes here, and then tcp_bench reports timeouts

TRACE:coio::net: GenericEvented(Token(1)): read() => Ok(4096)  <-- Client 1 is waken up
TRACE:coio::net: GenericEvented(Token(1)): write() => Ok(4096)
TRACE:coio::net: GenericEvented(Token(1)): read() => Ok(0)
TRACE:coio::net: GenericEvented(Token(3)): read() => Ok(4096) <-- Client 3 is waken up
TRACE:coio::net: GenericEvented(Token(3)): write() => Ok(4096)
TRACE:coio::net: GenericEvented(Token(3)): read() => Ok(0)
TRACE:coio::net: GenericEvented(Token(2)): read() => Ok(4096) <-- Client 2 is waken up
TRACE:coio::net: GenericEvented(Token(2)): write() => Ok(4096)
TRACE:coio::net: GenericEvented(Token(2)): read() => Ok(0)

It would be:

  1. Mio haven't trigger the event when the socket is readable (not likely)
  2. Coroutines haven't been waken up when the socket is readable

Tried with https://github.com/carllerche/mio.git#c3069801, problem still exists.

Although, after making this change:

diff --git a/benchmarks/Cargo.toml b/benchmarks/Cargo.toml
index cf65c7f..20006bf 100644
--- a/benchmarks/Cargo.toml
+++ b/benchmarks/Cargo.toml
@@ -7,7 +7,7 @@ authors = ["Y. T. Chung <zonyitoo@gmail.com>"]
 bytes = "0.3"
 clap = "2.1"
 #hyper = "*"
-mio = "0.5"
+mio = { git =  "https://github.com/carllerche/mio.git" }
 mioco = "0.7"

I see multiple versions of many crate in output of cargo run:

   Compiling bitflags v0.4.0
   Compiling bitflags v0.3.3
   Compiling bitflags v0.5.0
   Compiling nix v0.5.0-pre (https://github.com/carllerche/nix-rust?rev=c4257f8a76#c4257f8a)
   Compiling nix v0.6.0
   Compiling nix v0.5.1
   Compiling slab v0.2.0
   Compiling slab v0.1.3
   Compiling slab v0.1.3 (https://github.com/carllerche/slab.git?rev=44f9f41a1680e69db7d370d1912898fb0f90b1f8#44f9f41a)
   Compiling mio v0.5.0 (https://github.com/carllerche/mio.git#c3069801)
   Compiling mio v0.5.1

So am I not sure whether mio master was used at all :)

commented

I found a bug in the event notifier (Condvar), which will drop the event notification if no coroutine is waiting on it. Please try again, this bug should be fixed.

I can confirm that no timeouts happening anymore.

Performance is also superb, with slowdown being just 8%:

mio 84000 req/s
coio(1 thread) 78000 req/s

Although achieved at the cost of much higher CPU utilisation:

Average:      UID       PID    %usr %system  %guest    %CPU   CPU  Command
Average:     1000     24915   19.23  112.68    0.00  131.91     -  coio-tcp-echo-s`
Average:     1000     25130    7.55   88.89    0.00   96.44     -  mio-tcp-echo-se
commented

There are still lots of work to do, :)

That is kinda worst case scenario - no userspace work is done here. in more realistic scenarios relative overhead will go down, while benefits of multicore processing will shoot up.

Bug like this scary though, can be missed easily, we were lucky to be able to trigger it relatively easy. Would it be possible to create a test case for problems like that? Custom condvars, spinlocks, notifies - all of that might break simply because just 2 pair of eyes are looking into them, it doesn't add confidence I admit :(

EDIT: grammar

commented

Hmm, the current implementation is just a temporary solution, we wanted to make it work just like Go's Cond implementation but failed to compile. So I committed to make it compile and work without any self-tests. That's my fault.

We will continue working on this, but currently we are focus more on #56 .