Performance regression since 78687e7

Question

Performance regression since 78687e7

lhecker opened this issue 9 years ago · comments

Up until e63031c (inclusive) the current benchmark for coio-tcp-echo-server showed (subjectively) good results:

Speed: 75640 request/sec, 75640 response/sec
Requests: 2269208
Responses: 2269208

Since then performance has dropped significantly:

Speed: 1526 request/sec, 1526 response/sec
Requests: 45801
Responses: 45801

Between 78687e7..f747ab8 all builds fail the benchmark for major or minor reasons though: socket already registered or No such file or directory error in earlier builds (before the change to the more recent mio version) and later because the slab is overfilled, or because using too many semaphores leads to a Too many open files panic. All of this made ruling out the reason a bit hard for me, but I suspect that it's related to the change to a newer mio version and the shared event loop etc.

Any ideas?

ty · Answer 1 · Mon Dec 28 2015 09:20:15 GMT+0800 (China Standard Time)

Indeed, I have noticed this problem. I thought about this yesterday, but I still don't know what's the bottleneck of the current implementation. For the problem you have mentioned above:

"socket already registered": Because the latest version of Mio wants to support IOCP on Windows, which makes it impossible to move FDs from one loop to another. In the current implementation of coio, it is a little bit hard to control the I/O objects for when I should call register, reregister and deregister on it. So register the I/O object everytime when the I/O is going to block, and deregister it when I got notified by Mio.

But as you can see in the current AppVeyor build log, it still panicked with "socket already registered" message. That's because Mio right now can only register the SOCKET object once into one EventLoop. Once you register it into an EventLoop, you cannot register it again even if you have already called deregister on it (you cannot reregister it even with a different Token!).
"No such file or directory" and insert into filled slab: This is the problem we are working on in #19 . A coroutine is resumed before it is blocked.
"Too many open files" panics for using too many semaphores? That's news.

So here is my opinion:

The #16 has to be done because if we want to support IOCP, then we cannot use one-loop-per-thread strategy.

Before profiling, I suspect the real problem is one of these:

Here. Now I use the channel provided by Mio for registering events. My original idea is don't block the worker thread when it wants to register an event to EventLoop.
Here. When the event is ready, it calls this callback and then send the coroutine pointer through a mpsc queue back to the worker thread. If the thread is busy right now, those coroutines may be starved in this channel and they cannot be stolen by the other worker threads. I have tried to get rid of the deque crate, which is a work-stealing queue, and replace it with a bounded-mpmc-queue. But I can't see any explicit difference. Check the mpmc branch
The Mutex that carry by the Coroutine. There are many spots that has the same problem as described in #19 , so it is very likely that one worker-thread is running a Coroutine and another worker-thread is trying to grab the lock and try to resume it. But it can't because of the Mutex. So the other worker is stuck. This can be proved by if the performance is higher by using only one worker thread.

The Speed drops from 75640 to 1526 is definitely a bug.

ty · Answer 2 · Mon Dec 28 2015 09:35:39 GMT+0800 (China Standard Time)

CPU usage is not 100% (about 10% actually) when running benchmarks. So there must be somewhere that blocks the worker, which is the cause of this issue.

Leonard Hecker · Answer 3 · Wed Dec 30 2015 19:16:02 GMT+0800 (China Standard Time)

@zonyitoo I finally found the line causing this regression and I'm sure that your reaction will be just like mine: "OMG". 😄
You said that the CPU utilization is about 10% right? Well look at this commit: 4f4a1f1
Yeah... I guess sleeping for 1/10th of a second could really be the reason for this...

If I remove this the performance goes back to 76k req/s with coio-tcp-echo-server using -t 1. But the performance drops significantly again to about 13k req/s with -t 4. I guess the main reason is the rapid registration and deregistration with mio, huh? Because the mio channel uses heap allocations and the symbols for the registrations are stored in a hashmap (benchmarking suggests that this causes about 20% (!) of the CPU load alone with -t 1).

ty · Answer 4 · Wed Dec 30 2015 19:26:52 GMT+0800 (China Standard Time)

OOOOOMG......

Yes, with I/O bound applications, wait_event is the bottleneck of performance. Mio's channel actually is a MPMC lock-free queue, itself should not be the most significant cause of lower performance. There may not be another way to replace the two queues (worker -> Mio, Mio -> worker).

Leonard Hecker · Answer 5 · Wed Dec 30 2015 19:50:31 GMT+0800 (China Standard Time)

It might also be worth a try to seperate mio's EventLoop from coio. Since e.g. sockets are Sync (right?) you could share them across the coroutines. The state would then be shared by a "manager" object which sits between the EventLoop and coio. Whenever a state is updated (which it stores persistently so we are not required to use level trigger but can use edge trigger instead), it checks if a coroutine is currently parked in a (e.g. socket-) method and wakes it up. Thus you would nearly never change the EventLoop except for the creation of new sockets etc. Furthermore the manager could be locked by a Mutex, which is (contrary to popular believe) really fast as long as you do not have any lock contention, which you wouldn't have because the probability of the EventLoop updating a "manager" and a socket reading from it should be quite low (assuming a 1:n and 1:1 releationship between EventLoop:Manager and Manager:Socket).

Alternatively you might consider reverting back to the "one EventLoop per Thread" design. You would need to transform the scheduler to a hybrid-work-stealing one: You could mark coroutines which are parked for methods on objects which are not sendable (i.e. all socket methods since they cannot be reregistered at another EventLoop in another thread) as not being "stealable". Such coroutines which are parked as "non-stealable" would be required to stay as long in the same Processor as long as they are not resumed by it. I think that this "might" offer better performance and might be easier to archieve than optimizing a shared EventLoop.

BTW: I looked at mio's source code and you're right: It doesn't allocate any memory (which is only possible because it's bounded - unbounded channels require linked lists and thus Box). The register() and deregister() operations remain being heavy operations though with they 20% CPU usage alone...

ty · Answer 6 · Wed Dec 30 2015 21:39:02 GMT+0800 (China Standard Time)

Then, it would makes nearly all coroutines can only be run in their own worker-thread, which is not a work-stealing implementation, just one-eventloop-per-thread + coroutine-pool.

Shared EventLoop module is widely used in many libraries, so I don't think shared EventLoop is the main problem.

Leonard Hecker · Answer 7 · Wed Dec 30 2015 22:24:23 GMT+0800 (China Standard Time)

Well actually this would only effect windows, while UNIX platforms could run without any kind of locks, since you can freely shere the fd between EventLoops there. I reckon that implementing the semi-stealing-scheduler is a lot easier than tuning a shared-loop. Why? Well because lock contention (e.g.: even "lock-free" queues as being used to communicate with the shared loop block at a certain point) is basically the death of multithreading…
And yeah… shared loops are probably more widely used than per-thread loops, but I think that's more of a sideeffect to get a stable and easy to write cross-platform program, rather than being optimal.

Oh and I got another hint for you... 😊 Try removing rand::random() - it uses a cryptographically secure RNG, which is probably a bit over the top for the Processor::scheduler method. If you replace it with rand::weak_rng() and it's XorShiftRng you get a hefty performance boost of about 400% (!) with -t 4. This is one of the biggest contentions when using multithreading with coio and as you can see it alone (nearly) transforms relative_performance = 100% / thread_count to relative_performance = 100% - (2% * thread_count). 😊

ty · Answer 8 · Wed Dec 30 2015 22:30:28 GMT+0800 (China Standard Time)

Excellent!

But by the way, in the current implementation of Mio, it forbids fds from being moved from one EventLoop to another one. So right here we can only use shared EventLoop right now.

Leonard Hecker · Answer 9 · Wed Dec 30 2015 22:55:21 GMT+0800 (China Standard Time)

Ah I just saw it... It would be great if one could create mio sockets with a EventFd instead of having to use the std::net types...

The second biggest performance hog is btw the work stealing queue which is far from being well engineered... It causes a lot of implicit syscalls and/or thread barrriers. 😟

Reason being: They use sequential-consistent ordering EVERYWHERE: https://github.com/kinghajj/deque/blob/master/src/lib.rs#L253-L277
This one method alone causes 11% of the whole CPU usage on OSX.
And that's why it's quite sad to see this, because knowing the difference between the memory ordering of atomic operations is essential if you claim to write a performant or arguably "good" lock-free library... It's so sad to see because this extreme misuse of atomics can be seen literally everywhere. 😞 I whonder if I should write a PR...

(In case you don't know what's so bad about SeqCst ordering: To be atomically consistent it has to insert thread barriers on x86 which are extremely costly compared to other orderings like the acquire/release one, or even relaxed. You can read more about it here: http://en.cppreference.com/w/cpp/atomic/memory_order)

I'd still argue that one loop per thread might scale better than the shared one. The select/kevent/etc. calls will cause syscalls etc. which probably scales a lot better if more than one thread is busy performing those syscalls. Furthermore you circumvent the effort of using queues to communicate.

This just depends on whether it's possible to determine when a coroutine must be locked to a specific Processor and when it can be traded among them using work stealing.

If it's not possible then I really whonder why -t 1 is still faster than -t 4 even though it uses less threads.

ty · Answer 10 · Wed Dec 30 2015 23:23:04 GMT+0800 (China Standard Time)

The deque crate is originally created by Alex, one of the collaborator in Rust's project. They use this queue to implement the original version of Rust's Coroutine library, the libgreen. I think they just want to make everything works but to optimize it.

Also, I am still wondering how to make good use of the deque with the current shared EventLoop strategy. Because when the IoHandler::ready is called, it just pushed the coroutine into an mpsc queue to the worker thread. So those coroutines won't be able to be stolen by the other workers.

So many problems arise when Mio comes to 5.0. This crate needs a proper refactor!

I know the memory_order in C++11. The document said the Ordering in Rust is exactly the same as C++.

ty · Answer 11 · Wed Dec 30 2015 23:28:28 GMT+0800 (China Standard Time)

To make a conclusion, all possible optimization:

Hybrid work-stealing algorithm, which allows coroutines to be pinned on a specific worker thread. So make every worker-thread has its own EventLoop.
rand::random()
Optimize deque implementation
Lower the price of register and deregister
Use a bounded MPMC queue as the task queue instead of deque to eliminate the mpsc channel (Mio -> Worker)

Will be extended by further discussion.

Leonard Hecker · Answer 12 · Thu Dec 31 2015 00:09:30 GMT+0800 (China Standard Time)

I'm going to optimize the deque first, which should give a performance boost of at least 3-5%: kinghajj/deque#8

Leonard Hecker · Answer 13 · Thu Dec 31 2015 20:56:48 GMT+0800 (China Standard Time)

Today I thought a bit more about the first point (the "Hybrid work-stealing algorithm"). In hindsight you might be right that a shared event loop is better, because I only had cases in mind where network operations are equally balanced between connections (which is the case for the project I'm currently using coio for). But if the workload is unbalanced the worker starvation kicks in and we get the usual problems with simple coroutine schedulers. So yeah... Maybe you've been right all along. I still think that a loop per thread is faster (meaning: we should keep it on our radar), but it won't be faster if we don't have a solution for making work-stealing in the general use-case with mio possible (meaning: it should be lowest priority, huh?).

If you'd like me to work on some part of this, or implement some of the solutions we already spoke about, just say so. :)

P.S.: If you've ever wondered as to why I'm so active on this project: I would like to learn Rust and simply picked this project for it, because I consider coroutines as highly interesting. (mioco wasn't really option for me, because while it's probably a lot more stable than coio, it's really not that well written in my humble opinion.)

ty · Answer 14 · Fri Jan 01 2016 21:19:26 GMT+0800 (China Standard Time)

Let me explain my priority goals:

Usability and availability is the primary goal, which means that this library should at least work without errors and support the most ordinary use cases, such as network I/O, basic synchronization methods.
When the APIs are stablized, we should focus on the performance. It at least has comparable performance as Go's runtime.
Add the other supports, such as file I/O, the select, and so on.

So here the coio project is still in the 1 step. I just managed to make it work with Mio v0.5, and as you can see, there are still a lot of problems remains. So I suggest if you want to help, please first focus on how to make it works well with the current version of Mio (shared EventLoop is the only option), and performance should be considered.

I have been quite busy these days, so I won't commit much lately. Please fell free to comment, I will reply as soon as possible.

I want to make coio as one of the best project in Rustland.

BTW, if you are interested, please take a look at the context-rs project and see if there is any thing could be optimized.

Leonard Hecker · Answer 15 · Fri Jan 01 2016 22:07:31 GMT+0800 (China Standard Time)

It's great to hear that you're planning to take coio so far. 😊 And yes I'd really like to make coio as stable as possible by investigating solutions to the current bugs, but I think there should be some kind of coordination, where you say what you'd like to work on, and what I can/should work on, so we don't fix the same problem twice. I'm soon going to get a bit more busy though, but I'll try to spend at least a some time every day writing code for coio and it's transitive projects.

I know it's a lot to ask for, but let's just assume that I'm going to continue supporting this project (which I'm planning to): It would probably really simplify some things if you could grant me push rights to this project (i.e. if you could make me a collaborator) as soon as you trust me with it, because there are a lot of things I have in mind for this and other projects.

For instance: I could reeeeaaaally need some support right now for SO_REUSEADDR (for a university assignment which is due soon). To add this properly I would need to write the code (about 20% effort), and then send proper PRs to mio (which has an 1yr old issue for this) and to this project (to bridge the API over). But I also can't do it manually in my project using the C APIs, because you accidentially (?) made the AsRawFd accessor private. Thus: 80% effort.

ty · Answer 16 · Fri Jan 01 2016 23:50:33 GMT+0800 (China Standard Time)

Sure, I would add you as a collaborator. :) Happy hacking!

And actually I am not working on anything. So you can choose your target freely.

Leonard Hecker · Answer 17 · Sat Jan 02 2016 00:15:47 GMT+0800 (China Standard Time)

I discovered just now that

src/bin/server.rs:71:13: 71:39 error: trait `AsRawFd` is private
src/bin/server.rs:71         use std::sys::ext::io::AsRawFd;
                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~
src/bin/server.rs:73:18: 73:34 error: source trait is private
src/bin/server.rs:73         let fd = sock.as_raw_fd();
                                      ^~~~~~~~~~~~~~~~

does not in fact mean that a trait is "private" but instead that the use clause at the beginning of the file is missing: rust-lang/rust#22050
And people are complaining about C++ compilers to be hard to understand... 😐

I also took a look at context-rs. Your reasoning behind how context switches work are really easy to understand. But I do think that using Boost.Context's assembler is going to be much better in the long run, because their code is surely a lot more fleshed out due to all the manhours spent on it over the years and due to the tests on all those different platforms (e.g.: they have asm files in AT&T and Intel syntax, and thus can be compiled seamlessly on windows). I don't understand how their assembler works yet though ("why do they push/pop all the things?"), but I'm sure I'll get there. Currently it's probably a bit more important to improve coio though. But I'll try to send you a PR with a branch that uses Boost.Context's asm - this might even solve a couple issues along the way.

Oh and thanks for making me a collaborator. 😊

ty · Answer 18 · Sat Jan 02 2016 00:55:45 GMT+0800 (China Standard Time)

AsRawFd should be used as

use std::os::unix::io::AsRawFd;

Porting Boost.Context is a great idea, you can find another person is working on that in context-rs. But also please keep an eye on inline assembling. Because using external asm code requires extra assembler to compile it.

Leonard Hecker · Answer 19 · Sat Jan 02 2016 02:35:46 GMT+0800 (China Standard Time)

Yeah I found the solution to the AsRawFd problem after fiddling around for a while. Since I'm still new to Rust as a language learning things like this is probably my new normal now, huh? Thanks anyways! 😅

And while my knowledge in Rust and Coroutine implementations is definitely lacking I'm quite sure that you can't use inline assembly for the context swaps etc. safely without having naked functions. Those might be coming soon though:

ty · Answer 20 · Sun Jan 03 2016 00:02:37 GMT+0800 (China Standard Time)

Of course, I am still waiting for the #[naked] attribute to be added into the Rust's nightly.

ty · Answer 21 · Sun Jan 31 2016 02:48:29 GMT+0800 (China Standard Time)

Thread parking strategies will be discussed in #27 .