Consider backing Inko processes by OS threads

Question

Consider backing Inko processes by OS threads

yorickpeterse opened this issue 5 months ago · comments

Description

Inko's approach to concurrency is similar to that of Erlang and Go: M Inko processes are mapped onto N OS threads. For sockets we use non-blocking IO, and for files we use blocking operations coupled with a backup thread pool in case these operations take too long.

This setup is based on the general belief that M:N scheduling combined with non-blocking IO leads to better performance compared to 1:1 scheduling with blocking operations. These benefits however are debatable, highly dependant on the type of workload, and come with their own set of non-trivial trade-offs.

A benefit of green threads with an M:N scheduler is that spawning tasks is fast and efficient, such that you can spawn many of them rapidly. On paper this seems beneficial, but in practice it remains to be seen if it truly is. For example, in a typical transactional application (basically any web application), the amount of concurrency is limited not by how many or how fast you can spawn your tasks, but by the concurrency supported by the external services (e.g. a database) the transaction relies upon. This means it doesn't really matter that you're able to spawn 10 000 processes with ease, if you're still limited to running only 20 concurrently due to your database pool being limited to 20 concurrent connections.

Even if your system somehow supported unbounded/unlimited concurrency, you really don't want that in a production setting as planning around unbounded concurrency is impossible, and bound to lead to problems. In contrast, it's much easier to deal with a system that's limited to for example 32 concurrent tasks.

Even if you could somehow solve this, green threading poses additional problems such as:

Additional overhead the scheduler introduces to ensure tasks are run fairly
Additional system calls and locking that comes with the use of non-blocking IO
Poor C interop
Platform specific assembly to support stack swapping, making it more difficult to support multiple platforms (= one of the reasons we don't support Windows at this time)
Poor support for C libraries that require thread-local storage or thread pinning, such as most GUI libraries, without developing a way of pinning Inko processes to OS threads, which in turn introduces scheduler complexity
The code complexity that comes with supporting all this

There are usually two reasons one might want to avoid the typical thread-per-request approach and instead go with the above approach:

Spawning OS threads is more expensive than spawning green threads
The cost of OS thread context switching is greater than that of green threads

The cost of context switching only really matters in systems where we have fully isolated transactions that don't depend on a fixed size pool of sorts, i.e. tasks that are purely CPU bound. But for such workloads I suspect that 1:1 scheduling is in fact better because you don't have the cost of additional bookkeeping.

The cost of spawning threads is something one should be able to mitigate (or at least improve upon) by reusing threads: you maintain a pool of reusable threads, initially at size zero. When threads are needed, we check the pool and reuse a thread if any is present. If not, we spawn a new one. When threads finish, they enter the reusable pool for up to N seconds, after which they stop. Given a sufficiently large upper limit (e.g. 1000), the cost of spawning threads is amortized over time, with the minimal/best-case cost being the equivalent of unlocking a mutex and a pop from a queue of sorts.

The cost of context switching also applies even when using M:N scheduling, because it's still there and we have no control over it. This can in certain scenarios make things worse, such as when a process is rescheduled only for the OS thread to be swapped out with another OS thread by the kernel. In other words, M:N scheduling doesn't solve this but rather makes it less common.

I've been thinking about this over the years, but the more I think about it, and the more challenges I encounter with the M:N scheduler, the more I think we should move to an 1:1 scheduler with the above thread reuse mechanism. The benefits are numerous:

We get to remove a ton of code from the compiler and runtime library
We no longer need a special mechanism to deal with thread pinning, thread-local state, etc, making it easier to interact with C libraries that need this
We can get rid of epoll/kqueue/etc and just use blocking IO and let the kernel handle things. Linux is perfectly capable of handling tens of thousands of threads blocking on IO. Even on my laptop I can easily run 100 000 threads or so without needing additional work (#540)
We can (and should) still set the thread stack sizes to something smaller than the default 8 MiB of virtual memory, just as we do now; minus the need for manually needing to reuse stack memory
We can work towards supporting Windows again more easily, as we no longer need the platform specific assembly used for swapping processes and stacks
Types such as Channel could be simplified, as we can now just use a regular condition variable and mutex for blocking processes on channels
No more primary and blocking thread pools
Sockets can be made smaller as we no longer need to track additional state used by the network poller
Debuggers and profilers (e.g. Valgrind) should work better with Inko, as these can get confused when stacks are switched

Of course at the language level nothing would change: processes would still be lightweight processes (because they are more lightweight compared to OS processes), and the way you use channels/etc would remain the same. You'd also still spawn processes per transactions where possible, it's just that now each process is backed by a dedicated OS thread. In other words, the use of 1:1 scheduling is just an implementation detail transparent to the language.

Related work

Issues we could close

Assuming we drop the use of green threading, the following issues could be closed due to no longer being relevant:

#617: this could probably all be thread-local state managed by generated code
#583: not needed
#344: not needed as we can just use regular blocking IO

Yorick Peterse · Answer 1 · Sat Feb 10 2024 04:11:07 GMT+0800 (China Standard Time)

Somewhere in the last two years I did hack together a small PoC that replaced the scheduler with a 1:1 scheduler. At the time this resulted in a small increase in execution times for the test suite, but this was when we were still using an interpreter. This setup also didn't reuse any threads, so I suspect most of the extra time was spent just starting threads.

Yorick Peterse · Answer 2 · Sat Feb 10 2024 04:39:12 GMT+0800 (China Standard Time)

Here's a simple and admittedly poorly implemented example of amortizing the thread spawn cost by reusing threads:

use std::sync::mpsc::channel;
use std::sync::Mutex;
use std::thread;
use std::time::{Duration, Instant};

fn naive() {
    let mut i = 0;
    let mut fastest = Duration::from_secs(100);

    while i < 50_000 {
        let (input_send, input_rec) = channel();
        let (output_send, output_rec) = channel();

        input_send.send(Instant::now()).unwrap();
        thread::spawn(move || {
            output_send
                .send(input_rec.recv().unwrap().elapsed())
                .unwrap();
        });

        let time = output_rec.recv().unwrap();

        if time < fastest {
            fastest = time;
        }

        i += 1;
    }

    println!("naive: {:?}", fastest);
}

fn reused() {
    let mut i = 0;
    let mut fastest = Duration::from_secs(100);
    let reusable = Mutex::new(Vec::with_capacity(32));

    while i < 50_000 {
        let (input, output) = {
            let mut threads = reusable.lock().unwrap();

            if let Some(res) = threads.pop() {
                res
            } else {
                let (input_send, input_rec) = channel::<Instant>();
                let (output_send, output_rec) = channel::<Duration>();

                thread::spawn(move || loop {
                    if let Ok(t) = input_rec.recv() {
                        let _ = output_send.send(t.elapsed());
                    } else {
                        break;
                    }
                });

                (input_send, output_rec)
            }
        };

        input.send(Instant::now()).unwrap();

        let time = output.recv().unwrap();

        reusable.lock().unwrap().push((input, output));

        if time < fastest {
            fastest = time;
        }

        i += 1;
    }

    println!("reused: {:?}", fastest);
}

fn main() {
    naive();
    reused();
}

In the reused case you can't use join to get the thread results, so in the interest of comparing apples to apples both examples use channels for their input and output.

Running this with cargo run --release yields the following on my laptop:

naive: 14.089µs
reused: 653ns

The "reused" time varies a bit between 500 nsec and 1 µsec, but it highlights how easily you can reduce the spawn cost by just reusing threads. Assuming a real and accurate implementation (the above version only ever spawns a single thread and always reuses it) might need some extra bookkeeping, we'd still be looking at a 10x improvement at least.

The context switch cost remains, but I'm willing to bet that for 95% of the applications out there this is a non-issue to begin with.

Yorick Peterse · Answer 3 · Sat Feb 10 2024 06:26:02 GMT+0800 (China Standard Time)

Another point to consider: green threads typically come with smaller growable stacks, such that the initial amount of (virtual) memory they need is smaller. However, Inko's stack sizes are fixed to 1 MiB by default, as resizing stacks comes with its own overhead and complicates code generation (= you have to ensure the stack size check always comes first in every function).

Yorick Peterse · Answer 4 · Sat Feb 10 2024 23:51:37 GMT+0800 (China Standard Time)

An argument against one thread per Inko process is a less consistent experience: running many OS threads requires tuning of various /sys settings to not run into errors. In addition, macOS applies a limit on the number of threads you can spawn per process, and IIRC that limit is around 2000. In contrast, Inko's scheduler doesn't require any tuning whether you spawn 1 or 100 000 processes.

Yorick Peterse · Answer 5 · Sat Feb 10 2024 23:59:35 GMT+0800 (China Standard Time)

Another argument against OS threads in the context of FFI:

Pinning an Inko process to an OS thread isn't a great approach to handling C libraries requiring to run on the same thread, but it's also not that big of a deal. We could also change the scheduler such that the main process always runs on the same thread, and not offer a generic pinning mechanism. This is easy enough to implement and sufficient for using libraries that must run on the same thread.