Proposal: c.neco_start_parked - to avoid thundering herd of coroutine switching?

Question

Proposal: c.neco_start_parked - to avoid thundering herd of coroutine switching?

deckarep opened this issue 3 months ago · comments

So far I'm not seeing a way to avoid this behavior, even with different strategies like using a WaitGroup or using a suspend/resume.

My program is creating coroutines at runtime, and it will create 100s when the user clicks the screen in a loop. This happens within a mouse click event, and then a c.neco_yield() after this code runs, giving all coroutines a chance to switch, then the render frame draws and the cycle repeats.

Fundamentally, this all works but the problem I'm seeing is in my mouse click event, when I spawn 100 coroutines, the shear act of doing a c.neco_start while starting a coroutine also does a coroutine switch internally. So what happens is for 100 coroutines, a switch is called...within neco, all alive coroutines are given a chance to run and then this occurs for the next started coroutine in a loop.

The problem gets worse as more coroutines are alive the frame rate drops significantly...now I could be wrong about the math here but here's what I'm trying to describe:

Assume my application is already running and currently has 1000 coroutines
User clicks (spawning 100 new coroutines)
Spawn the 0th coroutine (in the 100 iter loop)
- neco_start called (switching to newly minted coroutine)
- internally causing a switch to each alive coroutines (1000)
- Total: 1000 + 1 = 1001 switches
Spawn next coroutine (1st)
- neco_start called (switching to newly minted coroutine)
- switch to each alive coroutines (1001) + 1 new alive
- Total: 1001 + 1 = 1002 switches

Continuing on to handle the remaining 98 you end up with: 105050 context switches.

Then, after this work is done, my application will do one c.neco_yield() per game loop update.

Sorry for the wall of text and I know this might be a lot to unpack or consider. If there is a way around this otherwise than please by all means let me know.

Otherwise, I think what could really help here is a neco_start_parked method which creates the coroutine, adds it to the neco internal list, but just immediately returns back to the callsite (no switch occurs).

In my case, 100 coroutines would be created and registered and given the chance to run on the next call to neco_yield. This yield will take care of running all coroutines and would linearly scale with the app.

Ok, if you got this far on this post, I applaud you and appreciate any response. If you would consider this I can also try to contribute the change...but I need to get into the weeds of the coroutine magic.

-@deckarep

Ralph Caraveo · Answer 1 · Sun May 12 2024 06:12:20 GMT+0800 (China Standard Time)

As an aside: I went down the path if seeing it would make a difference to bootstrap all my coroutines at application startup in a suspended state. Then in my case, when the user clicks the screen to spawn 100 bunnies, I just initialize the bunnies and call: neco_resume.

Unfortunately the same thing happens with a thundering herd because neco_resume will also yield per coroutine which makes sense.

However, I created a neco_resume_later and in that code I simply don't callsco_yield(); which has the net effect of registering the coroutine to resume in the future from the yield that occurs later in my code. This scales linearly.

This worked beautifully and now my application is super fluid.

So i'm curious to hear your thoughts.

Josh Baker · Answer 2 · Sun May 12 2024 08:57:49 GMT+0800 (China Standard Time)

Hi, I whipped up an example program of the issue that you are describing.

#include "neco.h"

#define N 100

long long total = 0;
long long steps = 0;

void costats(int argc, void *argv[]) {
    while (1) {
        printf("\e[1;34m(stats) %lld steps %lld coroutines\e[0m\n", steps, total);
        neco_sleep(NECO_SECOND);
    }
}

void cochild(int argc, void *argv[]) {
    total++;
    while (1) {
        steps++;
        neco_yield();
    }
}

void click() {
    int64_t start = neco_now();
    for (int i = 0; i < N; i++) {
        neco_start(cochild, 0);
    }
    printf("Started %d coroutines (%lld total) in %.2f ms\n", 
        N, total, (neco_now()-start) / 1e6);
}

int neco_main(int argc, char *argv[]) {
    printf("Press enter to start %d coroutines...\n", N);
    neco_start(costats, 0);
    while (1) {
        char c = 0;
        while (c != '\n') {
            if (neco_read(0, &c, 1) != 1) {
                return 1;
            }
        }
        click();
    }
    return 0;
}

If you run this you will see that every time the Enter key is pressed 100 coroutines are started. The amount of time that it takes to start all of them increase linearly based on the number of coroutines that are currently yielded. I believe this is consistent with the thundering herd problem that you are describing.

You are right that the reason that is happening is because all yielded coroutines must run prior to newly started coroutine,
but also the parent coroutine that started the new one must be scheduled after last. This ensures that all coroutines are fairly scheduled and that all arguments are in scope when the new coroutine has started.

The biggest concern is keeping the arguments in scope, because that could lead to memory bugs. The super-duper "fair" scheduler is probably a lesser concern.

I made a small change that you can find in the quickstart branch that alters the sco scheduler so that when a coroutine is started it runs immediately and the parent coroutine gets scheduled ahead of all the currently running and yielding coroutines.

It makes for very quick coroutine starts and is argument safe. When ran against the example above you will see that there is no slow down at all no matter how many coroutines are currently running.

But side effect is that a couple tests fail because they expect a specific schedule ordering.

However, I created a neco_resume_later and in that code I simply don't call sco_yield();

I added the change you suggested to the same branch (within the same neco_resume function). But it caused a deadlock with the generator code in the current test.

Additional testing needed to determine if either solution could work for real, but I think the first one is promising.

Ralph Caraveo · Answer 3 · Sun May 12 2024 09:40:37 GMT+0800 (China Standard Time)

Thanks for your in-depth analysis and for the branch. If this behavior I'm seeking can be done with no API changes or new additions that's quite alright with me.

I'll give your branch a shot with my codebase to see how it's affecting things on my end.

Also, I completely understand the need to maintain a fair scheduler and more importantly ensuring the lifetime on the args is correct in order to avoid the bugs you mention.

With my code, I made sure to immediately take stack copies of the derefed args to ensure that even if the calling code stack frame unwinds my coroutines can continue working. This will not work however if the args need to stay as pointers...

I'll follow-up!

Ralph Caraveo · Answer 4 · Sun May 12 2024 10:06:06 GMT+0800 (China Standard Time)

OK, I confirmed that your quickstart branch is solid. Also, the nice thing with this solution: I don't have to pre-spawn the coroutines...I can just go back to creating them on-demand and I see no impact to the framerate as a result. The app runs super-smooth still.

Now that you see the difference I'm really hoping we can get to a robust solution that can be merged eventually.

Code for your reference: https://github.com/deckarep/zig-notebook/blob/main/coro/src/coro_tacular.zig

Josh Baker · Answer 5 · Mon May 13 2024 00:44:25 GMT+0800 (China Standard Time)

I've been experimenting with the quickstart method and I think that it could work broadly for all neco_start operations. I have a little more testing to do to verify 100%. The change will technically be made in the sco project and propagate to neco.
The biggest concern I have atm is ensuring a way that allows for quickly starting a ton of coroutines without starving the coroutines that are currently in the scheduled run list. But I think I may have a working solution locally.

Ralph Caraveo · Answer 6 · Mon May 13 2024 02:03:57 GMT+0800 (China Standard Time)

Excellent, thanks for the deep dive on this. Feel free to ping and I'm happy to try any branch out.

Josh Baker · Answer 7 · Mon May 13 2024 21:04:24 GMT+0800 (China Standard Time)

I just pushed a working solution to the main branch.

Ralph Caraveo · Answer 8 · Mon May 13 2024 21:16:48 GMT+0800 (China Standard Time)

@tidwall - oh great, I will be trying this today. I briefly glanced at your changes and glad it didn’t look like I big scary change. 😄

Ralph Caraveo · Answer 9 · Tue May 14 2024 00:18:41 GMT+0800 (China Standard Time)

@tidwall - everything is looking great on my end. The performance of my version of the bunnymark are buttery smooth. I will close this issue.

Thank you for tackling this and knocking it out. Your library is inspiring...so are your other repos.

Josh Baker · Answer 10 · Tue May 14 2024 10:49:32 GMT+0800 (China Standard Time)

Thanks for the kind words. I'm glad to hear it's working well for you. :)