rt: Hang on too small completion queue

Question

rt: Hang on too small completion queue

ollie-etl opened this issue 2 years ago · comments

Similar to #145, and again exercised by #144. If there size of the completion queue is smaller than the number of concurrent writers to the submission queue, the runtime just hangs.

Probably not commiting?

FrankReh · Answer 1 · Tue Oct 25 2022 07:26:30 GMT+0800 (China Standard Time)

Yes, the driver may be making too strict assumptions about when to submit and when to check the cqueue.

The driver isn't flushing the squeue and handling the cqueue as often as it could and it can use the overflow bit to make it take an extra loop before getting back to the caller.

And also, like another issue and commit found and handled, it is not enough to just submit when the thread is going idle, there will be times we want submit and tick should run without being idle first. Whether we wait for the squeue to be full or not is debatable. The squeue drains completely with any call to submit so its probably okay to wait for the squeue to become full but the app can decide what the size of the squeue is and if it is made to large, it can starve the kernel from work it could be processing on. Some day I want to ask the uring author what they think the right sizing is, but maybe this crate's benchmark abilities will already have given us some good ideas.

I believe my thinking about this earlier, and my pseudo fix, was any time the squeue is found full, we should run a tick first but we don't want to recurse and call a tick again if a tick completion handler had tried to make its own submission entry and was thwarted by finding the squeue full too. We also don't want to force the kernel to go into overflow mode if we can help it because that creates a slower path and burns cycles that will sometimes be avoidable by checking the cqueue periodically.

And there is also the non-sys call mode which the driver doesn't have an option for yet which would let the two sides run in parallel even more efficiently.

Great ways are being found to stress the system reliably. Will make it easier to push performance and detect when regressions are happening.

Noah Kennedy · Answer 2 · Wed Oct 26 2022 23:26:17 GMT+0800 (China Standard Time)

I think that we should probably take an approach here similar to what tokio does with when it calls epoll_wait. This would mean getting a hook into tokio to let us run a callback to flush the queue and dispatch completions during the maintenance loop. Tokio's maintenance loop runs every set number of scheduler ticks (a scheduler tick is basically the polling of a task). This would give us a good way to periodically flush the queue.

FrankReh · Answer 3 · Thu Oct 27 2022 01:17:03 GMT+0800 (China Standard Time)

@Noah-Kennedy Are you saying there is already a hook in tokio we can use, or you're going to try and get one added.

If that is new, perhaps before that lands, we can implement our own maintenance countdown in the op creation to trigger the same uring maintenance.

Noah Kennedy · Answer 4 · Thu Oct 27 2022 01:31:40 GMT+0800 (China Standard Time)

I'm going to try to get one added. In the meantime, I'm wondering if it is best to address this in the op creation or to do a "duck tape and zip ties" fix by having an always ready task that no-ops and yields back most of the time like this (pardon my pseudocode):

async fn maintenance_hack() {
    loop {
        for _ in 0..16 {
            // yield back to the runtime 16 times
            tokio::task::yield_now().await;
        }
        
        // flush the queue and dispatch completions.
        // I'm acting like the thread locals are statics here for simplicity's sake.
        DRIVER.tick();
    }
}

This will basically run once every 16 full passes through the scheduler.

Noah Kennedy · Answer 5 · Thu Oct 27 2022 01:32:07 GMT+0800 (China Standard Time)

Crap, that won't work. It will cause us to busy loop instead of blocking on epoll_wait.

Noah Kennedy · Answer 6 · Thu Oct 27 2022 01:34:13 GMT+0800 (China Standard Time)

Yeah, we should probably try and do this like @FrankReh was suggesting for the time being.

Noah Kennedy · Answer 7 · Thu Oct 27 2022 01:34:26 GMT+0800 (China Standard Time)

And long-term move to a hook in tokio.

Noah Kennedy · Answer 8 · Thu Oct 27 2022 02:59:40 GMT+0800 (China Standard Time)

The more that I think about it, this probably can't be solved in the op submission without removing the ability to batch SQE submissions. I'm wondering if it might make sense for us to do something like my suggestion but with some added logic around only letting the task run if there are unflushed submission queue events.

Noah Kennedy · Answer 9 · Thu Oct 27 2022 03:03:46 GMT+0800 (China Standard Time)

Ah I was thinking of a different issue here. Disregard my earlier comments, they are important to do but won't fix this issue. We should definitely dispatch CQEs first anytime we submit SQEs.

Noah Kennedy · Answer 10 · Thu Oct 27 2022 03:26:17 GMT+0800 (China Standard Time)

We should also be using NODROP, but I don't think that is in tokio-rs/io-uring. We would need to add that flag first and get a release out in there.

Noah Kennedy · Answer 11 · Thu Oct 27 2022 04:15:36 GMT+0800 (China Standard Time)

I forgot that NODROP is a feature, not a flag...

Noah Kennedy · Answer 12 · Thu Oct 27 2022 04:22:11 GMT+0800 (China Standard Time)

Does this occur with the NODROP feature enabled?

FrankReh · Answer 13 · Thu Oct 27 2022 04:22:54 GMT+0800 (China Standard Time)

I've got buffer ring groups working in a fork and streaming multi-accept working in a fork. I think I'd like to get those in before we try to address how to handle a flood of submissions or a flood of responses because it is easier to create the floods and then we can see all the reasons for doing so more easily (I think).

But my streaming multi-accept has a cancel ability in it that may not fit with your plans for cancel but I'd like it to get a little more light because it had a nice aspect or two.

FrankReh · Answer 14 · Thu Oct 27 2022 04:24:42 GMT+0800 (China Standard Time)

Does this occur with the NODROP feature enabled?

NODROP just means the kernel will queue up things using extra kernel memory, waiting for the cq to be drained by us. So it doesn't help if our problem is not getting the kernel to see our sq entries or we aren't pulling from the cq because we thought we did but what we didn't do was ask for more to be pulled. That's what using the OVERFLOW bit is for.

Noah Kennedy · Answer 15 · Thu Oct 27 2022 04:27:34 GMT+0800 (China Standard Time)

Ah, I see the issue. I didn't realize that the problem was that we had extra completions we didn't notice due to overflow. Sorry for the confusion!

FrankReh · Answer 16 · Thu Oct 27 2022 04:28:54 GMT+0800 (China Standard Time)

Once I started handling the OVERFLOW bit, my test hangs went away.

My streaming fork is waiting for the slab linked list work by @ollie-etl to be merged. I think that PR was left in his court but I'll double check...

Yes, I did a review for him yesterday and he'll also need to resolve conflicts with the new master.

FrankReh · Answer 17 · Thu Oct 27 2022 04:32:45 GMT+0800 (China Standard Time)

It's not a blocker but I want to get that writev PR committed as the author had put good work into that and we should be able to clear it off our table with no other dependencies.

Noah Kennedy · Answer 18 · Thu Oct 27 2022 04:33:48 GMT+0800 (China Standard Time)

I understand your earlier comments now. Once we get through the current PR queue, we can get this fix in and do a release.

FrankReh · Answer 19 · Thu Oct 27 2022 04:36:13 GMT+0800 (China Standard Time)

How about a release 4.0 tomorrow or Friday, regardless of what else is done. And none of our planned work is a breaking change so maybe a 4.1 about a week later? I'll work on change log stuff too. How do I share change log with you the best way? Just a regular PR?

FrankReh · Answer 20 · Thu Oct 27 2022 04:36:53 GMT+0800 (China Standard Time)

I meant 0.4.0 and 0.4.1.

Noah Kennedy · Answer 21 · Thu Oct 27 2022 04:54:55 GMT+0800 (China Standard Time)

@FrankReh we should have done this a while ago TBH, but we need a CHANGELOG.md file where we can document changes. We will format it as with tokio, and we can use this to add relnotes for the release tags.

Noah Kennedy · Answer 22 · Thu Oct 27 2022 04:55:14 GMT+0800 (China Standard Time)

We should probably list this as a "known issue"

FrankReh · Answer 23 · Thu Oct 27 2022 05:11:27 GMT+0800 (China Standard Time)

We should probably list this as a "known issue"

But it's good to see if it can be addressed while the OP has cycles and the ability to easily reproduce from their end. With your PR in, let's see if @ollie-etl can reproduce the problem.

I can make it my priority to get something handling OVERFLOW in for @ollie-etl to then check out. I don't know that this can't be solved relatively easily so let's see about getting this closed quickly.

FrankReh · Answer 24 · Thu Oct 27 2022 07:54:23 GMT+0800 (China Standard Time)

I could create a self contained minimal reproducer that hangs at exactly the size of the completion queue, just as the title predicts

OllieB · Answer 25 · Thu Oct 27 2022 13:29:02 GMT+0800 (China Standard Time)

My, you guys have been busy! I'll try and take a look at these this morning

FrankReh · Answer 26 · Thu Oct 27 2022 23:14:50 GMT+0800 (China Standard Time)

I have a solution that solves it for my test case. It just involves a change to the io-uring crate. The tokio-uring crate can be left at master.

@Noah-Kennedy , @ollie-etl , would you like to change your Cargo.toml to use my fork/branch and report whether it fixes the problem you are seeing too?

io-uring = { git = "https://github.com/frankreh/io-uring", branch = "frankreh/cqueue-overflow", features = ["unstable"] }

Noah Kennedy · Answer 27 · Thu Oct 27 2022 23:42:46 GMT+0800 (China Standard Time)

@FrankReh what did you change?

FrankReh · Answer 28 · Fri Oct 28 2022 01:03:38 GMT+0800 (China Standard Time)

I changed the overflow bit handling in submit. I think I sent the idea out last night in one of these other issues.

Noah Kennedy · Answer 29 · Fri Oct 28 2022 02:13:18 GMT+0800 (China Standard Time)

How did you change it though, and do you think that this would be appropriate for a PR in io-uring?

FrankReh · Answer 30 · Fri Oct 28 2022 03:09:19 GMT+0800 (China Standard Time)

Yes, I think it's appropriate. Just easier if someone else can vouch for it making a difference too.

FrankReh · Answer 31 · Fri Oct 28 2022 03:10:04 GMT+0800 (China Standard Time)

How did you change it though, and do you think that this would be appropriate for a PR in io-uring?

I don't understand where this question is coming from. Isn't it easier for you to look at the diff of the commit I put into that branch?

Noah Kennedy · Answer 32 · Fri Oct 28 2022 03:18:21 GMT+0800 (China Standard Time)

Sorry, I was on mobile earlier and not able to easily see, should have mentioned that. Looking at your change rn, this looks fine.

FrankReh · Answer 33 · Fri Oct 28 2022 09:36:10 GMT+0800 (China Standard Time)

This should be solved once tokio-rs/io-uring#152 or its equivalent is merged and this crate's version of io-uring is appropriately bumped.

Noah Kennedy · Answer 34 · Sat Oct 29 2022 05:14:37 GMT+0800 (China Standard Time)

Does that change alone fix this, or is #152 also needed?

Noah Kennedy · Answer 35 · Sat Oct 29 2022 05:17:43 GMT+0800 (China Standard Time)

Nevermind, I just saw the PR in here.