Unexplained race condition in v0.16 causing "runtime dropped the dispatch task"

Question

Unexplained race condition in v0.16 causing "runtime dropped the dispatch task"

Nikita240 opened this issue 3 months ago · comments

After upgrading from bollard v0.15 to v0.16 I started encountering a race condition in my unit tests. I believe this is likely related to the upgrade to hyper v1.1, but I can't quite pin down what's happening.

Here is the test setup to replicate:

//! [dependencies]
//! bollard = "0.16.0"
//! tokio = { version = "1.24.2", features = ["rt-multi-thread", "macros", "fs"] }
//! once_cell = "1.19.0"
use bollard::{image::ListImagesOptions, Docker};
use once_cell::sync::OnceCell;

static DOCKER: OnceCell<Docker> = OnceCell::new();
fn get_docker() -> Result<&'static Docker, bollard::errors::Error> {
    DOCKER.get_or_try_init(Docker::connect_with_socket_defaults)
}

#[tokio::test(flavor = "multi_thread")]
async fn test_runtime() {
    run_test(10).await;
}

#[tokio::test(flavor = "multi_thread")]
async fn test_runtime_2() {
    run_test(10).await;
}

#[tokio::test(flavor = "multi_thread")]
async fn test_runtime_3() {
    run_test(100).await;
}

async fn run_test(count: usize) {
    let docker = get_docker().unwrap();
    for _ in 0..count {
        let _ = &docker
            .list_images(Some(ListImagesOptions::<String> {
                all: true,
                ..Default::default()
            }))
            .await
            .unwrap();
    }
}

Here is what the error looks like:

running 3 tests
test test_runtime ... ok
test test_runtime_3 ... FAILED
test test_runtime_2 ... ok

failures:

---- test_runtime_3 stdout ----
thread 'test_runtime_3' panicked at tests/bollard.rs:33:14:
called `Result::unwrap()` on an `Err` value: HyperLegacyError { err: Error { kind: SendRequest, source: Some(hyper::Error(User(DispatchGone), "runtime dropped the dispatch task")) } }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    test_runtime_3

The test failures are random and inconsistent.

rustc 1.76.0 (07dca489a 2024-02-04)

Do you have any ideas how to root-cause this?

Niel Drummond · Answer 1 · Tue Mar 19 2024 04:36:12 GMT+0800 (China Standard Time)

Thanks for the report... I see there's a v1.2.0 version of Hyper out and version v0.1.3 of hyper-util, let's see if we can reproduce this on those versions.

Niel Drummond · Answer 2 · Wed Mar 20 2024 04:28:32 GMT+0800 (China Standard Time)

I actually can't reproduce this problem. Can you give more detail on your system, and maybe any dockerd logs you find ? you can turn on debug logging in the daemon using the following configuration in /etc/docker/daemon.json :

{
	"debug": true,
	"raw-logs": true
}

Nikita Rushmanov · Answer 3 · Wed Mar 20 2024 04:46:32 GMT+0800 (China Standard Time)

That's very strange. I'm able to replicate this on two different machines running different docker versions.

Nikita Rushmanov · Answer 4 · Wed Mar 20 2024 05:04:44 GMT+0800 (China Standard Time)

I think the issue here is caused by the statically stored Docker instance static DOCKER: OnceCell<Docker>.

When running tokio tests with multi_thread, tokio will actually run the tests concurrently, but spawn a unique runtime for each one of them.

As of bollard@0.16, somehow, the Docker instance "absorbs" the first tokio runtime it sees, and if that runtime is dropped while someone else is making a request, you get the error "runtime dropped the dispatch task".

Niel Drummond · Answer 5 · Wed Mar 20 2024 06:08:20 GMT+0800 (China Standard Time)

Ah yes, I see it now if you run them all together..

Niel Drummond · Answer 6 · Wed Mar 20 2024 16:21:10 GMT+0800 (China Standard Time)

I put this test scenario into bollard's CI system, and it seems to fail on all connectors (http / ssl / named pipe / unix socket), so that excludes any issue with any individual connector. I also checked locally running against the latest master branch of hyper and it still fails (albeit less often).

Niel Drummond · Answer 7 · Fri Mar 22 2024 01:44:56 GMT+0800 (China Standard Time)

I did find a fix, if you have the time, I'd appreciate if you can check if it works for you.. #390

Related to this hyperium/hyper#2312

Nikita Rushmanov · Answer 8 · Tue Mar 26 2024 01:39:05 GMT+0800 (China Standard Time)

I just got around to test your fix. I can confirm it works!

Thank you so much for your support on this ❤️