Scaling issue when there are more workers than cores.

Question

Scaling issue when there are more workers than cores.

ryzhyk opened this issue 3 years ago · comments

We have a CI job that submits many tiny transactions to 4 workers. We notice that when we run the job in a container with one CPU core, it runs 100x slower than when it has 4 cores available to it. Here is a minimal repro.

fn main() {
    let mut args = std::env::args();
    args.next();
    let iterations: u32 = args.next().unwrap().parse().unwrap();

    timely::execute_from_args(std::env::args().skip(2), move |worker| {
        let (mut input, probe) = worker.dataflow::<u32,_,_>(|scope| {

            let (input, values) = scope.new_collection::<_,i32>();
            let out = values.map(|x| x);
            let probe = out.probe();

            (input, probe)
        });

        for x in 1 .. iterations {
            input.update(x, 1);
            input.advance_to(x);
            input.flush();
            worker.step_while(|| probe.less_than(input.time()));
        }
    }).unwrap();
}

When I run this program with 4 CPU cores using the taskset command:

taskset --cpu-list 1,2,3,4  cargo run --example test 5000 -w 4

it completes in 1.5s. But when I run it on one core:

taskset --cpu-list 1  cargo run --example test 5000 -w 4

it takes 115s.

I realize this is a pathological example, since timely/DD are not optimized for tiny transactions and because normally the number of workers should not exceed the number of CPUs. But I can imagine scenarios where workload fluctuations can reduce the number of cores available to DD. Ideally this should not lead to such dramatic slowdowns. So I was wondering what's causing this and whether this is the expected behavior or a performance bug.

Frank McSherry · Answer 1 · Thu Feb 18 2021 10:13:29 GMT+0800 (China Standard Time)

I think it is not unexpected. Each of the worker threads run continually, and as long as they think they have work to do they will not yield the core. You can have them yield periodically, manually, of course. I definitely recommend not oversubscribing the CPUs and then relying on the OS scheduler, as it has much less information than timely does and will not move between operators as quickly as timely can.

Frank McSherry · Answer 2 · Thu Feb 18 2021 10:18:19 GMT+0800 (China Standard Time)

You can also increase the concurrency. Your

worker.step_while(|| probe.less_than(input.time()));

call prevents the loading of concurrent work, effectively introducing a barrier when you could instead continue to load data (in this simple example, but also that is generally how TD works best). The coarse granularity OS scheduling is being exacerbated by the small amount of work before you ask the workers to synchronize.

You can also just avoid worker.step_while(), which is the source of busy waiting. If you instead had a loop and called worker.step_or_park(None); the thread would yield when it runs out of work.

Leonid Ryzhyk · Answer 3 · Thu Feb 18 2021 10:23:45 GMT+0800 (China Standard Time)

Fair enough. So each worker uses up its entire time slice without making any progress, so even a trivial transaction requires at least one timeslice per worker, but probably more than that. step_or_park does solve the problem. Thanks!

Frank McSherry · Answer 4 · Thu Feb 18 2021 10:30:09 GMT+0800 (China Standard Time)

Yup. Perhaps there should be a clearer warning on step_while() as it can let you get in to trouble in ways that step() and step_or_park() can't get you to as easily.

Leonid Ryzhyk · Answer 5 · Thu Feb 18 2021 10:36:17 GMT+0800 (China Standard Time)

In the hindsight, it's pretty obvious. I feel stupid for not figuring this out myself. Thanks again for the quick response!

Frank McSherry · Answer 6 · Thu Feb 18 2021 10:40:34 GMT+0800 (China Standard Time)

No worries. I wrote the first response not having it in my mind either that step_or_park is what you might want.