ConcatAlong is very slow

Question

ConcatAlong is very slow

opfromthestart opened this issue 9 months ago · comments

With using (a,b).concat_along(Axis::<1>) each inference takes about 300ms, while if I manually convert the tensors to vectors, concat them myself, and then convert them back into tensors, it only takes about 6ms. I checked that the outputs are the same so I didn't miss anything and they have the same output. I'm not sure what the problem is but I do not think this is acceptable.
This is using the CUDA version.

Corey Lowman · Answer 1 · Thu Sep 14 2023 23:27:43 GMT+0800 (China Standard Time)

Can you give shapes of a & b tensors? I'll take a look

Corey Lowman · Answer 2 · Thu Sep 14 2023 23:31:46 GMT+0800 (China Standard Time)

Also is this after repeated calls? ConcatAlong JIT compiles the kernel the first execution which probably takes somewhere along those lines

Corey Lowman · Answer 3 · Thu Sep 14 2023 23:35:35 GMT+0800 (China Standard Time)

Yeah this is from JIT

I get these timings from this simple example:

use std::time::Instant;

use dfdx::prelude::*;

fn main() {
    let dev: AutoDevice = Default::default();
    let a: Tensor<(Const<2>, usize), f32, _> = dev.zeros_like(&(Const, 2));
    let b: Tensor<(Const<2>, usize), f32, _> = dev.zeros_like(&(Const, 4));
    for _ in 0..10 {
        let start = Instant::now();
        let _: Tensor<Rank2<2, 6>, f32, _> =
            (a.clone(), b.clone()).concat_along(Axis::<1>).realize();
        println!("{:?}", start.elapsed());
    }
}

375.891925ms
20.9µs
11.16µs
11.64µs
9.771µs
9.75µs
9.27µs
9.43µs
9.73µs
9.68µs

Corey Lowman · Answer 4 · Thu Sep 14 2023 23:37:14 GMT+0800 (China Standard Time)

Going to close this for now - If you want to investigate if there are ways to speed up JIT compilation times feel free to open a different issue!

opfromthestart · Answer 5 · Thu Sep 14 2023 23:39:33 GMT+0800 (China Standard Time)

The shapes I am concating are (Const::<1>, Const::<6240>) with (Const::<1>, Const::<6>). Even in a loop, it appears to always have this cost, so I think it may be recompiling something every time, either that or it is not ever trying to optimize it.

opfromthestart · Answer 6 · Thu Sep 14 2023 23:49:17 GMT+0800 (China Standard Time)

Doing the same test that you did for my network, it appears to consistently have the same time value regardless of the number of calls. Is it possible for the function to have been cleared out of the GPU cache, and how would I prevent that?
Network times:

310.782126ms
295.890484ms
291.42786ms
292.147541ms
291.059603ms
303.778666ms
293.815021ms
310.522738ms
301.337522ms
308.999231ms
322.108602ms
321.610593ms
329.548947ms
312.1137ms
344.742673ms
296.267473ms
292.671337ms
291.348365ms
300.508978ms
328.355806ms

Times for CPU hover around 16ms.

opfromthestart · Answer 7 · Thu Sep 14 2023 23:54:23 GMT+0800 (China Standard Time)

I added some debug statements into the ConcatAlong kernel and

        if !self.dev.has_func(&module_name, "fwd") {

will always return false, and so it recompiles in every single loop. Is there a way to prevent this?

opfromthestart · Answer 8 · Fri Sep 15 2023 00:16:23 GMT+0800 (China Standard Time)

I added my own caching using a once_cell:sync::Lazy and I got an iteration down to 5ms. I think some transparency in how kernels are loaded and unloaded would help.

Corey Lowman · Answer 9 · Fri Sep 15 2023 01:16:57 GMT+0800 (China Standard Time)

Ah hmm good info. Definitely should not be compiled every time. What dtype are you using? And can you send a simple snippet that reproduces the behavior?

Corey Lowman · Answer 10 · Fri Sep 15 2023 01:20:46 GMT+0800 (China Standard Time)

Are you recreating the device inside the loop or whatever calls the concat_along? Each device instantiation will need to recompile the kernels

opfromthestart · Answer 11 · Fri Sep 15 2023 04:36:02 GMT+0800 (China Standard Time)

I was unaware of that. I was creating a new Cuda object every time in the loop.
I had thought that it was a zero sized/marker type, I didn't know it was important.

Corey Lowman · Answer 12 · Mon Sep 18 2023 22:36:48 GMT+0800 (China Standard Time)

Should probably move to that TBH, this is fairly common conception.