Matrix multiplication with Rayon doesn't see perf improvements

Question

Matrix multiplication with Rayon doesn't see perf improvements

oliverhu opened this issue 8 months ago · comments

Hello I am trying to build a quick matrix multiplication using Rayon, but I don't see any perf improvements switching from default iter() to par_iter()...

/// w -> (d, n), x -> (n, 1)
fn matmul(w: &Vec<Vec<f32>>, x: &Vec<f32>) -> Vec<f32> {
    w.par_iter()
     .map(
        |n| {
            (n, x).into_par_iter().map(|(x, y)| x * y).sum()
        }
    ).collect()
}

fn matmul_default(w: Vec<Vec<f32>>, x: Vec<f32>) -> Vec<f32> {
    w.iter()
     .map(
        |n| {
            n.iter().zip(x.iter()).map(|(x, y)|
                x * y
            ).sum()
        }).collect()
}

When I run a benchmark of a multiplication of [2000, 2000] @ [2000, 1], it is a bit slower using into_par_iter...

Keqiu Hu · Answer 1 · Fri Dec 01 2023 16:09:27 GMT+0800 (China Standard Time)

nvm, it is faster...there was an error in my benchmark code.

Josh Stone · Answer 2 · Sat Dec 02 2023 06:59:08 GMT+0800 (China Standard Time)

You might also want to try it with parallelism only on the outer loop, and leave it sequential inside. There's no inherent problem with nesting parallelism in rayon, but you're unlikely to benefit when the workload is already well balanced, and it will add some overhead to split the inner loop for nothing.

Keqiu Hu · Answer 3 · Sat Dec 02 2023 07:20:08 GMT+0800 (China Standard Time)

@cuviper yeah, you're right. After removing the inner parallelism, it is actually 2x faster tho...

fn matmul(w: &Vec<Vec<f32>>, x: &Vec<f32>) -> Vec<f32> {
    w.into_par_iter()
     .map(
        |n| {
            n.iter().enumerate().map(|(id, j)| {
            j * x[id] }
        ).sum()
            // (n, x).into_par_iter().map(|(x, y)| x * y).sum()
        }
    ).collect()
}
fn matmul_2(w: &Vec<Vec<f32>>, x: &Vec<f32>) -> Vec<f32> {
    w.into_par_iter()
     .map(
        |n| {
            (n, x).into_par_iter().map(|(x, y)| x * y).sum()
        }
    ).collect()
}

fn matmul_3(w: &Vec<Vec<f32>>, x: &Vec<f32>) -> Vec<f32> {
    w.into_par_iter()
     .map(
        |n| {
            n.into_par_iter().enumerate().map(|(id, j)| {
            j * x[id] }
        ).sum()
        }
    ).collect()
}
fn matmul_default(w: Vec<Vec<f32>>, x: Vec<f32>) -> Vec<f32> {
    w.iter()
     .map(
        |n| {
            n.iter().zip(x.iter()).map(|(x, y)|
                x * y
            ).sum()
        }).collect()
}

Benchmark (1000, 1000) @ (1000, 1) ->

Time elapsed in matmul is: 3.781083ms
Time elapsed in matmul2 is: 10.111667ms
Time elapsed in matmul3 is: 7.760333ms
Time elapsed in matmul_default is: 15.889958ms