rayon-rs / rayon

Rayon: A data parallelism library for Rust

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Matrix multiplication with Rayon doesn't see perf improvements

oliverhu opened this issue · comments

Hello I am trying to build a quick matrix multiplication using Rayon, but I don't see any perf improvements switching from default iter() to par_iter()...

/// w -> (d, n), x -> (n, 1)
fn matmul(w: &Vec<Vec<f32>>, x: &Vec<f32>) -> Vec<f32> {
    w.par_iter()
     .map(
        |n| {
            (n, x).into_par_iter().map(|(x, y)| x * y).sum()
        }
    ).collect()
}

fn matmul_default(w: Vec<Vec<f32>>, x: Vec<f32>) -> Vec<f32> {
    w.iter()
     .map(
        |n| {
            n.iter().zip(x.iter()).map(|(x, y)|
                x * y
            ).sum()
        }).collect()
}

When I run a benchmark of a multiplication of [2000, 2000] @ [2000, 1], it is a bit slower using into_par_iter...

nvm, it is faster...there was an error in my benchmark code.

You might also want to try it with parallelism only on the outer loop, and leave it sequential inside. There's no inherent problem with nesting parallelism in rayon, but you're unlikely to benefit when the workload is already well balanced, and it will add some overhead to split the inner loop for nothing.

@cuviper yeah, you're right. After removing the inner parallelism, it is actually 2x faster tho...

fn matmul(w: &Vec<Vec<f32>>, x: &Vec<f32>) -> Vec<f32> {
    w.into_par_iter()
     .map(
        |n| {
            n.iter().enumerate().map(|(id, j)| {
            j * x[id] }
        ).sum()
            // (n, x).into_par_iter().map(|(x, y)| x * y).sum()
        }
    ).collect()
}
fn matmul_2(w: &Vec<Vec<f32>>, x: &Vec<f32>) -> Vec<f32> {
    w.into_par_iter()
     .map(
        |n| {
            (n, x).into_par_iter().map(|(x, y)| x * y).sum()
        }
    ).collect()
}

fn matmul_3(w: &Vec<Vec<f32>>, x: &Vec<f32>) -> Vec<f32> {
    w.into_par_iter()
     .map(
        |n| {
            n.into_par_iter().enumerate().map(|(id, j)| {
            j * x[id] }
        ).sum()
        }
    ).collect()
}
fn matmul_default(w: Vec<Vec<f32>>, x: Vec<f32>) -> Vec<f32> {
    w.iter()
     .map(
        |n| {
            n.iter().zip(x.iter()).map(|(x, y)|
                x * y
            ).sum()
        }).collect()
}

Benchmark (1000, 1000) @ (1000, 1) ->

Time elapsed in matmul is: 3.781083ms
Time elapsed in matmul2 is: 10.111667ms
Time elapsed in matmul3 is: 7.760333ms
Time elapsed in matmul_default is: 15.889958ms