not scalable well with multiple go routines

Question

not scalable well with multiple go routines

1a1a11a opened this issue 5 years ago · comments

I restrict each coder to use one goroutine and I launch 8 goroutines for decoding different stuff, but the performance does not scale well. Here are my test cases, 3 data chunk, 1 parity chunk, chunk size 128B - 16K.

Using aws c5.2xlarge (AVX512 support), only test decoding.

If chunk size is 128 Byte, single thread throughput is 635 MB/s, while 8 goroutines give around 921 MB/s (aggregated throughput).

If chunk size is 1024 Byte, single thread throughput is around 3704 MB/s, while 8 goroutines give around 5730 MB/s (aggregated throughput).

If chunk size is 16384 Byte, single thread throughput is around 9370 MB/s, while 8 goroutines give around 17466 MB/s (aggregated throughput).

Klaus Post · Answer 1 · Fri Mar 01 2019 16:52:03 GMT+0800 (China Standard Time)

First of all, when you restrict to one goroutine, you are doing scaling, not the reed+solomon package.

Second of all, c5.2xlarge is 4 cores, plus some added bonus from the hyperthreads.

With small sizes, all of these are heavily dominated by setup time. As you can see bigger blocks gives more throughput. At one point you also run into memory bandwidth limitations with the memory bus being shared between all cores.

So 1) use bigger blocks and/or 2) let the package choose its concurrency level.

Juncheng Yang · Answer 2 · Fri Mar 01 2019 20:33:07 GMT+0800 (China Standard Time)

hi @klauspost, thank you for your quick reply! I agree with some of your points, however just to point out a few more observations,

Yeah it is effective 4 threads with hyper threading, soI tried 1, 2, 4, 8 and 16 goroutines, but 2, 4, 16 goroutine performance are worse than 8, that's why I reported 8.
I tried autogoroutine option which gives me similar results. (and I also tried other combinations of options, so far this gives me the best throughput)
I also tried multiprocessing (limiting one goroutine and launch 2/4/8 benchmarks), it turns out if I do this, the aggregated throughput is much higher (at least 3 times).
My application will have small block, maybe I can do batching, but I am curious why multi-goroutine scheme does not scale, while multiprocessing scales (even though not linear). Is there a lock somewhere?

Klaus Post · Answer 3 · Thu Mar 07 2019 02:32:16 GMT+0800 (China Standard Time)

My guess is, that most is related to caching. Having to context switch and set up a lot of smaller calculations or switch between them will drastically reduce the cache efficiency.

Details like whether you are sharing cache lines across goroutines in your input/output can also negatively affect multicore scaling. So maybe inserting some "space" between your different inputs can also help.

There are no locks once you have created the Encoder.

Juncheng Yang · Answer 4 · Fri Mar 08 2019 07:01:31 GMT+0800 (China Standard Time)

Thank you for your reply! It is good to known there is no lock. If it is related to caching, why multiprocessing (each binary uses one goroutine, but launch the binary multiple time) has much higher throughput?

Klaus Post · Answer 5 · Fri Mar 08 2019 18:18:06 GMT+0800 (China Standard Time)

The only real difference would be the placement of data in memory.

If your destination slices share cachelines across goroutines there will be a significant overhead. Again, without seeing your benchmarks I'm just guessing.

I just merged an optimization that should help in your setup for small blocks.

Also, I just realized that you are doing 3:1, which honestly could just as well be done with simple xor - no need for reed-solomon whatsoever.

To construct: parity := d0 ^ d1 ^d2
To reconstruct d1 := d1 := d0 ^ d2 ^ parity

Example: https://play.golang.org/p/P-XQiLl_Zxv

Creating assembler functions for this is also pretty trivial.

Juncheng Yang · Answer 6 · Sat Mar 09 2019 02:52:34 GMT+0800 (China Standard Time)

Hi @klauspost, the destination does not have any sharing, I have attached my code below. Thank you for mentioning about the new package and possibility of using xor! It is helpful, we are still investigating which schemes we should use, one parity is just one possibility.

package main

import (
    "fmt"
    "github.com/klauspost/reedsolomon"
    "log"
    "math"
    "math/rand"
    "runtime"
    "sort"
    "sync"
    "time"
)

func encode(coder reedsolomon.Encoder, k, chunkSize int) (shards [][]byte) {
    var data = make([]byte, chunkSize*k)
    rand.Read(data)

    shards, _ = coder.Split(data)
    _ = coder.Encode(shards)
    ok, _ := coder.Verify(shards)
    if !ok {
        log.Fatal("error")
    }
    return shards
}

func loseData(n, k int, shards [][]byte) {

    if n-k == 1 {
        shards[rand.Intn(n)] = nil
    } else {
        for i := 0; i < n-k; i++ {
            if shards[rand.Intn(n)] == nil {
                i --;
            } else {
                shards[rand.Intn(n)] = nil
            }
        }
    }
}

func decode(coder reedsolomon.Encoder, n, k int, shards [][]byte) (fixedShards [][]byte) {
    _ = coder.Reconstruct(shards)
    return shards
}

func Benchmark(n, k int, thrptMap map[int]float64, mtx *sync.Mutex, wg *sync.WaitGroup) {
    defer (*wg).Done()
    rand.Seed(time.Now().UnixNano())

    // coder, _ := reedsolomon.New(k, n-k)
    coder, _ := reedsolomon.New(k, n-k, reedsolomon.WithMaxGoroutines(1))
    // coder, _ := reedsolomon.New(k, n-k, reedsolomon.WithMaxGoroutines(1), reedsolomon.WithCauchyMatrix())
    // coder, _ := reedsolomon.New(n-k, k, reedsolomon.WithMaxGoroutines(4), reedsolomon.WithCauchyMatrix(), reedsolomon.WithMinSplitSize(1024))

    for chunkSize := 16; chunkSize < int(math.Pow(2, 24)); chunkSize *= 4 {
        // coder, _ := reedsolomon.New(k, n-k, reedsolomon.WithAutoGoroutines(chunkSize), reedsolomon.WithCauchyMatrix())
        encodedData := encode(coder, k, chunkSize)

        startTs := time.Now()
        turnAroundBytes := 0

        for int(time.Since(startTs).Seconds()) < 2 {
            for i := 0; i < 1024; i++ {
                loseData(n, k, encodedData)
                decode(coder, n, k, encodedData)
                turnAroundBytes += chunkSize * k
            }
        }
        thrpt := float64(turnAroundBytes)/1000000/float64(time.Since(startTs).Nanoseconds()/1000000000)

        mtx.Lock()
        if v, ok := thrptMap[chunkSize]; ok{
            thrptMap[chunkSize] = v + thrpt
        }else{
            thrptMap[chunkSize] = thrpt
        }
        mtx.Unlock()

        // fmt.Printf("%d\t%d\t%d\t%.4f\n", n, k, chunkSize, thrpt)
    }
}


func BenchmarkParallel(n, k, nThreads int){
    runtime.GOMAXPROCS(nThreads)
    fmt.Println(n, k, nThreads)
    thrptMap := make(map[int]float64)
    mtx := &sync.Mutex{}
    var wg = sync.WaitGroup{}

    for i:=0; i<nThreads; i++{
        wg.Add(1)
        go Benchmark(n, k, thrptMap, mtx, &wg)
    }
    wg.Wait()

    keys := make([]int, 0)
    for k, _ := range thrptMap {
        keys = append(keys, k)
    }
    sort.Ints(keys)

    for _, k := range keys {
        fmt.Printf("%d \t %.4f\n", k, thrptMap[k])
    }
}



func main() {
    numcpu := runtime.NumCPU()
    runtime.GOMAXPROCS(numcpu)

    BenchmarkParallel(4 ,3, 1)
    BenchmarkParallel(4 ,3, 4)
 }

Klaus Post · Answer 7 · Sat Mar 09 2019 03:10:21 GMT+0800 (China Standard Time)

Thanks. You do know that rand.Intn() has a global mutex? So your very small sets will have very serious contention for that? Be careful that you are not benchmarking that instead.

Juncheng Yang · Answer 8 · Sat Mar 09 2019 03:17:02 GMT+0800 (China Standard Time)

Hmm, nice catch! I didn't realize that, let me benchmark again without this.

Klaus Post · Answer 9 · Sat Mar 09 2019 04:23:51 GMT+0800 (China Standard Time)

Another (minor) thing you can do is to use shards[x] = shards[x][:0] to reset your slice. This will prevent allocating a new slice for the output and takes GC mostly out of the equation.

Juncheng Yang · Answer 10 · Sat Mar 09 2019 04:31:29 GMT+0800 (China Standard Time)

A good suggestion! I will change that for the benchmark. :) Jason

…

On Mar 8, 2019, 15:23 -0500, Klaus Post ***@***.***>, wrote: Another (minor) thing you can do is to use shards[x] = shards[x][:0] to reset your slice. This will prevent allocating a new slice for the output and takes GC mostly out of the equation. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

Juncheng Yang · Answer 11 · Mon Mar 11 2019 09:51:17 GMT+0800 (China Standard Time)

Hi @klauspost, I have redone the benchmark (on a different instance type), but it seem the results are pretty similar. For large block size, the scalability seems fine, but small block size, the scalability is really bad.

Left is block size in byte, right is aggregated throughput in MB/s

One process 
128      613.4170
512      2180.7759
1024     3678.9289
4096     6543.1142
16384    9210.6916
131072   10267.6562

Eight process, each running one goroutine 
128      923.2712
512      3698.5897
1024     6425.1494
4096     13061.0627
16384    17389.5844
131072   28991.0292

Launching the one-goroutine-benchmark 8 times in parallel
128      2776
512      9640
1024     16480
4096     29992
16384    41672
131072   45096

Juncheng Yang · Answer 12 · Mon Mar 11 2019 10:23:25 GMT+0800 (China Standard Time)

My bad, the results of last post is using the old code, here is the new one. The scalability is much more reasonable now. Feel free to close the issue. Thank you!

One process 
128 	 724.1073
512 	 2681.7331
1024 	 4938.7930
4096 	 3177.1853
16384 	 6744.4408
131072 	 18119.3933

Eight processes (goroutine) this instance only has 4 real cores (8 hyperthreads) 
128 	 1471.0211
512 	 7730.6266
1024 	 14769.1930
4096 	 7323.2548
16384 	 23932.6986
131072 	 67645.7349