klauspost / reedsolomon

Reed-Solomon Erasure Coding in Go

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

not scalable well with multiple go routines

1a1a11a opened this issue · comments

I restrict each coder to use one goroutine and I launch 8 goroutines for decoding different stuff, but the performance does not scale well. Here are my test cases, 3 data chunk, 1 parity chunk, chunk size 128B - 16K.

Using aws c5.2xlarge (AVX512 support), only test decoding.

If chunk size is 128 Byte, single thread throughput is 635 MB/s, while 8 goroutines give around 921 MB/s (aggregated throughput).

If chunk size is 1024 Byte, single thread throughput is around 3704 MB/s, while 8 goroutines give around 5730 MB/s (aggregated throughput).

If chunk size is 16384 Byte, single thread throughput is around 9370 MB/s, while 8 goroutines give around 17466 MB/s (aggregated throughput).

First of all, when you restrict to one goroutine, you are doing scaling, not the reed+solomon package.

Second of all, c5.2xlarge is 4 cores, plus some added bonus from the hyperthreads.

With small sizes, all of these are heavily dominated by setup time. As you can see bigger blocks gives more throughput. At one point you also run into memory bandwidth limitations with the memory bus being shared between all cores.

So 1) use bigger blocks and/or 2) let the package choose its concurrency level.

hi @klauspost, thank you for your quick reply! I agree with some of your points, however just to point out a few more observations,

  1. Yeah it is effective 4 threads with hyper threading, soI tried 1, 2, 4, 8 and 16 goroutines, but 2, 4, 16 goroutine performance are worse than 8, that's why I reported 8.
  2. I tried autogoroutine option which gives me similar results. (and I also tried other combinations of options, so far this gives me the best throughput)
  3. I also tried multiprocessing (limiting one goroutine and launch 2/4/8 benchmarks), it turns out if I do this, the aggregated throughput is much higher (at least 3 times).
  4. My application will have small block, maybe I can do batching, but I am curious why multi-goroutine scheme does not scale, while multiprocessing scales (even though not linear). Is there a lock somewhere?

My guess is, that most is related to caching. Having to context switch and set up a lot of smaller calculations or switch between them will drastically reduce the cache efficiency.

Details like whether you are sharing cache lines across goroutines in your input/output can also negatively affect multicore scaling. So maybe inserting some "space" between your different inputs can also help.

There are no locks once you have created the Encoder.

Thank you for your reply! It is good to known there is no lock. If it is related to caching, why multiprocessing (each binary uses one goroutine, but launch the binary multiple time) has much higher throughput?

The only real difference would be the placement of data in memory.

If your destination slices share cachelines across goroutines there will be a significant overhead. Again, without seeing your benchmarks I'm just guessing.

I just merged an optimization that should help in your setup for small blocks.

Also, I just realized that you are doing 3:1, which honestly could just as well be done with simple xor - no need for reed-solomon whatsoever.

To construct: parity := d0 ^ d1 ^d2
To reconstruct d1 := d1 := d0 ^ d2 ^ parity

Example: https://play.golang.org/p/P-XQiLl_Zxv

Creating assembler functions for this is also pretty trivial.

Hi @klauspost, the destination does not have any sharing, I have attached my code below. Thank you for mentioning about the new package and possibility of using xor! It is helpful, we are still investigating which schemes we should use, one parity is just one possibility.

package main

import (
    "fmt"
    "github.com/klauspost/reedsolomon"
    "log"
    "math"
    "math/rand"
    "runtime"
    "sort"
    "sync"
    "time"
)

func encode(coder reedsolomon.Encoder, k, chunkSize int) (shards [][]byte) {
    var data = make([]byte, chunkSize*k)
    rand.Read(data)

    shards, _ = coder.Split(data)
    _ = coder.Encode(shards)
    ok, _ := coder.Verify(shards)
    if !ok {
        log.Fatal("error")
    }
    return shards
}

func loseData(n, k int, shards [][]byte) {

    if n-k == 1 {
        shards[rand.Intn(n)] = nil
    } else {
        for i := 0; i < n-k; i++ {
            if shards[rand.Intn(n)] == nil {
                i --;
            } else {
                shards[rand.Intn(n)] = nil
            }
        }
    }
}

func decode(coder reedsolomon.Encoder, n, k int, shards [][]byte) (fixedShards [][]byte) {
    _ = coder.Reconstruct(shards)
    return shards
}

func Benchmark(n, k int, thrptMap map[int]float64, mtx *sync.Mutex, wg *sync.WaitGroup) {
    defer (*wg).Done()
    rand.Seed(time.Now().UnixNano())

    // coder, _ := reedsolomon.New(k, n-k)
    coder, _ := reedsolomon.New(k, n-k, reedsolomon.WithMaxGoroutines(1))
    // coder, _ := reedsolomon.New(k, n-k, reedsolomon.WithMaxGoroutines(1), reedsolomon.WithCauchyMatrix())
    // coder, _ := reedsolomon.New(n-k, k, reedsolomon.WithMaxGoroutines(4), reedsolomon.WithCauchyMatrix(), reedsolomon.WithMinSplitSize(1024))

    for chunkSize := 16; chunkSize < int(math.Pow(2, 24)); chunkSize *= 4 {
        // coder, _ := reedsolomon.New(k, n-k, reedsolomon.WithAutoGoroutines(chunkSize), reedsolomon.WithCauchyMatrix())
        encodedData := encode(coder, k, chunkSize)

        startTs := time.Now()
        turnAroundBytes := 0

        for int(time.Since(startTs).Seconds()) < 2 {
            for i := 0; i < 1024; i++ {
                loseData(n, k, encodedData)
                decode(coder, n, k, encodedData)
                turnAroundBytes += chunkSize * k
            }
        }
        thrpt := float64(turnAroundBytes)/1000000/float64(time.Since(startTs).Nanoseconds()/1000000000)

        mtx.Lock()
        if v, ok := thrptMap[chunkSize]; ok{
            thrptMap[chunkSize] = v + thrpt
        }else{
            thrptMap[chunkSize] = thrpt
        }
        mtx.Unlock()

        // fmt.Printf("%d\t%d\t%d\t%.4f\n", n, k, chunkSize, thrpt)
    }
}


func BenchmarkParallel(n, k, nThreads int){
    runtime.GOMAXPROCS(nThreads)
    fmt.Println(n, k, nThreads)
    thrptMap := make(map[int]float64)
    mtx := &sync.Mutex{}
    var wg = sync.WaitGroup{}

    for i:=0; i<nThreads; i++{
        wg.Add(1)
        go Benchmark(n, k, thrptMap, mtx, &wg)
    }
    wg.Wait()

    keys := make([]int, 0)
    for k, _ := range thrptMap {
        keys = append(keys, k)
    }
    sort.Ints(keys)

    for _, k := range keys {
        fmt.Printf("%d \t %.4f\n", k, thrptMap[k])
    }
}



func main() {
    numcpu := runtime.NumCPU()
    runtime.GOMAXPROCS(numcpu)

    BenchmarkParallel(4 ,3, 1)
    BenchmarkParallel(4 ,3, 4)
 }

Thanks. You do know that rand.Intn() has a global mutex? So your very small sets will have very serious contention for that? Be careful that you are not benchmarking that instead.

Hmm, nice catch! I didn't realize that, let me benchmark again without this.

Another (minor) thing you can do is to use shards[x] = shards[x][:0] to reset your slice. This will prevent allocating a new slice for the output and takes GC mostly out of the equation.

Hi @klauspost, I have redone the benchmark (on a different instance type), but it seem the results are pretty similar. For large block size, the scalability seems fine, but small block size, the scalability is really bad.

Left is block size in byte, right is aggregated throughput in MB/s

One process 
128      613.4170
512      2180.7759
1024     3678.9289
4096     6543.1142
16384    9210.6916
131072   10267.6562
Eight process, each running one goroutine 
128      923.2712
512      3698.5897
1024     6425.1494
4096     13061.0627
16384    17389.5844
131072   28991.0292
Launching the one-goroutine-benchmark 8 times in parallel
128      2776
512      9640
1024     16480
4096     29992
16384    41672
131072   45096

My bad, the results of last post is using the old code, here is the new one. The scalability is much more reasonable now. Feel free to close the issue. Thank you!

One process 
128 	 724.1073
512 	 2681.7331
1024 	 4938.7930
4096 	 3177.1853
16384 	 6744.4408
131072 	 18119.3933
Eight processes (goroutine) this instance only has 4 real cores (8 hyperthreads) 
128 	 1471.0211
512 	 7730.6266
1024 	 14769.1930
4096 	 7323.2548
16384 	 23932.6986
131072 	 67645.7349