Performance issue or misuse?

Question

Performance issue or misuse?

kostyfisik opened this issue 3 years ago · comments

Konstantin Ladutenko commented 3 years ago

I've added Fastor to test it against OpenMP SIMD for std::vector in fastor branch here in the test by 2b-t. I use Fastor this way

// Fastor
    std::cout << " -C++ Fastor:            ";
    Fastor::Tensor<double,length> x_tensor(x_vec);
    Fastor::Tensor<double,length> y_tensor(y_vec);
    Timer stopwatch;
    stopwatch.Start();

    for (size_t i = 0; i < it; ++i)
    {
      auto res = Fastor::einsum<Fastor::Index<0>,Fastor::Index<0> > (x_tensor, y_tensor);
      Fastor::unused(res);
    }

    stopwatch.Stop();
    std::cout << " runtime: " << stopwatch.GetRuntime() << ", result: " << Fastor::einsum<Fastor::Index<0>,Fastor::Index<0> > (x_tensor, y_tensor) << std::endl;

the results on my laptop (i7-8550U) are like this:

~/github/OMP-AVX-intrinsics-dotprod$ make clean && make  && ./bin/main.GCC 
Compiled src/main.cpp successfully!
g++  obj/main.o  -O3 -flto -lgomp -o bin/main.GCC
Linking complete!

DATA ALIGNMENT
Vector (100000 elements):
 first_element%cache_line: 16
 length%cache_line:        0
Array (100000 elements):
 first_element%cache_line: 0
 length%cache_line:        0

STARTING BENCHMARKS with 100000 iterations
 -C++ Vector OMP SIMD:    runtime: 1.654, result: 25004.864
 -C++ Span   OMP SIMD:    runtime: 1.839, result: 25004.864
 -C++ Array  OMP SIMD:    runtime: 0.930, result: 25004.864
 -C++ Array  AVX2 OMP:    runtime: 1.011, result: 25004.864
 -C++ Fastor:             runtime: 2.251, result: 25004.863542789789790

... another run ...
STARTING BENCHMARKS with 100000 iterations
 -C++ Vector OMP SIMD:    runtime: 1.650, result: 25004.864
 -C++ Span   OMP SIMD:    runtime: 1.827, result: 25004.864
 -C++ Array  OMP SIMD:    runtime: 0.859, result: 25004.864
 -C++ Array  AVX2 OMP:    runtime: 0.914, result: 25004.864
 -C++ Fastor:             runtime: 2.154, result: 25004.863542789789790

note, that to get the best performance from OMP I had to set omp_threads to be twice as many as physical cores on my CPU at the begining of main.cpp.
It looks like Fastor's performance is far from maximum in this case. Am I doing something wrong?

Note, that if i reduce ths size of vector from 1e5 to 1e3, than Fastor really shines:

STARTING BENCHMARKS with 10000000 iterations
 -C++ Vector OMP SIMD:    runtime: 15.115, result: 257.232
 -C++ Span   OMP SIMD:    runtime: 16.009, result: 257.232
 -C++ Array  OMP SIMD:    runtime: 14.667, result: 257.232
 -C++ Array  AVX2 OMP:    runtime: 14.410, result: 257.232
 -C++ Fastor:             runtime: 0.927, result: 257.232150633568438

Best regards,
Konstantin Ladutenko