Performance issue or misuse?
kostyfisik opened this issue · comments
Konstantin Ladutenko commented
Dear @romeric,
I've added Fastor to test it against OpenMP SIMD for std::vector in fastor branch here in the test by 2b-t. I use Fastor this way
// Fastor
std::cout << " -C++ Fastor: ";
Fastor::Tensor<double,length> x_tensor(x_vec);
Fastor::Tensor<double,length> y_tensor(y_vec);
Timer stopwatch;
stopwatch.Start();
for (size_t i = 0; i < it; ++i)
{
auto res = Fastor::einsum<Fastor::Index<0>,Fastor::Index<0> > (x_tensor, y_tensor);
Fastor::unused(res);
}
stopwatch.Stop();
std::cout << " runtime: " << stopwatch.GetRuntime() << ", result: " << Fastor::einsum<Fastor::Index<0>,Fastor::Index<0> > (x_tensor, y_tensor) << std::endl;
the results on my laptop (i7-8550U) are like this:
~/github/OMP-AVX-intrinsics-dotprod$ make clean && make && ./bin/main.GCC
Compiled src/main.cpp successfully!
g++ obj/main.o -O3 -flto -lgomp -o bin/main.GCC
Linking complete!
DATA ALIGNMENT
Vector (100000 elements):
first_element%cache_line: 16
length%cache_line: 0
Array (100000 elements):
first_element%cache_line: 0
length%cache_line: 0
STARTING BENCHMARKS with 100000 iterations
-C++ Vector OMP SIMD: runtime: 1.654, result: 25004.864
-C++ Span OMP SIMD: runtime: 1.839, result: 25004.864
-C++ Array OMP SIMD: runtime: 0.930, result: 25004.864
-C++ Array AVX2 OMP: runtime: 1.011, result: 25004.864
-C++ Fastor: runtime: 2.251, result: 25004.863542789789790
... another run ...
STARTING BENCHMARKS with 100000 iterations
-C++ Vector OMP SIMD: runtime: 1.650, result: 25004.864
-C++ Span OMP SIMD: runtime: 1.827, result: 25004.864
-C++ Array OMP SIMD: runtime: 0.859, result: 25004.864
-C++ Array AVX2 OMP: runtime: 0.914, result: 25004.864
-C++ Fastor: runtime: 2.154, result: 25004.863542789789790
note, that to get the best performance from OMP I had to set omp_threads to be twice as many as physical cores on my CPU at the begining of main.cpp.
It looks like Fastor's performance is far from maximum in this case. Am I doing something wrong?
Note, that if i reduce ths size of vector from 1e5 to 1e3, than Fastor really shines:
STARTING BENCHMARKS with 10000000 iterations
-C++ Vector OMP SIMD: runtime: 15.115, result: 257.232
-C++ Span OMP SIMD: runtime: 16.009, result: 257.232
-C++ Array OMP SIMD: runtime: 14.667, result: 257.232
-C++ Array AVX2 OMP: runtime: 14.410, result: 257.232
-C++ Fastor: runtime: 0.927, result: 257.232150633568438
Best regards,
Konstantin Ladutenko