Used cache blocking, parallelizing, loop unrolling, register blocking, loop ordering, and SSE instructions to optimize the multiplication of large matrices to 55 gFLOPS
Used cache blocking, parallelizing, loop unrolling, register blocking, loop ordering, and SSE instructions to optimize the multiplication of large matrices to 55 gFLOPS