Excuse me, is the performance evaluation of small/skinny matrix by the following program correct?

Question

Excuse me, is the performance evaluation of small/skinny matrix by the following program correct?

ProgrammerWLY opened this issue 2 years ago · comments

		if ( bli_does_notrans( transa ) )
			bli_obj_create( dt, m, k, rs_a, cs_a, &a );
		else
			bli_obj_create( dt, k, m, cs_a, rs_a, &a );

		if ( bli_does_notrans( transb ) )
			bli_obj_create( dt, k, n, rs_b, cs_b, &b );
		else
			bli_obj_create( dt, n, k, cs_b, rs_b, &b );
		if ( bli_does_notrans( transc ) )
			bli_obj_create( dt, m, n, rs_c, cs_c, &c );
		else
			bli_obj_create( dt, n, m, cs_c, rs_c, &c );

		bli_randm( &a );
		bli_randm( &b );
		bli_randm( &c );

		// warm up
		for(i = 0 ; i < 2; i++)
			bli_gemm( &alpha, &a, &b, &beta, &c);

		// loo= 50
		start = dclock();
		for(i = 0 ; i < loop; i++)
			bli_gemm( &alpha, &a, &b, &beta, &c);
		cost = (dclock() - start)/loop;

		printf("blis sup : M=%d , N= %d, K =%d, Gflops= %lf, effic = %lf%\n", 
			m, n, k, ops/cost, ops/cost /8.8 * 100);

		bli_obj_free( &a );
		bli_obj_free( &b );
		bli_obj_free( &c );

Field G. Van Zee · Answer 1 · Tue May 03 2022 06:03:43 GMT+0800 (China Standard Time)

A few comments.

We don't typically advocate for honoring a trans_t parameter on the output matrix.
This code leaves out the setting of alpha and beta (and other details).
Whether small/skinny execution takes place (vs. the conventional code path) is decided by comparing the problem dimensions to hardware-specific thresholds. Thus, I can't really say from the above code that sup would execute at all.
Your code calculates the average execution time. There's nothing inherently wrong with this. That said, we almost always report the fastest of n_repeat executions (Note: Typically for us, n_repeat == 3, although there is nothing special about that number aside from that it's low enough that it allows the suite of experiments to finish in a reasonable time.)
BLIS does not export a dclock() function, so I'm assuming you are defining that on your own (or obtaining it elsewhere). Instead, we have bli_clock() and its helper function, bli_clock_min_diff(), which we use for reporting the fastest of multiple trials, as mentioned above. (See test/test_gemm.c for an example of how this is used in a simple setting.)
We don't usually perform any "warm up" executions, in part because we report the fastest rather than average time, but also because there is no guarantee that the warmed-up data will still reside in the core's local caches by the time the measured tests commence. Why is this? The OS scheduler may have since migrated the process to another core. You can guarantee this, however, by binding threads to cores, as described in our Multithreading.md document. (That said, while there is no guarantee that the process will stay in one place, in practice it's probably unlikely that it moves in that short time.)