eembc / coremark

When building and running coremark.exe with gcc on redhat 8.2, sometimes I will get the error of "ERROR! Must execute for at least 10 secs for a valid result!" since the total run time is less than 10 secs.
What does this message mean? Why does the test need to run for at least 10 secs? Could you please help to explain, thanks a lot!

make command

make CC="gcc" PORT_CFLAGS="-O0 -g"

the output of binary

$ ./coremark.exe  0x0 0x0 0x66 0 7 1 2000
2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 14733
Total time (secs): 14.733000
Iterations/Sec   : 4072.490328
Iterations       : 60000
Compiler version : GCC10.2.0
Compiler flags   : -O0 -g    -lrt
Memory location  : Please put data memory location here
                        (e.g. code in flash, data on heap etc)
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0xbd59
Correct operation validated. See readme.txt for run and reporting rules.
CoreMark 1.0 : 4072.490328 / GCC10.2.0 -O0 -g    -lrt / Heap

$ ./coremark.exe  0x3415 0x3415 0x66 0 7 1 2000
2K validation run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 9786
Total time (secs): 9.786000
Iterations/Sec   : 4087.471899
ERROR! Must execute for at least 10 secs for a valid result!
Iterations       : 40000
Compiler version : GCC10.2.0
Compiler flags   : -O0 -g    -lrt
Memory location  : Please put data memory location here
                        (e.g. code in flash, data on heap etc)
seedcrc          : 0x18f2
[0]crclist       : 0xe3c1
[0]crcmatrix     : 0x0747
[0]crcstate      : 0x8d84
[0]crcfinal      : 0xbe81
Errors detected

If a benchmark does not run for a long enough period, especially if you are running on an operating system (vs. bare metal), noise from the system can interfere with the measurement. Think about it this way, if you try to time a single loop, and the OS interrupts the process briefly, the time of one loop could vary from 1-10x. Running more iterations and extending the benchmark runtime amortizes the noise over the measurement period, resulting in more stable measurements (less run-to-run variation). Given the number of instructions run in a single loop of CoreMark, 10 seconds was deemed to be of a sufficient order of magnitude greater than the noise in the majority of deployments. However, since you specified arg4 = 0, it should automatically determine the number of iterations (see lines 242-263 in core_main.c) and pick a # that results in 10 seconds by increasing the # of iterations 10x each time. Odd that it missed the target by ~214ms. I'm curious as to why that happened: very unusual.

Thanks a lot for your comment! I tried to build the binary with clang and icc and also tried on another machine. They have the same issue as well.

If I were you I would verify how the macro GETMYTIME is implemented and that it is using the correct TIMER_RES_DIVIDER. I would also instrument the code in 242-263. What platform are you running on?

GETMYTIME(_t) is expanded to clock_gettime(0, _t) and TIMER_RES_DIVIDER is set to 1000000. Collecting time part should work well.

I printed the value of secs_passed after line 253 and found that the autual run time is not always proportional to the value of iterations.

coremark/core_main.c

Lines 241 to 263 in 1541482

    
               /* automatically determine number of iterations if not set */ 
        
               if (results[0].iterations == 0) 
        
               { 
        
                   secs_ret secs_passed = 0; 
        
                   ee_u32   divisor; 
        
                   results[0].iterations = 1; 
        
                   while (secs_passed < (secs_ret)1) 
        
                   { 
        
                       results[0].iterations *= 10; 
        
                       start_time(); 
        
                       iterate(&results[0]); 
        
                       stop_time(); 
        
                       secs_passed = time_in_secs(get_time()); 
        
                   } 
        
                   /* now we know it executes for at least 1 sec, set actual run time at 
        
                    * about 10 secs */ 
        
                   divisor = (ee_u32)secs_passed; 
        
                   if (divisor == 0) /* some machines cast float to int as 0 since this 
        
                                        conversion is not defined by ANSI, but we know at 
        
                                        least one second passed */ 
        
                       divisor = 1; 
        
                   results[0].iterations *= 1 + 10 / divisor; 
        
               }

For example, when iterations=10, the run time of iterate(&results[0]) is 0.001s. When iterations=100, the run time is 0.017s, which is nearly 10 times of the first run time(0.001s). When iterations=1000, the run time is 0.276s, which is about 16 times of the second run time(0.017s). When iterations=10000, the run time is 3.144s, about 11 times of the third run time(0.276s). Now iterations is set to 40000 according to the code, and we expects the run time to be 3.144s*4=12.576s, which is more than 10 secs. However, it's hard to make sure that the last run time 3.144s is resprentative so the run time of 40000 iterations could be less or more than 10 secs.

secs_passed: 0.001000
secs_passed: 0.017000
secs_passed: 0.276000
secs_passed: 3.144000

I ran on Xeon(R) CPU E5-2680 v3+redhat 8.2.

@yuxianch

The first half of what you report makes sense, the second half does not.

First: yes, as the number of iterations increases from ~10 to ~10,000 the IPS will increase. On a Xeon machine running Linux, one loop of CoreMark is well within the OS noise, and much faster than the tolerance of the measurement function. As you increase the number of iterations, the percentage of the time spent measuring the OS and the clock code goes to zero. (Note: You could conceivably measure a single loop of CoreMark on a Xeon, but you would need to turn off interrupts and power management, run at Ring0, warm the cache, and use RDTSC as the timing instruction, which measures core clock ticks. This would be "bare metal".)

Second: As the number of iterations increases, the IPS should become constant. Since this is not happening, it makes me think we need to back up a bit. CoreMark does not run at Ring-0, which means it can be interrupted by the OS. If you are doing something on your machine while the benchmark is running, you will interfere with it and collect invalid measurements. You must make sure every non-essential process is terminated. And don't move any windows in the GUI or interact with the machine in any way.

The IPS between 100k, 150k, 200k iterations should be roughly the same. If the IPS is not stabilizing, your computer is doing something else during the benchmark and interfering with it.

Strange problem. This timing loop is pretty simple and has been in use for 12 years, we have scores from 2 MHz to 3000 MHz on ~500 platforms, so I'm pretty sure this has something to do with your OS activity.

One thing that I am sure is that when running CoreMark, there is no other heavy process running. I can only see some light processes which occpy 0% CPU and 0% MEM.

Any update or can we close this?

	/* automatically determine number of iterations if not set */
	if (results[0].iterations == 0)
	{
	secs_ret secs_passed = 0;
	ee_u32 divisor;
	results[0].iterations = 1;
	while (secs_passed < (secs_ret)1)
	{
	results[0].iterations *= 10;
	start_time();
	iterate(&results[0]);
	stop_time();
	secs_passed = time_in_secs(get_time());
	}
	/* now we know it executes for at least 1 sec, set actual run time at
	* about 10 secs */
	divisor = (ee_u32)secs_passed;
	if (divisor == 0) /* some machines cast float to int as 0 since this
	conversion is not defined by ANSI, but we know at
	least one second passed */
	divisor = 1;
	results[0].iterations *= 1 + 10 / divisor;
	}

Question regarding "Must execute for at least 10 secs for a valid result"

make command

the output of binary