Implement Thumb-2 optimized memcpy/memset

Question

Implement Thumb-2 optimized memcpy/memset

jserv opened this issue 10 years ago · comments

Directory kernel/lib contains the implementation of memcpy and memset, but it is too generic. We can utilize several ARM Cortex-M3/M4 specific features to optimize:

Thumb-2
- apply 32-bit aligned data copy in inner loop, which is not necessary to Cortex-M3/M4, but it could be better for the external memory access depending on memory controller.
unaligned memory access
PLD instruction to preload cache with memory source

Jim Huang · Answer 1 · Fri Feb 07 2014 05:10:13 GMT+0800 (China Standard Time)

Reference:

Jim Huang · Answer 2 · Fri Mar 14 2014 04:24:29 GMT+0800 (China Standard Time)

lk implements arm-m optimized memcpy and memset routines in git commit littlekernel/lk@33b94d9

gapry · Answer 3 · Tue Jul 29 2014 11:31:48 GMT+0800 (China Standard Time)

@jserv The profile result:

unalignment
alignment

Jim Huang · Answer 4 · Tue Jul 29 2014 11:53:50 GMT+0800 (China Standard Time)

It looks so weird. Can you explain?

gapry · Answer 5 · Tue Jul 29 2014 12:01:09 GMT+0800 (China Standard Time)

@jserv The implementation is the branch.
https://github.com/gapry/f9-kernel/blob/benchmark_memcpy/benchmark/benchmark.c

My approach is that measure the case, alignment and unalignment, five times and take the avg time. Assume my approach is correct, the data imply the conclusion is the unalignment case is better than alignment after the optimized on the stm32F407.

Jim Huang · Answer 6 · Tue Jul 29 2014 12:13:32 GMT+0800 (China Standard Time)

@gapry In order to clarify the performance gain, please compare the optimized memcpy routines with plain byte-oriented C version.

gapry · Answer 7 · Tue Jul 29 2014 12:15:01 GMT+0800 (China Standard Time)

@jserv What does plain byte-oriented mean ?

Jim Huang · Answer 8 · Tue Jul 29 2014 12:20:00 GMT+0800 (China Standard Time)

The simplest and inefficient implementation of memcpy:

void memcpy(void* src, void* dst, size_t len)
{
    char* p = (char*)src;
    char* q = (char*)dst;
    while(len--) *p++ = *q++;
}

gapry · Answer 9 · Tue Jul 29 2014 18:14:02 GMT+0800 (China Standard Time)

@jserv For now, I use DWT to measure the elapsed clock cycles. You can check the commit: gapry@33e58df

and the completed Implementation: https://github.com/gapry/f9-kernel/blob/benchmark_memcpy/benchmark/benchmark.c

The profile result:
unalignment:

alignment:

Jim Huang · Answer 10 · Tue Jul 29 2014 20:49:33 GMT+0800 (China Standard Time)

@gapry I don't think your benchmarking is valid since it doesn't represent the variance. There must be something wrong.