32 bit div

Question

32 bit div

GoogleCodeExporter opened this issue 9 years ago · comments

Google Code Exporter commented 9 years ago

Hi,


32 bit lib seems very expensive in terms of number of cycles and accuracy. 

Is there any one working on it ?

Original issue reported on code.google.com by birdla...@gmail.com on 3 Mar 2011 at 3:23

Google Code Exporter · Answer 1 · Sun Feb 07 2016 23:10:46 GMT+0800 (China Standard Time)

There are only 2 people on this project at the minute since it's very new. I've 
implemented all the 32-bit stuff so far, but I only have so much time to work 
on it amongst full-time work many other projects and commitments. If you want 
better/faster algorithms and are interested in implementing them then I'd be 
happy to give you committer rights.

As the division goes, division will always be a slow operation, there aren't 
that many optimizations I can think to make to the algorithm I used (not 
without assembler anyway) but that doesn't mean that there isn't a 
fundamentally better algorithm out there that could be used.

The inaccuracy must be due to a small error somewhere in the code, the fact 
that it's such a small error indicates that it's likely to be a rounding error 
or an error in some extreme case where the result overflows.

Original comment by Flatmush@googlemail.com on 3 Mar 2011 at 3:35

Added labels: Performance

Google Code Exporter · Answer 2 · Sun Feb 07 2016 23:10:46 GMT+0800 (China Standard Time)

I'd be interested to hear what you're using the library for birdland (if you 
can tell me), I might be able to help you achieve what you want better that way 
with the limited time I have to work on this project.

Original comment by Flatmush@googlemail.com on 3 Mar 2011 at 3:45

Google Code Exporter · Answer 3 · Sun Feb 07 2016 23:10:46 GMT+0800 (China Standard Time)

hi 
I need these fixed point functions in the baseband processing of some wireless 
communication protocol such as WLAN channel estimation. Timing rquirements are 
very strict. Please run the following code and see how many cycles division is 
taking. Please which processor you are runing on. Please tell me also what 
value of d you get ? 


#include <stdio.h>
#include <stdlib.h>


typedef __int32_t fix16_t;

fix16_t fix16_div(fix16_t inArg0, fix16_t inArg1);
fix16_t fix16_mul(fix16_t inArg0, fix16_t inArg1);
volatile int d;

int main(){

    int x = 5063265; //77.2593;
    int y = -6338360;//-96.7157;

    int a;
    d = fix16_div(x, y);
    printf("%d \n",d);
    //printf("%d \n",a);

}

fix16_t fix16_div(fix16_t inArg0, fix16_t inArg1) {
int neg = ((inArg0 < 0) != (inArg1 < 0));
        inArg0 = (inArg0 < 0 ? -inArg0 : inArg0);
        inArg1 = (inArg1 < 0 ? -inArg1 : inArg1);

        while(((inArg0 | inArg1) & 1) == 0) {
                inArg0 >>= 1;
                inArg1 >>= 1;
        }

        __uint32_t r_hi = (inArg0 / inArg1);

        __uint32_t n_lo = (inArg0 % inArg1);
        __uint32_t n_hi = (n_lo >> 16);
        n_lo <<= 16;

        __uint32_t i, arg;
        for(i = 1, arg = inArg1; ((n_lo | arg) & 1) == 0; i <<= 1) {
                n_lo = ((n_lo >> 1) | (n_hi << 31));
                n_hi =  (n_hi >> 1);
                arg >>= 1;
        }

        __uint32_t res = 0;
        if(n_hi) {
                __uint32_t arg_lo, arg_hi;
                for(arg_lo = inArg1; (arg_lo >> 31) == 0; arg_lo <<= 1, i <<= 1);
                for(arg_hi = (arg_lo >> 31), arg_lo <<= 1, i <<= 1; arg_hi < n_hi; arg_hi = (arg_hi << 1) | (arg_lo >> 31), arg_lo <<= 1, i <<= 1);

                do {
                        arg_lo = (arg_lo >> 1) | (arg_hi << 31);
                        arg_hi = (arg_hi >> 1);
                        i >>= 1;
                        if(arg_hi < n_hi) {
                                n_hi -= arg_hi;
                                if(arg_lo > n_lo)
                                        n_hi--;
                                n_lo -= arg_lo;
                                res += i;
                        } else if((arg_hi == n_hi) && (arg_lo <= n_lo)) {
                                n_hi -= arg_hi;
                                n_lo -= arg_lo;
                                res += i;
                        }
                } while(n_hi);
        }

        res += (n_lo / inArg1);
        #ifndef FIXMATH_NO_ROUNDING
        if((n_lo % inArg1) >= (inArg1 >> 1))
                res++;
        #endif
        res += (r_hi << 16);

        return (neg ? -res : res);
}

fix16_t fix16_mul(fix16_t inArg0, fix16_t inArg1) {
    __int16_t hi[2] = { (inArg0 >> 16),    (inArg1 >> 16)    };
    __uint16_t lo[2] = { (inArg0 & 0xFFFF), (inArg1 & 0xFFFF) };

     __int32_t r_hi = hi[0] * hi[1];
     __int32_t r_md = (hi[0] * lo[1]) + (hi[1] * lo[0]);
    __uint32_t r_lo = lo[0] * lo[1];
    #ifndef FIXMATH_NO_ROUNDING
    r_lo += 0xFFFF;
    #endif

    r_md += (r_hi & 0xFFFF) << 16;
    r_md += (r_lo >> 16);

    return r_md;

}

Original comment by birdla...@gmail.com on 3 Mar 2011 at 5:04

Google Code Exporter · Answer 4 · Sun Feb 07 2016 23:10:46 GMT+0800 (China Standard Time)

Running the code I get -49552.  This is using gcc 4.3.1.  I don't have time 
right now to look into the excecution speed, but ill take a look later today

Original comment by joe.scha...@gmail.com on 3 Mar 2011 at 5:35

Google Code Exporter · Answer 5 · Sun Feb 07 2016 23:10:46 GMT+0800 (China Standard Time)

Yes I get the same value but the accurate one is -52352. cycle count seems very 
high for this code. 

let see what you get for cycle count

Original comment by birdla...@gmail.com on 3 Mar 2011 at 6:29

Google Code Exporter · Answer 6 · Sun Feb 07 2016 23:10:47 GMT+0800 (China Standard Time)

How are you measuring cycle count

Original comment by joe.scha...@gmail.com on 3 Mar 2011 at 6:36

Google Code Exporter · Answer 7 · Sun Feb 07 2016 23:10:47 GMT+0800 (China Standard Time)

Birdland, are you targetting ARM, if so then we could make a much faster 
assembler optimized version, especially if it has a clz instruction.
This implementation is very generic and aimed at accuracy, there's a similar 
divide function to divide a 64-bit number by a 32-bit number in the newly 
submitted int64.h header which may work better, though it is largely the same.

Original comment by Flatmush@googlemail.com on 3 Mar 2011 at 7:52

Google Code Exporter · Answer 8 · Sun Feb 07 2016 23:10:47 GMT+0800 (China Standard Time)

Ok so I have never run a profiler before so I am not entirely sure what I am 
doing here, but here it goes.

so as a test I took birdlands code and copied it into fix16.c appending an _b 
to the function name.  
The standard fix16_div is using 64 bit math.
I threw in mul for good measure
I modified the main code to be:

    int x = 5063265; //77.2593;
    int y = -6338360;//-96.7157;
    d = fix16_div_b(x, y);
    k = fix16_div(x,y);
    t = fix16_mul(x,y);
    printf("%d \n",d);
    printf("%d \n",k);
    printf("%d \n",t);
    return 0;

compiled it with gcc 4.2.1 with the -g option. Then used valgrind to look at 
the number of instructions per function. Here is what I came up with.

#############################################################################
Results for Core Duo @ 2.0Ghz(32 bit)


valgrind --tool=callgrind --toggle-collect=fix16_div_b ./main
No optimization =  267
O3                        = 198
Os                         = 209

valgrind --tool=callgrind --toggle-collect=fix16_div ./main
No optimization = 2341
O3                        = 2334
Os                         = 2334

valgrind --tool=callgrind --toggle-collect=fix16_mul ./main
No optimization = 56
O3                         = 10
Os                         = 10

#############################################################################
Results for Core 2 Duo @ 2Ghz(64 bit)

valgrind --tool=callgrind --toggle-collect=fix16_div_b ./main
No optimization = 268
O3                         = 188
Os                         = 200

valgrind --tool=callgrind --toggle-collect=fix16_div ./main
No optimization = 23
O3                        = 14
Os                         = 14

valgrind --tool=callgrind --toggle-collect=fix16_mul ./main
No optimization = 18
O3                         = 10
Os                         = 10

#############################################################################
Results for Arm Cortex M0 @ 48 Mhz (32 bit)
(this uses a timer, which is running at the system clock speed) since it is 
single threaded 
it always runs in the same amount of time.
Also I only compiled with Os, because the code grows too large without it.

also keep in mind that this processor has NO division instruction.  So some 
magic is happening
which is probably why the 32bit and 64 bit functions are similar

fix16_div_b
Os                     = 880

fix16_div
Os                     = 1104

fix16_mul 
Os                     = 145

#############################################################################

Wow.  So the fix16_div routine is very slow for 32 bit systems even if the 
compiler can handle 64 bit numbers
The ARM doesnt take as large of a hit, but that is because the div_b is slow as 
it is, mostly due to the lack of a division routine

Hope all of this makes sense...

Original comment by joe.scha...@gmail.com on 4 Mar 2011 at 2:26

Google Code Exporter · Answer 9 · Sun Feb 07 2016 23:10:47 GMT+0800 (China Standard Time)

It's expected that 64-bit division would be slow (though concrete results are 
very nice to have) because there is no trivial way to perform division unlike 
multiplication.
I think that the algorithm I wrote could do with some work, but I'm not sure 
there is going to be a very fast way of doing it.

Original comment by Flatmush@googlemail.com on 4 Mar 2011 at 9:27

Google Code Exporter · Answer 10 · Sun Feb 07 2016 23:10:47 GMT+0800 (China Standard Time)

I've found a good way to do fixed point division on arm: 
http://me.henri.net/fp-div.html
It'll be very useful for you birdland, I don't want to steal it to put it in 
the SDK though so I'm going to have to contact him and check if he wants to 
join in or if he's happy for his work to go into our project.

Original comment by Flatmush@googlemail.com on 4 Mar 2011 at 12:27

Google Code Exporter · Answer 11 · Sun Feb 07 2016 23:10:47 GMT+0800 (China Standard Time)

Well we have Henri's permission (and encouragement) to use his code as part of 
our library, so I'll add it when I get time (unless one of you want to).

Henri:
"By all means please use the code. It took a lot of work and research to figure 
out, and at the time, I didn't want it all to go to waste - so I wrote it all 
down in that article - but no-one was interested in publishing it at the time. 
For this reason I would be very happy to see it part of an open source library. 
That being said, I don't do ARM work any more - I don't even have the build 
tools installed, so if you can use or adapt the code 'as-is' please let that be 
my contribution."

Original comment by Flatmush@googlemail.com on 4 Mar 2011 at 2:09

Google Code Exporter · Answer 12 · Sun Feb 07 2016 23:10:47 GMT+0800 (China Standard Time)

What I did with my application now is that I saw the range of my numbers in my 
application and saw that 2^12 gives me 3 decimal places precision. So at the 
moment just using simple integer division with precision upto 3 decimal places. 

input1 = A*2^12;
input2 = B*2^12;


int div(int input1, int input2){
input1 = input1 << 12;
return input1/input2;
}

I need to keep track of my precision in my application. 

Now I am stuck again in sqrt :( looking for some faster implementation.

Original comment by birdla...@gmail.com on 4 Mar 2011 at 2:10

Google Code Exporter · Answer 13 · Sun Feb 07 2016 23:10:47 GMT+0800 (China Standard Time)

currently using the following sqrt. I have not tested its speed but it should 
be varying depending on the input value. Any one knows more faster approach ?

int isqrt32 (int value) 
{
    int g = 0;
    int bshft = 12;
    int b = 1<<bshft;
    do {
        int temp = (g+g+b)<<bshft;
        if (value >= temp) {
            g += b;
            value -= temp;
        }
        b>>=1;
    } while (bshft--);

    return g;
}

Original comment by birdla...@gmail.com on 4 Mar 2011 at 2:57

Google Code Exporter · Answer 14 · Sun Feb 07 2016 23:10:47 GMT+0800 (China Standard Time)

With sqrt functions the faster approaches tend to be approximations, but this 
is meant to be an accurate library (nothing saying we can't use macros and have 
both though). Implementing the function in 32-bit or using ARM assembler would 
be a good bet, though I'm thinking this sqrt stuff should be a new issue.

Original comment by Flatmush@googlemail.com on 4 Mar 2011 at 3:10

Google Code Exporter · Answer 15 · Sun Feb 07 2016 23:10:48 GMT+0800 (China Standard Time)

Birdland are you using an ARM.  If so I have an assembly fixed point division 
algorithm that I know will run on an ARM thumb architecture and I think should 
run on any other ARM architecture.  

I will commit it but would like someone with something other than what I have 
test it out

fix16_t fix16_div(fix16_t inArg0, fix16_t inArg1)
{
    register int32_t quotient;
    asm(
    "num     .req %[inArg0]      @ Map Register Equates\n\t"
    "den     .req %[inArg1]\n\t"
    "mod     .req r2\n\t"
    "cnt     .req r3\n\t"
    "quo     .req r4\n\t"
    "notden  .req r12\n\t"
    "sign  .req r11\n\t"    


    "@ Make sure that we are not dividing by zero\n\t"
    "cmp den, #0\n\t"
    "bne .skip_div0\n\t"
    "@ If we are dividing by zero return all -1\n\t"
    "mov num, #1\n\t"
    "neg num, num\n\t"
    "b .quit\n\t"
    ".skip_div0:\n\t"

    "@ Store the sign in the sign register for later\n\t"
    "@ Sign is computed by exoring the numerator and denominator\n\t"
    "@ and checking if the result is positive or negative\n\t"
    "movs mod, num\n\t"
    "eor mod,mod,den\n\t"
    "movs sign, mod\n\t"

    "@ Negate the denominator if necessary\n\t"
    "@ The algorithm requires the denominator to be signed negative\n\t"
    "cmp den, #0\n\t"
    "bmi .skip_negate_den\n\t"
    "neg den,den\n\t"
    ".skip_negate_den:\n\t"

    "@ Negate the numerator if necessary\n\t"
    "@ The algoright requires the numerator to be signed positive\n\t"
    "cmp num, #0\n\t"
    "bpl .skip_negate_num\n\t"
    "neg num,num\n\t"
    ".skip_negate_num:\n\t"

    "@ Begin the actual division algorithm\n\t"
    "neg mod, den\n\t"
    "movs notden, mod\n\t"
    "movs mod, #0\n\t"
    "add num, num, num\n\t"
    ".rept 48\n\t"
    "    adc mod, mod, mod\n\t"
    "    add mod, den, mod\n\t"
    "    bcs .+4\n\t"
    "    add mod, mod, notden\n\t"
    "    adc quo, quo, quo\n\t"
    "    add num, num, num\n\t"
    ".endr\n\t"
    "movs num,quo\n\t"

    "@ Check the sign to determine if we need to negate the answer\n\t"
    "movs quo,sign\n\t"
    "cmp quo,#0\n\t"
    "bpl .skip_negate_quo\n\t"
    "neg num,num\n\t"
    ".skip_negate_quo:\n\t"

    ".quit:\n\t"
    /* output registers */
    : [quotient] "=r" (quotient)
    /* input registers */
    : [inArg0] "0" (inArg0), [inArg1] "r" (inArg1)
    /* clobbered registers */
    : "cc", "r2", "r3", "r4", "r11", "r12");
    return quotient;
}

Original comment by joe.scha...@gmail.com on 22 Jun 2011 at 12:20

Google Code Exporter · Answer 16 · Sun Feb 07 2016 23:10:48 GMT+0800 (China Standard Time)

It would be interesting to see code run on multiple flavors of ARM Cortex-M 
family: Cortex-M0 vs Cortex-M3 vs Cortex-M4 vs Cortex-M4F.  See instruction 
differences at  http://en.wikipedia.org/wiki/ARM_Cortex-M#Overview

I added links to this project at 
http://en.wikipedia.org/wiki/STM32#Development_tools
http://en.wikipedia.org/wiki/ARM_Cortex-M#Development_tools

Keep up the good work!
http://en.wikipedia.org/wiki/User:Sbmeirow

Original comment by sbmeirow on 10 Dec 2011 at 6:19

Google Code Exporter · Answer 17 · Sun Feb 07 2016 23:10:48 GMT+0800 (China Standard Time)

Comparing cores does not make sense ...
Moreover, M3, M4 and M4F are exactly identical excepting DSP (+FPU) on M4(F) 
which is not used by libfixmath since we're working on integer operations ... 
you will measure exactly the same timings.

Nevertheless, I agree it could be interesting to compare toolchains.

Best regards,
Vincent.

Original comment by vincent....@gmail.com on 10 Dec 2011 at 7:21

Google Code Exporter · Answer 18 · Sun Feb 07 2016 23:10:48 GMT+0800 (China Standard Time)

The Cortex-M0 / M0+ / M1 only has 32-bit multiply instructions with a 
lower-32-bit result (32bit x 32bit = 32bit), 

where as the Cortex-M3 / M4 / M7 includes additional 32-bit multiply 
instructions with 64-bit results (32bit x 32bit = 64bit). 

The Cortex-M4 / M7 also include DSP instructions for (16bit x 16bit) and (32bit 
x 16bit) multiplication.

Original comment by sbmeirow on 19 Feb 2015 at 9:38