Fastest Multi-precision Division Routines?

jpanhalt · Sep 16, 2017

A recent project required a 32-bit by 16-bit division. My previous default method was that by Hemsley ( ). It takes about 350Tcy, but is limited to quotients of 16-bits or less.

I became attracted to the so-called Kenyan (or Russian) method that doubles the divisor until it is just larger than the dividend. That is followed by sequential subtractions by the divisor. After each subtraction, the divisor is divided by 2. A subtraction without borrow results in a 1 being rotated into the quotient, and one with borrow results in a zero being rotated into the quotient. See here for a better description: Reichborn-Kjennerud ( ).

Working with that description, I got some procedures that worked, were a little faster than Hemsley, but were lengthy. Then using indirect addressing, I shortened the code a little for a small cost in time.

Presently, I have two working algorithms. The slower one just uses byte shifts to initiate the sequential subtractions but has fewer instructions. The faster one manipulates bits in the divisor and takes slightly more code. Timings for the same problem set used with the Hemsley method average 498 and 482 Tcy, respectively. I have attached the byte-shift method as a file. I have not tried to optimize it yet.

My question:
My test problem set is 12 problems. They are not randomly picked. Division by small numbers tends to be slowest, as expected from the algorithm. Is anyone here willing to do 3 of those test divisions in C just to see how close I am to what is standard. I have no property interest in either method I wrote. They are pretty much blunt-force approaches. Of course, I will also share the whole problem set, but the three problems I have picked are pretty representative:

1) 0x9C40000 ÷ 0x1F4 ; expected answer = 0x50000, byte-method = 573 Tcy
2) 0x9C40000 ÷ 0x1388 ; expected answer = 0x8000, byte-method = 443 Tcy
3) 0xC40000 ÷ 0x3100 ; expected answer = 0x400, byte-method = 394 Tcy

Thanks in advance.

John

Pommie · Sep 19, 2017

I just added the following code to a project in XC8.

Code:

    T1CON=0b00000000;
    uint32 var32=0x9C40000;
    uint16 var16=0x1F4;
    TMR1=0;
    TMR1ON=1;
    var32/=var16;
    TMR1ON=0;

I get the following results,
1) 0x9C40000 ÷ 0x1F4 ; expected answer = 0x50000, = 1168Tcy
2) 0x9C40000 ÷ 0x1388 ; expected answer = 0x8000, = 960 Tcy
3) 0xC40000 ÷ 0x3100 ; expected answer = 0x400, = 945 Tcy

So, your routines are at least twice as fact as the standard C routines. However, I haven't got the optimizing version so they may be deliberately crippled by Microchip.

Mike.
Ian Rogers, fancy trying the above with optimization?

jpanhalt · Sep 19, 2017

Thanks Mike for testing the problems. I am a little surprised C was that much slower than the Hemsley routine that averages about 682 Tcy.

Regards, John

Pommie · Sep 20, 2017

The C code may be casting to 32 bit and doing a 32x32 divide. Stupic C.

Mike.

jpanhalt · Sep 20, 2017

And in all fairness to C, of the methods I have mentioned, only the Hemsley is truly 32-bit/16-bit. The other methods are 31-bit/16-bit. The reason being that as soon as the 32nd bit is set, it is frequently necessary to have a 33rd bit so that doublings of the divisor will be greater than the dividend. The division by 0x1388 tests for that. One could work around that limitation, but for most applications it does not seem to be a great limitation. The limit to a 16-bit quotient in the fastest of the alternative procedures (Golovchenko's) is a greater limitation, at least in my opinion.

John

Ian Rogers · Sep 20, 2017

Pommie said:
Ian Rogers, fancy trying the above with optimization?

Nope... I had tried to implement really fast division, but even the boffs at XC8 headquarters use ASM..

The fastest routines are the Egyptian and Chinese algorithms, and that is what is used here.. I wouldn't attempt these in C..
Here is C18's 32x32 division... They reckon they saved 256 instruction cycles... I haven't worked out the Tcy's

Code:

/*** Unsigned Integer Division: 32-bit by 32-bit
 ***
 *** Optimized: Dec. 21, 2000
 ***        by: Daniel R. Madill, Quanser Consulting Inc.
 ***       for: Saved (for the worst case) at least 8*32 = 256 instruction cycles
 ***            over the code supplied with MCC18 v1.00.12
 ***/
void  FXD3232U(void/* ulong aarg, ulong barg */)
{
  // use INDF1 for the counter...
    _asm
    // REM = 0
    clrf __REMB0, 0
    clrf __REMB1, 0
    clrf __REMB2, 0
    clrf __REMB3, 0
    // INDF1 = 32
    movlw 32
    movwf INDF1, 0
    // Clear the carry
    bcf STATUS, 0, 0
loop:
 //AARG32 <<= 1; The carry is always clear at the top of the loop.
 rlcf __AARGB3, 1, 0
 rlcf __AARGB2, 1, 0
 rlcf __AARGB1, 1, 0
 rlcf __AARGB0, 1, 0
 //REM32 = (REM32 << 1) | (AARG32 >> 32)
 rlcf __REMB3, 1, 0
 rlcf __REMB2, 1, 0
 rlcf __REMB1, 1, 0
 rlcf __REMB0, 1, 0
 
 //if (PROD >= BARG32)
 movf __BARGB3, 0, 0
 subwf __REMB3, 0, 0
 movf __BARGB2, 0, 0
 subwfb __REMB2, 0, 0
 movf __BARGB1, 0, 0
 subwfb __REMB1, 0, 0
 movf __BARGB0, 0, 0
 subwfb __REMB0, 0, 0
 bnc _false
    //{
     //REM32-= BARG32;
     movf __BARGB3, 0, 0
     subwf __REMB3, 1, 0
     movf __BARGB2, 0, 0
     subwfb __REMB2, 1, 0
     movf __BARGB1, 0, 0
     subwfb __REMB1, 1, 0
     movf __BARGB0, 0, 0
     subwfb __REMB0, 1, 0
     //++AARG32; Since AARG32 was shift to the left above, we only need to set
        //          the lowest bit. Use incf so that the carry flag will also be cleared.
        //          Thus, the carry will always be clear at the top of the loop.
     incf __AARGB3, 1, 0
    //}
_false:
    decfsz INDF1, 1, 0    // does not affect the carry bit
    bra loop
  /* result in AARG already... */
  _endasm
}

This is pretty much the same as the others..

jpanhalt · Sep 20, 2017

Agreed. Even Golovchenko's method can be called a shift-and-add (or subtract) algorithm (https://en.wikipedia.org/wiki/Multiplication_algorithm#Shift_and_add). One real time waster is the need to re-build the subtrahend after a failed subtraction (i.e., a "0" in the result). So one needs to chose whether to use addition to rebuild it (favors 1's) or do a trial subtraction (favors 0's). One might assume that 1's are more common, as all binary numbers start with a 1. However, since the first subtraction in these algorithms must fail, I chose to do the trial subtraction.

Time savings seem to be in how you get to the subtraction process. My code sacrifices instructions for speed. Others make the opposite choice.

Thanks for showing your code. I will try to re-cast for an enhanced 16F in the next few days

John

jpanhalt · Sep 21, 2017

Ian Rogers said:
This is pretty much the same as the others..

Easily adapted to 16F. Needed to add: bcf STATUS,0 after "incf __AARGB3, 1, 0" because incf doesn't affect carry bit in the 16F processors. Also got rid of the bnc, which is a pseudo instruction for 16F's (added a Tcy).

I got the following results using MPLab SIM:
1) 0x9C40000 ÷ 0x1F4 ; without adjustment = 694 ; w/ 16F ajustments = 728
2) 0x9C40000 ÷ 0x1388 ; without adjustment = 686 ; w/ 16F ajustments = 719
3) 0xC40000 ÷ 0x3100 ; without adjustment = 686 ; w/ 16F ajustments = 719

Agreed, it may be a little faster than Pommie's routine but is quite similar to the Hemsley and other routines.

John

BobW · Sep 30, 2017

I ran into a situation a few years ago when I was implementing the math routines from Microchip's appnote AN526 which was written for the older 16Cxxx series. There was a difference in status bit behaviour between the 16C series and the 16F midrange series that I was using. I ended up having to add a few additional instructions to handle this. There might have been a different implementation that would have been more efficient for the midrange PICs, but what I ended up with was fast enough for my application.

I'm surprised at how many versions of shift & subtract division routines exist, all with different names, and only minor variations between them. Still, one variation may be much more efficient than the others for a particular processor implementation, because it may be better suited to that processor's instruction set. In any event the only possible general method that might give significant improvement over shift & subtract would likely have to involve lookup tables. Interestingly, even the CORDIC division routine ends up being nothing more than shift & subtract. So, it doesn't gain anything, even with its lookup table abilities.

jpanhalt · Sep 30, 2017

I was tempted to call the procedure CORDIC, but Wikipedia seems to describe CORDIC as a type of shift and add/subtract method.

I have been tuning the procedure I posted earlier to chop a few nanoseconds and was just reviewing my most recent data. Possibly up to 32-bit (I may have solved the 31-bit limit) divided by up to 16-bit with unlimited result seems to average about 450 Tcy +/- .

The most significant change I have made is to avoid doing a trial subtraction or immediate add-back to reconstruct the minuend. Instead, I do the add-back in the next step and just live with the negative numbers. Thus, there are no wasted additions or subtractions. I refer to that section as the Loop. Pre-loop manipulations are what I have been focused on, when I can, for the past few days. Those who embed Assembly in their C may find that loop section helpful with a little modification.

I should have something to post in a few days. My biggest fear is whether I am addressing obvious glitches, as I cannot to test all potential combinations. For example, an exception occurs with one of the routines if the dividend and divisor are both single bytes, the divisor is odd, and it is greater than the (dividend/2) +1, i.e., the quotient is 1. Correct result is obtained, but the divisor ends up one less than the original. Most other procedures preserve the divisor. The work around is not very costly (it rotates right into a flag and then rotates left from the flag), but my concern is whether I am catching the really important glitches.

John

Ian Rogers · Oct 1, 2017

jpanhalt said:
I was tempted to call the procedure CORDIC

I have tried CORDIC but it was developed mainly for trig ( which was the main reason for my trials) needless to say I failed to get a decent speed... It uses floats and floats cost dearly in ram.. I should imagine you could do it with fixed point, but a whole heap of work I didn't need..

BobW · Oct 1, 2017

It should also be remembered that if speed is the primary concern, and in the special case where you are dividing by a constant, then you can hard code the division by that specific constant, and avoid all bit testing, making things extremely fast. This is often the case with the division operations that I need to do in my PIC applications. A recent project has been to replace an obsolete 360 count rotary encoder with a newer 4096 count encoder in an industrial machine. The code has to be fast enough to do the conversion: output=input*360/4096 on the fly. I was able to get sufficient speed using a combination of lookup tables and some hardcoded division (actually multiplication by the reciprocal of the divisor). In the process of working on that project I discovered a number of neat tricks for doing this efficiently.

jpanhalt · Oct 2, 2017

My reference calculator made an error!

Hi Bob,

I'm still working on validating the code. Speed is not the only goal.

I thought I had discovered a "glitch" this morning that had me worried. I had tested 0xFF FF FF FF ÷ 0xFF , and the calculation worked fine. Then, I tried
0xFE FF FF FF ÷ 0xFF and got 0xFF FF FF. My reference calculator (Sharp EL-531x) showed 0x1 00 00 00. That did not look right by inspection, so I tried the Hemsley method and an online calculator for large binary and hex numbers. Both confirmed my original answer. I re-entered the problem into the Sharp calculator several times to rule out an entry error and kept getting the same, wrong result. The error is also reproduced on another of my Sharp EL-531x calculators.

For other problems, e.g., 0xFE FF FF FF ÷ 0xF, all four methods agreed.

A quick search with Google did not show that to be a known fault of that calculator -- but I didn't spend much time looking.

John

BobW · Oct 4, 2017

Not a good thing when you can't trust your calculator.

Sometimes I'll test out my assembly language routines by emulating the same operations in a higher level language. For example I write a lot of programs in Xojo (unfortunate name for a high level BASIC-like programming IDE), which can do all of the bitwise operations. That way, I can feed the test routine massive amounts of test data, and compare the output to the language's regular math functions, to make sure there are no glitches.

atferrari · Oct 6, 2017

Hola John,

After close to 2 years doing nothing on these matters, today I managed to retrieve my math routines, reinstall MPLAB to see what its performance is.

After a cursory reading it is not clear to me if a 16-bit limited quotient is good for you. My routines seem to be all good for a quotient limited to 16 bits.

jpanhalt · Oct 6, 2017

Oh, I think there will be many times when a 16-bit quotient will suffice. That would be OK even for the project that sparked this tangent. All I needed was to look at 500.00 ± 10% or so decimal. Summer doesn't leave much time for playing with electronics. This was something I could start and stop as required.

For a 16-bit quotient, it is hard to beat the Golovchenko method mentioned earlier. There is also the question of whether you need a valid remainder. Anyway, I have been away from this tangent most of the week, but started again as it is raining today.

It is easy to be quicker. It is much harder to be quicker with comparable amount of code.

John

atferrari · Oct 6, 2017

Insert code function, not working for me today. Sorry.

Having moved to the 18F family long time ago, I find hard to revisit my 16F routines.

Attached is the 32/16 unsigned routine with limited to 16 bits quotient.

Keep in mind it is for the 18F family. Simulated, including the CALL instruction took 263 Tcys

jpanhalt · Oct 7, 2017

Hi Austin,

Nice routine.

Translated to enhanced midrange, converted the "pseudo instructions," and changed the labels to be consistent with my problem sets. Changing the pseudo-instructions is not necessary, as the Assembler handles that. Changed the first step of the Loop to use LSLF, which avoids the need to clear the carry bit. Knocks a few Tcy off the timing. Timings reported below are before that change. After that change, the last problem is 347 Tcy.

The following are all in hex (all calculations were correct):
9C40000/1388 = 8000, remainder=0, Tcy=299
2/1 =2, remainder=0, Tcy=299
AAAA AAAA/C4C4 = DE0B, remainder=3E3E, Tcy=320
9C40000/5106 = 1EDA, remainder=4CE4, Tcy=363

Here's a version of the translated code:

Code:

; Quotient is in T1:T0
; Remainder is in T3:T2
; Divisor is unchanged


     movlw     16
     movwf     counter

Loop
     lslf      T0,f
     rlf       T1,f
     rlf       T2,f
     rlf       T3,f

     btfsc     STATUS,0
     bra       Subtract
     movf      B1,w
     subwf     T3,w

     btfss     STATUS,2
     bra       TestSubtract
     movf      B0,w
     subwf     T2,w

     btfsc     STATUS,0
     bra       Subtract
   
SetCount
     decfsz    counter,f
     bra       Loop
     bra       Done       ;job done or "return" if called

Subtract
     movf      B0,w
     subwf     T2,f
     movf      B1,w
     subwfb    T3,f
     bsf       T0,0       ;flag dividend that divisor subtracted
     bra       SetCount
 
TestSubtract
     btfss     STATUS,0
     bra       SetCount
     bra       Subtract
Done
     nop
;*********
     end

Thanks for sharing.

John

BobW · Oct 7, 2017

Thanks guys. That's nice and simple. I'll be adding it to my library of math routines.

atferrari · Oct 8, 2017

Hola John

In red, results I got with your quantities (no tweaking).

The following are all in hex (all calculations were correct):
9C40000/1388 = 8000, remainder=0, Tcy=299 - 253
2/1 =2, remainder=0, Tcy=299 - 253
AAAA AAAA/C4C4 = DE0B, remainder=3E3E, Tcy=320 - 269
9C40000/5106 = 1EDA, remainder=4CE4, Tcy=363 - 317

Comments:

I was surprised by the LSFL instruction (did not recognize it). When the enhanced 16F started to become common I already moved to the 18F.

Wondering what tweaking I could do to my routine. Maybe it would demand a full rethink of it. The archives, with the flow diagram and comments, are now on my desk dutifully on paper as it was common for me in the past.

I suspect, only suspect, that being a RISC set of instructions the desire of short & fast is not possible. Short OR fast.

I suppose that in the Z80 such a thing could be attempted but not nothing I would try for many reasons. Or maybe with the 8051.

I wrote the original routine in the hotel at Belem (Brazil) while waiting for a vessel to come alongside. It was based on the explanation given in the book Microcomputer Math, by William Barden Jr, translated to Portuguese. Still have with me the flow diagram and graph paper sheet where I worked out the manual test of my routine. I enjoyed it.

Welcome to our site!

Electro Tech is an online community (with over 170,000 members) who enjoy talking about and building electronic circuits, projects and gadgets. To participate you need to register. Registration is free. Click here to register now.

Fastest Multi-precision Division Routines?

Well-Known Member

Attachments

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

User Extraordinaire

Well-Known Member

Well-Known Member

Active Member

Well-Known Member

User Extraordinaire

Active Member

Well-Known Member

Active Member

Well-Known Member

Well-Known Member

Well-Known Member

Attachments

Well-Known Member

Active Member

Well-Known Member

Similar threads

New Articles From Microcontroller Tips