How to divide fast by immediates

In almost all assembly books you’ll find some nice tricks to do fast multiplications. E.g. instead of “imul eax, ebx, 3” you can do “lea eax, [ebx+ebx*2]” (ignoring flag effects). It’s pretty clear how this works. But how can we speed up, say, a division by 3? This is quite important since division is still a really slow operation. If you never thought or heart about this problem before, get pen and paper and try a little bit. It’s an interesting problem.

The non-obvious trick here is: We can multiply by the inverse (i.e. the reciprocal). Maybe this sounds a little bit strange, since the inverse of 3 is what, 1/3? Yes, it is, over the rational numbers. But here we calculate over the finite ring Z/(232)Z (for the non-mathematical readers: We wrap around after 232). And in Z/(232)Z all odd elements have a multiplicative inverse. We can find the inverse with the extended euclidean algorithm (gcdex):

For numbers a, b gcdex(a, b) returns [g, x, y] where

  • g is the greatest common divisor
  • x, y are some numbers,
  • such that x*a + y*b = g

E.g. gcdex(3, 232) = [1, -1431655765, 1]

That means:
-1431655765 * 3 + 1*232 = 1

Since we are using the modulo 232 math (we have 32 bit registers), this is exactly what we need:

-1431655765*3 + 232 = 0xaaaa_aaab * 3 = 1 (mod 232)
So 1/3 = 0xaaaa_aaab (mod 232)

Now, let’s look what we have so far:
For numbers divisible by 3 this is already our solution. We just have to multiply them by 0xaaaa_aaab, since this is the inverse of 3. But — in the general case — we want this to work with all values (the remainder should be ignored, but with our solution it destroys the result).

We need the another trick: We look at the 64 bit result and count the “overflows”. This is somehow similar to fixed point arithmetic. Look at this:

3 * 0xaaaa_aaab = 0x2_0000_0001, and that means
(n*3) * 0xaaaa_aaab = n * 233 + n

This is cool, because

(n*3 + c) * 0xaaaa_aaab = (n*233 + n) + c*0xaaaa_aaab

So, with c in {0, 1, 2} we have the nice identity:

(n * 0xaaaa_aaab) >> 33 = n div 3, for all 0 <= n < 232

Last but not least, our final solution for dividing by 3 is:

[IN: eax dividend]
mov ecx, 0xaaaa_aaab // 1/3
mul ecx // multiply with inverse
shr edx, 1 // shift by 33
[OUT: edx = eax div 3]

Of course, this is only half of the story; this is not the general solution for all divisors. And it doesn’t handle signed division. For futher reading see this paper or this gcc bug report. It also has its chapter in the AMD optimization manual.

BTW: This is part of reason why you have to learn all that math stuff in computer science. It’s useful! :-)

18 thoughts on “How to divide fast by immediates”

  1. Actually, his reminds me of how to quickly mod an arbitrarily big int value by 10.

    So, let’s say we have an array of bytes, starting at (%eax) with a length of %ecx, in increasing significant order. (The first byte is thus the least-significant byte)

    So, if we look at this value, we see that we can trivially mathematically decompose the total big int to (c_0 + c_1 * 256 + c_2 * 256^2 + … + c_n * 256^n), So we get the following: sigma ( c_i * 256^i) from i=0 to n.

    Now, here’s where the creative stuff works. If you perform a mod of a sum, you can distribute the mod over the sum, as long as you keep the mod outside of the sum. So, (x + y) % c = (x%c + y%c) % c. Ok, so like, we haven’t saved ANY time here at all, have we? Let’s just do MORE work, right? But wait! we get a nifty thing our mod over the above sum is now: (sigma (c_i * 256^i % 10) from i = 0 to n) % 10

    So, the advantage here, is that we’ve broken the math down so that we don’t have to add up a massively large number, we can just keep track of the lower byte, we really don’t need anything more, it all overflows out, and won’t effect the final mod by 10! So, now if we just have an easy way to calculate (c_i * 256^i % 10) But we do! Because you can distribute a mod over multiplication the same as you can over addition. So, now we simply need to calculate ((c_i % 10) * (256^i % 10)) % 10

    Now, let’s take a look at 256^i % 10. If you notice the last term is 6, and 6 times 6 is… 36. It turns out that (256^i % 10) = 6 for all cases except zero, where it is simply 1. So, here we have something really messy, but fairly easy to calculate:

    (c_0 % 10 + sigma ((c_i % 10 * 6) % 10) for i=1 to n) % 10. Now, let’s undistribute the mod 10 in the middle, and we get (c_0 % 10 + sigma (c_i * 6 % 10) for i=1 to n) % 10, and hey! we can distribute out the mod 10 again! (c_0 + sigma (c_i * 6) for i=1 to n) % 10, and then distribute out the 6, (c_0 + 6 * sigma(c_i) for i = 1 to n) % 10 Which has reduced everything down to n additions, one multiply, and one mod!

    And here’s the better thing, since 256^i % 10 = 6, we can use any size of int available, from 16-bit to 32-bit and 64-bit, they will all produce the same results!

    So, here’s the code:

    # IN: %eax address starting the arbitrarily large integer
    # IN: %ecx length of the arbitrary integer
    mod_10:
    xorl %edx, %edx
    pushl %eax
    incl %eax
    decl %ecx

    loop:
    addl (%eax), %edx
    incl %eax
    decl %ecx
    jnz loop

    leal (%edx, %edx,1), %edx # %edx = %edx * 3
    addl %edx, %edx # %edx = %edx * 2

    popl %eax
    addl (%eax), %edx # %edx = total sum
    pushl %edx

    movl $0xcccc_cccd, %ecx # 1/5
    leal (%edx,%edx), %eax # %eax = total sum * 2
    mull %ecx # %edx = (total sum * 2) / 5
    # %edx = total sum * (2 / 5)
    # %edx = total sum * (4 / 10)
    shrl $3, %edx # %edx = total sum * ( 4 / 10) / 4
    # %edx = total sum * (1 / 10)
    # %edx = val = total sum / 10

    leal (%edx, %edx, 4), %edx # %edx = %edx * 5 = (val * 5)
    addl %edx, %edx # %edx = %edx * 2 = (val * 10)

    popl %eax
    subl %edx, %eax # %eax = total sum – ((total sum / 10) * 10)
    # %eax = total sum % 10
    ret
    #OUT: %eax = modulo 10 of the big int

  2. Cassy, your math is correct but your code is wrong. I can’t read asm but I assume it looks something like this:

    unsigned mod10(const uint8 *p, size_t n)
    {
    infinite_integer sum = 0;

    for (size_t i = 1; i != n; ++i)
    sum = (sum + p[i]) % (1 + 2**32);

    sum = (sum * 6) % (1 + 2**32);
    sum = (sum + p[0]) % (1 + 2**32);

    return sum % 10;
    }

    with all the implicit modulos in C made explicit, you can see why it doesn’t work because 1 + 2**32 isn’t a multiple of 10.

  3. Yes, I realized that myself.

    Switch:

    loop:
    addl (%eax), %edx
    incl %eax
    decl %ecx
    jnz loop

    for:

    loop:
    addl (%eax), %edx
    jnc skip

    loop2:
    incl %edx
    jc loop2

    skip:
    addl $4, %eax
    subl $4, %ecx
    jnz loop

    Basically, we’re calculating the digital root of all of the bytes/words after the first one. In doing this, if there is a carry out, then we simply need to add that back into the value (fortunately, it’s impossible to setup a situation where incrementing by one will continuously generate a carry out)

    While calculating this sum we only care about the modulo 10 value stored in the sum, so that when we go to multiply it by 6, we get the right answer, (which cannot overflow, since we only care about the modulo 10 value times 6)

    Then we add in the original value, which won’t interfere with the modulo 10 value, then modulo the result by 10.

    Hm. It might be necessary to modulo by 10 explicitly before doing the multiply. The issue here is that you have to be careful to design it so that only the relevant information keeps in the result.

  4. Interesting. You see something similar when you modulate 8-bit colors (say, for alpha-compositing), since you have to do a fast divide-by-255. From memory, you end up with something like:

    unsigned char mul8(unsigned char a, unsigned char b) {
    unsigned short ret = a;
    ret *= b;
    ret += 128;
    return (ret + (ret >> 8)) >> 8;
    }

    Which technically only requires 16-bit arithmetic, so you can do the SWAR thing. Of course, most compilers are going to generate the full 32-bit imul.

    -W

  5. After thinking about it, replace:

    loop:
    addl (%eax), %edx
    jnc skip

    loop2:
    incl %edx
    jc loop2

    skip:
    addl $4, %eax
    subl $4, %ecx
    jnz loop

    with the following…

    loop:
    addl (%eax), %edx
    jnc skip
    incl %edx

    skip:
    addl $4, %eax
    subl $4, %ecx
    jnz loop

    This is because you can’t add two numbers, generate a carry, and then generate a carry in an increment. (Working with bytes, if you add 0xFF, and 0xFF, the maximum values, you get 0xFE with carry. If you then increment that you get 0xFF, with no carry.)

  6. Pingback: buy cialis
  7. Don’t you just multiply by (1/n)*2^32? At least on x86 this works:

    mov edx,X
    mov eax,(1/n)*2^32
    mul edx
    mov Y,edx

    ; Y=X/n

    It doesn’t work for all values of X and N, but with some shifts and/or rounding it will – always faster than DIV. I’m amazed at how fast DIV is getting in newer processors, though.

  8. Hi Won,

    Your formula to divide by 255 is not correct, it always gives an error for multiples of 255. I fixed the formula and now it shows the exact value:

    int r2 = (i+1 + (i >> 8)) >> 8;

    Here’s the test i used to confirm it:

    int diff=0;
    for (int i = 0; i > 8)) >> 8;
    if (r1 != r2)
    {
    diff++;
    System.out.println(i+”: “+r1+” != “+r2+” – “+temp);
    }
    }
    System.out.println(diff);

    With your formula, it shows 256 differences. With my formula, it shows 0 differences.

    HTH,

    guich

Leave a Comment