In almost all assembly books you’ll find some nice tricks to do fast multiplications. E.g. instead of “imul eax, ebx, 3” you can do “lea eax, [ebx+ebx*2]”Â (ignoring flag effects). It’s pretty clear how this works. But how can we speed up, say, a division by 3? This is quite important since division is still a really slow operation. If you never thought or heart about this problem before, get pen and paper and try a little bit. It’s an interesting problem.

The non-obvious trick here is: We can multiply by the inverse (i.e. the reciprocal). Maybe this sounds a little bit strange, since the inverse of 3 is what, 1/3? Yes, it is, over the rational numbers. But here we calculate over the finite ring Z/(2^{32})Z (for the non-mathematical readers: We wrap around after 2^{32}). And in Z/(2^{32})Z all odd elements have a multiplicative inverse. We can find the inverse with the extended euclidean algorithm (gcdex):

For numbers a, b gcdex(a, b) returns [g, x, y] where

- g is the greatest common divisor
- x, y are some numbers,
- such that x*a + y*b = g

E.g. gcdex(3, 2^{32}) = [1, -1431655765, 1]

That means:

-1431655765 * 3 + 1*2^{32}= 1

Since we are using the modulo 2^{32} math (we have 32 bit registers), this is exactly what we need:

-1431655765*3 + 2

^{32}= 0xaaaa_aaab * 3 = 1 (mod 2^{32})

So 1/3 = 0xaaaa_aaab (mod 2^{32})

Now, let’s look what we have so far:

For numbers divisible by 3 this is already our solution. We just have to multiply them by 0xaaaa_aaab, since this is the inverse of 3. But — in the general case — we want this to work with all values (the remainder should be ignored, but with our solution it destroys the result).

We need the another trick: We look at the 64 bit result and count the “overflows”. This is somehow similar to fixed point arithmetic. Look at this:

3 * 0xaaaa_aaab = 0x2_0000_0001, and that means

(n*3) * 0xaaaa_aaab = n * 2^{33}+ n

This is cool, because

(n*3 + c) * 0xaaaa_aaab = (n*2

^{33}+ n) + c*0xaaaa_aaab

So, with c in {0, 1, 2} we have the nice identity:

(n * 0xaaaa_aaab) >> 33 = n div 3, for all 0 <= n < 2

^{32}

Last but not least, our final solution for dividing by 3 is:

[IN: eax dividend]

mov ecx, 0xaaaa_aaab// 1/3mul ecx

// multiply with inverseshr edx, 1

// shift by 33[OUT: edx = eax div 3]

Of course, this is only half of the story; this is not the general solution for all divisors. And it doesn’t handle signed division. For futher reading see this paper or this gcc bug report. It also has its chapter in the AMD optimization manual.

BTW: This is part of reason why you have to learn all that math stuff in computer science. It’s useful! :-)

Actually, his reminds me of how to quickly mod an arbitrarily big int value by 10.

So, let’s say we have an array of bytes, starting at (%eax) with a length of %ecx, in increasing significant order. (The first byte is thus the least-significant byte)

So, if we look at this value, we see that we can trivially mathematically decompose the total big int to (c_0 + c_1 * 256 + c_2 * 256^2 + … + c_n * 256^n), So we get the following: sigma ( c_i * 256^i) from i=0 to n.

Now, here’s where the creative stuff works. If you perform a mod of a sum, you can distribute the mod over the sum, as long as you keep the mod outside of the sum. So, (x + y) % c = (x%c + y%c) % c. Ok, so like, we haven’t saved ANY time here at all, have we? Let’s just do MORE work, right? But wait! we get a nifty thing our mod over the above sum is now: (sigma (c_i * 256^i % 10) from i = 0 to n) % 10

So, the advantage here, is that we’ve broken the math down so that we don’t have to add up a massively large number, we can just keep track of the lower byte, we really don’t need anything more, it all overflows out, and won’t effect the final mod by 10! So, now if we just have an easy way to calculate (c_i * 256^i % 10) But we do! Because you can distribute a mod over multiplication the same as you can over addition. So, now we simply need to calculate ((c_i % 10) * (256^i % 10)) % 10

Now, let’s take a look at 256^i % 10. If you notice the last term is 6, and 6 times 6 is… 36. It turns out that (256^i % 10) = 6 for all cases except zero, where it is simply 1. So, here we have something really messy, but fairly easy to calculate:

(c_0 % 10 + sigma ((c_i % 10 * 6) % 10) for i=1 to n) % 10. Now, let’s undistribute the mod 10 in the middle, and we get (c_0 % 10 + sigma (c_i * 6 % 10) for i=1 to n) % 10, and hey! we can distribute out the mod 10 again! (c_0 + sigma (c_i * 6) for i=1 to n) % 10, and then distribute out the 6, (c_0 + 6 * sigma(c_i) for i = 1 to n) % 10 Which has reduced everything down to n additions, one multiply, and one mod!

And here’s the better thing, since 256^i % 10 = 6, we can use any size of int available, from 16-bit to 32-bit and 64-bit, they will all produce the same results!

So, here’s the code:

# IN: %eax address starting the arbitrarily large integer

# IN: %ecx length of the arbitrary integer

mod_10:

xorl %edx, %edx

pushl %eax

incl %eax

decl %ecx

loop:

addl (%eax), %edx

incl %eax

decl %ecx

jnz loop

leal (%edx, %edx,1), %edx # %edx = %edx * 3

addl %edx, %edx # %edx = %edx * 2

popl %eax

addl (%eax), %edx # %edx = total sum

pushl %edx

movl $0xcccc_cccd, %ecx # 1/5

leal (%edx,%edx), %eax # %eax = total sum * 2

mull %ecx # %edx = (total sum * 2) / 5

# %edx = total sum * (2 / 5)

# %edx = total sum * (4 / 10)

shrl $3, %edx # %edx = total sum * ( 4 / 10) / 4

# %edx = total sum * (1 / 10)

# %edx = val = total sum / 10

leal (%edx, %edx, 4), %edx # %edx = %edx * 5 = (val * 5)

addl %edx, %edx # %edx = %edx * 2 = (val * 10)

popl %eax

subl %edx, %eax # %eax = total sum – ((total sum / 10) * 10)

# %eax = total sum % 10

ret

#OUT: %eax = modulo 10 of the big int

Cassy, your math is correct but your code is wrong. I can’t read asm but I assume it looks something like this:

unsigned mod10(const uint8 *p, size_t n)

{

infinite_integer sum = 0;

for (size_t i = 1; i != n; ++i)

sum = (sum + p[i]) % (1 + 2**32);

sum = (sum * 6) % (1 + 2**32);

sum = (sum + p[0]) % (1 + 2**32);

return sum % 10;

}

with all the implicit modulos in C made explicit, you can see why it doesn’t work because 1 + 2**32 isn’t a multiple of 10.

Yes, I realized that myself.

Switch:

loop:

addl (%eax), %edx

incl %eax

decl %ecx

jnz loop

for:

loop:

addl (%eax), %edx

jnc skip

loop2:

incl %edx

jc loop2

skip:

addl $4, %eax

subl $4, %ecx

jnz loop

Basically, we’re calculating the digital root of all of the bytes/words after the first one. In doing this, if there is a carry out, then we simply need to add that back into the value (fortunately, it’s impossible to setup a situation where incrementing by one will continuously generate a carry out)

While calculating this sum we only care about the modulo 10 value stored in the sum, so that when we go to multiply it by 6, we get the right answer, (which cannot overflow, since we only care about the modulo 10 value times 6)

Then we add in the original value, which won’t interfere with the modulo 10 value, then modulo the result by 10.

Hm. It might be necessary to modulo by 10 explicitly before doing the multiply. The issue here is that you have to be careful to design it so that only the relevant information keeps in the result.

Interesting. You see something similar when you modulate 8-bit colors (say, for alpha-compositing), since you have to do a fast divide-by-255. From memory, you end up with something like:

unsigned char mul8(unsigned char a, unsigned char b) {

unsigned short ret = a;

ret *= b;

ret += 128;

return (ret + (ret >> 8)) >> 8;

}

Which technically only requires 16-bit arithmetic, so you can do the SWAR thing. Of course, most compilers are going to generate the full 32-bit imul.

-W

ugh, I mean 2**32 not 1 + 2**32, I always get an off-by-one error there.

After thinking about it, replace:

loop:

addl (%eax), %edx

jnc skip

loop2:

incl %edx

jc loop2

skip:

addl $4, %eax

subl $4, %ecx

jnz loop

with the following…

loop:

addl (%eax), %edx

jnc skip

incl %edx

skip:

addl $4, %eax

subl $4, %ecx

jnz loop

This is because you can’t add two numbers, generate a carry, and then generate a carry in an increment. (Working with bytes, if you add 0xFF, and 0xFF, the maximum values, you get 0xFE with carry. If you then increment that you get 0xFF, with no carry.)

Hello,

Very Informative Posts.

I really like what you have going on here. I’ll be back soon

Pretty Cool Place.

I like your style too.

Cya again,

Joey

I’m not quite understanding what all

this is supposed to be about?

Must be me or something…

How green is the grass on the other side of the fence?

Not much. Don’t believe it I tell you.

Jerry

Hey,

Really nice site you got here.

I’ll come back more often and check it out.

Peace!

helloy

Hi Jim. Photos i received. Thanks

Hello! Good Site! Thanks you! chmnkceyjlznc

Don’t you just multiply by (1/n)*2^32? At least on x86 this works:

mov edx,X

mov eax,(1/n)*2^32

mul edx

mov Y,edx

; Y=X/n

It doesn’t work for all values of X and N, but with some shifts and/or rounding it will – always faster than DIV. I’m amazed at how fast DIV is getting in newer processors, though.

Hi Won,

Your formula to divide by 255 is not correct, it always gives an error for multiples of 255. I fixed the formula and now it shows the exact value:

int r2 = (i+1 + (i >> 8)) >> 8;

Here’s the test i used to confirm it:

int diff=0;

for (int i = 0; i > 8)) >> 8;

if (r1 != r2)

{

diff++;

System.out.println(i+”: “+r1+” != “+r2+” – “+temp);

}

}

System.out.println(diff);

With your formula, it shows 256 differences. With my formula, it shows 0 differences.

HTH,

guich

note: for some obscure reason, part of the code i posted was truncated! So, just use the formula i posted.