It’s time for a puzzle again! (submitted by sheepmaster)

Calculate in x86 assembly language the arithmetic mean (a+b)/2 of two signed integers a and b. If you think this is trivial, consider the following edge cases: a = b = 0x7FFFFFFF and a = 0x7FFFFFFF, b = 0x80000000.

Again, the standard rules apply: a and b are in general purpose registers of your choice, the result can be in any register, and any register can be overwritten.

Oh, and by the way, using larger data types isn’t allowed either (that would make things too easy, wouldn’t it?).

As always I’m not fluent in 0x86 – but I speak some ARM and therefore here is a solution I thought up. First the idea:

Shift both registers right and add them. If both registers are of same sign and the zero bit was set, add/substract one. end.

a is in R0, b in R1, result will be in R2 and register R0 and R1 will be killed.

MOV R2,R0, ASR#1

ADD R2,R2,R1,ASR#1

AND R0,#&80000001

AND R1,#&80000001

ADD R0,R0,R1

ADD R2,R2,R0,ASR#1

which would have been a neat idea, but I just realized that asr doesn’t do a correct div 2 in the negative case, and so my code doesn’t work. Will think about what to do with that, and write another komment.

SHR EAX

SHR EBX

ADD EAX,EBX

I know, it can be off by one, but who cares π

–Blerik

So, fixing the ASR != DIV 2 problem took me another 3 instructions to give a total of nine:

ADD R3,R0,R0 LSR#31

MOV R2,R3, ASR#1

ADD R3,R1,R1 LSR#31

ADD R2,R2,R3,ASR#1

AND R0,#&80000001

AND R1,#&80000001

ADD R0,R0,R1

ADD R0,R0,R0 LSR#31

ADD R2,R2,R0,ASR#1

now it should work correctly, i hope π

@blerik: at least do a sar please π

@Dominik: I mistranslated my Java code to assembly, probably because it has been 15 years since the last time I used assembly :).

And what is the problem with ASR? Is it different than SAR? SAR does division by two perfectly for negative (2s compement) numbers.

–Blerik

@Dominik: What do you get when R0 = -1 and R1 = 1? On paper I got 1, but that could just be me not being as fluent in ARM…

I had a similar idea, but didn’t need to handle negative numbers separately, as ASR always rounds

down, so you have to add 1 when both numbers are odd.What are the highest-value bits in your AND Rx,#&80000001 for?

@Sheepmaster:

R0 = -1 and R1 = 1

R3 = 0

R2 = 0

R3 = 1

R2 = 0

R0 = &80000001

R1 = 1

R0 = &80000001 + 1

@Sheepmaster:

R0 = -1 and R1 = 1

R3 = 0

R2 = 0

R3 = 1

R2 = 0

R0 = &80000001

R1 = 1

R0 = &80000001 + 1 <- here’s the error

… damn…

ASR always rounds to the lower numbers, this is why i add the sign-bit all the time (ADD rd, rs, rs LSR#31) – so the ASR rounds correctly to zero instead of lower numbers.

Oh shit – now I see the bug, the AND s are for determining if i need to adjust the resulting value. My initial solution did compare the two instead of adding them. We only need adjusting when both the sign and the lowest bit are set. But my next step is wrong. since &80000001 is nowhere near -1 (which I assumed in my disordered mind), adding doesn’t work. Arg.

Okay, now it got ugly. Just doing branches to determine if we have to adjust and in which direction. And the solution isn’t beautiful anymore at all π :

— first the addition and correct ASR

ADD R3,R0,R0 LSR#31

MOV R2,R3, ASR#1

ADD R3,R1,R1 LSR#31

ADD R2,R2,R3,ASR#1

— check first and last bit and if they are equal

AND R0,#&80000001

AND R1,#&80000001

— if they aren’t we jump because we are finished

CMP R0,R1

BNE .end

— if the zero bit isn’t set we are finished

TST R0,#1

BEQ .end

— now test the sign and add or substract our adjustment

TST R0,#&80000000

SUBEQ R2,R2,#1

ADDNE R2,R2,#1

.end

The boring x86 solution (uses CDQ, so not exactly correct wrt. “no larger data types” rule…); eax = a, ebx = b, result in eax, clobbers ebx and edx:

add eax, ebx

mov ebx, 2

cdq

idiv ebx

… and now I notice that it doesn’t work on the edge cases. D’oh.

What do you think of this (input: eax, ebx – output: eax)?

add eax, ebx

jo oh_no_bad_thing

sar eax, 1

jmp done

oh_no_bad_thing:

rcr eax, 1

done:

@blerik:

To get rid of the off-by-one problem, I think you must add the value of a AND b AND 1 to your result.

Okay, after a quick bit of reviewing some algorithms, I’ve come up with this code (equivalent to (a & b) + ((a ^ b) >> 1) ):

mov ecx, eax

xor ecx, ebx

sar ecx, 1

and eax, ebx

add eax, ecx

This is the floor of the average. You didn’t specify how to round, so I assume this is what you wanted…

An old friend of mine, Risc OS ARM Assembler Guru, came up with this reduction:

MOV R2,#0 ; Startwert

MOVS R0,R0,ASR #1 ; R0=R0/2, Rest in Carry-Flag

ADC R2,R2,#0 ; R2=R2+0+Carry-Flag

MOVS R1,R1,ASR #1 ; R1=R1/2, Rest in Carry-Flag

ADC R2,R2,#0 ; R2=R2+0+Carry-Flag

MOV R2,R2,ASR #1 ; R2=R2/2 (wir haben jetzt also das Mittel aus den

; beiden Inhalten von Bit 0)

ADD R2,R2,R0 ; Zum Schluss:

ADD R2,R2,R1 ; Einfach alles aufsummieren

Tests:

R0=&7fffffff, R1=&7fffffff -> R2=&7fffffff

R0=&80000000, R1=&80000000 -> R2=&80000000

R0=&7fffffff, R1=&80000000 -> R2=&ffffffff

Reduced to:

AND R2,R0,#1

MOVS R1,R1,ASR #1

ADC R2,R2,#0

ADD R2,R1,R2,ASR #1

ADD R2,R2,R0,ASR #1

and finally to a fantastic 4 ARM Risc Instruction Photo Finish:

AND R2,R0,R1

AND R2,R2,#1

ADD R2,R2,R0,ASR #1

ADD R2,R2,R1,ASR #1

π – did I mention that I suck at assembler?

@DrV: That’s a nice solution. Mine would have been the assembly equivalent of (a >> 1) + (b >> 1) + (a & b & 1).

Γ’ΒΕwhich would look like yours in ARM assembly, Dominik π

@DrV: (a & b) + ((a ^ b) >> 1) rounds wrongly for negative numbers. e.g. -3,-4 = -4 instead of -3

@sheepmaster: sadly it wasn’t mine π

Okay, maybe I should have specified how the result should be rounded. For me, always rounding down would be acceptable, as this is what an arithmetic shift right does, and the goal is not to implement some sophisticated rounding method in assembler (although it could be an interesting taskΓ’ΒΕ) π

If you’re trying to do this on an x86, and you’re using more than two instructions, you’ve already gone too far.

addl %ebx, %eax

# %eax holds 32 bits, and the carry flag holds the 9th bit

rcrl $1, %eax

# rotate %eax and the carry flag as a 33-bit number, right by one

@Cassy: No, that doesn’t work for a = 0x7FFFFFFF and b = 0x80000000. You have to check if there’s an

overflowfirst, and if not (which would be the case for a = 0x7FFFFFFF and b = 0x80000000), do an arithmetic shift right.@sheepmaster: π I think you’re right. The problem there is that while the “shift” right is 33-bit, the addition wouldn’t be sign extended to 33-bits. *sigh* Well, just to keep my totally hackish untranslatable to C example going:

addl %eax, %ebx

seto %bl

; %bl is either set to 1, or -1, either way the LSbit is 1

pushl %ebx

popf

; We put %ebx into the flags register

; Since CF is the LSbit, it now has the value of OF

rcrl $1, %eax

; This works correctly for signed values now.

OK, I broke this into three cases:

– both positive (5,3), (0x7fffffff,0x7fffffff)

– one neg, one pos (0xfffffff,1), (0x80000000,0x10000000),(0x80000000,0x7fffffff),(0x80000000,1)

– both negative (0x8000000, 0x8000000), (0xfffffff,0xfffffff)

I think this works for all three cases: arithmetic shift right each number by 1, add, and add in bitwise OR of lowest bits in the original number.

mov edx,eax

or edx,ebx

and edx,1

asr eax,1

asr ebx,1

add eax,ebx

add eax,edx

Oh yeah except asr=sar. Not the smallest code perhaps, but nice and simple.

Cassy: “seto bl” means set bl=01 if overflow, bl=00 if not.

@myria: from what I’ve heard, “SETcc” should not be trusted to set the register to just 1, because I’ve heard at times that it could possibly be set to -1 (all bits set)

Either way, my hack version still works, because the least significant bit is the one that gets moved into the CF, which is then moved into the register with the RCR. Note that all but 8-bits of the EFLAGS end up undefined from pushing EBX then popping it into EFLAGS.

Of course, my friend said he had to laugh at my solution, because it was just such an incredible hack. π Of course, this is also the reason why I love assembly so much π

Cassy: Your code doesn’t work for the test case 0x7FFFFFFF, even after switching around eax and ebx in the first instruction.

I mean, 0x7FFFFFFF with 0x7FFFFFFF.

Matt Parks’s code appears correct… I have a modification of his that is 1 instruction shorter:

mov eax,ecx

or ecx,edx

sar eax,1

sar edx,1

shr ecx,1

adc eax,edx

This would need to be profiled against Matt’s code. It’s hard to tell which one will be better; the shr/adc can be expensive. It likely has terrible performance on Pentium 4’s, which hate “adc” and “sbb”.

My code is Visual C++’s __fastcall: parameters in ecx, edx; return value in eax; ecx, edx volatile.

Ick, stupid dest/src flipflopping π

Hm… I’m working up a hack that actually works, but it’s taking me some time.

Write an optimal program in y86 assembly language for the following problem.

In a particular application, one of the key operational function takes in a

32 bit unsigned value as arguement. This value has 3 parts.

bits 0-15 -> value 1.

bits 17-31 -> value 2.

bit 16 -> operation.

=0 => value 1 + value 2

=1 => value 1 – value 2

Both values are unsigned values.

The function reads in the 32 bit value, performs the operation and

returns a 16bit value.

If the values are signed values, will there be any change in the code? if so, what are the changes to be made? if not, Justify.

If instead of bit 16, the operatoin bit is bit 15 (with value 1 = bit 0-14 and value 2 = bits 16-31), what are the changes to be made in the above code?

Hello! Good Site! Thanks you! jknjqjonwriuma

How about this one:?

I modified the code from Goplat, http://forums.thedailywtf.com/forums/t/5470.aspx

add eax,edx

setl cl

shr cl,1

rcr eax,1

setc cl // c is odd?

cdq // edx = -1 if r=0

add cl,dl

adc eax,0 // inc if c is odd and r> 1); // arithmetic shift!

return (((int)(r<0)) & c & 1) + r;

}

My entry has been garbled, so I repost it in html…

How about this one:?

I modified the code from Goplat, http://forums.thedailywtf.com/forums/t/5470.aspx

add eax,edx

setl cl

shr cl,1

rcr eax,1

setc cl // c is odd?

cdq // edx = -1 if r<0, 0 if r>=0

add cl,dl

adc eax,0 // inc if c is odd and r<0, i.e. cl=1 and dl=-1

or alternatively this one?:

mov ecx,eax

xor eax,edx

and edx,ecx

mov ecx,eax

sar eax,1

add eax,edx

sets dl

and ecx,edx

and ecx,1

add eax,ecx

These functions want the parameters in eax and edx, give back the result in eax, and modifies only ecx.

int avg(int a, int b)

{

int c,r;

c = a ^ b;

r = (a & b) + (c >> 1); // arithmetic shift!

return (((int)(r<0)) & c & 1) + r;

}

Goplat’s magic lies in the setl after the add command.

By replacing the shr cl,1 by neg cl, the first function gets even better (I replaced cl by dl as well):

add eax,edx

setl dl

neg dl // put “less” into carry, and set dl to -1 (r<0) or 0 (r>=0)

rcr eax,1

add dl,0

adc eax,0 // inc if c is odd and r<0, i.e. carry of rcr and dl=-1

Unfortunately all commands depend on each other, i.e. nothing can be done in parallel on a P4.

By the way: Myria’s (6. August 2006 at 22:43) and Matt Parks’ (3. August 2006 at 17:39) code fail for e.g. -1 and 2 where the result should be 0 instead of 1.

Uh oh, in the code above the add after the rcr should have been a adc. This error came by manually copying the code.

Here is the corrected code:

add eax,edx

setl dl

neg dl

rcr eax,1

adc dl,0

adc eax,0

Here is yet another sequence, albeit only a gain on Core2:

add eax,edx

setl dl

shrd eax,edx,1

add dl,-2

adc eax,0

Argh! The same mistake again!

Here is the corrected sequence:

add eax,edx

setl dl

shrd eax,edx,1

adc dl,-2

adc eax,0

I would add 0x80000000 to each input, making them unsigned, then add, shift right with carry to average them, and subtract 0x80000000 from the answer.

That would round toward negative infinity which is not what was wanted.

Example: -2 / -1 => -2 instead of -1

Here’s another method for the “round towards zero” challenge:

add eax,edx

jae here1

cmc

here1: jnl here2

sbb eax,-1

here2: rcr eax,1

It’s one instruction more than Jaspers, but in my tests on an E2200 ‘dual core’, it’s significantly faster. It’s also shorter in real terms (12 bytes compared to 15), and it doesn’t destroy edx.

Your solution appears to be correct.

It is smaller and could even be the faster on most processors but it contains jumps and more commands.

The original problem specification is missing the goal which solution is considered “best”, i.e. smallest, least number of commands, or fastest (which processor?).

Regarding Mysoft’s code. jae is equivalent to jnc. So jae/cmc just ensures that the carry is reset. Therefore the code can be modified to:

add eax,edx

clc

jnl here2

sub eax,-1

here2: rcr eax,1

Eliminating the first jump, and changing the sbb to sub, which in theory should be faster code. I ran a test changing it to al/dl (8 bit version) so I could quickly iterate over all possible signed byte values of a and b, and it appears to work fine.

I spent a while thinking about eliminating the second jump, I was able to do it in this way (8 bit test code again):

asm

mov al, [a]

add al, [b]

clc

setl bl

neg bl

sub al, bl

rcr al, 1

mov [function], al

end asm

When extended to 32 bits, it will also require a xor ebx, ebx before usage. Mysoft believes this way will be slower on many CPUs, I didn’t run any tests on speed though.

I could find an even better variant of Pete Black’s code. The CLC instruction is superfluous, and DEC is one byte smaller than NEG (the SET instruction must be inverted):

xor ecx,ecx

add eax,edx

setnl cl

dec ecx

sub eax,ecx

rcr eax,1