Arithmetic mean of two signed integers

It’s time for a puzzle again! (submitted by sheepmaster)

Calculate in x86 assembly language the arithmetic mean (a+b)/2 of two signed integers a and b. If you think this is trivial, consider the following edge cases: a = b = 0x7FFFFFFF and a = 0x7FFFFFFF, b = 0x80000000.

Again, the standard rules apply: a and b are in general purpose registers of your choice, the result can be in any register, and any register can be overwritten.

Oh, and by the way, using larger data types isn’t allowed either (that would make things too easy, wouldn’t it?).

43 thoughts on “Arithmetic mean of two signed integers”

  1. As always I’m not fluent in 0x86 – but I speak some ARM and therefore here is a solution I thought up. First the idea:
    Shift both registers right and add them. If both registers are of same sign and the zero bit was set, add/substract one. end.
    a is in R0, b in R1, result will be in R2 and register R0 and R1 will be killed.
    MOV R2,R0, ASR#1
    ADD R2,R2,R1,ASR#1
    AND R0,#&80000001
    AND R1,#&80000001
    ADD R0,R0,R1
    ADD R2,R2,R0,ASR#1

    which would have been a neat idea, but I just realized that asr doesn’t do a correct div 2 in the negative case, and so my code doesn’t work. Will think about what to do with that, and write another komment.

  2. So, fixing the ASR != DIV 2 problem took me another 3 instructions to give a total of nine:
    ADD R3,R0,R0 LSR#31
    MOV R2,R3, ASR#1
    ADD R3,R1,R1 LSR#31
    ADD R2,R2,R3,ASR#1
    AND R0,#&80000001
    AND R1,#&80000001
    ADD R0,R0,R1
    ADD R0,R0,R0 LSR#31
    ADD R2,R2,R0,ASR#1

    now it should work correctly, i hope πŸ˜‰

  3. @Dominik: I mistranslated my Java code to assembly, probably because it has been 15 years since the last time I used assembly :).

    And what is the problem with ASR? Is it different than SAR? SAR does division by two perfectly for negative (2s compement) numbers.


  4. @Dominik: What do you get when R0 = -1 and R1 = 1? On paper I got 1, but that could just be me not being as fluent in ARM…

    I had a similar idea, but didn’t need to handle negative numbers separately, as ASR always rounds down, so you have to add 1 when both numbers are odd.

    What are the highest-value bits in your AND Rx,#&80000001 for?

  5. @Sheepmaster:
    R0 = -1 and R1 = 1
    R3 = 0
    R2 = 0
    R3 = 1
    R2 = 0
    R0 = &80000001
    R1 = 1
    R0 = &80000001 + 1 <- here’s the error
    … damn…

    ASR always rounds to the lower numbers, this is why i add the sign-bit all the time (ADD rd, rs, rs LSR#31) – so the ASR rounds correctly to zero instead of lower numbers.

    Oh shit – now I see the bug, the AND s are for determining if i need to adjust the resulting value. My initial solution did compare the two instead of adding them. We only need adjusting when both the sign and the lowest bit are set. But my next step is wrong. since &80000001 is nowhere near -1 (which I assumed in my disordered mind), adding doesn’t work. Arg.

  6. Okay, now it got ugly. Just doing branches to determine if we have to adjust and in which direction. And the solution isn’t beautiful anymore at all πŸ™ :

    — first the addition and correct ASR
    ADD R3,R0,R0 LSR#31
    MOV R2,R3, ASR#1
    ADD R3,R1,R1 LSR#31
    ADD R2,R2,R3,ASR#1

    — check first and last bit and if they are equal
    AND R0,#&80000001
    AND R1,#&80000001
    — if they aren’t we jump because we are finished
    CMP R0,R1
    BNE .end
    — if the zero bit isn’t set we are finished
    TST R0,#1
    BEQ .end
    — now test the sign and add or substract our adjustment
    TST R0,#&80000000
    SUBEQ R2,R2,#1
    ADDNE R2,R2,#1

  7. The boring x86 solution (uses CDQ, so not exactly correct wrt. “no larger data types” rule…); eax = a, ebx = b, result in eax, clobbers ebx and edx:

    add eax, ebx
    mov ebx, 2
    idiv ebx

  8. What do you think of this (input: eax, ebx – output: eax)?

    add eax, ebx
    jo oh_no_bad_thing
    sar eax, 1
    jmp done
    rcr eax, 1

    To get rid of the off-by-one problem, I think you must add the value of a AND b AND 1 to your result.

  9. Okay, after a quick bit of reviewing some algorithms, I’ve come up with this code (equivalent to (a & b) + ((a ^ b) >> 1) ):

    mov ecx, eax
    xor ecx, ebx
    sar ecx, 1
    and eax, ebx
    add eax, ecx

    This is the floor of the average. You didn’t specify how to round, so I assume this is what you wanted…

  10. An old friend of mine, Risc OS ARM Assembler Guru, came up with this reduction:

    MOV R2,#0 ; Startwert

    MOVS R0,R0,ASR #1 ; R0=R0/2, Rest in Carry-Flag
    ADC R2,R2,#0 ; R2=R2+0+Carry-Flag

    MOVS R1,R1,ASR #1 ; R1=R1/2, Rest in Carry-Flag
    ADC R2,R2,#0 ; R2=R2+0+Carry-Flag

    MOV R2,R2,ASR #1 ; R2=R2/2 (wir haben jetzt also das Mittel aus den
    ; beiden Inhalten von Bit 0)

    ADD R2,R2,R0 ; Zum Schluss:
    ADD R2,R2,R1 ; Einfach alles aufsummieren


    R0=&7fffffff, R1=&7fffffff -> R2=&7fffffff

    R0=&80000000, R1=&80000000 -> R2=&80000000

    R0=&7fffffff, R1=&80000000 -> R2=&ffffffff

    Reduced to:

    AND R2,R0,#1
    MOVS R1,R1,ASR #1
    ADC R2,R2,#0
    ADD R2,R1,R2,ASR #1
    ADD R2,R2,R0,ASR #1

    and finally to a fantastic 4 ARM Risc Instruction Photo Finish:

    AND R2,R0,R1
    AND R2,R2,#1
    ADD R2,R2,R0,ASR #1
    ADD R2,R2,R1,ASR #1

    πŸ˜‰ – did I mention that I suck at assembler?

  11. Okay, maybe I should have specified how the result should be rounded. For me, always rounding down would be acceptable, as this is what an arithmetic shift right does, and the goal is not to implement some sophisticated rounding method in assembler (although it could be an interesting taskΓ’Β€Εš) πŸ˜‰

  12. If you’re trying to do this on an x86, and you’re using more than two instructions, you’ve already gone too far.

    addl %ebx, %eax
    # %eax holds 32 bits, and the carry flag holds the 9th bit
    rcrl $1, %eax
    # rotate %eax and the carry flag as a 33-bit number, right by one

  13. @Cassy: No, that doesn’t work for a = 0x7FFFFFFF and b = 0x80000000. You have to check if there’s an overflow first, and if not (which would be the case for a = 0x7FFFFFFF and b = 0x80000000), do an arithmetic shift right.

  14. @sheepmaster: πŸ˜› I think you’re right. The problem there is that while the “shift” right is 33-bit, the addition wouldn’t be sign extended to 33-bits. *sigh* Well, just to keep my totally hackish untranslatable to C example going:

    addl %eax, %ebx
    seto %bl
    ; %bl is either set to 1, or -1, either way the LSbit is 1
    pushl %ebx
    ; We put %ebx into the flags register
    ; Since CF is the LSbit, it now has the value of OF
    rcrl $1, %eax
    ; This works correctly for signed values now.

  15. OK, I broke this into three cases:
    – both positive (5,3), (0x7fffffff,0x7fffffff)
    – one neg, one pos (0xfffffff,1), (0x80000000,0x10000000),(0x80000000,0x7fffffff),(0x80000000,1)
    – both negative (0x8000000, 0x8000000), (0xfffffff,0xfffffff)

    I think this works for all three cases: arithmetic shift right each number by 1, add, and add in bitwise OR of lowest bits in the original number.

    mov edx,eax
    or edx,ebx
    and edx,1
    asr eax,1
    asr ebx,1
    add eax,ebx
    add eax,edx

  16. @myria: from what I’ve heard, “SETcc” should not be trusted to set the register to just 1, because I’ve heard at times that it could possibly be set to -1 (all bits set)

    Either way, my hack version still works, because the least significant bit is the one that gets moved into the CF, which is then moved into the register with the RCR. Note that all but 8-bits of the EFLAGS end up undefined from pushing EBX then popping it into EFLAGS.

    Of course, my friend said he had to laugh at my solution, because it was just such an incredible hack. πŸ™‚ Of course, this is also the reason why I love assembly so much πŸ˜‰

  17. Cassy: Your code doesn’t work for the test case 0x7FFFFFFF, even after switching around eax and ebx in the first instruction.

  18. Matt Parks’s code appears correct… I have a modification of his that is 1 instruction shorter:

    mov eax,ecx
    or ecx,edx
    sar eax,1
    sar edx,1
    shr ecx,1
    adc eax,edx

    This would need to be profiled against Matt’s code. It’s hard to tell which one will be better; the shr/adc can be expensive. It likely has terrible performance on Pentium 4’s, which hate “adc” and “sbb”.

    My code is Visual C++’s __fastcall: parameters in ecx, edx; return value in eax; ecx, edx volatile.

  19. Write an optimal program in y86 assembly language for the following problem.

    In a particular application, one of the key operational function takes in a
    32 bit unsigned value as arguement. This value has 3 parts.

    bits 0-15 -> value 1.
    bits 17-31 -> value 2.
    bit 16 -> operation.
    =0 => value 1 + value 2
    =1 => value 1 – value 2
    Both values are unsigned values.
    The function reads in the 32 bit value, performs the operation and
    returns a 16bit value.
    If the values are signed values, will there be any change in the code? if so, what are the changes to be made? if not, Justify.
    If instead of bit 16, the operatoin bit is bit 15 (with value 1 = bit 0-14 and value 2 = bits 16-31), what are the changes to be made in the above code?

  20. My entry has been garbled, so I repost it in html…

    How about this one:?
    I modified the code from Goplat,

    add eax,edx
    setl cl
    shr cl,1
    rcr eax,1
    setc cl  // c is odd?
    cdq  // edx = -1 if r<0, 0 if r>=0
    add cl,dl
    adc eax,0  // inc if c is odd and r<0, i.e. cl=1 and dl=-1

    or alternatively this one?:

    mov ecx,eax
    xor eax,edx
    and edx,ecx
    mov ecx,eax
    sar eax,1
    add eax,edx
    sets dl
    and ecx,edx
    and ecx,1
    add eax,ecx

    These functions want the parameters in eax and edx, give back the result in eax, and modifies only ecx.

    int avg(int a, int b)
      int c,r;
      c = a ^ b;
      r = (a & b) + (c >> 1);  // arithmetic shift!
      return (((int)(r<0)) & c & 1) + r;

  21. Goplat’s magic lies in the setl after the add command.
    By replacing the shr cl,1 by neg cl, the first function gets even better (I replaced cl by dl as well):

    add eax,edx
    setl dl
    neg dl  // put “less” into carry, and set dl to -1 (r<0) or 0 (r>=0)
    rcr eax,1
    add dl,0
    adc eax,0  // inc if c is odd and r<0, i.e. carry of rcr and dl=-1

    Unfortunately all commands depend on each other, i.e. nothing can be done in parallel on a P4.

  22. By the way: Myria’s (6. August 2006 at 22:43) and Matt Parks’ (3. August 2006 at 17:39) code fail for e.g. -1 and 2 where the result should be 0 instead of 1.

  23. Uh oh, in the code above the add after the rcr should have been a adc. This error came by manually copying the code.

    Here is the corrected code:
    add eax,edx
    setl dl
    neg dl
    rcr eax,1
    adc dl,0
    adc eax,0

    Here is yet another sequence, albeit only a gain on Core2:
    add eax,edx
    setl dl
    shrd eax,edx,1
    add dl,-2
    adc eax,0

  24. Argh! The same mistake again!

    Here is the corrected sequence:
    add eax,edx
    setl dl
    shrd eax,edx,1
    adc dl,-2
    adc eax,0

  25. I would add 0x80000000 to each input, making them unsigned, then add, shift right with carry to average them, and subtract 0x80000000 from the answer.

  26. That would round toward negative infinity which is not what was wanted.
    Example: -2 / -1 => -2 instead of -1

  27. Here’s another method for the “round towards zero” challenge:

    add eax,edx
    jae here1
    here1: jnl here2
    sbb eax,-1
    here2: rcr eax,1

    It’s one instruction more than Jaspers, but in my tests on an E2200 ‘dual core’, it’s significantly faster. It’s also shorter in real terms (12 bytes compared to 15), and it doesn’t destroy edx.

  28. Your solution appears to be correct.
    It is smaller and could even be the faster on most processors but it contains jumps and more commands.
    The original problem specification is missing the goal which solution is considered “best”, i.e. smallest, least number of commands, or fastest (which processor?).

  29. Regarding Mysoft’s code. jae is equivalent to jnc. So jae/cmc just ensures that the carry is reset. Therefore the code can be modified to:

    add eax,edx
    jnl here2
    sub eax,-1
    here2: rcr eax,1

    Eliminating the first jump, and changing the sbb to sub, which in theory should be faster code. I ran a test changing it to al/dl (8 bit version) so I could quickly iterate over all possible signed byte values of a and b, and it appears to work fine.

    I spent a while thinking about eliminating the second jump, I was able to do it in this way (8 bit test code again):

    mov al, [a]
    add al, [b]
    setl bl
    neg bl
    sub al, bl
    rcr al, 1
    mov [function], al
    end asm

    When extended to 32 bits, it will also require a xor ebx, ebx before usage. Mysoft believes this way will be slower on many CPUs, I didn’t run any tests on speed though.

  30. I could find an even better variant of Pete Black’s code. The CLC instruction is superfluous, and DEC is one byte smaller than NEG (the SET instruction must be inverted):

    xor ecx,ecx
    add eax,edx
    setnl cl
    dec ecx
    sub eax,ecx
    rcr eax,1


Leave a Comment