Category Archives: puzzle

Aggressive Tail Call Optimization

In some i386/x86_64 assembly code my coworker was working on, there was a macro like this:

#define ENDFUNC leave 
                ret

Having forgotten about the exact implementation of the macro, he then wrote a function that ended like this:

        leave
        ENDFUNC

Now this obviously ended the function with

        leave
        leave
        ret

and he got very interesting results. What was happening?

The Infinite Loop Mystery

Today’s puzzle is about some code behaving horribly wrong.

Recently, I was working on some operating system project and hacking on the code to switch between privileged and non-privileged mode. I could switch modes successfully and intercept traps when in non-privileged mode.

Then I wanted to check whether I could handle timer interrupts correctly, so I added this to my non-privileged code, to give the timer interrupt a chance to fire:

    volatile int i;
    for (i=0; i<10000; i++);

Timer interrupts were handled correctly and eventually returned to the non-privileged code – but the delay code turned into an infinite loop!

I changed to loop to count only to 10, and I changed it to count down instead of up, but the result remained the same. I looked at the generated assembly. It looked like this:

    movl    $10, 0xfc(%ebp)	// i = 10
    jmp     1f			// goto 1
2:
    movl    0xfc(%ebp), %eax	// %eax = i
    decl    %eax		// %eax--
    movl    %eax,0xfc(%ebp)	// i = %eax
1:
    movl    0xfc(%ebp), %eax	// %eax = i
    testl   %eax, %eax		// if (%eax > 0)
    jg      2b			// goto 2

It looked fine. On every timer interrupt, I dumped %eax, and it was stuck at 10. I debugged my pusha/popa code to save and restore registers between modes, and it was okay. I debugged my flag handing code, and flags were fine.

Then I replaced my C code with the generated assembly code and added instructions that copied the value of %eax before the “decl” into %ebx, and after the “decl” into %ecx and added a trap instruction right after that to have privileged mode print out the values of the three registers.

    movl    $10, 0xfc(%ebp)	// i = 10
    jmp     1f			// goto 1
2:
    movl    0xfc(%ebp), %eax	// %eax = i
    movl    %eax, %ebx          // value before
    decl    %eax		// %eax--
    movl    %eax, %ecx          // value after
    TRAP
    movl    %eax,0xfc(%ebp)	// i = %eax
1:
    movl    0xfc(%ebp), %eax	// %eax = i
    testl   %eax, %eax		// if (%eax > 0)
    jg      2b			// goto 2

The result was %eax = %ebx = %ecx = 10. This is when I understood what was going on.

Please share your comments below. 🙂

Readable and Maintainable Bitfields in C

Bitfields are very common in low level C programming. You can use them for efficient storage of a data structure with lots of flags, or to pass a set of flags between functions. Let us look at the different ways of doing this.

The straightforward way to deal with bitfields is to do the Boolean logic by hand:

Boolean Magic

#define FLAG_USER   (1 << 0)
#define FLAG_ZERO   (1 << 1)
#define FLAG_FORCE  (1 << 2)
/* bits 3-30 reserved */
#define FLAG_COMPAT (1 << 31)

int
create_object(int flags)
{
        int is_compat = (flags & FLAG_COMPAT);

        if (is_compat)
                flags &= ~FLAGS_FORCE;

        if (flags & FLAG_FORCE) {
                [...]
        }
        [...]
}

int
create_object_zero(int flags)
{
	create_object(flags | FLAGS_ZERO);
}

void
caller()
{
        create_object(FLAG_FORCE | FLAG_COMPAT);
}

You can see code like this everywhere, so most C programmers can probably read and understand this quite easily. But unfortunately, this method is very error-prone. Mixing up "&" and "&&" and omitting the "~" when doing "&=" are common oversights, and since the compiler only sees "int" types, this also doesn't give you any kind of type-safety.

Bitfields

Let us look at the same code using bitfields instead:

typedef unsigned int boolean_t;
#define FALSE 0
#define TRUE !FALSE
typedef union {
        struct {
                boolean_t user:1;
                boolean_t zero:1;
                boolean_t force:1;
                int :28;                /* unused */
                boolean_t compat:1;     /* bit 31 */
        };
        int raw;
} flags_t;

int
create_object(flags_t flags)
{
        boolean_t is_compat = flags.compat;

        if (is_compat)
                flags.force = FALSE;

        if (flags.force) {
                [...]
        }
        [...]
}

int
create_object_zero(flags_t flags)
{
	flags.zero = TRUE;
	create_object(flags);
}

void
caller()
{
        create_object((flags_t) { { .force = TRUE, .compat = TRUE } });
}

Flags can just be used like any variables. The compiler abstracts the Boolean logic away. The only downside is that the code with the static initializer requires some advanced syntax.

Endianness

Bitfields in C always start at bit 0. While this is the least significant bit (LSB) on Little Endian (bit 0 has a weight of 2^0), most compilers on Big Endian systems inconveniently consider the most significant bit (MSB) bit 0 (bit 0 has a weight of 2^31, 2^63 etc. depending on the word size), so in case your bitfield needs to be binary-compatible across machines with different endianness, you will need to define two versions of the bitfield.

The Raw Bitfield

Did you notice the "int raw" in the union? It lets us conveniently (and type-safely) export the raw bit value without having to use a cast.

	printf("raw flags: 0x%xn", flags.raw);

We can use this to reconstruct the FLAG_* constants in the original example, in case some code needs it:

#define FLAG_USER   (((flags_t) { { .user   = TRUE } }).raw)
#define FLAG_ZERO   (((flags_t) { { .zero   = TRUE } }).raw)
#define FLAG_FORCE  (((flags_t) { { .force  = TRUE } }).raw)
#define FLAG_COMPAT (((flags_t) { { .compat = TRUE } }).raw)

This code constructs a temporary instance of the bitfield, sets one bit, and converts it into a raw integer - all at compile time.

Bitfield Access from Assembly

With the same trick, you can also access your bitfield from assembly, for example if the bitfield is part of the Thread Control Block in your operating system kernel, and the low level context switch code needs to access one of the bits. The "int raw" can be used to statically convert a flag into the corresponding raw mask:

typedef unsigned int boolean_t;

typedef union {
	struct {
		boolean_t bit0:1;
		boolean_t bit1:1;
		int :19;
		boolean_t bit31:1;
	};
	int raw;
} bitfield_t;


int test()
{
	int param = -1;
	int result;

	__asm__ volatile (
		"test    %2, %1    n"
		"xor     %0, %0    n"
		"setcc   %0        n"
		: "=r" (result)
		: "r" (param),
		  "i" (((bitfield_t) { { .bit31 = TRUE } }).raw)
	);
	return result;
}

The corresponding x86 assembly code looks like this:

	.text
	.align	4,0x90
	.globl	_test
_test:
	pushl	%ebp
	movl	%esp, %ebp
	movl	$-1, %eax
	## InlineAsm Start
	test	$0x80000000, %eax
	xor	%eax, %eax
	setcc	%eax
	## InlineAsm End
	popl	%ebp
	ret

	.subsections_via_symbols

This works fine with LLVM, but unfortunately GCC (4.2.1) has problems detecting that the raw value is available at compile time, so the "i" has to be replaced with an "r": GCC will then pre-assign a register with the raw value instead of being able to use an immediate with the "test" instruction.

How to Not Do It

I have seen C++ code doing this:

enum {
	FLAG_USER,
	FLAG_ZERO,
	FLAG_FORCE,
	FLAG_COMPAT = 31
}

int
create_object(bitfield_t flags)
{
        bool is_compat = flags.is_set(FLAG_COMPAT);

        if (is_compat)
                flags -= FLAGS_FORCE;

        if (flags.is_set(FLAG_FORCE)) {
                [...]
        }
        [...]
}

int
create_object_zero(int flags)
{
	create_object(flags + FLAGS_ZERO);
}

void
caller()
{
        create_object(((bitfield_t)FLAG_FORCE) + FLAG_COMPAT);
}

This all looks quite weird. The constants are bit index values, and they are added and subtracted. The reason is C++ operator overloading:

class bitmask_t
{
    word_t      maskvalue;

public:
    [...]
    inline bitmask_t operator -= (int n)
        {
            maskvalue = maskvalue & ~(1UL << n);
            return (bitmask_t) maskvalue;
        }
    [...]
}

This is horrible. The code that uses this class makes no sense unless you read and understand the implementation of the class. And you have to be very careful: While it is possible to "add" a flag to an existing bitfield, you cannot just add two flags - it would do the arithmetic and add the two values.

Mapping the setting and clearing of bits onto the addition and subtraction operators is clearly wrong in the first place: Flags in a bitfield are equivalent to elements in a set. Setting a flag is equivalent to the "union" operation, which even in Mathematics has its own symbol instead of overloading the "+" operator.

Question

If you compile code that does something like "((bitfield_t) { { .bit31 = TRUE } }).raw" with GCC in C++ mode, it fails. Why?

A Lot of Security

I happened to drive through Cupertino, CA, USA last Wednesday and ended up in this situation:

Oh-oh, they got me. But they were not after me, they escorted two vans onto some company’s campus.


Five police cars, two police motorcycles, and lots of people with suits and sunglasses. For some reason, this outfit doesn’t have the same effect on me any more since “The Matrix”.



The people in the vans went into the building through a side door:




Here are some details:



A blonde woman in white, a woman with a red dress, a man in a brown uniform with a suitcase, and lots of more men in suits. The vans had license places from Maryland.

The question of today’s security puzzle is: Who is the very important person?

"ROR" in Microsoft BASIC for 6502

If you disassemble any version of Microsoft BASIC for 6502, you’ll find this code in a function that normalizes the (simulated) floating point accumulator:

NORMALIZE_FAC6:
	inc	FAC		; MANTISSA CARRIED, SO SHIFT RIGHT
	beq	OVERFLOW	; OVERFLOW IF EXPONENT TOO BIG
	ror	FAC+1
	ror	FAC+2
	ror	FAC+3
	ror	FAC+4
	ror	FACEXTENSION
	rts

Well, not any BASIC. All versions of

  • Commodore BASIC (all versions, since 1977)
  • AppleSoft BASIC (all versions, since 1977)
  • Microsoft BASIC for the OHIO Scientific (all versions, since 1977)
  • Microsoft BASIC for the rare Mattel Intellivision Keyboard Component (1980)

use this code, but if you look at the disassembly of

  • Microsoft BASIC for the MOS KIM-1 (1977)
  • Microsoft BASIC for the Tangerine Microtan 65 (1979)

you will see this code instead:

NORMALIZE_FAC6:
        inc     FAC
        beq     OVERFLOW
        lda     #$00
        bcc     @1
        lda     #$80
@1:
        lsr     FAC+1
        ora     FAC+1
        sta     FAC+1
        lda     #$00
        bcc     @2
        lda     #$80
@2:
        lsr     FAC+2
        ora     FAC+2
        sta     FAC+2
        lda     #$00
        bcc     @3
        lda     #$80
@3
        lsr     FAC+3
        ora     FAC+3
        sta     FAC+3
        lda     #$00
        bcc     @4
        lda     #$80
@4:
        lsr     FAC+4
        ora     FAC+4
        sta     FAC+4
        lda     #$00
        bcc     @5
        lda     #$80
@5:
        lsr     FACEXTENSION
        ora     FACEXTENSION
        sta     FACEXTENSION
        rts

(Actually, the OHIO Scientific and Intellivision versions work on a 3 byte (“6 digit”) instead of a 4 byte (“9 digit”) mantissa, so the “FAC+4” part is missing.)

Similar replacement has happened in other parts of the floating point library. It seems to be a compile-time option of the assembly source code.

Todays puzzle is to find out why there are two versions of this code, and why the different computer vendors chose to use one version or another.

See comments for solution.

How retiring segmentation in AMD64 long mode broke VMware

UNIX, Windows NT, and all the operating systems in their class rely on virtual memory, or paging, in order to provide every process on the system a complete address space of its own. An easier way to protect processes from each other is segmentation: The 4 GB address space of a 32 bit CPU is divided into segments (consisting of a physical base address and a limit), one for each process, and every process may only access their own segment. This is what the 286 did.

The 386 then introduced virtual memory, but segmentation was still possible, either instead of, or on top of the paged virtual address space. Today, no modern operating system for the x86 uses segmentation any more, so for every process, the base for the code and data segments is set to 0, and the limit is set to 0xFFFFFFFF.

The AMD64 architecture, while still being fully compatible in 32 bit mode, retired a lot of legacy functionality in the new 64 bit long mode, including most of segmentation. The CS (code), DS (data 1), ES (data 2) and SS (stack) segment registers are practically gone, and the FS and GS segments still support a base (which can be used in tricks to quickly access data at a constant position, like the TCB), but the limit is no longer enforced. Now operating systems don’t have to save and restore most of these segment registers any more when switching contexts, making these switches faster.

But this broke VMware. While VMware could still virtualize 32 bit operating systems on AMD64 CPUs, they could not virtualize 64 bit operating systems, because they required segment limits.

In a nutshell, this is how VMware works: All user mode code of the guest runs in exactly the environment it expects; VMware makes sure the page mappings of the user mode address spaces are correct. All kernel mode code of the guest will be run in user mode, and again, VMware must layout memory as the guest kernel expects it to be. In both modes of operation, there can be exceptions, like system calls (by guest user mode code) or page table modifications (by guest kernel mode code). These have to be trapped by the virtual machine monitor, and the respective functionality has to be carried out in a modified way, so that they still seem to have the correct effect to the guest, but don’t interfere with the host operating system.

The virtual machine monitor’s trap handler must reside in the guest’s address space, because an exception cannot switch address spaces. So VMware’s trap handler sits at the very top of the every guest’s address space, which is unused by all major operating systems. According to Popek and Goldberg’s definition of virtualization, there must be no way for code inside a virtual machine to escape, and modify the host’s state in any way not directly controlled by the monitor. Therefore, it must be made sure that the guest code cannot write to the trap handler code. VMware does this using segment limits: The limits of all segment registers are set to something like 0xFFFFEFFF to protect the uppermost 4 KB of the address space where the trap handler resides.

With no segment limits any more for 64 bit code, this way to protect the trap handler was impossible. Unable to comply with Popek and Goldberg’s security requirement, VMware chose not to support 64 bit virtualization until AMD reintroduced (optional) segment limits on later models of their Opteron and Athlon 64 CPUs. Intel never implemented 64 bit segment limits on their EM64T/Intel64 CPUs, because their 64 bit processors soon implemented VT/Vanderpool, which also worked around the problem. So this is why VMware requires a certain model and stepping of the AMD CPU line or a VT-enabled Intel CPU in order to support 64 bit virtualization.

Now the question is: Why don’t they protect the uppermost page using the permission bits in the page table? This is how all operating systems protect themselves from user mode processes. If you have an answer on this, or otherwise have thoughts, please comment on this post. 🙂

References: 1, 2, 3, 4, 5

Arithmetic mean of two signed integers

It’s time for a puzzle again! (submitted by sheepmaster)

Calculate in x86 assembly language the arithmetic mean (a+b)/2 of two signed integers a and b. If you think this is trivial, consider the following edge cases: a = b = 0x7FFFFFFF and a = 0x7FFFFFFF, b = 0x80000000.

Again, the standard rules apply: a and b are in general purpose registers of your choice, the result can be in any register, and any register can be overwritten.

Oh, and by the way, using larger data types isn’t allowed either (that would make things too easy, wouldn’t it?).