Monthly Archives: June 2006

FFREEP – the assembly instruction that never existed

Due to simplified instruction decoding of the Intel 80287, this CPU had opcode aliases for instructions like FXCH, FSTP, i.e. there were some additional encodings that did the same as the originals as defined by the 8087. As a side effect of this, a new instruction, FFREEP appeared, although not intented by Intel.

The “Intel 80287 Programmer’s Reference Manual” (at least the revised 1987 version) listed FFREEP in the opcode list and explained what it does:

DF 1101 1111 1100 0REG (6)
The marked encodings are not generated by the language translators. If, however, the 80287 encounters one of these encodings in the instruction stream, it will execute it as follows:
FFREE ST(i) and pop stack

FFREEP was not documented in the instruction reference, but sice the instruction was now somewhat official, Intel had to keep the instruction in all future CPUs, and AMD, Cyrix and most other cloners implemented it as well; at least since the 387 class FPUs. But nobody documented it.

FFREEP made a brief appearance again in the 1997 “Intel Architecture Optimization Manual” for the Pentium Pro. This CPU was the first x86 processor that translated x86 instruction into RISC-like micro-ops, so the optimization manual listed the number of micro-ops necessary for each instruction, including the otherwise undocumented FFREEP.

In 2002, AMD finally documented FFREEP – not. They dedicated a whole page in the “AMD Athlon Processor x86 Code Optimization Guide” to this instruction, describing what it does, and how it can be used. They even state:

Note that the FFREEP instruction, although insufficiently documented in the past, is supported by all 32-bit x86 processors.

This is not entirely correct, as the Nexgen 586PF did not support FFREEP – AMD obviously interprets “all 32-bit x86 processors” as “all Intel and AMD (and possibly Cyrix) 32-bit x86 processors”. Oh, and please note that even after this, AMD does not list FFREEP in its x86/AMD64 instruction reference.

Despite the facts that FFREEP has now been retroactively documented, it has existed in all P6-class and later CPUs, and it actually serves a purpose, it is still hardly used, although most disassemblers (objdump, HT) and i386 emulators (Bochs, QEMU) support it. The GCC toolchain seems to be the only one that ever emits code using FFREEP, but it only does so if tuning for AMD K8 CPUs.

References: 1, 2, 3, 4, 5, 6, 7

Virtualization: The elegant way and the x86 way

Virtualization means running one or more complete operating systems (at the same time) on one machine, possibly on top of another operating system. VMware, VirtualPC, Parallels etc. support, for example, running a complete GNU/Linux OS on top of Windows. For virtualization, the Virtual Machine Monitor (VMM) must be more powerful than kernel mode code of the guest: The guest’s kernel mode code must not be allowed to change the global state of the machine, but may not notice that its attempts fail, as it was designed for kernel mode. The VMM as the arbiter must be able to control the guest completely.

Architectures like the PowerPC made virtualization easy from the beginning. There are no assembly instructions that work differently in kernel mode than in user mode. An instruction either works the same in both modes, or it throws an exception when used in user mode. In order to virtualize an operating system, it is as easy as running the kernel mode part of the guest in user mode and emulate all instructions that throw exceptions. When the guest OS wants to set up a page table, the VMM notices this, intercepts the instruction, and changes its own page tables, so that the guest OS works as it is supposed to, but the VMM and other guests cannot be affected.

On the x86 platform, there are several instructions that just behave differently in kernel mode and in user mode. If we run kernel mode code in user mode, some sensitive instructions might not throw exceptions, but instead return incorrect (compared to kernel mode) results. VMware, VirtualPC, Parallels and friends therefore have to scan all kernel mode code and replace these sensitive instructions with explicit calls to the VMM. This effectively steals about 100 MHz of computing power per VM running.

Intel fixed it with its “Virtualization Technology” (VT), formerly known as “Vanderpool”, but not by adding a global switch that makes all sensitive instructions throw exceptions in user mode – but by adding yet another mode of execution. The new “root mode” is more powerful than standard kernel mode. The host OS and the VMM run in root mode, and the VMM switches to “non-root” mode into the guest OS, after telling the CPU which instructions and events should make it leave non-root mode and return to the VMM. This sounds complex – but it therefore fits nicely into the x86 architecture. ;-)

Although AMD’s Pacifica is incompatible, it’s the same design. But it’s more powerful: Pacifica allows 16 Bit as well as non-paged applications in non-root mode, whereas VT restricts the VM to 32/64 bit paged mode.

I know that I simplified the whole issue a lot, but if you have corrections or any other comments, please do add them.

The funny page table terminology on AMD64

What’s the next word in this sequence: PT, PD, PDP, …?

As you probably know, “AX” means “A extended”, and therefore “EAX” means “extended AX extended”. With the 64 bit extensions of the 8080 architecture, AMD chose “RAX”, not adding another “extended”…

Something similar happened with page tables sice the i386. The i386 (1985) could (theoretically) map 4 GB of memory to 4 GB of memory, so it needed two levels of page tables. One single 4 KB “page directory table” (PD) had 1024 32 bit page directory entries (PDE), pointing to 1024 4KB “page tables” (PT), which had 1024 page table entried (PTE).

The Pentium Pro (1995) implemented a hack called “PAE” (Physical Address Extension) that allowed a total of up to 64 GB of RAM, without changing the 4 GB limit per address space. For this, page table entries now had to be 64 bits wide, and only 512 entries fit into a 4 KB page table. The same was true for the page directory: It could now point to page tables above 4 GB, so entries there had to be 64 bits wide as well, and again only 512 entries fit now. Therefore a third level of page tables had to be introduced: Intel called it the “page directory pointer table” (PDP), and it only contains 4 (64 bit) entries to the four page directories, so that every virtual address space could be 4 GB. (The register CR3, which now points to the PDP, also got the alternate name PDPTR: “page directory pointer table register”.)

When in 2003 AMD introduced the AMD64 64 bit extensions to the i386 architecture, page tables had to be extended once more: In the implementation currently on the market (and copied by Intel), the CPUs can map 48 bit virtual addresses to 52 bit physical addresses. Using all 512 entries of the page directory pointer table (instead of just 4) only allows 39 bits of virtual addresses (512 GB), so another level of page tables was introduced. (They could have introduced more extra levels, but 256 TB of address space seemed to be enough for now – another level can be introduced at any time with new CPUs by just changing the OS, and without having to change application programs.)

The interesting fact is now what AMD called it… “page directory pointer-pointer” (PDPP)? “page directory pointer directory” (PDPD)? No, they understood that numbering the page table levels was a better idea, as they all have the same format anyway. The (single) 4th level page table is called “page map level 4″ (PML4). The other levels are still named PDP, PD and PT in the documentation, though (also in Intel’s), probably to make it easier for developers familiar with i386/PAE.

The C ! operator

In C, the ! (“logical NOT”) operator used on a value x evaluates to 0 when x is not 0, and 1 when x is 0. In other words, it’s equivalent to the following C:

(x == 0) ? 1 : 0

How should this be implemented in x86 assembly language, when “x” is already in a register? The target register can either be the same one, or it can be a different one. I didn’t try too hard and got 7 bytes; it can probably be made better. On other CPUs, it can be done in a single instruction. For example, in MIPS: “sltiu dest, src, 1″.

Note that this is about the case where the compiler doesn’t know how the result of the ! is used, as in “return !x;” in a non-inlined function. Cases like “if (!x)” are simpler.

(If you want to share how easily it can be done on *your* favorite CPU, please post a comment as well!)

The real reason for driver signing in Vista x64

In Windows Vista x64, drivers are required to be signed by someone holding a VeriSign code certificate or they won’t load. There is no way to (permanently) disable this signing even if you are Administrator. The F8 startup menu has an option to disable it, but you must select it every time you boot up. Microsoft’s claimed reason for this is that it prevents Trojans from installing kernel-mode rootkits. That is a load of crap. Continue reading