A while ago, an engineer from a respectable company for low-level solutions (no names without necessity!) claimed that a certain company’s new 4-way SMP system had broken CPUs or at least broken firmware that didn’t set up some CPU features correctly: While on the older 2-way system, all CPUs returned the same features (using CPUID), on the 4-way system, two of the CPUs would return bogus data. read more

Why is there no CR1 – and why are control registers such a mess anyway?

If you want to enable protected mode or paging on the i386/x86_64 architecture, you use CR0, which is short for control register 0. Makes sense. These are important system settings. But if you want to switch the pagetable format, you have to change a bit in CR4 (CR1 does not exist and CR2 and CR3 don’t hold control bits), if you want to switch to 64 bit mode, you have to change a bit in an MSR, oh, and if you want to turn on single stepping, that’s actually in your FLAGS. Also, have I mentioned that CR5 through CR15 don’t exist – except for CR8, of course? read more

Intel VT VMCS Layout

I understand that there might be a good reason for Intel to add virtualization extensions to their CPU architecture. Instead of fixing the x86 architecture to (optionally) make it Popek-Goldberg compliant and have all critial instructions trap if not run in Ring 0, they added non-root mode, a very big hammer that allows me to switch my CPU state completely to that of the guest and switches back to my original host state on a certain event in the guest. Well, it’s a great toy for people who want to play with CPU internals. read more

How retiring segmentation in AMD64 long mode broke VMware

UNIX, Windows NT, and all the operating systems in their class rely on virtual memory, or paging, in order to provide every process on the system a complete address space of its own. An easier way to protect processes from each other is segmentation: The 4 GB address space of a 32 bit CPU is divided into segments (consisting of a physical base address and a limit), one for each process, and every process may only access their own segment. This is what the 286 did. read more

How to divide fast by immediates

In almost all assembly books you’ll find some nice tricks to do fast multiplications. E.g. instead of “imul eax, ebx, 3” you can do “lea eax, [ebx+ebx*2]” (ignoring flag effects). It’s pretty clear how this works. But how can we speed up, say, a division by 3? This is quite important since division is still a really slow operation. If you never thought or heart about this problem before, get pen and paper and try a little bit. It’s an interesting problem. read more

Shift oddities

Most of the x86 instructions will automatically alter the flags depending on the result. Sometimes this is rather frustrating because you actually what to preserve the flags as long as possible, and sometimes you miss a “mov eax, ecx” which alters the flags. But at least it’s guaranteed that an instruction either sets the flags or it doesn’t touch them, independent of the actual operation… Or is it? read more

Redundant SSE instructions

As we all know the x86-ISA has a lot of redundant instructions (ie. instructions with the same semantic but different opcodes). Sometimes this is unavoidable, sometimes it looks like bad design. But with SSE it gets really weird. Let’s say we want to perform xmm0 <- xmm0 & xmm1 (ie. bitwise and). Not an uncommon operation; but we have 3 different ways do archive this:

  • andps xmm0, xmm1 (0f 54 c1)
  • andpd xmm0, xmm1 (66 0f 54 c1)
  • pand xmm0, xmm1 (66 0f db c1)

(Note that andpd/pand are SSE2 instructions)
Regarding the result in xmm0 these are really the same instructions. Now, why did Intel do this? First we’re going to inspect andps/andpd. Looking at the optimization manuals we get a hint: The ps/pd mark the target register to contain singles or doubles, so they should match the actual data you are operating on. read more