Monthly Archives: July 2006

Redundant SSE instructions

As we all know the x86-ISA has a lot of redundant instructions (ie. instructions with the same semantic but different opcodes). Sometimes this is unavoidable, sometimes it looks like bad design. But with SSE it gets really weird. Let’s say we want to perform xmm0 <- xmm0 & xmm1 (ie. bitwise and). Not an uncommon operation; but we have 3 different ways do archive this:

  • andps xmm0, xmm1 (0f 54 c1)
  • andpd xmm0, xmm1 (66 0f 54 c1)
  • pand xmm0, xmm1 (66 0f db c1)

(Note that andpd/pand are SSE2 instructions)
Regarding the result in xmm0 these are really the same instructions. Now, why did Intel do this? First we’re going to inspect andps/andpd. Looking at the optimization manuals we get a hint: The ps/pd mark the target register to contain singles or doubles, so they should match the actual data you are operating on.

It looks like the processor internally handles the floats in some “unpacked” structure and the ps/pd is a sort of hint whether it has to repack the number again. Or something like that, at least this is only an optimization issue. But that’s stupid, if the processor already knows the internal format, one “andp” instruction would be sufficient — the processor can peform andps or andpd anyway, depending on which would be faster in the situation. Or, looking at the MMX case, there we have no pandb, pandw, pandd, pandq etc. The same applies to “movapd/movdqa memory, xmm”: Damn, it’s the processor who knows better than me how to achive this the fastest way.

Finally, let’s look at pand. After Intel recognized that MMX is a complete mess, they opened the MMX instructions for the xmm registers (0x66 prefix). And now? We have a third way to do the AND… And it somehow looks like they never had SSE2 in mind, when they designed the SSE1 instructions.

How Itanium messed up Intel's CPUID family IDs

Assigning internal version/family/model IDs to products is a non-trivial task, especially if there are several different families/architectures on your roadmap, and if the marketing names and target markets have no real correlation to the internal architecture.

With Windows, Microsoft’s versioning scheme was quite adventurous: After Windows 95 (internal version number 4.0) and Windows 98 (version 4.10), Microsoft chose the version number 4.90 for Windows ME, the last operating system of the Win9X line, which was supposed to be replaced by Windows NT version 5.0 a.k.a Windows 2000. After all, Windows ME had the user interface and Win32 API version close to that of Windows 2000, so it was somewhere between 98 (4.10) and 2000 (5.0). It was not until Windows XP (version 5.10) that consumers actually switched to the NT line, but the numbering was still consistent. Everything went right.

At Intel, everything went wrong. On every modern x86 processor, the CPUID instruction returns, among other things, the family code of the CPU. The i486 (1989) is family 4, Pentium (1993) and Pentium MMX (1997) are family 5, Pentium Pro (1995), Pentium 2 (1997), Pentium 3 (2000) and Pentium M (2003) are family 6 (“P6 microarchitecture”).

Since 1997 or so, it seems to have been clear to Intel that the x86 line of CPUs (retroactively named “IA-32”) was to be eventually replaced by IA-64. Itanium, the first IA-64 CPU, was supposed to be released around 1998. IA-64 also supported the CPUID instruction, and the Itanium was specified to return family 7.

But Itanium was 3 years late, and Intel introduced a successor to the P6 line of CPUs, which was a complete redesign, to compete against the very strong AMD Athlon in the x86 market. The Pentium 4 (“NetBurst architecture”) needed a family code. It had no relation to family 6 (P6) and none to family 7 (Itanium). So they chose family 15, which was the highest number that could be represented in the 4 bit family field of CPUID, and defined it to mean: “check the (previously undefined) bits 20 to 27 for the information you are looking for”. All NetBurst CPUs had “0” as the extended family ID, so the effective family was “15/0”. Then came the Itanium 2 (2002): For some reason, Intel didn’t use family 8 for it, but family 15 and extended family 1 (“15/1”, later Itanium 2 CPUs had “15/2”).

Today we know that IA-64 will not replace IA-32, in particular because Microsoft ruled out the possibility of supporting IA-64 on desktop Windows. In 2006, Intel introduced the Core and Core 2 CPU lines, which replace the Pentium 4 – and they are IA-32.

So what is the family ID of the Core CPUs? “15/3”, because it is the next free ID? “8”, because numbers 8 to 14 are not taken yet? No, Core and Core 2 are family 6: These CPUs are direct successors of the Pentium 3, and thus based on the P6 microarchitecture. The model ID encoded in CPUID is “13” for the last Pentium M (“Dothan”), “14” for the Core, and “15” for the Core 2 and the Core 2 based Xeon. Now the problem is that the model ID bit field is only 4 bits wide, so “15” is the highest model ID that can be represented. I think we are all curious how Intel is going to encode Core 3…

* 486 (1989): family 4
* Pentium (1993): family 5
* Pentium Pro (1995): family 6, models 0 and 1
* Pentium 2 (1997): family 6, models 3, 5 and 6
* Pentium 3 (2000): family 6, models 7, 8, 10, 11
* Itanium (2001): family 7
* Pentium 4 (2000): family 15/0
* Itanium 2 (2002): family 15/1 and 15/2
* Pentium M (2003): family 6, models 9 and 13
* Core (2006): family 6, model 14
* Core 2 (2006): family 6, model 15

References: 1 2 3 4

Win32's MulDiv

In Win32, there is an API call called “MulDiv”:

The MulDiv function multiplies two 32-bit values and then divides the 64-bit result by a third 32-bit value. The return value is rounded up or down to the nearest integer.

int WINAPI MulDiv(
int nNumber,
int nNumerator,
int nDenominator

If a divide overflow (including divide by zero) occurs, MulDiv returns -1. (Stupidly, there’s no way to actually check whether the result truly is -1 instead of an error.)

How do we implement this in x86-32? The official version does not use structured exception handling to simply catch the exception and handle it. MulDiv actually checks for overflow beforehand and never causes an exception.

The official implementation is several pages long, and I think we could do much better.

If this were the unsigned case, the entire function would be simple:

mov eax, [esp+4]
mul dword ptr [esp+8]
cmp edx, [esp+12]
jae overflow
div dword ptr [esp+12]
ret 12
or eax, -1
ret 12