As we all know the x86-ISA has a lot of redundant instructions (ie. instructions with the same semantic but different opcodes). Sometimes this is unavoidable, sometimes it looks like bad design. But with SSE it gets really weird. Let’s say we want to perform xmm0 <- xmm0 & xmm1 (ie. bitwise and). Not an uncommon operation; but we have 3 different ways do archive this:
- andps xmm0, xmm1 (0f 54 c1)
- andpd xmm0, xmm1 (66 0f 54 c1)
- pand xmm0, xmm1 (66 0f db c1)
(Note that andpd/pand are SSE2 instructions)
Regarding the result in xmm0 these are really the same instructions. Now, why did Intel do this? First we’re going to inspect andps/andpd. Looking at the optimization manuals we get a hint: The ps/pd mark the target register to contain singles or doubles, so they should match the actual data you are operating on.
It looks like the processor internally handles the floats in some “unpacked” structure and the ps/pd is a sort of hint whether it has to repack the number again. Or something like that, at least this is only an optimization issue. But that’s stupid, if the processor already knows the internal format, one “andp” instruction would be sufficient — the processor can peform andps or andpd anyway, depending on which would be faster in the situation. Or, looking at the MMX case, there we have no pandb, pandw, pandd, pandq etc. The same applies to “movapd/movdqa memory, xmm”: Damn, it’s the processor who knows better than me how to achive this the fastest way.
Finally, let’s look at pand. After Intel recognized that MMX is a complete mess, they opened the MMX instructions for the xmm registers (0×66 prefix). And now? We have a third way to do the AND… And it somehow looks like they never had SSE2 in mind, when they designed the SSE1 instructions.