Redundant SSE instructions

As we all know the x86-ISA has a lot of redundant instructions (ie. instructions with the same semantic but different opcodes). Sometimes this is unavoidable, sometimes it looks like bad design. But with SSE it gets really weird. Let’s say we want to perform xmm0 <- xmm0 & xmm1 (ie. bitwise and). Not an uncommon operation; but we have 3 different ways do archive this:

andps xmm0, xmm1 (0f 54 c1)
andpd xmm0, xmm1 (66 0f 54 c1)
pand xmm0, xmm1 (66 0f db c1)

(Note that andpd/pand are SSE2 instructions)
Regarding the result in xmm0 these are really the same instructions. Now, why did Intel do this? First we’re going to inspect andps/andpd. Looking at the optimization manuals we get a hint: The ps/pd mark the target register to contain singles or doubles, so they should match the actual data you are operating on.

It looks like the processor internally handles the floats in some “unpacked” structure and the ps/pd is a sort of hint whether it has to repack the number again. Or something like that, at least this is only an optimization issue. But that’s stupid, if the processor already knows the internal format, one “andp” instruction would be sufficient — the processor can peform andps or andpd anyway, depending on which would be faster in the situation. Or, looking at the MMX case, there we have no pandb, pandw, pandd, pandq etc. The same applies to “movapd/movdqa memory, xmm”: Damn, it’s the processor who knows better than me how to achive this the fastest way.

Finally, let’s look at pand. After Intel recognized that MMX is a complete mess, they opened the MMX instructions for the xmm registers (0x66 prefix). And now? We have a third way to do the AND… And it somehow looks like they never had SSE2 in mind, when they designed the SSE1 instructions.

3 thoughts on “Redundant SSE instructions”

Yes, this always bugged me, too. It’s like, why can’t YOU figure out the best silicon path to send it down? I know I won’t know best.

Of course the fun thing is if you do something like this:

movdqa addr1, xmm0
movapd addr2, xmm1
xorps xmm0, xmm1

It’s like, BAM! Gotcha, no one can figure out the right thing to do in this case. I’ve dealt a lot with SSE and AltiVec, and AV just seem so much more well thought out to me.

Is there ever any reason to choose one of these instructions rather than another? Is one of them faster?

They’re being internally handled as IEEE754 singles or doubles. What you’re looking at is the result of typed execution domains in the microarchitecture (Nehalem and future processors) without devoting silicon to tracking the tags. Before Nehalem, go ahead and always use the *S variant of such instructions (including MOVAP, MOVUP etc) since they’re generally available on SSE1-only processors, and a byte shorter to encode. On Nehalem and later, generate the correctly-typed instruction.

3 thoughts on “Redundant SSE instructions”

Leave a Comment Cancel reply