Redundant SSE instructions

As we all know the x86-ISA has a lot of redundant instructions (ie. instructions with the same semantic but different opcodes). Sometimes this is unavoidable, sometimes it looks like bad design. But with SSE it gets really weird. Let’s say we want to perform xmm0 <- xmm0 & xmm1 (ie. bitwise and). Not an uncommon operation; but we have 3 different ways do archive this:

  • andps xmm0, xmm1 (0f 54 c1)
  • andpd xmm0, xmm1 (66 0f 54 c1)
  • pand xmm0, xmm1 (66 0f db c1)

(Note that andpd/pand are SSE2 instructions)
Regarding the result in xmm0 these are really the same instructions. Now, why did Intel do this? First we’re going to inspect andps/andpd. Looking at the optimization manuals we get a hint: The ps/pd mark the target register to contain singles or doubles, so they should match the actual data you are operating on.

It looks like the processor internally handles the floats in some “unpacked” structure and the ps/pd is a sort of hint whether it has to repack the number again. Or something like that, at least this is only an optimization issue. But that’s stupid, if the processor already knows the internal format, one “andp” instruction would be sufficient — the processor can peform andps or andpd anyway, depending on which would be faster in the situation. Or, looking at the MMX case, there we have no pandb, pandw, pandd, pandq etc. The same applies to “movapd/movdqa memory, xmm”: Damn, it’s the processor who knows better than me how to achive this the fastest way.

Finally, let’s look at pand. After Intel recognized that MMX is a complete mess, they opened the MMX instructions for the xmm registers (0×66 prefix). And now? We have a third way to do the AND… And it somehow looks like they never had SSE2 in mind, when they designed the SSE1 instructions.

3 thoughts on “Redundant SSE instructions

  1. Cassy Foesch

    Yes, this always bugged me, too. It’s like, why can’t YOU figure out the best silicon path to send it down? I know I won’t know best.

    Of course the fun thing is if you do something like this:

    movdqa addr1, xmm0
    movapd addr2, xmm1
    xorps xmm0, xmm1

    It’s like, BAM! Gotcha, no one can figure out the right thing to do in this case. I’ve dealt a lot with SSE and AltiVec, and AV just seem so much more well thought out to me.

  2. Nick Black

    They’re being internally handled as IEEE754 singles or doubles. What you’re looking at is the result of typed execution domains in the microarchitecture (Nehalem and future processors) without devoting silicon to tracking the tags. Before Nehalem, go ahead and always use the *S variant of such instructions (including MOVAP, MOVUP etc) since they’re generally available on SSE1-only processors, and a byte shorter to encode. On Nehalem and later, generate the correctly-typed instruction.


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>