Having Fun with Branch Delay Slots

Branch Delay Slots are one of the awkward features of RISC architectures. RISC CPUs are pipelined by definition, so while the current instruction is in execution, the following instruction(s) will be in the pipeline already. If there is for example a conditional branch in the instruction stream, the CPU cannot know whether the next instruction is the one following the branch or the instruction at the target location until it has evaluated the branch. This would cause a bubble in the pipeline; therefore some RISC architectures have a branch delay slot: The instruction after the branch will always be executed, no matter whether the branch is taken or not.

So in practice, you can put the instruction that would be before the branch right after the branch, if this instruction is independent of the branch instruction, i.e. doesn’t access the same registers. Otherwise, you can fill it with a NOP. Out-of-order architectures can do this reordering at runtime, so there would be no need for a delay slot. Nevertheless, the delay slot is a feature of the architecture, not the implementation.

Some RISCs like PowerPC and ARM do not have a delay slot, but for example MIPS, SPARC, PA-RISC have it. But there are some variations: MIPS and PA-RISC have an annihilation/nullify/likely bit in the instruction, so the programmer can choose that the instruction in the delay slot only gets executed if the branch is taken.

Other CPUs, like SPARC, PA-RISC or the ill-fated Motorola M88K have optional delay slots: The programmer can set a bit in the opcode if he cannot come up with a good instruction for the delay slot, and save the wasted NOP in the program code this way – the CPU will put a bubble into the pipeline.

Now the interesting question is what happens if the branch and the delay instruction are not independent. What if the delay instruction writes r5 and the branch jumps to r5? What if it’s a branch-and-link, and the delay instruction modifies the link register? On MIPS, this is illegal, and undefined.

In practice, MIPS won’t halt and catch fire though. As you would expect from the design of a CPU pipeline, the CPU basically executes the branch and the delay instruction in order, as they are stored in the instruction stream, and it only delays the write to PC, i.e. the actual jump until after the delay instruction. So, for example, if you modify a register that the branch depends on, it will not influence the branch, but be in effect after the jump.

On the aforementioned Motorola M88K, this behaviour is documented, and GCC even makes use of it:

820:   7d ad 00 08   cmp   r13,r13,0x08    ; compare

824:   d4 6d 00 05   bb0.n 0x03,r13,0x834  ; cond. branch
828:   63 df 00 00   addu  r30,r31,0       ; delay slot

82c:   cc 00 00 7f   bsr.n 0xa28           ; function call
830:   60 21 01 ac   addu  r1,r1,0x1ac     ; delay slot: fix up link

834:   00 00 00 00   nop

The first three instructions are a compare/branch sequence. If the branch is taken, execution will continue at 0x834, otherwise at 0x82c, after the delay slot. The delay instruction is independent of the branch, nothing special here yet.

But now look at the following two instructions: 0x82c and 0x830 are not independent. r1 is the M88K’s link register, so the “bsr” writes the addres of the following instruction after the delay slot (0x834) into r1. The delay slot also writes into r1: It adds 0x1ac to it. These instructions are executed in order, but the actual branch to 0xa28 will only be done after the delay instruction.

So what these two instructions effectively do is call a function, but set up the return address to skip the next 0x1ac bytes (107 instructions) after return. If the conditional branch at 0x824 is taken, the code at 0x834 will be executed, otherwise, 0xa28 is called, and the “taken” case of the conditional branch is skipped. This trick can be used whenever you have C code with an “if” statement of which one case is a single function call:

if (...) {
    call_something();
} else {
    [...]
}

13 thoughts on “Having Fun with Branch Delay Slots”

  1. Yep, I remember using this exact same trick on the exact same 88K, back at Encore. It’s worth adding that on many OOO architectures the explicit delay slot might not be there but the pipeline bubble still is. In other words, if you mispredict a branch then you have to wait for the pipeline to empty and then fill again at the new location. On a deeply pipelined processor this would often take longer than the typical number of instructions between branches, so you’d be completely stalled. At least with branch delays a clever compiler would have had a chance to fill the delay slots with something useful.

    Another fun feature of the 88K (and others) was load delays. Issue a load into a register and it *will not* be there for the next N cycles so you can continue to use the old values if you know what you’re doing. Who needs prefetch? ;)

    • I want to know that if there is no independent instruction to put in delay slot then how many Nop’s we can put there ????????

  2. I almost used something similar on the PS2 VU unit. It was a branch with another branch in the delay slot. The effect was to execute a single instruction at the destination of the first branch, then continue from the destination of the second.

    It would have saved a useful amount of instruction memory on a processor that didn’t have much. Unfortunately, branches are in the set of instructions that were documented to give undefined behaviour when put in the delay slot. It worked, but I wasn’t allowed to use it.

    On the 88K, what happens to the load delay if there’s an interrupt?

  3. Ha, great entry, I hadn’t thought about branch delay slots in quite a while.

    My evolution was 6502->68K->ARM->PowerPC->MIPS with a few other oddballs thrown in there (dabbled in 88K & SPARC in addition to some DSP stuff).

    When I first learned about MIPS, I actually preferred it to ARM, it seemed (seems?) simpler to write assembly code. By the time I was proficient in MIPS assembly, as I anticipated a branch in the next few instructions, I was already thinking about the BDS.

    Thankfully compilers nowadays seem to have this covered for us. I still get under the hood every now & then, but most of my C & C++ code is blissfully unaware of the BDS. That, and the fact that probably only 10% of my code these days runs on a CPU with a BDS… :-)

  4. Some years ago John Mashey (one of the original MIPS architects) posted in comp.arch on what were in his opinion the top MIPS design mistakes, and branch delay slots were one of them. As I recall it, his main objection was that they made the later OOO implementations way more complex and not worth the performance gain of the original design.

    I would not agree that delay slots are not an implementation feature. In my opinion they are an implementation artefact that becomes codified in the architecture for compatibility reasons. I do not think any architect would include delay slots in their design if they had the resources to eliminate the pipeline bubble in some other fashion.

    I’m not certain about this, but I believe the first MIPS implementations (R2000 and R3000) would do something strange if you put an illegal instruction in the delay slot, as they omitted the pipeline interlocks on purpose in order to make them smaller.

    On SuperH the PC is written before the branch delay slot instruction is executed, so a PC-relative load or store is performed relative to the branch target instruction. I can’t come up with a useful trick for this detail, but it makes eg. automatic literal pool generation a bit more “interesting”.

  5. >MIPS and PA-RISC have an annihilation/nullify/likely bit in the instruction,
    That was a MIPS II addition, after the realisation that filling the BDS can be hard and MIPS I code wasted non-trivial amounts of code space on NOPs

  6. Another interesting question is what happens when the delay slot contains a delayed branch. On some architectures (MIPS) it is disallowed with the behaviour undefined in the architecture. On SPARC (at least in V9) it is allowed: the architecture has two program counters, PC and next-PC, normally chasing each other.

    This gives us a fun way to run single instructions – in a debugger, or anywhere else:

    1 jmp A ; run single instruction at A
    2 jmp 3 ; but get right back afterwards!
    3 : …

    A: some instruction

    Of course, this doesn’t work if the single instruction itself changes the control flow.

  7. I see that my comment above largely duplicated the one previously posted by John – sorry, I should have read it more closely.

    It is interesting to contrast the SPARC approach of having two distinct program counters, with MIPS having only one. For example, if an exception occurs in a delay slot on MIPS, the branch instruction is typically re-executed on resumption. This is all right because MIPS branches have no side-effects other than the control flow change.

    Load delays were present in early MIPS implementations but not part of the architecture – the assembler would actually rearrange instructions to protect against this hazard, instead of having a hardware interlock. Later implementations added an interlock, but the meddling assembler remained.

    DSPs frequently have exposed pipelines with delays in all sorts of instructions – branches, loads/stores, even arithmetic, and sometimes the delays are longer than one instruction.

  8. Abusing the load delay or branch delay on MIPS-I (and probably later chips too) is only safe if there is no way an interrupt can occur.

    When an interrupt occurs in a branch delay slot, the cpu subtracts 4 from the PC. However if you have a branch in a branch delay slot, then the CPU will subtract 4 from the address the second branch pointed at. Which would not normally be executed.

    I’m pretty sure the CPU has to wait for all loads to be complete when executing an interrupt. Otherwise you could never guarantee that the correct value would be loaded. So when you return you will now have the new value, when you didn’t expect it.

    lwl/lwr are quite interesting, they pull the register update out of the pipeline. So all the time you run lwl/lwr instructions, the register won’t get updated.

  9. Very glad I found your site…

    I’ve been puzzling over the following (gcc-generated) MIPS assembly code that has been produced from a flat disassembly of a closed-source binary.

    Both IDA Pro and mips-linux-objdump produce essentially the same code, so it’s not likely a disassembly error..

    li $s6, 0x1
    li $v0, 0xFFFFFFFF
    bne $s8, $v0, some_location
    move $v0, $s6

    when does the branch take place?!

    is it when $s8 != 0xFFFFFFFF or
    is it when $s8 != 0x1 ?

    In See MIPS Run Linux (2nd edition), the author, Dominic Sweetman, writes…

    “Quite often, the instruction that would otherwise have been placed before the branch can be moved into the delay slot.

    This can be a bit tricky on a conditional branch, where the branch delay instruction must be (at least) harmless on both paths. Where nothing useful can be done, the delay slot is filled with a nop instruction.”

    That tells us that the above code is illegal, but it doesn’t explain what happens when the branch delay instruction does conflict with the test in the conditional branch..

    But I think you’ve answered that question with great clarity! Thank you.

  10. hi,I have encountered a problem about branch delay slot.

    I do not want CPU to excute the instruction in branch delay slot ,and also,code-segment cannot be modified

    I know one solution is use GCC marco -fno-delayed-branch to forbidden compiler insert other instructions to delay slot but NOP,

    but such solution need to re-compile code,it’s unacceptable in my project

    Thank you

  11. While this article is probably too old, I have a precise question regarding the exact behavior of a branch instruction inside the delay slot of another branch in this example:

    1: BEQ $v0, $v1, N
    2: BEQ $zr, $zr, 4
    3: ADDIU $a0, $a0, 1
    4:

    N: ADDIU $a1, $a1, 1

    Let us suppose $v0 and $v1 are different and $a0 and $a1 set to 0, what would be the order of the executed instructions ?

    What about the delay slot in location 3? would it be executed? before/after N? and so on…

    Best regards

Leave a Reply to Anonymous Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.