Clockslide: How to waste an exact number of clock cycles on the 6502

by Sven Oliver ‘SvOlli’ Moll; the original German language version has been simultaneously posted on his blog.

This is an article about the 6502 processor about the topic: how to “waste” a number of clock cycles stated in a register, in this case the X register. The principle is simple: you have a number of operations that do close to nothing. The more the code is jumped to at the “front”, the more clock cycles are needed to get to the actual code. If the code is jumped to more at the “end”, the CPU gets to the code in question more quickly.

This nice theory won’t work directly on the 6502, because every instruction takes at least two clock cycles to execute. If you want to get it down to the precision of one cycle, this is getting more difficult. The first half of this trick I found in code of Eckhard Stollberg, who is one of the guys that pionieered homebrew on the Atari 2600 VCS. There, I found some strange bytes:

C9 C9 C9 C9 C9 C9 C9 C9 C9 C9 C9 C9 C5 EA

The disassembly looks like this:

; CODE1
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP $EA  ; 3

To run through the code, you’ll need 15 clock cycles, and nothing changes except for some state registers. If the code is called with an offset of one byte, this code will be processed:

; CODE2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C5 ; 2
NOP      ; 2

This makes 14 clock cycles, and only the status register will be changed. If the code is called with an offset of two bytes, it is started at the CODE1 segment at the second instruction. Add another one, you’ll get to the second instruction of the CODE2 segment, and so on. This way it is possible to specify the exact number of clock cycles to be “wasted”. With on exception: to be more specific there are 2 + X clock cycles that are wasted. There is no way to waste exactly one clock cycle.

Now we need a way to specify the “entry” of our “slide”. On a C=64 this would be done using self-modifying code. The operand of a JMP $XXXX instruction will be replaced with the calculated address. This is not possible on systems like the Atari 2600, since the code is run in ROM. One option for example would be to use JMP ($0080) after writing the entry point to $0080 and $0081.

My approach differs a bit from the usual way. RAM is scarce on the Atari, and I don’t want to “waste” up two of the 128 bytes available, when there is another way. When the CPU executes a JSR $XXXX (jump to subroutine) command, it writes the current address to the stack. To be more specific, it is the address of the JSR command + 2 which is the return address – 1. And this is what I do: I write my entry point – 1 to the stack and use the command RTS (return from subroutine) to jump into the clock slide. So, I’m still using two bytes of RAM, but only for a short time, without the need to evaluate which two bytes are available at this point.

; the X register specifies how many of the
; 15 clock cycles possible should be skipped
LDA #>clockslide
PHA
TXA
CLC
ADC #<clockslide
PHA
STA WSYNC ; <= this syncs to start of next scanline
clockslide:
RTS
CMP #$C9
CMP #$C9
CMP #$C9
CMP #$C9
CMP #$C9
CMP #$C9
CMP $EA
realcode:
; and here the real code continues

This approach still has one problem: between “clockslide” and “realcode”, no page crossing may occur. If this were the case, I’d have to increase the high byte on the stack by one. But since the position of the code segments is under my control, I left this out as an exercise for the reader. ;-)

11 thoughts on “Clockslide: How to waste an exact number of clock cycles on the 6502”

  1. Very interesting. I had a similar cycle wasting problem on the Z-80. I tried various clockslide techniques but at a minimum of 4 cycles per instruction I couldn’t figure out how to make it work effectively. A series of $21 bytes might be a start ( ld hl,$2121 ), but can’t see how it could be capped off.

    Here’s my cycle waster for Z-80: http://members.shaw.ca/gp2000/beamhack3.html

    I use it more for wasting out the rest of a frame so the high overhead is OK when hundreds or thousands of cycles are to be bled off.

  2. WOW, way cool. I have been trying to get a simple clock slide for the C64 that does not waste a ton of code space and this is a perfect way to do so. Now with some miner modification I will be able to do much better (and faster) VIC II tricks…… Now the intro screen for my homebrew C64 clone will be amaizing (do you know how long it took to design a 100% VIC II compatable video chip with extended modes).

  3. Waited too long for another interesting read on your blog! Cool read, but anyway, I would never try to say it’s impossible for the C64 to get down to 1 cycle ;) Nice idea to use RTS, indeed.

  4. In fact I analyzed this and you can create any of 8 cycles of delay in a code fragment of 3 bytes, but this would require jumping into code fragements. There is a much better method than yours and probably the densest way to do it.
    Starting with the delay in A (17 bytes):
    ;Delay in A, 1-7
    eor #7 ;A=7-A so jitter will be 0…6 in A
    sta corr+1 ;self-writing code, the bpl jump-address = A
    corr bpl *+2 ;the jump to timer (A) dependent byte
    cmp #$c9 ;if A=0, cmp#$c9; if A=1, cmp #$c9 again 2 cycles later
    cmp #$c9 ;if A=2, cmp#$c9, if A=3, CMP #$EA 2 cycles later
    bit $ea24 ;if A=4,bit$ea24; if A=5, bit $ea, if A=6, only NOP
    bit $ea24 ;IMPORTANT to handle 8th cycle jitter

  5. Oops, last line in invalid. Here’s another way to explain yours (with c64 mod). C64 uses a 1-8 delay to fix jitter when raster interrupt occurs. The jitter is from finishing the current instruction, which has a variable time.

    ;A=1..8
    *=$1000
    clc
    adc #$ff-8;A=8-A so result will be 7?0 in A
    eor #$ff
    sta corr+1 ;self-writing code, the bpl jump-address = A
    corr bpl *+2 ;the jump to (A) dependent byte (13 cycles so far)
    cmp #$c9 ;A=8->A=0->BPL +2
    cmp #$c9 ;
    cmp #$c9 ;
    cmp $ea ;3 =9 (13+9=22 max delay)

    Start Address
    $1000 $1001 $1002 $1003 $1004 $1005 $1006 $1007 $1008
    ——– ——– ——– ——– ——– ——– ——– ——– ——–
    cmp #$c9 cmp #$c9 cmp #$c9 cmp #$c9 cmp #$c9 cmp #$c5 cmp $ea nop
    cmp #$c9 cmp #$c9 cmp #$c9 cmp #$c5 cmp $ea nop
    cmp #$c9 cmp #$c5 cmp $ea nop
    cmp $ea nop
    ——– ——– ——– ——– ——– ——– ——– ——– ——–
    9 8 7 6 5 4 3 2 0
    Cycles

  6. Another method (which admittedly has no advantage over the one mentioned in the article :-)). The single cycle delay is achieved by a branch to the following instruction. Enter with # of cycles to delay in A:

    eor #$ff ;invert delay value:
    sec
    adc #maxdelay ;now A=max. delay – actual delay
    lsr ;A/2 because each nop below takes 2 cycles
    sta .branch+1 ;selfmodify branch distance
    bcc .branch ;2 or 3 cycles, depending on delay being odd/even
    .branch bpl * ;branch always, skipping delay/2 of the NOPs below
    nop
    nop
    nop
    … ;at least max. delay/2 nops needed in total

  7. @Oliver. What a coincidence, I just visited this side on the same day like you (after half a year) and thought the same ;) Hey what’s up in here, I demand more geeky stuff on one of the best blogs ever!

  8. You should also look at using RTI to come back from a subroutine, you, if my (old) memory serves me right, can save a clock cycle there. That what I did in the 80’s or was it the 70’s :)

  9. Just an FYI that 6502 code for the T800 Terminator, in you ccc club presentation (which was very good by the way) was the “Apple dos rom” v2.3 I think :)

Leave a Reply to DavidS Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.