by Sven Oliver ‘SvOlli’ Moll; the original German language version has been simultaneously posted on his blog.
This is an article about the 6502 processor about the topic: how to “waste” a number of clock cycles stated in a register, in this case the X register. The principle is simple: you have a number of operations that do close to nothing. The more the code is jumped to at the “front”, the more clock cycles are needed to get to the actual code. If the code is jumped to more at the “end”, the CPU gets to the code in question more quickly.
This nice theory won’t work directly on the 6502, because every instruction takes at least two clock cycles to execute. If you want to get it down to the precision of one cycle, this is getting more difficult. The first half of this trick I found in code of Eckhard Stollberg, who is one of the guys that pionieered homebrew on the Atari 2600 VCS. There, I found some strange bytes:
C9 C9 C9 C9 C9 C9 C9 C9 C9 C9 C9 C9 C5 EA
The disassembly looks like this:
; CODE1 CMP #$C9 ; 2 CMP #$C9 ; 2 CMP #$C9 ; 2 CMP #$C9 ; 2 CMP #$C9 ; 2 CMP #$C9 ; 2 CMP $EA ; 3
To run through the code, you’ll need 15 clock cycles, and nothing changes except for some state registers. If the code is called with an offset of one byte, this code will be processed:
; CODE2 CMP #$C9 ; 2 CMP #$C9 ; 2 CMP #$C9 ; 2 CMP #$C9 ; 2 CMP #$C9 ; 2 CMP #$C5 ; 2 NOP ; 2
This makes 14 clock cycles, and only the status register will be changed. If the code is called with an offset of two bytes, it is started at the CODE1 segment at the second instruction. Add another one, you’ll get to the second instruction of the CODE2 segment, and so on. This way it is possible to specify the exact number of clock cycles to be “wasted”. With on exception: to be more specific there are 2 + X clock cycles that are wasted. There is no way to waste exactly one clock cycle.
Now we need a way to specify the “entry” of our “slide”. On a C=64 this would be done using self-modifying code. The operand of a JMP $XXXX instruction will be replaced with the calculated address. This is not possible on systems like the Atari 2600, since the code is run in ROM. One option for example would be to use JMP ($0080) after writing the entry point to $0080 and $0081.
My approach differs a bit from the usual way. RAM is scarce on the Atari, and I don’t want to “waste” up two of the 128 bytes available, when there is another way. When the CPU executes a JSR $XXXX (jump to subroutine) command, it writes the current address to the stack. To be more specific, it is the address of the JSR command + 2 which is the return address – 1. And this is what I do: I write my entry point – 1 to the stack and use the command RTS (return from subroutine) to jump into the clock slide. So, I’m still using two bytes of RAM, but only for a short time, without the need to evaluate which two bytes are available at this point.
; the X register specifies how many of the ; 15 clock cycles possible should be skipped LDA #>clockslide PHA TXA CLC ADC #<clockslide PHA STA WSYNC ; <= this syncs to start of next scanline clockslide: RTS CMP #$C9 CMP #$C9 CMP #$C9 CMP #$C9 CMP #$C9 CMP #$C9 CMP $EA realcode: ; and here the real code continues
This approach still has one problem: between “clockslide” and “realcode”, no page crossing may occur. If this were the case, I’d have to increase the high byte on the stack by one. But since the position of the code segments is under my control, I left this out as an exercise for the reader. π
Very interesting. I had a similar cycle wasting problem on the Z-80. I tried various clockslide techniques but at a minimum of 4 cycles per instruction I couldn’t figure out how to make it work effectively. A series of $21 bytes might be a start ( ld hl,$2121 ), but can’t see how it could be capped off.
Here’s my cycle waster for Z-80: http://members.shaw.ca/gp2000/beamhack3.html
I use it more for wasting out the rest of a frame so the high overhead is OK when hundreds or thousands of cycles are to be bled off.
WOW, way cool. I have been trying to get a simple clock slide for the C64 that does not waste a ton of code space and this is a perfect way to do so. Now with some miner modification I will be able to do much better (and faster) VIC II tricks…… Now the intro screen for my homebrew C64 clone will be amaizing (do you know how long it took to design a 100% VIC II compatable video chip with extended modes).
Waited too long for another interesting read on your blog! Cool read, but anyway, I would never try to say it’s impossible for the C64 to get down to 1 cycle π Nice idea to use RTS, indeed.
In fact I analyzed this and you can create any of 8 cycles of delay in a code fragment of 3 bytes, but this would require jumping into code fragements. There is a much better method than yours and probably the densest way to do it.
Starting with the delay in A (17 bytes):
;Delay in A, 1-7
eor #7 ;A=7-A so jitter will be 0…6 in A
sta corr+1 ;self-writing code, the bpl jump-address = A
corr bpl *+2 ;the jump to timer (A) dependent byte
cmp #$c9 ;if A=0, cmp#$c9; if A=1, cmp #$c9 again 2 cycles later
cmp #$c9 ;if A=2, cmp#$c9, if A=3, CMP #$EA 2 cycles later
bit $ea24 ;if A=4,bit$ea24; if A=5, bit $ea, if A=6, only NOP
bit $ea24 ;IMPORTANT to handle 8th cycle jitter
Oops, last line in invalid. Here’s another way to explain yours (with c64 mod). C64 uses a 1-8 delay to fix jitter when raster interrupt occurs. The jitter is from finishing the current instruction, which has a variable time.
;A=1..8
*=$1000
clc
adc #$ff-8;A=8-A so result will be 7?0 in A
eor #$ff
sta corr+1 ;self-writing code, the bpl jump-address = A
corr bpl *+2 ;the jump to (A) dependent byte (13 cycles so far)
cmp #$c9 ;A=8->A=0->BPL +2
cmp #$c9 ;
cmp #$c9 ;
cmp $ea ;3 =9 (13+9=22 max delay)
Start Address
$1000 $1001 $1002 $1003 $1004 $1005 $1006 $1007 $1008
——– ——– ——– ——– ——– ——– ——– ——– ——–
cmp #$c9 cmp #$c9 cmp #$c9 cmp #$c9 cmp #$c9 cmp #$c5 cmp $ea nop
cmp #$c9 cmp #$c9 cmp #$c9 cmp #$c5 cmp $ea nop
cmp #$c9 cmp #$c5 cmp $ea nop
cmp $ea nop
——– ——– ——– ——– ——– ——– ——– ——– ——–
9 8 7 6 5 4 3 2 0
Cycles
Another method (which admittedly has no advantage over the one mentioned in the article :-)). The single cycle delay is achieved by a branch to the following instruction. Enter with # of cycles to delay in A:
eor #$ff ;invert delay value:
sec
adc #maxdelay ;now A=max. delay – actual delay
lsr ;A/2 because each nop below takes 2 cycles
sta .branch+1 ;selfmodify branch distance
bcc .branch ;2 or 3 cycles, depending on delay being odd/even
.branch bpl * ;branch always, skipping delay/2 of the NOPs below
nop
nop
nop
… ;at least max. delay/2 nops needed in total
Hey guys, I’ miss you π
@Oliver. What a coincidence, I just visited this side on the same day like you (after half a year) and thought the same π Hey what’s up in here, I demand more geeky stuff on one of the best blogs ever!
More blog posts please!
You should also look at using RTI to come back from a subroutine, you, if my (old) memory serves me right, can save a clock cycle there. That what I did in the 80’s or was it the 70’s π
Just an FYI that 6502 code for the T800 Terminator, in you ccc club presentation (which was very good by the way) was the “Apple dos rom” v2.3 I think π