{"id":568,"date":"2011-02-07T23:58:22","date_gmt":"2011-02-08T07:58:22","guid":{"rendered":"http:\/\/www.pagetable.com\/?p=568"},"modified":"2011-02-07T23:58:22","modified_gmt":"2011-02-08T07:58:22","slug":"a-256-byte-autostart-fast-loader-for-the-commodore-64","status":"publish","type":"post","link":"https:\/\/www.pagetable.com\/?p=568","title":{"rendered":"A 256 Byte Autostart Fast Loader for the Commodore 64"},"content":{"rendered":"<p><i>Update<\/i>: The source is available at <a href=\"https:\/\/github.com\/mist64\/fastboot1541\">github.com\/mist64\/fastboot1541<\/a><\/p>\n<p>Platforms like the Commodore 64 are still a lot of fun to work with, not only because the limitations make certain tasks a real challenge, but also because it is possible to use many interesting tricks on a bit- and cycle-level &#8211; after all, the system is well-understood and practically all setups were identical.<\/p>\n<p>This article presents a C64 &#8220;fast bootloader&#8221;: A small program that auto-starts when loaded into memory and chain-loads e.g. a game, but replacing the slow disk transfer routines in ROM with much faster ones &#8211; and all this fits into a single 256 bytes sector.<\/p>\n<h2>The C64 and the Drive 1541<\/h2>\n<p>The C64 is an 8-bit computer released in 1982 that is powered by a 1 MHz 6502-based CPU and has 64 KB of RAM. A typical C64 setup also consists of a Commodore 1541 disk drive (5.25&#8243; SS\/DD media, 170 KB per side) which is a 1 MHz 6502 computer with 2 KB of RAM itself.<\/p>\n<h2>The IEC Bus<\/h2>\n<p>The computer and the drive are connected through a serial cable that carries three lines from the computer to the drive (ATN, CLK, DATA) and two from the drive to the computer (CLK, DATA). This IEC bus supports several daisy-chained disk drives (and printers), and the computer uses the ATN line to arbitrate the bus.<\/p>\n<p>IEC was introduced in the C64&#8217;s predecessor VIC-20 and its floppy drive 1540, as a cheaper version for the parallel IEE-488 bus. Each device had a 6522 &#8220;VIA&#8221; I\/O controller that could do a simple serial protocol in hardware &#8211; in theory. The serial port in the VIAs never worked, so Commodore decided to work around the issue by just implementing the protocol in software, after all, both devices had programmable CPUs. The C64, running basically the same system software as the VIC-20, and the 1541, a slightly updated disk drive for the C64, inherited this design.<\/p>\n<p>As a result of every single bit having to go through a software handshake, the ROM code in the C64 and the drive could only transfer about 400 bytes per second. A game that fills all of RAM would take more than two minutes to load. Another issue of the IEC code in the C64 ROM is that it turns off interrupts every once in a while, making it impossible to properly play music in the background while loading data from disk.<\/p>\n<h2>Fast Loaders<\/h2>\n<p>Practically every game and every demo therefore only had a small boot program that was loaded by the original ROM code, typically less than a kilobyte in size, which contained more efficient serial code for the computer side, as well as corresponding code for the drive, which was uploaded into the drive&#8217;s RAM using the original serial protocol.<\/p>\n<h2>The Protocol<\/h2>\n<p>The original IEC protocol uses one clock and one data line in each direction. The sender alternates the clock line on every bit of data, and the receiver has to acknowledge the receipt by alternating its clock line. The idea of a faster protocol is to send a whole byte, bit by bit without a clock signal: Whenever both devices are undisturbed by interrupts and device DMA, we can assume the clocks of both devices to be enough in sync for the duration of transmitting a byte. (In fact, the 1541 is clocked a little bit faster than a PAL C64, so there is one extra 1541 cycle for every 67 C64 cycles.) Since the clock line is now not necessary for the handshake, it can be used for data, so we can transmit two bits at a time, and transfer a byte in four steps.<\/p>\n<h2>Receiving a Byte<\/h2>\n<p>The IEC bus is controlled through port A of the C64&#8217;s second 6526 &#8220;CIA&#8221; I\/O chip, which is accessible through the MMIO address $DD00:<\/p>\n<table border=\"1\">\n<tr>\n<td>7<\/td>\n<td>IEC DATA IN<\/td>\n<\/tr>\n<tr>\n<td>6<\/td>\n<td>IEC CLK IN<\/td>\n<\/tr>\n<tr>\n<td>5<\/td>\n<td>IEC DATA OUT<\/td>\n<\/tr>\n<tr>\n<td>4<\/td>\n<td>IEC CLK OUT<\/td>\n<\/tr>\n<tr>\n<td>3<\/td>\n<td>IEC ATN Signal OUT<\/td>\n<\/tr>\n<tr>\n<td>2<\/td>\n<td>RS-232<\/td>\n<\/tr>\n<tr>\n<td>1-0<\/td>\n<td>VIC Select<\/td>\n<\/tr>\n<\/table>\n<p>We can signal the drive when we are ready to receive a byte through bits 4 and 5, and the drive can send data through bits 7 and 6. We need to be careful not to change bits 0 and 1, since these select the 16 KB memory bank that the video chip fetches its data from &#8211; typically, these bits are both 1.<\/p>\n<p>The fastest way to receive a byte from the serial bus (given the sender is fast enough) is to repeatedly read two bits from $DD00, shift them down, then read the next two bits, and repeat all this four times:<\/p>\n<pre>\n    lda $DD00 ; get 2 bits into bits 6-7\n    lsr\n    lsr       ; move down into bits 4-5\n    eor $DD00 ; get 2 more bits\n    lsr\n    lsr       ; move everything down to bits 2-5\n    eor $DD00 ; get 2 more bits\n    lsr\n    lsr       ; move everything down to bits 0-5\n    eor $DD00 ; get last 2 bits\n<\/pre>\n<p>The trick here is to XOR the new bits onto the already received and shifted bits, this way we avoid shifting, ANDing and ORing. An absolute load taking four cycles and a shift instruction two, receiving a byte takes 28 cycles total. We need to make sure the sending side has the same timing.<\/p>\n<h2>Sending a Byte<\/h2>\n<p>The 1541 disk drive has the IEC bus exposed through port A of its first &#8220;VIA&#8221; I\/O chip, which is mapped at $1800:<\/p>\n<table border=\"1\">\n<tr>\n<td>7<\/td>\n<td>IEC ATN IN<\/td>\n<\/tr>\n<tr>\n<td>6-5<\/td>\n<td>Device number jumper<\/td>\n<\/tr>\n<tr>\n<td>4<\/td>\n<td>IEC ATN ACK OUT<\/td>\n<\/tr>\n<tr>\n<td>3<\/td>\n<td>IEC CLOCK OUT<\/td>\n<\/tr>\n<tr>\n<td>2<\/td>\n<td>IEC CLOCK IN<\/td>\n<\/tr>\n<tr>\n<td>1<\/td>\n<td>IEC DATA OUT<\/td>\n<\/tr>\n<tr>\n<td>0<\/td>\n<td>IEC DATA IN<\/td>\n<\/tr>\n<\/table>\n<p>The bits to write the data into are number 1 and number 3, which are not next to each other, so sending is a little more complicated than receiving. The following code assumes that the low 4 bits of the data byte are in register A, and the high 4 bits are in register Y:<\/p>\n<pre>\n    sta $1800 ; send bits 1 and 3 of A\n    asl       ; bits 0 and 2 become bits 1 and 3\n    and #$0F  ; mask off bit #4\n    sta $1800 ; send bits 0 and 1 in A\n    tya\n    nop\n    sta $1800 ; send bits 1 and 3 in Y\n    asl       ; bits 0 and 2 become bits 1 and 3\n    and #$0F  ; mask off bit #4\n    sta $1800 ; send bits 0 and 1 in Y\n<\/pre>\n<p>The idea is to first just write the low 4 bits into the output port, therefore sending bits 1 and 3. Shifting the value left by one will put bits 0 and 2 into positions 1 and 3 and we can send them by writing them into the output port. Then we repeat this with the upper four bits.<\/p>\n<p>The absolute stores are taking 4 cycles each, and &#8220;asl&#8221;, &#8220;and&#8221;, &#8220;tya&#8221; and &#8220;nop&#8221; all take 2 cycles each, so this code has exactly the same timing as the receiver side.<\/p>\n<p>But unfortunately, this code sends the bits in the wrong order, so we need to correct it, either on the sending or on the receiving side. A simple lookup in a 256 byte table (on either side) would do the job, but storing the table in the fast loader would increase its size significantly &#8211; and therefore the time to load the fast loader with the original Commodore serial code. It would be possible to generate this table at runtime, but a good tradeoff between size and performance is this code (34 bytes; with a few bytes more, it could be converted into a generator for the table):<\/p>\n<pre>\n    eor #3        ; fix up bits 0-1 (VIC bank)\n    pha           ; save original\n    lsr\n    lsr\n    lsr\n    lsr           ; get high nybble\n    tax           ; to X\n    ldy enc_tab,x ; super-encoded high nybble in Y\n    pla\n    and #$0F      ; lower nybble\n    tax\n    lda enc_tab,x ; super-encoded low nybble in A\n\nenc_tab:\n    .byte %1111, %0111, %1101, %0101, %1011, %0011, %1001, %0001\n    .byte %1110, %0110, %1100, %0100, %1010, %0010, %1000, %0000\n<\/pre>\n<p>First, we invert the lowest two bits of the value to send, because the receiving side always reads back &#8220;11&#8221; in the lowest bits of $DD00. Then we encode both the high and the low 4 bits using a 16 byte table, putting the results in the A (low) and Y (high) registers. This table interleaves the four bits so that 0123 becomes 3120 and inverts every bit: An interesting property of the lines from the drive to the computer is that all lines arrive inverted at the computer side. This is not the case for the lines from the computer to the drive. So after this conversion, the send code above can blindly send the bits, which will show up as the original value on the receiver side.<\/p>\n<h2>Handshake<\/h2>\n<p>But it is not enough to just bang the bits on the bus &#8211; after every byte, we need to do a handshake. In a perfect world, we could just send a complete sector (256 bytes on a 1541) in a go, maybe even in an unrolled loop for extra speed, but there are several reasons against it. One reason is that the floppy is clocked about 2% faster, so we&#8217;re off by one cycle every 67 cycles. The fastest possible send loop is about 50 cycles per byte (including the byte encoding), and the time between two pieces of data on the bus (i.e. $1800 writes\/$DD00 reads) is 8 cycles, so after 8*67 = 536 cycles = 10 transfered bytes we have missed two bits. Getting the timing correct across such a long time becomes very tricky then.<\/p>\n<p>Another problem is the fact that the video chip in the C64 (&#8220;VIC-II&#8221;) requires 40 cycles for DMA on every 8th line of the visible screen that it sends to the display, completely stopping the CPU, which would mess up all timing. One raster line on the C64 is exactly 63 cycles long, so these CPU stalls (&#8220;badlines&#8221;) happen every 504 cycles. If we start a transfer of several bytes just after one of these badlines, we have time for a maximum of about 9 bytes ((504-40)\/50). Or we could only transfer data while the VIC is outside the visible area, but this is only in 112 of the 312 lines, so we would be wasting about 64% of the processing power. Most fast loaders just turn off the screen, so the VIC doesn&#8217;t do these fetches any more, but we don&#8217;t want to take this shortcut!<\/p>\n<p>Badlines happen every time the vertical raster location (readable in register $D012 of the VIC) is between 50 and 249 and the the lowest 3 bits reach the value of 3. (Actually, this value is variable and corresponds to the lowest 3 bits in register $D011.) So the first thing to check is whether we are below 50 or above 249 (i.e. between 250 and 311). $D012 only holds the lowest 8 bits of the raster register (the MSB is stored in the MSB of $D011), but just checking for $D012 being below 50 already means that the raster register is between 0-49 or 256-305 &#8211; this easier check with some false positives is preferrable to a more compicated but slower check. So if we are in the visible area, we must watch out whenever we are are in a line that ends in &#8220;2&#8221;, because a badline will happen some time within the next 63 cycles. In every other case, a badline won&#8217;t happen within at least 63 cycles, so we are safe to spend our 28 cycles in the receiver code. The following code does this:<\/p>\n<pre>\nwait_raster:\n    lda $D012           ; vertical raster position (bits 0-7)\n    cmp #50             ; between 0-49 or 256-305?\n    bcc wait_raster_end ; yes, so it's safe\n    and #$07            ; lowest 3 bits\n    cmp #$02            ; are we in the line before a badline?\n    beq wait_raster     ; yes, then wait until we are not\nwait_raster_end:\n<\/pre>\n<p>If sprites are visible on the screen, this also requires extra DMAs from the VIC, but we assume that there no sprites active. If in doubt, writing 0 into VIC register $D015 makes sure they are all turned off.<\/p>\n<p>Now the actual handshake is rather easy. The protocol is this: At the beginning, both the computer and the drive set their handshake flags to &#8220;not ready&#8221;. When the drive has data in its buffer and is ready to send a byte, it sends its flag to &#8220;ready&#8221;. Then the C64 makes sure it is not in danger of a badline and sets its &#8220;ready&#8221; flag. Just after the transfer of the byte, both devices set their flags to &#8220;not ready&#8221; again. The drive signals readyness with CLK=0, and the computer does so with both CLK=0 and DATA=0, so the code looks like this:<\/p>\n<pre>\n;-----\n; C64 at initialization time\n    lda #VIC_OUT | DATA_OUT ; CLK=0 DATA=1\n    sta $DD00               ; we're not ready to receive\n\n;-----\n; drive at initialization time\n    lda #F_CLK_OUT          ; CLK=1 DATA=0\n    sta $1800               ; drive code running, we're not ready to send\n\n;-----\n; C64 waiting for drive code running\nwait_fast:\n    bit $DD00\n    bvs wait_fast           ; wait for CLK=1, i.e. drive code running\n\n;-----\n; drive when it is ready to send (i.e. byte in buffer and converted)\n    lda #0                  ; CLK=0 DATA=0\n    sta $1800               ; we're ready to send\n\n;-----\n; C64 waiting for drive ready to send\nwait_byte:\n    bit $DD00\n    bvc wait_byte           ; wait for CLK=0, i.e. drive ready to send\n\n;-----\n; C64 when it is ready to receive (i.e. not in danger of badline)\n    lda #VIC_OUT            ; CLK=0 DATA=0\n    sta $DD00               ; we're ready, start sending!\n\n;-----\n; drive waiting for C64 ready to receive\nwait_c64:\n    ldx $1800\n    bne wait_c64            ; needs all 0\n\n;-----\n; C64 after receiving a byte\n    lda #VIC_OUT | DATA_OUT ; CLK=0 DATA=1\n    sta $DD00               ; not ready any more, don't start sending\n\n;-----\n; drive after sending a byte\n    jsr $E9AE               ; CLK=1 (use ROM code to opimize for size)\n<\/pre>\n<p>Please note that logic on all $DD00 reads on the computer side looks backwards, because the bits get inverted.<\/p>\n<h2>Timing after the Handshake<\/h2>\n<p>The send and the receive code have exactly the same timing, but we need to make sure that they also start at the same time. The C64 code cannot start reading data from the bus directly after telling the drive that it is ready to receive, because the drive is testing for the C64&#8217;s readiness in this loop:<\/p>\n<pre>\n;-----\n; drive waiting for C64 ready to receive\nwait_c64:\n    ldx $1800\n    bne wait_c64            ; needs all 0\n<\/pre>\n<p>The load takes 4 cycles, and the branch 2 or 3, depending on whether it is taken. The actual bus access for the read from $1800 takes place in the third cycle, so in the worst case, the computer signals that its ready exactly the fourth cycle: In this case, the LDX has read the old value and the branch is taken, the LDX reads the value again, gets the right value now, and doesn&#8217;t take the branch. So the maximum time until the 1541 reacts is 10 cycles: 1 for the last cycle in the LDX, 3 for the taken branch, 4 for another LDX, and 2 for the final non-taken branch. This is the code that delays for 10 cycles between the ready signalling and the reading of the first 2 bits:<\/p>\n<pre>\n    lda #VIC_OUT ; CLK=0 DATA=0\n    sta $DD00    ; we're ready, start sending!\n    pha          ; 3 cycles\n    pla          ; 4 cycles\n    bit $00      ; 3 cycles\n    lda $DD00    ; get 2 bits into bits 6&7\n<\/pre>\n<h2>Reading Sectors<\/h2>\n<p>Now that we have the transfer code for one byte in place, we can easily construct a loop on both sides that repeats the transfer 256 times for a full sector. But we also need code running inside the drive that reads sectors from disk in the first place. The easiest way to do this is use the ROM code. It will happily position the read head for us, wait for the sector to come by, read it, decode the on-disk bit encoding (6-to-4 Group-Code-Recording, GCR) and put it into a buffer:<\/p>\n<pre>\n    lda #TRACK\n    sta $06\n    lda #SECTOR\n    sta $07\n    lda #0\n    sta $f9     ; buffer at $0300\n    cli\n    jsr $D586   ; read sector\n    sei\n<\/pre>\n<p>The 1541 has 5 buffers, numbered 0 through 4, from $0300-$03FF to $0600-$06FF. The track and sector for buffer 0 are stored in zero page addresses 6 and 7, the ones for buffer one in addresses 8 and 9 and so on. Note that during the whole process of loading data into the C64, we have interrupts disabled (&#8220;SEI&#8221;), so we need to reenable them while reading from disk for the timers to work properly.<\/p>\n<p>More advanced fastloaders don&#8217;t use the ROM code for reading, but implement a more optimized version, achieving another minor speedup. Bigger speedups can be achieved by changing the algorithm of reading completely: Tracks on a 1541 disk are up to 21 sectors, 256 bytes each, but the 1541 RAM is only 2 KB, so it cannot read a whole track into memory. Therefore it reads one sector, transfers it, reads another one and so on. After reading a sector, it needs to be decoded and sent, during which time the disk continues spinning, causing the drive to miss a few sectors. So it would be a bad idea to store files on consecutive sectors, since this would mean it has to wait for a whole turn of the disk (one fifth of a second) for that sector to arrive under the head again.<\/p>\n<p>Instead files are stored in an interleaved fashion, typically with an interleave factor of 4, meaning a file is for example stored on sectors 0, 4, 8, 12 etc. When a fast loader is used, it would typically require a different interleave factor for optimal performance, but unfortunately the interleave factor is a property of the already written disk. A very advanced method of fast loading is therefore to always read the sector that comes by next and transfer it, unless it has been transfered before, until the complete track is in the C64. The C64 then sorts the sectors in the correct order. For smaller files, this does not work too well, since this method always reads and transfers complete tracks.<\/p>\n<p>Even different fast loaders (like Heureka Sprint, used by Turrican and some other Rainbow Arts titles) require the data on disk to be encoded differently, making decoding more efficient. Some copy programs, like &#8220;Master Copy&#8221; don&#8217;t decode the sector data at all &#8211; but they can only do this because they write the same encoding, and the actual payload data is never required.<\/p>\n<p>But in order to keep the implementation really small (custom read code is in the order of 600 bytes) let&#8217;s stay with the code in ROM.<\/p>\n<h2>Uploading the Code into the Drive<\/h2>\n<p>Let&#8217;s consider both pieces of code on the C64 and the 1541 side finished now, but what&#8217;s still missing is code to upload the drive code into the RAM of the 1541 and run it. There are several ways of doing this: The 1541 operating system over the original IEC protocol has a &#8220;memory-write&#8221; (&#8220;M-W&#8221;) command, allowing us to upload up to 36 bytes at a time, and a &#8220;memory-execute&#8221; (&#8220;M-E&#8221;) command that makes the CPU jump to the address we specify. Our 1541 code is about 100 bytes, which would take about a quarter of a second to upload with the original protocol.<\/p>\n<p>But there is a way to avoid this cost: All code and data in the C64 came originally from disk, so why would we download it to the C64 and upload it to the 1541 again? We can just instruct the 1541 to read a sector and execute it. This can be done with block-read (&#8220;B-R&#8221;) and &#8220;memory-execute&#8221;, or with the specialized instruction &#8220;block-execute&#8221; (&#8220;B-E&#8221;). Unfortunately, &#8220;block-execute&#8221; does not work on the concept of buffers, but on the concept of channels which abstract buffers, making this more complex than it would have to be.<\/p>\n<p>A common trick is to upload minimal 6502 code to the drive that reads a sector and jumps to it and execute that. And it&#8217;s even possible to avoid the &#8220;memory-write&#8221; command: When sending the &#8220;memory-execute&#8221;, we can send trailing bytes, for a command that is up to 42 bytes long. The code would just travel with the &#8220;memory-execute&#8221; command, and the execution address would point to this very code in the temporary command buffer:<\/p>\n<pre>\n    lda #$0f\n    sta $b9   ; secondary address\n    sta $b8   ; logical file number\n    ldx #&lt;cmd\n    ldy #&gt;cmd\n    lda #cmd_end - cmd\n    jsr $fdf9 ; filnam\n    jsr $f34a ; open\n    brk\n\ncmd:\n    .byte \"M-E\"\n    .word $0200 + cmd_code - cmd\ncmd_code:\n    lda #18   ; track 18, sector 18\n    sta $08\n    sta $09\n    lda #1    ; buffer at $0400\n    sta $f9\n    jsr $d586 ; read sector\n    jmp $0400 ; jump to the code we loaded\ncmd_end:\n<\/pre>\n<p>The command buffer in the 1541 is located at $0200, so the &#8220;memory-execute&#8221; jumps to the first byte just after the command itself, at $0205. We choose to store the floppy code on track 18, sector 18: Track 18 is decidated to directory entries, so unless the disk has 144 files on it, it is unlikely all sectors of track 18 are in use. Reading the code from track 18 also means the head does not have to move if we store the C64 loader there, too.<\/p>\n<h2>Fitting C64 and Drive Code Into a Single Sector<\/h2>\n<p>But there is an even simpler and faster solution: If we manage to fit both the C64 code and the floppy code into a single sector, we don&#8217;t have to read another sector, but we can just send a &#8220;memory-execute&#8221; into the buffer that the block was loaded into:<\/p>\n<pre>\ncmd:\n    .byte \"M-E\"\n    .word $0482\ncmd_end:\n<\/pre>\n<p>The default buffer is #1 at $0400 in the drive&#8217;s memory, so after the start program got loaded into the C64, the sector can still be found at $0400.<\/p>\n<p>The default 1541 file system does not support random file access, therefore there is no central data structure that allows a lookup of the sector number following the current one. Instead, the link to the next track and sector is stored in the first two bytes of every sector, reducing the usable space in a sector to 254 bytes (and making seeks in a file very expensive). So the first byte in a sector is the track (1-35) and the second byte is the sector (0-20) of the following block. If it is the last block of a file, the track number is zero and the sector field contains the number of valid bytes in the block; all bytes afterwards will be ignored and not transfered. This allows files that are not a multiple of 254 bytes in size.<\/p>\n<p>So the trick is to create a file that is about half a sector in size and contains the C64 code, and we store the drive code in the unused half of the sector. So the reason why we always optimized for code size when choosing algorithms before was because we really need to fit everything in 256 bytes!<\/p>\n<h2>Header<\/h2>\n<p>Now what is the executable file format, you may ask? What are the headers, how are they structured? How much data is used for headers? It is complicated.<\/p>\n<p>The shell of the C64 was Commodore BASIC, a derivative of <a href=\"http:\/\/www.pagetable.com\/?p=43\">Microsoft<\/a> <a href=\"http:\/\/www.pagetable.com\/?p=46\">BASIC<\/a> <a href=\"http:\/\/www.pagetable.com\/?p=45\">for 6502<\/a>. So you would load BASIC programs from disk with the &#8220;LOAD&#8221; command, you could have them printed on the screen with &#8220;LIST&#8221; and edit them; and if you wanted to run them, you would type &#8220;RUN&#8221;. This concept wasn&#8217;t really meant for programs not written in BASIC, but it was enough to have a small BASIC header in front of your assembly program, like this:<\/p>\n<pre>\n10 SYS2061\n<\/pre>\n<p>BASIC programs get loaded to $0801, so the machine code is stored directly after this small BASIC header which tells the interpreter to run machine code at 2061 = $080D. But this wastes 12 bytes and requires the user to type &#8220;RUN&#8221; after the program is loaded.<\/p>\n<h2>Autostart<\/h2>\n<p>It is much nicer to have the program autostart directly after the &#8220;LOAD&#8221; command. The trick here is to have the program load not into BASIC RAM, but into a region where it overwrites vectors &#8211; it&#8217;s basically a buffer exploit! Here is a rough memory map of the C64:<\/p>\n<table border=\"1\">\n<tr>\n<td>$0000-$00FF<\/td>\n<td>BASIC and KERNAL variables<\/td>\n<\/tr>\n<tr>\n<td>$0100-$01FF<\/td>\n<td>Stack<\/td>\n<\/tr>\n<tr>\n<td>$0200-$0258<\/td>\n<td>BASIC input buffer<\/td>\n<\/tr>\n<tr>\n<td>$0259-$02FF<\/td>\n<td>BASIC and KERNAL variables<\/td>\n<\/tr>\n<tr>\n<td>$0300-$033B<\/td>\n<td>System vectors<\/td>\n<\/tr>\n<tr>\n<td>$033C-$03FF<\/td>\n<td>I\/O buffer<\/td>\n<\/tr>\n<tr>\n<td>$0400-$07FF<\/td>\n<td>Screen RAM<\/td>\n<\/tr>\n<tr>\n<td>$0800-$9FFF<\/td>\n<td>BASIC RAM<\/td>\n<\/tr>\n<tr>\n<td>$A000-$BFFF<\/td>\n<td>BASIC ROM<\/td>\n<\/tr>\n<tr>\n<td>$C000-$CFFF<\/td>\n<td>RAM<\/td>\n<\/tr>\n<tr>\n<td>$D000-$DFFF<\/td>\n<td>Device MMIO<\/td>\n<\/tr>\n<tr>\n<td>$E000-$FFFF<\/td>\n<td>KERNAL ROM<\/td>\n<\/tr>\n<\/table>\n<p>Commonly, autostart programs would overwrite the system vectors at $0300:<\/p>\n<table border=\"1\">\n<tr>\n<td>$0314-$0315<\/td>\n<td>IRQ vector<\/td>\n<\/tr>\n<tr>\n<td>$0316-$0317<\/td>\n<td>BRK vector<\/td>\n<\/tr>\n<tr>\n<td>$0318-$0319<\/td>\n<td>NMI vector<\/td>\n<\/tr>\n<tr>\n<td>$031A-$031B<\/td>\n<td>OPEN vector<\/td>\n<\/tr>\n<tr>\n<td>$031C-$031D<\/td>\n<td>CLOSE vector<\/td>\n<\/tr>\n<tr>\n<td>$031E-$031F<\/td>\n<td>CHKIN vector<\/td>\n<\/tr>\n<tr>\n<td>$0320-$0321<\/td>\n<td>CHKOUT vector<\/td>\n<\/tr>\n<tr>\n<td>$0322-$0323<\/td>\n<td>CLRCHN vector<\/td>\n<\/tr>\n<tr>\n<td>$0324-$0325<\/td>\n<td>CHRIN vector<\/td>\n<\/tr>\n<tr>\n<td>$0326-$0327<\/td>\n<td>CHROUT vector<\/td>\n<\/tr>\n<tr>\n<td>$0328-$0329<\/td>\n<td>STOP vector<\/td>\n<\/tr>\n<tr>\n<td>$032A-$032B<\/td>\n<td>GETIN vector<\/td>\n<\/tr>\n<tr>\n<td>$032C-$032D<\/td>\n<td>CLALL vector<\/td>\n<\/tr>\n<tr>\n<td>$032E-$032F<\/td>\n<td>unused<\/td>\n<\/tr>\n<tr>\n<td>$0330-$0331<\/td>\n<td>LOAD vector<\/td>\n<\/tr>\n<tr>\n<td>$0332-$0333<\/td>\n<td>SAVE vector<\/td>\n<\/tr>\n<tr>\n<td>0334-033B<\/td>\n<td>unused<\/td>\n<\/tr>\n<tr>\n<td>033C-03FB<\/td>\n<td>Tape buffer<\/td>\n<\/tr>\n<\/table>\n<p>Your program would load to $0326, for example, overwriting the CHROUT vector as well as the 5 following vectors, and your code would be loaded into the tape buffer starting at $033C. When loading is finished, the BASIC interpreter wants to print &#8220;READY.&#8221;, jumping over the CHROUT vector at $0326 and therefore into your code.<\/p>\n<p>The problem with this solution is that we have to preserve the values of some of the vectors between $0328 and $033B, because the original LOAD code in ROM calls the STOP vector to test whether the user pressed the STOP key. So our file would have to contain the original values, not only wasting 12 bytes, but also introducing potential incompatibilities if the user has a cartridge like the Final Cartridge III or the Action Replay VI attached &#8211; these devices were practically ROM extensions and hooked some of these vectors to provide improved functionality.<\/p>\n<p>(Actually, overwriting the STOP vector is useful in a different scenario: This way, we can catch execution during the load operation as opposed to after it and continue loading the same file with a replacement bus protocol.)<\/p>\n<p>A different way to gain control after loading is to load into the stack and overwriting the address returned to after the LOAD is finished. The stack on the 6502 is always located between $0100 and $01FF, so if we overwrite this complete area with a value of 2, we would put all &#8220;$0202&#8221; vectors on the stack, catching execution as soon as the inner ROM LOAD code returns to its caller. Since the 6502 increments the return address after it fetches it from the stack, our payload would live at $0203, which is still pretty much directly after the stack area. But of course overwriting the complete stack is a waste: Experimentation shows that the one vector on the stack that actually counts is located at $01F8\/$01F9.<\/p>\n<h2>Laying Out the Code<\/h2>\n<p>The problem with the payload starting at $0203 is that we can only use the memory up to $0258 (55 bytes) &#8211; this is the buffer for a BASIC input line. Unfortunately, this is not enough, since our code is more like 110 bytes. We can put the payload before the vector we overwrite, i.e. onto the stack. But we must be careful, because the LOAD code in ROM uses some stack, overwriting the area between $01ED to $01F7. So let&#8217;s have our code start somewhere in the stack area, going up to $01EC, and put a JMP to the code at $0203 to catch the stack return.<\/p>\n<p>The 11 bytes at $01ED-$01F7 (stack that gets overwritten while loading) and the 9 bytes at $01FA-$0202 (area between the vector on the stack we overwrite and our first instruction at $0203) seems wasted &#8211; but not quite. We can use $01FE-$0202 to store our 5 byte &#8220;M-E&#8221; string, and we just fill all bytes from $01ED to $01FD with &#8220;2&#8221;. This gives us extra safety that our code will work machines with replacement ROMs or extended ROM routines that use a slightly different stack layout &#8211; as long as they don&#8217;t use more stack and overwrite our code.<\/p>\n<h2>Final Words<\/h2>\n<p>Fast loaders and autostart bootloaders have been around for almost as long as the C64. Fast loaders have used the stack trick before, and 26 cycle drive transfer code with the screen turned on has been in use before as well. So what&#8217;s really novel about the bootloader described in this article is the combination of the most optimized tricks into a single-block (256 byte) program. That&#8217;s the beauty of programming for the C64: Practically everything is implicitly open source, since the best algorithms fit in a few hundred bytes of code, and an experienced C64 hacker can reverse-engineer existing code and incorporate it into his own. That&#8217;s how it has always been done.<\/p>\n<h2>The Code<\/h2>\n<p>Here is the complete code, which can be assembled with the ca65 assembler of the <a href=\"http:\/\/www.cc65.org\/\">cc65 compiler suite<\/a>.<\/p>\n<pre>\n\nTARGET := $0400\nTRACK := 18\n\nDATA_OUT := $20 ; bit 5\nCLK_OUT  := $10 ; bit 4\nVIC_OUT  := $03 ; bits need to be on to keep VIC happy\n\nseccnt = 2\n\n;----------------------------------------------------------------------\n; Hack to generate .PRG file with load address as first word\n;----------------------------------------------------------------------\n.segment \"LOADADDR\"\n.addr *\n\n;----------------------------------------------------------------------\n; Send an \"M-E\" to the 1541 that jumps to floppy code.\n; Then receive one block and run it.\n; This code lives around $0190.\n;----------------------------------------------------------------------\n.segment \"PART2\"\nmain:\n    lda #$0f\n    sta $b9\n    sta $b8\n    ldx #&lt;memory_execute\n    ldy #&gt;memory_execute\n    lda #memory_execute_end - memory_execute\n    jsr $fdf9 ; filnam\n    jsr $f34a ; open\n\n    sei\n    lda #VIC_OUT | DATA_OUT ; CLK=0 DATA=1\n    sta $DD00 ; we're not ready to receive\n\n; wait until floppy code is active\nwait_fast:\n    bit $DD00\n    bvs wait_fast ; wait for CLK=1 (inverted read!)\n\n    lda #sector_table_end - sector_table ; number of sectors\n    sta seccnt\n    ldy #0\nget_rest_loop:\n    bit $DD00\n    bvc get_rest_loop ; wait for CLK=0 (inverted read!)\n\n; wait for raster\nwait_raster:\n    lda $D012\n    cmp #50\n    bcc wait_raster_end\n    and #$07\n    cmp #$02\n    beq wait_raster\nwait_raster_end:\n\n    lda #VIC_OUT ; CLK=0 DATA=0\n    sta $DD00 ; we're ready, start sending!\n    pha ; 3 cycles\n    pla ; 4 cycles\n    bit $00 ; 3 cycles\n    lda $DD00 ; get 2 bits into bits 6&7\n    lsr\n    lsr ; move down by 2 (bits 4&5)\n    eor $DD00 ; get 2 more bits\n    lsr\n    lsr ; move everything down (bits 2-5)\n    eor $DD00; get 2 more bits\n    lsr\n    lsr ; move everything down (bits 0-5)\n    eor $DD00 ; get last 2 bits, now 0-7 are populated\n\n    ldx #VIC_OUT | DATA_OUT ; CLK=0 DATA=1\n    stx $DD00 ; not ready any more, don't start sending\n\nselfmod1:\n    sta TARGET,y\n    iny\n    bne get_rest_loop\n\n    inc selfmod1+2\n    dec seccnt\n    bne get_rest_loop\n\ninf:\n    jmp inf\n\n.segment \"VECTOR\"\n; these bytes will be overwritten by the KERNAL stack while loading\n; let's set them all to \"2\" so we have a chance that this will work\n; on a modified KERNAL\n    .byte 2,2,2,2,2,2,2,2,2,2,2\n; This is the vector to the start of the code; RTS will jump to $0203\n    .byte 2,2\n; These bytes are on top of the return value on the stack. We could use\n; them for data; or, fill them with \"2\" so different versions of KERNAL\n; might work\n    .byte 2,2,2,2\n\n.segment \"CMD\"\nmemory_execute:\n     .byte \"M-E\"\n     .word $0480 + 2\nmemory_execute_end:\n\n;----------------------------------------------------------------------\n; Jump to code that receives data.\n;----------------------------------------------------------------------\n.segment \"START\"\n    jmp main\n\n;----------------------------------------------------------------------\n;----------------------------------------------------------------------\n; C64 -> Floppy: direct\n; Floppy -> C64: inverted\n;----------------------------------------------------------------------\n;----------------------------------------------------------------------\n\n.segment \"FCODE\"\n\nF_DATA_OUT := $02\nF_CLK_OUT  := $08\n\nsec_index := $05\n\nstart1541:\n    lda #F_CLK_OUT\n    sta $1800 ; fast code is running!\n\n    lda #0 ; sector\n    sta sec_index\n    sta $f9 ; buffer $0300 for the read\n    lda #TRACK\n    sta $06\nread_loop:\n    ldx sec_index\n    lda sector_table,x\n    inc sec_index\n    bmi end\n    sta $07\n    cli\n    jsr $D586       ; read sector\n    sei\n\nsend_loop:\n; we can use $f9 as the byte counter, since we'll return it to 0\n; so it holds the correct buffer number \"0\" when we read the next sector\n    ldx $f9\n    lda $0300,x\n\n; first encode\n    eor #3 ; fix up for receiver side (VIC bank!)\n    pha ; save original\n    lsr\n    lsr\n    lsr\n    lsr ; get high nybble\n    tax ; to X\n    ldy enc_tab,x ; super-encoded high nybble in Y\n    ldx #0\n    stx $1800 ; DATA=0, CLK=0 -> we're ready to send!\n    pla\n    and #$0F ; lower nybble\n    tax\n    lda enc_tab,x ; super-encoded low nybble in A\n; then wait for C64 to be ready\nwait_c64:\n    ldx $1800\n    bne wait_c64; needs all 0\n\n; then send\n    sta $1800\n    asl\n    and #$0F\n    sta $1800\n    tya\n    nop\n    sta $1800\n    asl\n    and #$0F\n    sta $1800\n\n    jsr $E9AE ; CLK=1 10 cycles later\n\n    inc $f9\n    bne send_loop\n    beq read_loop\n\nend:\n    jmp *\n\nenc_tab:\n    .byte %1111, %0111, %1101, %0101, %1011, %0011, %1001, %0001\n    .byte %1110, %0110, %1100, %0100, %1010, %0010, %1000, %0000\n\nsector_table:\n    .byte 0,1,2,3,$FF\nsector_table_end:\n<\/pre>\n<p>This is the linker script:<\/p>\n<pre>\nMEMORY {\n    # hack to get the load address as the first 2 bytes into the .PRG\n    LOADADDR: start = $0188, size = 2;\n\n    # the receive code, filled with $02s that overwrite the top few bytes of\n    # the stack and make the KERNAL loader return to $0203\n    PART2:    start = $0188, size = $0065, fill = yes, fillval = $FF, file = %O;\n\n    VECTOR:   start = $01ED, size = $0011, fill = yes, fillval = $FF, file = %O;\n\n    CMD:      start = $01FE, size = $0005, fill = yes, fillval = $FF, file = %O;\n\n    # entry point $0203 due to stack overwritten with $02s\n    # code that transfers M-E\n    START:    start = $0203, size = $0003, fill = yes, fillval = $ff, file = %O;\n\n    FCODE:    start = $482, size = $007E, fill = yes, fillval = $ff, file = %O;\n}\n\nSEGMENTS {\n    LOADADDR:   load = LOADADDR,    type = ro;\n    START:      load = START,       type = ro;\n    PART2:      load = PART2,       type = ro;\n    CMD:        load = CMD,         type = ro;\n    VECTOR:     load = VECTOR,      type = ro;\n    FCODE:      load = FCODE,       type = ro;\n}\n<\/pre>\n<p>This script for the c1541 tool, which puts the code into a disk image:<\/p>\n<pre>\nformat autostart,01\nwrite \"start.prg\"\n<\/pre>\n<p>And this is the shell script that builds the whole thing:<\/p>\n<pre>\nca65 start.s &&\nld65 -C start.cfg start.o -o start.prg &&\ndd if=\/dev\/zero of=autostart.d64 bs=256 count=683 &&\nc1541 autostart.d64 < c1541script.txt\n<\/pre>\n<p>Note that the c1541 tool creates a file with the whole block on disk, so in practice, the 1541 code will be loaded into the C64 as well, but never used. So the two link bytes of the block would have to be manually changed to decrease its size to achieve maximum speed.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Update: The source is available at github.com\/mist64\/fastboot1541 Platforms like the Commodore 64 are still a lot of fun to work with, not only because the limitations make certain tasks a real challenge, but also because it is possible to use many interesting tricks on a bit- and cycle-level &#8211; after all, the system is well-understood &#8230; <a title=\"A 256 Byte Autostart Fast Loader for the Commodore 64\" class=\"read-more\" href=\"https:\/\/www.pagetable.com\/?p=568\" aria-label=\"Read more about A 256 Byte Autostart Fast Loader for the Commodore 64\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[41,8,13],"tags":[],"class_list":["post-568","post","type-post","status-publish","format-standard","hentry","category-c64","category-commodore","category-floppy-disks"],"_links":{"self":[{"href":"https:\/\/www.pagetable.com\/index.php?rest_route=\/wp\/v2\/posts\/568","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.pagetable.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.pagetable.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.pagetable.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.pagetable.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=568"}],"version-history":[{"count":0,"href":"https:\/\/www.pagetable.com\/index.php?rest_route=\/wp\/v2\/posts\/568\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.pagetable.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=568"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.pagetable.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=568"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.pagetable.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=568"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}