Monthly Archives: July 2008

How MOS 6502 Illegal Opcodes really work

The original NMOS version of the MOS 6502, used in computers like the Commodore 64, the Apple II and the Nintendo Entertainment System (NES), is well-known for its illegal opcodes: Out of 256 possible opcodes, 151 are defined by the architecture, but many of the remaining 105 undefined opcodes do useful things.

Many articles have been written to test and document these, but I am not aware of any article that tries to explain where exactly they come from. I’ll do this here.

The Block Diagram

Every 6502 data sheet comes with a block diagram, but these are of no use, because they are oversimplified, partially incorrect, and don’t explain how instruction decoding works. The following more detailed diagram is a lot more useful:



(Original from Apple II things)

The Decode ROM (PLA)

There is no need to understand the whole diagram. The important part is on the left: The instruction register, which holds the opcode, and the current clock cycle within the instruction (T0 to T6) get fed into a 130×21 bit decode ROM, i.e. a ROM with 130 lines of 21 bits each. On the die shot, this is the green area on the bottom.



(Original from Molecular Expressions)

While other CPUs from the same era used microcode to interpret the instruction, the 6502 had this 130×21 bit PLA. All lines of the PLA compare the instruction and the current clock cycle, and if they match, the line fires. A little simplified, every line looks like this:

ON bits OFF bits timing
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 T6 T5 T4 T3 T2 T1

(See the diagrams at http://impulzus.sch.bme.hu/6502/ for details; partial English translation of the website here).

  • “ON bits” specifies, which bits need to be set for this line to fire.
  • “OFF bits” specifies, which bits need to be clear for this line to fire.

The opcode table of the 6502 is laid out in a way that you can find easy rules to generalize the effects of similar opcodes. For example, the branch opcodes are encoded like this:

%aab10000

where “aa” is the condition (00=N, 01=V, 10=C, 11=Z) and “b” decides whether the branch is taken on a set or a clear flag.

So the following line would fire on the first cycle of any branch:

ON bits OFF bits timing
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 T6 T5 T4 T3 T2 T1
0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1

From now on, let’s write it differently, so that it’s more readable:

mask cycle description
XXX10000 T1 T1 of Bcc: fetch branch offset

If a line fires, it outputs a “1″. The “Random Control Logic” that can seen in the diagram then AND/OR-combines some lines and feeds the result into various components of the CPU: In the case of a branch, this would result in fetching the branch offset, for example.

One line can fire for several opcodes that are similar in their encoding and thus their behavior: For example, “LDA abs”, “ORA abs” and “AND abs” all do the same thing (fetch the low byte of the address) in T1, so there can be a line that matches all these opcodes and causes a memory fetch and a PC increment. Also, multiple lines can fire at the same time for any given cycle within an instruction, which will have the combined effect of the single lines.

LDA and LDX becomes LAX

Now there are many undefined opcodes. The designers of the 6502 have not created any specific PLA lines for them, but since their opcodes are similar to well-defined opcodes, there might be lines that fire nevertheless.

Let’s take opcode $AF for example, which is “LAX absolute”. It loads a value from an absolute address in memory and stores it in A and X at the same time. This is somewhat the combination of opcodes $AD (LDA abs) and $AE (LDX abs).

The instructions “LDA/LDX/LDY abs” ($AC/$AD/$AE) consist of four cycles:

  • The first cycle fetches the low byte of the address.
  • The second cycle fetches the hgh byte of the address.
  • The third cycle fetches the address from memory and stores it in A/X/Y.
  • The fourth cycle fetches the next instruction.

Cycles T1, T2 and T4 are identical for all three of them, and they are encoded smilarly, so the following three PLA lines can be used to detect these instructions and signal the rest of the CPU to carry out the specific tasks:

mask cycle description
101011XX T1 T1 of $AC/$AD/$AE: fetch addr/lo
101011XX T2 T2 of $AC/$AD/$AE: fetch addr/lo
101011XX T4 T4 of $AC/$AD/$AE: fetch next opcode

The mask %101011XX doesn’t only fire for $AC/$AD/$AE, but also for the undefined opcode $AF: So $AF (LAX) behaves the same as LDA/LDX/LDY in T1/T2/T4, i.e. it fetches a 16 bit address and in the end fetches the next opcode.

T3 differs in all three cases, so it has to be handled by one separate line per case:

mask cycle description
10101100 T3 T3 of $AC: read into Y
101011X1 T3 T3 of $AD: read into A
1010111X T3 T3 of $AE: read into X

(Actually, the lines in the actual PLA might be less specific, i.e. contain more X bits, since there are similar instructions like “ORA absolute” that might share this line.)

The line for $AC is only true for the exact value of $AC, but the $AD and $AE lines have one “don’t care” bit each. The bitfield of $AF, which is %10101111, is true for both masks, so in T3 of $AF, both the $AD and the $AE lines fire.

In T3, LDA/LDX/LDY have in common that they all read from memory and put the result onto the internal “SB” bus. “LDA” also sets the “SB->AC” control line to “1″, which will make the accumulator read its value from SB. Likewise, LDX causes “SB->X” to be “1″ and makes X to read from the SB bus, and LDY reads SB into the Y register.

Since both the LDA and the LDX lines fire, both the accumulator and the X register will be sent the command to load their values from the SB bus, so $AF is effectively an LAX: Load Accumulator and X.

The KIL Opcodes

There are many “KIL” opcodes, i.e. opcodes that stop the CPU, so that it can only recover using a RESET, and not even an IRQ or an NMI.

In order to understand this, let’s look at the different states an instruction can be in. After the instruction fetch, the CPU is in cycle T1. It will feed the opcode and the cycle number into the PLA and cause the rest of the CPU to carry out whatever has to be done in this cycle, according to the PLA. Then it will shift the T bitfield left by one, so the T2 line will be “1″, then line T3 and so on. There are seven T lines total, T1 to T7. At the end of each instruction, the PLA causes the T bitfield to reset, so that the next instruction starts with T1=1 again.

But what happens if T does not get reset? This can happen if in all seven states of T, no line fires that actually belongs to an instruction that ends at this cycle. T gets shifted left until state T7, in which another shift left will just shift the 1 bit out of T – all bits of T will be zero then, so no PLA line can fire any more.

All interrupt and NMI requests are always delayed until the current instruction is finished, i.e. until T gets reset. But since T never gets reset, all interrupts and NMIs are effectively disabled.

What’s next?

There are many illegal opcodes, some with very weird behavior, and some that have been documented as unstable. Studying all these can reveal many interesting details about the internal design of the 6502.

Apple I BASIC as a Mac OS X Scripting Language

Update: Commodore BASIC as a Scripting Language for UNIX and Windows – now Open Source

Recently, we reconstructed a perfect copy of Apple I BASIC, the first piece of software Apple ever sold – in 1976. Now it is time to make something useful out of it. Wouldn’t it be nice if you could use Apple I BASIC to replace, say, Perl? Wouldn’t it be nice if you could do this:

$ cat reverse.bas
#!/usr/bin/apple1basic
10 DIM A$(100)
20 INPUT A$
30 FOR I = LEN(A$) TO 1 STEP -1
40 PRINT A$(I,I);
50 NEXT I
60 PRINT
70 END
$ chmod a+x reverse.bas
$ echo MICHAEL STEIL | ./reverse.bas
LIETS LEAHCIM
$

Here is Apple I BASIC as a scripting language for Mac OS X Intel:

apple1basic_osx.zip

Just yet another Apple I emulator for Mac? No, it is not. There are some very important differences:

  • The “apple1basic” executable is a statically recompiled version of the original binary. All code is running natively.
  • “apple1basic” plugs right into UNIX stdin and stdout.
  • You can pass “apple1basic” the filename of a BASIC program to run.
  • You can run BASIC programs like shell scripts.

Let’s play with it for a bit. First, copy “apple1basic” to /usr/bin:

$ sudo cp apple1basic /usr/bin

Let’s try direct mode first:

$ apple1basic
>PRINT"HELLO WORLD!"
HELLO WORLD!
>

Now let’s write a small program:

$ apple1basic
>10 FOR I = 1 TO 10
>20 TAB I: PRINT "HELLO WORLD!"
>30 NEXT I
>40 END
>RUN
HELLO WORLD!
 HELLO WORLD!
  HELLO WORLD!
   HELLO WORLD!
    HELLO WORLD!
     HELLO WORLD!
      HELLO WORLD!
       HELLO WORLD!
        HELLO WORLD!
         HELLO WORLD!
>

We can tell apple1basic to run a BASIC program from a file, too:

$ cat hello.bas
10 FOR I = 1 TO 10
20 TAB I: PRINT "HELLO WORLD!"
30 NEXT I
40 END
$ apple1basic hello.bas
HELLO WORLD!
 HELLO WORLD!
  HELLO WORLD!
   HELLO WORLD!
    HELLO WORLD!
     HELLO WORLD!
      HELLO WORLD!
       HELLO WORLD!
        HELLO WORLD!
         HELLO WORLD!
$

apple1basic can be interactive:

$ cat name.bas
10 DIM N$(20)
20 PRINT "WHAT IS YOUR NAME";
30 INPUT N$
40 PRINT "HELLO, "; N$; "!"
50 END
$ apple1basic name.bas
WHAT IS YOUR NAME?MICHAEL
HELLO, MICHAEL!
$

apple1basic supports redirection of stdin and stdout. Note that if stdin is a pipe, “INPUT” doesn’t print the “?”:

$ cat reverse.bas
10 DIM A$(100)
20 INPUT A$
30 FOR I = LEN(A$) TO 1 STEP -1
40 PRINT A$(I,I);
50 NEXT I
60 PRINT
70 END
$ echo MICHAEL STEIL | apple1basic reverse.bas
LIETS LEAHCIM
$

Which brings us back to our first example: You can even use apple1basic as a UNIX script interpreter by adding the hashbang as the first line:

$ cat reverse.bas
#!/usr/bin/apple1basic
10 DIM A$(100)
20 INPUT A$
30 FOR I = LEN(A$) TO 1 STEP -1
40 PRINT A$(I,I);
50 NEXT I
60 END
$ chmod a+x reverse.bas
$ echo MICHAEL STEIL | ./reverse.bas
LIETS LEAHCIM
$

Some more programs

These games, written for the Apple I in the 1970s, have been taken from this and other sites and converted from tokenized hexcode into ASCII.

You can also download hanoi-apple1.bas, a fixed version of Amit Singh’s BASIC program from his Hanoimania project.

How it is done

My static recompiler takes the Apple I BASIC binary as an input and produces a native executable. This file does not contain the original 6502 code and runs 100% natively. On a 2 GHz machine, it runs about 2000 times faster than a 1 MHz Apple I, so it can run one 6502 clock cycle per native cycle. (In the best case, an interpreter can only get up to 1/10 of an emulated cycle per native cycle.)

Accesses to the Apple I terminal (keyboard and screen) are handled by functions that implement a few hacks. Depending on the program counter and the stack, I can discard some output (like terminal echo, 80 column forced line wrap) and redirect files and stdin to keyboard input, depending on context. For example, if there is a filename on the command line, keyboard emulation passes these characters into BASIC, and adds the “RUN” command. All following input will be read from stdin, and as soon as the code to print the “>” prompt is detected, the output handler quits the application.

Commodore BASIC / Microsoft BASIC 6502

I have done the same thing to the 9 KB Microsoft BASIC for 6502 taken from the Commodore 64 ROM.

cbmbasic_osx.zip

$ cbmbasic

    **** COMMODORE 64 BASIC V2 ****

 64K RAM SYSTEM  38911 BASIC BYTES FREE

READY.
_

It has many of the same features as Apple I BASIC: It discards the banner and the “READY.” prompt output when runnig in non-interactive mode, accepts a BASIC file as a parameter, and can be used as a script interpreter.

But… why??

I love the idea of reusing old code. Emulation is nice, but it rarely integrates well into modern systems. Most code out there has either a very lightweight (or well-understood) interface to the hardware (like the Apple I) or the opertaing system (like Commodore BASIC, as well as all programs from the GUI era), so hooking it up with modern operting system services can work out very nicely.

Also, emulation of vintage systems rarely cares too much about performance any more, since it is usually fast enough, i.e. as fast as or faster than the original system. But computers have always been slow, and running very old code can have the advantage of working with really fast software.

And yes, I think there is a lot of useful vintage code out there.

Links

Optimized Flags Code for 6502

When I disassembled Steve Wozniak’s Apple I BASIC, I found a 6502 trick that I had never seen before, although I had read a lot of 6502 code, including the very well-written Commodore BASIC (i.e. Microsoft BASIC for 6502).

What is the most optimized way (shortest code) to set, clear and test a flag?

Normally, you would do this (“flag” is a byte in the zero page):

function code bytes cycles
set
lda #1
sta flag
4 5
clear
lda #0
sta flag
4 5
test
lda flag
beq cleared
4 5/6

Woz did it like this:

function code bytes cycles
set
lda #$80
sta flag
4 5
clear
lsr flag
2 5
test
bit flag
bpl cleared
4 5/6

The trick is to store the flag in bit #7. Clearing bit #7 is easy: Just do a logical shift right, which will always write a zero into the MSB; we don’t care about the contents of the other bits. The “lsr zp” is the same speed as the “lda/sta zp” combination, but only occupies 2 bytes instead of four.

The other advantage of this method is that it is possible to test the flag without destroying a register. The “bit” instruction will just copy the MSB into the negative flag, while the “lda” would have overwritten the accumulator.

I can’t see why this couldn’t be adapted to other (CISC, RMW-capable) CPUs as well. If you are fluent in any other assembly language, please post a comment.

1200 Baud Archeology: Reconstructing Apple I BASIC from a Cassette Tape

The audio file that was posted two weeks ago is indeed a very important artifact of computer history: It is a recording of the “Apple I BASIC” cassette tape that came with the Apple I. It is the first piece of Software ever sold by Apple (not counting computer firmware).

Here is the first confirmed perfect dump of the 4096 bytes: apple1basic.bin

The Apple I is extremely rare. Only 200 were built, and less than 100 are believed to be in existence. Neither Steve nor Woz own an Apple I any more, and neither does Apple Inc. (Update: Woz still owns some.) The cassettes are even rarer, as not every Apple I came with one. There has not been a dump of the tape until 2002, when Achim Breidenbach of Boinx Software got an MP3 recording of an original Apple 1 BASIC tape by Larry Nelson, an Apple I owner, and, with a little help from his father (who worked with an Apple I back in 1976), managed to decode it by writing a program – in GFA BASIC on an Atari ST. Here is a screenshot of the visualization this program could provide:

Achim wrote an Apple I emulator, included a commented disassembly of his Apple I BASIC dump, and published it on the internet. Other people continued working on the disassembly and changed instructions that they thought were mistakes in the dump. The only dump that can be easily found today includes these changes. It was time to analyze the tape again and get an authoritative dump of the 4096 bytes.

So here is how to decode the signal. Let us first open the audio file in Audacity and look at the waveform. The signal is mono, and as it turns out that the quality of the left channel is better, let us delete the other channel. This is what the whole ~35 second recording looks like:

There is silence at the beginning and at the end – but we can just as well hear that. We need to zoom in to see the actual waveform. From about 2 seconds to 12.4 seconds, when we hear a continuous beep, the signal signal consists of uniform waves:

Afterwards, during what sounds like noise, the waveform looks very different: There are shorter waves and loger waves.

The original output of the Apple I when writing the tape was a square wave, but the signal was filtered by the properties of the magnetic tape, so high frequencies were removed, and the signal became rounder. On the way back in from tape, the op-amp of the Cassette Interface converted the signal back into a square wave by converting all samples above a certain value into 1, and all samples below into 0. While the threshold in the op-amp was fixed, it could be effectively manipulated by changing the output volume of the cassette player. Effect->Amplify in Audacity gives us the following picture:

It is now time to write a small program to measure and dump the width of the pulses. We need to be able to parametrize the threshold (the volume knob) to be able to try different settings. In Audacity, we have to save the file as “WAV” first, because it is easy to decode this file format: We just skip the 44 byte header, and every following byte is a signed 16 bit value representing the sample.

First runs show us that all pulses in the first part of the file are around 56 samples long, while the rest of the file contains pulses which have a length of either around 22 or around 45. It seems the first part is a sync signal, so that the reader knows when the data starts. The two pulse lengths afterwards represent ones and zeros. It turns out that the shorter pulses are zeros; this way we get readable 6502 assembly code and some ASCII data (though with bit #7 set).

Here is a hexdump of the first few bytes:

0000000 4c b0 e2 ad 11 d0 10 fb ad 10 d0 60 8a 29 20 f0
0000010 23 a9 a0 85 e4 4c c9 e3 a9 20 c5 24 b0 0c a9 8d
0000020 a0 07 20 c9 e3 a9 a0 88 d0 f8 a0 00 b1 e2 e6 e2
0000030 d0 02 e6 e3 60 20 15 e7 20 76 e5 a5 e2 c5 e6 a5

In order to verify that our decoded data is correct, we must make sure that after the sync, there are only pulses that fall into the 22 or 45 category. If we play with the volume and find a volume region (i.e. different but similar volume settings) that has no pulse length errors and decodes into the same data, we can be confident that the data is correct. According to what is printed on the cassette, Apple 1 BASIC is supposed to reside in RAM at 0xE000 to 0xEFFF, so
the length of the decoded data should be 4096 bytes.

The following program is a possibly minimal implementation of the required algorithm:

apple1basic-decode.c

#include <stdio.h>

#define DIVISOR 30

int main() {
    int index = 0, last = 0, direction = 1, syncstate = 0, bitindex = 0;
    int distance;
    unsigned char outbyte;
    signed short sample;

    while (!feof(stdin)) {
        sample = getchar() | getchar()<<8;
        if (!direction) {
            if (sample>(32768/DIVISOR)) {
                distance = index-last;
                if (distance<50) {
                    if (syncstate == 2) {
                        outbyte = outbyte << 1 | (distance<32? 0:1);
                        if (!((++bitindex)&7)) putchar(outbyte);
                    }
                    if (syncstate == 1) syncstate++;
                } else if ((distance<70) && !syncstate)
                    syncstate++;
                last = index;
                direction++;
            }
        } else
            if (sample<-(32768/DIVISOR))
                direction--;
        index++;
    }

    return bitindex/8;
}

Run it like this:

cat apple1basic-mono.wav | ./apple1basic-decode > apple1basic.bin

You can optionally specify a parameter to the program which overrides the default value of 30 for "DIVISOR". Values between ~25 and ~95 will reconstruct the data correctly.

apple1basic.bin

If you can read 6502 assembly, you should definitely have a look at the disassembly, as this is very high quality (in the 1970s, this meant: very tightly compressed) code.

Update: You can run Apple I BASIC as a scripting language on Mac OS X now!