Archive for September, 2010

Internals of BRK/IRQ/NMI/RESET on a MOS 6502

Wednesday, September 29th, 2010

After 35 years of measuring the behaviour of the MOS 6502 CPU to better understand what is going on, the Visual6502 simulator finally allows us insight into the chip, so we can understand what the CPU does internally. One interesting thing here is the question how the 6502 handles BRK, IRQ, NMI and RESET.

The Specification

Let’s revisit the documented part first. The 6502 knows three vectors at the top of its address space:

Signal Vector
NMI $FFFA/$FFFB
RESET $FFFC/$FFFD
IRQ/BRK $FFFE/$FFFF
  • On a RESET, the CPU loads the vector from $FFFC/$FFFD into the program counter and continues fetching instructions from there.
  • On an NMI, the CPU pushes the low byte and the high byte of the program counter as well as the processor status onto the stack, disables interrupts and loads the vector from $FFFA/$FFFB into the program counter and continues fetching instructions from there.
  • On an IRQ, the CPU does the same as in the NMI case, but uses the vector at $FFFE/$FFFF.
  • On a BRK instruction, the CPU does the same as in the IRQ case, but sets bit #4 (B flag) in the copy of the status register that is saved on the stack.

The four operations are very similar, they only differ in the location of the vector, whether they actually push data onto the stack, and whether they set the B flag.

Signal Vector Push PC and P Set B Flag
NMI $FFFA/$FFFB yes no
RESET $FFFC/$FFFD no no
IRQ $FFFE/$FFFF yes no
BRK $FFFE/$FFFF yes yes

BRK

Ignoring opcode fetches, the PLA ROM defines the following cycles of the BRK instruction (6502 Programming Manual, page 131):

  • store PC(hi)
  • store PC(lo)
  • store P
  • fetch PC(lo) from $FFFE
  • fetch PC(hi) from $FFFF

IRQ

An IRQ does basically the same thing as a BRK, but it clears the B flag in the pushed status byte. The CPU goes through the same sequence of cycles as in the BRK case, which is done like this:

If there is an IRQ pending and the current instruction has just finished, the interrupt logic in the 6502 forces the instruction register (“IR”) to “0″, so instead of executing the next instruction, the PLA will decode the instruction with the opcode 0×00 – which is BRK! Of course it has to kick in a few cycles later again to make sure a B value of 0 is pushed, but otherwise, it’s just the BRK instruction executing.

NMI

Not surprisingly, NMI is done the same way: “0″ is injected into the instruction stream, but this time, some extra logic makes sure that the addresses $FFFA/$FFFB are put onto the address bus when fetching the vector.

RESET

RESET also runs through the same sequence, but it is the most different of the four cases, since it does not write the current PC and status onto the stack – but this was hacked trivially: The bus cycles exist, but the read/write line is not set to “write”, but “read” instead. The following trace was created with the transistor data from the Visual 6502 project and shows the first nine cycles after letting go of RESET:

#0 AB:00FF D:00 R/W:1 PC:00FF A:AA X:00 Y:00 SP:00 P:02 IR:00  READ $00FF = $00

Cycle 0: When a 6502 is turned on, the stack pointer is initialized with zero. The BRK/IRQ/NMI/RESET sequence pulls the instruction register (IR) to 0.

#1 AB:00FF D:00 R/W:1 PC:00FF A:AA X:00 Y:00 SP:00 P:02 IR:00  READ $00FF = $00
#2 AB:00FF D:00 R/W:1 PC:00FF A:AA X:00 Y:00 SP:00 P:02 IR:00  READ $00FF = $00
#3 AB:0100 D:00 R/W:1 PC:00FF A:AA X:00 Y:00 SP:00 P:02 IR:00  READ $0100 = $00

Cycle 3: The first stack access happens at address $0100 – a push first stores the value at $0100 + SP, then decrements SP. In the BRK/IRQ/NMI case, this would have stored the high-byte of the PC. But for RESET, it is a read cycle, not a write cycle, and the result is discarded.

#4 AB:01FF D:00 R/W:1 PC:00FF A:AA X:00 Y:00 SP:00 P:02 IR:00  READ $01FF = $00

Cycle 4: SP is now 0xFF (even if the internal state does not reflect that), so the second stack access (which would have been the low-byte of PC) targets 0x01FF. Again, the result is discarded, and SP decremented.

#5 AB:01FE D:00 R/W:1 PC:00FF A:AA X:00 Y:00 SP:00 P:02 IR:00  READ $01FE = $00

Cycle 5: SP is now 0xFE, and the third stack access, (the status register) happens at 0x01FE. SP is decremented again.

#6 AB:FFFC D:E2 R/W:1 PC:00FF A:AA X:00 Y:00 SP:FD P:06 IR:00  READ $FFFC = $E2

Cycle 6: The internal state of the CPU now shows that SP is 0xFD, because it got decremented 3 times for the three fake push operations. The low-byte of the vector is read.

#7 AB:FFFD D:FC R/W:1 PC:00FF A:AA X:00 Y:00 SP:FD P:16 IR:00  READ $FFFD = $FC

Cycle 7: The high-byte of the vector is read.

#8 AB:FCE2 D:A2 R/W:1 PC:FCE2 A:AA X:00 Y:00 SP:FD P:16 IR:00  READ $FCE2 = $A2

Cycle 8: The first actual instruction is fetched.

Since RESET is not timing critical, it doesn’t matter whether a few cycles are wasted by doing the fake stack cycles.

Measuring the ROR Bug in the Early MOS 6502

Tuesday, September 28th, 2010

The MOS 6502 CPU was introduced in September of 1975, and while the documentation described the three shift/rotate instructions ASL, LSR and ROL, the ROR instruction was missing – the documentation said that ROR would be available in chips starting in June 1976. In fact, the reason for this omission was that the instruction, while being present, didn’t behave correctly. Only few 6502s with the defect are in existence, and nobody seemed to have checked what was actually going on in these chips.

Simon C got my KIM-1 working again, which has a 6502 from week 51 of 1975. There are 512 possible inputs to ROR (8 bit A plus 1 bit C; assuming it doesn’t have dependencies on other registers), and roughly two bytes of output: the 8 bit result and the processor status (flags) register. We ran the following programs on the KIM-1 – note that we had to split the task into several programs, because the KIM-1 doesn’t have enough RAM to hold all results.

Program 1a: for A=00..FF, with C=0, execute “ROR A”, save result into array

.C:0200   A2 00      LDX #$00
.C:0202   8A         TXA
.C:0203   18         CLC
.C:0204   6A         ROR A
.C:0205   9D 00 03   STA $0300,X
.C:0208   E8         INX
.C:0209   D0 F7      BNE $0202
.C:020B   00         BRK

Program 1b: for A=00..FF, with C=1, execute “ROR A”, save result into array

.C:0200   A2 00      LDX #$00
.C:0202   8A         TXA
.C:0203   38         SEC
.C:0204   6A         ROR A
.C:0205   9D 00 03   STA $0300,X
.C:0208   E8         INX
.C:0209   D0 F7      BNE $0202
.C:020B   00         BRK

Program 2a: for A=00..FF, with C=0, execute “ROR A”, save flags into array

.C:0200   A2 00      LDX #$00
.C:0202   8A         TXA
.C:0203   38         SEC
.C:0204   6A         ROR A
.C:0205   08         PHP
.C:0206   68         PLA
.C:0207   9D 00 03   STA $0300,X
.C:020a   E8         INX
.C:020b   D0 F5      BNE $0202
.C:020d   00         BRK

Program 2b: for A=00..FF, with C=1, execute “ROR A”, save flags into array


.C:0200   A2 00      LDX #$00
.C:0202   8A         TXA
.C:0203   38         SEC
.C:0204   6A         ROR A
.C:0205   08         PHP
.C:0206   68         PLA
.C:0207   9D 00 03   STA $0300,X
.C:020a   E8         INX
.C:020b   D0 F5      BNE $0202
.C:020d   00         BRK

These are the results:

Program 1a and 1b (result with C=0 and C=1) produced the same output!

00 -> 00
01 -> 02
02 -> 04
03 -> 06
04 -> 08
05 -> 0A
[...]
FF -> FE

This is a shift left (!!!) of the input value; and it is independent of the carry flag – just like ASL.

Program 2a (status when C=0) produced the following output:

00:     7E (Z=1, N=0, C=0)
01..3F: 7C (Z=0, N=0, C=0)
40..7F: FC (Z=0, N=1, C=0)
80:     7E (Z=1, N=0, C=0)
81..BF: 7C (Z=0, N=0, C=0)
C0..FF: FC (Z=0, N=1, C=0)

Program 2b (status when C=1) produced the following output:

00:     7F (Z=1, N=0, C=1)
01..3F: 7D (Z=0, N=0, C=1)
40..7F: FD (Z=0, N=1, C=1)
80:     7F (Z=1, N=0, C=1)
81..BF: 7D (Z=0, N=0, C=1)
C0..FF: FD (Z=0, N=1, C=1)

These are the correct flags corresponding to the incorrect results in A. The carry flag is the same as the input carry flag, i.e. it is unmodified.

Our preliminary summary is this:

  ROR          Broken ROR on pre-June '76 CPU (Memory or Accumulator)
                   +-+-+-+-+-+-+-+-+
  Operation:       |7|6|5|4|3|2|1|0| <- 0
                   +-+-+-+-+-+-+-+-+                    N Z C I D V
                                                        / / _ _ _ _

So ROR on early 6502s does three things wrong:

  1. It shifts left, instead of right (behaves like ASL)
  2. It shifts a zero in, instead of C (behaves like ASL)
  3. It doesn't update C (as if it wasn't a rotate instruction)

All three problems are flags that are sent to the ALU: the shift direction, the input bit, and the carry writeback.

Unresolved questions:

  • We only tested ROR A; other addressing modes of ROR might behave differently. Other addressing modes might even be working - but I doubt that, since MOS would certainly have documented the working ones then.
  • ROR might have more dependencies than A and C.
  • What is it in the chip that causes the bug? I'm sure the fine guys at visual6502.org will be able to figure this one out soon. It is unlikely to be a bug in the PLA ROM, because a bug there would not affect different addressing modes of the same instruction with very different timings. It is more likely that it is in the "Random Control Logic" part.

Measuring the Entropy of the MOS 6502 CPU

Sunday, September 26th, 2010

Everything can be expressed in bits. It takes 4 kilobits to decompress ZIP data, 25 kilobits to kill a human, 43 megabits for a working Mac OS X kernel, and 10^120 bits to describe our universe. What is the entropy/complexity of the 6502 CPU, you might wonder?

You have probably already seen Visual 6502, a simulator of the 6502 that operates at the transistor level and visualizes what’s going on inside the CPU. The data the program operates on was derived by converting a high-resolution photograph of the 6502 die into polygons, and converting these polygons into the node and transistor configuration.

If you ignore the polygon data, which is only needed for visualization, the data to fully describe the 6502 consists of a set of nodes and a set of transistors.

There are 1725 nodes in the 6502, and 3510 transistors. Every transistor has one node as an input that turns the transistor on or off. If the transistor is on, it connects two other nodes, and if it is off, it separates them. Every node can either be neutral, or be a pullup – the latter means that the node will behave as if it was connected to VCC whenever the path through this node is otherwise floating, i.e. not connected to VCC or GND (or an external pulldown).

So the raw information that describes the 6502 is a set of nodes, for which there is one bit of information, whether it is a pullup node or a neutral node, as well as a set of transistors, which is a three-tuple consisting of gate, c1 and c2.

The file segdefs.js in visual6502 contains (next to the polygon information) the node state: “+” means pullup, “-” (or a missing node description) means neutral. The file transdefs.js contains the transistor tuples.

Stripped from irrelevant data, the 6502 description would look like this: The node states:

BOOL nodes[] = {
    /*  0 */ 1,
    /*  1 */ 0,
    /*  2 */ 0,
    /*  3 */ 1,
    /*  4 */ 1,
    /*  5 */ 1,
    /*  6 */ 1,
    /*  7 */ 0,
    /*  8 */ 1,
    /*  9 */ 0,
    /* 10 */ 1,
    [...]
}

…and the transistors:

struct {
	int gate;
	int c1;
	int c2;
} transistors[] = {
    {357, 558, 217},
    {1608, 657, 349},
    {412, 558, 1146},
    {558, 558, 943},
    {826, 230, 657},
    [...]
}

A quick estimate would be 1 bit per node (pullup or neutral), and 3 times 11 bits (3 node numbers from 1 to 1725) per transistor, resulting in about 117 kilobits; but there is still a lot of redundancy in this representation.

The transistor tuples can be represented in a much more compact way. First, there is no ordering for the transistors, so if we sort them by gate, we can encode just the difference between the current gate and the previous gate:

struct {
	int gate;
	int c1;
	int c2;
} transistors[] = {
    {1, 890, 558},
    {4, 558, 11},
    {5, 558, 146},
    {6, 558, 282},
    {6, 874, 657},
    {7, 657, 1591},
    {7, 657, 1591},
    {7, 657, 1591},
    {8, 150, 558},
    [...]
}

In practice, the difference between two gate nodes in the transistor list is a number between 0 and 4, so we could use 3 bits (instead of 11) for the gate. But in fact, it is enough to store a single bit: The nodes have no order either, so we can renumber them so that all nodes that are connected to a gate of a transistor are numbered from 0, and all nodes that are not connected to gates will be numbered above the ones with gates:

struct {
	int gate;
	int c1;
	int c2;
} transistors[] = {
    {0, 890, 558},
    {1, 558, 11},
    {2, 558, 146},
    {3, 558, 282},
    {3, 874, 657},
    {4, 657, 1591},
    {4, 657, 1591},
    {4, 657, 1591},
    {5, 150, 558},
    [...]
}

This way, there are no holes in the list of gates – a new transistor has either the next node number as its gate, or the same one again (i.e. a node is gate of several transistors). It is enough to store a single bit for the gate.

The example above already shows that the nodes 657 and 558 show up a lot: These are VCC and GND, respectively. In fact, 234 transistors connect to VCC, and 2493 connect to GND. We could Huffman-encode the c1 and c2 nodes, but in practice, all other nodes except these two are relatively uniformly distributed, so let’s just special case them.

There is also no ordering between the two nodes the transistor connects, and no transistor will ever connect VCC/GND to VCC/GND, so we only need to do the special VCC/GND encoding for c1, and leave c2 as it is. Let’s use a single “1″ bit to represent GND, “01″ as VCC, and all other 11 bit node numbers will be prefixed with “00″, making them occupy 13 bits.

This way, the 3510 values of c1 can be described in 2493 * 1 (the GNDs) + 234 * 2 (the VCCs) + 783 * (2 + 11) (the others) bits = 13140 bits. The c2 values occupy 3510 * 11 bits = 38610 bits. If you add the bit per transistor for the gate (i.e. 3510 bits), we’re at 55260 bits for the transistor description.

Using arithmetic coding, we can describe a node number in about 10.75 bits instead of 11. This way, we save (783 + 3510) * 0.25 bits = 1073 bits. The transistor description thus fits into 54187 bits.

As stated previously, we also need one bit per node for the pullup information. About half the nodes are pullup, and we don’t have the freedom of sorting and renaming the nodes any more (we’ve done that already for the transistor gates), so we’ll have to go with the 1725 raw bits.

So the final number is 54187 bits + 1725 bits = 55912 bits. That’s 6989 bytes, or roughly 56 kilobits, and about twice as big as H1N1.

With this number, we can estimate the complexity of a Commodore 64 to be about 200 kilobits (CPU, VIC, SID, 2xCIA, PLA) – plus another 160 kilobits of ROM.

But wait: Did you see the duplicate transistors in the sorted list? Yes, there are some duplicates, in fact, there are only 3288 unique transistors. And when looking at it on a higher level, the 130 entry instruction decode ROM has duplicates too: Only 115 are unique, leading to redundant paths in the node/transistor graph. The reason for this redundancy is routing: There are only two dimensions in a computer chip (plus layers), so you cannot just connect anything to anything.

The homework for today, dear reader, is to develop an algorithm that minimizes the description of the 6502 by removing duplicate transistors as well as equivalent paths.

CPUID on all CPUs (HOWNOTTO)

Tuesday, September 14th, 2010

A while ago, an engineer from a respectable company for low-level solutions (no names without necessity!) claimed that a certain company’s new 4-way SMP system had broken CPUs or at least broken firmware that didn’t set up some CPU features correctly: While on the older 2-way system, all CPUs returned the same features (using CPUID), on the 4-way system, two of the CPUs would return bogus data.

I asked for his test code. It ran in kernel mode and looked roughly like this:

    int cpu_features[4];

    for (int i = 0; i < 100000; i++) {
        cpu_features[get_cur_cpu_number()] = cpuid_get_features();
        usleep(100);
    }

    for (int j = 0; j < 4; j++)
        printf("CPU %d features: %x\n", j, cpu_features[j]);

Questions to the reader:

  1. What was the original idea, what is the algorithm?
  2. Why did this work on a 2-way system, but not on a 4-way system?
  3. Which two changes would at least make this code correct (albeit still horrible)?
  4. How would you do it correctly?
  5. Would you buy software from this company?

Playstation 3 Hacking – Linux Is Inevitable

Monday, September 6th, 2010

In the talk “Why Silicon Security is still that hard” by Felix Domke at the 24th Chaos Communication Congress in 2007 (in which he described how he hacked the Xbox 360, and bushing had a cameo at the end explaining how they hacked the Wii), I had a little part, in which I argued that “Linux Is Inevitable”: If you lock down a system, it will eventually get hacked. In the light of the recent events happening with PlayStation 3 hacking, let’s revisit them.

This is the original slide from 2007:

device

y

security

hacked

for

effect

PS2

1999

?

?

piracy

-

dbox2

2000

signed kernel

3 months

Linux

pay TV decoding

GameCube

2001

encrypted boot

12 months

Homebrew

piracy

Xbox

2001

encrypted/signed bootup, signed executables

4 months

Linux

Homebrew

piracy

iPod

2001

checksum

<12 months

Linux

-

DS

2004

signed/encrypted executables

6 months

Homebrew

piracy

PSP

2004

signed bootup/executables

2 months

Homebrew

piracy

Xbox 360

2005

encrypted/signed bootup,encrypted/signed executables, encrypted RAM, hypervisor, eFuses

12 months

Linux

Homebrew

leaked keys

PS3

2006

encrypted/signed bootup,encrypted/signed executables, hypervisor, eFuses, isolated SPU

not yet

-

-

Wii

2006

encrypted bootup

1 month

Linux

piracy

AppleTV

2007

signed bootloader

2 weeks

Linux

Front Row piracy

iPhone

2007

?

1 month

Homebrew

international

SIM-Lock revenue

The table shows the relationship between the quality of a device’s security system and the time it took to hack it, as well as the original motivation for hacking and the side effects (collateral damage) it caused.

Correlation security/time to hack

There is a pretty clear correlation betwen the quality of the security system and the time required for hacking it – with the notable exception being the GameCube, which had rather weak security, but since its release coincided with the much more powerful Xbox, much of the hacker community neglected the GameCube until the Xbox was done. What can also be seen is that recently, devices tend to get hacked more quickly; probably simply because there are more and more people interested in hacking.

Correlation Linux/time to hack

The other exception is the PlayStation 3, which was not hacked until about three and a half years after its introduction. I argued that this was because there was only very little motivation to hack it: Sony shipped the devices with the “Other OS” option and even sponsored a port of Linux to it, allowing any user to install Linux if they wanted. Although Linux was running on top of a hypervisor and did not have access to all of the features of the device, it seems to have been enough to take the enough motivation to hack it out of the hacker community.

Linux/homebrew is the primary motivation

This is supported by the by the fact that the motivation for hacking every system in the table was either homebrew (i.e. running unautorized hobbyist applications) or Linux. Hackers seem to love to convert their devices into Linux computers to run a big library of existing software, or to hack the device to make it possible to run versions of existing emulators and games on the native OS.

Piracy is a side effect

None of the hacks in the table was done with the motivation to allow running copied games – but whenever the point of the security system was to prevent piracy, hacking it inevitably enabled piracy as a side effect. Some security systems protected other things like pay TV keys and SIM-locks; these also fell as side effects.

2010 update

In September 2009, Sony started shipping the “slim” model of the PlayStation 3, with the “Other OS” feature removed. With firmware 3.21 in April 2010, the feature was also removed from existing original models that users chose to upgrade – which was required for using any of the online features. The missing “Other OS” feature on the slim model motivated George Hotz (geohot) to hack into hypervisor mode (Jan 2010), but this approach did not lead to a working hack of the security system. In August 2010, the Australian company OzMods announced the commercial “PSJailbreak” USB dongle that hacks into non-hypervisor mode, allowing piracy and homebrew (“Backup Manager” says “backups and homebrew”).

Although this is the first time that a commercial company is first to hack a system, and the first time that piracy seems to have been a key motivation, removal of “Other OS” might have been another motivation, and geohot’s previous attempts might have helped as an entry point for this hack.

Usually, an open hacker community develops a hack, and commercial companies convert them into modchips. This time, a company developed a hack and a modchip, and the community reverse engineered it and ported the exploit code onto several other devices, allowing people to hack the PlayStation 3 without a dedicated device. And I’m sure Linux will be adapted soon to run in the new environment.

Conclusion

What do we learn from this? Linux is inevitable. Or maybe it should be “Homebrew is inevitable”. In the history of mankind, there has yet to be a popular system that is locked down to only allow certain software to run, but does not get hacked to run arbitrary code. I still dare to say that if Sony had not removed “Other OS”, the PlayStation 3 would have been the first system to not get hacked. At all.

(Here is an updated 2010 version of the table:)

device

y

security

hacked

for

effect

PS2

1999

?

?

piracy

-

dbox2

2000

signed kernel

3 months

Linux

pay TV decoding

GameCube

2001

encrypted boot

12 months

Homebrew

piracy

Xbox

2001

encrypted/signed bootup, signed executables

4 months

Linux

Homebrew

piracy

iPod

2001

checksum

<12 months

Linux

-

DS

2004

signed/encrypted executables

6 months

Homebrew

piracy

PSP

2004

signed bootup/executables

2 months

Homebrew

piracy

Xbox 360

2005

encrypted/signed bootup,encrypted/signed executables, encrypted RAM, hypervisor, eFuses

12 months

Linux

Homebrew

leaked keys

PS3

2006

encrypted/signed bootup,encrypted/signed executables, hypervisor, eFuses, isolated SPU

4 years

Piracy

Homebrew

-

Wii

2006

encrypted bootup

1 month

Linux

piracy

AppleTV

2007

signed bootloader

2 weeks

Linux

Front Row piracy

iPhone

2007

signed/encrypted bootup/executables

11 days

Homebrew

SIM-Lock

piracy

iPad

2010

signed/encrypted bootup/executables

1 day

Homebrew

piracy