Using the OS X 10.10 Hypervisor Framework: A Simple DOS Emulator

Since Version 10.10 (Yosemite), OS X contains Hypervisor.framework, which provides a thin user mode abstraction of the Intel VT features. It enables apps to use virtualization without the need of a kernel extension (KEXT) – which makes them compatible with the OS X App Store guidelines.

The idea is that the OS takes care of memory management (including nested paging) as well as scheduling virtual CPUs like normal threads. All we have to do is create a virtual CPU (or more!), set up all its state, assign it some memory, and run it… and then handle all “VM exits” – Intel lingo for hypervisor traps.

There is no real documentation, but the headers contain a decent amount of information. Here are some declarations from Hypervisor/hv.h:


 * @function   hv_vm_create

 * @abstract   Creates a VM instance for the current task

 * @param      flags  RESERVED

 * @result     0 on success or error code


extern hv_return_t hv_vm_create(hv_vm_options_t flags) __HV_10_10;


 * @function   hv_vm_map

 * @abstract   Maps a region in the virtual address space of the current task

 *             into the guest physical address space of the VM

 * @param      uva    Page aligned virtual address in the current task

 * @param      gpa    Page aligned address in the guest physical address space

 * @param      size   Size in bytes of the region to be mapped

 * @param      flags  READ, WRITE and EXECUTE permissions of the region

 * @result     0 on success or error code


extern hv_return_t hv_vm_map(hv_uvaddr_t uva, hv_gpaddr_t gpa, size_t size,

hv_memory_flags_t flags) __HV_10_10;


 * @function   hv_vcpu_create

 * @abstract   Creates a vCPU instance for the current thread

 * @param      vcpu   Pointer to the vCPU ID (written on success)

 * @param      flags  RESERVED

 * @result     0 on success or error code


extern hv_return_t hv_vcpu_create(hv_vcpuid_t *vcpu,

hv_vcpu_options_t flags) __HV_10_10;


 * @function   hv_vcpu_run

 * @abstract   Executes a vCPU

 * @param      vcpu  vCPU ID

 * @result     0 on success or error code

 * @discussion

 *             Call blocks until the next VMEXIT of the vCPU


 *             Must be called by the owning thread


extern hv_return_t hv_vcpu_run(hv_vcpuid_t vcpu) __HV_10_10;

So let’s create a virtual machine that runs simple DOS applications in 16 bit real mode, and trap all “int” DOS system calls – similar to DOSBox.

First, we need to create a VM:


This creates a VM for the current Mach task (i.e. UNIX process). It’s implicit, so it doesn’t return anything. Then we allocate some memory and assign it to the VM:

#define VM_MEM_SIZE (1 * 1024 * 1024)
void *vm_mem = valloc(VM_MEM_SIZE);
hv_vm_map(vm_mem, 0, VM_MEM_SIZE, HV_MEMORY_READ | 
                                  HV_MEMORY_WRITE | 

And we need to create a virtual CPU:

hv_vcpuid_t vcpu;
hv_vcpu_create(&vcpu, HV_VCPU_DEFAULT);

Now comes the annoying part: Set up the CPU state. If the state is illegal or inconsistent, the CPU will refuse to run. You will need to refer to the Intel Manual 3C for all the context. Luckily, most virtual machines start from 16 bit real mode, and mode changes will be done by the boot loader or operating system inside the VM, so you won’t have to worry about setting up any other state than real mode state. Real mode state setup looks something like this:

hv_vmx_vcpu_write_vmcs(vcpu, VMCS_GUEST_CS_SELECTOR, 0);
hv_vmx_vcpu_write_vmcs(vcpu, VMCS_GUEST_CS_LIMIT, 0xffff);
hv_vmx_vcpu_write_vmcs(vcpu, VMCS_GUEST_CS_ACCESS_RIGHTS, 0x9b);
hv_vmx_vcpu_write_vmcs(vcpu, VMCS_GUEST_CS_BASE, 0);

hv_vmx_vcpu_write_vmcs(vcpu, VMCS_GUEST_DS_SELECTOR, 0);
hv_vmx_vcpu_write_vmcs(vcpu, VMCS_GUEST_DS_LIMIT, 0xffff);
hv_vmx_vcpu_write_vmcs(vcpu, VMCS_GUEST_DS_ACCESS_RIGHTS, 0x93);
hv_vmx_vcpu_write_vmcs(vcpu, VMCS_GUEST_DS_BASE, 0);


hv_vmx_vcpu_write_vmcs(vcpu, VMCS_GUEST_CR0, 0x20);
hv_vmx_vcpu_write_vmcs(vcpu, VMCS_GUEST_CR3, 0x0);
hv_vmx_vcpu_write_vmcs(vcpu, VMCS_GUEST_CR4, 0x2000);

After that, we should populate RAM with the code we want to execute:

FILE *f = fopen(argv[1], "r");
fread((char *)vm_mem + 0x100, 1, 64 * 1024, f);

…and assign the GPRs the proper initial state – including the instruction pointer, which will point to the code:

hv_vcpu_write_register(vcpu, HV_X86_RIP, 0x100);
hv_vcpu_write_register(vcpu, HV_X86_RFLAGS, 0x2);
hv_vcpu_write_register(vcpu, HV_X86_RSP, 0x0);

The virtual CPU is fully set up, we can now run it!


This call runs the virtual CPU (while blocking the calling thread) until its time slice expires or a “VM exit” happens. A VM exit is a hypervisor-class exception, i.e. an event in the VM that the hypervisor wants to trap. We can trap events like exceptions, certain privileged instructions (CPUID, HLT, RDTSC, RDMSR, …) and control register (CR0, CR2, CR3, CR4, …) accesses.

After hv_vcpu_run() returns, we need to read the exit reason and act upon it, and run the virtual CPU again. Here is a minimal loop to handle VM exits:

for (;;) {

	uint64_t exit_reason = hv_vmx_vcpu_read_vmcs(vcpu, VMCS_EXIT_REASON);

	switch (exit_reason) {

EXIT_REASON_EXT_INTR is caused by host interrupts (usually it means that the time slice is up), so we will just ignore it. EXIT_REASON_EPT_FAULT happens every time the guest accesses a page for the first time, or when the guest accesses an unmapped page – this way we can emulate MMIO. In our case, we can also ignore those.

For emulating DOS, we are catching EXIT_REASON_EXCEPTION, which is caused by the int instruction (if caught). We can get the number of the interrupt from the virtual CPU state without decoding instructions:

uint8_t interrupt_number = hv_vmx_vcpu_read_vmcs(vcpu, VMCS_IDT_VECTORING_INFO) & 0xFF;

…and emulate the system call. We can read and write GPRs using the hv_vcpu_read_register() and hv_vcpu_write_register() calls.

hvdos – a simple DOS Emulator for OS X

The full source of hvdos, a simple DOS emulator using the OS X Hypervisor framework, is available at

It contains an adapted version of the libcpu DOS system call library and manages to run (parts of) some .COM files. A good demo is the ZIP decompression tool.

Creating your own Hypervisor

hvdos can serve as a template for your own Hypervisor.framework experiments. It contains wrapper functions for error handling, a header that defines all Intel VT constants (taken from FreeBSD), complete 16 bit real mode initialization, as well as a few helper functions to set up the fields VMCS_PIN_BASED_CTLS, VMCS_PRI_PROC_BASED_CTLS, VMCS_SEC_PROC_BASED_CTLS and VMCS_ENTRY_CTLS properly. These are needed to define, among other things, which events cause VM exits.

You can easily add more CPUs by creating one POSIX thread per virtual CPU. For every thread, you create a virtual CPU and run a VM exit main loop.

You can for example start writing an IBM PC emulator by running Bochs BIOS and trapping I/O accesses, or running MS-DOS without BIOS by trapping BIOS int calls.

Or you could bridge an existing open source solution (QEMU, QEMU+KVM, VirtualBox, DOSBox, …) to use Hypervisor.framework…

Fully Commented Commodore 64 BASIC ROM Disassembly – based on Applesoft!

In our series about C64 ROM commentaries (English version by Lee Davison, German version by Data Becker), I’m now presenting a most unusual C64 ROM commentary – based on a commented disassembly of the Apple II ROM.

S-C DocuMentor for Applesoft” is a commented disassembly of the BASIC ROM of the Apple II computer. Like Commodore BASIC, “Applesoft” BASIC is based on Microsoft BASIC for 6502, but on an older revision. Since the two BASIC interpreters are almost the same instruction for instruction (modulo some command extensions on both sides), the commentary translated over very nicely.

The cross-referenced HTML version of the “S-C C64 BASIC Disassembly” is available here at

The raw txt files of all commentaries are maintained at Fixes and additions happily accepted!

Fully Commented Commodore 64 ROM Disassembly (English)

After last week’s German C64 ROM disassembly from the “64 intern” book, I have now also converted Lee Davison’s commented disassembly into the same format.

The cross-referenced HTML version is available here at

The raw txt files of both the German and the English commented disassemblies are maintained at The two files seem to have been independently developed, which gives us the opportunity to compare, find mistakes, and merge missing information.

I will happily accept additions and corrections to either file – let’s create the one true source of C64 ROM information!

Fully Commented Commodore 64 ROM Disassembly (German)

Whenever I need to look up some code in the ROM of the Commodore 64, I have the choice of the commented disassembly by Marko Mäkelä, the one by Ninja/The Dreams, or the one by Lee Davison – or I can just use my paper copy of “Das neue Commodore-64-intern-Buch“, an excellent line-by-line commentary in German.

That’s why I scanned, OCRed, cleaned up and cross-referenced it.

The raw txt file is maintained at Corrections, additions and translations welcome.

The cross-referenced HTML version is available here at

Wikileaks Movie “The Fifth Estate” pirated my “Xbox Hacking” Slides

Xbox hacking has made it to the silver screen, and Felix Domke and me (Michael Steil) are movie stars! …and so are at least 14 of my presentation slides!

This a picture from the Julian Assange and Wikileaks movie The Fifth Estate (2013), starring Benedict Cumberbatch and Daniel Brühl, directed by Bill Condon:

“Linux is Inevitable”? Sounds like something I would say. In fact, looks like a slide from my presentation at the 24th Chaos Communication Congress in Berlin in December 2007:

Coincidence? Let’s get some context. The scene in the movie is indeed set at the 24th Chaos Communication Congress (24C3), where Julian Assange (Cumberbatch) presents his vision about Wikileaks in the break between two talks. From the movie’s screenplay (ironically leaked by Wikileaks):

           I'm afraid the small conference
           rooms are all booked.

                 (off the schedule)
           What about the auditorium? It's
           empty 'til the X-Box Security talk.

At the 24C3, Felix Domke and me indeed presented Why Silicon-Based Security is still that hard: Deconstructing Xbox 360 Security on day 2 at 16:00 in the main auditorium. (In reality, Julian Assange did not present in the main auditorium, but in a workshop area.)

 To one side of the stage, THE NEXT SPEAKER sets up a few
 deconstructed X-BOX 360s beside a CORKBOARD covered with
 exhibits for his talk: 'Deconstructing Xbox 360 Security.'

To be clear: These are pictures from the movie, not actual pictures from the conference. They really deconstructed an Xbox 360 for the movie!

 Daniel, also on stage, watches as Julian pulls out a WAD of
 TWINE, moves to the corkboard.

 The X-Box guy looks up, CONCERNED, as Julian wraps the twine
 around a PUSH PIN holding up part of the X-Box exhibit,
 STRINGING THE TWINE to another pin holding up another part.

 Julian POINTS to the two pins and the twine. ILLUSTRATING.

           Two people and a secret. The
           beginning of any conspiracy, of all
           corruption. As it grows...

 Daniel watches, RIVETED, as Julian STRETCHES the twine to
 another pin. And another. And another...

           More people... and more secrets.

 ensnaring more of the exhibit in his web. It's MESMERIZING.

           But. If we can find one moral man,
           one whistleblower...

 Julian focuses on a PIN at the CENTER of his web of twine.

           Someone willing to expose these
           secrets --


           That man... can topple the most
           repressive of regimes.


                       X BOX GUY 
           Was zur hoelle!

           And there's the problem.

                       X BOX GUY 
           Otto, my talk is in ten minutes.

 The X-Box guy, PISSED, looks to Otto who's with a CUTE

In the movie, the “Xbox Guy” actually shouts “What the hell!” with a German accent.

I assume the person with the voltmeter is supposed to be Felix, and the angry German with the long hair (called “Game Console Hacker” in the credits, played by Christoph Franken) is supposed to be me.

On the corkboard in the movie, there are about two dozen printed slides.

This is a reconstructed version created from many individual frames of the scene:

It looks like the “Xbox Guy” is named “Denis Schnegg”, and the name of the talk is “Hacking Game Consoles”.

Most slides I could decipher are direct copies from slides from either our 24C3 talk, or our Google Tech Talk in 2008. Here are the screen captures of the slides in the movie, and the corresponding slides from our talks:

1 24c3, slide 6
2 Google, slide 6
3 24c3, slide 8
4 24c3, slide 7
5 Google, slide 24
6 Google, slide 26
7 Google, slide 28
8 Google, slide 36
9 Google, slide 8
10 Google, slide 9
11 Google, slide 15
12 Google, slide 11
13 Google, slide 14
25 24c3, slide 2

For reference, here are the full presentations the scene in the movie is based on:

Why Silicon-Based Security is still that hard: Deconstructing Xbox 360 Security (24C3)

The Xbox 360 Security System and its Weaknesses (Google TechTalk)

And since the producers of the movie consider it fair use to copy 14 of my slides without giving me credit, it must also be fair use to quote the scene of the movie here:

Rhapsody Developer’s Guide [PDF, 1997]

Feiler, Jesse.
Rhapsody Developer’s Guide.
Boston: AP Professional, 1997.
ISBN 0-12-251334-7

(528 pages, 13.3 MB PDF)

Rhapsody Developer’s Guide provides a road map to Rhapsody technology and the ways it can be used. Based on a modern microkernel, Rhapsody runs on PowerPC and Intel processors, and supports traditional Mac OS applications (in the Blue Box) as well as modern applications in the Yellow Box. Totally object-oriented, the Yellow Box platform offers an unparalleled development environment that permits rapid implementation of functionality ranging from traditional personal computer applications to media-rich, Internet-enabled, and database-driven applications for the next century.

This book describes the architecture of Rhapsody, including its cross-platform implementation on PowerPC and Intel. It details the Yellow Box platform (based on OpenStep) and provides a complete description of the core API, as well as a description of the architecture that will be enriched in the future with additional functionality from Apple. The languages of Rhapsody are discussed, and the API is presented in a language-neutral way that will be convenient for C++ developers, classic and modem Objective-C users, and Java programmers. Throughout, there is an emphasis on how Rhapsody relates to existing investments in code and programming expertise. Screen shots and code samples from products shipping today using Rhapsody technology provide opportunities and challenges to new Rhapsody developers.

About the Author

Jesse Feiler is software director of the Philmont Software Mill. He is also the author of Cyberdog and Real World Apple Guide. He has served as a consultant, author, and speaker for many prestigious businesses, including the Federal Reserve Bank ofNew York, Prodigy, Kodak, Young & Rubicam, and Apple Computer, Inc.

Why is my TI-99/4A in Black and White?

by James Abbatiello

My first computer was a Texas Instruments TI-99/4A. Longtime readers may remember a previous article where we implemented TI-99/4A BASIC as a Scripting Language for modern computers. Recently I got nostalgic for the actual hardware so I got my 99 out of the closet where it had been for a decade or more. I hooked it up to the TV and turned it on. I was expecting to see something like this:TI-99/4a title screen

Instead I was greeted by this:

Well, that’s not right! It is in black and white. And what’s with all these vertical black lines? Clearly something’s wrong but what could it be?

New Meets Old

At first I suspected that it was my TV, which is a fairly new LCD. Old computers or game consoles sometimes played a bit fast and loose with the NTSC standard and it seemed unlikely that a new TV would ever have been tested with something as old as a TI-99/4A. Perhaps the TV just couldn’t interpret output from the 99. So I tried with a CRT TV:

Well the black bars are gone (or at least not as apparent) but it is still in black and white. Something must be wrong with the computer itself.

All About Video Signals

The output from the back of the computer is a composite video signal but using a 5-pin DIN connector (that also carries audio and power) instead of the usual RCA jack. Back when this computer was new that signal would usually to go an RF modulator which was then connected to a TV via a 300-ohm connector. Nowadays you can still do the same thing but since most TVs don’t have screw terminals on the back anymore it can be more convenient to take the composite video signal and hook it directly into the composite input on the TV. All that is required is a simple adapter cable that can be created yourself or purchased online.

I thought that perhaps the video circuitry was generating separate Luminance (Y) and Chrominance (C) signals and then combining them into the final composite output. If this were the case then it would suggest something was wrong in the C amplifier or the final combining stage. It turns out that this is not the case. The video chip in the TI-99/4A is referred to as the Video Display Processor (VDP) and is a TMS9918A, TMS9928A or TMS9929A depending on the region the computer was originally intended for and the television standard in use there (e.g. NTSC or PAL). My computer was made for the US market and outputs NTSC signals using the TMS9918A. This chip has a single video output pin that supplies composite video directly with the Y and C already mixed. So if something was wrong with just the C generation circuitry then it was something broken inside the VDP and my only recourse would be to try to find a replacement chip.

Mad Scientist Equipment

The VDP still seemed to work correctly in all other respects so I was hopeful that the true problem lay elsewhere. I thought I’d take a look at the signal on an oscilloscope. We’d expect to the see the NTSC colorburst and if it was missing that would explain why no color was showing up on the TVs. Here’s what it looked like:

And here’s a closeup of the interesting portion:

I’m no expert but that looks like a horizontal sync pulse followed by a colorburst to me. But there was still no color on the TV.

The composite video signal that the VDP generates is sent to a simple 2-transistor amplifier and then to the output jack. I didn’t think it was likely but perhaps something in the amplifier had given out and Y was still strong enough to get picked up by the TV but C wasn’t. To test this I took the computer apart and tapped the signal right as it came out of the chip and before it went through the amplifier. It was still black-and-white. This suggested that the problem was not in the amplifier.

The Healing Power of Crystals

At this point I knew that the VDP was mostly working correctly. It generated the right pattern on the screen so it must be able to communicate with both the video RAM and the CPU. That accounts for most of the pins on the VDP, the ones handling digital signals. The remaining pins are mostly for power and the connection to the quartz crystal that provides the timing. I checked the power and that seemed fine. So let’s take a closer look at the crystal:

The crystal is the gray-colored component in the middle. To the right is the VDP, covered in thermal paste. Just behind the crystal is a variable inductor.

A variable inductor: now that’s interesting! It is connected to the crystal and apparently used for fine tuning the frequency. Could the fix be as simple as turning an adjustment screw?

Alas, no. I turned it as far as it would go in both directions with no improvement to the video output. If the frequency was off it was beyond the ability of this adjustment to correct. I don’t have any equipment to allow precision measuring of the actual frequency this crystal was producing, but I do have the internet. A little Googling brought me to this post on the TI-99/4A mailing list. Yes, there’s still an active mailing list for a computer that hasn’t been manufactured in almost 30 years!

That post describes the symptoms that I was experiencing and indicated that the solution was to replace the crystal. This was somewhat surprising to me. I’d heard of electrolytic capacitors going bad in old equipment but a quartz crystal? They’re usually quite reliable. But you can’t argue with real-world experience.

The VDP takes the frequency of this crystal (10.738635 MHz) and divides it by 3 to produce the NTSC colorburst frequency (3.579545 MHz). If the frequency of the crystal was off then the generated colorburst would also be off and the TV wouldn’t be able to sync to it. Without seeing a valid colorburst the TV isn’t going to produce any color. That would certainly explain our symptoms!

So after deciding to replace this crystal we have to actually find a replacement part. We want a crystal that runs at exactly 10.738635 MHz. We also need it rated for the proper “load capacitance”. Running a crystal with the wrong capacitance will shift the frequency from the rated frequency. That would be bad since our entire goal is to get the frequency back to the ideal. The original crystal was rated for a load capacitance of 32pF (you can just make out the 32 in the above picture although it is partially obscured by the blue wire). So we want a replacement crystal that’s also rated for 32pF.

Let’s go internet shopping for 10.738635 MHz crystals. Jameco doesn’t carry any. Digikey has some but didn’t have any in stock with a 32pF load capacitance. Luckily Mouser came through for me! A few days later and I had a replacement crystal:

And after a little surgery on the motherboard:

Now for the moment of truth:

Success! Now to play some Parsec!

Bonus Oscilloscope Image

If you’re wondering what the colorburst looks like with the new crystal then wonder no longer.

Looks pretty similar to my eyes but apparently it makes a world of difference to a TV.

Clockslide: How to waste an exact number of clock cycles on the 6502

by Sven Oliver ‘SvOlli’ Moll; the original German language version has been simultaneously posted on his blog.

This is an article about the 6502 processor about the topic: how to “waste” a number of clock cycles stated in a register, in this case the X register. The principle is simple: you have a number of operations that do close to nothing. The more the code is jumped to at the “front”, the more clock cycles are needed to get to the actual code. If the code is jumped to more at the “end”, the CPU gets to the code in question more quickly.

This nice theory won’t work directly on the 6502, because every instruction takes at least two clock cycles to execute. If you want to get it down to the precision of one cycle, this is getting more difficult. The first half of this trick I found in code of Eckhard Stollberg, who is one of the guys that pionieered homebrew on the Atari 2600 VCS. There, I found some strange bytes:

C9 C9 C9 C9 C9 C9 C9 C9 C9 C9 C9 C9 C5 EA

The disassembly looks like this:

CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP $EA  ; 3

To run through the code, you’ll need 15 clock cycles, and nothing changes except for some state registers. If the code is called with an offset of one byte, this code will be processed:

CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C5 ; 2
NOP      ; 2

This makes 14 clock cycles, and only the status register will be changed. If the code is called with an offset of two bytes, it is started at the CODE1 segment at the second instruction. Add another one, you’ll get to the second instruction of the CODE2 segment, and so on. This way it is possible to specify the exact number of clock cycles to be “wasted”. With on exception: to be more specific there are 2 + X clock cycles that are wasted. There is no way to waste exactly one clock cycle.

Now we need a way to specify the “entry” of our “slide”. On a C=64 this would be done using self-modifying code. The operand of a JMP $XXXX instruction will be replaced with the calculated address. This is not possible on systems like the Atari 2600, since the code is run in ROM. One option for example would be to use JMP ($0080) after writing the entry point to $0080 and $0081.

My approach differs a bit from the usual way. RAM is scarce on the Atari, and I don’t want to “waste” up two of the 128 bytes available, when there is another way. When the CPU executes a JSR $XXXX (jump to subroutine) command, it writes the current address to the stack. To be more specific, it is the address of the JSR command + 2 which is the return address – 1. And this is what I do: I write my entry point – 1 to the stack and use the command RTS (return from subroutine) to jump into the clock slide. So, I’m still using two bytes of RAM, but only for a short time, without the need to evaluate which two bytes are available at this point.

; the X register specifies how many of the
; 15 clock cycles possible should be skipped
LDA #>clockslide
ADC #<clockslide
STA WSYNC ; <= this syncs to start of next scanline
CMP #$C9
CMP #$C9
CMP #$C9
CMP #$C9
CMP #$C9
CMP #$C9
; and here the real code continues

This approach still has one problem: between “clockslide” and “realcode”, no page crossing may occur. If this were the case, I’d have to increase the high byte on the stack by one. But since the position of the code segments is under my control, I left this out as an exercise for the reader. ;-)

Assembly Evolution Part 1: Accessing Memory and the strange case of the Intel 4004

by Julien Oster, reprinted with permission.

While it has become far less relevant for non-system developers to write assembly than it was a few decades ago, by now CPUs have nevertheless made it much more comfortable to do so. Today we are used to a lot of things: fancy indirect addressing modes with scale, a galore of general purpose registers instead of an accumulator and maybe one or two crippled index registers, condition codes for nearly every instruction (on ARM)…

But also the basics themselves have evolved. Let’s take a look at what past programmers had to put up with in entirely simple, everyday things. We’ll start with the most trivial: writing to memory.

Our goal is to write a single immediate value of 3 into the memory location 5. In light of paging, segmenting and bank switching, we’ll use whatever is convenient as a definition for “memory location”. Also, we’ll let the CPU decide what the word size should be. Since you only need 2 bits to represent a 3, it should fit with every CPUs word size (except for 1 bit CPUs, which actually existed, but that’s a story for another posting). If we have the choice, we’ll just take the smallest.

We’ll work backwards, from the present to the past, to explore the wonders of direct addressing in Intel CPUs. (One precautionary warning though: I only really tested the 4004 code in an emulator, and my habits are highly tainted by current Intel CPUs. So if I made some mistake somewhere, kindly point it out and I’ll fix it!)


On a modern x86 CPU, it is of course fairly easy to write the value 3 to memory cell 5. You just do it:

mov byte [5], 3

A single instruction, simple and obvious. I cheated a bit by not using a segment prefix, nor did I set up any segment registers/selectors beforehand. But assuming a nowadays common OS environment in protected mode, you probably don’t want to fiddle with those selectors anyway.


The Intel 8085 is somewhat of a direct predecessor to the 8086, the first in the line of the excessively successful x86 processors. While the 8086 has a 16 bit data bus, the 8085 only has 8 bit. The address bus is already full 16 bit, but its 16 bit capabilities are limited. Specifically, there is no immediate 16 bit addressing (except for branches), leaving us no way to specify our memory location in the instruction that actually performs the move.

Memory is instead addressed with a pseudo register called M. This pseudo register is in reality just backed by the registers H and L paired together, each 8 bit wide, and accessing it accesses the memory location they point at (you may take a guess which register receives the High byte, and which the Low byte of the address).

Luckily, there are a few simple 16bit instructions for moving immediate values, so all in all we can write our byte with:

LXI H, 0005h ; unlucky syntax, as this actually means HL instead of just H

MOV M, 3

By the way, bonus points if you are somehow able to find out just when the address in HL is available on the address bus. The same applies to the 8080 and 8008. Does the CPU copy the register pair’s content to the address bus pins only when actual memory operations take place, or are the address bus pins somehow directly connected to H and L itself? Is that even feasible? I’d really like to find out…


We continue going further back, skipping the 8080 because it was identical in that regard, and arrive at its direct predecessor instead, the Intel 8008. The 8080 and 8085 were source compatible to the 8008 (which, mind you, is not the same as binary compatible… also it may or may not have required some light automated translation), but in the downward direction we have something vital taking from us: While already using 16bit addresses (with only a 14bit address bus, resulting in 16k memory, though), the only instructions that were allowed to contain 16bit immediate values at all are jumps and branches. Consequently, we are left with no way to completely specify our destination address in one instruction!

Instead, we have to access H and L, together forming pseudo register M’s address, one at a time:

LHI 00h

LLI 05h



It’s hardly possible to go back further than the Intel 4004, at least if you are only considering single chip CPUs (at the time of its conception in the early 70s, there were already famous multi-chip CPUs with comfortable orthogonal instruction sets, notably the PDPs). Indeed, it was the first widely available single chip CPU. This little thing was a 4-bit CPU with some strange quirks, which we will explore further. Overall, it bears little to no resemblance to its successor in name, the Intel 8008 (except for the internal stack, which both had–I will cover that in another posting).

But let’s just look at the code for writing a value of 3 into the memory location at 5 first:

FIM P0, 5; load address 05h into pair R0,R1

SRC P0   ; set address bus to contents of R0,R1

LDM 3    ; load 3 into accumulator

WRM      ; write accumulator content to memory

That looks a bit strange.

As a 4 bit CPU, the 4004 has 4 bit wide registers and addresses 4 bit nibbles as words in memory. It has only one accumulator on which the majority of operations is performed, but sixteen index registers (R0-R15).

Those index registers are handy for accessing memory: Besides loading values directly from ROM, an instruction exists to load data indirectly, which sets the address bus to the ROM cell’s content. Another instruction performs an indirect jump instead. Other than that, you can just increment index registers, albeit there is the interesting “ISZ” instruction that not only increments, but also branches if the result is not 0.

Because the 4004 uses 8 bits to address the 4 bit nibbles, every two consecutive index registers form a pair, which is then used for memory references.

Note that I explicitly said ROM above. This is because in the 4004 architecture, ROM and RAM are actually vastly different beasts, at least from the assembly programmer’s perspective. You can not directly access RAM. It always involves index register pairs, manually sending their content to the address bus (with a strangely named instruction “SRC”, which for some reason spells out send register control) and then issuing another instruction which transfers from or to the accumulator.

Interestingly, accessing regular RAM nibbles is not your only choice among the transfer instructions. You can also fetch from and to I/O ports. But the CPU does not have any direct I/O port, instead they are available on both RAM and ROM! You can also read and write “RAM status characters”, which to me look like plain regular RAM cells within another namespace. If someone knows, I’d love to hear what they were used for (and if they maybe did behave differently to normal RAM).

Take a look at the data sheet. Within its only 9 pages, the instruction set is depicted on page 4 and 5. Especially in the light that fairly reasonable orthogonal instruction sets appear to have been available in multi-chip CPUs, this first single-chip CPU is clearly a strange specialization towards the desk calculator it was meant for (the Busicom 141-PF). It has the aforementioned index register-centered RAM access, separate ROM (although there is a transfer instruction which strangely refers to some optional “read/write program memory”), a three level internal stack which is almost useless for general purpose programming and a lone special purpose instruction for “keyboard process” (KBP).

Original 4004 CPUs go from anything from a few to a few thousand dollars on eBay, depending on their packaging and revision. If you’d like to, you can instead play around with a virtual one in this java-script based, fully fledged assembler, disassembler and emulator, or read the rescued source code of the Busicom 141-PF calculator. There’s lots more of schematics, data sheets and other resources on the Intel’s anniversary project page.

That is, if you are brave enough.

The story of 15 Second Copy for the C-64

by Mike Pall, published with permission.

[This is a follow-up to Thomas Tempelmann’s Story of FCopy for the C-64.]

Ok, I have to make a confession … more than 25 years late:

I’ve reverse-engineered Thomas Tempelmann’s code, added various improvements and spread them around. I guess I’m at least partially responsible for the slew of fast-loaders, fast-copys etc. that circulated in the German C64 scene and beyond. Uh, oh …

I’ve only published AFLG (auto-fast-loader-generator) under my real name in the German “RUN” magazine. It owes quite a bit to TT’s original ideas. I guess I have to apologize to Thomas for not giving proper credit. But back then in the 80′s, intellectual property matters wasn’t exactly something a kid like me was overly concerned with.

Later on, everyone was soldering parallel-transfer cables to the VIA #1 of the 1541 and plugging them into the C64 userport. This provided extra bandwidth compared to the standard serial cable. It allowed much faster loading of programs with a tiny parallel loader (a file named “!”, that was prepended on all disks). Note that the commercial kits with cables, custom EPROMs and silly dongles followed only much later.

So I wrote “15 second copy”, which worked with a plain parallel cable. Yes, it copied a full 35 track disk in 15 seconds! There was only one down-side: this was only the time for reading/writing from and to disk — you had to swap the floppies seven times (!) and that usually took quite a bit more extra time! ;-)

It worked by transferring the “live” GCR-encoded data from the 1541′s disk head to the C64 and simultaneously doing a fast checksum. Part of the checksumming was done on the 1541, part was done on the C64. There simply weren’t enough cycles left on either side! Most of the transfer happened asynchronously by adjusting for the slightly different CPU frequencies and with only a minimum number of handshakes. This meant meticulous cycle counting and use of some odd tricks.

The raw GCR took up more space (684*324 bytes) in the C64 RAM, so that’s why it required 4 passes. Other copy programs fully decoded the GCR and required only 3 passes. But GCR decoding was rather time-consuming, so they had to skip some sectors and read every track multiple times. OTOH my program was able to read/write at the full 300rpm, i.e. 5 tracks per second plus stepper time, which boils down to 2x ~7.5 seconds for read and write. Yep, you had to swap the floppies every 2 seconds …

Ok, so I spread the program. For free. I even made a 40 track version, which took 17 seconds. Only to see these coming back in various mutations, with the original credits ripped out, decorated with multiple intros, different groups pretending they wrote it or cracked it (it was free, there was nothing to crack!). The only thing they left alone were the copy routines, probably because they were extremely fragile and hard to understand. So it was really easy to recognize my own code. Some of the commercial parallel-cable + ROM kits even bragged with “Backups in 15 seconds!”. These were blatant rip-offs: they basically changed the screen colors and added a check for their dongles. Duh.

Let’s just say this rather frustrating experience taught me a lot and that’s why I’m doing open source today.

So I shelved my plans to write an enhanced version which would try to compress the memory to reduce the number of passes. Ah, yes … I wrote quite a few packers, too … but I’ll save that story for another time.

I still have the disks with the source code somewhere in my basement. But I’m not so sure I’ll be able to read them anymore. They weren’t of high quality to begin with … and I’d have to find my homegrown toolchain, too. ;-)

But I took the time to reverse-engineer my own code from one of the copies that are floating around on the net. For better understanding on the C64/1541 handshake issues, refer to this article. If you’re wondering about the weird bvc * loops: the 6502 CPU of the 1541 has an SO pin, which is triggered by a full shift register for the data from the disk head. This directly sets the overflow flag in the CPU and allows reading the contents from the shift register with very low latency.

Yes, there’s a lot more weird code in there. For the sake of brevity, here are only the inner loops of the I/O routines for the read, write and verify pass for the C64 and the 1541 side. Enjoy!

  ;--- 1541: Read ---
  ldy #$20
  bvc *        ; Wait for disk shift register to fill
  lda $1c01    ; Load data from disk
  sta $1801    ; Send byte to C64 via parallel cable
  inc $1800    ; Toggle serial pin
  eor $80      ; Compute checksum for 1st GCR byte in $80
  sta $80
  bvc *
  lda $1c01    ; Load data from disk
  sta $1801    ; Send byte to C64 via parallel cable
  dec $1800    ; Toggle serial pin
  eor $81      ; Compute checksum for 2nd GCR byte in $81
  sta $81
  ; ...
  ; Copy and checksum to $82 $83 $84
  ; And another time for $80 $81 $82 $83 $84 with inverted toggles
  ; ...
  beq f_read_end
  jmp f_read
  ; Copy the remaining 4 bytes and checksum to $80 $81 $82
  ; Lots of bit-shifting and xoring to indirectly verify
  ; the sector checksum from the 5 byte xor of the raw GCR data

  ;--- C64: Read ---
  ; Setup ($5d) and ($5f) to point to GCR buffer
  ldy #$00
  bit $dd00    ; Wait for serial pin to toggle
  bpl *-3
  lda $dd01    ; Read incoming data (from 1541)
  sta ($5d),y  ; Store to buffer
  bit $dd00    ; Wait for serial pin to toggle
  bmi *-3
  lda $dd01    ; Read incoming data (from 1541)
  sta ($5d),y  ; Store to buffer
  bne c_read
  bit $dd00    ; Wait for serial pin to toggle
  bpl *-3
  lda $dd01    ; Read incoming data (from 1541)
  sta ($5d),y  ; Store to buffer
  bit $dd00    ; Wait for serial pin to toggle
  bmi *-3
  lda $dd01    ; Read incoming data (from 1541)
  sta ($5d),y  ; Store to buffer
  cpy #$44
  bne c_read2

  ;--- C64: Write ---
  ; Setup ($5d) and ($5f) to point to GCR buffer
  ldy #$00
  eor ($5d),y  ; Load from buffer and compute checksum
  bit $dd00    ; Wait for serial pin to toggle
  bpl *-3
  sta $dd01    ; Store xor'ed outgoing data (to 1541)
  eor ($5d),y  ; Load from buffer and compute checksum
  bit $dd00    ; Wait for serial pin to toggle
  bmi *-3
  sta $dd01    ; Store xor'ed outgoing data (to 1541)
  bne c_write
  eor ($5f),y  ; Load from buffer and compute checksum
  bit $dd00    ; Wait for serial pin to toggle
  bpl *-3
  sta $dd01    ; Store xor'ed outgoing data (to 1541)
  eor ($5f),y  ; Load from buffer and compute checksum
  bit $dd00    ; Wait for serial pin to toggle
  bmi *-3
  sta $dd01    ; Store xor'ed outgoing data (to 1541)
  cpy #$44
  bne c_write2
  ldx $5b
  sta $0200,x  ; Store checksum for verify pass
  stx $5b

  ;--- 1541: Write ---
  ldy #$a2
  lda #$00
  bvc *        ; Wait for disk shift register to clear
  eor $1801    ; Xor with incoming data (from C64)
  sta $1c01    ; Write data to disk shift register
  dec $1800    ; Toggle serial pin
  lda $1801    ; Reload data to undo xor for next byte
  bvc *        ; Wait for disk shift register to clear
  eor $1801    ; Xor with incoming data (from C64)
  sta $1c01    ; Write data to disk shift register
  inc $1800    ; Toggle serial pin
  lda $1801    ; Reload data to undo xor for next byte
  bne f_write

  ;--- 1541: Verify ---
  ; Get checksum computed by c_write on the C64 side
  ldy #$a2
  bvc *        ; Wait for disk shift register to fill
  eor $1c01    ; Xor with data from disk
  bvc *        ; Wait for disk shift register to fill
  eor $1c01    ; Xor with data from disk
  bne f_verify
  ; Verify is ok if checksum is zero