Category Archives: tricks

Clockslide: How to waste an exact number of clock cycles on the 6502

by Sven Oliver ‘SvOlli’ Moll; the original German language version has been simultaneously posted on his blog.

This is an article about the 6502 processor about the topic: how to “waste” a number of clock cycles stated in a register, in this case the X register. The principle is simple: you have a number of operations that do close to nothing. The more the code is jumped to at the “front”, the more clock cycles are needed to get to the actual code. If the code is jumped to more at the “end”, the CPU gets to the code in question more quickly.

This nice theory won’t work directly on the 6502, because every instruction takes at least two clock cycles to execute. If you want to get it down to the precision of one cycle, this is getting more difficult. The first half of this trick I found in code of Eckhard Stollberg, who is one of the guys that pionieered homebrew on the Atari 2600 VCS. There, I found some strange bytes:

C9 C9 C9 C9 C9 C9 C9 C9 C9 C9 C9 C9 C5 EA

The disassembly looks like this:

CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP $EA  ; 3

To run through the code, you’ll need 15 clock cycles, and nothing changes except for some state registers. If the code is called with an offset of one byte, this code will be processed:

CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C5 ; 2
NOP      ; 2

This makes 14 clock cycles, and only the status register will be changed. If the code is called with an offset of two bytes, it is started at the CODE1 segment at the second instruction. Add another one, you’ll get to the second instruction of the CODE2 segment, and so on. This way it is possible to specify the exact number of clock cycles to be “wasted”. With on exception: to be more specific there are 2 + X clock cycles that are wasted. There is no way to waste exactly one clock cycle.

Now we need a way to specify the “entry” of our “slide”. On a C=64 this would be done using self-modifying code. The operand of a JMP $XXXX instruction will be replaced with the calculated address. This is not possible on systems like the Atari 2600, since the code is run in ROM. One option for example would be to use JMP ($0080) after writing the entry point to $0080 and $0081.

My approach differs a bit from the usual way. RAM is scarce on the Atari, and I don’t want to “waste” up two of the 128 bytes available, when there is another way. When the CPU executes a JSR $XXXX (jump to subroutine) command, it writes the current address to the stack. To be more specific, it is the address of the JSR command + 2 which is the return address – 1. And this is what I do: I write my entry point – 1 to the stack and use the command RTS (return from subroutine) to jump into the clock slide. So, I’m still using two bytes of RAM, but only for a short time, without the need to evaluate which two bytes are available at this point.

; the X register specifies how many of the
; 15 clock cycles possible should be skipped
LDA #>clockslide
ADC #<clockslide
STA WSYNC ; <= this syncs to start of next scanline
CMP #$C9
CMP #$C9
CMP #$C9
CMP #$C9
CMP #$C9
CMP #$C9
; and here the real code continues

This approach still has one problem: between “clockslide” and “realcode”, no page crossing may occur. If this were the case, I’d have to increase the high byte on the stack by one. But since the position of the code segments is under my control, I left this out as an exercise for the reader. ;-)

Chaosradio Express #177: Commodore 64

(This article is about a German-language podcast episode on the C64.)

Im Februar hat mich Tim Pritlove auf der Durchreise in Frankfurt abgefangen, wo ich mit einem Koffer voll mit zwölf Commodore 64 Motherboards in einem Hotelzimmer saß, und mit mir eine 2 Stunden und 42 Minuten lange Episode für Chaosradio Express aufgenommen.

Chaosradio Express #177: Commodore 64

Hier also nochmal der Hinweis auf die Folge, die jetzt schon ein paar Monate zurückliegt, für all diejenigen, die sie nicht schon anderweitig entdeckt haben. :-)

HFS+ File System Analysis and Forensics with fileXray

Modern filesystems are highly optimized database systems that are a core function of modern operating systems. They allow concurrent access by many CPUs, they keep locality up and fragementation down, and they can recover from crashes guaranteeing consistent data structures.

If you want to learn about the internals of modern filesystems, you can either read up on theory (e.g. Dominic Giampaolo’s excellent free book Practical File System Design), or you can dissect one using a tool like fileXray, which

  • allows you to examine data structures on your Mac OS X HFS+ disk, like the volume header, the journal, regular and special files, etc., down to the B-tree level
  • gives you insight into statistics like fragmentation and hot file clustering
  • lets you find out where all these deleted files really go, and what forensic analysis can tell someone about your disks!

Having Fun with Branch Delay Slots

Branch Delay Slots are one of the awkward features of RISC architectures. RISC CPUs are pipelined by definition, so while the current instruction is in execution, the following instruction(s) will be in the pipeline already. If there is for example a conditional branch in the instruction stream, the CPU cannot know whether the next instruction is the one following the branch or the instruction at the target location until it has evaluated the branch. This would cause a bubble in the pipeline; therefore some RISC architectures have a branch delay slot: The instruction after the branch will always be executed, no matter whether the branch is taken or not.

So in practice, you can put the instruction that would be before the branch right after the branch, if this instruction is independent of the branch instruction, i.e. doesn’t access the same registers. Otherwise, you can fill it with a NOP. Out-of-order architectures can do this reordering at runtime, so there would be no need for a delay slot. Nevertheless, the delay slot is a feature of the architecture, not the implementation.

Some RISCs like PowerPC and ARM do not have a delay slot, but for example MIPS, SPARC, PA-RISC have it. But there are some variations: MIPS and PA-RISC have an annihilation/nullify/likely bit in the instruction, so the programmer can choose that the instruction in the delay slot only gets executed if the branch is taken.

Other CPUs, like SPARC, PA-RISC or the ill-fated Motorola M88K have optional delay slots: The programmer can set a bit in the opcode if he cannot come up with a good instruction for the delay slot, and save the wasted NOP in the program code this way – the CPU will put a bubble into the pipeline.

Now the interesting question is what happens if the branch and the delay instruction are not independent. What if the delay instruction writes r5 and the branch jumps to r5? What if it’s a branch-and-link, and the delay instruction modifies the link register? On MIPS, this is illegal, and undefined.

In practice, MIPS won’t halt and catch fire though. As you would expect from the design of a CPU pipeline, the CPU basically executes the branch and the delay instruction in order, as they are stored in the instruction stream, and it only delays the write to PC, i.e. the actual jump until after the delay instruction. So, for example, if you modify a register that the branch depends on, it will not influence the branch, but be in effect after the jump.

On the aforementioned Motorola M88K, this behaviour is documented, and GCC even makes use of it:

820:   7d ad 00 08   cmp   r13,r13,0x08    ; compare

824:   d4 6d 00 05   bb0.n 0x03,r13,0x834  ; cond. branch
828:   63 df 00 00   addu  r30,r31,0       ; delay slot

82c:   cc 00 00 7f   bsr.n 0xa28           ; function call
830:   60 21 01 ac   addu  r1,r1,0x1ac     ; delay slot: fix up link

834:   00 00 00 00   nop

The first three instructions are a compare/branch sequence. If the branch is taken, execution will continue at 0×834, otherwise at 0x82c, after the delay slot. The delay instruction is independent of the branch, nothing special here yet.

But now look at the following two instructions: 0x82c and 0×830 are not independent. r1 is the M88K’s link register, so the “bsr” writes the addres of the following instruction after the delay slot (0×834) into r1. The delay slot also writes into r1: It adds 0x1ac to it. These instructions are executed in order, but the actual branch to 0xa28 will only be done after the delay instruction.

So what these two instructions effectively do is call a function, but set up the return address to skip the next 0x1ac bytes (107 instructions) after return. If the conditional branch at 0×824 is taken, the code at 0×834 will be executed, otherwise, 0xa28 is called, and the “taken” case of the conditional branch is skipped. This trick can be used whenever you have C code with an “if” statement of which one case is a single function call:

if (...) {
} else {

A Standalone printf() for Early Bootup

A while ago, I complained about operating systems with overly complicated startup code that spends too much time in assembly and does hot have printf() or framebuffer access until very late.

This second post is about printf(): Many systems use POST codes (on i386/x86_64, i.e. writes to port 0×80) or debug LED for debugging, or have complicated and cumbersome implementations for puts() and print_hex() – and printf() is only available very late, because it has some special requirement, like console channels being set up. But printf() is not rocket science: Al it needs is C and a stack. Whatever system you are bringing up on whatever platform: Having printf() as early as possible will prove very useful.

Today, I am presenting a full-featured standalone version of printf() that can be added to arbitrary 32 or 64 bit C code. The code has been taken from FreeBSD (sys/kern/subr_prf.c) and is therefore BSD-licensed. Unnecessary functions have been removed and all typedefs required have been added.

 * Copyright (c) 1986, 1988, 1991, 1993
 *	The Regents of the University of California.  All rights reserved.
 * (c) UNIX System Laboratories, Inc.
 * All or some portions of this file are derived from material licensed
 * to the University of California by American Telephone and Telegraph
 * Co. or Unix System Laboratories, Inc. and are reproduced herein with
 * the permission of UNIX System Laboratories, Inc.
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 * 4. Neither the name of the University nor the names of its contributors
 *    may be used to endorse or promote products derived from this software
 *    without specific prior written permission.
 *	@(#)subr_prf.c	8.3 (Berkeley) 1/21/94

typedef unsigned long size_t;
typedef long ssize_t;
#ifdef __64BIT__
typedef unsigned long long uintmax_t;
typedef long long intmax_t;
typedef unsigned int uintmax_t;
typedef int intmax_t;
typedef unsigned char u_char;
typedef unsigned int u_int;
typedef unsigned long u_long;
typedef unsigned short u_short;
typedef unsigned long long u_quad_t;
typedef long long quad_t;
typedef unsigned long uintptr_t;
typedef long ptrdiff_t;
#define NULL ((void*)0)
#define NBBY    8               /* number of bits in a byte */
char const hex2ascii_data[] = "0123456789abcdefghijklmnopqrstuvwxyz";
#define hex2ascii(hex)  (hex2ascii_data[hex])
#define va_list __builtin_va_list
#define va_start __builtin_va_start
#define va_arg __builtin_va_arg
#define va_end __builtin_va_end
#define toupper(c)      ((c) - 0x20 * (((c) >= 'a') && ((c) <= 'z')))
static size_t
strlen(const char *s)
	size_t l = 0;
	while (*s++)
	return l;

/* Max number conversion buffer length: a u_quad_t in base 2, plus NUL byte. */
#define MAXNBUF	(sizeof(intmax_t) * NBBY + 1)

 * Put a NUL-terminated ASCII number (base <= 36) in a buffer in reverse
 * order; return an optional length and a pointer to the last character
 * written in the buffer (i.e., the first character of the string).
 * The buffer pointed to by `nbuf' must have length >= MAXNBUF.
static char *
ksprintn(char *nbuf, uintmax_t num, int base, int *lenp, int upper)
	char *p, c;

	p = nbuf;
	*p = '';
	do {
		c = hex2ascii(num % base);
		*++p = upper ? toupper(c) : c;
	} while (num /= base);
	if (lenp)
		*lenp = p - nbuf;
	return (p);

 * Scaled down version of printf(3).
 * Two additional formats:
 * The format %b is supported to decode error registers.
 * Its usage is:
 *	printf("reg=%bn", regval, "*");
 * where  is the output base expressed as a control character, e.g.
 * 10 gives octal; 20 gives hex.  Each arg is a sequence of characters,
 * the first of which gives the bit number to be inspected (origin 1), and
 * the next characters (up to a control character, i.e. a character <= 32),
 * give the name of the register.  Thus:
 *	kvprintf("reg=%bn", 3, "102BITTWO1BITONEn");
 * would produce output:
 *	reg=3
 * XXX:  %D  -- Hexdump, takes pointer and separator string:
 *		("%6D", ptr, ":")   -> XX:XX:XX:XX:XX:XX
 *		("%*D", len, ptr, " " -> XX XX XX XX ...
kvprintf(char const *fmt, void (*func)(int, void*), void *arg, int radix, va_list ap)
#define PCHAR(c) {int cc=(c); if (func) (*func)(cc,arg); else *d++ = cc; retval++; }
	char nbuf[MAXNBUF];
	char *d;
	const char *p, *percent, *q;
	u_char *up;
	int ch, n;
	uintmax_t num;
	int base, lflag, qflag, tmp, width, ladjust, sharpflag, neg, sign, dot;
	int cflag, hflag, jflag, tflag, zflag;
	int dwidth, upper;
	char padc;
	int stop = 0, retval = 0;

	num = 0;
	if (!func)
		d = (char *) arg;
		d = NULL;

	if (fmt == NULL)
		fmt = "(fmt null)n";

	if (radix < 2 || radix > 36)
		radix = 10;

	for (;;) {
		padc = ' ';
		width = 0;
		while ((ch = (u_char)*fmt++) != '%' || stop) {
			if (ch == '')
				return (retval);
		percent = fmt - 1;
		qflag = 0; lflag = 0; ladjust = 0; sharpflag = 0; neg = 0;
		sign = 0; dot = 0; dwidth = 0; upper = 0;
		cflag = 0; hflag = 0; jflag = 0; tflag = 0; zflag = 0;
reswitch:	switch (ch = (u_char)*fmt++) {
		case '.':
			dot = 1;
			goto reswitch;
		case '#':
			sharpflag = 1;
			goto reswitch;
		case '+':
			sign = 1;
			goto reswitch;
		case '-':
			ladjust = 1;
			goto reswitch;
		case '%':
		case '*':
			if (!dot) {
				width = va_arg(ap, int);
				if (width < 0) {
					ladjust = !ladjust;
					width = -width;
			} else {
				dwidth = va_arg(ap, int);
			goto reswitch;
		case '0':
			if (!dot) {
				padc = '0';
				goto reswitch;
		case '1': case '2': case '3': case '4':
		case '5': case '6': case '7': case '8': case '9':
				for (n = 0;; ++fmt) {
					n = n * 10 + ch - '0';
					ch = *fmt;
					if (ch < '0' || ch > '9')
			if (dot)
				dwidth = n;
				width = n;
			goto reswitch;
		case 'b':
			num = (u_int)va_arg(ap, int);
			p = va_arg(ap, char *);
			for (q = ksprintn(nbuf, num, *p++, NULL, 0); *q;)

			if (num == 0)

			for (tmp = 0; *p;) {
				n = *p++;
				if (num & (1 << (n - 1))) {
					PCHAR(tmp ? ',' : '<');
					for (; (n = *p) > ' '; ++p)
					tmp = 1;
				} else
					for (; *p > ' '; ++p)
			if (tmp)
		case 'c':
			PCHAR(va_arg(ap, int));
		case 'D':
			up = va_arg(ap, u_char *);
			p = va_arg(ap, char *);
			if (!width)
				width = 16;
			while(width--) {
				PCHAR(hex2ascii(*up >> 4));
				PCHAR(hex2ascii(*up & 0x0f));
				if (width)
					for (q=p;*q;q++)
		case 'd':
		case 'i':
			base = 10;
			sign = 1;
			goto handle_sign;
		case 'h':
			if (hflag) {
				hflag = 0;
				cflag = 1;
			} else
				hflag = 1;
			goto reswitch;
		case 'j':
			jflag = 1;
			goto reswitch;
		case 'l':
			if (lflag) {
				lflag = 0;
				qflag = 1;
			} else
				lflag = 1;
			goto reswitch;
		case 'n':
			if (jflag)
				*(va_arg(ap, intmax_t *)) = retval;
			else if (qflag)
				*(va_arg(ap, quad_t *)) = retval;
			else if (lflag)
				*(va_arg(ap, long *)) = retval;
			else if (zflag)
				*(va_arg(ap, size_t *)) = retval;
			else if (hflag)
				*(va_arg(ap, short *)) = retval;
			else if (cflag)
				*(va_arg(ap, char *)) = retval;
				*(va_arg(ap, int *)) = retval;
		case 'o':
			base = 8;
			goto handle_nosign;
		case 'p':
			base = 16;
			sharpflag = (width == 0);
			sign = 0;
			num = (uintptr_t)va_arg(ap, void *);
			goto number;
		case 'q':
			qflag = 1;
			goto reswitch;
		case 'r':
			base = radix;
			if (sign)
				goto handle_sign;
			goto handle_nosign;
		case 's':
			p = va_arg(ap, char *);
			if (p == NULL)
				p = "(null)";
			if (!dot)
				n = strlen (p);
				for (n = 0; n < dwidth && p[n]; n++)

			width -= n;

			if (!ladjust && width > 0)
				while (width--)
			while (n--)
			if (ladjust && width > 0)
				while (width--)
		case 't':
			tflag = 1;
			goto reswitch;
		case 'u':
			base = 10;
			goto handle_nosign;
		case 'X':
			upper = 1;
		case 'x':
			base = 16;
			goto handle_nosign;
		case 'y':
			base = 16;
			sign = 1;
			goto handle_sign;
		case 'z':
			zflag = 1;
			goto reswitch;
			sign = 0;
			if (jflag)
				num = va_arg(ap, uintmax_t);
			else if (qflag)
				num = va_arg(ap, u_quad_t);
			else if (tflag)
				num = va_arg(ap, ptrdiff_t);
			else if (lflag)
				num = va_arg(ap, u_long);
			else if (zflag)
				num = va_arg(ap, size_t);
			else if (hflag)
				num = (u_short)va_arg(ap, int);
			else if (cflag)
				num = (u_char)va_arg(ap, int);
				num = va_arg(ap, u_int);
			goto number;
			if (jflag)
				num = va_arg(ap, intmax_t);
			else if (qflag)
				num = va_arg(ap, quad_t);
			else if (tflag)
				num = va_arg(ap, ptrdiff_t);
			else if (lflag)
				num = va_arg(ap, long);
			else if (zflag)
				num = va_arg(ap, ssize_t);
			else if (hflag)
				num = (short)va_arg(ap, int);
			else if (cflag)
				num = (char)va_arg(ap, int);
				num = va_arg(ap, int);
			if (sign && (intmax_t)num < 0) {
				neg = 1;
				num = -(intmax_t)num;
			p = ksprintn(nbuf, num, base, &tmp, upper);
			if (sharpflag && num != 0) {
				if (base == 8)
				else if (base == 16)
					tmp += 2;
			if (neg)

			if (!ladjust && padc != '0' && width
			    && (width -= tmp) > 0)
				while (width--)
			if (neg)
			if (sharpflag && num != 0) {
				if (base == 8) {
				} else if (base == 16) {
			if (!ladjust && width && (width -= tmp) > 0)
				while (width--)

			while (*p)

			if (ladjust && width && (width -= tmp) > 0)
				while (width--)

			while (percent < fmt)
			 * Since we ignore an formatting argument it is no
			 * longer safe to obey the remaining formatting
			 * arguments as the arguments will no longer match
			 * the format specs.
			stop = 1;
#undef PCHAR

static void
putchar(int c, void *arg)
	/* add your putchar here */

printf(const char *fmt, ...)
	/* */
	va_list ap;

	va_start(ap, fmt);
	kvprintf(fmt, putchar, NULL, 10, ap);

One thing you will have to do is #define or #undef __64__BIT to enable or disable support for printing 64 bit values. If you are on a 64 bit architecture, you will need this to print addresses properly. If you are on 32 bit, this support will require external library functions to do 64 bit arithmetic, so you probably want to turn this off.

On ARM, which doesn't have an assembly statement for division, will always need a library routine. You can use FreeBSD's at sys/libkern/arm/divsi3.S, if you to add the following #defines to make the file compile:

#define __FBSDID(a)
#define RET bx lr
#define ENTRY_NP(a) _##a: .global _##a
#define _KERNEL

All you need to do now to get printf() working is to fill the putchar() function, for example with a call to a serial_putc() implementation, if you have a serial port. On a PC, serial_putc() could look like this:

static void outb(short p, unsigned char a)
        asm volatile("outb %b0, %w1" : : "a" (a), "Nd" (p));
static unsigned char inb(short p)
        unsigned char a;
        asm volatile("inb %w1, %b0" : "=a" (a) : "Nd" (p));
        return a;
#define SERIAL_PORT 0x3f8
static void
serial_putc(unsigned char c) {
	while (!(inb(SERIAL_PORT + 5) & 0x20));
	outb(SERIAL_PORT, c);

On an "ARM Integrator/CP" board, the default for QEMU emulating ARM, the follwing serial_putc() would apply (make sure you are running the MMU off, or with the serial port MMIO region mapped):

#define SERIAL_PORT 0x16000000
static void
serial_putc(unsigned char c) {
	while ((*(volatile unsigned int *)(SERIAL_PORT + 0x18)) & 0x20);
	*(volatile unsigned int *)(SERIAL_PORT) = (c);

Unfortunately, sometimes you are bringing up or debugging a system that does not have a serial port - or at least you haven't found one: Apple Macintosh and various game systems come to mind. In this case, you might be able to set up a framebuffer and print to that. In a later post, I will present a small standalone framebuffer library.

Minimizing the Assembly needed for Machine Initialization

In many operating systems, I have seen overly complicated startup code. Too much is done in assembly, and printf() and framebuffer access is only available very late. In the next three blog posts, I will show how this can be avoided.

In this post, I am showing how little assembly code is needed for startup. Minimizing the assembly makes your code significantly more maintainable. Everything that really needs to be done is setting up the CPU state to support 64 bit (or 32 bit) C code running at physical addresses, and everything else, including setting up the final machine state, can be done in C with a little inline assembly. The following example switches from 16 bit (real mode) or 32 bit mode (flat protected mode) into 64 bit mode on an x86_64 CPU (NASM syntax):

PAGE_SIZE	equ	0x1000
STACK_SIZE	equ	16*1024

PML2		equ	0x1000




	[BITS 16]

	; clear 3 pages of pagetables
	mov edi, PML2
	xor eax, eax
	mov ecx, 3*4096/4
	rep stosd

	; set up pagetables
	mov dword [PML2], 0x87		; single 4 MB id mapping PML2
	mov dword [PML3], PML2 | 7	; single entry at PML3
	mov dword [PML4], PML3 | 7	; single entry at PML4

	; load the GDT
	lgdt [gdt_desc]

	; set PSE,  PAE
	mov eax, 0x30
	mov cr4, eax

	; long mode
	mov ecx, 0xc0000080
	or eax, 0x100

	; enable pagetables
	mov eax, PML4
	mov cr3, eax

	; turn on long mode and paging
	mov eax, 0x80010001
	mov cr0, eax

	jmp SEL_CS:code64

	[BITS 64]
	mov ax, SEL_DS
	mov ds, ax
	mov es, ax
	mov fs, ax
	mov gs, ax
	mov ss, ax

	mov sp, STACK_TOP
	call start

inf:	jmp inf

	dw  GDT_LEN-1
	dd  gdt
	align 8
gdt	db 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 ; 0x00 dummy
gdt_cs	db 0xff, 0xff, 0x00, 0x00, 0x00, 0x9b, 0xaf, 0x00 ; 0x08 code64
gdt_ds	db 0xff, 0xff, 0x00, 0x00, 0x00, 0x93, 0xaf, 0x00 ; 0x18 data64

GDT_LEN equ $ - gdt
SEL_CS equ gdt_cs - gdt
SEL_DS equ gdt_ds - gdt

While switching into 32 bit flat mode is trivial, switching into 64 bit mode requires setting up pagetables. This code sets up 4 MB of identity-mapped memory starting at address 0.

The code is designed to switch from 16 bit mode into 64 bit mode, but since 16 and 32 bit flat mode on i386 are assembly source compatible, you can replace “[BITS 16]” with “[BITS 32]“, and the code will switch from 32 bit to 64 bit mode. (Yes, it is possible to switch from real mode directly into 64 bit mode: Forums)

If you use this code, be careful about the memory layout. The example leaves the first page in memory untouched (in case you need the original real mode BIOS IDT for something later), and occupies the next few pages for the pagetables and the stack. Your code should be above that, but, on a BIOS system, not be between 640 KB and 1 MB (this might be device memory and ROM), and also not above 1 MB before you have enabled the A20 gate.

While this code is enough to support 64 bit C code, this is not enough to set up the machine to support all aspects of an operating system. You probably want to set up your own GDT that has entries for 32 bit code and data too, you want to set up an IDT in order to be able to catch exceptions and interrupts, and you will need real pagetables to support virtual memory. Also, you will have to move your stack pointer once you have your final memory layout.

But it is possible to construct the overly complicated GDT, IDT and pagetable structures using readable C code, and the “lidt”, “lgdt” etc. instructions can be done in inline assembly. While this is not portable C code, it is possible to reuse large parts of the machine initialization for a 32 bit (i386) and a 64 bit (x86_64) version of the operating system, which is not as easy to get right in assembly.

In my next post, I am going to show how easy it is to get printf() working as soon as you reach C, so you don’t have to mess around with puts()- and print_hex()-like functions in early machine initialization.


Commodore computers up to BASIC 2.0 (like the Commodore 64, the VIC-20 and the PET 2001) only had a very basic understanding of mass storage: There were physical device numbers that were mapped to the different busses, and the “KERNAL” library had “open”, “read”, “write” and “close” functions that worked on these devices. There were also higher-level “load” and “save” functions that could load and save arbitrary regions of memory: The first two bytes of the file would be the (little endian) start address of the memory block.

With no special knowledge of “block storage” devices like disk drives, BASIC 2.0, which was not only a programming laguage but basically the shell of Commodore computers, could not express commands like “format a disk”, “delete a file” or “show the directory”. All this functionality, as well as the file system implementation, was part of the firmware of the disk drives.

Sending a Command

Sending commands to the drive was done by using the “open” call with a “secondary address” of 15: The computer’s KERNAL just sent the file name and the secondary address over the IEC bus as if it were to open a file, but the floppy drive understood secondary address 15 as the command channel. So for example, deleting a file from BASIC looked like this:

OPEN 1,8,15,"S:FOO": CLOSE 1

“1″ is the KERNAL’s file descriptor, “8″ the device number and “15″ the secondary address. Experts omitted the close, because it blocked on the completion of the operation.

Getting Data Back

While the “OPEN” line for disk commands was pretty verbose, it was still doable. Getting the error message of the last operation back was more tricky: It required a loop in BASIC that read bytes from channel 15 until EOF was reached.

Getting a directory listing would be in the same class of problem, since it requires the computer to send a command (and a file name mask) to the floppy and receive the data. Neither BASIC nor KERNAL knew how to do this, and since this was such a common operation, it wouldn’t have been possible to have the user type in a 4 line BASIC program just to dump the directory contents.

The BASIC Program Hack

Here comes the trick: If the program to load started with a “$” (followed by an optional mask), the floppy drive just returned the directory listing – formatted as a BASIC program. The user could then just “LOAD” the directory and “LIST” it if it were a BASIC program:



0 "TEST DISK       " 23 2A
20   "FOO"               PRG
3    "BAR"               PRG

In this example, “TEST DISK” is the disk name, “23″ the disk ID and “2A” the filesystem format/version (always 2A on 1540/1541/1551/1570/1571 – but this was only a redundant copy of the version information which was never read and could be changed). There are two files, 20 and 3 blocks in size respecively (a block is a 256 byte allocation unit on disk – since blocks are stored as linked lists there are only 254 bytes of payload), and both are of the “PRG” type.

Encoding of Commodore BASIC Programs

The floppy was aware of the encoding that Commodore BASIC (a derivative of Microsoft BASIC for 6502) used and prepared the directory listing in that way. A BASIC program in memory is a linked list of lines. Every line starts with a 2-byte pointer to the next line. A 0-pointer marks the end of the program. The next two bytes are the line number, followed by the zero-terminated encoded line contents.

The LIST command decodes a BASIC program in memory by following the linked list from the start of BASIC RAM. It prints the line number, a space, and the line contents. These contents have BASIC keywords encoded as 1-byte tokens starting at 0×80. Character below 0×80 are printed verbatim. Here is what 10 PRINT"HELLO WORLD!" would look like:

0801  0E 08    - next line starts at 0x080E
0803  0A 00    - line number 10
0805  99       - token for PRINT
0806  "HELLO!" - ASCII text of line
080D  00       - end of line
080E  00 00    - end of program

The example directory listing from above would be encoded by the floppy like this:

0801  21 08    - next line starts at 0x0821
0803  00 00    - line number 0
0805  '"TEST DISK       " 23 2A '
0820  00       - end of line
0821  21 08    - next line starts at 0x0821
0823  14 00    - line number 20
0825  '  "FOO"               PRG '
0840  00       - end of line

A couple of things are interesting here:

  • The line with the disk name and the ID is actually printed in inverted letters, which is done by having the “revert” character code as the first character of the first line, i.e. the floppy makes the assumption that the computer understands this convention.
  • BASIC will print the file sizes as variable-with line numbers, so the floppy adds extra spaces to the beginning of the line contents to have all file names aligned.
  • The floppy needs to populate the next line pointers for the linked list.

The Link Pointer

The obvious question here is: How can the floppy know where in the computer’s memory the BASIC program will live? The answer is: It doesn’t. The BASIC interpreter supports having its program anywhere in memory, and loading programs that were saved from other locations on memory – or possibly other Microsoft BASIC compatible computers with a different memory layout. The VIC-20 had BASIC RAM at 0×0401, the C64 at 0×0801 and the C128 at 0x1C01. Therefore, BASIC “rebinds” a program on load, searching for the zero-terminator of the lines and filling the (redundant) link pointers.

The floppy therefore only has to send non-zero values as the link pointers for BASIC to accept the directory listing as a program. In fact, a 1541 sends the directory with a 0×0401-base, which would be valid on a VIC-20. The reason for this is that the 1541 is only a 1540 with minor timing fixes for C64 support, and the 1540 is the floppy drive that was designed for the VIC-20.

Therefore, if you do LOAD"$",8,1 on a C64, the extra “,1″ will be interpreted by the KERNAL LOAD code to load the file at its original address (as opposed to the beginning of BASIC RAM), and since there is screen RAM at 0×0400 on the C64, garbage will appear on the screen, because the character encoding of screen ram is incompatible with BASIC character encoding.

Directory Code in 61 Bytes

There are two problems with this “directory listing is a BASIC program” hack: Listing the directory overwrites a BASIC program in RAM, and listing the directory from inside an application is non-trivial.

Therefore, many many implementations to show a directory listing exist on the C64 – and I want to present my own one here, which is, to my knowledge, the shortest existing (and maybe shorted possible?) version. It is based on a 70 byte version published in “64′er Magazin” some time in the 80s, and I managed to get it down to 61 bytes.

,C000:  A9 01     LDA #$01     ; filename length
,C002:  AA        TAX
,C003:  A0 E8     LDY #$E8     ; there is a "$" at $E801 in ROM
,C005:  20 BD FF  JSR $FFBD    ; set filename
,C008:  A9 60     LDA #$60
,C00A:  85 B9     STA $B9      ; set secondary address
,C00C:  20 D5 F3  JSR $F3D5    ; OPEN (IEC bus version)
,C00F:  20 19 F2  JSR $F219    ; set default input device
,C012:  A0 04     LDY #$04     ; skip 4 bytes (load address and link pointer)
,C014:  20 13 EE  JSR $EE13    ; read byte
,C017:  88        DEY
,C018:  D0 FA     BNE $C014    ; loop
,C01A:  A5 90     LDA $90
,C01C:  D0 19     BNE $C037    ; check end of file
,C01E:  20 13 EE  JSR $EE13    ; read byte (block count low)
,C021:  AA        TAX
,C022:  20 13 EE  JSR $EE13    ; read byte (block count high)
,C025:  20 CD BD  JSR $BDCD    ; print 16 bit integer
,C028:  20 13 EE  JSR $EE13    ; read character
,C02B:  20 D2 FF  JSR $FFD2    ; print character to stdout
,C02E:  D0 F8     BNE $C028    ; loop until zero
,C030:  20 D7 AA  JSR $AAD7    ; print carriage return character
,C033:  A0 02     LDY #$02
,C035:  D0 DD     BNE $C014    ; skip 2 bytes next time (link pointer)
,C037:  20 42 F6  JSR $F642    ; CLOSE
,C03A:  4C F3 F6  JMP $F6F3    ; reset default input device

(There is a similar implementation here.)

There are two limitations of this code though: It omits the extra space between the block number and the filename, leading to a slightly different output, and it cannot be interrupted.

Readable and Maintainable Bitfields in C

Bitfields are very common in low level C programming. You can use them for efficient storage of a data structure with lots of flags, or to pass a set of flags between functions. Let us look at the different ways of doing this.

The straightforward way to deal with bitfields is to do the Boolean logic by hand:

Boolean Magic

#define FLAG_USER   (1 << 0)
#define FLAG_ZERO   (1 << 1)
#define FLAG_FORCE  (1 << 2)
/* bits 3-30 reserved */
#define FLAG_COMPAT (1 << 31)

create_object(int flags)
        int is_compat = (flags & FLAG_COMPAT);

        if (is_compat)
                flags &= ~FLAGS_FORCE;

        if (flags & FLAG_FORCE) {

create_object_zero(int flags)
	create_object(flags | FLAGS_ZERO);

        create_object(FLAG_FORCE | FLAG_COMPAT);

You can see code like this everywhere, so most C programmers can probably read and understand this quite easily. But unfortunately, this method is very error-prone. Mixing up "&" and "&&" and omitting the "~" when doing "&=" are common oversights, and since the compiler only sees "int" types, this also doesn't give you any kind of type-safety.


Let us look at the same code using bitfields instead:

typedef unsigned int boolean_t;
#define FALSE 0
#define TRUE !FALSE
typedef union {
        struct {
                boolean_t user:1;
                boolean_t zero:1;
                boolean_t force:1;
                int :28;                /* unused */
                boolean_t compat:1;     /* bit 31 */
        int raw;
} flags_t;

create_object(flags_t flags)
        boolean_t is_compat = flags.compat;

        if (is_compat)
                flags.force = FALSE;

        if (flags.force) {

create_object_zero(flags_t flags)
{ = TRUE;

        create_object((flags_t) { { .force = TRUE, .compat = TRUE } });

Flags can just be used like any variables. The compiler abstracts the Boolean logic away. The only downside is that the code with the static initializer requires some advanced syntax.


Bitfields in C always start at bit 0. While this is the least significant bit (LSB) on Little Endian (bit 0 has a weight of 2^0), most compilers on Big Endian systems inconveniently consider the most significant bit (MSB) bit 0 (bit 0 has a weight of 2^31, 2^63 etc. depending on the word size), so in case your bitfield needs to be binary-compatible across machines with different endianness, you will need to define two versions of the bitfield.

The Raw Bitfield

Did you notice the "int raw" in the union? It lets us conveniently (and type-safely) export the raw bit value without having to use a cast.

	printf("raw flags: 0x%xn", flags.raw);

We can use this to reconstruct the FLAG_* constants in the original example, in case some code needs it:

#define FLAG_USER   (((flags_t) { { .user   = TRUE } }).raw)
#define FLAG_ZERO   (((flags_t) { { .zero   = TRUE } }).raw)
#define FLAG_FORCE  (((flags_t) { { .force  = TRUE } }).raw)
#define FLAG_COMPAT (((flags_t) { { .compat = TRUE } }).raw)

This code constructs a temporary instance of the bitfield, sets one bit, and converts it into a raw integer - all at compile time.

Bitfield Access from Assembly

With the same trick, you can also access your bitfield from assembly, for example if the bitfield is part of the Thread Control Block in your operating system kernel, and the low level context switch code needs to access one of the bits. The "int raw" can be used to statically convert a flag into the corresponding raw mask:

typedef unsigned int boolean_t;

typedef union {
	struct {
		boolean_t bit0:1;
		boolean_t bit1:1;
		int :19;
		boolean_t bit31:1;
	int raw;
} bitfield_t;

int test()
	int param = -1;
	int result;

	__asm__ volatile (
		"test    %2, %1    n"
		"xor     %0, %0    n"
		"setcc   %0        n"
		: "=r" (result)
		: "r" (param),
		  "i" (((bitfield_t) { { .bit31 = TRUE } }).raw)
	return result;

The corresponding x86 assembly code looks like this:

	.align	4,0x90
	.globl	_test
	pushl	%ebp
	movl	%esp, %ebp
	movl	$-1, %eax
	## InlineAsm Start
	test	$0x80000000, %eax
	xor	%eax, %eax
	setcc	%eax
	## InlineAsm End
	popl	%ebp


This works fine with LLVM, but unfortunately GCC (4.2.1) has problems detecting that the raw value is available at compile time, so the "i" has to be replaced with an "r": GCC will then pre-assign a register with the raw value instead of being able to use an immediate with the "test" instruction.

How to Not Do It

I have seen C++ code doing this:

enum {

create_object(bitfield_t flags)
        bool is_compat = flags.is_set(FLAG_COMPAT);

        if (is_compat)
                flags -= FLAGS_FORCE;

        if (flags.is_set(FLAG_FORCE)) {

create_object_zero(int flags)
	create_object(flags + FLAGS_ZERO);

        create_object(((bitfield_t)FLAG_FORCE) + FLAG_COMPAT);

This all looks quite weird. The constants are bit index values, and they are added and subtracted. The reason is C++ operator overloading:

class bitmask_t
    word_t      maskvalue;

    inline bitmask_t operator -= (int n)
            maskvalue = maskvalue & ~(1UL << n);
            return (bitmask_t) maskvalue;

This is horrible. The code that uses this class makes no sense unless you read and understand the implementation of the class. And you have to be very careful: While it is possible to "add" a flag to an existing bitfield, you cannot just add two flags - it would do the arithmetic and add the two values.

Mapping the setting and clearing of bits onto the addition and subtraction operators is clearly wrong in the first place: Flags in a bitfield are equivalent to elements in a set. Setting a flag is equivalent to the "union" operation, which even in Mathematics has its own symbol instead of overloading the "+" operator.


If you compile code that does something like "((bitfield_t) { { .bit31 = TRUE } }).raw" with GCC in C++ mode, it fails. Why?

Optimized Flags Code for 6502

When I disassembled Steve Wozniak’s Apple I BASIC, I found a 6502 trick that I had never seen before, although I had read a lot of 6502 code, including the very well-written Commodore BASIC (i.e. Microsoft BASIC for 6502).

What is the most optimized way (shortest code) to set, clear and test a flag?

Normally, you would do this (“flag” is a byte in the zero page):

function code bytes cycles
lda #1
sta flag
4 5
lda #0
sta flag
4 5
lda flag
beq cleared
4 5/6

Woz did it like this:

function code bytes cycles
lda #$80
sta flag
4 5
lsr flag
2 5
bit flag
bpl cleared
4 5/6

The trick is to store the flag in bit #7. Clearing bit #7 is easy: Just do a logical shift right, which will always write a zero into the MSB; we don’t care about the contents of the other bits. The “lsr zp” is the same speed as the “lda/sta zp” combination, but only occupies 2 bytes instead of four.

The other advantage of this method is that it is possible to test the flag without destroying a register. The “bit” instruction will just copy the MSB into the negative flag, while the “lda” would have overwritten the accumulator.

I can’t see why this couldn’t be adapted to other (CISC, RMW-capable) CPUs as well. If you are fluent in any other assembly language, please post a comment.