Archive for the ‘tricks’ Category

Clockslide: How to waste an exact number of clock cycles on the 6502

Monday, February 6th, 2012

by Sven Oliver ‘SvOlli’ Moll; the original German language version has been simultaneously posted on his blog.

This is an article about the 6502 processor about the topic: how to “waste” a number of clock cycles stated in a register, in this case the X register. The principle is simple: you have a number of operations that do close to nothing. The more the code is jumped to at the “front”, the more clock cycles are needed to get to the actual code. If the code is jumped to more at the “end”, the CPU gets to the code in question more quickly.

This nice theory won’t work directly on the 6502, because every instruction takes at least two clock cycles to execute. If you want to get it down to the precision of one cycle, this is getting more difficult. The first half of this trick I found in code of Eckhard Stollberg, who is one of the guys that pionieered homebrew on the Atari 2600 VCS. There, I found some strange bytes:

C9 C9 C9 C9 C9 C9 C9 C9 C9 C9 C9 C9 C5 EA

The disassembly looks like this:

; CODE1
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP $EA  ; 3

To run through the code, you’ll need 15 clock cycles, and nothing changes except for some state registers. If the code is called with an offset of one byte, this code will be processed:

; CODE2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C9 ; 2
CMP #$C5 ; 2
NOP      ; 2

This makes 14 clock cycles, and only the status register will be changed. If the code is called with an offset of two bytes, it is started at the CODE1 segment at the second instruction. Add another one, you’ll get to the second instruction of the CODE2 segment, and so on. This way it is possible to specify the exact number of clock cycles to be “wasted”. With on exception: to be more specific there are 2 + X clock cycles that are wasted. There is no way to waste exactly one clock cycle.

Now we need a way to specify the “entry” of our “slide”. On a C=64 this would be done using self-modifying code. The operand of a JMP $XXXX instruction will be replaced with the calculated address. This is not possible on systems like the Atari 2600, since the code is run in ROM. One option for example would be to use JMP ($0080) after writing the entry point to $0080 and $0081.

My approach differs a bit from the usual way. RAM is scarce on the Atari, and I don’t want to “waste” up two of the 128 bytes available, when there is another way. When the CPU executes a JSR $XXXX (jump to subroutine) command, it writes the current address to the stack. To be more specific, it is the address of the JSR command + 2 which is the return address – 1. And this is what I do: I write my entry point – 1 to the stack and use the command RTS (return from subroutine) to jump into the clock slide. So, I’m still using two bytes of RAM, but only for a short time, without the need to evaluate which two bytes are available at this point.

; the X register specifies how many of the
; 15 clock cycles possible should be skipped
LDA #>clockslide
PHA
TXA
CLC
ADC #<clockslide
PHA
STA WSYNC ; <= this syncs to start of next scanline
clockslide:
RTS
CMP #$C9
CMP #$C9
CMP #$C9
CMP #$C9
CMP #$C9
CMP #$C9
CMP $EA
realcode:
; and here the real code continues

This approach still has one problem: between “clockslide” and “realcode”, no page crossing may occur. If this were the case, I’d have to increase the high byte on the stack by one. But since the position of the code segments is under my control, I left this out as an exercise for the reader. ;-)

Chaosradio Express #177: Commodore 64

Tuesday, July 5th, 2011

(This article is about a German-language podcast episode on the C64.)

Im Februar hat mich Tim Pritlove auf der Durchreise in Frankfurt abgefangen, wo ich mit einem Koffer voll mit zwölf Commodore 64 Motherboards in einem Hotelzimmer saß, und mit mir eine 2 Stunden und 42 Minuten lange Episode für Chaosradio Express aufgenommen.

Chaosradio Express #177: Commodore 64

Hier also nochmal der Hinweis auf die Folge, die jetzt schon ein paar Monate zurückliegt, für all diejenigen, die sie nicht schon anderweitig entdeckt haben. :-)

HFS+ File System Analysis and Forensics with fileXray

Monday, November 1st, 2010

Modern filesystems are highly optimized database systems that are a core function of modern operating systems. They allow concurrent access by many CPUs, they keep locality up and fragementation down, and they can recover from crashes guaranteeing consistent data structures.

If you want to learn about the internals of modern filesystems, you can either read up on theory (e.g. Dominic Giampaolo’s excellent free book Practical File System Design), or you can dissect one using a tool like fileXray, which

  • allows you to examine data structures on your Mac OS X HFS+ disk, like the volume header, the journal, regular and special files, etc., down to the B-tree level
  • gives you insight into statistics like fragmentation and hot file clustering
  • lets you find out where all these deleted files really go, and what forensic analysis can tell someone about your disks!

Having Fun with Branch Delay Slots

Sunday, November 22nd, 2009

Branch Delay Slots are one of the awkward features of RISC architectures. RISC CPUs are pipelined by definition, so while the current instruction is in execution, the following instruction(s) will be in the pipeline already. If there is for example a conditional branch in the instruction stream, the CPU cannot know whether the next instruction is the one following the branch or the instruction at the target location until it has evaluated the branch. This would cause a bubble in the pipeline; therefore some RISC architectures have a branch delay slot: The instruction after the branch will always be executed, no matter whether the branch is taken or not.

So in practice, you can put the instruction that would be before the branch right after the branch, if this instruction is independent of the branch instruction, i.e. doesn’t access the same registers. Otherwise, you can fill it with a NOP. Out-of-order architectures can do this reordering at runtime, so there would be no need for a delay slot. Nevertheless, the delay slot is a feature of the architecture, not the implementation.

Some RISCs like PowerPC and ARM do not have a delay slot, but for example MIPS, SPARC, PA-RISC have it. But there are some variations: MIPS and PA-RISC have an annihilation/nullify/likely bit in the instruction, so the programmer can choose that the instruction in the delay slot only gets executed if the branch is taken.

Other CPUs, like SPARC, PA-RISC or the ill-fated Motorola M88K have optional delay slots: The programmer can set a bit in the opcode if he cannot come up with a good instruction for the delay slot, and save the wasted NOP in the program code this way – the CPU will put a bubble into the pipeline.

Now the interesting question is what happens if the branch and the delay instruction are not independent. What if the delay instruction writes r5 and the branch jumps to r5? What if it’s a branch-and-link, and the delay instruction modifies the link register? On MIPS, this is illegal, and undefined.

In practice, MIPS won’t halt and catch fire though. As you would expect from the design of a CPU pipeline, the CPU basically executes the branch and the delay instruction in order, as they are stored in the instruction stream, and it only delays the write to PC, i.e. the actual jump until after the delay instruction. So, for example, if you modify a register that the branch depends on, it will not influence the branch, but be in effect after the jump.

On the aforementioned Motorola M88K, this behaviour is documented, and GCC even makes use of it:

820:   7d ad 00 08   cmp   r13,r13,0x08    ; compare

824:   d4 6d 00 05   bb0.n 0x03,r13,0x834  ; cond. branch
828:   63 df 00 00   addu  r30,r31,0       ; delay slot

82c:   cc 00 00 7f   bsr.n 0xa28           ; function call
830:   60 21 01 ac   addu  r1,r1,0x1ac     ; delay slot: fix up link

834:   00 00 00 00   nop

The first three instructions are a compare/branch sequence. If the branch is taken, execution will continue at 0×834, otherwise at 0x82c, after the delay slot. The delay instruction is independent of the branch, nothing special here yet.

But now look at the following two instructions: 0x82c and 0×830 are not independent. r1 is the M88K’s link register, so the “bsr” writes the addres of the following instruction after the delay slot (0×834) into r1. The delay slot also writes into r1: It adds 0x1ac to it. These instructions are executed in order, but the actual branch to 0xa28 will only be done after the delay instruction.

So what these two instructions effectively do is call a function, but set up the return address to skip the next 0x1ac bytes (107 instructions) after return. If the conditional branch at 0×824 is taken, the code at 0×834 will be executed, otherwise, 0xa28 is called, and the “taken” case of the conditional branch is skipped. This trick can be used whenever you have C code with an “if” statement of which one case is a single function call:

if (...) {
    call_something();
} else {
    [...]
}

A Standalone printf() for Early Bootup

Monday, September 7th, 2009

A while ago, I complained about operating systems with overly complicated startup code that spends too much time in assembly and does hot have printf() or framebuffer access until very late.

This second post is about printf(): Many systems use POST codes (on i386/x86_64, i.e. writes to port 0×80) or debug LED for debugging, or have complicated and cumbersome implementations for puts() and print_hex() – and printf() is only available very late, because it has some special requirement, like console channels being set up. But printf() is not rocket science: Al it needs is C and a stack. Whatever system you are bringing up on whatever platform: Having printf() as early as possible will prove very useful.

Today, I am presenting a full-featured standalone version of printf() that can be added to arbitrary 32 or 64 bit C code. The code has been taken from FreeBSD (sys/kern/subr_prf.c) and is therefore BSD-licensed. Unnecessary functions have been removed and all typedefs required have been added.

/*-
 * Copyright (c) 1986, 1988, 1991, 1993
 *	The Regents of the University of California.  All rights reserved.
 * (c) UNIX System Laboratories, Inc.
 * All or some portions of this file are derived from material licensed
 * to the University of California by American Telephone and Telegraph
 * Co. or Unix System Laboratories, Inc. and are reproduced herein with
 * the permission of UNIX System Laboratories, Inc.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 * 4. Neither the name of the University nor the names of its contributors
 *    may be used to endorse or promote products derived from this software
 *    without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
 * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 *
 *	@(#)subr_prf.c	8.3 (Berkeley) 1/21/94
 */

typedef unsigned long size_t;
typedef long ssize_t;
#ifdef __64BIT__
typedef unsigned long long uintmax_t;
typedef long long intmax_t;
#else
typedef unsigned int uintmax_t;
typedef int intmax_t;
#endif
typedef unsigned char u_char;
typedef unsigned int u_int;
typedef unsigned long u_long;
typedef unsigned short u_short;
typedef unsigned long long u_quad_t;
typedef long long quad_t;
typedef unsigned long uintptr_t;
typedef long ptrdiff_t;
#define NULL ((void*)0)
#define NBBY    8               /* number of bits in a byte */
char const hex2ascii_data[] = "0123456789abcdefghijklmnopqrstuvwxyz";
#define hex2ascii(hex)  (hex2ascii_data[hex])
#define va_list __builtin_va_list
#define va_start __builtin_va_start
#define va_arg __builtin_va_arg
#define va_end __builtin_va_end
#define toupper(c)      ((c) - 0x20 * (((c) >= 'a') && ((c) <= 'z')))
static size_t
strlen(const char *s)
{
	size_t l = 0;
	while (*s++)
		l++;
	return l;
}

/* Max number conversion buffer length: a u_quad_t in base 2, plus NUL byte. */
#define MAXNBUF	(sizeof(intmax_t) * NBBY + 1)

/*
 * Put a NUL-terminated ASCII number (base <= 36) in a buffer in reverse
 * order; return an optional length and a pointer to the last character
 * written in the buffer (i.e., the first character of the string).
 * The buffer pointed to by `nbuf' must have length >= MAXNBUF.
 */
static char *
ksprintn(char *nbuf, uintmax_t num, int base, int *lenp, int upper)
{
	char *p, c;

	p = nbuf;
	*p = '\0';
	do {
		c = hex2ascii(num % base);
		*++p = upper ? toupper(c) : c;
	} while (num /= base);
	if (lenp)
		*lenp = p - nbuf;
	return (p);
}

/*
 * Scaled down version of printf(3).
 *
 * Two additional formats:
 *
 * The format %b is supported to decode error registers.
 * Its usage is:
 *
 *	printf("reg=%b\n", regval, "*");
 *
 * where  is the output base expressed as a control character, e.g.
 * \10 gives octal; \20 gives hex.  Each arg is a sequence of characters,
 * the first of which gives the bit number to be inspected (origin 1), and
 * the next characters (up to a control character, i.e. a character <= 32),
 * give the name of the register.  Thus:
 *
 *	kvprintf("reg=%b\n", 3, "\10\2BITTWO\1BITONE\n");
 *
 * would produce output:
 *
 *	reg=3
 *
 * XXX:  %D  -- Hexdump, takes pointer and separator string:
 *		("%6D", ptr, ":")   -> XX:XX:XX:XX:XX:XX
 *		("%*D", len, ptr, " " -> XX XX XX XX ...
 */
int
kvprintf(char const *fmt, void (*func)(int, void*), void *arg, int radix, va_list ap)
{
#define PCHAR(c) {int cc=(c); if (func) (*func)(cc,arg); else *d++ = cc; retval++; }
	char nbuf[MAXNBUF];
	char *d;
	const char *p, *percent, *q;
	u_char *up;
	int ch, n;
	uintmax_t num;
	int base, lflag, qflag, tmp, width, ladjust, sharpflag, neg, sign, dot;
	int cflag, hflag, jflag, tflag, zflag;
	int dwidth, upper;
	char padc;
	int stop = 0, retval = 0;

	num = 0;
	if (!func)
		d = (char *) arg;
	else
		d = NULL;

	if (fmt == NULL)
		fmt = "(fmt null)\n";

	if (radix < 2 || radix > 36)
		radix = 10;

	for (;;) {
		padc = ' ';
		width = 0;
		while ((ch = (u_char)*fmt++) != '%' || stop) {
			if (ch == '\0')
				return (retval);
			PCHAR(ch);
		}
		percent = fmt - 1;
		qflag = 0; lflag = 0; ladjust = 0; sharpflag = 0; neg = 0;
		sign = 0; dot = 0; dwidth = 0; upper = 0;
		cflag = 0; hflag = 0; jflag = 0; tflag = 0; zflag = 0;
reswitch:	switch (ch = (u_char)*fmt++) {
		case '.':
			dot = 1;
			goto reswitch;
		case '#':
			sharpflag = 1;
			goto reswitch;
		case '+':
			sign = 1;
			goto reswitch;
		case '-':
			ladjust = 1;
			goto reswitch;
		case '%':
			PCHAR(ch);
			break;
		case '*':
			if (!dot) {
				width = va_arg(ap, int);
				if (width < 0) {
					ladjust = !ladjust;
					width = -width;
				}
			} else {
				dwidth = va_arg(ap, int);
			}
			goto reswitch;
		case '0':
			if (!dot) {
				padc = '0';
				goto reswitch;
			}
		case '1': case '2': case '3': case '4':
		case '5': case '6': case '7': case '8': case '9':
				for (n = 0;; ++fmt) {
					n = n * 10 + ch - '0';
					ch = *fmt;
					if (ch < '0' || ch > '9')
						break;
				}
			if (dot)
				dwidth = n;
			else
				width = n;
			goto reswitch;
		case 'b':
			num = (u_int)va_arg(ap, int);
			p = va_arg(ap, char *);
			for (q = ksprintn(nbuf, num, *p++, NULL, 0); *q;)
				PCHAR(*q--);

			if (num == 0)
				break;

			for (tmp = 0; *p;) {
				n = *p++;
				if (num & (1 << (n - 1))) {
					PCHAR(tmp ? ',' : '<');
					for (; (n = *p) > ' '; ++p)
						PCHAR(n);
					tmp = 1;
				} else
					for (; *p > ' '; ++p)
						continue;
			}
			if (tmp)
				PCHAR('>');
			break;
		case 'c':
			PCHAR(va_arg(ap, int));
			break;
		case 'D':
			up = va_arg(ap, u_char *);
			p = va_arg(ap, char *);
			if (!width)
				width = 16;
			while(width--) {
				PCHAR(hex2ascii(*up >> 4));
				PCHAR(hex2ascii(*up & 0x0f));
				up++;
				if (width)
					for (q=p;*q;q++)
						PCHAR(*q);
			}
			break;
		case 'd':
		case 'i':
			base = 10;
			sign = 1;
			goto handle_sign;
		case 'h':
			if (hflag) {
				hflag = 0;
				cflag = 1;
			} else
				hflag = 1;
			goto reswitch;
		case 'j':
			jflag = 1;
			goto reswitch;
		case 'l':
			if (lflag) {
				lflag = 0;
				qflag = 1;
			} else
				lflag = 1;
			goto reswitch;
		case 'n':
			if (jflag)
				*(va_arg(ap, intmax_t *)) = retval;
			else if (qflag)
				*(va_arg(ap, quad_t *)) = retval;
			else if (lflag)
				*(va_arg(ap, long *)) = retval;
			else if (zflag)
				*(va_arg(ap, size_t *)) = retval;
			else if (hflag)
				*(va_arg(ap, short *)) = retval;
			else if (cflag)
				*(va_arg(ap, char *)) = retval;
			else
				*(va_arg(ap, int *)) = retval;
			break;
		case 'o':
			base = 8;
			goto handle_nosign;
		case 'p':
			base = 16;
			sharpflag = (width == 0);
			sign = 0;
			num = (uintptr_t)va_arg(ap, void *);
			goto number;
		case 'q':
			qflag = 1;
			goto reswitch;
		case 'r':
			base = radix;
			if (sign)
				goto handle_sign;
			goto handle_nosign;
		case 's':
			p = va_arg(ap, char *);
			if (p == NULL)
				p = "(null)";
			if (!dot)
				n = strlen (p);
			else
				for (n = 0; n < dwidth && p[n]; n++)
					continue;

			width -= n;

			if (!ladjust && width > 0)
				while (width--)
					PCHAR(padc);
			while (n--)
				PCHAR(*p++);
			if (ladjust && width > 0)
				while (width--)
					PCHAR(padc);
			break;
		case 't':
			tflag = 1;
			goto reswitch;
		case 'u':
			base = 10;
			goto handle_nosign;
		case 'X':
			upper = 1;
		case 'x':
			base = 16;
			goto handle_nosign;
		case 'y':
			base = 16;
			sign = 1;
			goto handle_sign;
		case 'z':
			zflag = 1;
			goto reswitch;
handle_nosign:
			sign = 0;
			if (jflag)
				num = va_arg(ap, uintmax_t);
			else if (qflag)
				num = va_arg(ap, u_quad_t);
			else if (tflag)
				num = va_arg(ap, ptrdiff_t);
			else if (lflag)
				num = va_arg(ap, u_long);
			else if (zflag)
				num = va_arg(ap, size_t);
			else if (hflag)
				num = (u_short)va_arg(ap, int);
			else if (cflag)
				num = (u_char)va_arg(ap, int);
			else
				num = va_arg(ap, u_int);
			goto number;
handle_sign:
			if (jflag)
				num = va_arg(ap, intmax_t);
			else if (qflag)
				num = va_arg(ap, quad_t);
			else if (tflag)
				num = va_arg(ap, ptrdiff_t);
			else if (lflag)
				num = va_arg(ap, long);
			else if (zflag)
				num = va_arg(ap, ssize_t);
			else if (hflag)
				num = (short)va_arg(ap, int);
			else if (cflag)
				num = (char)va_arg(ap, int);
			else
				num = va_arg(ap, int);
number:
			if (sign && (intmax_t)num < 0) {
				neg = 1;
				num = -(intmax_t)num;
			}
			p = ksprintn(nbuf, num, base, &tmp, upper);
			if (sharpflag && num != 0) {
				if (base == 8)
					tmp++;
				else if (base == 16)
					tmp += 2;
			}
			if (neg)
				tmp++;

			if (!ladjust && padc != '0' && width
			    && (width -= tmp) > 0)
				while (width--)
					PCHAR(padc);
			if (neg)
				PCHAR('-');
			if (sharpflag && num != 0) {
				if (base == 8) {
					PCHAR('0');
				} else if (base == 16) {
					PCHAR('0');
					PCHAR('x');
				}
			}
			if (!ladjust && width && (width -= tmp) > 0)
				while (width--)
					PCHAR(padc);

			while (*p)
				PCHAR(*p--);

			if (ladjust && width && (width -= tmp) > 0)
				while (width--)
					PCHAR(padc);

			break;
		default:
			while (percent < fmt)
				PCHAR(*percent++);
			/*
			 * Since we ignore an formatting argument it is no
			 * longer safe to obey the remaining formatting
			 * arguments as the arguments will no longer match
			 * the format specs.
			 */
			stop = 1;
			break;
		}
	}
#undef PCHAR
}

static void
putchar(int c, void *arg)
{
	/* add your putchar here */
}

void
printf(const char *fmt, ...)
{
	/* http://www.pagetable.com/?p=298 */
	va_list ap;

	va_start(ap, fmt);
	kvprintf(fmt, putchar, NULL, 10, ap);
	va_end(ap);
}

One thing you will have to do is #define or #undef __64__BIT to enable or disable support for printing 64 bit values. If you are on a 64 bit architecture, you will need this to print addresses properly. If you are on 32 bit, this support will require external library functions to do 64 bit arithmetic, so you probably want to turn this off.

On ARM, which doesn't have an assembly statement for division, will always need a library routine. You can use FreeBSD's at sys/libkern/arm/divsi3.S, if you to add the following #defines to make the file compile:

#define __FBSDID(a)
#define RET bx lr
#define ENTRY_NP(a) _##a: .global _##a
#define _KERNEL

All you need to do now to get printf() working is to fill the putchar() function, for example with a call to a serial_putc() implementation, if you have a serial port. On a PC, serial_putc() could look like this:

static void outb(short p, unsigned char a)
{
        asm volatile("outb %b0, %w1" : : "a" (a), "Nd" (p));
}
static unsigned char inb(short p)
{
        unsigned char a;
        asm volatile("inb %w1, %b0" : "=a" (a) : "Nd" (p));
        return a;
}
#define SERIAL_PORT 0x3f8
static void
serial_putc(unsigned char c) {
	while (!(inb(SERIAL_PORT + 5) & 0x20));
	outb(SERIAL_PORT, c);
}

On an "ARM Integrator/CP" board, the default for QEMU emulating ARM, the follwing serial_putc() would apply (make sure you are running the MMU off, or with the serial port MMIO region mapped):

#define SERIAL_PORT 0x16000000
static void
serial_putc(unsigned char c) {
	while ((*(volatile unsigned int *)(SERIAL_PORT + 0x18)) & 0x20);
	*(volatile unsigned int *)(SERIAL_PORT) = (c);
}

Unfortunately, sometimes you are bringing up or debugging a system that does not have a serial port - or at least you haven't found one: Apple Macintosh and various game systems come to mind. In this case, you might be able to set up a framebuffer and print to that. In a later post, I will present a small standalone framebuffer library.

Minimizing the Assembly needed for Machine Initialization

Monday, August 10th, 2009

In many operating systems, I have seen overly complicated startup code. Too much is done in assembly, and printf() and framebuffer access is only available very late. In the next three blog posts, I will show how this can be avoided.

In this post, I am showing how little assembly code is needed for startup. Minimizing the assembly makes your code significantly more maintainable. Everything that really needs to be done is setting up the CPU state to support 64 bit (or 32 bit) C code running at physical addresses, and everything else, including setting up the final machine state, can be done in C with a little inline assembly. The following example switches from 16 bit (real mode) or 32 bit mode (flat protected mode) into 64 bit mode on an x86_64 CPU (NASM syntax):

PAGE_SIZE	equ	0x1000
STACK_SIZE	equ	16*1024

PML2		equ	0x1000
PML3		equ	PML2+PAGE_SIZE
PML4		equ	PML3+PAGE_SIZE

STACK_BOTTOM	equ	PML4+PAGE_SIZE
STACK_TOP	equ	STACK_BOTTOM+STACK_SIZE

CODE_START	equ	STACK_TOP

	org CODE_START

	[BITS 16]
	cli

	; clear 3 pages of pagetables
	mov edi, PML2
	xor eax, eax
	mov ecx, 3*4096/4
	rep stosd

	; set up pagetables
	mov dword [PML2], 0x87		; single 4 MB id mapping PML2
	mov dword [PML3], PML2 | 7	; single entry at PML3
	mov dword [PML4], PML3 | 7	; single entry at PML4

	; load the GDT
	lgdt [gdt_desc]

	; set PSE,  PAE
	mov eax, 0x30
	mov cr4, eax

	; long mode
	mov ecx, 0xc0000080
	rdmsr
	or eax, 0x100
	wrmsr

	; enable pagetables
	mov eax, PML4
	mov cr3, eax

	; turn on long mode and paging
	mov eax, 0x80010001
	mov cr0, eax

	jmp SEL_CS:code64

code64:
	[BITS 64]
	mov ax, SEL_DS
	mov ds, ax
	mov es, ax
	mov fs, ax
	mov gs, ax
	mov ss, ax

	mov sp, STACK_TOP
	call start

inf:	jmp inf

gdt_desc:
	dw  GDT_LEN-1
	dd  gdt
	align 8
gdt	db 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 ; 0x00 dummy
gdt_cs	db 0xff, 0xff, 0x00, 0x00, 0x00, 0x9b, 0xaf, 0x00 ; 0x08 code64
gdt_ds	db 0xff, 0xff, 0x00, 0x00, 0x00, 0x93, 0xaf, 0x00 ; 0x18 data64

GDT_LEN equ $ - gdt
SEL_CS equ gdt_cs - gdt
SEL_DS equ gdt_ds - gdt

While switching into 32 bit flat mode is trivial, switching into 64 bit mode requires setting up pagetables. This code sets up 4 MB of identity-mapped memory starting at address 0.

The code is designed to switch from 16 bit mode into 64 bit mode, but since 16 and 32 bit flat mode on i386 are assembly source compatible, you can replace “[BITS 16]” with “[BITS 32]“, and the code will switch from 32 bit to 64 bit mode. (Yes, it is possible to switch from real mode directly into 64 bit mode: osdev.org Forums)

If you use this code, be careful about the memory layout. The example leaves the first page in memory untouched (in case you need the original real mode BIOS IDT for something later), and occupies the next few pages for the pagetables and the stack. Your code should be above that, but, on a BIOS system, not be between 640 KB and 1 MB (this might be device memory and ROM), and also not above 1 MB before you have enabled the A20 gate.

While this code is enough to support 64 bit C code, this is not enough to set up the machine to support all aspects of an operating system. You probably want to set up your own GDT that has entries for 32 bit code and data too, you want to set up an IDT in order to be able to catch exceptions and interrupts, and you will need real pagetables to support virtual memory. Also, you will have to move your stack pointer once you have your final memory layout.

But it is possible to construct the overly complicated GDT, IDT and pagetable structures using readable C code, and the “lidt”, “lgdt” etc. instructions can be done in inline assembly. While this is not portable C code, it is possible to reuse large parts of the machine initialization for a 32 bit (i386) and a 64 bit (x86_64) version of the operating system, which is not as easy to get right in assembly.

In my next post, I am going to show how easy it is to get printf() working as soon as you reach C, so you don’t have to mess around with puts()- and print_hex()-like functions in early machine initialization.