Category Archives: whines

Racism in Monstropolis

Sometimes, freezeframe fun does not provide fun, but sadness.

In the Pixar movie Monsters Inc., you can see the following file of a child at 12 min 40 sec:

Monsters scare children at night, and this is how they keep track of them. The file comes with a blueprint of the room, a list of date stamps, business-critical notes like “scared of snakes”, and the standard data like name, gender, age, and… uh, race??

Albert Lozano, age 8, seems to be “hispanic”, and for monsters, this is apparently a feature that is important to them.

Oh well, that’s Monstropolis, a world inhabited by monsters that scare little children. Modern societies, on the other hand, have understood that “race” is a detail that is just as useful to track as shoe size. Oh wait.


Leave security to security code. Or: Stop fixing bugs to make your software secure!

If you read about operating system security, it seems to be all about how many holes are discovered and how quickly they are fixed. If you look inside an OS vendor, you see lots of code auditing taking place. This assumes all security holes can be found and fixed, and that they can be eliminated more quickly that new ones are added. Stop fixing bugs already, and take security seriously!

Recently, German publisher heise.de interviewed Felix “fefe” von Leitner, a security expert and CCC spokesperson, on Mac OS X security:

heise.de: Apple has put protection mechanisms like Data Execution Prevention, Address Space Layout Randomization and Sandboxing into OS X. Shouldn’t that be enough?

Felix von Leitner: All these are mitigations that make exploiting the holes harder but not impossible. And: The underlying holes are still there! They have to close these holes, and not just make exploiting them harder. (Das sind alles Mitigations, die das Ausnutzen von Lücken schwerer, aber nicht unmöglich machen. Und: Die unterliegenden Lücken sind noch da! Genau die muss man schließen und nicht bloß das Ausnutzen schwieriger machen.)

Security mechanisms make certain bugs impossible to exploit

A lot is wrong about this statement. First of all, the term “harder” is used incorrectly in this context. Making “exploiting holes harder but not impossible” would mean that an attacker has to put more effort into writing exploit code for a certain security-relevant bug, but achieve the same at the end. While this is true for some special cases (with DEP on, certain code execution exploits can be converted into ROP exploits), but the whole point of mechanisms like DEP, ASLR and sandboxing is to make certain bugs impossible to exploit (e.g. directory traversal bugs can be made impossible with proper sandboxing) – while other bugs are unaffected (DEP can’t help against trashing of globals though an integer exploit). So mechanisms like DEP, ASLR and sandboxing make it harder to find exploitable bugs, not harder to exploit existing bugs. In other words: Every one of these mechanisms makes certain bugs non-exploitable, effectively decreasing the number of exploitable bugs in the system.

As a consequence, it does not matter whether the underlying bug is still there. It cannot be exploited. Imagine you have an application in a sandbox that restricts all file system accesses to /tmp – is it a bug if the application doesn’t check all user filenames redundantly? Does the US President have to lock the bedroom door in the White House or can he trust the building to be secure? Of course, a point can be made for diverse levels of barriers in systems of high security where a single breach can be desastrous and fixing a hole can be expensive (think: Xbox), but if you have to set priorities, it is smarter for the President to have security around the White House than locking every door behind himself.

Symmetric and asymmetric security work

When an operating systems company has to decide on how to allocate its resources, it needs to be aware of symmetric and asymmetric work. Finding and fixing bugs is symmetric work. You are as efficient finding and fixing bugs as attackers are finding and exploiting them. For every hour you spend fixing bugs, attackers have to spend one more hour searching for them, roughly speaking. Adding mechanisms like ASLR is asymmetric work. It may take you 1000 hours to get it implemented, but over time, it will waste more than 1000 hours of your attackers’ time – or make the attacker realize that it’s too much work and not worth attacking the system.

Leave security to security code

Divide your code into security code and non-security code. Security code needs to be written by people who have a security background, they keep the design and implementation simple and maintainable, and they are aware of common security pitfalls. Non-security is code that never deals with security. It can be written by anyone. If a non-security project requires a small module that deals with security (e.g. verifies a login), push it into a different process – which is then security code.

Imagine for example a small server application that just serves some data from your disk publicly. Attackers have exploited it to serve anything from disk or spread malware. Should you fix your application? Mind you, your application by itself has nothing to do with security. Why spend time on adding a security infrastructure to it, fixing some of the holes, ignoring others, and adding more, instead of properly partitioning responsibilities and having everyone do what they can do best: The kernel can sandbox the server so it can only access a single subdirectory, and it can’t write to the filesystem. And the server can stay simple as opposed to being bloated with security checks.

How many bugs in 1000 lines of code?

A lot of people seem to assume they can find and fix all bugs. Every non-trivial program contains at least one bug. The highly security-critical first-stage bootloader of the original Xbox was only 512 bytes in size. It consisted of about 200 hand-written assembly instructions. There was one bug in the design of the crypto system in the code, as well as two code execution bugs, one of which could be exploited in two different ways. In the revised version, one of the code execution bugs was fixed, and the crypto system had been replaced with one that had a different exploitable bug. Now extrapolate this to a 10+ MB binary like an HTML5 runtime (a.k.a web browser) and think about whether looking for holes and fixing them makes a lot of sense. And keep in mind that a web browser is not all security-critical assembly carefully handwritten by security professionals.

Conclusion

So stop looking for and fixing holes, this won’t make an impact. If the hackers find one, instead of researching “how this could happen” and educating the programmers responsible for it, construct a system that mitigates these attacks without the worker code having to be security aware. Leave security to security code.

Comparing Digital Video Downloads of Interlaced TV Shows

In the days of CRT monitors, TV shows used to be broadcast in interlaced mode, which is unsupported by modern flat-panel displays. All online streaming services and video stores provide progressive video, so they must deinterlace the data first. This article compares the deinterlacing strategies of Apple iTunes, Netflix, Microsoft Zune, Amazon VoD and Hulu by comparing their respective encodings of a Futurama episode.

If you have dealt with video formats before, you probably know about interlacing, a 1930s trick to achieve both high spatial and temporal resolution at half the (analog) data rate: In NTSC countries, there are 60 fields per second (PAL: 50), and every field is half the vertical resolution of a full frame. When film footage at 24 frames per second has to be played at 30 fps (NTSC), every frame has to be shown 1.25 times – in other words, every fourth frame has to be shown twice. This introduces jerky motion (judder), but it can be improved by using the 60 Hz temporal resolution: Frame A gets shown for 2 fields, frame B for 3 fields, frame C for 2 fields, and so on. This way, every source frame gets shown for 2.5 fields, i.e. 1.25 frames – this method is called a telecine 2:3 pulldown.

A lot of TV material is produced at 24 fps and telecined, for several reasons: Standard movie cameras can be used instead of TV cameras, 24 fps can be converted to 25 fps PAL more easily than 30 fps NTSC, and for cartoons, this means that only 24 (or 12) frames have to be drawn for every second.

Unfortunately, interlacing only works with ancient CRT TVs – modern LCD screens can only show progressive video. And while DVDs are specified to encode interlaced video, more modern formats like MPEG-4/H.264 and VC-1 usually carry progressive data. So when playing DVDs, the DVD player or the TV have to deal with the interlacing problem, and in case of modern file formats, it’s the job of the converter/encoder.

The naive way of converting an interlaced source to progressive is to combine every two fields into a frame. This works great if the original source material was 30 fps progressive (which is rare for NTSC but common for PAL), but for telecined video, since two out of every six frames are combined from two different fields, this leads to ugly combing effects.

If the source material was 24 fps, an inverse telecine can be done, recovering the original 24 frames per second. Unfortunately, it is not always this easy, since interlaced video may switch between methods, and sometimes use different methods at the same time, e.g. overlaying 30 fps interlaced captions on top of a 24 fps telecined picture, or compositing two telecined streams with a different phase. “Star Trek: The Next Generation” is a famous offender in this category – just single-step through the title…

In the following paragraphs, let us look at an episode of Futurama and how the deinterlacing was done by the different providers of the show. Futurama was produced in 24 fps and telecined. Some of the editing seems to have been done on the resulting interlaced video, so the telecine pattern is not 100% consistent.

NTSC DVD

The NTSC DVD is basically just an MPEG-2-compressed version of the US CCIR 601 broadcast master. It encodes 720×480 anamorphic pixels (which can be displayed as 640×480 or 720×540) and has all the original interlacing intact. This is a frame at 640×480 and properly inverse telecined:

DVD

Hulu

Hulu

Hulu (480p version) took the original image without doing any cropping on the sides. You can clearly see this picture is only half the vertical resolution, meaning one of the fields got discarded. It seems this was Hulu’s deinterlacing strategy, since throughout the complete video, everything is half the vertical resolution, whether there is motion or not. This also keeps the video at 30 fps, and effectively shows every fourth frame twice, introducing stronger judder.

iTunes

iTunes

iTunes crops the picture to get rid of the black pixels in the overscan area and scales it to 640×480. They run a full-blown 60 Hz deinterlace filter on the video. Such a filter is meant to take a live television signal as an input, with a temporal resolution of 60 Hz. While this looks fine on frames with no or little motion, vertical resolution is halved as soon as there is motion. Basically, it is the wrong filter. Like Hulu, iTunes preserves the 30 fps, introducing a stronger judder. (The video encoding is H.264 at 1500 kbit/sec.)

Netflix

Netflix

Netflix seems to do the same as iTunes – maybe they even got the data from iTunes? The image is cropped and scaled to 640×480, they run a deinterlace-filter and retain the 30 fps, leading to halved resolution when there is motion, and stronger judder.

Amazon Video on Demand

Amazon Video on Demand

Amazon Video on Demand with its horribly inconvenient Unbox Player (Windows only, requires 1 GB of extra downloads and two reboots) did a better job. Like Netflix and iTunes, they cropped the picture and scaled it to 640×480, but they actually did a real inverse telecine. In some segments (like the end credits), the algorithm failed because of inconsistencies of the original telecine, so it reverted to half the vertical resolution. And like the others, Amazon also encodes at 30 fps, i.e. judder. (The video encoding is VC-1 at 2600 kbit/sec.)

Zune

Zune

Microsoft’s Zune Store provides a cropped video at 640×480 at the original 24 fps and with a bitrate of 1500 kbit/sec (VC-1). Looking through it frame by frame reveals that they used a brilliant detelecine/deinterlace algorithm. On the DVD, the panning at the beginning of the “Robot Hell” song is very tricky: It breaks the standard telecine pattern (PPPIIPPPII becomes PPPIPPPI), it seems every fifth frame was removed.




The pan consists of a pattern of three progressive frames, and then one interlaced frame, which is composed of the previous frame and the current frame. Consequently, every fourth frame has half its resolution wasted by the repeated lines of the previous frame, i.e. every fourth frame only exists at half resolution in the DVD master material.

Hulu discards half the vertical resolution for every frame anyway, and the deinterlacing algorithms of iTunes and Netflix discard half the resolution whenever there is motion. The Amazon algorithm does a good job when the telecine pattern is correct, but in this case, it gets confused and encodes all frames of the pan in half resolution. The Zune algorithm does a brilliant job here: The progressive frames stay at full resolution, and it extracts the half-resolution picture out of every fourth frame:




This is the fourth picture at full size – you can see half the vertical resolution is missing (it was never there in the first place!), but the algorithm did a very good interpolation job:

Robot Hell (Zune)

The Zune video is almost perfect. It recombines all fields correctly and recovers all single fields, scaling them up so that it’s hardly visible there is information missing. If you ignore the 720 vs. 640 horizontal pixels, the resulting 24 fps video contains all information of the DVD version, but with all interlacing removed, and with zero judder. Too bad it’s not H.264, but DRMed and only plays on Windows (XP+), Zune and Windows Phone 7.

Summary

Provider Cropping Resolution Deinterlacing fps Encoder Bitrate (kbits/sec)
NTSC DVD no 720×480 none 30 MPEG-2 6500
Hulu no 640×480 discard 30 H.264? ?
iTunes yes 640×480 30 Hz deinterlace 30 H.264 1500
Netflix yes 640×480 30 Hz deinterlace 30 H.264/VC-1 ?
Amazon VoD yes 640×480 detelecine+decomb 30 VC-1 2600
Zune yes 640×480 fuzzy detelecine 24 VC-1 1500

Note: H.264 and VC-1 compress significantly better than MPEG-2; a rule of thumb is to divide the MPEG-2 bitrate by 2.3 to get a comparable H.264/VC-1 bitrate. So the Amazon bitrate is fine and the video is about the same quality (sharp picture, no compression artefacts) as the DVD, but the iTunes and Zune versions are not (artifacts can be seen on single frames).

It is scary how little effort seems to be going into video conversion/encoding at major players like iTunes, Netflix and Hulu. Amazon did a kind of okay job converting the source material properly, and only Microsoft did an excellent job. The NTSC DVDs still give you the maximum quality – but of course, if you watch them on an LCD, the burden of deinterlacing is on your side. Handbrake with “detelecine” (for the bulk of it) and “decomb” (for exceptions) turned on, and with a target framerate of “same as source” will generate a rather good MP4 video similar to Amazon’s, but without the judder.

Are there any stores I missed? Can someone check the PAL DVD as well as digital PAL and NTSC broadcasts? What is the magical detelecine/deinterlace program Microsoft uses?

See also: Comparing Bittorrent Files of Interlaced TV Shows

The Intel 80376 – a Legacy-Free i386 (with a Twist!)

25 years after the introduction of the 32 bit Intel i386 CPU, all Intel compatibles still start up (and wake up!) in 16 bit stone-age mode, and they have to be switched into 32/64 bit mode to be usable.

Wouldn’t it be nice if a modern i386/x86_64 CPU started at least in 32 bit protected mode? Can’t they make a legacy-free CPU that does not support 16 bit mode at all? Such a CPU exists, well, existed. It’s the 1989-2001 Intel 80376, an embedded version of the Intel i386.

The datasheet describes all the interesting differences. The 80376 does not support any 16 bit mode, so the “D” bit in segment descriptors must be set to 1 (page 25), forcing 32 bit code and data segments. 286-style descriptors are not supported either (page 27). (The 0×66 and 0×67 opcode prefixes still exists, so code can work on 16 bit registers and generate 16 bit addresses (page 14), just like an i386 in 32 bit mode.)

Since the CPU does not support 16 bit modes, it cannot do real mode, so CR0.PE is always 1. Consequently, a 80376 starts up in 32 bit protected mode, but otherwise, startup is just like on the i386 (page 19): EIP is 0x0000FFF0, CS is 0xF000, CS.BASE is 0xFFFF0000, CS.LIMIT is 0xFFFF, and the other segment registers are 0×0000, with a base of 0×00000000 and a limit of 0xFFFF. No GDT is set up, and in order to get the system into a sane state, loading a GDT and reloading the segment registers is still necessary. Too bad they didn’t set all bases to 0, all limits to 0xFFFFFFFF and EIP to 0xFFFFFFF0.

The 80376 is designed to be forward-compatible with the i386, so unsupported features are documented as “reserved” or “must be 0/1″, and legacy properties like the garbled segment descriptors are unchanged. All (properly written) 80376 software should also run on an i386 (page 1) – except for the first few startup instructions of course. Intel provides the following code sequence (page 20) that is to be executed directly after RESET to distinguish between the 80376 and the i386:

smsw bx
test bl, 1
jnz is_80376

This tests for CR0.PE, which is hardcoded to 1 on the 80376 and is 0 on RESET on an i386. The three instructions are bitness agnostic, i.e. the encoding is identical in 16 and 32 bit mode.

Sounds like the perfect CPU? Well, here comes the catch: The 80376 doesn’t do paging. CR2 and CR3 don’t exist (it is undocumented whether accessing them causes an exception), CR0.PG is hardcoded to 0 (page 8) and the #PF exception does not exist (page 17). A man can dream though… a man can dream.

For Lisa, the World Ended in 1995

If you try to set the clock in Lisa OS 3.1 to 2010, you’re out of luck:

You can only enter years from 1981 to 1995. That’s a span of 15 years – why? And what happens if the clock runs past the end of 1995?

Well, it wraps around to 1 Jan 1980.

But why does it not allow entering 1980 then? That’s why:

Whenever the clock is set to 1980, it thinks the clock is not set up properly. So it is a 4 bit counter. Too bad, a 5 bit counter could have made it into 2011, and we all know that’s way more than ever needed.

CPUID on all CPUs (HOWNOTTO)

A while ago, an engineer from a respectable company for low-level solutions (no names without necessity!) claimed that a certain company’s new 4-way SMP system had broken CPUs or at least broken firmware that didn’t set up some CPU features correctly: While on the older 2-way system, all CPUs returned the same features (using CPUID), on the 4-way system, two of the CPUs would return bogus data.

I asked for his test code. It ran in kernel mode and looked roughly like this:

    int cpu_features[4];

    for (int i = 0; i < 100000; i++) {
        cpu_features[get_cur_cpu_number()] = cpuid_get_features();
        usleep(100);
    }

    for (int j = 0; j < 4; j++)
        printf("CPU %d features: %xn", j, cpu_features[j]);

Questions to the reader:

  1. What was the original idea, what is the algorithm?
  2. Why did this work on a 2-way system, but not on a 4-way system?
  3. Which two changes would at least make this code correct (albeit still horrible)?
  4. How would you do it correctly?
  5. Would you buy software from this company?

Why is there no CR1 – and why are control registers such a mess anyway?

If you want to enable protected mode or paging on the i386/x86_64 architecture, you use CR0, which is short for control register 0. Makes sense. These are important system settings. But if you want to switch the pagetable format, you have to change a bit in CR4 (CR1 does not exist and CR2 and CR3 don’t hold control bits), if you want to switch to 64 bit mode, you have to change a bit in an MSR, oh, and if you want to turn on single stepping, that’s actually in your FLAGS. Also, have I mentioned that CR5 through CR15 don’t exist – except for CR8, of course?

Like many (but unfortunately not all) quirks of the i386/x86_64 architecture, this mess can be explained with history.

8086 – FLAGS

x86 history typically starts with the 16 bit 8086, but although it was not binary compatible with its predecessor, it was nevertheless a rather straightforward assembly-level compatible 16 bit extension of the 8 bit Intel 8080 with some ideas of the Zilog Z80. The 8086 is still a classic “home computer class” CPU, which was not meant for modern operating systems: It had no MMU of any kind, and no concept of privileged and unpriviliged modes. Therefore, control bits that we see as system state today were encoded into the 16 bit FLAGS register: The interrupt enable bit and the trap flag (which will cause a software interrupt after the next instruction and thus lets you single-step) are encoded into FLAGS right next to the ALU’s flags like Zero and Carry.

80286 – Machine Status Word

The 80286 then came with a simple form of memory management that allowed more sophisticated (but not yet “modern”) operating systems to run – like the original versions of OS/2. The 16 bit “Machine Status Word” was created to host the big switch between legacy mode (real mode) and the new memory-managed mode (protected mode) and a program could access it using the new instructions “lmsw” and “smsw”. The 80286 had more system state than just this bit: The GDT, the IDT and the TSS had its own registers and dedicated instructions to access them (“lgdt”/”sgdt”, “lidt”/”sidt”, “ltr”/”str”)

i386 – Control Registers

The i386 finally had a real MMU that allowed paging and thus modern operating systems. The MMU required two more registers in the system state, one for the base address of the pagetables, and one to read a fault address from. Intel decided against adding more special purpose registers with dedicated accessor instructions, but instead introduced eight indexed 32 bit wide “control registers” CR0 to CR7. The new accessors “mov crn, r32“/”mov r32, crn” allowed copying between registers and control registers and had the 3 bit CR index encoded in the opcode.

The old MSW was also wired into the lower 16 bits of CR0; but CR0 was also extended with new bits like the switch to turn on paging. CR1 was kept reserved, presumably as a second control register for miscellaneous control bits, and CR2 and CR3 were used for the aforementioned fault address and pagetable base pointer. The opcodes to access reserved control registers generated an “invalid opcode” fault, making it possible for Intel to reuse the opcodes later if they don’t use the control registers.

i486 – CR4

The i486 added a few more control bits, and some of them went into CR0. But instead of overflowing the new bits into CR1, Intel decided to skip it and open up CR4 instead – for unknown reasons.

Pentium – MSRs

On the Pentium, Intel added for the first time control bits that were a property of the implementation as opposed to the architecture, i.e. bits that are microarchitecture-specific and will therefore only work on certain CPUs and not necessarily be supported on later CPUs – like caching details and debug settings. In order not to waste the valuable CR space with throw-away control bits, Intel introduced the Model-specific Registers (MSRs). The MSR address space is 32 bits, and every MSR is 64 bits wide. The two new instructions “rdmsr” and “wrmsr” copy between an ECX-indexed MSR and the EDX:EAX registers.

Pentium II – SYSENTER MSR

The SYSENTER instruction that got introduced on the on the Pentium II is a fast way to switch between unprivileged and privileged mode. Instead of looking up the destination segment, instruction pointer and stack pointer in memory, the CPU holds this information in three special-purpose system registers. CR space is valuable, so Intel decided against filling up CR5, CR6 and CR7, so they put it into the MSR address space instead – at 0×174 through 0×176. This was practically an abuse of the MSR concept.

AMD K6 – EFER MSR

Who can blame AMD for doing similar things then? With the K6, which was introduced at the same time as the Pentium II, AMD diverged from just copying Intel for the first time and actually added features of their own: They added the SYSCALL instruction, and with it, a control bit that turns it on and off, and an extra control register with the target location. Being afraid to collide with Intel extensions they they didn’t know about, they put the extra system registers into the MSR space: the control register “EFER” (Extended Feature Enable Register) at 0xC000_0080 and the Syscall Target Register (STAR) at 0xC000_0081. Intel had been nicely lining up MSRs counting up from 0, so AMD decided to start counting at 0xC000_0080. Understandable as this is, it is basically the same abuse of the MSR concept as Intel’s with SYSENTER.

A very similar thing happened in the CPUID space, by the way: While Intel encoded all its feature bits in leaf 0x0000_0001, AMD defined leaf 0x8000_0001 for its features.

x86_64 – Chaos!

So far everything looked like it was getting a little more controlled. Both Intel and AMD are only adding new control registers in the MSR space, and since this is a big address space and AMD and Intel extend it on rather opposite locations, it all looks nicer. But then came x86_64: For the first time, Intel was copying a feature that AMD introduced, and it needed to be compatible with all its details. AMD had encoded the availibility of x86_64 in its own CPUID leaf in 0x8000_0001, so Intel had to support this leaf as well. And since Long Mode was turned on in the EFER MSR, Intel had to support an MSR in the AMD space of 0xC000_0000. Long mode also required supporting SYSCALL, so Intel also supported the STAR MSR.

Since x86_64 introduced the REX prefix to double the number of available general-purpose registers, AMD decided to allow this prefix also for “mov cr”, doubling the number of control registers and therefore introducing CR8 through CR15 – also doubling their width. And since AMD introduced them, they owned them, and decided to use CR8 for the “Task Priority Register” feature.

VMX and SVM

The architecture is messy, sure, but does it matter? Maybe not… as long as CPUs didn’t have virtualization extensions! Both Intel VMX and AMD SVM are designed so that they can automatically switch the complete privileged machine state including control registers and certain MSRs. Intel for example special cases CR0, CR3, CR4 and CR8, leaves CR2 to the user. AMD on the other hand has 16 fields for all CRs in its switcher. And because of the two different starting points of the MSR space, Intel VMX required a whitelist bitmap for 8192 MSRs starting at 0x0000_0000 and for another 8192 MSRs starting at 0xC000_0000 – and of course SYSENTER_CS, EFER, STAR and friends are special-cased. If you want to have a lot of fun, read the VMCS layout reference of Intel’s manual 3B!

Future?

  • CR1 and CR5 to CR7 are still “owned” by Intel. AMD has shown that they don’t want to use them – and even Intel has not added a control register since 1989.
  • CR9 through CR15 are technically owned by AMD, since they introduced them with x86_64 and decided to use CR8. Intel adopted the reserved ones when adopting x86_64, but it is unlikely that Intel will ever adopt smaller changes to the architecture from AMD, and AMD is unlikely to use them if they won’t be part of the architecture, so these will probably never be used either. On the other hand, AMD added these to the auto-switcher list of their SVM Virtual Machine Control Block (VMCB), showing that they haven’t given up on them yet.
  • The MSR space is properly de-facto partitioned. Intel continues adding MSRs at 0 and AMD at 0xC000_0000 – but MSR have already lost their model-specificness in 1997. MSRs are the new CRs.

Dear Intel, dear AMD: I like the control registers, and I hate to see them wasted. Why don’t you finally define CR1 and give it a few control bits in the future? If you’re scared about collisions, I will be happy to be the arbiter. Ah, whatever: Intel, you get to define all even bits in CR1, and AMD, you get to define all odd bits. Okay? Cool.

Intel VT VMCS Layout

I understand that there might be a good reason for Intel to add virtualization extensions to their CPU architecture. Instead of fixing the x86 architecture to (optionally) make it Popek-Goldberg compliant and have all critial instructions trap if not run in Ring 0, they added non-root mode, a very big hammer that allows me to switch my CPU state completely to that of the guest and switches back to my original host state on a certain event in the guest. Well, it’s a great toy for people who want to play with CPU internals.

Therefore Intel had to add the VMCS, a 4 KB block in memory that holds the complete CPU state of both the host and the guest (segment registers, GDT and IDT pointer, certain MSRs etc.) as well as some control bits (for example, when to exit).

I also understand that Intel doesn’t allow me to just read and write memory in the VMCS, but abstracts accessing the virtualization state using a vmread/vmwrite interface. This way, the actual layout of this 4 KB page is an implementation detail and can be changed on later CPUs. It also allows for field indexes that are more spread out and encode what kind of field it is.

So I understand very well why Intel encodes into the VMCS field index whether it’s a control field (0), a read-only field (1), part of the guest state (2) or part of the host state (3), and whether it’s a 16 bit (0), 32 bit (2), 64 bit (1) or native-sized (3) field. This way, for example, all 16 bit guest state fields (like the guest’s CS) have indexes starting at 0×0800, and all 64 bit host state fields (like the hosts’s EFER MSR) start at 0x2C00.

Now what I don’t understand is what is so hard to be consistent with this convention (Intel Manual 3B, Appendix H).

  • VMCS Link Pointer (0×2800): In the first revision of VT, it had already been already decided that there should be a mechanism for having a second 4 KB page in case later versions of VT need more than 4 KB of state. For this, there is there “VMCS Link Pointer”, which is a 64 bit physical address. Guess what category this belongs to? Guest state.
  • “Guest Address Space Size” bit in the “VM Entry Controls” Field (0×4012): This is clearly guest state and not a control field.
  • “Host Address Space Size” bit in the “VM Exit Controls” Field (0x400C): This is clearly host state and not a control field.
  • VMX-preemption timer value (0x482E): This timer controls after how many ticks execution of the guest should end and control should be returned to the hypervisor. Intel put this into the “guest state” bucket: All other guest state fields are properties of the i386/x86_64 architecture that need to be switched, but not this one. This should really be a control field.

And here is another favorite of mine: the “Primary Execution Controls” field. The 32 bits specify which events in the guest will exit guest execution and trap into the hypervisor (Table 21-6). These events are, among others:

  • exit on HLT
  • exit on INVLPG
  • exit on MOV CR3
  • exit on PAUSE

Setting these bits to 1 enables the traps. So if you set all bits to 0, you basically have an unrestricted guest, and if you set all bits to 1, you have the most controlled guest, and you get a notification about every event in the guest. Or so you might think. Actually, there are two bits in the field that don’t work like this:

  • Use MSR bitmaps
  • Use I/O bitmaps

If these bits are set to 1, it checks a whitelist whether a certain MSR or I/O access is possible. If they are set to 0, all MSR and I/O accesses trap. Compared to all other bits, that’s backwards. Oh great.

Since Steve Jobs seems to be happy to explain his personal opinion on everything lately, I wrote him an email asking him about this, and he replied:

Return-path: <sjobs@apple.com>
Received: from bulkin002-bge351000.mac.com ([unknown] [10.150.69.129])
 by ms231.mac.com
 (Sun Java(tm) System Messaging Server 7u3-12.01 64bit (built Oct 15 2009))
 with ESMTP id <0L2X00HTAZ3Q6GF1@ms231.mac.com> for XXX@mac.com; Mon,
 24 May 2010 13:47:50 -0700 (PDT)
Original-recipient: rfc822;XXX@mac.com
Received: from relay13.apple.com ([17.128.113.29])
 by bulkin002.mac.com (Sun Java(tm) System Messaging Server 6.3-7.02 (built Jun
 27 2008; 32bit)) with ESMTP id <0L2X001EVZ3QKED0@bulkin002.mac.com> for
 XXX@mac.com (ORCPT XXX@mac.com); Mon, 24 May 2010 13:47:50 -0700 (PDT)
X-AuditID: 1180721d-b7c17fe00000693e-19-4bfae5f6545a
Received: from [17.201.27.84]
	(using TLS with cipher AES128-SHA (AES128-SHA/128 bits))
	(Client did not present a certificate)	by relay13.apple.com (Apple SCV relay)
 with SMTP id DB.14.26942.6F6EAFB4; Mon, 24 May 2010 13:47:50 -0700 (PDT)
From: Steve Jobs <sjobs@apple.com>
Content-type: text/plain
Content-transfer-encoding: 7bit
Subject: Re: Intel VT VMCS Layout
Date: Mon, 24 May 2010 13:47:48 -0700
Message-id: <3E789F1B-7E13-FFD2-80F6-8E8D4CDDE7FB@apple.com>
To: Michael Steil <XXX@mac.com>
MIME-version: 1.0 (Apple Message framework v1077)
X-Mailer: Apple Mail (2.1077)
X-Brightmail-Tracker: AAAAAQAAAZE=

The whole VMCS is a big mess, I hate it.

> Hi Steve, what do you think about the ordering of the VMCS fields in
> Intel's VT extenions?
>
>   Michael

PCEPTPDPTE

Here is a new pagetable entry.

I like Intel. I told you before how Intel messed up the x86 register nomenclature by extending A to AX (A extended) and then to EAX (extended A extended). Then AMD came and extended the register once more, giving it a more sane name: RAX.

I also told you before how Intel messed up the x86 pagetable nomenclature: There were pagetables (PT, level 1) and page directories (PD, level 2) on the i386, and for the Pentium Pro, they added page directory pointers (PDP, level 3). Then AMD came and extended it once more, giving it a more sane name: page map level 4 (PML4).

With the advent of virtualization, both Intel and AMD added a feature to get rid of the slow software shadow pagetables, and added hardware support for nested pagetables, i.e. the guest has 4 levels of pagetables, and the host has another 4 levels.

AMD called these – surprise, surprise! – nested pagetables, NPT. Intel was more creative. With a history of extending architectures, they went with the big E: extended pagetables, EPT.

Let’s practice a bit: A PD is a page directory, a PDE is a page directory entry. You can also call it a PDPTE, a page directory pagetable entry (level 2 PTE), because after all, all these entries on all levels are PTEs, because they share the same format. A PDPPTE is a page directory pointer pagetable entry, aka level 3 entry.

If we use nested paging – excuse me – extended paging on Intel, we need to prepend EPT to our nice little abbreviations. An EPTPTE is a level 1 entry, an EPTPDPTE is level 2 not to be confused with an EPTPDPPTE, which is level 3, and a level 4 entry is EPTPML4PTE.

It get even better. Oracle/Sun/Innotek VirtualBox uses Hungarian Notation for its variable names, so it prepends “P” for pointer and “C” for constant. So what would you call a variable, which is a pointer to a constant level 2 EPT entry?

Of course, PCEPTPDPTE.

/** Pointer to a const EPT Page Directory Pointer Entry. */
typedef const EPTPDPTE *PCEPTPDPTE;

I thought about this for a while, and considered patenting this brilliant idea of mine, but here it is, free of patents and free for everyone to use: Michael’s nomenclature for Intel/AMD pagetables:

new name description old name
P4 pagetable level 4 page PML4
P3 pagetable level 3 page PDP
P2 pagetable level 2 page PD
P1 pagetable level 1 page PT
P4E pagetable level 4 entry PML4E/PML4PTE
P3E pagetable level 3 entry PDPE/PDPPTE
P2E pagetable level 2 entry PDE/PDPTE
P1E pagetable level 1 entry PTE
NP4 nested pagetable level 4 page EPTPML4
NP3 nested pagetable level 3 page EPTPDP
NP2 nested pagetable level 2 page EPTPD
NP1 nested pagetable level 1 page EPTPT
NP4E nested pagetable level 4 entry EPTPML4E/EPTPML4PTE
NP3E nested pagetable level 3 entry EPTPDPE/EPTPDPPTE
NP2E nested pagetable level 2 entry EPTPDE/EPTPDPTE
NP1E nested pagetable level 1 entry EPTPTE

You are welcome.