In almost all assembly books you’ll find some nice tricks to do fast multiplications. E.g. instead of “imul eax, ebx, 3” you can do “lea eax, [ebx+ebx*2]”Â (ignoring flag effects). It’s pretty clear how this works. But how can we speed up, say, a division by 3? This is quite important since division is still a really slow operation. If you never thought or heart about this problem before, get pen and paper and try a little bit. It’s an interesting problem.
25 years after the introduction of the 32 bit Intel i386 CPU, all Intel compatibles still start up (and wake up!) in 16 bit stone-age mode, and they have to be switched into 32/64 bit mode to be usable.
A while ago, an engineer from a respectable company for low-level solutions (no names without necessity!) claimed that a certain company’s new 4-way SMP system had broken CPUs or at least broken firmware that didn’t set up some CPU features correctly: While on the older 2-way system, all CPUs returned the same features (using CPUID), on the 4-way system, two of the CPUs would return bogus data.
If you want to enable protected mode or paging on the i386/x86_64 architecture, you use CR0, which is short for control register 0. Makes sense. These are important system settings. But if you want to switch the pagetable format, you have to change a bit in CR4 (CR1 does not exist and CR2 and CR3 don’t hold control bits), if you want to switch to 64 bit mode, you have to change a bit in an MSR, oh, and if you want to turn on single stepping, that’s actually in your FLAGS. Also, have I mentioned that CR5 through CR15 don’t exist – except for CR8, of course?
I understand that there might be a good reason for Intel to add virtualization extensions to their CPU architecture. Instead of fixing the x86 architecture to (optionally) make it Popek-Goldberg compliant and have all critial instructions trap if not run in Ring 0, they added non-root mode, a very big hammer that allows me to switch my CPU state completely to that of the guest and switches back to my original host state on a certain event in the guest. Well, it’s a great toy for people who want to play with CPU internals.
Here is a new pagetable entry.
In many operating systems, I have seen overly complicated startup code. Too much is done in assembly, and printf() and framebuffer access is only available very late. In the next three blog posts, I will show how this can be avoided.
In some i386/x86_64 assembly code my coworker was working on, there was a macro like this:
Today’s puzzle is about some code behaving horribly wrong.
Try this in kernel mode:
UNIX, Windows NT, and all the operating systems in their class rely on virtual memory, or paging, in order to provide every process on the system a complete address space of its own. An easier way to protect processes from each other is segmentation: The 4 GB address space of a 32 bit CPU is divided into segments (consisting of a physical base address and a limit), one for each process, and every process may only access their own segment. This is what the 286 did.
Intel used some strange opcodes for the SSE3 instructions. All MMX/SSE opcodes use the 0x0f prefix (former “pop cs”). They soon noticed the the 0x0f area gets full, so they used the 0x66, 0xf2, 0xf3 prefix as modifiers. The basic rule is:
Most of the x86 instructions will automatically alter the flags depending on the result. Sometimes this is rather frustrating because you actually what to preserve the flags as long as possible, and sometimes you miss a “mov eax, ecx” which alters the flags. But at least it’s guaranteed that an instruction either sets the flags or it doesn’t touch them, independent of the actual operation… Or is it?
As we all know the x86-ISA has a lot of redundant instructions (ie. instructions with the same semantic but different opcodes). Sometimes this is unavoidable, sometimes it looks like bad design. But with SSE it gets really weird. Let’s say we want to perform xmm0 <- xmm0 & xmm1 (ie. bitwise and). Not an uncommon operation; but we have 3 different ways do archive this:
- andps xmm0, xmm1 (0f 54 c1)
- andpd xmm0, xmm1 (66 0f 54 c1)
- pand xmm0, xmm1 (66 0f db c1)
(Note that andpd/pand are SSE2 instructions)
Regarding the result in xmm0 these are really the same instructions. Now, why did Intel do this? First we’re going to inspect andps/andpd. Looking at the optimization manuals we get a hint: The ps/pd mark the target register to contain singles or doubles, so they should match the actual data you are operating on.