CPUID on all CPUs (HOWNOTTO)

A while ago, an engineer from a respectable company for low-level solutions (no names without necessity!) claimed that a certain company’s new 4-way SMP system had broken CPUs or at least broken firmware that didn’t set up some CPU features correctly: While on the older 2-way system, all CPUs returned the same features (using CPUID), on the 4-way system, two of the CPUs would return bogus data.

I asked for his test code. It ran in kernel mode and looked roughly like this:

    int cpu_features[4];

    for (int i = 0; i < 100000; i++) {
        cpu_features[get_cur_cpu_number()] = cpuid_get_features();
        usleep(100);
    }

    for (int j = 0; j < 4; j++)
        printf("CPU %d features: %xn", j, cpu_features[j]);

Questions to the reader:

  1. What was the original idea, what is the algorithm?
  2. Why did this work on a 2-way system, but not on a 4-way system?
  3. Which two changes would at least make this code correct (albeit still horrible)?
  4. How would you do it correctly?
  5. Would you buy software from this company?

21 thoughts on “CPUID on all CPUs (HOWNOTTO)”

  1. His idea was to keep running until he had a good chance of being scheduled on each cpu, and querying the current cpu each time. This is never going to work due to TOCTOU, if it ever worked on any non-up system then I’m impressed.

  2. Similar to what Anon says, I think he was trying to get the features of each CPU by pure chance, counting on usleep to get his scheduler to deschedule the thread, and (hopefully) resume it on a different CPU.

    I’m not that low-level of a programmer, but if his kernel is preemptable, then he has a race condition between when he gets his current CPU number and when he gets that CPU’s features.

    So, even if he managed to fill all four entries in the array, there’s no guarantee that the features noted at an array’s index correspond to the CPU that index was supposed to represent.

    As for why it works with 2 CPUs, and not 4? I don’t know what TOCTOU is, but I could venture a guess that the scheduler may be favoring rescheduling on the same CPU in a multicore setting. (i.e. if this ‘4-way’ system is a dual-proc, dual-core system, the scheduler may be favoring bouncing his thread between the same two cores.)

    The only way I can see this working is if he explicitly gave CPU affinity to separate threads, and grabbed the features once the thread was running.

    Faster, too; you wouldn’t spend 10 seconds waiting on results.

  3. I can’t believe someone would think the CPU is wrong after the results of that code and not thoroughly question the code they wrote first, that seems a little arrogant.

    He’s my stab at it, I know, it not pretty, and it might not even really work properly, I’ll admit, I didn’t test it. As long as this code is allowed to be distributed across all CPUs by the OS, it should work, as long as the while wont create a time-of-check-to-time-of-use (TOCTTOU) condition itself, which it may, in that case, I missed the boat on this lesson.

    #define NUM_CPU 4

    int cpu_features[NUM_CPU];

    for (int i = 0; i < NUM_CPU; i++) {
    while (get_cur_cpu_number() == i) {
    cpu_features[i] = cpuid_get_features();
    }
    }

    for (int j = 0; j < NUM_CPU; j++)
    printf("CPU %d features: %xn", j, cpu_features[j]);

  4. My implementaiton would look like this:

    int cpu_features[4] = {0}; /* Initializing helps */

    for (int i = 0; i < NUM_CPU; i++) {
    set_proc_priority(1 << i);
    schedule();
    cpu_features[i] = cpuid_get_features();
    }

    for (int j = 0; j < NUM_CPU; j++)
    printf("CPU %d features: %xn", j, cpu_features[j]);

  5. I’m also not a very low-level programmer (or at least, not very good at it, when complicated things like concurrency come into play anyway), so I was glad to learn a new acronym – TOCTTOU!

    However I’d suspect that Adam’s code is pretty similar to the original in that at the while condition test you check the CPU number but can be pre-empted and end up executing on another CPU when calling cpuid_get_features()?

    Presumably the way to fix this is to use some kind of multi-CPU-locking instruction (affinity mask?) to ensure that getting the CPU number and querying the CPUID information happens on the same processor?

  6. Haha. I have no idea what the functions does except from its name suggest. But i’ll try.

    I have no idea a for loop can be run in parallel without some intend from the programmer. Questions like what if the data affected in the loop is dependent on the previous, how does resolve such a situation come to my mind but I assume that that’s not a problem as the code doesn’t seems portrait a need to solve this problem.

    1. The original idea is to find out all the cpu (feature) available to the system. The sleep is important because it cause the cpu to sleep dimming it unavailable and the system will select a available cpu instead. The algorithm should be trying to activate each cpu and retrieve the cpu feature. Something like what @Seed is doing.

    2. Maybe because this implicitly let the system decide how many cpu to use, the system always try to use more than one cpu on loop. so it always work for 2 way system. On a 4 way system, the system maybe decides that the code doesn’t need to activate all 4 cpu and thus didn’t activate it? :P

    3. My guess is increase the sleeping time? :P Initialize the variable? Hmmmp… Maybe not using for loop. For loop doesn’t seems parallel to me. Or maybe make the loop call a function instead of inline code. as this give a hint to the system it can be parallel? :P Hehe okay I’m being greedy here. :P If I have to choose 2, I’ll take the last 2 answer. The first is kinda stupid. :P

    4. I find @Seed answer the more reasonable among the current available though. I personally would have no idea on how to accomplish this though.

    5. I don’t see the bad of the code thought, except that it might not be a good test code and it has magic number. But I don’t fully understand the situation so I can’t comment much. If a hard answer is need I’ll go for, I’m probably will still buy.

  7. 4. The kernel probably has a function that can provide the information. I would look for this function before trying to implement by own.

  8. My 2c:
    (2) I’d guess the default scheduler behaviour is to assign processor affinity to cores sharing the same L1 cache. Maybe in a 2-way system the cache is shared and thus there is a reasonable chance the thread would indeed execute on both cores. In the 4-way SMP (the guess continues) there are 2 separate caches, each assigned to 2 processors. Hence, the loop resides only on half the cores – and half the cpu_info never gets populated.
    (3) The shortest code-route to the original (questionable) idea would probably be to parallelize the loop, e.g. by openMP pragmas, to increase the chance of it using all available processors. Say –
    #pragma parallel
    for (int i = 0; i < NUM_CPU; i++) {
    cpu_features[get_cur_cpu_number()] = cpuid_get_features();
    }
    (4) I'd say the right way to go is to iteratively create worker threads, explicitly assign them affinity masks that cover each processor, and have them query the cpuid features.
    (5) If you're saying its a respectable company, there's a good chance I already do…

  9. Just to clarify, my thought with the while() was that the while condition would be checked again before becoming re-entrant after being pre-empted. I guess I was giving the compiler / OS too much credit, its not that smart.

  10. for (int i = 0; i < 100000; i++) {
    cpu_features[get_cur_cpu_number()] = cpuid_get_features();
    usleep(100);
    }

    funny.
    the 100.000 is a nice trick to maybe get called on each CPU, i agree.
    But who tells you, that the Sheduler does not send you to a different CPU between get_cur_cpu_number() and cpuid_get_features(); ?
    i think this is the Reason for usleep(100) — to make the shelduler gives you on passing this a new timeslice which gives you a higer chanche of doing it "correct" but not 100%

    I think the only correct way would be like this:

    for (int i = 0; i < 100000; i++) {

    int cpunum = get_cur_cpu_number();
    if (cpu_features[cpunum].IsEmtpy) {
    OS.stopSheduler(); // single USer Mode or so
    int validate = get_cur_cpu_number();
    if (validate == cpunum) {
    cpu_features[cpunum] = cpuid_get_features();
    }
    OS.startSheduler();
    }
    usleep(100);
    }

  11. Algorithm: Monte Carlo method for getting cpu-features for every core because thread synchronisation and core delegation wasn’t avariable or some logic under drug influences ????.

    Why it is working for 2-core cpu’s:
    The internal cores of Intels dualcore cpu’s are identical, so every CPUID opcode returning the same values regardless of sheduling (only a guess) ?

    Better handling:
    – use a probitary low-level vm with AOT or JIT compiler for platform independent pseudocode, replacing the OS and let this vm handling the cpu detection [so it’s not your fault and you can use all ressources with much less efficiency] ;)
    – realize Nirvana, so the problem isn’t of importance once and for all.

    – Probably better a functionality of the boot-loader
    – Probably better handled at assembler level because no ISA independent functionality is envolved
    – executing related code as synchronized core-depend thread if possible would help.

    That are my really serious remarks ;)

  12. Here’s my guess:

    1. He wanted to discover the features of each core by probabilistically scheduling his code on each core.

    2. Besides the obvious race, my guess is hyperthreading or some other form of SMT is doubling the number of logical CPUs, overflowing the cpu_features array and scribbling over some data.

    3, 4. Detect the actual number of (logical?) CPUs, lock and synchronize the critical section, and properly pin the process to each core in turn. Also the second operand to printf is signed, but the format specifier %x wants unsigned.

    5. hell no.

  13. queue_work_on() allows you to run some code on a particular cpu on linux.
    ex:
    http://lxr.free-electrons.com/source/kernel/stop_machine.c?v=2.6.28#L115

    131 get_cpu();
    132 for_each_online_cpu(i) {
    133 sm_work = percpu_ptr(stop_machine_work, i);
    134 INIT_WORK(sm_work, stop_cpu);
    135 queue_work_on(i, stop_machine_wq, sm_work);
    136 }
    137 /* This will release the thread on our CPU. */
    138 put_cpu();
    139 flush_workqueue(stop_machine_wq);
    140 ret = active.fnret;
    141 mutex_unlock(&lock);

  14. Without knowing anything about how the thread will be scheduled between CPUs, there’s still an obvious bug that even beginning programmers should catch: The cpu_features table is not initialized to known state before the loop, and hence you can’t even detect whether a particular core has been checked. Also, trusting something like this to random chance of scheduling is plain stupid.

  15. Warning: fopen() [function.fopen]: open_basedir restriction in effect. File(/tmp/cas_0gG7f5) is not within the allowed path(s): (/data/web/kunden/hessi/domains/asm/) in /data/web/kunden/hessi/domains/asm/wp-content/plugins/custom-anti-spam/custom_anti_spam.php on line 226

    Warning: fopen(/tmp/cas_0gG7f5) [function.fopen]: failed to open stream: Operation not permitted in /data/web/kunden/hessi/domains/asm/wp-content/plugins/custom-anti-spam/custom_anti_spam.php on line 226

    Warning: fwrite(): supplied argument is not a valid stream resource in /data/web/kunden/hessi/domains/asm/wp-content/plugins/custom-anti-spam/custom_anti_spam.php on line 227

    Warning: fclose(): supplied argument is not a valid stream resource in /data/web/kunden/hessi/domains/asm/wp-content/plugins/custom-anti-spam/custom_anti_spam.php on line 228

    Warning: file_exists() [function.file-exists]: open_basedir restriction in effect. File(/tmp/cas_0gG7f51.wav) is not within the allowed path(s): (/data/web/kunden/hessi/domains/asm/) in /data/web/kunden/hessi/domains/asm/wp-content/plugins/custom-anti-spam/custom_anti_spam.php on line 244

    Warning: file_get_contents() [function.file-get-contents]: open_basedir restriction in effect. File(/tmp/cas_0gG7f5) is not within the allowed path(s): (/data/web/kunden/hessi/domains/asm/) in /data/web/kunden/hessi/domains/asm/wp-content/plugins/custom-anti-spam/custom_anti_spam.php on line 248

    Warning: file_get_contents(/tmp/cas_0gG7f5) [function.file-get-contents]: failed to open stream: Operation not permitted in /data/web/kunden/hessi/domains/asm/wp-content/plugins/custom-anti-spam/custom_anti_spam.php on line 248

    Warning: Cannot modify header information – headers already sent by (output started at /data/web/kunden/hessi/domains/asm/wp-content/plugins/custom-anti-spam/custom_anti_spam.php:226) in /data/web/kunden/hessi/domains/asm/wp-content/plugins/custom-anti-spam/sounds/wav_join.php on line 82
    Some data has already been output to browser, can’t send wav file

  16. @hamtitampti:

    This is user-mode; you don’t get to stop the OS scheduler. If you can, get a better OS that won’t (this would be hilariously abusable in practice). You can ask the OS to put you on a specific CPU and it will oblige, and if you’re trying to get the CPUID of every core you’re probably trying to do something involving separate threads on different cores anyway (which uses the scheduler process affinity calls anyway).

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.