how to know current thread running on which cpu in c [duplicate] - c

Presumably there is a library or simple asm blob that can get me the number of the current CPU that I am executing on.

Use sched_getcpu to determine the CPU on which the calling thread is running. See man getcpu (the system call) and man sched_getcpu (a library wrapper). However, note what it says:
The information placed in cpu is only guaranteed to be current at the time of the call: unless the CPU affinity has been fixed using sched_setaffinity(2), the kernel might change the CPU at any time. (Normally this does not happen because the scheduler tries to minimize movements between CPUs to keep caches hot, but it is possible.) The caller must be prepared to handle the situation when cpu and node are no longer the current CPU and node.

You need to do something like:
Call sched_getaffinity and identify the CPU bits
Iterate over the CPUs, doing sched_setaffinity to each one
(I'm not sure if after sched_setaffinity you're guaranteed to be on the CPU, or
need to yield explicitly ?)
Execute CPUID (asm instruction)... there is a way of getting a unique per-core ID out of one of it's outputs (see Intel docs). I vaguely recall it's the "APIC ID".
Build a table (a std::map ?) from APIC IDs to a CPU number or affinity mask or something.
If you did this on your main thread, don't forget to set sched_setaffinity back to all CPUS!
Now you can CPUID again whenever you need to and lookup which core you're on.
But I'd query why you need to do this; normally you want to take control via sched_setaffinity rather than finding out which core you're on (and even that's a pretty rare thing to want/need). (That's why I don't know the crucial detail of what to pull out of CPUID exactly, sorry!)
Update: Just learned about sched_getcpu from litb's response here. Much better! (my Debian/etch libc is too old to have it though).

I don't know of anything to get your current core id. With kernel level task/process migration, you wouldn't be guaranteed that it would remain constant for any length of time, unless you were running in some form of real-time mode.
If you want to be on a specific core, you can put use that sched_setaffinity() function or the taskset command to launch your program. I believe that these need elevated permissions to work, though. In your program, you could then run sched_getaffinity() to see the mask that was set earlier and use that as a best guess at the core on which you are executing.

sysconf(_SC_NPROCESSORS_ONLN);

Related

Fork()ing and running on specific set of CPUs

I have a parent process, which I use to spawn a series of child processes, which each run their own program sequentially. Each of these programs change a file over time, I want to read the data from this file and see how it changes as each program runs.
I need two sets of data for this to work, the value of the file at some set interval (I haven't decided on the interval yet), and the time each program takes to run, there are other variables which can influence the execution times of these programs, which I want to see also.
So I figured to get more accurate timing of the child process while still reading from a file I could run them on different cores. I have 8 cores, I would like to run the parent process on 0-3, then fork the child to run on 4-7. I'm not sure if this is possible though within C, and a search around hasn't yielded any answers, which makes me think it isn't.
Within Linux, outside of a program, I can use taskset to do this.
I plan on setting aside 4 of the cores using the kernel parameter isolcpus(). I want as little noise as possible while running the child programs.
Asking the kernel to associate CPU cores with threads or processes is also known as setting the "affinity" between the core and the process/thread.
Under linux, there exists a set of functions that provide this capability. Take a look at the manual page for one of the functions...
man pthread_setaffinity_np
This family of API calls might be able to give you what you need.
That man page has a "see also" section that links to the other functions in this family.
Typically with features such as these that deal with kernel process and thread scheduling, it is entirely dependent on what mood the kernel is in at the time as to whether your requests are met or ignored. Your mileage may very due to system load or the number of available cores. Even if a system has 16 cores, these features may be disabled in the kernel compilation settings (think virtual machines). Equally, you may find that there are some additional options that you may be able to add to your kernel to get better results than the defaults.

Is there a way to suspend OS scheduling for the duration of a program?

I have an assignment where I am analyzing the runtime of various sorting algorithms. I have written the code but I think it's an unfair comparison.
My code basically grabs the the clock time before and after the sorting is finished to compute the elapsed time. However, what if the OS decides to interrupt more frequently during the runtime of a specific sorting algorithm, or if it rather decides that some other background application should be given more of the time domain when it's thread comes back up?
I am not a CS major so I may not be entirely correct here, but from what I've read previously I was concerned this might have an impact on the results.
I also realize that if OS scheduling is suspended and the program hangs then there might be a serious problem; I am just wondering if it possible.
Normally, there's no real reason for it. The scheduler will slightly increase the execution time, but if the code runs for a few seconds, the change will be tiny.
So unless you're running heavy applications on the same computer, the amount of noise this will add to your tests is negligible.
In Linux, you can use isolcpus parameter to mark CPUs that won't be used by the scheduler. You can find information here. I'm not sure what's the minimal kernel version.
If you use it, you'll need to use sched_setaffinity, to put your theread on an isolated CPU, because the scheduler won't put it there.
It is not possible, not in user space code. Otherwise, any malicious process could steal the CPU from others.
If you want precise time counting for your process only, I suggest using time command. You can read about it here: What do 'real', 'user' and 'sys' mean in the output of time(1)?
Quick answer: you are most likely interested in user time, assuming your code doesn't make a heavy use of syscalls (which would be rather strange for a sorting algorithm)
On an up-to-date POSIX system (basically Linux) you can use clock_gettime with CLOCK_PROCESS_CPUTIME_ID or CLOCK_THREAD_CPUTIME_ID if you make sure the process doesn't migrate between CPUs (you can set its affinity for example).
The difference in times returned by clock_gettime with those arguments results in exact time the process/thread spent executing. Only pitfall as I mentioned is process migration as the man page says:
The CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID clocks are realized on many platforms using timers from the CPUs (TSC on i386, AR.ITC on Itanium). These registers may differ between CPUs and as a consequence these clocks may return bogus results if a process is migrated to another CPU.
This means that you don't really need to suspend all other processes just to measure the execution time of your program.

Measuring CPU clocks consumed by a process

I have written a program in C. Its a program created as result of a research. I want to compute exact CPU cycles which program consumes. Exact number of cycles.
Any idea how can I find that?
The valgrind tool cachegrind (valgrind --tool=cachegrind) will give you a detailed output including the number of instructions executed, cache misses and branch prediction misses. These can be accounted down to individual lines of assembler, so in principle (with knowledge of your exact architecture) you could derive precise cycle counts from this output.
Know that it will change from execution to execution, due to cache effects.
The documentation for the cachegrind tool is here.
No you can't. The concept of a 'CPU cycle' is not well defined. Modern chips can run at multiple clock rates, and different parts of them can be doing different things at different times.
The question of 'how many total pipeline steps' might in some cases be meaningful, but there is not likely to be a way to get it.
Try OProfile. It use various hardware counters on the CPU to measure the number of instructions executed and how many cycles have passed. You can see an example of it's use in the article, Memory part 7: Memory performance tools.
I am not entirely sure that I know exactly what you're trying to do, but what can be done on modern x86 processors is to read the time stamp counter (TSC) before and after the block of code you're interested in. On the assembly level, this is done using the RDTSC instruction, which gives you the value of the TSC in the edx:eax register pair.
Note however that there are certain caveats to this approach, e.g. if your process starts out on CPU0 and ends up on CPU1, the result you get from RDTSC will refer to the specific processor core that executed the instruction and hence may not be comparable. (There's also the lack of instruction serialisation with RDTSC, but in this context here, I don't think that's so much of an issue.)
Sorry, but no, at least not for most practical purposes -- it's simply not possible with most normal OSes. Just for example, quite a few OSes don't do a full context switch to handle an interrupt, so the time spent servicing a interrupt can and often will appear to be time spent in whatever process was executing when the interrupt occurred.
The "not for practical purposes" would indicate the possibility of running your program under a cycle accurate simulator. These are available, but mostly for CPUs used primarily in real-time embedded systems, NOT for anything like a full-blown PC. Worse, they (generally) aren't for running anything like a full-blown OS, but for code that runs on the "bare metal."
In theory, you might be able to do something with a virtual machine running something like Windows or Linux -- but I don't know of any existing virtual machine that attempts to, and it would be decidedly non-trivial and probably have pretty serious consequences in performance as well (to put it mildly).

How can you find the processor number a thread is running on?

I have a memory heap manager which partitions the heap into different segments based on the number of processors on the system. Memory can only be allocated on the partition that goes with the currently running thread's processor. This will help allow different processors to continue running even if two different ones want to allocate memory at the same time, at least I believe.
I have found the function GetCurrentProcessorNumber() for Windows, but this only works on Windows Vista and later. Is there a method that works on Windows XP?
Also, can this be done with pthreads on a POSIX system?
From output of man sched_getcpu:
NAME
sched_getcpu - determine CPU on which the calling thread is running
SYNOPSIS
#define _GNU_SOURCE
#include <utmpx.h>
int sched_getcpu(void);
DESCRIPTION
sched_getcpu() returns the number of the CPU
on which the calling thread is currently executing.
RETURN VALUE
On success, sched_getcpu() returns a non-negative CPU number.
On error, -1 is returned and errno is set to indicate the error.
SEE ALSO
getcpu(2)
Unfortunately, this is Linux specific. I doubt there is a portable way to do this.
For XP, a quick google as revealed this:
https://www.cs.tcd.ie/Jeremy.Jones/GetCurrentProcessorNumberXP.htm Does this help?
In addition to Antony Vennard's answer and the code on the cited site, here is code that will work for Visual C++ x64 as well (no inline assembler):
DWORD GetCurrentProcessorNumberXP() {
int CPUInfo[4];
__cpuid(CPUInfo, 1);
// CPUInfo[1] is EBX, bits 24-31 are APIC ID
if ((CPUInfo[3] & (1 << 9)) == 0) return -1; // no APIC on chip
return (unsigned)CPUInfo[1] >> 24;
}
A short look at the implementation of GetCurrentProcessorNumber() on Win7 x64 shows that they use a different mechanism to get the processor number, but in my (few) tests the results were the same for my home-brewn and the official function.
If all you want to do is avoid contention, you don't need to know the current CPU. You could just randomly pick a heap. Or you could have a heap per thread. Although you may get more or less contention that way, you would avoid the overhead of polling the current CPU, which may or may not be significant. Also check out Intel Thread Building Block's scalable_allocator, which may have already solved that problem better than you will.
This design smells bad to me. You seem to be making the assumption that a thread will stay associated with a specific CPU. That is not guaranteed. Yes, a thread may normally stay on a single CPU, but it doesn't have to, and eventually your program will have a thread that switches CPU's. It may not happen often, but eventually it will. If your design doesn't take this into account, then you will mostly likely eventually hit some sort of hard to trace bug.
Let me ask this question, what happens if memory is allocated on one CPU and freed on another? How will your heap handle that?

How does sched_setaffinity() work?

I am trying to understand how the linux syscall sched_setaffinity() works. This is a follow-on from my question here.
I have this guide, which explains how to use the syscall and has a pretty neat (working!) example.
So I downloaded the Linux 2.6.27.19 kernel sources.
I did a 'grep' for lines containing that syscall, and I got 91 results. Not promising.
Ultimately, I'm trying to understand how the kernel is able to set the instruction pointer for a specific core (or processor.)
I am familiar with how single-core-single-thread programs work. One might issue a 'jmp foo' instruction, and this basically sets the IP to the memory address of the 'foo' label. But when one has multiple cores, one has to say "fetch the next instruction at memory address foo, and set the instruction pointer for core number 2 to begin execution there."
Where, in the assembly code, are we specifying which core performs that operation?
Back to the kernel code: what is important here? The file 'kernel/sched.c' has a function called sched_setaffinity(), but returns type "long" - which is inconsistent with its manual page. So what is important here? Which of these modules shows the assembly instructions issued? What module is reading the 'task_struct', looking at the 'cpus_allowed' member, and then translating that into an instruction? (I've also thumbed through the glibc source - but I think it just makes a call to the kernel code to accomplish this task.)
sched_setaffinity() simply tells the scheduler which CPUs is that process/thread allowed to run on, then calls for a re-schedule.
The scheduler actually runs on each one of the CPUs, so it gets a chance to decide what task to execute next on that particular CPU.
If you're interested in how you can actually call some code on other CPUs, I suggest you take a look at smp_call_function_single(). In case we want to call something on another CPU, this calls generic_exec_single(). The latter simply adds the function to the target CPU's call queue and forces a reschedule through some IPI stuff (if the queue was empty).
Bottom line is: there no actual SMP variant of the _jmp_ instruction. Instead, code running on other CPUs cooperates in order to accomplish the task.
I think the thing you are not understanding is that the kernel is running on all the CPU cores. At every timer interrupt (~1000 per second), the scheduler runs on each CPU and chooses a process to run. There is no one CPU that somehow tells the others to start running a process. sched_setaffinity() works by just setting flags on the process. The scheduler reads these flags and will not run that process on its CPU if it is set not to.
Where, in the assembly code, are we specifying which core performs that operation?
There is no assembly involved here. Every task (thread) is assigned to a single CPU (or core in your terms) at a time. To stop running on a given CPU and resume on another, the task has to "migrate" (also this). When a task migrates from one CPU to another, the scheduler picks the CPU which is more idle among the CPUs allowed by sched_setaffinity().
There is no magic assembly instructions issued. The kernel has a more low-level view of the hardware, each CPU is a separate object, very different than how it looks like for user-space processes (in user-space, CPUs are almost invisible).
Check this out: B Operating System Programming Guidelines

Resources