I'm writing a program in C on windows that needs to run as many threads as available cores. But I dont know how to get the number of cores. Any ideas?
You can call the GetSystemInfo WinAPI function; it returns a SYSTEM_INFO struct, which has the number of processors (which is the number of cores on a system with multiple core CPUs).
You can read NUMBER_OF_PROCESSORS environment variable.
Type "cmd" on windows startup and open "cmd.exe".
Now type in the following command:
WMIC CPU Get /Format:List
You will find the entries like - "NumberOfCores" and "NumberOfLogicalProcessors". Typically the logical-processors are achieved by threading. Therefore the relation would typically go like;
NumberOfLogicalProcessors = NumberOfCores * Number-of-Threads-per-Core.
Since each core serves a processing-unit, therefore with threading, logical-processing-unit is realized in real space.
More info here.
Even though the question deals with .NET and yours with C, the basic responses should help:
Detecting the number of processors
As #Changming-Sun mentioned in a comment above, GetSysInfo returns the number of logical processors, which is not always the same as the number of processor cores. On machines that support hyperthreading (including most modern Intel CPUs) more than one thread can run on the same core (technically, more than one thread will have its thread context loaded on the same core). Getting the number of processor cores requires a call to GetLogicalProcessorInformation and a little bit of coding work. Basically, you get back a list of SYSTEM_LOGICAL_PROCESSOR_INFORMATION entries, and you have to count the number of entries with RelationProcessorCore set. A good example of how to code this in the GetLogicalProcessorInformation documentation provided by Microsoft:
https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-getlogicalprocessorinformation
Related
I am working on how dual core (especially in embedded systems) could be beneficial. I would like to compare two targets: one with ARM -cortex-A9 (925 MHz) dual core, and the other with ARM cortex-A8 single core.
I have some ideas (please see below), but I am not sure, I will use the dual core features:
My questions are:
1-how to execute several threads on different cores (without OpenMP, because it didn't work on my target, and it isn't compatible with VxWorks)
2-How the kernel execute code on dual core with shared memory: how it allocate stack, heap, memory for global variable and static ones?
3-Is it possible to add C-flags in order to indicate number of CPU cores so we will be able to use dual core features.
4-How the kernel handle program execution (with a lot of threads) on dual core.
Some tests to compare two architecture regarding OS and Dual core / single core
Dual core VS single core:
create three threads that execute some routines and depend on each results other (like matrix multiplication). Afterwards measure the time taken to on the dual core once and the on the single core (how is it possible without openMP)
Ping pong:
One process sends a message to the other.two processes repeatedly pass a message back and forth. We should investigate how the time taken varies with the size of the message for each architecture.
One to all:
A process with rank 0 sends the same message to all other processes in the program and then receives a message of the same length from all other processes. How does the time taken varies with the size of the messages and with the number of processes?
Short answers wrt. Linux only:
how to execute several threads on different cores
Use multiple processes, or pthreads within a single process.
How the kernel execute code on dual core with shared memory: how it allocate stack, heap, memory for global variable and static ones?
In Linux, they all belong to the process. (You can declare thread-local variables, though, with for example the __thread keyword.)
In other words, if a thread allocates some memory, that memory is immediately visible to all other threads in the same process. Even if that thread exits, nothing happens to the memory. It is perfectly normal for some other thread to free the memory later on.
Each thread does get their own stack, which by default is quite large. (With pthreads, this is easy to control using a pthread_attr_t that specifies a smaller stack size.)
In general, there is no difference in memory handling or program execution, between a single-threaded and a multi-threaded process, on the kernel side. (Userspace code needs to use memory and threads properly, of course; the kernel does not try to stop stupid userspace code from shooting itself in the head, at all.)
Is it possible to add C-flags in order to indicate number of CPU cores so we will be able to use dual core features.
Yes, for example by examining the /proc/cpuinfo pseudofile.
However, it is much more common to leave such details for the system administrator. Services like Apache etc. let the administrator configure the number, either in a configuration file, or in the command line. Even make supports a -j JOBS parameter, allowing multiple compilation tasks to run in parallel.
I very warmly recommend you forget about any detection magic, and instead let the user specify the number of threads used. (If you insist, you could use detection magic for the default, but then only if the user/admin has not given you any hints about the number of threads to be used.)
It is also possible to set the CPU mask for each thread, specifying the set of cores that thread may run on. Other than in benchmarks, where this is done more for repeatability than anything else, or in dedicated processes designed to hog all the resources on a machine, it is very rare.
How the kernel handle program execution (with a lot of threads) on dual core.
The exact same way as it handles a lot of simultaneous processes.
There are some controls (control group stuff, cpu masks) and maybe resource accounting that are handled differently between separate processes and threads within the same process, but in general terms, separate threads in a single process are executed the same way as separate processes are.
Presumably there is a library or simple asm blob that can get me the number of the current CPU that I am executing on.
Use sched_getcpu to determine the CPU on which the calling thread is running. See man getcpu (the system call) and man sched_getcpu (a library wrapper). However, note what it says:
The information placed in cpu is only guaranteed to be current at the time of the call: unless the CPU affinity has been fixed using sched_setaffinity(2), the kernel might change the CPU at any time. (Normally this does not happen because the scheduler tries to minimize movements between CPUs to keep caches hot, but it is possible.) The caller must be prepared to handle the situation when cpu and node are no longer the current CPU and node.
You need to do something like:
Call sched_getaffinity and identify the CPU bits
Iterate over the CPUs, doing sched_setaffinity to each one
(I'm not sure if after sched_setaffinity you're guaranteed to be on the CPU, or
need to yield explicitly ?)
Execute CPUID (asm instruction)... there is a way of getting a unique per-core ID out of one of it's outputs (see Intel docs). I vaguely recall it's the "APIC ID".
Build a table (a std::map ?) from APIC IDs to a CPU number or affinity mask or something.
If you did this on your main thread, don't forget to set sched_setaffinity back to all CPUS!
Now you can CPUID again whenever you need to and lookup which core you're on.
But I'd query why you need to do this; normally you want to take control via sched_setaffinity rather than finding out which core you're on (and even that's a pretty rare thing to want/need). (That's why I don't know the crucial detail of what to pull out of CPUID exactly, sorry!)
Update: Just learned about sched_getcpu from litb's response here. Much better! (my Debian/etch libc is too old to have it though).
I don't know of anything to get your current core id. With kernel level task/process migration, you wouldn't be guaranteed that it would remain constant for any length of time, unless you were running in some form of real-time mode.
If you want to be on a specific core, you can put use that sched_setaffinity() function or the taskset command to launch your program. I believe that these need elevated permissions to work, though. In your program, you could then run sched_getaffinity() to see the mask that was set earlier and use that as a best guess at the core on which you are executing.
sysconf(_SC_NPROCESSORS_ONLN);
I have 4 processors and am compiling a processor-hungry application. I read that using make with the -j4 switch was recommended for OpenCV; should I instead use -j8? What is the advantage of making for multiple processors?
The answers above are all mostly correct. However, the details are a bit misleading. For example, there's no need to add an extra job for a "managing thread" (note: make is not actually multithreaded). make never counts itself as a job for the purposes of -j, so, as Huygens says above, if you say -j5 you'll get 5 compile jobs running, not 4 plus make.
The reason most people use [number of cores] + [some padding] has nothing to do with make or what it needs, but rather with the nature of the compiler. A compiler is really just a very complicated text translation tool: it reads in text in one form and converts it to "text" (binary) in another form. A lot of this (especially as your language gets more complex, like C++), requires a lot of CPU. But it also requires a lot of disk I/O. Disk I/O is slow, so while one compiler is waiting for some data from the disk, the kernel schedules other jobs to run. That is why you can usefully have more than the number of cores compiles running at the same time.
Exactly how large you can get -j before you start seeing diminishing returns (your builds actually start going slower, at some point, with more -j) depends completely on your hardware, the kinds of builds you're doing, etc. The only way to know for sure is experimentation.
However, [number of cores]+[a few] is typically a good approximation.
As you say the -j flag tells make that it is allowed to spawn the provided amount of 'threads'. Ideally each thread is executed on its own core/CPU, so your multi-core/CPU environment is used to its fullest.
make itself does not compile the source files. This is done by a compiler (gcc). The Makefile (input for make) contains a set of targets. Each target has a set of dependencies (on other targets) and rules how to build the target. make reads the Makefile(s) and manages all targets, dependencies, and build rules. Besides compiling source files you can use make to perform any task that can be described by shell commands.
If you set the allowed number of threads too high, it is not possible to schedule each thread on its own core. Additional scheduling (context) switches are required to let all threads execute. This additional resource usage obviously result in lower performance.
There are multiple rules-of-thumb, but I guess that setting to total amount to <number of cores> + 1 is the most common. The idea behind this is that all cores have their own thread and there is one additional managing thread that handles the targets and which is next to be built.
One CPU per thread plus one manager/loader. Since a thread that does disk operations is technically almost idle from CPU point of view, add one to the total number of cores.
If the CPU is using hyperthreading, you can safely count each core as two cores and double the number of threads, so a quad core Intel Core i7 should get -j9 (eight virtual cores plus manager.) On a quad core AMD use -j5
The -j option is only use to speed up application build, it determines how many jobs make can spawn for the build. You can either set -j<nb core> or even higher -j<nb-core * 1.5> so that compilation can happen in parallel.
It has no impact on the compiled code.
For a 4 core system, you could try make -j6. If make can run parallel builds, it will launch up to 6 simultaneous compilation process (e.g. 6 calls to gcc).
How can i Get process Cpu usage in c??
I need Cpu usage of evrey process and threads.
please give me an example.
Thanks!
In plain C, this is not possible, but since the question is also tagged "Windows":
CPU usage is CPU time divided by real time. The GetThreadTimes and GetProcessTimes functions give you that information (among other features such as performance counters, which Joachim Pileborg mentioned above, but I think this one is probably easier).
You probably also want to use CreateToolhelp32Snapshot first to know what processes and threads exist at all. You'll need to translate thread/process IDs to handles, but I guess that won't be a big hurdle (i.e. OpenProcess).
In C, total CPU usage can be determined using Performance Counters (there is a small typo in the example code: sleep has to be changed to Sleep).
In C++, C#, Delphi etc., I would recommend using WMI.
== EDIT ==
I found an approach to get the per-process CPU usage. For example, in order to get the CPU load of Microsoft Outlook, change the counter path in the above example to this:
PdhAddCounter(query, TEXT("\\Process(OUTLOOK)\\% Processor Time"), 0, &counter);
If you have multiple instances of the same executable running, you may use indexes. This MSDN example is also very useful.
I am trying to understand how the linux syscall sched_setaffinity() works. This is a follow-on from my question here.
I have this guide, which explains how to use the syscall and has a pretty neat (working!) example.
So I downloaded the Linux 2.6.27.19 kernel sources.
I did a 'grep' for lines containing that syscall, and I got 91 results. Not promising.
Ultimately, I'm trying to understand how the kernel is able to set the instruction pointer for a specific core (or processor.)
I am familiar with how single-core-single-thread programs work. One might issue a 'jmp foo' instruction, and this basically sets the IP to the memory address of the 'foo' label. But when one has multiple cores, one has to say "fetch the next instruction at memory address foo, and set the instruction pointer for core number 2 to begin execution there."
Where, in the assembly code, are we specifying which core performs that operation?
Back to the kernel code: what is important here? The file 'kernel/sched.c' has a function called sched_setaffinity(), but returns type "long" - which is inconsistent with its manual page. So what is important here? Which of these modules shows the assembly instructions issued? What module is reading the 'task_struct', looking at the 'cpus_allowed' member, and then translating that into an instruction? (I've also thumbed through the glibc source - but I think it just makes a call to the kernel code to accomplish this task.)
sched_setaffinity() simply tells the scheduler which CPUs is that process/thread allowed to run on, then calls for a re-schedule.
The scheduler actually runs on each one of the CPUs, so it gets a chance to decide what task to execute next on that particular CPU.
If you're interested in how you can actually call some code on other CPUs, I suggest you take a look at smp_call_function_single(). In case we want to call something on another CPU, this calls generic_exec_single(). The latter simply adds the function to the target CPU's call queue and forces a reschedule through some IPI stuff (if the queue was empty).
Bottom line is: there no actual SMP variant of the _jmp_ instruction. Instead, code running on other CPUs cooperates in order to accomplish the task.
I think the thing you are not understanding is that the kernel is running on all the CPU cores. At every timer interrupt (~1000 per second), the scheduler runs on each CPU and chooses a process to run. There is no one CPU that somehow tells the others to start running a process. sched_setaffinity() works by just setting flags on the process. The scheduler reads these flags and will not run that process on its CPU if it is set not to.
Where, in the assembly code, are we specifying which core performs that operation?
There is no assembly involved here. Every task (thread) is assigned to a single CPU (or core in your terms) at a time. To stop running on a given CPU and resume on another, the task has to "migrate" (also this). When a task migrates from one CPU to another, the scheduler picks the CPU which is more idle among the CPUs allowed by sched_setaffinity().
There is no magic assembly instructions issued. The kernel has a more low-level view of the hardware, each CPU is a separate object, very different than how it looks like for user-space processes (in user-space, CPUs are almost invisible).
Check this out: B Operating System Programming Guidelines