How to know number of existing Openmp threads - c

I have an OpenMP program running with say 6 threads on a 8-core machine. How can I extract this information (num_threads = 6) from another program (non-openmp, plain C program). Can I get this info from underlying kernel.
I was using run_queue lengths using "sar -q 1 0" but this doesn't yield consistent results. sometimes it gives 8, few times more or less.

In Linux, threads are processes (see first post here), so you can ask for a list of running processes with ps -eLf. However, if the machine has 8 cores, it is possible that OpenMP created 8 threads (even though it currently uses 6 of them for your computation); in this case, it is your code that must store somewhere (e.g. a file, or a FIFO) information about the threads that it is using.

Related

How many threads can I create inside a C program and how does it relate to the number of threads my CPU has?

My CPU is an i5-8400, which has 6 cores and 6 threads. What does this mean? Initially, I thought it meant that I had 6 threads available per core, which totals 36 threads.
Say I'm making a C program, where I create pthreads, does that mean I can only have 6 threads in that program, as its process will run on a single CPU core? If that's the case, what would happen if I tried creating a seventh thread?
When I go to task manager (windows), I see thousands of threads:
, which means my understanding was wrong.
So my questions are:
How does my CPU number of threads relate to how many threads I can create in a process, i.e., say I create a C program; how many threads can I create in its process?
What happens if I try to create a thread, and there are no more threads available?
Intel CPU has cores an multiple execution unit per core. A mainboard can have many CPU.
For example, my system has 2 Intel Xeon E5620 processors. Each Xeon E5620 has 4 cores and each core can execute 2 threads (That is the hyperthreading feature). On my system, a total of 2x4x2=16 threads can be execute really simultaneously).
The difference between the number of threads and the number of cores exists because CPU has incomplete core which is able to execute multiple threads but less performing than a complete core. To say it otherwise, it is faster to have 8 cores single thread than 4 cores each double thread.
When we talk about the number of thread in CPU context, it means that you can have so much thread really executing in parallel. While you look at the task manager, you see the total number of thread objects in the system. At a given moment, most of them are sleeping (for example waiting for I/O or a mutex,...) only the number of thread given for the CPU times the number of CPU can be really executing instructions.
If you create more threads than is available in the available CPU, then part of them are simply waiting for their turn to execute. A CPU thread execute one after the other the existing threads. Actually the operating system has a scheduler that determine which thread is ready to run or not.
Interesting readings:
https://learn.microsoft.com/en-us/windows/win32/procthread/scheduling
https://www.intel.com/content/www/us/en/gaming/resources/hyper-threading.html

Parallel efficiency drops inconsistently

My question is probably of trivial nature. I parallelised a CFD code using MPI libraries and now I am trying to investigate my parallel efficiency. To start with, I created a case which would provide equal loads among the ranks and constant ratio of volume of calculations over transferred data. Thus, my expectation would be that as I increase the ranks, any runtime changes would be attributed to the communication delays only. However, I realised that subroutines that do not invoke rank communication (so they only do domain calculations, hence they deal with the same load for all ranks) contribute significantly-actually the most- runtime increases. What am I missing here? Does this even make sense?
Does this even make sense?
Yes!
The more processes you create (every process has a rank), the more you reach the limit of your system's capability to execute processes in a truly parallel manner.
Your system (e.g. your computer) can run in parallel a certain amount of processes, when this limit is surpassed, then some processes wait to be executed (thus not all processes run in parallel), which harms performance.
For example, assuming that a computer has 4 cores and you create 4 processes, then every core can execute a process, thus your performance is harmed by the communicated between the processes, if any.
Now, in the same computer, you create 8 processes. What will happen?
4 of the processes will start execute in parallel, but the other 4 will wait for a core to get available, so that they can run too. This is not a truly parallel execution (some processes will execute in linear fashion). Moreover, depending on the OS scheduling policy, some processes may be interleaved, causing overhead at every switch.

Multi-threading on ARM cortex A9 dual core (Linux or VxWorks)

I am working on how dual core (especially in embedded systems) could be beneficial. I would like to compare two targets: one with ARM -cortex-A9 (925 MHz) dual core, and the other with ARM cortex-A8 single core.
I have some ideas (please see below), but I am not sure, I will use the dual core features:
My questions are:
1-how to execute several threads on different cores (without OpenMP, because it didn't work on my target, and it isn't compatible with VxWorks)
2-How the kernel execute code on dual core with shared memory: how it allocate stack, heap, memory for global variable and static ones?
3-Is it possible to add C-flags in order to indicate number of CPU cores so we will be able to use dual core features.
4-How the kernel handle program execution (with a lot of threads) on dual core.
Some tests to compare two architecture regarding OS and Dual core / single core
Dual core VS single core:
create three threads that execute some routines and depend on each results other (like matrix multiplication). Afterwards measure the time taken to on the dual core once and the on the single core (how is it possible without openMP)
Ping pong:
One process sends a message to the other.two processes repeatedly pass a message back and forth. We should investigate how the time taken varies with the size of the message for each architecture.
One to all:
A process with rank 0 sends the same message to all other processes in the program and then receives a message of the same length from all other processes. How does the time taken varies with the size of the messages and with the number of processes?
Short answers wrt. Linux only:
how to execute several threads on different cores
Use multiple processes, or pthreads within a single process.
How the kernel execute code on dual core with shared memory: how it allocate stack, heap, memory for global variable and static ones?
In Linux, they all belong to the process. (You can declare thread-local variables, though, with for example the __thread keyword.)
In other words, if a thread allocates some memory, that memory is immediately visible to all other threads in the same process. Even if that thread exits, nothing happens to the memory. It is perfectly normal for some other thread to free the memory later on.
Each thread does get their own stack, which by default is quite large. (With pthreads, this is easy to control using a pthread_attr_t that specifies a smaller stack size.)
In general, there is no difference in memory handling or program execution, between a single-threaded and a multi-threaded process, on the kernel side. (Userspace code needs to use memory and threads properly, of course; the kernel does not try to stop stupid userspace code from shooting itself in the head, at all.)
Is it possible to add C-flags in order to indicate number of CPU cores so we will be able to use dual core features.
Yes, for example by examining the /proc/cpuinfo pseudofile.
However, it is much more common to leave such details for the system administrator. Services like Apache etc. let the administrator configure the number, either in a configuration file, or in the command line. Even make supports a -j JOBS parameter, allowing multiple compilation tasks to run in parallel.
I very warmly recommend you forget about any detection magic, and instead let the user specify the number of threads used. (If you insist, you could use detection magic for the default, but then only if the user/admin has not given you any hints about the number of threads to be used.)
It is also possible to set the CPU mask for each thread, specifying the set of cores that thread may run on. Other than in benchmarks, where this is done more for repeatability than anything else, or in dedicated processes designed to hog all the resources on a machine, it is very rare.
How the kernel handle program execution (with a lot of threads) on dual core.
The exact same way as it handles a lot of simultaneous processes.
There are some controls (control group stuff, cpu masks) and maybe resource accounting that are handled differently between separate processes and threads within the same process, but in general terms, separate threads in a single process are executed the same way as separate processes are.

MPI and 2-socket node nonuniform memory access

I use a cluster which contains several nodes. Each them has 2 processors with 8 cores inside. I use Open MPI with SLURM.
My tests show that MPI Send/Recv data transfer rate is the following: between MPI process with rank 0 and MPI process 1 it's about 9 GB/sec, but between process 0 and process 2 it's 5 GB/sec. I assume that this happens because our processes execute on different processors.
I'd like to avoid non-local memory access. The recommendations I found here did not help. So the question is, is it possible to run 8 MPI processes - all on THE SAME processor? If it is - how do I do it?
Thanks.
The following set of command-line options to mpiexec should do the trick with versions of Open MPI before 1.7:
--by-core --bind-to-core --report-bindings
The last option will pretty-print the actual binding for each rank. Binding also activates some NUMA-awareness in the shared-memory BTL module.
Starting with Open MPI 1.7, processes are distributed round-robin over the available sockets and bound to a single core by default. To replicate the above command line, one should use:
--map-by core --bind-to core --report-bindings
It appears to be possible. The Process Binding and Rankfiles sections of the OpenMPI mpirun man page look promising. I would try some of the options shown with the --report-binding option set so you can verify process placement is how you intend and see if you get the performance improvement you expect out of your code.
You should look at the hostfile / rankfile documentation for your MPI library. Open MPI and MPICH both use different formats, but both will give you what you want.
Keep in mind that you will have performance issues if you oversubscribe your processor too heavily. Running more than 8 ranks on an 8 core processor will cause you to lose the performance benefits you gain from having locally shared memory.
With Slurm, set:
#SBATCH --ntasks=8
#SBATCH --ntasks-per-socket=8
to have all cores allocated on the same socket (CPU die), provided Slurm is correctly configured.

How to get number of cores in Win32?

I'm writing a program in C on windows that needs to run as many threads as available cores. But I dont know how to get the number of cores. Any ideas?
You can call the GetSystemInfo WinAPI function; it returns a SYSTEM_INFO struct, which has the number of processors (which is the number of cores on a system with multiple core CPUs).
You can read NUMBER_OF_PROCESSORS environment variable.
Type "cmd" on windows startup and open "cmd.exe".
Now type in the following command:
WMIC CPU Get /Format:List
You will find the entries like - "NumberOfCores" and "NumberOfLogicalProcessors". Typically the logical-processors are achieved by threading. Therefore the relation would typically go like;
NumberOfLogicalProcessors = NumberOfCores * Number-of-Threads-per-Core.
Since each core serves a processing-unit, therefore with threading, logical-processing-unit is realized in real space.
More info here.
Even though the question deals with .NET and yours with C, the basic responses should help:
Detecting the number of processors
As #Changming-Sun mentioned in a comment above, GetSysInfo returns the number of logical processors, which is not always the same as the number of processor cores. On machines that support hyperthreading (including most modern Intel CPUs) more than one thread can run on the same core (technically, more than one thread will have its thread context loaded on the same core). Getting the number of processor cores requires a call to GetLogicalProcessorInformation and a little bit of coding work. Basically, you get back a list of SYSTEM_LOGICAL_PROCESSOR_INFORMATION entries, and you have to count the number of entries with RelationProcessorCore set. A good example of how to code this in the GetLogicalProcessorInformation documentation provided by Microsoft:
https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-getlogicalprocessorinformation

Resources