Behavior of omp_get_max_threads in parallel regions - c

I compile this bit of code on Snow Leopard and linux and I get different results. On Snow leopard, the first call of omp_get_max_threads returns 2, which is my number of cores, while the second returns 1. On linux, both calls return 4, which is my number of cores. I think Linux has the correct behavior, am I right? Are both correct and I just have a misunderstanding of this function?
#include <stdio.h>
#include <omp.h>
int main() {
printf(" In a serial region; max threads are : %d\n", omp_get_max_threads());
#pragma omp parallel
{
#pragma omp master
printf(" In a parallel region; max threads are : %d\n", omp_get_max_threads());
}
}
Mac output:
In a serial region; max threads are : 2
In a parallel region; max threads are : 1
Linux output:
In a serial region; max threads are : 4
In a parallel region; max threads are : 4

this call is well specified in the openmp spec. linux has the correct behavior here.
with that being said, you are in a master region which is effectively serial and fhe the main thread, so the num threads call is explainable. if you arent tied to pure c I would encourage you to look at the c++ tbb library and particularly the ppl subset, you will find more generality and composability like for nested parallelism. I'm on myphone so I apologize for typos here.

With Apple-supplied gcc 4.2 [gcc version 4.2.1 (Apple Inc. build 5566)] on Leopard, I get the same results as you (except that my MacBook has fewer cores).
In a serial region; max threads are : 2
In a parallel region; max threads are : 1
Ditto for 4.3.4 from MacPorts.
However, with gcc 4.4.2 and 4.5.0 20091231 (experimental) from MacPorts, on the same computer I get:
In a serial region; max threads are : 2
In a parallel region; max threads are : 2
It looks like this isn't a Mac versus Linux issue, but due to the gcc version.
P.S. OpenMP can do nested parallelism.

Just a reminder, there is a forum devoted just to OpenMP, and read by the developers of OpenMP as well as OpenMP experts world wide. It's over at the official OpenMP website: http://openmp.org/forum
Great place to ask questions like this, and to find a lot of other resources at openmp.org

Weird. I always get the expected behaviour (using 4.2.1, build 5646 dot 1) with OS X 10.6.2:
On my Mac Pro
In a serial region; max threads are : 8
In a parallel region; max threads are : 8
and on my iMac
In a serial region; max threads are : 2
In a parallel region; max threads are : 2
Must be something else going on here. Compiling with just?
gcc fname.c -fopenmp

Related

C OpenMP : libgomp: Thread creation failed: Resource temporarily unavailable

I was trying to do a p basic project and It seems like I have run out of threads? Do you guys know how do can I fix the problem?
Here is the code:
int main()
{
omp_set_num_threads(2150);
#pragma omp parallel
{
printf("%d\n",omp_get_thread_num());
}
return 0;
}
and here is the global compiler setting I have written on "other compiler options" on CodeBlocks:
-fopenmp
I am getting the error of:
libgomp: Thread creation failed: Resource temporarily unavailable
I have seen similar threads on the site, but I have not got the answer or solution as of yet.
Specs:
Intel i5 6400
2x8GB ram
Windows 10 64 bit
The problem is
omp_set_num_threads(2150);
The OS will impose a limit on the number of threads a process can create. This may be indirect for example by limiting the stack size. Creating 2150 exceeds the limits.
You mention that you've got the intel 15 6400 which is a quad core chip. Try setting the number of threads to something more reasonable. In your case:
omp_set_num_threads(4);
For numerical processing, performance will likely suffer when using more than 4 threads on a 4 core system

OpenMP on Kirin 650

Basically, the Kirin 650 ARM core is the one in the Honor 5c phone from Huawei. I am trying to bench C++ code directly on the rooted phone I have. This code is executing Neon instructions and have OpenMP parallelization. When I execute OpenMP functions on the phone it returns:
Num Procs = 8
Num Threads = 1
Max Threads = 8
Num Devices = 0
Default Device = 0
Is Initial Device = 1
These infos seems coherent since the Kirin 650 has 8 cores. However it is also specified in the phone technical details that the Kirin can have 4 cores up to 1.7GHz (power saving cores) and 4 cores up to 2GHz (performance cores).
How does OpenMP handle these as they are asynchronous as I understand it ? With my benchmarks I see speedups when I test No OpenMP against OpenMP - 2 threads and OpenMP - 4 threads but my OpenMP - 8 threads results are disaster.
Does these cores share the same clock (I'm using the clock_gettime instruction from the Hayai library to bench everything) ? Do you have any advice when executing stuff with OpenMP on these kinds of platforms ?

How CPU allocation is done in Linux ? Thread level or Process level? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 years ago.
Improve this question
I am trying to understand how CPU is distributed among different processes with different no of threads.I have two programs Program1 and Program2.
Program1 has 5 threads whereas Program2 has ONLY main thread.
SCENARIO -1 :
terminal-1 : ./Program1
terminal-2 : ./Program2
When I run Program1 in one terminal and Program2 in another terminal , the CPU allocation is done 50% for Program1 and 50% for Program2. Each thread of Program1 is getting 10% (cumulatively 50% for Program1)
This shows, no matter the no of threads a process have, every process will get equal share of CPU. This shows CPU allocation is done at Process level.
pstree shows
├─bash───P1───5*[{P1}]
├─bash───P2───{P2}
SCENARIO -2 :
terminal-1 : ./Program1 & ./Program2
When I run both Program1 and Program2 in SAME terminal , the CPU allocation is done equal for Program1 and all threads of Program2. It means each thread of Program1 is getting almost 17% (cumulatively Program1 is getting 83%) and Program2 is also getting 17%. This shows CPU allocation is done at Thread level.
pstree shows
├─bash─┬─P1───5*[{P1}]
│ └─P2
I am using Ubuntu 12.04.4 LTS, kernel - config-3.11.0-15-generic. I have also used Ubuntu 14.04.4 , kernel-3.16.x and got similar results.
Can anyone explain how CPU scheduler of LINUX KERNEL distinguishing SCENARIO-1 and SCENARIO-2?
I think the CPU scheduler is distinguishing both SCENARIO in somewhere before allocating CPU.
To understand how CPU scheduler is distinguishing SCENARIO-1 and SCENARIO-2 ,I have downloaded Linux kernel source code.
However, I haven't found in source code where it is distinguishing SCENARIO-1 and SCENARIO-2.
It will be great if anyone point me the source code or function where the CPU scheduler is distinguishing SCENARIO-1 and SCENARIO-2.
Thanks in advance.
NOTE : Although Ubuntu is based on Debian, surprisingly, in Debian 8 (kernel-3.16.0-4-686-pae) in both SCENARIO's CPU allocation is done at Thread level means each thread of Program1 is getting almost 17% (cumulatively Program1 is getting 83%) and Program2 is also getting 17%.
Here is the code :
Program1(with 5 threads)
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
// Let us create a global variable to change it in threads
int g = 0;
// The function to be executed by all threads
void *myThreadFun(void *vargp)
{
// Store the value argument passed to this thread
int myid = (int)vargp;
// Let us create a static variable to observe its changes
static int s = 0;
// Change static and global variables
++s; ++g;
// Print the argument, static and global variables
printf("Thread ID: %d, Static: %d, Global: %d\n", myid, ++s, ++g);
while(1); // Representing CPU Bound Work
}
int main()
{
int i;
pthread_t tid[5];
// Let us create three threads
for (i = 0; i < 5; i++)
pthread_create(&tid[i], NULL, myThreadFun, (void *)i);
for (i = 0; i < 5; i++)
pthread_join(tid[i],NULL);
return 0;
}
Program2 ( with only main thread)
#include <stdio.h>
#include <stdlib.h>
int main()
{
while(1);// Representing CPU Bound Work
}
To disable all optimization by gcc, I have used O0 option while compiling the both programs.
gcc -O0 program1.c -o p1 -lpthread
gcc -O0 program2.c -o p2
UPDATE: As per explanation of ninjalj , in Scenario-1, CPU allocation is done at control groups level and as I am using two different terminal(means two different session), thus 2 different control groups and each control groups is getting 50% CPU allocation. This is due to reason, autogroup is enabled by default.
As Program2 has ONLY one thread and Program1 has more threads, I want to run
both the program in separate terminal(different session) and get more CPU allocation for Program1 ( as in Scenario-2, Program1 is getting 83% CPU allocation compared to 17% of Program2). Is it possible in any way,that CPU allocation of Scenario1 will be same as Scenario-2 in Ubuntu?
Also it is surprising to me although Ubuntu is based on Debian, still Debian and Ubuntu is behaving differently. In case of Debian, Program1 is getting more CPU in both Scenario.
The linux kernel does not distinguish processes vs. threads in scheduling.
Threads are processes that just happen to share most of their memory. Beyond that, they are treated equally by the scheduler.
You can have 50 processes and 30 threads. That's 80 "things" and the kernel will schedule them without regard to whether they are processes or threads

MPI calls are slow in an OpenMP section

I am attempting to write a hybrid MPI + OpenMP linear solver within the PETSc framework. I am currently running this code on 2 nodes, with 2 sockets per node, and 8 cores per socket.
export OMP_MAX_THREADS=8
export KMP_AFFINITY=compact
mpirun -np 4 --bysocket --bind-to-socket ./program
I have checked that this gives me a nice NUMA-friendly thread distribution.
My MPI program creates 8 threads, 1 of which should perform MPI communications while the remaining 7 perform computations. Later, I may try to oversubscribe the sockets with 9 threads each.
I currently do it like this:
omp_set_nested(1);
#pragma omp parallel sections num_threads(2)
{
// COMMUNICATION THREAD
#pragma omp section
{
while(!stop)
{
// Vector Scatter with MPI Send/Recv
// Check stop criteria
}
}
// COMPUTATION THREAD(S)
#pragma omp section
{
while(!stop)
{
#pragma omp parallel for num_threads(7) schedule(static)
for (i = 0; i < n; i++)
{
// do some computation
}
}
}
}
My problem is that the MPI communications take an exceptional amount of time, just because I placed them in the OpenMP section. The vector scatter takes approximately 0.024 seconds inside the OpenMP section, and less than 0.0025 seconds (10 times faster) if it is done outside of the OpenMP parallel region.
My two theories are:
1) MPI/OpenMP is performing extra thread-locking to ensure my MPI calls are safe, even though its not needed. I have tried forcing MPI_THREAD_SINGLE, MPI_THREAD_FUNELLED and MPI_THREAD_MULTIPLE to see if I can convince MPI that its already safe, but this had no effect. Is there something I'm missing?
2) My computation thread updates values used by the communications (its actually a deliberate race condition - as if this wasn't awkward enough already!). It could be that I'm facing memory bottlenecks. It could also be that I'm facing cache thrashing, but I'm not forcing any OpenMP flushes, so I don't think its that.
As a bonus question: is an OpenMP flush operation clever enough to only flush to the shared cache if all the threads are on the same socket?
Additional Information: The vector scatter is done with the PETSc functions VecScatterBegin() and VecScatterEnd(). A "raw" MPI implementation may not have these problems, but its a lot of work to re-implement the vector scatter to find out, and I'd rather not do that yet. From what I can tell, its an efficient loop of MPI Send/Irecvs beneath the surface.

Measure CPU frequency with turboboost in code

I am profiling some code on three different computers with three different frequencies. I need the frequencies to measure GFLOPs/s. I have some code which does this but it does not account for Turboboost. For example on my 2600k CPU it reports 3.4 GHz but I can see when I run CPUz that my CPU is running at 4.3 GHz (overclocked) for my code which uses all cores.
#include "stdint.h"
#include "stdio.h"
#include "omp.h"
int main() {
int64_t cycles = rdtsc(); double dtime = omp_get_wtime();
//run some code which uses all cores for a while (few ms)
dtime = omp_get_wtime() - dtime;
cycles = rdtsc() - cycles;
double freq = (double)cycles/dtime*1E-9;
printf("freq %.2f GHz\n", freq);
}
__int64 rdtsc() {
#ifdef _WIN32
return __rdtsc();
#else
uint64_t t;
asm volatile ("rdtsc" : "=A"(t));
return t;
#endif
}
I know this question has been asked various times with various answers but it's still not clear to me if this can be done. I don't care about hackers trying to change timers. This code is only for myself. Is it possible to get the actual frequency in code? How is this done on Linux? Every example I have found on linux gives the base frequency (or maybe max) but not the operating frequency under load like CPUz does.
Edit:
I found a program, Powertop, for Linux which appears to show the actual operating frequency. Since the source code is available maybe it's possible to figure out how to get the actual frequency in my own code.
I finally solved this problem. It is possible to measure the actuall operating frequency in code without needing device drivers or reading special counters.
Basically you time a loop for an operation with a carried loop dependency which always takes the same latency. For example
for(int i=0; i<spinCount; i++) {
x = _mm_add_ps(x,_mm_set1_ps(1.0f));
}
You run this loop in threads bound to each physical core (not logical) core. The requires that no other threads in the system then these ones take any significant CPU time so this method won't always give the correct answer but in my case it works quite well. I get results that deviate less than 0.5% from the correct turbo frequency for one thread and for many threads on Nahalem, Ivy Bridge, and Haswell on single socket system and multi-socket system. I described the details of this at how-can-i-programmatically-find-the-cpu-frequency-with-c so I won't repeat all the details here.

Resources