C OpenMP : libgomp: Thread creation failed: Resource temporarily unavailable - c

I was trying to do a p basic project and It seems like I have run out of threads? Do you guys know how do can I fix the problem?
Here is the code:
int main()
{
omp_set_num_threads(2150);
#pragma omp parallel
{
printf("%d\n",omp_get_thread_num());
}
return 0;
}
and here is the global compiler setting I have written on "other compiler options" on CodeBlocks:
-fopenmp
I am getting the error of:
libgomp: Thread creation failed: Resource temporarily unavailable
I have seen similar threads on the site, but I have not got the answer or solution as of yet.
Specs:
Intel i5 6400
2x8GB ram
Windows 10 64 bit

The problem is
omp_set_num_threads(2150);
The OS will impose a limit on the number of threads a process can create. This may be indirect for example by limiting the stack size. Creating 2150 exceeds the limits.
You mention that you've got the intel 15 6400 which is a quad core chip. Try setting the number of threads to something more reasonable. In your case:
omp_set_num_threads(4);
For numerical processing, performance will likely suffer when using more than 4 threads on a 4 core system

Related

How CPU allocation is done in Linux ? Thread level or Process level? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 years ago.
Improve this question
I am trying to understand how CPU is distributed among different processes with different no of threads.I have two programs Program1 and Program2.
Program1 has 5 threads whereas Program2 has ONLY main thread.
SCENARIO -1 :
terminal-1 : ./Program1
terminal-2 : ./Program2
When I run Program1 in one terminal and Program2 in another terminal , the CPU allocation is done 50% for Program1 and 50% for Program2. Each thread of Program1 is getting 10% (cumulatively 50% for Program1)
This shows, no matter the no of threads a process have, every process will get equal share of CPU. This shows CPU allocation is done at Process level.
pstree shows
├─bash───P1───5*[{P1}]
├─bash───P2───{P2}
SCENARIO -2 :
terminal-1 : ./Program1 & ./Program2
When I run both Program1 and Program2 in SAME terminal , the CPU allocation is done equal for Program1 and all threads of Program2. It means each thread of Program1 is getting almost 17% (cumulatively Program1 is getting 83%) and Program2 is also getting 17%. This shows CPU allocation is done at Thread level.
pstree shows
├─bash─┬─P1───5*[{P1}]
│ └─P2
I am using Ubuntu 12.04.4 LTS, kernel - config-3.11.0-15-generic. I have also used Ubuntu 14.04.4 , kernel-3.16.x and got similar results.
Can anyone explain how CPU scheduler of LINUX KERNEL distinguishing SCENARIO-1 and SCENARIO-2?
I think the CPU scheduler is distinguishing both SCENARIO in somewhere before allocating CPU.
To understand how CPU scheduler is distinguishing SCENARIO-1 and SCENARIO-2 ,I have downloaded Linux kernel source code.
However, I haven't found in source code where it is distinguishing SCENARIO-1 and SCENARIO-2.
It will be great if anyone point me the source code or function where the CPU scheduler is distinguishing SCENARIO-1 and SCENARIO-2.
Thanks in advance.
NOTE : Although Ubuntu is based on Debian, surprisingly, in Debian 8 (kernel-3.16.0-4-686-pae) in both SCENARIO's CPU allocation is done at Thread level means each thread of Program1 is getting almost 17% (cumulatively Program1 is getting 83%) and Program2 is also getting 17%.
Here is the code :
Program1(with 5 threads)
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
// Let us create a global variable to change it in threads
int g = 0;
// The function to be executed by all threads
void *myThreadFun(void *vargp)
{
// Store the value argument passed to this thread
int myid = (int)vargp;
// Let us create a static variable to observe its changes
static int s = 0;
// Change static and global variables
++s; ++g;
// Print the argument, static and global variables
printf("Thread ID: %d, Static: %d, Global: %d\n", myid, ++s, ++g);
while(1); // Representing CPU Bound Work
}
int main()
{
int i;
pthread_t tid[5];
// Let us create three threads
for (i = 0; i < 5; i++)
pthread_create(&tid[i], NULL, myThreadFun, (void *)i);
for (i = 0; i < 5; i++)
pthread_join(tid[i],NULL);
return 0;
}
Program2 ( with only main thread)
#include <stdio.h>
#include <stdlib.h>
int main()
{
while(1);// Representing CPU Bound Work
}
To disable all optimization by gcc, I have used O0 option while compiling the both programs.
gcc -O0 program1.c -o p1 -lpthread
gcc -O0 program2.c -o p2
UPDATE: As per explanation of ninjalj , in Scenario-1, CPU allocation is done at control groups level and as I am using two different terminal(means two different session), thus 2 different control groups and each control groups is getting 50% CPU allocation. This is due to reason, autogroup is enabled by default.
As Program2 has ONLY one thread and Program1 has more threads, I want to run
both the program in separate terminal(different session) and get more CPU allocation for Program1 ( as in Scenario-2, Program1 is getting 83% CPU allocation compared to 17% of Program2). Is it possible in any way,that CPU allocation of Scenario1 will be same as Scenario-2 in Ubuntu?
Also it is surprising to me although Ubuntu is based on Debian, still Debian and Ubuntu is behaving differently. In case of Debian, Program1 is getting more CPU in both Scenario.
The linux kernel does not distinguish processes vs. threads in scheduling.
Threads are processes that just happen to share most of their memory. Beyond that, they are treated equally by the scheduler.
You can have 50 processes and 30 threads. That's 80 "things" and the kernel will schedule them without regard to whether they are processes or threads

Regarding CPU utilization

Considering the below piece of C code, I expected the CPU utilization to go up to 100% as the processor would try to complete the job (endless in this case) given to it. On running the executable for 5 mins, I found the CPU to go up to a max. of 48%. I am running Mac OS X 10.5.8; processor: Intel Core 2 Duo; Compiler: GCC 4.1.
int i = 10;
while(1) {
i = i * 5;
}
Could someone please explain why the CPU usage does not go up to 100%? Does the OS limit the CPU from reaching 100%?
Please note that if I added a "printf()" inside the loop the CPU hits 88%. I understand that in this case, the processor also has to write to the standard output stream hence the sharp rise in usage.
Has this got something to do with the amount of job assigned to the processor per unit time?
Regards,
Ven.
You have a multicore processor and you are in a single thread scenario, so you will use only one core full throttle ... Why do you expect the overall processor use go to 100% in a similar context ?
Run two copies of your program at the same time. These will use both cores of your "Core 2 Duo" CPU and overall CPU usage will go to 100%
Edit
if I added a "printf()" inside the loop the CPU hits 88%.
The printf send some characters to the terminal/screen. Sending information, Display and Update is handeled by code outside your exe, this is likely to be executed on another thread. But displaying a few characters does not need 100% of such a thread. That is why you see 100% for Core 1 and 76% for Core 2 which results in the overal CPU usage of 88% what you see.

Time difference for same code of multithreading on different processors?

Hypothetical Question.
I wrote 1 multithreading code, which used to form 8 threads and process the data on different threads and complete the process. I am also using semaphore in the code. But it is giving me different execution time on different machines. Which is OBVIOUS!!
Execution time for same code:
On Intel(R) Core(TM) i3 CPU Machine: 36 sec
On AMD FX(tm)-8350 Eight-Core Processor Machine : 32 sec
On Intel(R) Core(TM) i5-2400 CPU Machine : 16.5 sec
So, my question is,
Is there any kind of setting/variable/command/switch i am missing which could be enabled in higher machine but not enabled in lower machine, which is making higher machine execution time faster? Or, is it the processor only, because of which the time difference is.
Any kind of help/suggestions/comments will be helpful.
Operating System: Linux (Centos5)
Multi-threading benchmarks should be performed with significant statistical sampling (ex: around 50 experiments per machines). Furthermore, the "environement" in which the program runs is important too (ex: was firefox running at the same time or not).
Also, depending on resources consumptions, runtimes can vary. In other words, without a more complete portrait of your experimental conditions, it's impossible to answer your question.
Some observations I have made from my personnal experiment:
Huge memory consumption can alter the results depending on the swapping settings on the machine.
Two "identical" machines with the same OS installed under the same conditions can show different results.
When total throughput is small compared to 5 mins, results appear pretty random.
etc.
I used to have a problem about time measure.My problem is the time in multithread is larger than that in single thread. Finally I found the problem is that not to measure the time in each thread and sum them but to measure out of the all thread. For example:
Wrong measure:
int main(void)
{
//create_thread();
//join_thread();
//sum the time
}
void thread(void *)
{
//measure time in thread
}
Right measure:
int main(void)
{
//record start time
//create_thread();
//join_thread();
//record end time
//calculate the diff
}
void thread(void *)
{
//measure time in thread
}

Behavior of omp_get_max_threads in parallel regions

I compile this bit of code on Snow Leopard and linux and I get different results. On Snow leopard, the first call of omp_get_max_threads returns 2, which is my number of cores, while the second returns 1. On linux, both calls return 4, which is my number of cores. I think Linux has the correct behavior, am I right? Are both correct and I just have a misunderstanding of this function?
#include <stdio.h>
#include <omp.h>
int main() {
printf(" In a serial region; max threads are : %d\n", omp_get_max_threads());
#pragma omp parallel
{
#pragma omp master
printf(" In a parallel region; max threads are : %d\n", omp_get_max_threads());
}
}
Mac output:
In a serial region; max threads are : 2
In a parallel region; max threads are : 1
Linux output:
In a serial region; max threads are : 4
In a parallel region; max threads are : 4
this call is well specified in the openmp spec. linux has the correct behavior here.
with that being said, you are in a master region which is effectively serial and fhe the main thread, so the num threads call is explainable. if you arent tied to pure c I would encourage you to look at the c++ tbb library and particularly the ppl subset, you will find more generality and composability like for nested parallelism. I'm on myphone so I apologize for typos here.
With Apple-supplied gcc 4.2 [gcc version 4.2.1 (Apple Inc. build 5566)] on Leopard, I get the same results as you (except that my MacBook has fewer cores).
In a serial region; max threads are : 2
In a parallel region; max threads are : 1
Ditto for 4.3.4 from MacPorts.
However, with gcc 4.4.2 and 4.5.0 20091231 (experimental) from MacPorts, on the same computer I get:
In a serial region; max threads are : 2
In a parallel region; max threads are : 2
It looks like this isn't a Mac versus Linux issue, but due to the gcc version.
P.S. OpenMP can do nested parallelism.
Just a reminder, there is a forum devoted just to OpenMP, and read by the developers of OpenMP as well as OpenMP experts world wide. It's over at the official OpenMP website: http://openmp.org/forum
Great place to ask questions like this, and to find a lot of other resources at openmp.org
Weird. I always get the expected behaviour (using 4.2.1, build 5646 dot 1) with OS X 10.6.2:
On my Mac Pro
In a serial region; max threads are : 8
In a parallel region; max threads are : 8
and on my iMac
In a serial region; max threads are : 2
In a parallel region; max threads are : 2
Must be something else going on here. Compiling with just?
gcc fname.c -fopenmp

Why would one CPU core run slower than the others?

I was benchmarking a large scientific application, and found it would sometimes run 10% slower given the same inputs. After much searching, I found the the slowdown only occurred when it was running on core #2 of my quad core CPU (specifically, an Intel Q6600 running at 2.4 GHz). The application is a single-threaded and spends most of its time in CPU-intensive matrix math routines.
Now that I know one core is slower than the others, I can get accurate benchmark results by setting the processor affinity to the same core for all runs. However, I still want to know why one core is slower.
I tried several simple test cases to determine the slow part of the CPU, but the test cases ran with identical times, even on slow core #2. Only the complex application showed the slowdown. Here are the test cases that I tried:
Floating point multiplication and addition:
accumulator = accumulator*1.000001 + 0.0001;
Trigonometric functions:
accumulator = sin(accumulator);
accumulator = cos(accumulator);
Integer addition:
accumulator = accumulator + 1;
Memory copy while trying to make the L2 cache miss:
int stride = 4*1024*1024 + 37; // L2 cache size + small prime number
for(long iter=0; iter<iterations; ++iter) {
for(int offset=0; offset<stride; ++offset) {
for(i=offset; i<array_size; i += stride) {
array1[i] = array2[i];
}
}
}
The Question: Why would one CPU core be slower than the others, and what part of the CPU is causing that slowdown?
EDIT: More testing showed some Heisenbug behavior. When I explicitly set the processor affinity, then my application does not slow down on core #2. However, if it chooses to run on core #2 without an explicitly set processor affinity, then the application runs about 10% slower. That explains why my simple test cases did not show the same slowdown, as they all explicitly set the processor affinity. So, it looks like there is some process that likes to live on core #2, but it gets out of the way if the processor affinity is set.
Bottom Line: If you need to have an accurate benchmark of a single-threaded program on a multicore machine, then make sure to set the processor affinity.
You may have applications that have opted to be attached to the same processor(CPU Affinity).
Operating systems would often like to run on the same processor as they can have all their data cached on the same L1 cache. If you happen to run your process on the same core that your OS is doing a lot of its work on, you could experience the effect of a slowdown in your cpu performance.
It sounds like some process is wanting to stick to the same cpu. I doubt it's a hardware issue.
It doesn't necessarily have to be your OS doing the work, some other background daemon could be doing it.
Most modern cpu's have separate throttling of each cpu core due to overheating or power saving features. You may try to turn off power-saving or improve cooling. Or maybe your cpu is bad. On my i7 I get about 2-3 degrees different core temperatures of the 8 reported cores in "sensors". At full load there is still variation.
Another possibility is that the process is being migrated from one core to another while running. I'd suggest setting the CPU affinity to the 'slow' core and see if it's just as fast that way.
Years ago, before the days of multicore, I bought myself a dual-socket Athlon MP for 'web development'. Suddenly my Plone/Zope/Python web servers slowed to a crawl. A google search turned up that the CPython interpreter has a global interpreter lock, but Python threads are backed by OS threads. OS Threads were evenly distributed among the CPUs, but only one CPU can acquire the lock at a time, thus all the other processes had to wait.
Setting Zope's CPU affinity to any CPU fixed the problem.
I've observed something similar on my Haswel laptop. The system was quiet, no X running, just the terminal. Executing the same code with different numactl --physcpubin option gave exactly the same results on all cores, except one. I changed the frequency of the cores to Turbo, to other values, nothing helped. All cores were running with expected speed, except one which was always running slower than the others. That effect survived the reboot.
I rebooted the computer and turned off HyperThreading in the BIOS. When it came back online it was fine again. I then turned on HyperThreading and it is fine till now.
Bizzare. No idea what that could be.

Resources