openmp not utilizing all threads - c

I've written parallel program in C using OpenMP.
I want to control number of threads program is using.
I'm using system with:
CentOS release 6.5 (Final)
icc version 14.0.1 (gcc version 4.4.7 compatibility)
2x Intel(R) Xeon(R) CPU E5-2620 0 # 2.00GHz
Program that I run:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
double t1[TABLE_SIZE];
double t2[TABLE_SIZE];
int main(int argc, char** argv) {
omp_set_dynamic(0);
omp_set_nested(0);
omp_set_num_threads(NUM_OF_THREADS);
#pragma omp parallel for default(none) shared(t1, t2) private(i)
for(i=0; i<TABLE_SIZE; i++) {
t1[i] = rand();
t2[i] = rand();
}
for(i=0; i<NUM_OF_REPETITION; i++) {
test1(t1, t2);
}
}
void test1(double t1[], double t2[]) {
int i;
double result;
#pragma omp parallel for default(none) shared(t1, t2) private(i) reduction(+:result)
for(i=0; i<TABLE_SIZE; i++) {
result += t1[i]*t2[i];
}
}
I'm running script that sets TABLE_SIZE(2500, 5000, 100000, 1000000), NUM_OF_THREADS(1-24), NUM_OF_REPETITION(50000 as 50k, 100000 as 100k, 1000000 as 1M) at compile time.
The problem is that computer is not utilizing all the threads that are offered all the time.
It seems that problem is dependent on TABLE_SIZE.
For example when I compile the code with TABLE_SIZE=2500 all is fine till NUM_OF_THREADS=20. Then some weird things happen. When I set NUM_OF_THREADS=21 the program is utilizing only 18 threads(I observe htop to see how many threads are running). When I set NUM_OF_THREADS=23 and NUM_OF_REPETITION=100k it's using 18 threads, but if I change NUM_OF_REPETITION to 1M at NUM_OF_THREADS=23 it's using 19 threads.
When I change TABLE_SIZE to 5000 the anomally starts at 18 threads. I set NUM_OF_THREADS=18 and at NUM_OF_REPETITION=1M the program uses only 17 threads. When I set NUM_OF_THREADS=19 and NUM_OF_REPETITION=100k or 1M it uses only 17 threads. If I change NUM_OF_THREADS to 24 the program is using 20 threads at NUM_OF_REPETITION=50k, 22 threads at NUM_OF_REPETITION=100k and 23 threads at NUM_OF_REPETITION=1M.
This sort of inconsistency is going on and on with increasing TABLE_SIZE. The bigger the TABLE_SIZE the faster(at lower NUM_OF_THREADS) the inconsistency occours.
At this(OpenMP set_num_threads() is not working) post I read that omp_set_num_threads() sets the upper limit of threads that can be used by the program. And as you can see I've disabled dynamic teams and program is still not using all the threads. It doesn't help if I set environment variables OMP_NUM_THREADS and OMP_DYNAMIC either.
So I went and read some of OpenMP specification 3.1. And it says program should use the number of threads it is set by omp_set_num_threads(). Also omp_get_max_threads() function returns 24 available threads.
Any help would be greatly appreciated.

I finally found a solution. I set the KMP_AFFINITY environment variable. It doesn't matter if I set variable to "compact" or "scatter"(I'm just interested in using all threads for now).
This is what documentation has to say(https://software.intel.com/en-us/articles/openmp-thread-affinity-control):
There are 2 considerations for OpenMP threading and affinity: First, determine the number of threads to utilize, and secondly, how to bind threads to specific processor cores.
If you do not set a value for KMP_AFFINITY, the OpenMP runtime is allowed to choose affinity for you. The value chosen depends on the CPU architecture and may change depending on what affinity is deemed most efficient FOR A VARIETY OF APPLICATIONS for that architecture.
Another source (https://software.intel.com/en-us/node/522691):
Affinity Types:
type = none (default)
Does not bind OpenMP* threads to particular thread contexts; however, if the operating system supports affinity, the compiler still uses the OpenMP* thread affinity interface to determine machine topology.
So I guess because I did not have KMP_AFFINITY set, the OpenMP runtime set most efficient affinity to its knowledge. Please correct me if I'm wrong.

Related

OpenMP: Is a barrier inside conditional code valid?

In the OpenMP Specification, the following restriction is posed for a barrier construct: (see p. 259, lines 30-31):
Each barrier region must be encountered by all threads in a team or by
none at all, unless cancellation has been requested for the innermost
enclosing parallel region.
Just for completeness, the definition of a region by OpenMP specification is as follows (cf. p.5, lines 9 ff.):
region
All code encountered during a specific instance of
the execution of a given construct, structured block sequence or
OpenMP library routine. A region includes any code in called routines
as well as any implementation code. [...]
I came up with a very simple example and I am asking myself whether it is at all valid, because the barriers are placed inside if-conditions (and not every barrier is "seen" by each thread). Nevertheless, the number of barriers is identical for each thread and experiments with two compilers show that the code works as expected.
#include <stdio.h>
#include <unistd.h>
#include <stdarg.h>
#include <sys/time.h>
#include "omp.h"
double zerotime;
double gettime(void) {
struct timeval t;
gettimeofday(&t, NULL);
return t.tv_sec + t.tv_usec * 1e-6;
}
void print(const char *format, ...) {
va_list args;
va_start (args, format);
#pragma omp critical
{
fprintf(stdout, "Time = %1.1lfs ", gettime() - zerotime);
vfprintf (stdout, format, args);
}
va_end (args);
}
void barrier_test_1(void) {
for (int i = 0; i < 5; i++) {
if (omp_get_thread_num() % 2 == 0) {
print("Path A: Thread %d waiting\n", omp_get_thread_num());
#pragma omp barrier
} else {
print("Path B: Thread %d waiting\n", omp_get_thread_num());
sleep(1);
#pragma omp barrier
}
}
}
int main() {
zerotime = gettime();
#pragma omp parallel
{
barrier_test_1();
}
return 0;
}
For four threads I get the following output:
Time = 0.0s Path B: Thread 1 waiting
Time = 0.0s Path B: Thread 3 waiting
Time = 0.0s Path A: Thread 0 waiting
Time = 0.0s Path A: Thread 2 waiting
Time = 1.0s Path B: Thread 1 waiting
Time = 1.0s Path B: Thread 3 waiting
Time = 1.0s Path A: Thread 2 waiting
Time = 1.0s Path A: Thread 0 waiting
Time = 2.0s Path B: Thread 1 waiting
Time = 2.0s Path B: Thread 3 waiting
Time = 2.0s Path A: Thread 0 waiting
Time = 2.0s Path A: Thread 2 waiting
...
which shows that all the threads nicely wait for the slow Path B operation and pair up even though they are not placed in the same branch.
However, I am still confused from the specification, whether my code is at all valid.
Contrast this e.g. with CUDA where the following statement is given regarding the related __syncthreads() routine:
__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block,
otherwise the code execution is likely to hang or produce unintended
side effects.
Thus, in CUDA, such code as written above in terms of __syncthreads() would be invalid, because the condition omp_get_thread_num() % 2 == 0 evaluates differently depending on the thread.
Follow-up Question:
While I am quite ok with the conclusion that the code above is not following the specification, a slight modification of the code could be as follows, where barrier_test_1() is replaced by barrier_test_2():
void call_barrier(void) {
#pragma omp barrier
}
void barrier_test_2(void) {
for (int i = 0; i < 5; i++) {
if (omp_get_thread_num() % 2 == 0) {
print("Path A: Thread %d waiting\n", omp_get_thread_num());
call_barrier();
} else {
print("Path B: Thread %d waiting\n", omp_get_thread_num());
sleep(1);
call_barrier();
}
}
}
We recognize, that we have only a single barrier placed inside the code and this one is visited by all threads in the team. While the above code would be still invalid in the CUDA case, I am still unsure about OpenMP. I think it boils down to the question what actually constitutes the barrier region, is it just the line in the code or is it all code which has been traversed between subsequent barriers? This is also the reason, why I looked up the definition of a region in the specification. More precisely, as far as I can see there is no code encountered during a specific instance of the execution of <the barrier construct>, which is due to the statement about stand-alone directives in the spec (p.45, lines 3+5)
Stand-alone directives are executable directives that have no
associated user code.
and
Stand-alone directives do not have any associated executable user
code.
and since (p.258 line 9)
The barrier construct is a stand-alone directive.
Maybe the following part of the spec is also of interest (p.259, lines 32-33):
The sequence of worksharing regions and barrier regions encountered
must be the same for every thread in a team.
Preliminary Conclusion:
We can wrap a barrier into a single function as above and replace all barriers by a call to the wrapper function which causes:
All threads either continue executing user code or wait at the barrier
If we call the wrapper only by a subset of threads, this will cause a deadlock but will not lead to undefined behavior
Between calls to the wrapper, the number of met barriers is identical among the threads
Essentially this means, we can safely synchronize and cut through different execution paths by the use of such wrapper
Am I correct?
In the OpenMP Specification, the following restriction is posed for a
barrier construct: (see p. 259, lines 30-31):
Each barrier region must be encountered by all threads in a team or by
none at all, unless cancellation has been requested for the innermost
enclosing parallel region.
That description is a bit problematic because barrier is a stand-alone directive. That means it has no associated code other than the directive itself, and therefore there is no such thing as a "barrier region".
Nevertheless, I think the intent is clear, both from the wording itself and from the conventional behavior of barrier implementations: absent any cancellation, if any thread in a team executing the innermost parallel region containing a given barrier construct reaches that barrier, then all threads in the team must reach that same barrier construct. Different barrier constructs represent different barriers, each requiring all threads to arrive before any proceed past.
However, I am still confused from the specification, whether my code is at all valid.
I see that the behavior of your test code suggests that the two barriers are being treated as a single one. This is irrelevant to interpreting the specification, however, because your code indeed does not satisfy the requirement you asked about. The spec does not require the program to fail in any particular way in this case, but it certainly does not require the behavior you observe, either. You might well find that the program behaves differently with a different version of the compiler or a different OpenMP implementation. The compiler is entitled to assume that your OpenMP code conforms to the OpenMP spec.
Of course, in the case of your particular example, the solution is to replace the two barrier constructs in the different conditional branches with a single one immediately following the else block.

Optimizing thread level parallelism - C

I'm currently writing a contrived producer/consumer program that uses a bounded buffer and a number of worker threads to multiply matrices (for the sake of exposition). While trying to assure that the concurrency of my program is optimal I'm observing some unusual behavior. Namely, when executing the program I am only able to achieve up to 100% CPU usage (observed in top), despite having 6 cores. Using shift-i to view the relative percentage changes the upper bound to ~%16.7 and I can clearly see when viewing the usage of different cores by pressing 1 that either only one core is fully maxed out or the load of one core is distributed amongst all six.
No matter what stress tests I run (I tried the stress program and a simple stress test that created multiple threads that spun) I cannot get a single process to use more than 100% (or ~%16.7 relative to all available cores) CPU, so I presume the parallelism is bound to a single core. This behavior is observed on an Ubuntu LTS 20.10 running within VirtualBox on a Mac host with a 2.3 GHz 8-Core Intel Core i9 processor. Is there some way I must enable multi-core parallelism in VirtualBox or is this perhaps just an idiosyncrasy of the setup?
For reference here is the simple stress test I used
void *prod_worker(void *arg) {
while (1) {
printf("...");
}
}
int main (int argc, char * argv[])
{
printf("pid: %lun\n", getpid());
getc(stdin);
int numw = atoi(argv[1]);
pthread_t *prod_threads = malloc(sizeof(pthread_t) * numw);
for (int i = 0; i < numw; i++) {
pthread_t prod;
int rcp;
rcp = pthread_create(&prod, NULL, prod_worker, NULL);
if (rcp == THREAD_CREATE_SUCCESS) {
prod_threads[i] = prod;
} else {
printf("Failed to create producer and consumer thread #%d...\n", i);
printf("Error codes: prod = %d\n", rcp);
printf("Retrying...\n");
i--;
}
}
for (int i = 0; i < numw; i++) {
pthread_join(prod_threads[i], NULL);
}
return 0;
}
printf is thread safe, which typically means there is a mutex of some kind on the stdout stream, such that only one thread can print at a time. Thus even though all your threads are "running", at any given time, all but one of them is likely to be waiting to take ownership of the stream, thus doing nothing useful and using no CPU.
You might want to instead try a test where the threads do some computation, instead of just I/O.

Pinning a large number of threads to a single CPU causes utilization spike on all cores

I've written a little test program that spawns a large number of threads (in my case 32 threads on a computer with 4 cores) and pins them all to one core with the pthread_setaffinity_np syscall.
These threads run in a loop in which they report the result of the sched_getcpu call via stdout and then sleep for a short time. What I wanted to see is, how strictly the OS adheres to a user's thread pinning settings (even if they don't make sense as in my case).
All threads report to be running on the core I've pinned them to, which is what I would have expected.
However, I've noticed, that while the program is running, cpu utilization on all 4 cores is around 100% (normally it's between 0% and 25%). Could someone enlighten me as to why this would be the case?
I would have expected the utilization on the pinned core to be maximal with it being maybe a little higher on the other cores to compensate.
I can append my code if necessary, but I figured it's pretty straightforward and thus not really necessary. I did the test on a fairly old PC with Ubuntu 18.04.
Update
#define _GNU_SOURCE
#include <assert.h>
#include <pthread.h>
#include <sched.h>
#include <stdio.h>
#include <unistd.h>
#define THREADS 32
#define PINNED 3
#define MINUTE 60
#define MILLISEC 1000
void thread_main(int id);
int main(int argc, char** argv) {
int i;
pthread_t pthreads[THREADS];
printf("%d threads will be pinned to cpu %d\n", THREADS, PINNED);
for (i = 0; i < THREADS; ++i) {
pthread_create(&pthreads[i], NULL, &thread_main, i);
}
sleep(MINUTE);
return 0;
}
void thread_main(int id) {
printf("thread %d: inititally running on cpu %d\n", id, sched_getcpu());
pthread_t pthread = pthread_self();
cpu_set_t cpu_set;
CPU_ZERO(&cpu_set);
CPU_SET(PINNED, &cpu_set);
assert(0 == pthread_setaffinity_np(pthread, sizeof(cpu_set_t), &cpu_set));
while (1) {
printf("thread %d: running on cpu %d\n", id, sched_getcpu());
//usleep(MILLISEC);
}
}
When I close all background activity utilization is not quite 100%, but definitely affects all 4 cores to a significant degree.
#caf
If you're running these in a pseudo-terminal, then another process is receiving all of > that printf output and processing it, which requires CPU time as well. That process
(your terminal, likely also Xorg) is going to show up heavily in profiles. Consider > that graphically rendering that text output is going to be far more CPU-intensive than the printf() that generates it. Try running your test process with output redirected to /dev/null.
This is the correct answer, thanks.
With the output directed to /dev/null the CPU usage spikes are restricted to the CPU that has all the threads pinned to it.

Not seeing any speedup using openmp

I am very new in openmp and am trying to understand its constructs..
Here is a simple code I wrote... (square of the number)..
#include <omp.h>
#include <stdio.h>
#define SIZE 20000
#define NUM_THREADS 50
int main(){
int id;
int output[SIZE];
omp_set_num_threads(NUM_THREADS);
double start = omp_get_wtime();
#pragma omp parallel for
//{
//id = omp_get_thread_num();
for (int i=0; i<SIZE;i++){
id = omp_get_thread_num();
//printf("current thread :%d of %d threads\n", id, omp_get_num_threads());
output[i] = i*i;
}
//}
double end = omp_get_wtime();
printf("time elapsed: %f for %d threads\n", end-start, NUM_THREADS);
}
Now, changing number of threads should decrease the time.. but actually it is increasing the time?
What am i doing wrong?
This is most likely due to your choice of problem to inspect. Lets look at your parallel loop:
#pragma omp parallel for
for (int i=0; i<SIZE;i++){
id = omp_get_thread_num();
output[i] = i*i;
}
You have specified 50 threads and stated you have 16 cores.
The serial case ignores the OMP directive and can perform aggressive optimization of the loop. Each element i is i*i, a simple multiplication dependent on nothing but the loop index. id can be optimized out completely. This probably gets completely vectorized and if your processor is modern it can probably do 4 multiplies in a single instruction (SIMD) meaning for size=2000, you are looking at 500 SIMD multiplications (with no data fetch overhead and a cache friendly store). This is going to be very fast.
Alternatively, lets look at the parallel version. You are initializing 50 threads -- expensive!. You are introducing many context switches as even if you have processor affinity, you have oversubscribed your cores. Each of the 50 threads is going to run 40 iterations of your loop. If you are lucky the compiler unrolled the loop a bit so it could instead do 10 iterations of a SIMD multiply. The multiplies, whether SIMD or not, are still going to be fast. What you end up with is the same amount of real work, so each processor has 1/16th of the work but the overhead of 50 threads being created and destroyed creates more work than the parallel gain. This is a good example of something that doesn't benefit from parallelization.
The first thing you want to do is limit your number of threads to your actual cores. You are not going to gain anything by adding needless context switches to your execution time. More threads than cores is generally not going to make it go faster.
The second thing you want to do is to do something more complicated in your loop, and do it many times (google for examples, there are many). When constructing your work loop you will also want to keep cache performance in mind, as badly constructed loops don't speedup well.
When you change your work to be more complex than the thread overhead, embarassingly parallel and great cache performance you can start to see a real benefit to OpenMP. The last thing you'll want to do is benchmark your loop from serial to 16 threads. e.g.:

How many threads to launch in OpenMP?

I am new to OpenMP Programming and I have executed several open-mp sample programs on GCC . I wanted to know how will I decide on how many threads to launch (i.e how to decide the parameter of omp_set_num_threads() function) to get the better performance on dual core intel processor .
*This is my sample program*
#include<math.h>
#include<omp.h>
#include<stdio.h>
#include<time.h>
#define CHUNKSIZE 10
#define N 100000
#define num_t 10
void main ()
{
int runTime;
int i, chunk;
int a[N], b[N], c[N],threads[num_t];
int thread_one=0,thread_two=0;
clock_t start,end;
omp_set_num_threads(num_t);
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i + 2.0;
chunk = CHUNKSIZE;
#pragma omp parallel shared(a,b,c,chunk,threads) private(i)
{
#pragma omp for schedule(dynamic,chunk)
for (i=0; i < N; i++)
{
c[i] = pow((a[i] * b[i]),10);
threads[omp_get_thread_num()]++;
}
} /* end of parallel section */
for(i=-1;i<num_t;i++)
printf("Thread no %d : %d\n",i,threads[i]);
}
As a rule of thumb, set for a first try your threads number to the number of cores of your machine. Then try to decrease this number to see if any improvement occurs.
By the way, rather than using omp_set_num_threads, setting OMP_NUM_THREADS environment variable is way more convenient to do such tests
My advice: don't bother. If it's a computationally intensive app (which openmp is mainly used for and what you have here) then the library itself will do a good job of managing everything.
The optimal number of threads depends on many parameters and it is hard to devise a general rule of a thumb.
For compute intensive tasks with low fetch/compute ratio, it would be best to set the number of threads to be equal to the number of CPU cores.
For heavy memory-bound tasks increasing the number of threads might saturate the memory bandwidth way before the number of threads becomes equal to the number of cores. Loop vectorisation can affect the memory bandwidth for a single thread significantly. In some cases threads share lots of data in the CPU cache, but in some - they don't and increasing their number decreases the available cache space. Also NUMA systems usually provide better bandwidth than SMP ones.
In some cases best performance could be achieved with more threads than cores - true when lots of blocking waiting is observed within each task. Sometimes SMT or HyperThreading can hide memory latency, sometimes it can't, depending on the kind of memory access being performed.
Unless you can model your code performance and make an educated guess on the best number of threads to run with, just experiment with several values.

Resources