How many threads to launch in OpenMP?

How many threads to launch in OpenMP? - c

I am new to OpenMP Programming and I have executed several open-mp sample programs on GCC . I wanted to know how will I decide on how many threads to launch (i.e how to decide the parameter of omp_set_num_threads() function) to get the better performance on dual core intel processor .
*This is my sample program*
#include<math.h>
#include<omp.h>
#include<stdio.h>
#include<time.h>
#define CHUNKSIZE 10
#define N 100000
#define num_t 10
void main ()
{
int runTime;
int i, chunk;
int a[N], b[N], c[N],threads[num_t];
int thread_one=0,thread_two=0;
clock_t start,end;
omp_set_num_threads(num_t);
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i + 2.0;
chunk = CHUNKSIZE;
#pragma omp parallel shared(a,b,c,chunk,threads) private(i)
{
#pragma omp for schedule(dynamic,chunk)
for (i=0; i < N; i++)
{
c[i] = pow((a[i] * b[i]),10);
threads[omp_get_thread_num()]++;
}
} /* end of parallel section */
for(i=-1;i<num_t;i++)
printf("Thread no %d : %d\n",i,threads[i]);
}

As a rule of thumb, set for a first try your threads number to the number of cores of your machine. Then try to decrease this number to see if any improvement occurs.
By the way, rather than using omp_set_num_threads, setting OMP_NUM_THREADS environment variable is way more convenient to do such tests

My advice: don't bother. If it's a computationally intensive app (which openmp is mainly used for and what you have here) then the library itself will do a good job of managing everything.

The optimal number of threads depends on many parameters and it is hard to devise a general rule of a thumb.
For compute intensive tasks with low fetch/compute ratio, it would be best to set the number of threads to be equal to the number of CPU cores.
For heavy memory-bound tasks increasing the number of threads might saturate the memory bandwidth way before the number of threads becomes equal to the number of cores. Loop vectorisation can affect the memory bandwidth for a single thread significantly. In some cases threads share lots of data in the CPU cache, but in some - they don't and increasing their number decreases the available cache space. Also NUMA systems usually provide better bandwidth than SMP ones.
In some cases best performance could be achieved with more threads than cores - true when lots of blocking waiting is observed within each task. Sometimes SMT or HyperThreading can hide memory latency, sometimes it can't, depending on the kind of memory access being performed.
Unless you can model your code performance and make an educated guess on the best number of threads to run with, just experiment with several values.

Related

Why does using multiple threads result in slower execution?

I am using MacBook Air M1 2020, Apple M1 7-Core GPU, RAM 8GB.
The problem:
I am comparing pairs of arrays which takes around 11 minutes when executed sequentially. Strangely, the more threads I put to work, the more time it takes to finish(even with NOT using mutex). I've so far tried to run it with 2 and then 4 threads.
What could be the problem? I assumed using 4 threads would be more efficient since I have available 7 cores and the execution time seems(to me) to be long enough to compensate for the overhead caused by handling multiple threads.
This is part of the code which I find relevant to this question:
int const xylen = 1024;
static uint8_t voxelGroups[321536][xylen];
int threadCount = 4;
bool areVoxelGroupsIdentical(uint8_t firstArray[xylen], uint8_t secondArray[xylen]){
return memcmp(firstArray, secondArray, xylen*sizeof(uint8_t)) == 0;
}
void* getIdenticalVoxelGroupsCount(void* threadNumber){
for(int i = (int)threadNumber-1; i < 321536-1; i += threadCount){
for(int j = i+1; j < 321536; j++){
if(areVoxelGroupsIdentical(voxelGroups[i], voxelGroups[j])){
pthread_mutex_lock(&mutex);
identicalVoxelGroupsCount++;
pthread_mutex_unlock(&mutex);
}
}
}
return 0;
}
int main(){
// some code
pthread_create(&thread1, NULL, getIdenticalVoxelGroupsCount, (void *)1);
pthread_create(&thread2, NULL, getIdenticalVoxelGroupsCount, (void *)2);
pthread_create(&thread3, NULL, getIdenticalVoxelGroupsCount, (void *)3);
pthread_create(&thread4, NULL, getIdenticalVoxelGroupsCount, (void *)4);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
pthread_join(thread3, NULL);
pthread_join(thread4, NULL);
// some code
}

First of all, the lock serialize all the identicalVoxelGroupsCount increments. Using more thread will not speed up this part. On the contrary, it will be slower because cache-line bouncing: the cache lines containing the lock and the increment variable will move from one core to another serially (see: cache coherence protocols). This is generally much slower than doing all the work sequentially because moving a cache line from one core to another introduce a pretty big latency. You do not need a lock. You can instead increment a local variable and then perform a final reduction only once (eg. by updating an atomic variable in the end of getIdenticalVoxelGroupsCount).
Moreover, the interleaving of the loop iterations is not efficient because most of the cache lines containing voxelGroups will be shared between thread. This this not as critical as the first point because threads are only reading the cache lines. Still, this can increase the memory throughput and result in a bottleneck. A much more efficient approach is to split the iterations in relatively-large contiguous chunks. It could be even better to split the blocks in medium-grained tiles to use the cache more efficiently (although this optimization is independent of the parallelization strategy).
Note that you can use OpenMP to do such kind of operation easily and efficiently in C.

How can multithreading give a factor of speeedup that is larger than the number of cores?

I am using pthreads with gcc. The simple code example takes the number of threads "N" as a user-supplied input. It splits up a long array into N roughly equally sized subblocks. Each subblock is written into by individual threads.
The dummy processing for this example really involves sleeping for a fixed amount of time for each array index and then writing a number into that array location.
Here's the code:
/******************************************************************************
* FILE: threaded_subblocks_processing
* DESCRIPTION:
* We have a bunch of parallel processing to do and store the results in a
* large array. Let's try to use threads to speed it up.
******************************************************************************/
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <math.h>
#define BIG_ARR_LEN 10000
typedef struct thread_data{
int start_idx;
int end_idx;
int id;
} thread_data_t;
int big_result_array[BIG_ARR_LEN] = {0};
void* process_sub_block(void *td)
{
struct thread_data *current_thread_data = (struct thread_data*)td;
printf("[%d] Hello World! It's me, thread #%d!\n", current_thread_data->id, current_thread_data->id);
printf("[%d] I'm supposed to work on indexes %d through %d.\n", current_thread_data->id,
current_thread_data->start_idx,
current_thread_data->end_idx-1);
for(int i=current_thread_data->start_idx; i<current_thread_data->end_idx; i++)
{
int retval = usleep(1000.0*1000.0*10.0/BIG_ARR_LEN);
if(retval)
{
printf("sleep failed");
}
big_result_array[i] = i;
}
printf("[%d] Thread #%d done, over and out!\n", current_thread_data->id, current_thread_data->id);
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
if (argc!=2)
{
printf("usage: ./a.out number_of_threads\n");
return(1);
}
int NUM_THREADS = atoi(argv[1]);
if (NUM_THREADS<1)
{
printf("usage: ./a.out number_of_threads (where number_of_threads is at least 1)\n");
return(1);
}
pthread_t *threads = malloc(sizeof(pthread_t)*NUM_THREADS);
thread_data_t *thread_data_array = malloc(sizeof(thread_data_t)*NUM_THREADS);
int block_size = BIG_ARR_LEN/NUM_THREADS;
for(int i=0; i<NUM_THREADS-1; i++)
{
thread_data_array[i].start_idx = i*block_size;
thread_data_array[i].end_idx = (i+1)*block_size;
thread_data_array[i].id = i;
}
thread_data_array[NUM_THREADS-1].start_idx = (NUM_THREADS-1)*block_size;
thread_data_array[NUM_THREADS-1].end_idx = BIG_ARR_LEN;
thread_data_array[NUM_THREADS-1].id = NUM_THREADS;
int ret_code;
long t;
for(t=0;t<NUM_THREADS;t++){
printf("[main] Creating thread %ld\n", t);
ret_code = pthread_create(&threads[t], NULL, process_sub_block, (void *)&thread_data_array[t]);
if (ret_code){
printf("[main] ERROR; return code from pthread_create() is %d\n", ret_code);
exit(-1);
}
}
printf("[main] Joining threads to wait for them.\n");
void* status;
for(int i=0; i<NUM_THREADS; i++)
{
pthread_join(threads[i], &status);
}
pthread_exit(NULL);
}
and I compile it with
gcc -pthread threaded_subblock_processing.c
and then I call it from command line like so:
$ time ./a.out 4
I see a speed up when I increase the number of threads. With 1 thread the process takes just a little over 10 seconds. This makes sense because I sleep for 1000 usec per array element, and there are 10,000 array elements. Next when I go to 2 threads, it goes down to a little over 5 seconds, and so on.
What I don't understand is that I get a speed-up even after my number of threads exceeds the number of cores on my computer! I have 4 cores, so I was expecting no speed-up for >4 threads. But, surprisingly, when I run
$ time ./a.out 100
I get a 100x speedup and the processing completes in ~0.1 seconds! How is this possible?

Some general background
A program's progress can be slowed by many things, but, in general, you can slow spots (otherwise known as hot spots) into two categories:
CPU Bound: In this case, the processor is doing some heavy number crunching (like trigonometric functions). If all the CPU's cores are engaged in such tasks, other processes must wait.
Memory bound: In this case, the processor is waiting for information to be retrieved from the hard disk or RAM. Since these are typically orders of magnitude slower than the processor, from the CPU's perspective this takes forever.
But you can also imagine other situations in which a process must wait, such as for a network response.
In many of these memory-/network-bound situations, it is possible to put a thread "on hold" while the memory crawls towards the CPU and do other useful work in the meantime. If this is done well then a multi-threaded program can well out-perform its single-threaded equivalent. Node.js makes use of such asynchronous programming techniques to achieve good performance.
Here's a handy depiction of various latencies:
Your question
Now, getting back to your question: you have multiple threads going, but they are performing neither CPU-intensive nor memory-intensive work: there's not much there to take up time. In fact, the sleep function is essentially telling the operating system that no work is being done. In this case, the OS can do work in other threads while your threads sleep. So, naturally, the apparent performance increases significantly.
Note that for low-latency applications, such as MPI, busy waiting is sometimes used instead of a sleep function. In this case, the program goes into a tight loop and repeatedly checks a condition. Externally, the effect looks similar, but sleep uses no CPU while the busy wait uses ~100% of the CPU.

Not seeing any speedup using openmp

I am very new in openmp and am trying to understand its constructs..
Here is a simple code I wrote... (square of the number)..
#include <omp.h>
#include <stdio.h>
#define SIZE 20000
#define NUM_THREADS 50
int main(){
int id;
int output[SIZE];
omp_set_num_threads(NUM_THREADS);
double start = omp_get_wtime();
#pragma omp parallel for
//{
//id = omp_get_thread_num();
for (int i=0; i<SIZE;i++){
id = omp_get_thread_num();
//printf("current thread :%d of %d threads\n", id, omp_get_num_threads());
output[i] = i*i;
}
//}
double end = omp_get_wtime();
printf("time elapsed: %f for %d threads\n", end-start, NUM_THREADS);
}
Now, changing number of threads should decrease the time.. but actually it is increasing the time?
What am i doing wrong?

This is most likely due to your choice of problem to inspect. Lets look at your parallel loop:
#pragma omp parallel for
for (int i=0; i<SIZE;i++){
id = omp_get_thread_num();
output[i] = i*i;
}
You have specified 50 threads and stated you have 16 cores.
The serial case ignores the OMP directive and can perform aggressive optimization of the loop. Each element i is i*i, a simple multiplication dependent on nothing but the loop index. id can be optimized out completely. This probably gets completely vectorized and if your processor is modern it can probably do 4 multiplies in a single instruction (SIMD) meaning for size=2000, you are looking at 500 SIMD multiplications (with no data fetch overhead and a cache friendly store). This is going to be very fast.
Alternatively, lets look at the parallel version. You are initializing 50 threads -- expensive!. You are introducing many context switches as even if you have processor affinity, you have oversubscribed your cores. Each of the 50 threads is going to run 40 iterations of your loop. If you are lucky the compiler unrolled the loop a bit so it could instead do 10 iterations of a SIMD multiply. The multiplies, whether SIMD or not, are still going to be fast. What you end up with is the same amount of real work, so each processor has 1/16th of the work but the overhead of 50 threads being created and destroyed creates more work than the parallel gain. This is a good example of something that doesn't benefit from parallelization.
The first thing you want to do is limit your number of threads to your actual cores. You are not going to gain anything by adding needless context switches to your execution time. More threads than cores is generally not going to make it go faster.
The second thing you want to do is to do something more complicated in your loop, and do it many times (google for examples, there are many). When constructing your work loop you will also want to keep cache performance in mind, as badly constructed loops don't speedup well.
When you change your work to be more complex than the thread overhead, embarassingly parallel and great cache performance you can start to see a real benefit to OpenMP. The last thing you'll want to do is benchmark your loop from serial to 16 threads. e.g.:

openmp not utilizing all threads

I've written parallel program in C using OpenMP.
I want to control number of threads program is using.
I'm using system with:
CentOS release 6.5 (Final)
icc version 14.0.1 (gcc version 4.4.7 compatibility)
2x Intel(R) Xeon(R) CPU E5-2620 0 # 2.00GHz
Program that I run:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
double t1[TABLE_SIZE];
double t2[TABLE_SIZE];
int main(int argc, char** argv) {
omp_set_dynamic(0);
omp_set_nested(0);
omp_set_num_threads(NUM_OF_THREADS);
#pragma omp parallel for default(none) shared(t1, t2) private(i)
for(i=0; i<TABLE_SIZE; i++) {
t1[i] = rand();
t2[i] = rand();
}
for(i=0; i<NUM_OF_REPETITION; i++) {
test1(t1, t2);
}
}
void test1(double t1[], double t2[]) {
int i;
double result;
#pragma omp parallel for default(none) shared(t1, t2) private(i) reduction(+:result)
for(i=0; i<TABLE_SIZE; i++) {
result += t1[i]*t2[i];
}
}
I'm running script that sets TABLE_SIZE(2500, 5000, 100000, 1000000), NUM_OF_THREADS(1-24), NUM_OF_REPETITION(50000 as 50k, 100000 as 100k, 1000000 as 1M) at compile time.
The problem is that computer is not utilizing all the threads that are offered all the time.
It seems that problem is dependent on TABLE_SIZE.
For example when I compile the code with TABLE_SIZE=2500 all is fine till NUM_OF_THREADS=20. Then some weird things happen. When I set NUM_OF_THREADS=21 the program is utilizing only 18 threads(I observe htop to see how many threads are running). When I set NUM_OF_THREADS=23 and NUM_OF_REPETITION=100k it's using 18 threads, but if I change NUM_OF_REPETITION to 1M at NUM_OF_THREADS=23 it's using 19 threads.
When I change TABLE_SIZE to 5000 the anomally starts at 18 threads. I set NUM_OF_THREADS=18 and at NUM_OF_REPETITION=1M the program uses only 17 threads. When I set NUM_OF_THREADS=19 and NUM_OF_REPETITION=100k or 1M it uses only 17 threads. If I change NUM_OF_THREADS to 24 the program is using 20 threads at NUM_OF_REPETITION=50k, 22 threads at NUM_OF_REPETITION=100k and 23 threads at NUM_OF_REPETITION=1M.
This sort of inconsistency is going on and on with increasing TABLE_SIZE. The bigger the TABLE_SIZE the faster(at lower NUM_OF_THREADS) the inconsistency occours.
At this(OpenMP set_num_threads() is not working) post I read that omp_set_num_threads() sets the upper limit of threads that can be used by the program. And as you can see I've disabled dynamic teams and program is still not using all the threads. It doesn't help if I set environment variables OMP_NUM_THREADS and OMP_DYNAMIC either.
So I went and read some of OpenMP specification 3.1. And it says program should use the number of threads it is set by omp_set_num_threads(). Also omp_get_max_threads() function returns 24 available threads.
Any help would be greatly appreciated.

I finally found a solution. I set the KMP_AFFINITY environment variable. It doesn't matter if I set variable to "compact" or "scatter"(I'm just interested in using all threads for now).
This is what documentation has to say(https://software.intel.com/en-us/articles/openmp-thread-affinity-control):
There are 2 considerations for OpenMP threading and affinity: First, determine the number of threads to utilize, and secondly, how to bind threads to specific processor cores.
If you do not set a value for KMP_AFFINITY, the OpenMP runtime is allowed to choose affinity for you. The value chosen depends on the CPU architecture and may change depending on what affinity is deemed most efficient FOR A VARIETY OF APPLICATIONS for that architecture.
Another source (https://software.intel.com/en-us/node/522691):
Affinity Types:
type = none (default)
Does not bind OpenMP* threads to particular thread contexts; however, if the operating system supports affinity, the compiler still uses the OpenMP* thread affinity interface to determine machine topology.
So I guess because I did not have KMP_AFFINITY set, the OpenMP runtime set most efficient affinity to its knowledge. Please correct me if I'm wrong.

Matrix Multiplication with Threads: Why is it not faster?

So I've been playing around with pthreads, specifically trying to calculate the product of two matrices. My code is extremely messy because it was just supposed to be a quick little fun project for myself, but the thread theory I used was very similar to:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define M 3
#define K 2
#define N 3
#define NUM_THREADS 10
int A [M][K] = { {1,4}, {2,5}, {3,6} };
int B [K][N] = { {8,7,6}, {5,4,3} };
int C [M][N];
struct v {
int i; /* row */
int j; /* column */
};
void *runner(void *param); /* the thread */
int main(int argc, char *argv[]) {
int i,j, count = 0;
for(i = 0; i < M; i++) {
for(j = 0; j < N; j++) {
//Assign a row and column for each thread
struct v *data = (struct v *) malloc(sizeof(struct v));
data->i = i;
data->j = j;
/* Now create the thread passing it data as a parameter */
pthread_t tid; //Thread ID
pthread_attr_t attr; //Set of thread attributes
//Get the default attributes
pthread_attr_init(&attr);
//Create the thread
pthread_create(&tid,&attr,runner,data);
//Make sure the parent waits for all thread to complete
pthread_join(tid, NULL);
count++;
}
}
//Print out the resulting matrix
for(i = 0; i < M; i++) {
for(j = 0; j < N; j++) {
printf("%d ", C[i][j]);
}
printf("\n");
}
}
//The thread will begin control in this function
void *runner(void *param) {
struct v *data = param; // the structure that holds our data
int n, sum = 0; //the counter and sum
//Row multiplied by column
for(n = 0; n< K; n++){
sum += A[data->i][n] * B[n][data->j];
}
//assign the sum to its coordinate
C[data->i][data->j] = sum;
//Exit the thread
pthread_exit(0);
}
source: http://macboypro.com/blog/2009/06/29/matrix-multiplication-in-c-using-pthreads-on-linux/
For the non-threaded version, I used the same setup (3 2-d matrices, dynamically allocated structs to hold r/c), and added a timer. First trials indicated that the non-threaded version was faster. My first thought was that the dimensions were too small to notice a difference, and it was taking longer to create the threads. So I upped the dimensions to about 50x50, randomly filled, and ran it, and I'm still not seeing any performance upgrade with the threaded version.
What am I missing here?

Unless you're working with very large matrices (many thousands of rows/columns), then you are unlikely to see much improvement from this approach. Setting up a thread on a modern CPU/OS is actually pretty expensive in relative terms of CPU time, much more time than a few multiply operations.
Also, it's usually not worthwhile to set up more than one thread per CPU core that you have available. If you have, say, only two cores and you set up 2500 threads (for 50x50 matrices), then the OS is going to spend all its time managing and switching between those 2500 threads rather than doing your calculations.
If you were to set up two threads beforehand (still assuming a two-core CPU), keep those threads available all the time waiting for work to do, and supply them with the 2500 dot products you need to calculate in some kind of synchronised work queue, then you might start to see an improvement. However, it still won't ever be more than 50% better than using only one core.

I'm not entirely sure I understand the source code, but here's what it looks like: You have a loop that runs M*N times. Each time through the loop, you create a thread that fills in one number in the result matrix. But right after you launch the thread, you wait for it to complete. I don't think that you're ever actually running more than one thread.
Even if you were running more than one thread, the thread is doing a trivial amount of work. Even if K was large (you mention 50), 50 multiplications isn't much compared to the cost of starting the thread in the first place. The program should create fewer threads--certainly no more than the number of processors--and assign more work to each.

You don't allow much parallel execution: you wait for the thread immediately after creating it, so there is almost no way for your program to use additional CPUs (i.e. it can never use a third CPU/core). Try to allow more threads to run (probably up to the count of cores you have).

If you have a processor with two cores, then you should just divide the work to be done in two halfs and give each thread one half. The same principle if you have 3, 4, 5 cores. The optimal performance design will always match the number of threads to the number of available cores (by available I mean cores that aren't already being heavily used by other processes).
One other thing you have to consider is that each thread must have its data contiguous and independent from the data for the other threads. Otherwise, memcache misses will slow down sighificantly the processing.
To better understand these issues, I'd recommend the book Patterns for Parallel Programming
http://astore.amazon.com/amazon-books-20/detail/0321228111
Although its code samples are more directed to OpenMP & MPI, and you're using PThreads, still the first half of the book is very rich in fundamental concepts & inner working of multithreading environments, very useful to avoid most of the performance bottlenecks you'll encounter.

Provided the code parallelizes correctly (I won't check it), likely performance boosts only when the code is parallelized in hardware, i.e. threads are really parallel (multi cores, multi cpus, ... other techologies...) and not apparently ("multitasking" way) parallel. Just an idea, I am not sure this is the case.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight