How to assign threads to different cores in C? - c

I created a program that does the addition of 8 numbers using 4 threads, and then the product of the results. How to ensure that each thread is using a separate core for maximum performance gains. I am new to pthreads so I really don't have any idea on how to use it properly. Please provide answers as simple as possible.
My code:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
int global[9];
void *sum_thread(void *arg)
{
int *args_array;
args_array = arg;
int n1,n2,sum;
n1=args_array[0];
n2=args_array[1];
sum = n1*n2;
printf("N1 * N2 = %d\n",sum);
return (void*) sum;
}
void *sum_thread1(void *arg)
{
int *args_array;
args_array = arg;
int n3,n4,sum2;
n3=args_array[2];
n4=args_array[3];
sum2=n3*n4;
printf("N3 * N4 = %d\n",sum2);
return (void*) sum2;
}
void *sum_thread2(void *arg)
{
int *args_array;
args_array = arg;
int n5,n6,sum3;
n5=args_array[4];
n6=args_array[5];
sum3=n5*n6;
printf("N5 * N6 = %d\n",sum3);
return (void*) sum3;
}
void *sum_thread3(void *arg)
{
int *args_array;
args_array = arg;
int n8,n7,sum4;
n7=args_array[6];
n8=args_array[7];
sum4=n7*n8;
printf("N7 * N8 = %d\n",sum4);
return (void*) sum4;
}
int main()
{
int sum3,sum2,sum,sum4;
int prod;
global[0]=9220; global[1]=1110; global[2]=1120; global[3]=2320; global[4]=5100; global[5]=6720; global[6]=7800; global[7]=9290;// the input
pthread_t tid_sum;
pthread_create(&tid_sum,NULL,sum_thread,global);
pthread_join(tid_sum,(void*)&sum);
pthread_t tid_sum1;
pthread_create(&tid_sum1,NULL,sum_thread1,global);
pthread_join(tid_sum1,(void*)&sum2);
pthread_t tid_sum2;
pthread_create(&tid_sum2,NULL,sum_thread2,global);
pthread_join(tid_sum2,(void*)&sum3);
pthread_t tid_sum3;
pthread_create(&tid_sum3,NULL,sum_thread3,global);
pthread_join(tid_sum3,(void*)&sum4);
prod=sum+sum2+sum3+sum4;
printf("The sum of the products is: %d", prod);
return 0;
}

You don't have, don't want and mustn't (I don't know if you somehow you can though) manage hardware resources at such low levels. That's a job for your OS and partially for standard libraries: they have been tested optimized and standardized properly.
I doubt you can do better. If you do what you are saying either you are an expert hardware/OS programmer or you are destroying decades of works :) .
Also consider this fact: your code will not be portable anymore if you could index the cores manually since it depends on the number of cores of your machine.
On the other way multithread programs should work (and even better sometimes) even when having one core. An example is the case where one of the threads doesn't do anything until an event happens: you can make one thread go to "sleep" so that only the other threads use the CPU; then when the event happens it will execute. In a non-multithread program generally polling is used which uses CPU resource to do nothing.
Also #yano said you are multithread program is not really parallel in this case since you are creating the thread and then waiting for it to finish with pthread_join before starting the other threads.

Related

pthread is slower than the "default" version

SITUATION
I want to see the advantage of using pthread. If I'm not wrong: threads allow me to execute given parts of program in parallel.
so here is what I try to accomplish: I want to make a program that takes a number(let's say n) and outputs the sum of [0..n].
code
#define MAX 1000000000
int
main() {
long long n = 0;
for (long long i = 1; i < MAX; ++i)
n += i;
printf("\nn: %lld\n", n);
return 0;
}
time: 0m2.723s
to my understanding I could simply take that number MAX and divide by 2 and let 2 threads
do the job.
code
#define MAX 1000000000
#define MAX_THREADS 2
#define STRIDE MAX / MAX_THREADS
typedef struct {
long long off;
long long res;
} arg_t;
void*
callback(void *args) {
arg_t *arg = (arg_t*)args;
for (long long i = arg->off; i < arg->off + STRIDE; ++i)
arg->res += i;
pthread_exit(0);
}
int
main() {
pthread_t threads[MAX_THREADS];
arg_t results[MAX_THREADS];
for (int i = 0; i < MAX_THREADS; ++i) {
results[i].off = i * STRIDE;
results[i].res = 0;
pthread_create(&threads[i], NULL, callback, (void*)&results[i]);
}
for (int i = 0; i < MAX_THREADS; ++i)
pthread_join(threads[i], NULL);
long long result;
result = results[0].res;
for (int i = 1; i < MAX_THREADS; ++i)
result += results[i].res;
printf("\nn: %lld\n", result);
return 0;
}
time: 0m8.530s
PROBLEM
The version with pthread runs slower. Logically this version should run faster, but maybe creation of threads is more expensive.
Can someone suggest a solution or show what I'm doing/understanding wrong here?
Your problem is cache thrashing combined with a lack of optimization (I bet you're compiling without it on).
The naive (-O0) code for
for (long long i = arg->off; i < arg->off + STRIDE; ++i)
arg->res += i;
will access the memory of *arg. With your results array being defined the way it is, that memory is very close to the memory of the next arg and the two threads will fight for the same cache-line, making RAM caching very ineffective.
If you compile with -O1, the loop should use a register instead and only write to memory at the end. Then, you should get better performance with threads (higher optimization levels on gcc seem to optimize the loop out completely)
Another (better) option is to align arg_t on a cache line:
typedef struct {
_Alignas(64) /*typical cache line size*/ long long off;
long long res;
} arg_t;
Then you should get better performance with threads regardless of whether or not you turn optimization on.
Good cache utilization is generally very important in multithreaded programming (and Ulrich Drepper has much to say on that topic in his infamous What Every Programmer Should Know About Memory).
Creating a whole bunch of threads is very unlikely to be quicker than simply adding numbers. The CPU can add an awfully large number of integers in the time it takes the kernel to set up and tear down a thread. To see the benefit of multithreading, you really need each thread to be doing a significant task -- significant compared to the overhead in creating the thread, anyway. Alternatively, you need to keep a pool of threads running, and assign them work according to some allocation strategy.
Multi-threading works best when an application consists of tasks that are somewhat independent, that would otherwise be waiting on one another to complete. It isn't a magic way to get more throughput.

Unable to figure out where the race condition occuring In OPENMP program in c

I am trying to integrate sin(x) from 0 to pi. But every time i run
the program i am getting different outputs.I know it is because of race condition occuring , but i am unable to figure out where is the problem lies
this is my code:
#include<stdio.h>
#include<stdlib.h>
#include<omp.h>
#include<math.h>
#include<time.h>
#define NUM_THREADS 4
static long num_steps= 10000000;
float rand_generator(float a )
{
//srand((unsigned int)time(NULL));
return ((float)rand()/(float)(RAND_MAX)) * a;
}
int main(int argc, char *argv[])
{
// srand((unsigned int)time(NULL));
omp_set_num_threads(NUM_THREADS);
float result;
float sum[NUM_THREADS];
float area=3.14;
int nthreads;
#pragma omp parallel
{
int id,nthrds;
id=omp_get_thread_num();
sum[id]=0.0;
printf("%d\n",id );
nthrds=omp_get_num_threads();
printf("%d\n",nthrds );
//if(id==0)nthreads=nthrds;
for (int i = id; i < num_steps; i=i+nthrds)
{
//float y=rand_generator(1);
//printf("%f\n",y );
float x=rand_generator(3.14);
sum[id]+=sin(x);
}
//printf(" sum is: %lf\n", sum);
//float p=(float)sum/num_steps*area;
}
float p=0.0;
for (int i = 0; i <NUM_THREADS; ++i)
{
p+=(sum[i]/num_steps)*area;
}
printf(" p is: %lf\n",p );
}
I tried adding pragma atomic but it also doesn't help.
Any help will be appreciated :).
The problem comes from the use of rand(). rand() is not thread safe. The reason is that it uses a common state for all the calls and is thus sensitive to races. Using stdlib's rand() from multiple threads
There a thread safe random generator that is called rand_r(). Instead of storing the rand generator state in an hidden global var, the state is a parameter to the function and can be rendered thread local.
You can use it like that
float rand_generator_r(float a,unsigned int *state )
{
//srand((unsigned int)time(NULL));
return ((float)rand_r(state)/(float)(RAND_MAX)) * a;
}
In your parallel block, add :
unsigned int rand_state=id*time(NULL); // or whatever thread dependent seed
and in your code call
float x=rand_generator(3.14,&rand_state);
and it should work.
By the way, I have the impression that there is a false sharing in your code that should slow down performances.
float sum[NUM_THREADS];
It is modified by all threads and is really likely to be store in a single cache line. Every store (and there are many stores to it) will create an invalidate in all other caches and it may significantly slow down your performances.
You should insure that the values are in different cache lines with :
#define CACHE_LINE_SIZE 64
struct {
float s;
char padding[CACHE_LINE_SIZE - sizeof(float)];
} sum_nofalse_sharing[NUM_THREADS];
and in your code, accumulate in sum_nofalse_sharing[id].s
Alternatively, create a local sum in the parallel block and write its value to sum[id] at the end.

What is the best way to create periodic Linux threads in C

For my application I have the requirement of accurate periodic threads with relative low cycle times (500 µs).
In particular the application is a run time system of a
PLC.
It's purpose is to run an application developed by the PLC user.
Such applications are organised in programs and periodic tasks - each task with it's own cycle time and priority.
Usually the application runs on systems with real time OSs (eg. vxWorks or Linux with RT patch).
Currently the periodic tasks are implemented via clock_nanosleep.
Unfortunately the actual sleep time of clock_nanosleep is disturbed by other threads - even with lower priority.
Once every second, the sleep time is exceeded by about 50 ms.
I've observed this on Debian 9.5, on RaspberryPi and also on an ARM-Linux with Preemt-RT.
Here's a sample, which shows this behavior:
#include <pthread.h>
#include <unistd.h>
#include <stdint.h>
#include <stdio.h>
typedef void* ThreadFun(void* param);
#define SCHEDULER_POLICY SCHED_FIFO
#define CLOCK CLOCK_MONOTONIC
#define INTERVAL_NS (10 * 1000 * 1000)
static long tickCnt = 0;
static long calcTimeDiff(struct timespec const* t1, struct timespec const* t2)
{
long diff = t1->tv_nsec - t2->tv_nsec;
diff += 1000000000 * (t1->tv_sec - t2->tv_sec);
return diff;
}
static void updateWakeTime(struct timespec* time)
{
uint64_t nanoSec = time->tv_nsec;
struct timespec currentTime;
clock_gettime(CLOCK, &currentTime);
while (calcTimeDiff(time, &currentTime) <= 0)
{
nanoSec = time->tv_nsec;
nanoSec += INTERVAL_NS;
time->tv_nsec = nanoSec % 1000000000;
time->tv_sec += nanoSec / 1000000000;
}
}
static void* tickThread(void *param)
{
struct timespec sleepStart;
struct timespec currentTime;
struct timespec wakeTime;
long sleepTime;
long wakeDelay;
clock_gettime(CLOCK, &wakeTime);
wakeTime.tv_sec += 2;
wakeTime.tv_nsec = 0;
while (1)
{
clock_gettime(CLOCK, &sleepStart);
clock_nanosleep(CLOCK, TIMER_ABSTIME, &wakeTime, NULL);
clock_gettime(CLOCK, &currentTime);
sleepTime = calcTimeDiff(&currentTime, &sleepStart);
wakeDelay = calcTimeDiff(&currentTime, &wakeTime);
if (wakeDelay > INTERVAL_NS)
{
printf("sleep req=%-ld.%-ld start=%-ld.%-ld curr=%-ld.%-ld sleep=%-ld delay=%-ld\n",
(long) wakeTime.tv_sec, (long) wakeTime.tv_nsec,
(long) sleepStart.tv_sec, (long) sleepStart.tv_nsec,
(long) currentTime.tv_sec, (long) currentTime.tv_nsec,
sleepTime, wakeDelay);
}
tickCnt += 1;
updateWakeTime(&wakeTime);
}
}
static void* workerThread(void *param)
{
while (1)
{
}
}
static int createThread(char const* funcName, ThreadFun* func, int prio)
{
pthread_t tid = 0;
pthread_attr_t threadAttr;
struct sched_param schedParam;
printf("thread create func=%s prio=%d\n", funcName, prio);
pthread_attr_init(&threadAttr);
pthread_attr_setschedpolicy(&threadAttr, SCHEDULER_POLICY);
pthread_attr_setinheritsched(&threadAttr, PTHREAD_EXPLICIT_SCHED);
schedParam.sched_priority = prio;
pthread_attr_setschedparam(&threadAttr, &schedParam);
if (pthread_create(&tid, &threadAttr, func, NULL) != 0)
{
return -1;
}
printf("thread created func=%s prio=%d\n", funcName, prio);
return 0;
}
#define CREATE_THREAD(func,prio) createThread(#func,func,prio)
int main(int argc, char*argv[])
{
int minPrio = sched_get_priority_min(SCHEDULER_POLICY);
int maxPrio = sched_get_priority_max(SCHEDULER_POLICY);
int prioRange = maxPrio - minPrio;
CREATE_THREAD(tickThread, maxPrio);
CREATE_THREAD(workerThread, minPrio + prioRange / 4);
sleep(10);
printf("%ld ticks\n", tickCnt);
}
Is something wrong in my code sample?
Is there a better (more reliable) way to create periodic threads?
For my application I have the requirement of accurate periodic threads with relative low cycle times (500 µs)
Probably too strong requirement. Linux is not a hard real-time OS.
I would suggest to have fewer threads (perhaps a small fixed set -only 2 or 3, organized in a thread pool; see this for an explanation, remembering that a RasberryPi3B+ has only 4 cores). You might prefer a single thread (think of a design around an event loop, inspired by continuation-passing style).
You probably don't need periodic threads. You need some periodic activity. They all might happen in the same thread. (the kernel is rescheduling tasks perhaps every 50 or 100 ms, even if it is capable of sleeping a smaller time, and if tasks get rescheduled very frequently -e.g. every millisecond- , their scheduling has a cost).
So read carefully time(7).
Consider using timer_create(2), or even better timerfd_create(2) used in an event loop around poll(2).
On a RaspberryPi, you won't have guaranteed 500µs delays. This is probably impossible (the hardware might not be powerful enough, and the Linux OS is not hard real-time). I feel your expectations are not reasonable.

MPI with pthreads slow performance [duplicate]

I'm trying around on the new C++11 threads, but my simple test has abysmal multicore performance. As a simple example, this program adds up some squared random numbers.
#include <iostream>
#include <thread>
#include <vector>
#include <cstdlib>
#include <chrono>
#include <cmath>
double add_single(int N) {
double sum=0;
for (int i = 0; i < N; ++i){
sum+= sqrt(1.0*rand()/RAND_MAX);
}
return sum/N;
}
void add_multi(int N, double& result) {
double sum=0;
for (int i = 0; i < N; ++i){
sum+= sqrt(1.0*rand()/RAND_MAX);
}
result = sum/N;
}
int main() {
srand (time(NULL));
int N = 1000000;
// single-threaded
auto t1 = std::chrono::high_resolution_clock::now();
double result1 = add_single(N);
auto t2 = std::chrono::high_resolution_clock::now();
auto time_elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count();
std::cout << "time single: " << time_elapsed << std::endl;
// multi-threaded
std::vector<std::thread> th;
int nr_threads = 3;
double partual_results[] = {0,0,0};
t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < nr_threads; ++i)
th.push_back(std::thread(add_multi, N/nr_threads, std::ref(partual_results[i]) ));
for(auto &a : th)
a.join();
double result_multicore = 0;
for(double result:partual_results)
result_multicore += result;
result_multicore /= nr_threads;
t2 = std::chrono::high_resolution_clock::now();
time_elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count();
std::cout << "time multi: " << time_elapsed << std::endl;
return 0;
}
Compiled with 'g++ -std=c++11 -pthread test.cpp' on Linux and a 3core machine, a typical result is
time single: 33
time multi: 565
So the multi threaded version is more than an order of magnitude slower. I've used random numbers and a sqrt to make the example less trivial and prone to compiler optimizations, so I'm out of ideas.
edit:
This problem scales for larger N, so the problem is not the short runtime
The time for creating the threads is not the problem. Excluding it does not change the result significantly
Wow I found the problem. It was indeed rand(). I replaced it with an C++11 equivalent and now the runtime scales perfectly. Thanks everyone!
On my system the behavior is same, but as Maxim mentioned, rand is not thread safe. When I change rand to rand_r, then the multi threaded code is faster as expected.
void add_multi(int N, double& result) {
double sum=0;
unsigned int seed = time(NULL);
for (int i = 0; i < N; ++i){
sum+= sqrt(1.0*rand_r(&seed)/RAND_MAX);
}
result = sum/N;
}
As you discovered, rand is the culprit here.
For those who are curious, it's possible that this behavior comes from your implementation of rand using a mutex for thread safety.
For example, eglibc defines rand in terms of __random, which is defined as:
long int
__random ()
{
int32_t retval;
__libc_lock_lock (lock);
(void) __random_r (&unsafe_state, &retval);
__libc_lock_unlock (lock);
return retval;
}
This kind of locking would force multiple threads to run serially, resulting in lower performance.
The time needed to execute the program is very small (33msec). This means that the overhead to create and handle several threads may be more than the real benefit. Try using programs that need longer times for the execution (e.g., 10 sec).
To make this faster, use a thread pool pattern.
This will let you enqueue tasks in other threads without the overhead of creating a std::thread each time you want to use more than one thread.
Don't count the overhead of setting up the queue in your performance metrics, just the time to enqueue and extract the results.
Create a set of threads and a queue of tasks (a structure containing a std::function<void()>) to feed them. The threads wait on the queue for new tasks to do, do them, then wait on new tasks.
The tasks are responsible for communicating their "done-ness" back to the calling context, such as via a std::future<>. The code that lets you enqueue functions into the task queue might do this wrapping for you, ie this signature:
template<typename R=void>
std::future<R> enqueue( std::function<R()> f ) {
std::packaged_task<R()> task(f);
std::future<R> retval = task.get_future();
this->add_to_queue( std::move( task ) ); // if we had move semantics, could be easier
return retval;
}
which turns a naked std::function returning R into a nullary packaged_task, then adds that to the tasks queue. Note that the tasks queue needs be move-aware, because packaged_task is move-only.
Note 1: I am not all that familiar with std::future, so the above could be in error.
Note 2: If tasks put into the above described queue are dependent on each other for intermediate results, the queue could deadlock, because no provision to "reclaim" threads that are blocked and execute new code is described. However, "naked computation" non-blocking tasks should work fine with the above model.

How to pass a sequential counter by reference to pthread start routine?

Below is my C code to print an increasing global counter, one increment per thread.
#include <stdio.h>
#include <pthread.h>
static pthread_mutex_t pt_lock = PTHREAD_MUTEX_INITIALIZER;
int count = 0;
int *printnum(int *num) {
pthread_mutex_lock(&pt_lock);
printf("thread:%d ", *num);
pthread_mutex_unlock(&pt_lock);
return NULL;
}
int main() {
int i, *ret;
pthread_t pta[10];
for(i = 0; i < 10; i++) {
pthread_mutex_lock(&pt_lock);
count++;
pthread_mutex_unlock(&pt_lock);
pthread_create(&pta[i], NULL, (void *(*)(void *))printnum, &count);
}
for(i = 0; i < 10; i++) {
pthread_join(pta[i], (void **)&ret);
}
}
I want each thread to print one increment of the global counter but they miss increments and sometimes access same values of global counter from two threads. How can I make threads access the global counter sequentially?
Sample Output:
thread:2
thread:3
thread:5
thread:6
thread:7
thread:7
thread:8
thread:9
thread:10
thread:10
Edit
Blue Moon's answer solves this question. Alternative approach is available in MartinJames'es comment.
A simple-but-useless approach is to ensure thread1 prints 1, thread2 prints 2 and so on is to put join the thread immmediately:
pthread_create(&pta[i], NULL, printnum, &count);
pthread_join(pta[i], (void **)&ret);
But this totally defeats the purpose of multi-threading because only one can make any progress at a time.
Note that I removed the superfluous casts and also the thread function takes a void * argument.
A saner approach would be to pass the loop counter i by value so that each thread would print different value and you would see threading in action i.e. the numbers 1-10 could be printed in any order and also each thread would print a unique value.

Resources