Segmentation fault for pthreads in a recursive call - c

Given the code below, I get a segmentation fault if I run it with n>16.
I think it has something to do with the stack, but I can't figure it out. Could anyone give me a hand? The code is not mine, and really not important. I would just like someone to give me a hand with what is happening. This SO question is very similar, but there's not enough information (the person who posts the answer briefly talks about the problem, but then goes on to talk about a different language). Besides, notice that with two gigs and no recursion, I can (if I'm doing it right) successfully create more than 16000 threads (though the OS only creates about 500 and runs about 300). Anyway, where am I getting the seg fault here and why? Thanks.
#include <pthread.h>
#include <stdio.h>
static void* fibonacci_thread( void* arg ) {
int n = (int)arg, fib;
pthread_t th1, th2;
void* pvalue; /*Holds the value*/
switch (n) {
case 0: return (void*)0;
case 1: /* Fallthru, Fib(1)=Fib(2)=1 */
case 2: return (void*)1;
default: break;
}
pthread_create(&th1, NULL, fibonacci_thread, (void*)(n-1));
pthread_create( &th2, NULL, fibonacci_thread, (void*)(n-2));
pthread_join(th1, &pvalue);
fib = (int)pvalue;
pthread_join(th2, &pvalue);
fib += (int)pvalue;
return (void*)fib;
}
int main(int argc, char *argv[])
{
int n=15;
printf ("%d\n",(int)fibonacci_thread((void*)n));
return 0;
}

This is not a good way to do a Fibonacci sequence :-)
Your first thread starts two others, each of those starts two others and so forth. So when n > 16, you may end up with a very large number of threads (in the thousands) (a).
Unless your CPU has way more cores than mine, you'll be wasting your time running thousands of threads for a CPU-bound task like this. For purely CPU bound tasks, you're better off having as many threads as there are physical execution engines (cores or CPUs) available to you. Obviously that changes where you're not purely CPU bound.
If you want an efficient recursive (non-threaded) Fibonacci calculator, you should use something like (pseudo-code):
def fib(n, n ranges from 1 to infinity):
if n is 1 or 2:
return 1
return fib(n-1) + fib(n-2)
Fibonacci isn't even really that good for non-threaded recursion since the problem doesn't reduce very fast. By that, I mean calculating fib(1000) will use 1000 stack frames. Compare this with a recursive binary tree search where only ten stack frames are needed. That's because the former only removes 1/1000 of the search space for each stack frame while the latter removes one half of the remaining search space.
The best way to do Fibonacci is with iteration:
def fib(n, n ranges from 1 to infinity):
if n is 1 or 2:
return 1
last2 = 1, last1 = 1
for i ranges from 3 to n:
last0 = last2 + last1
last2 = last1
last1 = last0
return last0
Of course, if you want a blindingly fast Fibonacci generator, you write a program to generate all the ones you can store (in, for example, a long value) and write out a C structure to contain them. Then incorporate that output into your C program and your runtime "calculations" will blow any other method out of the water. This is your standard "trade off space for time" optimisation method:
long fib (size_t n) {
static long fibs[] = {0, 1, 1, 2, 3, 5, 8, 13, ...};
if (n > sizeof(fibs) / sizeof(*fibs))
return -1;
return fibs[n];
}
These guidelines apply to most situations where the search space doesn't reduce that fast (not just Fibonacci).
(a) Originally, I thought this would be 216 but, as the following program shows (and thanks to Nemo for setting me straight), it's not quite that bad - I didn't take into account the reducing nature of the spawned threads as you approached fib(0):
#include <stdio.h>
static count = 0;
static void fib(int n) {
if (n <= 2) return;
count++; fib(n-1);
count++; fib(n-2);
}
int main (int argc, char *argv[]) {
fib (atoi (argv[1]));
printf ("%d\n", count);
return 0;
}
This is equivalent to the code you have, but it simply increments a counter for each spawned thread rather than actually spawning them. The number of threads for various input values are:
N Threads
--- ---------
1 0
2 0
3 2
4 4
5 8
6 14
:
14
15 1,218
16 1,972
:
20 13,528
:
26 242,784
:
32 4,356,616
Now note that, while I said it wasn't as bad, I didn't say it was good :-) Even two thousand threads is a fair load on a system with each having their own kernel structures and stacks. And you can see that, while the increases start small, they quickly accelerate to the point where they're unmanageable. And it's not like the 32nd number is large - it's only a smidgeon over two million.
So the bottom line still stands: use recursion where it makes sense (where you can reduce the search space relatively quickly so as to not run out of stack space), and use threds where it makes sense (where you don't end up running so many that you overload the resources of the operating system).

Heck with it, might as well make this an answer.
First, check the return values of pthread_create and pthread_join. (Always, always, always check for errors. Just assert they are returning zero if you are feeling lazy, but never ignore them.)
Second, I could have sworn Linux glibc allocates something like 2 megabytes of stack per thread by default (configurable via pthread_attr_setstacksize). Sure, that is only virtual memory, but on a 32-bit system that still limits you to ~2000 threads total.
Finally, I believe the correct estimate for the number of threads this will spawn is basically fib(n) itself (how nicely recursive). Or roughly phi^n, where phi is (1+sqrt(5))/2. So the number of threads here is closer to 2000 than to 65000, which is consistent with my estimate for where a 32-bit system will run out of VM.
[edit]
To determine the default stack size for new threads on your system, run this program:
int main(int argc, char *argv[])
{
size_t stacksize;
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_getstacksize(&attr, &stacksize);
phthread_attr_destroy(&attr);
printf("Default stack size = %zd\n", stacksize);
return 0;
}
[edit 2]
To repeat: This is nowhere near 2^16 threads.
Let f(n) be the number of threads spawned when computing fib(n).
When n=16, one thread spawns two new threads: One to compute fib(15) and another to compute fib(14). So f(16) = f(15) + f(14) + 1.
And in general f(n) = f(n-1) + f(n-2) + 1.
As it turns out, the solution to this recurrence is that f(n) is just the sum of the first n Fibonacci numbers:
1 + 1 + 2 + 3 + 5 + 8 // f(6)
+ 1 + 1 + 2 + 3 + 5 // + f(5)
+ 1 // + 1
= 1 + 1 + 2 + 3 + 5 + 8 + 13 // = f(7)
This is (very) roughly phi^(n+1), not 2^n. Total for f(16) is still measured in the low thousands, not tens of thousands.
[edit 3]
Ah, I see, the crux of your question is this (hoisted from the comments):
Thanks Nemo for a detailed answer. I did a little test and
pthread_created ~10,000 threads with just a while(1) loop inside so
they don't terminate... and it did! True that the OS was smart enouth
to only create about 1000 and run an even smaller number, but it
didn't run out of stack. Why do I not get a segfault when I generate
lots more than THREAD_MAX, but I do when I do it recursively?
Here is my guess.
You only have a few cores. At any time, the kernel has to decide which threads are going to run. If you have (say) 2 cores and 500 threads, then any particular thread is only going to run 1/250 of the time. So your main loop spawning new threads is not going to run very often. I am not even sure whether the kernel's scheduler is "fair" with respect to threads within a single process, so it is at least conceivable that with 1000 threads the main thread never gets to run at all.
At the very least, each thread doing while (1); is going to run for 1/HZ on its core before giving up its time slice. This is probably 1ms, but it could be as high as 10ms depending on how your kernel was configured. So even if the scheduler is fair, your main thread will only get to run around once a second when you have thousands of threads.
Since only the main thread is creating new threads, the rate of thread creation slows to a crawl and possibly even stops.
Try this. Instead of while (1); for the child threads in your experiment, try while (1) pause();. (pause is from unistd.h.) This will keep the child threads blocked and should allow the main thread to keep grinding away creating new threads, leading to your crash.
And again, please check what pthread_create returns.

first thing i would do is run a statement like printf("%i", PTHREAD_THREADS_MAX); and see what the value is; i don't think that the max threads of the OS is necessarily the same as the max number of pthreads, although i do see that you say you can successfully achieve 16000 threads with no recursion so i'm just mentioning it as something i would check in general.
should PTHREAD_THREADS_MAX be significantly > the number threads you are achieving, i would start checking the return values of the pthread_create() calls to see if you are getting EAGAIN. my suspicion is that the answer to your question is that you're getting a segfault from attempting to use an uninitialized thread in a join...
also, as paxdiablo mentioned, you're talking on the order of 2^16 threads at n=16 here (a little less assuming some of them finishing before the last are created); i would probably try to keep a log to see in what order each was created. probably the easiest thing would be just to use the (n-1) (n-2) values as your log items otherwise you would have to use a semaphore or similar to protect a counter...
printf might bog down, in fact, i wouldn't be surprised if that actually affected things by allowing more threads to finish before new ones were started, but so i'd probably just log using file write(); can just be a simple file you should be able to get a feel for what's going on by looking at the patterns of numbers there. (wait, that assumes file ops are thread safe; i think they are; it's been a while.)
also, once checking for EAGAIN you could try sleeping an bit and retrying; perhaps it will ramp up over time and the system is just being overwhelmed by the sheer number of thread requests and failing for some other reason than being out of resources; this would verify whether just waiting and restarting could get you where you want to be.
finally; i might try to re-write the function as a fork() (i know fork() is evil or whatever ;)) and see whether you have better luck there.
:)

Related

Accuracy of clock_gettime() in a context switch scenario

I'm trying to 'roughly' calculate the time of a thread context switch in a Linux system. I've written a program that uses pipes and multi-threading to achieve this. When running the program the calculated time is clearly wrong(see output below). I am unsure if this is due to me using the wrong clock_id for this procedure or perhaps my implementation
I have implemented sched_setaffinity() so as to only have the program run on core 0. I've tried to leave as much fluff out of code so to only measure the time of a context switch, so the tread process only writes a single character to the pipe and the parent does a 0 byte read.
I have a parent tread that creates one child thread with a one-way pipe between them to pass data, the child thread runs a simple function to write to a pipe.
void* thread_1_function()
{
write(fd2[1],"",sizeof("");
}
while the parent thread creates the child thread, starts the time counter and then calls a read on the pipe that the child thread writes to.
int main(int argc, char argv[])
{
//time struct declaration
struct timespec start,end;
//sets program to only use core 0
cpu_set_t cpu_set;
CPU_ZERO(&cpu_set);
CPU_SET(0,&cpu_set);
if((sched_setaffinity(0, sizeof(cpu_set_t), &cpu_set) < 1))
{
int nproc = sysconf(_SC_NPROCESSORS_ONLN);
int k;
printf("Processor used: ");
for(k = 0; k < nproc; ++k)
{
printf("%d ", CPU_ISSET(k, &cpu_set));
}
printf("\n");
if(pipe(fd1) == -1)
{
printf("fd1 pipe error");
return 1;
}
//fail on file descriptor 2 fail
if(pipe(fd2) == -1)
{
printf("fd2 pipe error");
return 1;
}
pthread_t thread_1;
pthread_create(&thread_1, NULL, &thread_1_function, NULL);
pthread_join(thread_1,NULL);
int i;
uint64_t sum = 0;
for(i = 0; i < iterations; ++i)
{
//initalize clock start
clock_gettime(CLOCK_MONOTONIC, &start);
//wait for child thread to write to pipe
read(fd2[0],input,0);
//record clock end
clock_gettime(CLOCK_MONOTONIC, &end);
write(fd1[1],"",sizeof(""));
uint64_t diff;
diff = billion * (end.tv_sec - start.tv_sec) + end.tv_nsec - start.tv_nsec;
diff = diff;
sum += diff;
}
The results i get while running this are typically in this manner:
3000
3000
4000
2000
12000
3000
5000
and so forth, when I inspect the time returned to the start and end timespec structs i see that tv_nsec seems to be a 'rounded' number as well:
start.tv_nsec: 714885000, end.tv_nsec: 714888000
Would this be caused by a clock_monotonic not being precise enough for what im attempting to measure, or some other problem that i'm overlooking?
i see that tv_nsec seems to be a 'rounded' number as well:
2626, 714885000, 2626, 714888000
Would this be caused by a clock_monotonic not being precise enough for
what im attempting to measure, or some other problem that i'm
overlooking?
Yes, that's a possibility. Every clock supported by the system has a fixed resolution. struct timespec is capable of supporting clocks with nanosecond resolution, but that does not mean that you can expect every clock to actually have such resolution. It looks like your CLOCK_MONOTONIC might have a resolution of 1 microsecond (1000 nanoseconds), but you can check that via the clock_getres() function.
If it is available to you, then you might try CLOCK_PROCESS_CPUTIME_ID. It is possible that that would have higher resolution than CLOCK_MONOTONIC for you, but do note that single-microsecond resolution is pretty precise -- that's on the order of one tick per 3000 CPU cycles on a modern machine.
Even so, I see several possible problems with your approach:
Although you set your process to have affinity for a single CPU, that does not prevent the system from scheduling other processes on that CPU, too. Thus, unless you've taken additional measures, you can't be certain -- it's not even likely -- that every context switch away from one of your program's threads is to the other thread.
You start your second thread and then immediately join it. There is no more context switching between your threads after that, because your second thread no longer exists after being successfully joined.
read() with a count of 0 may or may not check for errors, and it certainly does not transfer any data. It is totally unclear to me why you identify the time for that call with the time for a context switch.
If a context switch does occur in the space you're timing, then at least two need to occur there -- away from your program and back to it. Also, you're measuring the time consumed by whatever else runs in the other context as well, not just the switch time. The 1000-nanosecond steps may thus reflect time slices, rather than switching time.
Your main thread is writing null characters to the write end of a pipe, but there does not appear to be anything reading them. If indeed there isn't then this will eventually fill up the pipe's buffer and block. The purpose is lost on me.

Multi-threads program architecture C pthread

I'd like to create multi-threads program in C (Linux) with:
Infinite loop with infinite number of tasks
One thread per one task
Limit the total number of threads, so if for instance total threads number is more then MAX_THREADS_NUMBER, do sleep(), until total threads number become less then MAX_THREADS_NUMBER, continue after.
Resume: I need to do infinite number of tasks(one task per one thread) and I'd like to know how to implement it using pthreads in C.
Here is my code:
#include <stdio.h>
#include <string.h>
#include <pthread.h>
#include <stdlib.h>
#include <unistd.h>
#define MAX_THREADS 50
pthread_t thread[MAX_THREADS];
int counter;
pthread_mutex_t lock;
void* doSomeThing(void *arg)
{
pthread_mutex_lock(&lock);
counter += 1;
printf("Job %d started\n", counter);
pthread_mutex_unlock(&lock);
return NULL;
}
int main(void)
{
int i = 0;
int ret;
if (pthread_mutex_init(&lock, NULL) != 0)
{
printf("\n mutex init failed\n");
return 1;
}
for (i = 0; i < MAX_THREADS; i++) {
ret = pthread_create(&(thread[i]), NULL, &doSomeThing, NULL);
if (ret != 0)
printf("\ncan't create thread :[%s]", strerror(ret));
}
// Wait all threads to finish
for (i = 0; i < MAX_THREADS; i++) {
pthread_join(thread[i], NULL);
}
pthread_mutex_destroy(&lock);
return 0;
}
How to make this loop infinite?
for (i = 0; i < MAX_THREADS; i++) {
ret = pthread_create(&(thread[i]), NULL, &doSomeThing, NULL);
if (ret != 0)
printf("\ncan't create thread :[%s]", strerror(ret));
}
I need something like this:
while (1) {
if (thread_number > MAX_THREADS_NUMBER)
sleep(1);
ret = pthread_create(...);
if (ret != 0)
printf("\ncan't create thread :[%s]", strerror(ret));
}
Your current program is based on a simple dispatch design: the initial thread creates worker threads, assigning each one a task to perform. Your question is, how you make this work for any number of tasks, any number of worker threads. The answer is, you don't: your chosen design makes such a modification basically impossible.
Even if I were to answer your stated questions, it would not make the program behave the way you'd like. It might work after a fashion, but it'd be like a bicycle with square wheels: not very practical, nor robust -- not even fun after you stop laughing at how silly it looks.
The solution, as I wrote in a comment to the original question, is to change the underlying design: from a simple dispatch to a thread pool approach.
Implementing a thread pool requires two things: First, is to change your viewpoint from starting a thread and having it perform a task, to each thread in the "pool" grabbing a task to perform, and returning to the "pool" after they have performed it. Understanding this is the hard part. The second part, implementing a way for each thread to grab a new task, is simple: this typically centers around a data structure, protected with locks of some sort. The exact data structure does depend on what the actual work to do is, however.
Let's assume you wanted to parallelize the calculation of the Mandelbrot set (or rather, the escape time, or the number of iterations needed before a point can be ruled to be outside the set; the Wikipedia page contains pseudocode for exactly this). This is one of the "embarrassingly parallel" problems; those where the sub-problems (here, each point) can be solved without any dependencies.
Here's how I'd do the core of the thread pool in this case. First, the escape time or iteration count needs to be recorded for each point. Let's say we use an unsigned int for this. We also need the number of points (it is a 2D array), a way to calculate the complex number that corresponds to each point, plus some way to know which points have either been computed, or are being computed. Plus mutually exclusive locking, so that only one thread will modify the data structure at once. So:
typedef struct {
int x_size, y_size;
size_t stride;
double r_0, i_0;
double r_dx, i_dx;
double r_dy, i_dy;
unsigned int *iterations;
sem_t done;
pthread_mutex_t mutex;
int x, y;
} fractal_work;
When an instance of fractal_work is constructed, x_size and y_size are the number of columns and rows in the iterations map. The number of iterations (or escape time) for point x,y is stored in iterations[x+y*stride]. The real part of the complex coordinate for that point is r_0 + x*r_dx + y*r_dy, and imaginary part is i_0 + x*i_dx + y*i_dy (which allows you to scale and rotate the fractal freely).
When a thread grabs the next available point, it first locks the mutex, and copies the x and y values (for itself to work on). Then, it increases x. If x >= x_size, it resets x to zero, and increases y. Finally, it unlocks the mutex, and calculates the escape time for that point.
However, if x == 0 && y >= y_size, the thread posts on the done semaphore and exits, letting the initial thread know that the fractal is complete. (The initial thread just needs to call sem_wait() once for each thread it created.)
The thread worker function is then something like the following:
void *fractal_worker(void *data)
{
fractal_work *const work = (fractal_work *)data;
int x, y;
while (1) {
pthread_mutex_lock(&(work->mutex));
/* No more work to do? */
if (work->x == 0 && work->y >= work->y_size) {
sem_post(&(work->done));
pthread_mutex_unlock(&(work->mutex));
return NULL;
}
/* Grab this task (point), advance to next. */
x = work->x;
y = work->y;
if (++(work->x) >= work->x_size) {
work->x = 0;
++(work->y);
}
pthread_mutex_unlock(&(work->mutex));
/* z.r = work->r_0 + (double)x * work->r_dx + (double)y * work->r_dy;
z.i = work->i_0 + (double)x * work->i_dx + (double)y * work->i_dy;
TODO: implement the fractal iteration,
and count the iterations (say, n)
save the escape time (number of iterations)
in the work->iterations array; e.g.
work->iterations[(size_t)x + work->stride*(size_t)y] = n;
*/
}
}
The program first creates the fractal_work data structure for the worker threads to work on, initializes it, then creates some number of threads giving each thread the address of that fractal_work structure. It can then call fractal_worker() itself too, to "join the thread pool". (This pool automatically "drains", i.e. threads will return/exit, when all points in the fractal are done.)
Finally, the main thread calls sem_wait() on the done semaphore, as many times as it created worker threads, to ensure all the work is done.
The exact fields in the fractal_work structure above do not matter. However, it is at the very core of the thread pool. Typically, there is at least one mutex or rwlock protecting the work details, so that each worker thread gets unique work details, as well as some kind of flag or condition variable or semaphore to let the original thread know that the task is now complete.
In a multithreaded server, there is usually only one instance of the structure (or variables) describing the work queue. It may even contain things like minimum and maximum number of threads, allowing the worker threads to control their own number to dynamically respond to the amount of work available. This sounds magical, but is actually simple to implement: when a thread has completed its work, or is woken up in the pool with no work, and is holding the mutex, it first examines how many queued jobs there are, and what the current number of worker threads is. If there are more than the minimum number of threads, and no work to do, the thread reduces the number of threads, and exits. If there are less than the maximum number of threads, and there is a lot of work to do, the thread first creates a new thread, then grabs the next task to work on. (Yes, any thread can create new threads into the process. They are all on equal footing, too.)
A lot of the code in a practical multithreaded application using one or more thread pools to do work, is some sort of bookkeeping. Thread pool approaches very much concentrates on the data, and the computation needed to be performed on the data. I'm sure there are much better examples of thread pools out there somewhere; the hard part is to think of a good task for the application to perform, as the data structures are so task-dependent, and many computations are so simple that parallelizing them makes no sense (since creating new threads does have a small computational cost, it'd be silly to waste time creating threads when a single thread does the same work in the same or less time).
Many tasks that benefit from parallelization, on the other hand, require information to be shared between workers, and that requires a lot of thinking to implement correctly. (For example, although solutions exist for parallelizing molecular dynamics simulations efficiently, most simulators still calculate and exchange data in separate steps, rather than at the same time. It's just that hard to do right, you see.)
All this means that you cannot expect to be able to write the code, unless you understand the concept. Indeed, truly understanding the concepts are the hard part: writing the code is comparatively easy.
Even in the above example, there are certain tripping points: Does the order of posting the semaphore and releasing the mutex matter? (Well, it depends on what the thread that is waiting for the fractal to complete does -- and indeed, if it is waiting yet.) If it was a condition variable instead of a semaphore, it would be essential that the thread that is interested in the fractal completion is waiting on the condition variable, otherwise it would miss the signal/broadcast. (This is also why I used a semaphore.)

How to call same rank in MPI using C language?

I am trying to learn MPI programing and have written the following program. It adds an entire row of array and outputs the sum. At rank 0 (or process 0), it will call all its slave ranks to do the calculation. I want to do this using only two other slave ranks/process. Whenever I try to invoke same rank twice as shown in the code bellow, my code would just hang in the middle and wouldn't execute. If I don't call the same rank twice, code would work correctly
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int tag2 = 1;
int arr[30] = {0};
MPI_Request request;
MPI_Status status;
printf ("\n--Current Rank: %d\n", world_rank);
int index;
int source = 0;
int dest;
if (world_rank == 0)
{
int i;
printf("* Rank 0 excecuting\n");
index = 0;
dest = 1;
for ( i = 0; i < 30; i++ )
{
arr[ i ] = i + 1;
}
MPI_Send(&arr[0], 30, MPI_INT, dest, tag2, MPI_COMM_WORLD);
index = 0;
dest = 2;
for ( i = 0; i < 30; i++ )
{
arr[ i ] = 0;
}
MPI_Send(&arr[0], 30, MPI_INT, dest, tag2, MPI_COMM_WORLD);
index = 0;
dest = 2; //Problem happens here when I try to call the same destination(or rank 2) twice
//If I change this dest value to 3 and run using: mpirun -np 4 test, this code will work correctly
for ( i = 0; i < 30; i++ )
{
arr[ i ] = 1;
}
MPI_Send(&arr[0], 30, MPI_INT, dest, tag2, MPI_COMM_WORLD);
}
else
{
int sum = 0;
int i;
MPI_Irecv(&arr[0], 30, MPI_INT, source, tag2, MPI_COMM_WORLD, &request);
MPI_Wait (&request, &status);
for(i = 0; i<30; i++)
{
sum = arr[i]+sum;
}
printf("\nSum is: %d at rank: %d\n", sum, world_rank);
}
MPI_Finalize();
}
Result when using: mpirun -np 3 test
--Current Rank: 2
Sum is: 0 at rank: 2
--Current Rank: 0
* Rank 0 excecuting
--Current Rank: 1
Sum is: 524800 at rank: 1
//Program hangs here and wouldn't show sum of 30
Please let me know how can I call the same rank twice. For example, if I only had two other slave process that I can call.
Please show with an example if possible
In MPI, each process executes the same code and, as you're doing, you differentiate the different processes primarily through checking rank in if/else statements. The master process with rank 0 is doing 3 sends: a send to process 1, then two sends to process 2. The slave processes each do only one receive, which means that rank 1 receives its first message and rank 2 receives its first message. When you call the third MPI_Send on process 0, there is not and will not be any slave waiting to receive the message after that point, as the slaves have finished executing their else block. The program gets blocked as the master waits to send the final message.
In order to fix that, you have to make sure the slave of rank 2 performs two receives, either by adding a loop for that process only or by repeating for that process only (so, using an if(world_rank == 2) check) the block of code
sum = 0; //resetting sum
MPI_Irecv(&arr[0], 1024, MPI_INT, source, tag2, MPI_COMM_WORLD, &request);
MPI_Wait (&request, &status);
for(i = 0; i<1024; i++)
{
sum = arr[i]+sum;
}
printf("\nSum is: %d at rank: %d\n", sum, world_rank);
TL;DR: Just a remark that master/slave approach isn't to be favoured and might format programmer's mind durably, leading to poor code when put in production
Although Clarissa is perfectly correct and her answer is very clear, I'd like to add a few general remarks, not about the code itself, but about parallel computing philosophy and good habits.
First a quick preamble: when one wants to parallelise one's code, it can be for two main reasons: making it faster and/or permitting it to handle larger problems by overcoming the limitations (like memory limitation) found on a single machine. But in all cases, performance matters and I will always assume that MPI (or generally speaking parallel) programmers are interested and concerned by performance. So the rest of my post will suppose that you are.
Now the main reason of this post: over the past few days, I've seen a few questions here on SO about MPI and parallelisation, obviously coming from people eager to learn MPI (or OpenMP for that matter). This is great! Parallel programming is great and there'll never be enough parallel programmers. So I'm happy (and I'm sure many SO members are too) to answer questions helping people to learn how to program in parallel. And in the context of learning how to program in parallel, you have to write some simple codes, doing simple things, in order to understand what the API does and how it works. These programs might look stupid from the distance and very ineffective, but that's fine, that's how leaning works. Everybody learned this way.
However, you have to keep in mind that these programs you write are only that: API learning exercises. They're not the real thing and they do not reflect the philosophy of what an actual parallel program is or should be. And what drives my answer here is that I've seen here and in other questions and answers put forward recurrently the status of "master" process and "slaves" ones. And that is wrong, fundamentally wrong! Let me explain why:
As Clarissa perfectly pinpointed, "in MPI, each process executes the same code". The idea is to find a way of making several processes to interact for working together in solving a (possibly larger) problem (hopefully faster). But amongst these processes, none gets any special status, they are all equal. They are given an id to be able to address them, but rank 0 is no better than rank 1 or rank 1025... By artificially deciding that process #0 is the "master" and the others are its "slaves", you break this symmetry and it has consequences:
Now that rank #0 is the master, it commands, right? That's what a master does. So it will be the one getting the information necessary to run the code, will distribute it's share to the workers, will instruct them to do the processing. Then it will wait for the processing to be concluded (possibly getting itself busy in-between but more likely just waiting or poking the workers, since that's what a master does), collect the results, do a bit of reassembling and output it. Job done! What's wrong with that?
Well, the following are wrong:
During the time the master gets the data, slaves are sitting idle. This is sequential and ineffective processing...
Then the distribution of the data and the work to do implies a lot of transfers. This takes time and since it is solely between process #0 and all the others, this might create a lot of congestions on the network in one single link.
While workers do their work, what should the master do? Working as well? If yes, then it might not be readily available to handle requests from the slaves when they come, delaying the whole parallel processing. Waiting for these requests? Then it wastes a lot of computing power by sitting idle... Ultimately, there is no good answer.
Then points 1 and 2 are repeated in reverse order, with the gathering of the results and outputting or results. That's a lot of data transfers and sequential processing, which will badly damage the global scalability, effectiveness and performance.
So I hope you now see why master/slaves approach is (usually, not always but very often) wrong. And the danger I see from the questions and answers I've read of the past days is that you might get you mind formatted in this approach as if it was the "normal" way of thinking in parallel. Well, it is not! Parallel programming is symmetry. It is handling the problem globally, in all places at the same time. You have to think parallel from the start and see your code as a global parallel entity, not just a brunch of processes that need to be instructed on what to do. Each process is it's own master, dealing with it's peers on an equal ground. Each process should (as much as possible) acquire its data by itself (making it a parallel processing); decide what to do based on the number of peers involved in the processing and its id; exchange information with its peers when necessary, should it be locally (point to point communications) or globally (collective communications); and issue its own share of the result (again leading to parallel processing)...
OK, that's a bit extreme a requirement for people just starting to learn parallel programming and I want by no mean to tell you that your learning exercises should be like this. But keep the goal in mind and don't forget that API learning examples are only API learning examples, not reduced models of actual codes. So keep on experimenting with MPI calls to understand what they do and how they work, but try to slowly tend towards symmetrical approach on your examples. That can only be beneficial for you in the long term.
Sorry for this lengthy and somewhat off topic answer, and good luck with your parallel programming.

Linux - force single-core execution and debug multi-threading with pthread

I'm debugging a multi-threaded problem with C, pthread and Linux. On my MacOS 10.5.8, C2D, is runs fine, on my Linux computers (2-4 cores) it produces undesired outputs.
I'm not experienced, therefore I attached my code. It's rather simple: each new thread creates two more threads until a maximum is reached. So no big deal... as I thought until a couple of days ago.
Can I force single-core execution to prevent my bugs from occuring?
I profiled the programm execution, instrumenting with Valgrind:
valgrind --tool=drd --read-var-info=yes --trace-mutex=no ./threads
I get a couple of conflicts in the BSS segment - which are caused by my global structs and thread counter variales. However I could mitigate these conflicts with forced signle-core execution because I think the concurrent sheduling of my 2-4 core test-systems are responsible for my errors.
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#define MAX_THR 12
#define NEW_THR 2
int wait_time = 0; // log global wait time
int num_threads = 0; // how many threads there are
pthread_t threads[MAX_THR]; // global array to collect threads
pthread_mutex_t mut = PTHREAD_MUTEX_INITIALIZER; // sync
struct thread_data
{
int nr; // nr of thread, serves as id
int time; // wait time from rand()
};
struct thread_data thread_data_array[MAX_THR+1];
void
*PrintHello(void *threadarg)
{
if(num_threads < MAX_THR){
// using the argument
pthread_mutex_lock(&mut);
struct thread_data *my_data;
my_data = (struct thread_data *) threadarg;
// updates
my_data->nr = num_threads;
my_data->time= rand() % 10 + 1;
printf("Hello World! It's me, thread #%d and sleep time is %d!\n",
my_data->nr,
my_data->time);
pthread_mutex_unlock(&mut);
// counter
long t = 0;
for(t = 0; t < NEW_THR; t++){
pthread_mutex_lock(&mut);
num_threads++;
wait_time += my_data->time;
pthread_mutex_unlock(&mut);
pthread_create(&threads[num_threads], NULL, PrintHello, &thread_data_array[num_threads]);
sleep(1);
}
printf("Bye from %d thread\n", my_data->nr);
pthread_exit(NULL);
}
return 0;
}
int
main (int argc, char *argv[])
{
long t = 0;
// srand(time(NULL));
if(num_threads < MAX_THR){
for(t = 0; t < NEW_THR; t++){
// -> 2 threads entry point
pthread_mutex_lock(&mut);
// rand time
thread_data_array[num_threads].time = rand() % 10 + 1;
// update global wait time variable
wait_time += thread_data_array[num_threads].time;
num_threads++;
pthread_mutex_unlock(&mut);
pthread_create(&threads[num_threads], NULL, PrintHello, &thread_data_array[num_threads]);
pthread_mutex_lock(&mut);
printf("In main: creating initial thread #%ld\n", t);
pthread_mutex_unlock(&mut);
}
}
for(t = 0; t < MAX_THR; t++){
pthread_join(threads[t], NULL);
}
printf("Bye from program, wait was %d\n", wait_time);
pthread_exit(NULL);
}
I hope that code isn't too bad. I didn't do too much C for a rather long time. :) The problem is:
printf("Bye from %d thread\n", my_data->nr);
my_data->nr sometimes resolves high integer values:
In main: creating initial thread #0
Hello World! It's me, thread #2 and sleep time is 8!
In main: creating initial thread #1
[...]
Hello World! It's me, thread #11 and sleep time is 8!
Bye from 9 thread
Bye from 5 thread
Bye from -1376900240 thread
[...]
I don't now more ways to profile and debug this.
If I debug this, it works - sometimes. Sometimes it doesn't :(
Thanks for reading this long question. :) I hope I didn't share too much of my currently unresolveable confusion.
Since this program seems to be just an exercise in using threads, with no actual goal, it is difficult to suggest how treat your problem rather than treat the symptom. I believe can actually pin a process or thread to a processor in Linux, but doing so for all threads removes most of the benefit of using threads, and I don't actually remember how to do it. Instead I'm going to talk about some things wrong with your program.
C compilers often make a lot of assumptions when they are doing optimizations. One of the assumptions is that unless the current code being examined looks like it might change some variable that variable does not change (this is a very rough approximation to this, and a more accurate explanation would take a very long time).
In this program you have variables which are shared and changed by different threads. If a variable is only read by threads (either const or effectively const after threads that look at it are created) then you don't have much to worry about (and in "read by threads" I'm including the main original thread) because since the variable doesn't change if the compiler only generates code to read that variable once (remembering it in a local temporary variable) or if it generates code to read it over and over the value is always the same so that calculations based on it always come out the same.
To force the compiler not do this you can use the volatile keyword. It is affixed to variable declarations just like the const keyword, and tells the compiler that the value of that variable can change at any instant, so reread it every time its value is needed, and rewrite it every time a new value for it is assigned.
NOTE that for pthread_mutex_t (and similar) variables you do not need volatile. It if were needed on the type(s) that make up pthread_mutex_t on your system volatile would have been used within the definition of pthread_mutex_t. Additionally the functions that access this type take the address of it and are specially written to do the right thing.
I'm sure now you are thinking that you know how to fix your program, but it is not that simple. You are doing math on a shared variable. Doing math on a variable using code like:
x = x + 1;
requires that you know the old value to generate the new value. If x is global then you have to conceptually load x into a register, add 1 to that register, and then store that value back into x. On a RISC processor you actually have to do all 3 of those instructions, and being 3 instructions I'm sure you can see how another thread accessing the same variable at nearly the same time could end up storing a new value in x just after we have read our value -- making our value old, so our calculation and the value we store will be wrong.
If you know any x86 assembly then you probably know that it has instructions that can do math on values in RAM (both getting from and storing the result in the same location in RAM all in one instruction). You might think that this instruction could be used for this operation on x86 systems, and you would almost be right. The problem is that this instruction is still executed in the steps that the RISC instruction would be executed in, and there are several opportunities for another processor to change this variable at the same time as we are doing our math on it. To get around this on x86 there is a lock prefix that may be applied to some x86 instructions, and I believe that glibc header files include atomic macro functions to do this on architectures that can support it, but this can't be done on all architectures.
To work right on all architectures you are going to need to:
int local_thread_count;
int create_a_thread;
pthread_mutex_lock(&count_lock);
local_thread_count = num_threads;
if (local_thread_count < MAX_THR) {
num_threads = local_thread_count + 1;
pthread_mutex_unlock(&count_lock);
thread_data_array[local_thread_count].nr = local_thread_count;
/* moved this into the creator
* since getting it in the
* child will likely get the
* wrong value. */
pthread_create(&threads[local_thread_count], NULL, PrintHello,
&thread_data_array[local_thread_count]);
} else {
pthread_mutex_unlock(&count_lock);
}
Now, since you would have changed the num_threads to volatile you can atomically test and increment the thread count in all threads. At the end of this local_thread_count should be usable as an index into the array of threads. Note that I did not create but 1 thread in this code, while yours was supposed to create several. I did this to make the example more clear, but it should not be too difficult to change it to go ahead and add NEW_THR to num_threads, but if NEW_THR is 2 and MAX_THR - num_threads is 1 (somehow) then you have to handle that correctly somehow.
Now, all of that being said, there may be another way to accomplish similar things by using semaphores. Semaphores are like mutexes, but they have a count associated with them. You would not get a value to use as the index into the array of threads (the function to read a semaphore count won't really give you this), but I thought that it deserved to be mentioned since it is very similar.
man 3 semaphore.h
will tell you a little bit about it.
num_threads should at least be marked volatile, and preferably marked atomic too (although I believe that the int is practically fine), so that at least there is a higher chance that the different threads are seeing the same values. You might want to view the assembler output to see when the writes of num_thread to memory are actually supposedly taking place.
https://computing.llnl.gov/tutorials/pthreads/#PassingArguments
that seems to be the problem. you need to malloc the thread_data struct.

Matrix Multiplication with Threads: Why is it not faster?

So I've been playing around with pthreads, specifically trying to calculate the product of two matrices. My code is extremely messy because it was just supposed to be a quick little fun project for myself, but the thread theory I used was very similar to:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define M 3
#define K 2
#define N 3
#define NUM_THREADS 10
int A [M][K] = { {1,4}, {2,5}, {3,6} };
int B [K][N] = { {8,7,6}, {5,4,3} };
int C [M][N];
struct v {
int i; /* row */
int j; /* column */
};
void *runner(void *param); /* the thread */
int main(int argc, char *argv[]) {
int i,j, count = 0;
for(i = 0; i < M; i++) {
for(j = 0; j < N; j++) {
//Assign a row and column for each thread
struct v *data = (struct v *) malloc(sizeof(struct v));
data->i = i;
data->j = j;
/* Now create the thread passing it data as a parameter */
pthread_t tid; //Thread ID
pthread_attr_t attr; //Set of thread attributes
//Get the default attributes
pthread_attr_init(&attr);
//Create the thread
pthread_create(&tid,&attr,runner,data);
//Make sure the parent waits for all thread to complete
pthread_join(tid, NULL);
count++;
}
}
//Print out the resulting matrix
for(i = 0; i < M; i++) {
for(j = 0; j < N; j++) {
printf("%d ", C[i][j]);
}
printf("\n");
}
}
//The thread will begin control in this function
void *runner(void *param) {
struct v *data = param; // the structure that holds our data
int n, sum = 0; //the counter and sum
//Row multiplied by column
for(n = 0; n< K; n++){
sum += A[data->i][n] * B[n][data->j];
}
//assign the sum to its coordinate
C[data->i][data->j] = sum;
//Exit the thread
pthread_exit(0);
}
source: http://macboypro.com/blog/2009/06/29/matrix-multiplication-in-c-using-pthreads-on-linux/
For the non-threaded version, I used the same setup (3 2-d matrices, dynamically allocated structs to hold r/c), and added a timer. First trials indicated that the non-threaded version was faster. My first thought was that the dimensions were too small to notice a difference, and it was taking longer to create the threads. So I upped the dimensions to about 50x50, randomly filled, and ran it, and I'm still not seeing any performance upgrade with the threaded version.
What am I missing here?
Unless you're working with very large matrices (many thousands of rows/columns), then you are unlikely to see much improvement from this approach. Setting up a thread on a modern CPU/OS is actually pretty expensive in relative terms of CPU time, much more time than a few multiply operations.
Also, it's usually not worthwhile to set up more than one thread per CPU core that you have available. If you have, say, only two cores and you set up 2500 threads (for 50x50 matrices), then the OS is going to spend all its time managing and switching between those 2500 threads rather than doing your calculations.
If you were to set up two threads beforehand (still assuming a two-core CPU), keep those threads available all the time waiting for work to do, and supply them with the 2500 dot products you need to calculate in some kind of synchronised work queue, then you might start to see an improvement. However, it still won't ever be more than 50% better than using only one core.
I'm not entirely sure I understand the source code, but here's what it looks like: You have a loop that runs M*N times. Each time through the loop, you create a thread that fills in one number in the result matrix. But right after you launch the thread, you wait for it to complete. I don't think that you're ever actually running more than one thread.
Even if you were running more than one thread, the thread is doing a trivial amount of work. Even if K was large (you mention 50), 50 multiplications isn't much compared to the cost of starting the thread in the first place. The program should create fewer threads--certainly no more than the number of processors--and assign more work to each.
You don't allow much parallel execution: you wait for the thread immediately after creating it, so there is almost no way for your program to use additional CPUs (i.e. it can never use a third CPU/core). Try to allow more threads to run (probably up to the count of cores you have).
If you have a processor with two cores, then you should just divide the work to be done in two halfs and give each thread one half. The same principle if you have 3, 4, 5 cores. The optimal performance design will always match the number of threads to the number of available cores (by available I mean cores that aren't already being heavily used by other processes).
One other thing you have to consider is that each thread must have its data contiguous and independent from the data for the other threads. Otherwise, memcache misses will slow down sighificantly the processing.
To better understand these issues, I'd recommend the book Patterns for Parallel Programming
http://astore.amazon.com/amazon-books-20/detail/0321228111
Although its code samples are more directed to OpenMP & MPI, and you're using PThreads, still the first half of the book is very rich in fundamental concepts & inner working of multithreading environments, very useful to avoid most of the performance bottlenecks you'll encounter.
Provided the code parallelizes correctly (I won't check it), likely performance boosts only when the code is parallelized in hardware, i.e. threads are really parallel (multi cores, multi cpus, ... other techologies...) and not apparently ("multitasking" way) parallel. Just an idea, I am not sure this is the case.

Resources