How to achieve parallelism in C? - c

This might be a dumb question, i'm very sorry if that's the case. But i'm struggling to take advantage of the multiple cores in my computer to perform multiple computations at the same time in my Quad-Core MacBook. This is not for any particular project, just a general question, since i want to learn for when i eventually do need to do this kind of things
I am aware of threads, but the seem to run in the same core, so i don't seem to gain any performance using them for compute-bound operations (They are very useful for socket based stuff tho!).
I'm also aware of processed that can be created with fork, but i'm nor sure they are guaranteed to use more CPU, or if they, like threads, just help with IO-bound operations.
Finally i'm aware of CUDA, allowing paralellism in the GPU (And i think OpenCL and Compute Shaders also allows my code to run in the CPU in parallel) but i'm currently looking for something that will allow me to take advantage of the multiple CPU cores that my computer has.
In python, i'm aware of the multiprocessing module, which seems to provide an API very similar to threads, but there i do seem to gain an edge by running multiple functions performing computations in parallel. I'm looking into how could i get this same advantage in C, but i don't seem to be able
Any help pointing me to the right direction would be very much appreciated
Note: I'm trying to achive true parallelism, not concurrency
Note 2: I'm only aware of threads and using multiple processes in C, with threads i don't seem to be able to win the performance boost i want. And i'm not very familiar with processes, but i'm still not sure if running multiple processes is guaranteed to give me the advantage i'm looking for.

A simple program to heat up your CPU (100% utilization of all available cores).
Hint: The thread starting function does not return, program exit via [CTRL + C]
#include <pthread.h>
void* func(void *arg)
{
while (1);
}
int main()
{
#define NUM_THREADS 4 //use the number of cores (if known)
pthread_t threads[NUM_THREADS];
for (int i=0; i < NUM_THREADS; ++i)
pthread_create(&threads[i], NULL, func, NULL);
for (int i=0; i < NUM_THREADS; ++i)
pthread_join(threads[i], NULL);
return 0;
}
Compilation:
gcc -pthread -o thread_test thread_test.c
If i start ./thread_test, all cores are at 100%.
A word to fork and pthread_create:
fork creates a new process (the current process image will be copied and executed in parallel), while pthread_create will create a new thread, sometimes called a lightweight process.
Both, processes and threads will run in 'parallel' to the parent process.
It depends, when to use a child process over a thread, e.g. a child is able to replace its process image (via exec family) and has its own address space, while threads are able to share the address space of the current parent process.
There are of course a lot more differences, for that i recommend to study the following pages:
man fork
man pthreads

I am aware of threads, but the seem to run in the same core, so i don't seem to gain any performance using them for compute-bound operations (They are very useful for socket based stuff tho!).
No, they don't. Except if you block and your threads don't block, you'll see alll of them running. Just try this (beware that this consumes all your cpu time) that starts 16 threads each counting in a busy loop for 60 s. You will see all of them running and makins your cores to fail to their knees (it runs only a minute this way, then everything ends):
#include <assert.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define N 16 /* had 16 cores, so I used this. Put here as many
* threads as cores you have. */
struct thread_data {
pthread_t thread_id; /* the thread id */
struct timespec end_time; /* time to get out of the tunnel */
int id; /* the array position of the thread */
unsigned long result; /* number of times looped */
};
void *thread_body(void *data)
{
struct thread_data *p = data;
p->result = 0UL;
clock_gettime(CLOCK_REALTIME, &p->end_time);
p->end_time.tv_sec += 60; /* 60 s. */
struct timespec now;
do {
/* just get the time */
clock_gettime(CLOCK_REALTIME, &now);
p->result++;
/* if you call printf() you will see them slowing, as there's a
* common buffer that forces all thread to serialize their outputs
*/
/* check if we are over */
} while ( now.tv_sec < p->end_time.tv_sec
|| now.tv_nsec < p->end_time.tv_nsec);
return p;
} /* thread_body */
int main()
{
struct thread_data thrd_info[N];
for (int i = 0; i < N; i++) {
struct thread_data *d = &thrd_info[i];
d->id = i;
d->result = 0;
printf("Starting thread %d\n", d->id);
int res = pthread_create(&d->thread_id,
NULL, thread_body, d);
if (res < 0) {
perror("pthread_create");
exit(EXIT_FAILURE);
}
printf("Thread %d started\n", d->id);
}
printf("All threads created, waiting for all to finish\n");
for (int i = 0; i < N; i++) {
struct thread_data *joined;
int res = pthread_join(thrd_info[i].thread_id,
(void **)&joined);
if (res < 0) {
perror("pthread_join");
exit(EXIT_FAILURE);
}
printf("PTHREAD %d ended, with value %lu\n",
joined->id, joined->result);
}
} /* main */
Linux and all multithread systems work the same, they create a new execution unit (if both don't share the virtual address space, they are both processes --not exactly so, but this explains the main difference between a process and a thread--) and the available processors are given to each thread as necessary. Threads are normally encapsulated inside processes (they share ---not in linux, if that has not changed recently--- the process id, and virtual memory) Processes run each in a separate virtual space, so they can only share things through the system resources (files, shared memory, communication sockets/pipes, etc.)
The problem with your test case (you don't show it so I have go guess) is that probably you will make all threads in a loop in which you try to print something. If you do that, probably the most time each thread is blocked trying to do I/O (to printf() something)
Stdio FILEs have the problem that they share a buffer between all threads that want to print on the same FILE, and the kernel serializes all the write(2) system calls to the same file descriptor, so if the most of the time you pass in the loop is blocked in a write, the kernel (and stdio) will end serializing all the calls to print, making it to appear that only one thread is working at a time (all the threads will become blocked by the one that is doing the I/O) This busy loop will make all the threads to run in parallel and will show you how the cpu is collapsed.

Parallelism in C can be achieved by using the fork() function. This function simulates a thread by allowing two threads to run simultaneously and share data. The first thread forks itself, and the second thread is then executed as if it was launched from main(). Forking allows multiple processes to be Run concurrently without conflicts arising.
To make sure that data is shared appropriately between the two threads, use the wait() function before accessing shared resources. Wait will block execution of the current program until all database connections are closed or all I/O has been completed, whichever comes first.

Related

Only one thread ever acquiring semaphore

I have a program in which multiple threads are in a loop where they acquire a binary semaphore and then increase a global counter. However, by printing out the thread IDs, I notice that only one thread ever acquires the semaphore. Here's my MRE:
#include <stdbool.h>
#include <stdio.h>
#include <unistd.h>
#include <pthread.h>
#include <semaphore.h>
#define NUM_THREADS 10
#define MAX_COUNTER 100
struct threadCtx {
sem_t sem;
unsigned int counter;
};
static void *
threadFunc(void *args)
{
struct threadCtx *ctx = args;
pthread_t self;
bool done = false;
self = pthread_self();
while (!done) {
sem_wait(&ctx->sem);
if ( ctx->counter == MAX_COUNTER ) {
done = true;
}
else {
sleep(1);
printf("Thread %u increasing the counter to %u\n", (unsigned int)self, ++ctx->counter);
}
sem_post(&ctx->sem);
}
return NULL;
}
int main() {
pthread_t threads[NUM_THREADS];
struct threadCtx ctx = {.counter = 0};
sem_init(&sem.ctx, 0, 1);
for (int k=0; k<NUM_THREADS; k++) {
pthread_create(threads+k, NULL, threadFunc, &ctx);
}
for (int k=0; k<NUM_THREADS; k++) {
pthread_join(threads[k], NULL);
}
sem_destroy(&ctx.sem);
return 0;
}
The output is
Thread 1004766976 increasing the counter to 1
Thread 1004766976 increasing the counter to 2
Thread 1004766976 increasing the counter to 3
...
If I remove the call to sleep, the behavior is closer to what I would expect (i.e., the threads being woken up in a seemingly indeterminate manner). Why would this be?
David Schwartz's answer explains what is happening at a low level. That is to say, he's looking at it from the perspective of an OS developer or a hardware designer. Nothing wrong with that, but let's look at your program from the perspective of a Software Architect:
You've got multiple threads all executing the same loop. The loop locks the mutex,* it does some "work," and then it releases the mutex. OK, but what does it do next? Almost the very next thing that your loop does after releasing the mutex is it locks the mutex again. Your loop spends practically 100% of its time doing "work" with the mutex locked.
So, what's the point of running that same loop in multiple threads when there's never any opportunity for two or more threads to work at the same time?
If you want to use threads to do a parallel computation, you need to find/invent safe ways for the threads to do most of their work with the mutex unlocked. They should only lock a mutex for just long enough to post a result or, to take another assignment.
Sometimes that means writing code that is less efficient than single threaded code would be. But suppose that program (A) has a single thread that makes almost 100% use of a CPU, while program (B) uses eight CPUs but only uses them with 50% efficiency. Which program is going to win?
* I know, your example uses a sem_t (semaphore) object. But "semaphore" is what you are using. "Mutex" is the role in which you are using it.
Why would this be?
Context switches are expensive and your implementation is, wisely, minimizing them. Your threads are all fighting over the same resource, trying to schedule them closely will make performance much worse, probably for the entire system.
Since the thread that keeps getting the semaphore never uses up its timeslice, it will keep getting the resource. It is your responsibility to write code to do the work that you want done. It's the implementation's responsibility to execute your code as efficiently as it can, and that's what it's doing.
Most likely, what's going under the hood is this:
The thread that keeps getting the sempahore can always make forward progress except when it is sleeping. But when it is sleeping, no other thread that needs the sempahore can make forward progress.
The thread that keeps getting the semaphore never exhausts its timeslice because it sleeps before that happens.
So there is no reason for the implementation to ever block this thread other than when it is sleeping, meaning that no other thread can get the semaphore. If you don't want this thread to keep sleeping with the semaphore and blocking other threads, then write different code.

Multi-threads program architecture C pthread

I'd like to create multi-threads program in C (Linux) with:
Infinite loop with infinite number of tasks
One thread per one task
Limit the total number of threads, so if for instance total threads number is more then MAX_THREADS_NUMBER, do sleep(), until total threads number become less then MAX_THREADS_NUMBER, continue after.
Resume: I need to do infinite number of tasks(one task per one thread) and I'd like to know how to implement it using pthreads in C.
Here is my code:
#include <stdio.h>
#include <string.h>
#include <pthread.h>
#include <stdlib.h>
#include <unistd.h>
#define MAX_THREADS 50
pthread_t thread[MAX_THREADS];
int counter;
pthread_mutex_t lock;
void* doSomeThing(void *arg)
{
pthread_mutex_lock(&lock);
counter += 1;
printf("Job %d started\n", counter);
pthread_mutex_unlock(&lock);
return NULL;
}
int main(void)
{
int i = 0;
int ret;
if (pthread_mutex_init(&lock, NULL) != 0)
{
printf("\n mutex init failed\n");
return 1;
}
for (i = 0; i < MAX_THREADS; i++) {
ret = pthread_create(&(thread[i]), NULL, &doSomeThing, NULL);
if (ret != 0)
printf("\ncan't create thread :[%s]", strerror(ret));
}
// Wait all threads to finish
for (i = 0; i < MAX_THREADS; i++) {
pthread_join(thread[i], NULL);
}
pthread_mutex_destroy(&lock);
return 0;
}
How to make this loop infinite?
for (i = 0; i < MAX_THREADS; i++) {
ret = pthread_create(&(thread[i]), NULL, &doSomeThing, NULL);
if (ret != 0)
printf("\ncan't create thread :[%s]", strerror(ret));
}
I need something like this:
while (1) {
if (thread_number > MAX_THREADS_NUMBER)
sleep(1);
ret = pthread_create(...);
if (ret != 0)
printf("\ncan't create thread :[%s]", strerror(ret));
}
Your current program is based on a simple dispatch design: the initial thread creates worker threads, assigning each one a task to perform. Your question is, how you make this work for any number of tasks, any number of worker threads. The answer is, you don't: your chosen design makes such a modification basically impossible.
Even if I were to answer your stated questions, it would not make the program behave the way you'd like. It might work after a fashion, but it'd be like a bicycle with square wheels: not very practical, nor robust -- not even fun after you stop laughing at how silly it looks.
The solution, as I wrote in a comment to the original question, is to change the underlying design: from a simple dispatch to a thread pool approach.
Implementing a thread pool requires two things: First, is to change your viewpoint from starting a thread and having it perform a task, to each thread in the "pool" grabbing a task to perform, and returning to the "pool" after they have performed it. Understanding this is the hard part. The second part, implementing a way for each thread to grab a new task, is simple: this typically centers around a data structure, protected with locks of some sort. The exact data structure does depend on what the actual work to do is, however.
Let's assume you wanted to parallelize the calculation of the Mandelbrot set (or rather, the escape time, or the number of iterations needed before a point can be ruled to be outside the set; the Wikipedia page contains pseudocode for exactly this). This is one of the "embarrassingly parallel" problems; those where the sub-problems (here, each point) can be solved without any dependencies.
Here's how I'd do the core of the thread pool in this case. First, the escape time or iteration count needs to be recorded for each point. Let's say we use an unsigned int for this. We also need the number of points (it is a 2D array), a way to calculate the complex number that corresponds to each point, plus some way to know which points have either been computed, or are being computed. Plus mutually exclusive locking, so that only one thread will modify the data structure at once. So:
typedef struct {
int x_size, y_size;
size_t stride;
double r_0, i_0;
double r_dx, i_dx;
double r_dy, i_dy;
unsigned int *iterations;
sem_t done;
pthread_mutex_t mutex;
int x, y;
} fractal_work;
When an instance of fractal_work is constructed, x_size and y_size are the number of columns and rows in the iterations map. The number of iterations (or escape time) for point x,y is stored in iterations[x+y*stride]. The real part of the complex coordinate for that point is r_0 + x*r_dx + y*r_dy, and imaginary part is i_0 + x*i_dx + y*i_dy (which allows you to scale and rotate the fractal freely).
When a thread grabs the next available point, it first locks the mutex, and copies the x and y values (for itself to work on). Then, it increases x. If x >= x_size, it resets x to zero, and increases y. Finally, it unlocks the mutex, and calculates the escape time for that point.
However, if x == 0 && y >= y_size, the thread posts on the done semaphore and exits, letting the initial thread know that the fractal is complete. (The initial thread just needs to call sem_wait() once for each thread it created.)
The thread worker function is then something like the following:
void *fractal_worker(void *data)
{
fractal_work *const work = (fractal_work *)data;
int x, y;
while (1) {
pthread_mutex_lock(&(work->mutex));
/* No more work to do? */
if (work->x == 0 && work->y >= work->y_size) {
sem_post(&(work->done));
pthread_mutex_unlock(&(work->mutex));
return NULL;
}
/* Grab this task (point), advance to next. */
x = work->x;
y = work->y;
if (++(work->x) >= work->x_size) {
work->x = 0;
++(work->y);
}
pthread_mutex_unlock(&(work->mutex));
/* z.r = work->r_0 + (double)x * work->r_dx + (double)y * work->r_dy;
z.i = work->i_0 + (double)x * work->i_dx + (double)y * work->i_dy;
TODO: implement the fractal iteration,
and count the iterations (say, n)
save the escape time (number of iterations)
in the work->iterations array; e.g.
work->iterations[(size_t)x + work->stride*(size_t)y] = n;
*/
}
}
The program first creates the fractal_work data structure for the worker threads to work on, initializes it, then creates some number of threads giving each thread the address of that fractal_work structure. It can then call fractal_worker() itself too, to "join the thread pool". (This pool automatically "drains", i.e. threads will return/exit, when all points in the fractal are done.)
Finally, the main thread calls sem_wait() on the done semaphore, as many times as it created worker threads, to ensure all the work is done.
The exact fields in the fractal_work structure above do not matter. However, it is at the very core of the thread pool. Typically, there is at least one mutex or rwlock protecting the work details, so that each worker thread gets unique work details, as well as some kind of flag or condition variable or semaphore to let the original thread know that the task is now complete.
In a multithreaded server, there is usually only one instance of the structure (or variables) describing the work queue. It may even contain things like minimum and maximum number of threads, allowing the worker threads to control their own number to dynamically respond to the amount of work available. This sounds magical, but is actually simple to implement: when a thread has completed its work, or is woken up in the pool with no work, and is holding the mutex, it first examines how many queued jobs there are, and what the current number of worker threads is. If there are more than the minimum number of threads, and no work to do, the thread reduces the number of threads, and exits. If there are less than the maximum number of threads, and there is a lot of work to do, the thread first creates a new thread, then grabs the next task to work on. (Yes, any thread can create new threads into the process. They are all on equal footing, too.)
A lot of the code in a practical multithreaded application using one or more thread pools to do work, is some sort of bookkeeping. Thread pool approaches very much concentrates on the data, and the computation needed to be performed on the data. I'm sure there are much better examples of thread pools out there somewhere; the hard part is to think of a good task for the application to perform, as the data structures are so task-dependent, and many computations are so simple that parallelizing them makes no sense (since creating new threads does have a small computational cost, it'd be silly to waste time creating threads when a single thread does the same work in the same or less time).
Many tasks that benefit from parallelization, on the other hand, require information to be shared between workers, and that requires a lot of thinking to implement correctly. (For example, although solutions exist for parallelizing molecular dynamics simulations efficiently, most simulators still calculate and exchange data in separate steps, rather than at the same time. It's just that hard to do right, you see.)
All this means that you cannot expect to be able to write the code, unless you understand the concept. Indeed, truly understanding the concepts are the hard part: writing the code is comparatively easy.
Even in the above example, there are certain tripping points: Does the order of posting the semaphore and releasing the mutex matter? (Well, it depends on what the thread that is waiting for the fractal to complete does -- and indeed, if it is waiting yet.) If it was a condition variable instead of a semaphore, it would be essential that the thread that is interested in the fractal completion is waiting on the condition variable, otherwise it would miss the signal/broadcast. (This is also why I used a semaphore.)

4 Process 4 way synchronization using semaphores (In a C Programming, UNIX environment)

I have a question about synchronizing 4 processes in a UNIX environment. It is very important that no process runs their main functionality without first waiting for the others to "be on the same page", so to speak.
Specifically, they should all not go into their loops without first synchronizing with each other. How do I synchronize 4 processes in a 4 way situation, so that none of them get into their first while loop without first waiting for the others? Note that this is mainly a logic problem, not a coding problem.
To keep things consistent between environments let's just say we have a pseudocode semaphore library with the operations semaphore_create(int systemID), semaphore_open(int semaID), semaphore_wait(int semaID), and semaphore_signal(int semaID).
Here is my attempt and subsequent thoughts:
Process1.c:
int main() {
//Synchronization area (relevant stuff):
int sem1 = semaphore_create(123456); //123456 is an arbitrary ID for the semaphore.
int sem2 = semaphore_create(78901); //78901 is an arbitrary ID for the semaphore.
semaphore_signal(sem1);
semaphore_wait(sem2);
while(true) {
//...do main functionality of process, etc (not really relevant)...
}
}
Process2.c:
int main() {
//Synchronization area (relevant stuff):
int sem1 = semaphore_open(123456);
int sem2 = semaphore_open(78901);
semaphore_signal(sem1);
semaphore_wait(sem2);
while(true) {
//...do main functionality of process etc...
}
}
Process3.c:
int main() {
//Synchronization area (relevant stuff):
int sem1 = semaphore_open(123456);
int sem2 = semaphore_open(78901);
semaphore_signal(sem1);
semaphore_wait(sem2);
while(true) {
//...do main functionality of process etc...
}
}
Process4.c:
int main() {
//Synchronization area (relevant stuff):
int sem1 = semaphore_open(123456);
int sem2 = semaphore_open(78901);
semaphore_signal(sem2);
semaphore_signal(sem2);
semaphore_signal(sem2);
semaphore_wait(sem1);
semaphore_wait(sem1);
semaphore_wait(sem1);
while(true) {
//...do main functionality of process etc...
}
}
We run Process1 first, and it creates all of the semaphores into system memory used in the other processes (the other processes simply call semaphore_open to gain access to those semaphores). Then, all 4 processes have a signal operation, and then a wait. The signal operation causes process1, process2, and process3 to increment the value of sem1 by 1, so it's resultant maximum value is 3 (depending on what order the operating system decides to run these processes in). Process1, 2, and 3, are all waiting then on sem2, and process4 is waiting on sem1 as well. Process 4 then signals sem2 3 times to bring its value back up to 0, and waits on sem1 3 times. Since sem1 was a maximum of 3 from the signalling in the other processes (depending on what order they ran in, again), then it will bring its value back up to 0, and continue running. Thus, all processes will be synchronized.
So yea, not super confident on my answer. I feel that it depends heavily on what order the processes ran in, which is the whole point of synchronization -- that it shouldn't matter what order they run in, they all synchronize correctly. Also, I am doing a lot of work in Process4. Maybe it would be better to solve this using more than 2 semaphores? Wouldn't this also allow for more flexibility within the loops in each process, if I want to do further synchronization?
My question: Please explain why the above logic will or will not work, and/or a solution on how to solve this problem of 4 way synchronization. I'd imagine this is a very common thing to have to think about depending on the industry (eg. banking and synching up bank accounts). I know it is not very difficult, but I have never worked with semaphores before, so I'm kind of confused on how they work.
The precise semantics of your model semaphore library are not clear enough to answer your question definitively. However, if the difference between semaphore_create() and semaphore_open() is that the latter requires the specified semaphore to already exist, whereas the former requires it to not exist, then yes, the whole thing will fall down if process1 does not manage to create the needed semaphores before any of the other processes attempt to open them. (Probably it falls down in different ways if other semantics hold.)
That sort of issue can be avoided in a threading scenario because with threads there is necessarily an initial single-threaded segment wherein the synchronization structures can be initialized. There is also shared memory by which the various threads can communicate with one another. The answer #Dark referred to depends on those characteristics.
The essential problem with a barrier for multiple independent processes -- or for threads that cannot communicate via shared memory and that are not initially synchronized -- is that you cannot know which process needs to erect the barrier. It follows that each one needs to be prepared to do so. That can work in your model library if semaphore_create() can indicate to the caller which result was achieved, one of
semaphore successfully created
semaphore already exists
(or error)
In that case, all participating processes (whose number you must know) can execute the same procedure, maybe something like this:
void process_barrier(int process_count) {
sem_t *sem1, *sem2, *sem3;
int result = semaphore_create(123456, &sem1);
int counter;
switch (result) {
case SEM_SUCCESS:
/* I am the controlling process */
/* Finish setting up the barrier */
semaphore_create(78901, &sem2);
semaphore_create(23432, &sem3);
/* let (n - 1) other processes enter the barrier... */
for (counter = 1; counter < process_count; counter += 1) {
semaphore_signal(sem1);
}
/* ... and wait for those (n - 1) processes to do so */
for (counter = 1; counter < process_count; counter += 1) {
semaphore_wait(sem2);
}
/* let all the (n - 1) waiting processes loose */
for (counter = 1; counter < process_count; counter += 1) {
semaphore_signal(sem3);
}
/* and I get to continue, too */
break;
case SEM_EXISTS_ERROR:
/* I am NOT the controlling process */
semaphore_open(123456, &sem1);
/* wait, if necessary, for the barrier to be initialized */
semaphore_wait(sem1);
semaphore_open(78901, &sem2);
semaphore_open(23432, &sem3);
/* signal the controlling process that I have reached the barrier */
semaphore_signal(sem2);
/* wait for the controlling process to allow me to continue */
semaphore_wait(sem3);
break;
}
}
Obviously, I have taken some minor liberties with your library interface, and I have omitted error checks except where they bear directly on the barrier's operation.
The three semaphores involved in that example serve distinct, well-defined purposes. sem1 guards the initialization of the synchronization constructs and allows the processes to choose which among them takes responsibility for controlling the barrier. sem2 serves to count how many processes have reached the barrier. sem3 blocks the non-controlling processes that have reached the barrier until the controlling process releases them all.

98th call to pthread_create() fails

I'm running the following program. It simply creates threads that die straight away.
What I have found is that after 93 to 98 (it varies slightly) successful calls, every next call to pthread_create() fails with error 11: Resource temporarily unavailable. I think I'm closing the thread correctly so it should give up on any resources it has but some resources become unavailable.
The first parameter of the program allows me to set the interval between calls to pthread_create() but testing with different values, I've learned that the interval does not matter (well, I'll get the error earlier): the number of successful calls will be the same.
The second parameter of the program allows me to set a sleep interval after a failed call but the length of the interval does not seem to make any difference.
Which ceiling am I hitting here?
EDIT: found the error in doSomething(): change lock to unlock and the program runs fine. The question remains: what resource is depleted with the error uncorrected?
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <pthread.h>
#include <errno.h>
pthread_mutex_t doSomethingLock;
void milliSleep(unsigned int milliSeconds)
{
struct timespec ts;
ts.tv_sec = floorf(((float)milliSeconds / 1000));
ts.tv_nsec = ((((float)milliSeconds / 1000) - ts.tv_sec)) * 1000000000;
nanosleep(&ts, NULL);
}
void *doSomething(void *args)
{
pthread_detach(pthread_self());
pthread_mutex_lock(&doSomethingLock);
pthread_exit(NULL);
}
int main(int argc, char **argv)
{
pthread_t doSomethingThread;
pthread_mutexattr_t attr;
int threadsCreated = 0;
if (argc != 3)
{
fprintf(stderr, "usage: demo <interval between pthread_create() in ms> <time to wait after fail in ms>\n");
exit(1);
}
pthread_mutexattr_init(&attr);
pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE_NP);
pthread_mutex_init(&doSomethingLock, &attr);
while (1)
{
pthread_mutex_lock(&doSomethingLock);
if (pthread_create(&doSomethingThread, NULL, doSomething, NULL) != 0)
{
fprintf(stderr, "%d pthread_create(): error %d, %m\n", threadsCreated, errno);
milliSleep(atoi(argv[2]));
}
else threadsCreated++;
milliSleep(atoi(argv[1]));
}
}
If you are on a 32 bit distro, you are probably hitting address space limits. The last I checked, glibc will allocate about 13MB for stack space in every thread created (this is just the size of the mapping, not allocated memory). With 98 threads, you will be pushing past a gigabyte of address space of the 3G available.
You can test this by freezing your process after the error (e.g. sleep(1000000) or whatever) and looking at its address space with pmap.
If that is the problem, then try setting a smaller stack size with pthread_attr_setstack() on the pthread_attr_t you pass to pthread_create. You will have to be the judge of your stack requirements obviously, but often even complicated code can run successfully in only a few kilobytes of stack.
Your program does not "create threads that simply die away". It does not do what you think it does.
First, pthread_mutex_unlock() only unlocks a pthread_mutex_t that has been locked by the same thread. This is how mutexes work: they can only be unlocked by the same thread that locked them. If you want the behaviour of a semaphore semaphore, use a semaphore.
Your example code creates a recursive mutex, which the doSomething() function tries to lock. Because it is held by the original thread, it blocks (waits for the mutex to become free in the pthread_mutex_lock() call). Because the original thread never releases the lock, you just pile up new threads on top of the doSomethingLock mutex.
Recursivity with respect to mutexes just means a thread can lock it more than once; it must unlock it the same number of times for the mutex to be actually released.
If you change the pthread_mutex_lock() in doSomething() to pthread_mutex_unlock(), then you're trying to unlock a mutex not held by that thread. The call fails, and then the threads do die immediately.
Assuming you fix your program, you'll next find that you cannot create more than a hundred or so threads (depending on your system and available RAM).
The reason is well explained by Andy Ross: the fixed size stacks (getrlimit(RLIMIT_STACK, (struct rlimit *)&info) tells you how much, unless you set it via thread attributes) eat up your available address space.
The original stack given to the process is resized automatically, but for all other threads, the stack size is fixed. By default, it is very large; on my system, 8388608 bytes (8 megabytes).
I personally create threads with very small stacks, usually 65536 bytes, which is more than enough unless your functions use local arrays or large structures, or do insanely deep recursion:
#ifndef THREAD_STACK_SIZE
#define THREAD_STACK_SIZE 65536
#endif
pthread_attr_t attrs;
pthread_t thread[N];
int i, result;
/* Create a thread attribute for the desired stack size. */
pthread_attr_init(&attrs);
pthread_attr_setstacksize(&attrs, THREAD_STACK_SIZE);
/* Create any number of threads.
* The attributes are only a guide to pthread_create(),
* they are not "consumed" by the call. */
for (i = 0; i < N; i++) {
result = pthread_create(&thread[i], &attrs, some_func, (void *)i);
if (result) {
/* strerror(result) describes the error */
break;
}
}
/* You should destroy the attributes when you know
* you won't be creating any further threads anymore. */
pthread_attr_destroy(&attrs);
The minimum stack size should be available as PTHREAD_STACK_MIN, and should be a multiple of sysconf(_SC_PAGESIZE). Currently PTHREAD_STACK_MIN == 16384, but I recommend using a larger power of two. (Page sizes are always powers of two on any binary architecture.)
It is only the minimum, and the pthread library is free to use any larger value it sees fit, but in practice the stack size seems to be what you set it to, plus a fixed value depending on the architecture, kernel, and the pthread library version. Using a compile-time constant works well for almost all cases, but if your application is complex enough to have a configuration file, it might be a good idea to let the user override the compile-time constant if they want to, in the config file.

Linux - force single-core execution and debug multi-threading with pthread

I'm debugging a multi-threaded problem with C, pthread and Linux. On my MacOS 10.5.8, C2D, is runs fine, on my Linux computers (2-4 cores) it produces undesired outputs.
I'm not experienced, therefore I attached my code. It's rather simple: each new thread creates two more threads until a maximum is reached. So no big deal... as I thought until a couple of days ago.
Can I force single-core execution to prevent my bugs from occuring?
I profiled the programm execution, instrumenting with Valgrind:
valgrind --tool=drd --read-var-info=yes --trace-mutex=no ./threads
I get a couple of conflicts in the BSS segment - which are caused by my global structs and thread counter variales. However I could mitigate these conflicts with forced signle-core execution because I think the concurrent sheduling of my 2-4 core test-systems are responsible for my errors.
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#define MAX_THR 12
#define NEW_THR 2
int wait_time = 0; // log global wait time
int num_threads = 0; // how many threads there are
pthread_t threads[MAX_THR]; // global array to collect threads
pthread_mutex_t mut = PTHREAD_MUTEX_INITIALIZER; // sync
struct thread_data
{
int nr; // nr of thread, serves as id
int time; // wait time from rand()
};
struct thread_data thread_data_array[MAX_THR+1];
void
*PrintHello(void *threadarg)
{
if(num_threads < MAX_THR){
// using the argument
pthread_mutex_lock(&mut);
struct thread_data *my_data;
my_data = (struct thread_data *) threadarg;
// updates
my_data->nr = num_threads;
my_data->time= rand() % 10 + 1;
printf("Hello World! It's me, thread #%d and sleep time is %d!\n",
my_data->nr,
my_data->time);
pthread_mutex_unlock(&mut);
// counter
long t = 0;
for(t = 0; t < NEW_THR; t++){
pthread_mutex_lock(&mut);
num_threads++;
wait_time += my_data->time;
pthread_mutex_unlock(&mut);
pthread_create(&threads[num_threads], NULL, PrintHello, &thread_data_array[num_threads]);
sleep(1);
}
printf("Bye from %d thread\n", my_data->nr);
pthread_exit(NULL);
}
return 0;
}
int
main (int argc, char *argv[])
{
long t = 0;
// srand(time(NULL));
if(num_threads < MAX_THR){
for(t = 0; t < NEW_THR; t++){
// -> 2 threads entry point
pthread_mutex_lock(&mut);
// rand time
thread_data_array[num_threads].time = rand() % 10 + 1;
// update global wait time variable
wait_time += thread_data_array[num_threads].time;
num_threads++;
pthread_mutex_unlock(&mut);
pthread_create(&threads[num_threads], NULL, PrintHello, &thread_data_array[num_threads]);
pthread_mutex_lock(&mut);
printf("In main: creating initial thread #%ld\n", t);
pthread_mutex_unlock(&mut);
}
}
for(t = 0; t < MAX_THR; t++){
pthread_join(threads[t], NULL);
}
printf("Bye from program, wait was %d\n", wait_time);
pthread_exit(NULL);
}
I hope that code isn't too bad. I didn't do too much C for a rather long time. :) The problem is:
printf("Bye from %d thread\n", my_data->nr);
my_data->nr sometimes resolves high integer values:
In main: creating initial thread #0
Hello World! It's me, thread #2 and sleep time is 8!
In main: creating initial thread #1
[...]
Hello World! It's me, thread #11 and sleep time is 8!
Bye from 9 thread
Bye from 5 thread
Bye from -1376900240 thread
[...]
I don't now more ways to profile and debug this.
If I debug this, it works - sometimes. Sometimes it doesn't :(
Thanks for reading this long question. :) I hope I didn't share too much of my currently unresolveable confusion.
Since this program seems to be just an exercise in using threads, with no actual goal, it is difficult to suggest how treat your problem rather than treat the symptom. I believe can actually pin a process or thread to a processor in Linux, but doing so for all threads removes most of the benefit of using threads, and I don't actually remember how to do it. Instead I'm going to talk about some things wrong with your program.
C compilers often make a lot of assumptions when they are doing optimizations. One of the assumptions is that unless the current code being examined looks like it might change some variable that variable does not change (this is a very rough approximation to this, and a more accurate explanation would take a very long time).
In this program you have variables which are shared and changed by different threads. If a variable is only read by threads (either const or effectively const after threads that look at it are created) then you don't have much to worry about (and in "read by threads" I'm including the main original thread) because since the variable doesn't change if the compiler only generates code to read that variable once (remembering it in a local temporary variable) or if it generates code to read it over and over the value is always the same so that calculations based on it always come out the same.
To force the compiler not do this you can use the volatile keyword. It is affixed to variable declarations just like the const keyword, and tells the compiler that the value of that variable can change at any instant, so reread it every time its value is needed, and rewrite it every time a new value for it is assigned.
NOTE that for pthread_mutex_t (and similar) variables you do not need volatile. It if were needed on the type(s) that make up pthread_mutex_t on your system volatile would have been used within the definition of pthread_mutex_t. Additionally the functions that access this type take the address of it and are specially written to do the right thing.
I'm sure now you are thinking that you know how to fix your program, but it is not that simple. You are doing math on a shared variable. Doing math on a variable using code like:
x = x + 1;
requires that you know the old value to generate the new value. If x is global then you have to conceptually load x into a register, add 1 to that register, and then store that value back into x. On a RISC processor you actually have to do all 3 of those instructions, and being 3 instructions I'm sure you can see how another thread accessing the same variable at nearly the same time could end up storing a new value in x just after we have read our value -- making our value old, so our calculation and the value we store will be wrong.
If you know any x86 assembly then you probably know that it has instructions that can do math on values in RAM (both getting from and storing the result in the same location in RAM all in one instruction). You might think that this instruction could be used for this operation on x86 systems, and you would almost be right. The problem is that this instruction is still executed in the steps that the RISC instruction would be executed in, and there are several opportunities for another processor to change this variable at the same time as we are doing our math on it. To get around this on x86 there is a lock prefix that may be applied to some x86 instructions, and I believe that glibc header files include atomic macro functions to do this on architectures that can support it, but this can't be done on all architectures.
To work right on all architectures you are going to need to:
int local_thread_count;
int create_a_thread;
pthread_mutex_lock(&count_lock);
local_thread_count = num_threads;
if (local_thread_count < MAX_THR) {
num_threads = local_thread_count + 1;
pthread_mutex_unlock(&count_lock);
thread_data_array[local_thread_count].nr = local_thread_count;
/* moved this into the creator
* since getting it in the
* child will likely get the
* wrong value. */
pthread_create(&threads[local_thread_count], NULL, PrintHello,
&thread_data_array[local_thread_count]);
} else {
pthread_mutex_unlock(&count_lock);
}
Now, since you would have changed the num_threads to volatile you can atomically test and increment the thread count in all threads. At the end of this local_thread_count should be usable as an index into the array of threads. Note that I did not create but 1 thread in this code, while yours was supposed to create several. I did this to make the example more clear, but it should not be too difficult to change it to go ahead and add NEW_THR to num_threads, but if NEW_THR is 2 and MAX_THR - num_threads is 1 (somehow) then you have to handle that correctly somehow.
Now, all of that being said, there may be another way to accomplish similar things by using semaphores. Semaphores are like mutexes, but they have a count associated with them. You would not get a value to use as the index into the array of threads (the function to read a semaphore count won't really give you this), but I thought that it deserved to be mentioned since it is very similar.
man 3 semaphore.h
will tell you a little bit about it.
num_threads should at least be marked volatile, and preferably marked atomic too (although I believe that the int is practically fine), so that at least there is a higher chance that the different threads are seeing the same values. You might want to view the assembler output to see when the writes of num_thread to memory are actually supposedly taking place.
https://computing.llnl.gov/tutorials/pthreads/#PassingArguments
that seems to be the problem. you need to malloc the thread_data struct.

Resources