Unsure why this is generating so many threads - c

I'm running some tests on a short program I've written. It runs another program which performs some file operations based on the inputs I give it. The whole purpose of this program is to break a large packet of work into smaller packets to increase performance (sending 10 smaller packets to 10 versions of the program instead of waiting for the one larger one to execute, simple divide and conquer).
The problem lies in the fact that, while I believe I have limited the number of threads that will be created, the testing messages I have set up indicate that there are many more threads running than there should be. I'm really uncertain what I did wrong here.
Code snippet:
if (finish != start){
if (sizeOfBlock != 0){
num_threads = (finish - start)/sizeOfBlock + 1;
}
else{
num_threads = (finish-start) + 1;
}
if (num_threads > 10){ // this should limit threads to 10 at most
num_threads == 10;
}
else if (finish == start){
num_threads = 1;
}
}
threads = (pthread_t *) malloc(num_threads * sizeof(pthread_t));
for (i = 0; i < num_threads; i++){
printf("Creating thread %d\n", i);
s = pthread_create(&threads[i], NULL, thread, &MaxNum);
if (s != 0)
printf("error in pthread_create\n");
if (s==0)
activethreads++;
}
while (activethreads > 0){
//printf("active threads: %d\n", activethreads);
}
pthread_exit(0);

This code is useless:
if (num_threads > 10){ // this should limit threads to 10 at most
num_threads == 10
}
num_threads == 10 compares num_threads to 10 and then throws that away. You want assignment instead:
if (num_threads > 10){ // this should limit threads to 10 at most
num_threads = 10;
}
Also, there are numerous ; missing in your code, in the future, please try to provide a self contained example of code that compiles.

Related

How to create n threads, each one creates n - 1 threads

I am working with a large project and I am trying to create a test that does the following thing: first, create 5 threads. Each one of this threads will create 4 threads, which in turn each one creates 3 other threads. All this happens until 0 threads.
I have _ThreadInit() function used to create a thread:
status = _ThreadInit(mainThreadName, ThreadPriorityDefault, &pThread, FALSE);
where the 3rd parameter is the output(the thread created).
What am I trying to do is start from a number of threads that have to be created n = 5, the following way:
for(int i = 0; i < n; i++){
// here call the _ThreadInit method which creates a thread
}
I get stuck here. Please, someone help me understand how it should be done. Thanks^^
Building on Eugene Sh.'s comment, you could create a function that takes a parameter, which is the number of threads to create, that calls itself recursively.
Example using standard C threads:
#include <stdbool.h>
#include <stdio.h>
#include <threads.h>
int MyCoolThread(void *arg) {
int num = *((int*)arg); // cast void* to int* and dereference
printf("got %d\n", num);
if(num > 0) { // should we start any threads at all?
thrd_t pool[num];
int next_num = num - 1; // how many threads the started threads should start
for(int t = 0; t < num; ++t) { // loop and create threads
// Below, MyCoolThread creates a thread that executes MyCoolThread:
if(thrd_create(&pool[t], MyCoolThread, &next_num) != thrd_success) {
// We failed to create a thread, set `num` to the number of
// threads we actually created and break out.
num = t;
break;
}
}
int result;
for(int t = 0; t < num; ++t) { // join all the started threads
thrd_join(pool[t], &result);
}
}
return 0;
}
int main() {
int num = 5;
MyCoolThread(&num); // fire it up
}
Statistics from running:
1 thread got 5
5 threads got 4
20 threads got 3
60 threads got 2
120 threads got 1
120 threads got 0

Synchronization of p_threads using globals

I'm pretty new to C, so I'm not sure where even to start digging about my problem.
I'm trying to port python number-crunching algos to C, and since there's no GIL in C (woohoo), I can change whatever I want in memory from threads, for as long as I make sure there are no races.
I did my homework on mutexes, however, I cannot wrap my head around use of mutexes in case of continuously running threads accessing the same array over and over.
I'm using p_threads in order to split the workload on a big array a[N].
Number crunching algorithm on array a[N] is additive, so I'm splitting it using a_diff[N_THREADS][N] array, writing changes to be applied to a[N] array from each thread to a_diff[N_THREADS][N] and then merging them all together after each step.
I need to run the crunching on different versions of array a[N], so I pass them via global pointer p (in the MWE, there's only one a[N])
I'm synchronizing threads using another global array SYNC_THREADS[N_THREADS] and make sure threads quit when I need them to by setting END_THREADS global (I know, I'm using too many globals - I don't care, code is ~200 lines). My question is in regard to this synchronization technique - is it safe to do so and what is cleaner/better/faster way to achieve that?
MWEe:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#define N_THREADS 3
#define N 10000000
#define STEPS 3
double a[N]; // main array
double a_diff[N_THREADS][N]; // diffs array
double params[N]; // parameter used for number-crunching
double (*p)[N]; // pointer to array[N]
// structure for bounds for crunching the array
struct bounds {
int lo;
int hi;
int thread_num;
};
struct bounds B[N_THREADS];
int SYNC_THREADS[N_THREADS]; // for syncing threads
int END_THREADS = 0; // signal to terminate threads
static void *crunching(void *arg) {
// multiple threads run number-crunching operations according to assigned low/high bounds
struct bounds *data = (struct bounds *)arg;
int lo = (*data).lo;
int hi = (*data).hi;
int thread_num = (*data).thread_num;
printf("worker %d started for bounds [%d %d] \n", thread_num, lo, hi);
int i;
while (END_THREADS != 1) { // END_THREADS tells threads to terminate
if (SYNC_THREADS[thread_num] == 1) { // SYNC_THREADS allows threads to start number-crunching
printf("worker %d working... \n", thread_num );
for (i = lo; i <= hi; ++i) {
a_diff[thread_num][i] += (*p)[i] * params[i]; // pretend this is an expensive operation...
}
SYNC_THREADS[thread_num] = 0; // thread disables itself until SYNC_THREADS is back to 1
printf("worker %d stopped... \n", thread_num );
}
}
return 0;
}
int i, j, th,s;
double joiner;
int main() {
// pre-fill arrays
for (i = 0; i < N; ++i) {
a[i] = i + 0.5;
params[i] = 0.0;
}
// split workload between workers
int worker_length = N / N_THREADS;
for (i = 0; i < N_THREADS; ++i) {
B[i].thread_num = i;
B[i].lo = i * worker_length;
if (i == N_THREADS - 1) {
B[i].hi = N;
} else {
B[i].hi = i * worker_length + worker_length - 1;
}
}
// pointer to parameters to be passed to worker
struct bounds **data = malloc(N_THREADS * sizeof(struct bounds*));
for (i = 0; i < N_THREADS; i++) {
data[i] = malloc(sizeof(struct bounds));
data[i]->lo = B[i].lo;
data[i]->hi = B[i].hi;
data[i]->thread_num = B[i].thread_num;
}
// create thread objects
pthread_t threads[N_THREADS];
// disallow threads to crunch numbers
for (th = 0; th < N_THREADS; ++th) {
SYNC_THREADS[th] = 0;
}
// launch workers
for(th = 0; th < N_THREADS; th++) {
pthread_create(&threads[th], NULL, crunching, data[th]);
}
// big loop of iterations
for (s = 0; s < STEPS; ++s) {
for (i = 0; i < N; ++i) {
params[i] += 1.0; // adjust parameters
// zero diff array
for (i = 0; i < N; ++i) {
for (th = 0; th < N_THREADS; ++th) {
a_diff[th][i] = 0.0;
}
}
p = &a; // pointer to array a
// allow threads to process numbers and wait for threads to complete
for (th = 0; th < N_THREADS; ++th) { SYNC_THREADS[th] = 1; }
// ...here threads started by pthread_create do calculations...
for (th = 0; th < N_THREADS; th++) { while (SYNC_THREADS[th] != 0) {} }
// join results from threads (number-crunching is additive)
for (i = 0; i < N; ++i) {
joiner = 0.0;
for (th = 0; th < N_THREADS; ++th) {
joiner += a_diff[th][i];
}
a[i] += joiner;
}
}
}
// join workers
END_THREADS = 1;
for(th = 0; th < N_THREADS; th++) {
pthread_join(threads[th], NULL);
}
return 0;
}
I see that workers don't overlap time-wise:
worker 0 started for bounds [0 3333332]
worker 1 started for bounds [3333333 6666665]
worker 2 started for bounds [6666666 10000000]
worker 0 working...
worker 1 working...
worker 2 working...
worker 2 stopped...
worker 0 stopped...
worker 1 stopped...
worker 2 working...
worker 0 working...
worker 1 working...
worker 1 stopped...
worker 0 stopped...
worker 2 stopped...
worker 2 working...
worker 0 working...
worker 1 working...
worker 1 stopped...
worker 2 stopped...
worker 0 stopped...
Process returned 0 (0x0) execution time : 1.505 s
and I make sure worker's don't get into each other workspaces by separating them via a_diff[thead_num][N] sub-arrays, however, I'm not sure that's always the case and that I'm not introducing hidden races somewhere...
I didn't realize what was the question :-)
So, the question is if you're thinking well with your SYNC_THREADS and END_THREADS synchronization mechanism.
Yes!... Almost. The problem is that threads are burning CPU while waiting.
Conditional Variables
To make threads wait for an event you have conditional variables (pthread_cond). These offer a few useful functions like wait(), signal() and broadcast():
wait(&cond, &m) blocks a thread in a given condition variable. [note 2]
signal(&cond) unlocks a thread waiting in a given condition variable.
broadcast(&cond) unlocks all threads waiting in a given condition variable.
Initially you'd have all the threads waiting [note 1]:
while(!start_threads)
pthread_cond_wait(&cond_start);
And, when the main thread is ready:
start_threads = 1;
pthread_cond_broadcast(&cond_start);
Barriers
If you have data dependencies between iterations, you'd want to make sure threads are executing the same iteration at any given moment.
To synchronize threads at the end of each iteration, you'll want to take a look at barriers (pthread_barrier):
pthread_barrier_init(count): initializes a barrier to synchronize count threads.
pthread_barrier_wait(): thread waits here until all count threads reach the barrier.
Extending functionality of barriers
Sometimes you'll want the last thread reaching a barrier to compute something (e.g. to increment the counter of number of iterations, or to compute some global value, or to check if execution should stop). You have two alternatives
Using pthread_barriers
You'll need to essentially have two barriers:
int rc = pthread_barrier_wait(&b);
if(rc != 0 && rc != PTHREAD_BARRIER_SERIAL_THREAD)
if(shouldStop()) stop = 1;
pthread_barrier_wait(&b);
if(stop) return;
Using pthread_conds to implement our own specialized barrier
pthread_mutex_lock(&mutex)
remainingThreads--;
// all threads execute this
executedByAllThreads();
if(remainingThreads == 0) {
// reinitialize barrier
remainingThreads = N;
// only last thread executes this
if(shouldStop()) stop = 1;
pthread_cond_broadcast(&cond);
} else {
while(remainingThreads > 0)
pthread_cond_wait(&cond, &mutex);
}
pthread_mutex_unlock(&mutex);
Note 1: why is pthread_cond_wait() inside a while block? May seem a bit odd. The reason behind it is due to the existence of spurious wakeups. The function may return even if no signal() or broadcast() was issued. So, just to guarantee correctness, it's usual to have an extra variable to guarantee that if a thread suddenly wakes up before it should, it runs back into the pthread_cond_wait().
From the manual:
When using condition variables there is always a Boolean predicate involving shared variables associated with each condition wait that is true if the thread should proceed. Spurious wakeups from the pthread_cond_timedwait() or pthread_cond_wait() functions may occur. Since the return from pthread_cond_timedwait() or pthread_cond_wait() does not imply anything about the value of this predicate, the predicate should be re-evaluated upon such return.
(...)
If a signal is delivered to a thread waiting for a condition variable, upon return from the signal handler the thread resumes waiting for the condition variable as if it was not interrupted, or it shall return zero due to spurious wakeup.
Note 2:
An noted by Michael Burr in the comments, you should hold a companion lock whenever you modify the predicate (start_threads) and pthread_cond_wait(). pthread_cond_wait() will release the mutex when called; and re-acquires it when it returns.
PS: It's a bit late here; sorry if my text is confusing :-)

Squaring numbers w/ multiple threads

I'm trying to write a program that squares numbers 1-10,000 by creating 8 threads and each thread will take turns squaring ONE NUMBER EACH. Meaning that one thread will square 1, another will square 2, etc until all threads square a number. Then one thread will square 9, etc, all the way to 10,000. My code is below:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <pthread.h>
#include <sys/types.h>
#define NUMBER_OF_THREADS 8
#define START_NUMBER 1
#define END_NUMBER 10000
FILE *f;
void *sqrtfunc(void *tid) { //function for computing squares
int i;
for (i = START_NUMBER; i<= END_NUMBER; i++){
if ((i % NUMBER_OF_THREADS) == pthread_self()){ //if i%8 == thread id
fprintf(f, "%lu squared = %lu\n", i, i*i); //then that thread should do this
}
}
}
int main(){
//Do not modify starting here
struct timeval start_time, end_time;
gettimeofday(&start_time, 0);
long unsigned i;
f = fopen("./squared_numbers.txt", "w");
//Do not modify ending here
pthread_t mythreads[NUMBER_OF_THREADS]; //thread variable
long mystatus;
for (i = 0; i < NUMBER_OF_THREADS; i++){ //loop to create 8 threads
mystatus = pthread_create(&mythreads[i], NULL, sqrtfunc, (void *)i);
if (mystatus != 0){ //check if pthread_create worked
printf("pthread_create failed\n");
exit(-1);
}
}
for (i = 0; i < NUMBER_OF_THREADS; i++){
if(pthread_join(mythreads[i], NULL)){
printf("Thread failed\n");
}
}
exit(1);
//Do not modify starting here
fclose(f);
gettimeofday(&end_time, 0);
float elapsed = (end_time.tv_sec-start_time.tv_sec) * 1000.0f + \
(end_time.tv_usec-start_time.tv_usec) / 1000.0f;
printf("took %0.2f milliseconds\n", elapsed);
//Do not modify ending here
}
I am not sure where my error is. I create my 8 threads in main, and then depending on their thread id (tid), I want that thread to square a number. As of right now, nothing is being printed into the output file and I can't figure out why. Is my tid comparison not doing anything? Any tips are appreciated. Thanks guys.
First, you intentionally pass a parameter to each thread so it know which thread it is (from 0 to 7) That is good, but you then don't use it anymore inside the thread (this leads to one of the possible confussions you have)
Second, as you say in the explanation of how the algorithm should go, you say each thread must square a different set of numbers, but all of them do square the same set of numbers (indeed the whole set of numbers)
You have two approaches to this: Let each thread square the number, and go for the next, eight places further (so the algorithm is the one described in your explanation) or you give different sets (each 1250 consecutive numbers) and let each thread act on is own separate interval.
That said, you have to reconstruct your for loop to do one of two:
for (i = parameter; i < MAX; i += 8) ...
or
for (i = 1250*parameter; i < 1250*(parameter+1); i++) ...
that way, you'll get each thread run with a different set of input numbers.

user time increase in multi-cpu job

I am running the following code:
when I run this code with 1 child process: i get the following timing information:
(I run using /usr/bin/time ./job 1)
5.489u 0.090s 0:05.58 99.8% (1 job running)
when I run with 6 children processes: i get following
74.731u 0.692s 0:12.59 599.0% (6 jobs running in parallel)
The machine I am running the experiment on has 6 cores, 198 GB of RAM and nothing else is running on that machine.
I was expecting the user time reporting to be 6 times in case of 6 jobs running in parallel. But it is much more than that (13.6 times). My questions is from where this increase in user time comes from? Is it because multiple cores are jumping from one memory location to another more frequently in case of 6 jobs running in parallel? Or there is something else I am missing.
Thanks
#define MAX_SIZE 7000000
#define LOOP_COUNTER 100
#define simple_struct struct _simple_struct
simple_struct {
int n;
simple_struct *next;
};
#define ALLOCATION_SPLIT 5
#define CHAIN_LENGTH 1
void do_function3(void)
{
int i = 0, j = 0, k = 0, l = 0;
simple_struct **big_array = NULL;
simple_struct *temp = NULL;
big_array = calloc(MAX_SIZE + 1, sizeof(simple_struct*));
for(k = 0; k < ALLOCATION_SPLIT; k ++) {
for(i =k ; i < MAX_SIZE; i +=ALLOCATION_SPLIT) {
big_array[i] = calloc(1, sizeof(simple_struct));
if((CHAIN_LENGTH-1)) {
for(l = 1; l < CHAIN_LENGTH; l++) {
temp = calloc(1, sizeof(simple_struct));
temp->next = big_array[i];
big_array[i] = temp;
}
}
}
}
for (j = 0; j < LOOP_COUNTER; j++) {
for(i=0 ; i < MAX_SIZE; i++) {
if(big_array[i] == NULL) {
big_array[i] = calloc(1, sizeof(simple_struct));
}
big_array[i]->n = i * 13;
temp = big_array[i]->next;
while(temp) {
temp->n = i*13;
temp = temp->next;
}
}
}
}
int main(int argc, char **argv)
{
int i, no_of_processes = 0;
pid_t pid, wpid;
int child_done = 0;
int status;
if(argc != 2) {
printf("usage: this_binary number_of_processes");
return 0;
}
no_of_processes = atoi(argv[1]);
for(i = 0; i < no_of_processes; i ++) {
pid = fork();
switch(pid) {
case -1:
printf("error forking");
exit(-1);
case 0:
do_function3();
return 0;
default:
printf("\nchild %d launched with pid %d\n", i, pid);
break;
}
}
while(child_done != no_of_processes) {
wpid = wait(&status);
child_done++;
printf("\nchild done with pid %d\n", wpid);
}
return 0;
}
Firstly, your benchmark is a bit unusual. Normally, when benchmarking concurrent applications, one would compare two implementations:
A single thread version solving a problem of size S;
A multi-thread version with N threads, solving cooperatively the problem of size S; in your case, each solving a problem of size S/N.
Then you divide the execution times to obtain the speedup.
If your speedup is:
Around 1: the parallel implementation has similar performance as the single thread implementation;
Higher than 1 (usually between 1 and N), parallelizing the application increases performance;
Lower than 1: parallelizing the application hurts performance.
The effect on performance depends on a variety of factors:
How well your algorithm can be parallelized. See Amdahl's law. Does not apply here.
Overhead in inter-thread communication. Does not apply here.
Overhead in inter-thread synchronization. Does not apply here.
Contention for CPU resources. Should not apply here (since the number of threads is equal to the number of cores). However HyperThreading might hurt.
Contention for memory caches. Since the threads do not share memory, this will decrease performance.
Contention for accesses to main memory. This will decrease performance.
You can measure the last 2 with a profiler. Look for cache misses and stalled instructions.

Simulating round robin with pthreads

The task is to have 5 threads present at the same time, and the user assigns each a burst time. Then a round robin algorithm with a quantum of 2 is used to schedule the threads. For example, if I run the program with
$ ./m 1 2 3 4 5
The output should be
A 1
B 2
C 2
D 2
E 2
C 1
D 2
E 2
E 1
But for now my output shows only
A 1
B 2
C 2
Since the program errs where one thread does not end for the time being, I think the problem is that this thread cannot unlock to let the next thread grab the lock. My sleep() does not work, either. But I have no idea how to modify my code in order to fix them. My code is as follows:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
double times[5];
char process[] = {'A', 'B', 'C', 'D', 'E'};
int turn = 0;
void StartNext(int tid) //choose the next thread to run
{
int i;
for(i = (tid + 1) % 5; times[i] == 0; i = (i + 1) % 5)
if(i == tid) //if every thread has finished
return;
turn = i;
}
void *Run(void *tid) //the thread function
{
int i = (int)tid;
while(times[i] != 0)
{
while(turn != i); //busy waiting till it is its turn
if(times[i] > 2)
{
printf("%c 2\n", process[i]);
sleep(2); //sleep is to simulate the actual running time
times[i] -= 2;
}
else if(times[i] > 0 && times[i] <= 2) //this thread will have finished after this turn
{
printf("%c %lf\n", process[i], times[i]);
sleep(times[i]);
times[i] = 0;
}
StartNext(i); //choose the next thread to run
}
pthread_exit(0);
}
int main(int argc, char **argv)
{
pthread_t threads[5];
int i, status;
if(argc == 6)
{
for(i = 0; i < 5; i++)
times[i] = atof(argv[i + 1]); //input the burst time of each thread
for(i = 0; i < 5; i++)
{
status = pthread_create(&threads[i], NULL, Run, (void *)i); //Create threads
if(status != 0)
{
printf("While creating thread %d, pthread_create returned error code %d\n", i, status);
exit(-1);
}
pthread_join(threads[i], 0); //Join threads
}
}
return 0;
}
The program is directly runnable. Could anyone help me figure it out? Thanks!
Some things I've figured out reading your code:
1. At the beginning of the Run function, you convert tid (which is a pointer to void) directly to int. Shouldn't you dereference it?
It is better to make int turn volatile, so that the compiler won't make any assumptions about its value not changing.
When you call the function sleep the second time, you pass a parameter that has type double (times[i]), and you should pass an unsigned int parameter. A direct cast like (unsigned int) times[i] should solve that.
You're doing the pthread_join before creating the other threads. When you create thread 3, and it enters its busy waiting state, the other threads won't be created. Try putting the joins after the for block.

Resources