C <pthread.h> pass local variables to thread is broken - c

#include <stdio.h>
#include <pthread.h>
typedef struct {
int threadNum;
}thread_args;
void thread_func(void*vargp){
thread_args*id=(thread_args*)vargp;
printf("%i\n",id->threadNum);
}
int main() {
for(int i=0;i<20;i++) {
pthread_t id;
thread_args args;
args.threadNum=i;
pthread_create(&id,NULL,thread_func,(void*)&args);
}
pthread_exit(NULL);
return 0;
}
Adapted from https://www.geeksforgeeks.org/multithreading-c-2/.
So this is expected to output:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
But shuffled in a random order to account for the concurrency of the threads.
The issue here is that it actually prints out this:
4
9
10
5
11
12
13
8
4
4
17
6
18
7
15
19
6
14
19
16
As you can see, there are duplicate numbers and 0-3 are just plain skipped.
I have done concurrency before in other frameworks before, and I have seen similar issues: what is happening here is that the i is being passed as a reference (I think!) and so when the for loop increments i, it is incremented in all thread argument variables.
How can I avoid this?
NOTE: Everything is linking 100% properly and I'm on macOS.
PS: Sorry if this is a duplicate, I'm not very experienced with this.

You are having an UB at your for loop. You are creating an variable called args, where you assign a value to it, pass as reference to your thread, for later execution, and destroy it at the end of your for loop. Then you do it again, possibling overwritting this region.
To solve that problem, I suggest this modification:
int main() {
thread_args args[20] = {0};
pthread_t id[20] = {0};
for(int i=0;i<20;i++) {
args[i].threadNum=i;
pthread_create(&id[i],NULL,thread_func,(void*)&args[i]);
}
for(int i = 0; i < 20; i++)
pthread_join(id[i], NULL);
return 0;
}

This is, in fact, a race condition. You pass a void pointer to the argument struct, but (likely) the same memory address is reused for each argument struct. Therefore, when you later access it, you are likely to read modified memory. Try this:
#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
typedef struct {
int threadNum;
}thread_args;
void thread_func(void* vargp){
thread_args* id = (thread_args*)vargp;
printf("%i\n", id->threadNum);
free(vargp);
}
int main() {
for(int i=0;i<20;i++) {
pthread_t id;
thread_args* args = malloc(sizeof(thread_args));
args->threadNum = i;
pthread_create(&id, NULL, thread_func, (void*)args);
}
pthread_exit(NULL);
return 0;
}
Thanks to Kamil Cuk for pointing out another race condition.
Note that this snippet might still leak because the code never joins the threads, so the free() might never be called.

Related

C: for loop with two nested for loops stops working after first cycle

Can someone help me figure out why the for loop with the variable v doesn't execute after the first cycle?
#include <stdio.h>
#include <stdlib.h>
int main()
{
int x[100],n,h,s,v,k,l;
s=0;
scanf("%d",&n);
h=n;
for(int j=0;j<n;j++)
scanf("%d",&x[j]);
for(v=0;v<n;v++)
{
for(k=0;k<n;k++)
if(h%x[k]==0) x[k]=0;
for(l=0;l<n;l++)
if(x[l]==0) h--;
}
for(int m=0;m<n;m++)
s=s+x[m];
printf("%d",s);
return 0;
}
EDIT (this is a copy of the comment below!)
The input is 10 1 2 3 4 5 6 7 8 9 10 The expected result is 24, but the actual result is nothing, because the program just stops after the first cycle of v
regarding:
if(h%x[k]==0) x[k]=0;
The modulo operator % is a 'divide by then return remainder' operation.
So when the contents of x[k] is zero a 'divide by zero' exception occurs and the program crashes.

Standard C function is slower at the first call, how to solve this properly?

I want to make timing tests for learning how to benchmark using "time.h". But I noticed the first test is always longer.
0 1 2 3 4 5 6 7 8 9
time 0.000138
0 1 2 3 4 5 6 7 8 9
time 0.000008
0 1 2 3 4 5 6 7 8 9
time 0.000007
If I want to do several tests in the same main() function the results will be unreliable.
Here is the stupid code who prints the output above.
#include <stdio.h>
#include <time.h>
const int COUNT = 10;
void test() {
clock_t start = clock();
for(int i = 0; i < COUNT; i++) {
printf("%d ", i);
}
printf("\ntime %lf\n", (double)(clock() - start) / (double)CLOCKS_PER_SEC );
}
int main() {
test();
test();
test();
return 0;
}
I solved this by ignoring the first "test" function. Also, writing a first "printf" who prints some integer before the tests works too. But I guess it's not a proper solution.
CPU has cache. When code and data are not in cache, the code takes longer to run.
It's standard practice to discard the result of first run (or first few runs) when measuring performance. It's sometimes called "cache warmup".

Non-deterministic CUDA C kernel

I'm still a beginner with CUDA and I have been trying to write a simple kernel to perform a parallel prime sieve on the GPU. Originally I had written my code in C but I wanted to investigate the speed up on a GPU so I rewrote it:
41.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime.h>
#define B 1024
#define T 256
#define N (B*T)
#define checkCudaErrors(error) {\
if (error != cudaSuccess) {\
printf("CUDA Error - %s:%d: '%s'\n",__FILE__,__LINE__,cudaGetErrorString(error));\
exit(1);\
}\
}\
__global__ void prime_sieve(int *primes) {
unsigned int i = threadIdx.x + blockIdx.x * blockDim.x;
primes[i] = i;
primes[0] = primes[1] = 0;
if (i > 1 && i<N) {
for (int j=2; j<N/2; j++) {
if (i*j < N) {
primes[i*j] = 0;
}
}
}
}
int main() {
int *h_primes=(int*)malloc(N * sizeof(int));
int *d_primes;
checkCudaErrors(cudaMalloc( (void**)&d_primes, N*sizeof(int)));
checkCudaErrors(cudaMemcpy(d_primes,h_primes,N*sizeof(int),cudaMemcpyHostToDevice));
prime_sieve<<<B,T>>>(d_primes);
checkCudaErrors(cudaMemcpy(h_primes,d_primes,N*sizeof(int),cudaMemcpyDeviceToHost));
checkCudaErrors(cudaFree(d_primes));
int size = 0;
int total = 0;
for (int i=2; i<N; i++) {
if (h_primes[i]) {
size++;
}
total++;
}
printf("\n");
printf("Length = %d\tPrimes = %d\n",total,size);
free(h_primes);
return 0;
}
I run the program on Ubuntu 16.04 (4.4.0-83-generic) and I compile using nvcc 41.cu -o 41.o -arch=sm_30 under version 8.0.61. The program is run on a GeForce GTX 780 Ti but everytime it runs, it always produces non-deterministic results:
Length = 262142 Primes = 49477
Length = 262142 Primes = 49486
Length = 262142 Primes = 49596
Length = 262142 Primes = 49589
There were no errors reported back. At first I thought it was a race condition but cuda-memcheck didn't report back any hazards for racecheck,initcheck or synccheck and I couldn't think of any problems with my assumptions. I was thinking this could be a synchronisation problem?
This non-deterministic behaviour only occurs when I increase the block size and thread size as seen in the code. When I tried a block size and thread size of say 16, then there were no problems (as far as I could tell). It seems that not all threads get the chance to execute? I was planning to run this on very large array sizes (< 1 billion integers) but I am stuck at this point.
What am I doing wrong here?
There is a giant race-condition
So prime[i] > 0 means prime, while prime[i]=0 means composite.
primes[i] = i; is executed as first update on primes by each thread. Keep this in mind.
Now let's see what happen when thread 16 executes. It marks primes[16]=16 and and all multiples of 16 too. Something like the following
primes[16] = primes[32] = primes[48]=....=primes[k*16]=0
Imagine that thread 48 gets scheduled just after thread 16 completed its job (or when j>3 in thread 16 loop`).
Thread 48 sets primes[48] = 48. You have lost the update made by thread 16.
That is a race condition.
When coding in CUDA you should make sure that the correctness of your code does not depend on a particular scheduling of warps.
You should think as the order of execution as something non-deterministic.

Why does srand(time(NULL)) gives me segmentation fault on main?

Need some help here.
I want to understand what's happening in this code.
I'm trying to generate random numbers as tickets to the TCP_t struct created inside ccreate function.
The problem is, everytime I executed this code WITHOUT the srand(time(NULL)) it returned the same sequence of "random" numbers over and over, for example:
TID: 0 | TICKET : 103
TID: 1 | TICKET : 198
So I seeded it with time, to generate really random numbers.
When I put the seed inside the newTicket function, it brings different numbers in every execution, but the same numbers for every thread. Here is an example of output:
Execution 1:
TID: 0 | TICKET : 148
TID: 1 | TICKET : 148
Execution 2:
TID: 0 | TICKET : 96
TID: 1 | TICKET : 96
So, after some research, I found out I shouldn't seed it everytime I call rand but only once, in the beginning of the program. Now, after putting the seed inside the main function, it gives me segmentation fault, and I have NO IDEA why.
This might be a stupid question, but I really want to understand what's happening.
Is the seed screwing anything, somehow?
Am I missing something?
Should I generate random number in another way?
#include <ucontext.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#define MAX_TICKET 255
#define STACK_SIZE 32000
typedef struct s_TCB {
int threadId;
int ticket;
ucontext_t context;
} TCB_t;
void test();
int newTicket();
int newThreadId();
int ccreate (void* (*start)(void*), void *arg);
int threadId = 0;
int main(){
srand(time(NULL)); //<<<============== HERE = SEGMENTATION FAULT
ccreate((void*)&test, 0);
ccreate((void*)&test, 0);
}
int ccreate (void* (*start)(void*), void *arg){
if(start == NULL) return -1;
ucontext_t threadContext;
getcontext(&threadContext);
makecontext(&threadContext, (void*)start, 0);
threadContext.uc_stack.ss_sp = malloc(STACK_SIZE);
threadContext.uc_stack.ss_size = STACK_SIZE;
TCB_t * newThread = malloc(sizeof(TCB_t));
if (newThread == NULL) return -1;
int threadThreadId = newThreadId();
newThread->threadId = threadThreadId;
newThread->ticket = newTicket();
printf("TID: %d | TICKET : %d\n", newThread->threadId, newThread->ticket);
return threadThreadId;
}
int newThreadId(){
int newThreadId = threadId;
threadId++;
return newThreadId;
}
int newTicket(){
//srand(time(NULL)); //<<<============== HERE = IT PARTIALLY WORKS
return (rand() % (MAX_TICKET+1));
}
void test(){
printf("this is a test function");
}
Thanks to everyone who lends me a hand here.
And sorry if the code is too ugly to read. Tried to simplify it as much as I could.
The problem is not with srand(time(NULL)), but with makecontext.
You can run your code through a sanatizer to confirm:
gcc-6 -fsanitize=undefined -fsanitize=address -fsanitize=leak -fsanitize-recover=all -fuse-ld=gold -o main main.c
./main
ASAN:DEADLYSIGNAL
=================================================================
==8841==ERROR: AddressSanitizer: SEGV on unknown address 0x7fc342ade618 (pc 0x7fc340aad235 bp 0x7ffd1b945950 sp 0x7ffd1b9454f8 T0)
#0 0x7fc340aad234 in makecontext (/lib/x86_64-linux-gnu/libc.so.6+0x47234)
#1 0x400d2f in ccreate (/home/malko/Desktop/main+0x400d2f)
#2 0x400c19 in main (/home/malko/Desktop/main+0x400c19)
#3 0x7fc340a87f44 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21f44)
#4 0x400b28 (/home/malko/Desktop/main+0x400b28)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/lib/x86_64-linux-gnu/libc.so.6+0x47234) in makecontext
==8841==ABORTING
You can solve the problem by setting a stack size before making the context:
char stack[20000];
threadContext.uc_stack.ss_sp = stack;
threadContext.uc_stack.ss_size = sizeof(stack);
makecontext(&threadContext, (void*)start, 0);
Unrelated, but make sure you also free that malloc'd memory in your sample code.

How to pass a sequential counter by reference to pthread start routine?

Below is my C code to print an increasing global counter, one increment per thread.
#include <stdio.h>
#include <pthread.h>
static pthread_mutex_t pt_lock = PTHREAD_MUTEX_INITIALIZER;
int count = 0;
int *printnum(int *num) {
pthread_mutex_lock(&pt_lock);
printf("thread:%d ", *num);
pthread_mutex_unlock(&pt_lock);
return NULL;
}
int main() {
int i, *ret;
pthread_t pta[10];
for(i = 0; i < 10; i++) {
pthread_mutex_lock(&pt_lock);
count++;
pthread_mutex_unlock(&pt_lock);
pthread_create(&pta[i], NULL, (void *(*)(void *))printnum, &count);
}
for(i = 0; i < 10; i++) {
pthread_join(pta[i], (void **)&ret);
}
}
I want each thread to print one increment of the global counter but they miss increments and sometimes access same values of global counter from two threads. How can I make threads access the global counter sequentially?
Sample Output:
thread:2
thread:3
thread:5
thread:6
thread:7
thread:7
thread:8
thread:9
thread:10
thread:10
Edit
Blue Moon's answer solves this question. Alternative approach is available in MartinJames'es comment.
A simple-but-useless approach is to ensure thread1 prints 1, thread2 prints 2 and so on is to put join the thread immmediately:
pthread_create(&pta[i], NULL, printnum, &count);
pthread_join(pta[i], (void **)&ret);
But this totally defeats the purpose of multi-threading because only one can make any progress at a time.
Note that I removed the superfluous casts and also the thread function takes a void * argument.
A saner approach would be to pass the loop counter i by value so that each thread would print different value and you would see threading in action i.e. the numbers 1-10 could be printed in any order and also each thread would print a unique value.

Resources