I am recently learning thread. And in a little experiment I use pthread to try multithread to calculate the product of two matrix. Then I found that using multithread costs even more time than not to. I have tried to enlarge the volume of the matrix, single thread just performs better.
Here are my test code and the results:
Single Thread:
#include <stdio.h>
#include <pthread.h>
#include <sys/time.h>
#define M 3
#define K 2
#define N 3
int A[M][K]={{1,4},{2,5},{3,6}};
int B[K][N]={{8,7,6},{5,4,3}};
int C[M][N];
int main()
{
int begin = clock();
int result = 0;
int i,j,m;
for(i=0;i<M;i++)
for(j=0;j<N;j++){
for(m=0;m<K;m++){
result+=A[i][m]*B[m][j];
}
C[i][j]=result;
result = 0;
}
int end = clock();
printf("time cost:%d\n",end-begin);
for(i=0;i<M;i++){
for(j=0;j<N;j++){
printf("%d ", C[i][j]);
}
printf("\n");
}
}
result:
time cost:1
28 23 18
41 34 27
54 45 36
Multithread:
#include <stdio.h>
#include <pthread.h>
#include <malloc.h>
#include <sys/time.h>
#define M 3
#define K 2
#define N 3
/*structure for passing data to thread*/
struct v
{
int i;
/*row*/
int j;
/*column*/
};
void create_and_pass(struct v *data);
void* runner(void *param);
int A[M][K]={{1,4},{2,5},{3,6}};
int B[K][N]={{8,7,6},{5,4,3}};
int C[M][N];
int main(int argc, char* argv[])
{
/*We have to create M*N worker threads*/
int begin = clock();
int i,j;
for(i=0;i<M;i++)
for(j=0;j<N;j++){
struct v *data = (struct v *)malloc(sizeof(struct v));
data->i=i;
data->j=j;
/*Now create the thread passing it data as a parameter*/
create_and_pass(data);
}
int end = clock();
printf("花费时间:%d\n",end-begin);
for(i=0;i<M;i++){
for(j=0;j<N;j++){
printf("%d ",C[i][j]);
}
printf("\n");
}
}
void create_and_pass(struct v *data)
{
pthread_t tid;
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_create(&tid,&attr,runner,(void *)data);
pthread_join(tid,NULL);
}
void* runner(void *param)
{
struct v *data = param;
int result = 0;
int m;
for(m=0;m<K;m++)
result+=A[data->i][m]*B[m][data->j];
C[data->i][data->j]=result;
}
result:
time cost:1163
28 23 18
41 34 27
54 45 36
Please help, thanks.
The main thread creates and starts the worker thread, and immediately joins it. Joining is blocking operation, meaning that no other thread is started until this one finishes. Effectively the execution is sequential, with all the overhead of memory allocation, thread creation, etc.
It is also unlikely that you see any gain on such a small data set.
Related
I'm relatively new to multithread programming. I wrote a program which is calculating the squares from 0 - 10000 and saving them into an array. The sequential program is running much faster than the parallel. In my parallel program I have divided the loop into 8 threads (my machine has 8 cores) but it is much slower! Anyone an idea why this is the case? I have added the screenshots of the execution times.
/*Here is the normal program:*/
#define ARRAYSIZE 10000
int main(void) {
int array[ARRAYSIZE];
int i;
for (i=0; i<ARRAYSIZE; i++)
{
array[i]=i*i;
}
return 0;
}
/*Here is the parallelized calculation. Used from http://ramcdougal.com/threads.html*/
#include <stdio.h>
#include <pthread.h>
#define ARRAYSIZE 10000
#define NUMTHREADS 8 /*Cause have 8 core on my machine*/
struct ThreadData {
int start;
int stop;
int* array;
};
void* squarer (struct ThreadData* td);
/* puts i^2 into array positions i=start to stop-1 */
void* squarer (struct ThreadData* td)
{
struct ThreadData* data = (struct ThreadData*) td;
int start=data->start;
int stop=data->stop;
int* array=data->array;
int i;
for(i= start; i<stop; i++)
{
array[i]=i*i;
}
return NULL;
}
int main(void) {
int array[ARRAYSIZE];
pthread_t thread[NUMTHREADS];
struct ThreadData data[NUMTHREADS];
int i;
int tasksPerThread= (ARRAYSIZE + NUMTHREADS - 1)/ NUMTHREADS;
/* Divide work for threads, prepare parameters */
/* This means in my example I divide the loop into 8 regions: 0 ..1250,1250 .. 2500 etc., 2500 .. 3750 */
for(i=0; i<NUMTHREADS;i++)
{
data[i].start=i*tasksPerThread;
data[i].stop=(i+1)*tasksPerThread;
data[i].array=array;
data[NUMTHREADS-1].stop=ARRAYSIZE;
}
for(i=0; i<NUMTHREADS;i++)
{
pthread_create(&thread[i], NULL, squarer, &data[i]);
}
for(i=0; i<NUMTHREADS;i++)
{
pthread_join(thread[i], NULL);
}
return 0;
}
You want to have a garden party. In preparation, you must move 8 chairs from the house into the garden. You call a moving company and ask them to send 8 movers. They arrive from across town and quickly complete the task, one chair each. The 8 movers drive back to the other end of the town. When they return, they call you and tell you that the task has been completed.
Question: Would the whole process have gone faster if you had moved the 8 chairs yourself?
Answer: Yes, the actual task (moving 8 chairs a short distance) is far too small to involve a moving company. The time spent on transport back and forth far exceeds the time spent on the task itself.
The example above is similar to what your code does.
Starting 8 threads is equivalent to driving from the other end of town to your house.
Stopping 8 threads is equivalent to returning back.
There is far too much wasted time compared to the size of the task to be solved.
Lesson: Only use multi-threading when the task is sufficiently big.
So for your test, you should increase ARRAYSIZE (a lot). Further, you have to add some code that prevents the compiler from doing optimizations that bypass the array assignments.
Try the code below (It's OPs code with a few changes).
Single thread
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define ARRAYSIZE 1000000000
unsigned array[ARRAYSIZE];
int main(void) {
unsigned i;
for (i=0; i<ARRAYSIZE; i++)
{
array[i]=i*i;
}
srand(time(NULL));
return array[rand() % ARRAYSIZE] > 10000;
}
My result: 1.169 s
Multi thread
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#define ARRAYSIZE 1000000000
unsigned array[ARRAYSIZE];
#define NUMTHREADS 8 /*Cause have 8 core on my machine*/
struct ThreadData {
unsigned start;
unsigned stop;
unsigned* array;
};
/* puts i^2 into array positions i=start to stop-1 */
void* squarer (void* td)
{
struct ThreadData* data = (struct ThreadData*) td;
unsigned start=data->start;
unsigned stop=data->stop;
unsigned* array=data->array;
unsigned i;
for(i= start; i<stop; i++)
{
array[i]=i*i;
}
return NULL;
}
int main(void) {
pthread_t thread[NUMTHREADS];
struct ThreadData data[NUMTHREADS];
int i;
int tasksPerThread= (ARRAYSIZE + NUMTHREADS - 1)/ NUMTHREADS;
/* Divide work for threads, prepare parameters */
/* This means in my example I divide the loop into 8 regions: 0 ..1250,1250 .. 2500 etc., 2500 .. 3750 */
for(i=0; i<NUMTHREADS;i++)
{
data[i].start=i*tasksPerThread;
data[i].stop=(i+1)*tasksPerThread;
data[i].array=array;
data[NUMTHREADS-1].stop=ARRAYSIZE;
}
for(i=0; i<NUMTHREADS;i++)
{
pthread_create(&thread[i], NULL, squarer, &data[i]);
}
for(i=0; i<NUMTHREADS;i++)
{
pthread_join(thread[i], NULL);
}
srand(time(NULL));
return array[rand() % ARRAYSIZE] > 10000;
}
My result: 0.192 s
First time asking a question hope it will be productive:)
I have 10 threads running, and I need the main to print 2 things:
A value as it returns from a thread.
When all the threads are finished, to print a vector of all the values at the same order as they sent to the threads.
Now, the program prints the "--->" from the function, that means it finished the tread, but I need it to print them from main.
#include <stdio.h>
#include <pthread.h>
#include <semaphore.h>
#include <unistd.h>
#include <pthread.h>
#include <stdio.h>
//sem_t mutex;
void *myThread(void *args)
{
int argptr=do123(*(int*)args);
printf("--->%d\n",argptr);
// sem_wait(&mutex);
//*(int*)args=do123((int)args);
return (void*)argptr;
}
int main()
{
int nums[10]={17,65,34,91,92,93,33,16,22,75};
int TemPnums[10]={17,65,34,91,92,93,33,16,22,75};
int res[10]={0};
//pthread_t t1,t2,t3,t4,t5,t6,t7,t8,t9,t10;
pthread_t theads[10];
for (int i = 0; i < 10; i++) {
res[i]=nums[i];
pthread_create(&theads[i], NULL, myThread, &TemPnums[i]);
}
// pthread_join(&theads[10], &status);
for (int i = 0; i < 10; i++) {
void *status;
pthread_join(theads[i], &status);
res[i]=(int)status;
}
for (int i = 0; i < 10; i++) {
printf("%d\n",res[i]);
}
}
int do123(int num)
{
int k=0;
while(num!=1){
if(num%2==1){
num=num*3+1;
k++;
}else{
num=num/2;
k++;
}
}
return k;
}
OutPut:
--->12
--->92
--->27
--->13
--->17
--->17
--->26
--->14
--->4
--->15
12
27
13
92
17
17
26
4
15
14
The time at which a thread in C joins is not influenced by the time at which that same thread executes nor is it determined by its order. This means that on my system the order which a thread executes and joins in a pool of 10 threads can vary on your system. For example using this modified version of your code (see bottom of post for change notes):
#include <pthread.h>
#include <semaphore.h>
#include <stdio.h>
#include <unistd.h>
#include <cstdint>
//sem_t mutex;
int do123(int); // Added (1)
void *myThread(void *args)
{
size_t argptr = do123(*(int *)args);
printf("--->%d\n", argptr);
// sem_wait(&mutex);
//*(int*)args=do123((int)args);
return (void *)argptr;
}
int main()
{
int nums[10] = {17, 65, 34, 91, 92, 93, 33, 16, 22, 75};
int TemPnums[10] = {17, 65, 34, 91, 92, 93, 33, 16, 22, 75};
int res[10] = {0};
//pthread_t t1,t2,t3,t4,t5,t6,t7,t8,t9,t10;
pthread_t theads[10];
for (int i = 0; i < 10; i++)
{
res[i] = nums[i];
pthread_create(&theads[i], NULL, myThread, &TemPnums[i]);
}
// pthread_join(&theads[10], &status);
for (int i = 0; i < 10; i++)
{
void *status;
pthread_join(theads[i], &status);
res[i] = (size_t)status;
}
for (int i = 0; i < 10; i++)
{
printf("%d\n", res[i]);
}
}
int do123(int num)
{
int k = 0;
while (num != 1)
{
if (num % 2 == 1)
{
num = num * 3 + 1;
k++;
}
else
{
num = num / 2;
k++;
}
}
return k;
I get the output:
--->12
--->27
--->13
--->92
--->17
--->17
--->26
--->4
--->15
--->14
12
27
13
92
17
17
26
4
15
14
If your goal is to make sure that threads are joined into the array in the main function at the same order at which they are assigned their value in the helper function, I suggest implementing a way to block subsequent threads after one thread has its value assigned to it. In order to do this, you can implement a system using semaphores or mutexes.
Documentation on semaphores: https://www.tutorialspoint.com/how-to-use-posix-semaphores-in-c-language
Documentation on mutexes: https://www.tutorialspoint.com/deadlock-with-mutex-locks
In short, the flow should be when one thread enters do123(), lock all other threads from entering the function. Let all work on that thread be done and have it return from the function and be assigned to its respective index in the array. After this, you should unlock the next thread and repeat.
I suggest giving those a read to better understand how threading works. Good luck.
Notes on changes:
(1) You have to add the function declaration before using the function in your code. You had the definition of the function below where you call it. The compiler does not know about this function as it looks at your code from "top down".
You are losing precision by casting a type void* to an int as the size depends on your OS (16-bit, 32-bit, etc.). I changed them to a size_t struct which will ensure non-negative values as well as account for the loss of precision.
My goal is to create a program that evaluates the performance gains from increasing the number of threads the program can use. I evaluate the performance by using the Monte Carlo method to calculate pi. Each thread should create 1 random coordinate (x,y) and check if that coordinate is within the circle or not. If it is, the inCircle counter should increase. Pi is calculated as follows: 4 * inCircle/trys. Using pthread_join, there is no performance gains in a problem that should benefit from multiple threads. Is there some way to allow multiple threads to increase a counter without having to wait for each individual thread?
#include <stdio.h>
#include <string.h>
#include <pthread.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <stdbool.h>
#include <pthread.h>
#define nPoints 10000000
#define NUM_THREADS 16
int inCircle = 0;
int count = 0;
double x,y;
pthread_mutex_t mutex;
bool isInCircle(double x, double y){
if(x*x+y*y<=1){
return true;
}
else{
return false;
}
}
void *piSlave(){
int myCount = 0;
time_t now;
time(&now);
srand((unsigned int)now);
for(int i = 1; i <= nPoints/NUM_THREADS; i++) {
x = (double)rand() / (double)RAND_MAX;
y = (double)rand() / (double)RAND_MAX;
if(isInCircle(x,y)){
myCount++;
}
}
pthread_mutex_lock(&mutex);
inCircle += myCount;
pthread_mutex_unlock(&mutex);
pthread_exit(0);
}
double piMaster()
{
pthread_t threads[NUM_THREADS];
int rc;
long t;
for(t=0; t<NUM_THREADS; t++){
printf("Creating thread %ld\n", t);
rc = pthread_create(&threads[t], NULL, piSlave, (void *)t);
if (rc){
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
//pthread_join(threads[t], NULL);
}
//wait(NULL);
return 4.0*inCircle/nPoints;
}
int main()
{
printf("%f\n",piMaster());
return(0);
}
There are few issues with the code.
Wait for Thread Termination
The piMaster() function should wait for the threads it created. We can do this by simply running pthread_join() in a loop:
for (t = 0; t < NUM_THREADS; t++)
pthread_join(threads[t], NULL);
Avoid Locks
We can simply atomically increase the inCircle counter at the end of the loop, so no locks are needed. The variable must be declared with _Atomic keyword as described in the Atomic operations C reference:
_Atomic long inCircle = 0;
void *piSlave(void *arg)
{
[...]
inCircle += myCount;
[...]
}
This will generate correct CPU instructions to atomically increase the variable. For example, for x86 architecture a lock prefix appears as we can confirm in disassembly:
29 inCircle += myCount;
0x0000000100000bdb <+155>: lock add %rbx,0x46d(%rip) # 0x100001050 <inCircle>
Avoid Slow and Thread Unsafe rand()
Instead, we can simply scan the whole circle in a loop as described on Approximations of Pi Wikipedia page:
for (long x = -RADIUS; x <= RADIUS; x++)
for (long y = -RADIUS; y <= RADIUS; y++)
myCount += isInCircle(x, y);
So here is the code after the changes above:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define RADIUS 10000L
#define NUM_THREADS 10
_Atomic long inCircle = 0;
inline long isInCircle(long x, long y)
{
return x * x + y * y <= RADIUS * RADIUS ? 1 : 0;
}
void *piSlave(void *arg)
{
long myCount = 0;
long tid = (long)arg;
for (long x = -RADIUS + tid; x <= RADIUS + tid; x += NUM_THREADS)
for (long y = -RADIUS; y <= RADIUS; y++)
myCount += isInCircle(x, y);
printf("\tthread %ld count: %zd\n", tid, myCount);
inCircle += myCount;
pthread_exit(0);
}
double piMaster()
{
pthread_t threads[NUM_THREADS];
long t;
for (t = 0; t < NUM_THREADS; t++) {
printf("Creating thread %ld...\n", t);
if (pthread_create(&threads[t], NULL, piSlave, (void *)t)) {
perror("Error creating pthread");
exit(-1);
}
}
for (t = 0; t < NUM_THREADS; t++)
pthread_join(threads[t], NULL);
return (double)inCircle / (RADIUS * RADIUS);
}
int main()
{
printf("Result: %f\n", piMaster());
return (0);
}
And here is the output:
Creating thread 0...
Creating thread 1...
Creating thread 2...
Creating thread 3...
Creating thread 4...
Creating thread 5...
Creating thread 6...
Creating thread 7...
Creating thread 8...
Creating thread 9...
thread 7 count: 31415974
thread 5 count: 31416052
thread 1 count: 31415808
thread 3 count: 31415974
thread 0 count: 31415549
thread 4 count: 31416048
thread 2 count: 31415896
thread 9 count: 31415808
thread 8 count: 31415896
thread 6 count: 31416048
Result: 3.141591
I'm writing a simple program taking a double alpha and integer deg that prints a matrix mat as computed by create_basis. Below is the code:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define MAX 30
void create_basis(uint64_t mat[][MAX],double alpha, int deg);
void create_basis(uint64_t mat[][MAX],double alpha,int deg){
int i;
int j;
for(i=0;i<deg+1;i++){
for(j=0;j<deg+2;j++)
mat[i][j]=0;
}
for(i=0;i<deg+1;i++){
mat[i][deg+1]=floor(pow(alpha,i)*pow(10,16));
mat[i][i]=1;
}
}
int main(){
int deg;
double alpha;
int i;
int j;
printf("Enter number:\n");
scanf("%lf",&alpha);
printf("Enter degree:\n");
scanf("%d",°);
uint64_t mat[deg+1][deg+2];
create_basis(mat,alpha,deg);
printf("Matrix basis=\n\n");
for(i=0;i<deg+1;i++){
for(j=0;j<deg+2;j++){
if(j==0)
printf("[%llu ",mat[i][j]);
if(j==deg+1)
printf("%llu]",mat[i][j]);
else
printf("%llu ",mat[i][j]);
}
printf("\n");
}
return 0;
}
However, when I run, there seems to be an issue when I call create_basis in main because it is giving an abort trap 6 error, which I presume to mean I'm trying to access memory I don't have. However, the dimensions of mat seem to agree with what I'm trying to access. Am I calling create_basis incorrectly? Any ideas are greatly appreciated!
void create_basis(uint64_t mat[][MAX],double alpha,int deg){
change to
void create_basis(int deg, uint64_t mat[deg+1][deg+2],double alpha){
As reasons already explained #SteveSummit is
Two-dimensional array does not have to match.
EDIT TO QUESTION: Is it possible to have thread safe access to a bit array? My implementation below seems to require mutex locks which defeats the purpose of parallelizing.
I've been tasked with creating a parallel implementation of a twin prime generator using pthreads. I decided to use the Sieve of Eratosthenes and to divide the work of marking the factors of known primes. I staggering which factors a thread gets.
For example, if there are 4 threads:
thread one marks multiples 3, 11, 19, 27...
thread two marks multiples 5, 13, 21, 29...
thread two marks multiples 7, 15, 23, 31...
thread two marks multiples 9, 17, 25, 33...
I skipped the even multiples as well as the even base numbers. I've used a bitarray, so I run it up to INT_MAX. The problem I have is at max value of 10 million, the result varies by about 5 numbers, which is how much error there is compared to a known file. The results vary all the way down to about max value of 10000, where it changes by 1 number. Anything below that is error-free.
At first I didn't think there was a need for communication between processes. When I saw the results, I added a pthread barrier to let all the threads catch up after each set of multiples. This didn't make any change. Adding a mutex lock around the mark() function did the trick, but that slows everything down.
Here is my code. Hoping someone might see something obvious.
#include <pthread.h>
#include <stdio.h>
#include <sys/times.h>
#include <stdlib.h>
#include <unistd.h>
#include <math.h>
#include <string.h>
#include <limits.h>
#include <getopt.h>
#define WORDSIZE 32
struct t_data{
int *ba;
unsigned int val;
int num_threads;
int thread_id;
};
pthread_mutex_t mutex_mark;
void mark( int *ba, unsigned int k )
{
ba[k/32] |= 1 << (k%32);
}
void mark( int *ba, unsigned int k )
{
pthread_mutex_lock(&mutex_mark);
ba[k/32] |= 1 << (k%32);
pthread_mutex_unlock(&mutex_mark);
}
void initBa(int **ba, unsigned int val)
{
*ba = calloc((val/WORDSIZE)+1, sizeof(int));
}
void getPrimes(int *ba, unsigned int val)
{
int i, p;
p = -1;
for(i = 3; i<=val; i+=2){
if(!isMarked(ba, i)){
if(++p == 8){
printf(" \n");
p = 0;
}
printf("%9d", i);
}
}
printf("\n");
}
void markTwins(int *ba, unsigned int val)
{
int i;
for(i=3; i<=val; i+=2){
if(!isMarked(ba, i)){
if(isMarked(ba, i+2)){
mark(ba, i);
}
}
}
}
void *setPrimes(void *arg)
{
int *ba, thread_id, num_threads, status;
unsigned int val, i, p, start;
struct t_data *data = (struct t_data*)arg;
ba = data->ba;
thread_id = data->thread_id;
num_threads = data->num_threads;
val = data->val;
start = (2*(thread_id+2))-1; // stagger threads
i=3;
for(i=3; i<=sqrt(val); i+=2){
if(!isMarked(ba, i)){
p=start;
while(i*p <= val){
mark(ba, (i*p));
p += (2*num_threads);
}
}
}
return 0;
}
void usage(char *filename)
{
printf("Usage: \t%s [option] [arg]\n", filename);
printf("\t-q generate #'s internally only\n");
printf("\t-m [size] maximum size twin prime to calculate\n");
printf("\t-c [threads] number of threads\n");
printf("Defaults:\n\toutput results\n\tsize = INT_MAX\n\tthreads = 1\n");
}
int main(int argc, char **argv)
{
int *ba, i, num_threads, opt, output;
unsigned int val;
output = 1;
num_threads = 1;
val = INT_MAX;
while ((opt = getopt(argc, argv, "qm:c:")) != -1){
switch (opt){
case 'q': output = 0;
break;
case 'm': val = atoi(optarg);
break;
case 'c': num_threads = atoi(optarg);
break;
default:
usage(argv[0]);
exit(EXIT_FAILURE);
}
}
struct t_data data[num_threads];
pthread_t thread[num_threads];
pthread_attr_t attr;
pthread_mutex_init(&mutex_mark, NULL);
initBa(&ba, val);
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for(i=0; i < num_threads; i++){
data[i].ba = ba;
data[i].thread_id = i;
data[i].num_threads = num_threads;
data[i].val = val;
if(0 != pthread_create(&thread[i],
&attr,
setPrimes,
(void*)&data[i])){
perror("Cannot create thread");
exit(EXIT_FAILURE);
}
}
for(i = 0; i < num_threads; i++){
pthread_join(thread[i], NULL);
}
markTwins(ba, val);
if(output)
getPrimes(ba, val);
free(ba);
return 0;
}
EDIT: I got rid of the barrier and added a mutex_lock to the mark function. Output is accurate now, but now more than one thread slows it down. Any suggestions on speeding it up?
Your currently implementation of mark is correct, but the locking is extremely coarse-grained - there's only one lock for your entire array. This means that your threads are constantly contending for that lock.
One way of improving performance is to make the lock finer-grained: each 'mark' operation only requires exclusive access to a single integer within the array, so you could have a mutex for each array entry:
struct bitarray
{
int *bits;
pthread_mutex_t *locks;
};
struct t_data
{
struct bitarray ba;
unsigned int val;
int num_threads;
int thread_id;
};
void initBa(struct bitarray *ba, unsigned int val)
{
const size_t array_size = val / WORDSIZE + 1;
size_t i;
ba->bits = calloc(array_size, sizeof ba->bits[0]);
ba->locks = calloc(array_size, sizeof ba->locks[0]);
for (i = 0; i < array_size; i++)
{
pthread_mutex_init(&ba->locks[i], NULL);
}
}
void mark(struct bitarray ba, unsigned int k)
{
const unsigned int entry = k / 32;
pthread_mutex_lock(&ba.locks[entry]);
ba.bits[entry] |= 1 << (k%32);
pthread_mutex_unlock(&ba.locks[entry]);
}
Note that your algorithm has a race-condition: consider the example where num_threads = 4, so Thread 0 starts at 3, Thread 1 starts at 5 and Thread 2 starts at 7. It is possible for Thread 2 to execute fully, marking every multiple of 7 and then start again at 15, before Thread 0 or Thread 1 get a chance to mark 15 as a multiple of 3 or 5. Thread 2 will then do useless work, marking every multiple of 15.
Another alternative, if your compiler supports Intel-style atomic builtins, is to use those instead of a lock:
void mark(int *ba, unsigned int k)
{
__sync_or_and_fetch(&ba[k/32], 1U << k % 32);
}
Your mark() funciton is not threadsafe - if two threads try to set bits within the same int location one might overwrite with 0 a bit that was just set by another thread.