I'm relatively new to multithread programming. I wrote a program which is calculating the squares from 0 - 10000 and saving them into an array. The sequential program is running much faster than the parallel. In my parallel program I have divided the loop into 8 threads (my machine has 8 cores) but it is much slower! Anyone an idea why this is the case? I have added the screenshots of the execution times.
/*Here is the normal program:*/
#define ARRAYSIZE 10000
int main(void) {
int array[ARRAYSIZE];
int i;
for (i=0; i<ARRAYSIZE; i++)
{
array[i]=i*i;
}
return 0;
}
/*Here is the parallelized calculation. Used from http://ramcdougal.com/threads.html*/
#include <stdio.h>
#include <pthread.h>
#define ARRAYSIZE 10000
#define NUMTHREADS 8 /*Cause have 8 core on my machine*/
struct ThreadData {
int start;
int stop;
int* array;
};
void* squarer (struct ThreadData* td);
/* puts i^2 into array positions i=start to stop-1 */
void* squarer (struct ThreadData* td)
{
struct ThreadData* data = (struct ThreadData*) td;
int start=data->start;
int stop=data->stop;
int* array=data->array;
int i;
for(i= start; i<stop; i++)
{
array[i]=i*i;
}
return NULL;
}
int main(void) {
int array[ARRAYSIZE];
pthread_t thread[NUMTHREADS];
struct ThreadData data[NUMTHREADS];
int i;
int tasksPerThread= (ARRAYSIZE + NUMTHREADS - 1)/ NUMTHREADS;
/* Divide work for threads, prepare parameters */
/* This means in my example I divide the loop into 8 regions: 0 ..1250,1250 .. 2500 etc., 2500 .. 3750 */
for(i=0; i<NUMTHREADS;i++)
{
data[i].start=i*tasksPerThread;
data[i].stop=(i+1)*tasksPerThread;
data[i].array=array;
data[NUMTHREADS-1].stop=ARRAYSIZE;
}
for(i=0; i<NUMTHREADS;i++)
{
pthread_create(&thread[i], NULL, squarer, &data[i]);
}
for(i=0; i<NUMTHREADS;i++)
{
pthread_join(thread[i], NULL);
}
return 0;
}
You want to have a garden party. In preparation, you must move 8 chairs from the house into the garden. You call a moving company and ask them to send 8 movers. They arrive from across town and quickly complete the task, one chair each. The 8 movers drive back to the other end of the town. When they return, they call you and tell you that the task has been completed.
Question: Would the whole process have gone faster if you had moved the 8 chairs yourself?
Answer: Yes, the actual task (moving 8 chairs a short distance) is far too small to involve a moving company. The time spent on transport back and forth far exceeds the time spent on the task itself.
The example above is similar to what your code does.
Starting 8 threads is equivalent to driving from the other end of town to your house.
Stopping 8 threads is equivalent to returning back.
There is far too much wasted time compared to the size of the task to be solved.
Lesson: Only use multi-threading when the task is sufficiently big.
So for your test, you should increase ARRAYSIZE (a lot). Further, you have to add some code that prevents the compiler from doing optimizations that bypass the array assignments.
Try the code below (It's OPs code with a few changes).
Single thread
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define ARRAYSIZE 1000000000
unsigned array[ARRAYSIZE];
int main(void) {
unsigned i;
for (i=0; i<ARRAYSIZE; i++)
{
array[i]=i*i;
}
srand(time(NULL));
return array[rand() % ARRAYSIZE] > 10000;
}
My result: 1.169 s
Multi thread
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#define ARRAYSIZE 1000000000
unsigned array[ARRAYSIZE];
#define NUMTHREADS 8 /*Cause have 8 core on my machine*/
struct ThreadData {
unsigned start;
unsigned stop;
unsigned* array;
};
/* puts i^2 into array positions i=start to stop-1 */
void* squarer (void* td)
{
struct ThreadData* data = (struct ThreadData*) td;
unsigned start=data->start;
unsigned stop=data->stop;
unsigned* array=data->array;
unsigned i;
for(i= start; i<stop; i++)
{
array[i]=i*i;
}
return NULL;
}
int main(void) {
pthread_t thread[NUMTHREADS];
struct ThreadData data[NUMTHREADS];
int i;
int tasksPerThread= (ARRAYSIZE + NUMTHREADS - 1)/ NUMTHREADS;
/* Divide work for threads, prepare parameters */
/* This means in my example I divide the loop into 8 regions: 0 ..1250,1250 .. 2500 etc., 2500 .. 3750 */
for(i=0; i<NUMTHREADS;i++)
{
data[i].start=i*tasksPerThread;
data[i].stop=(i+1)*tasksPerThread;
data[i].array=array;
data[NUMTHREADS-1].stop=ARRAYSIZE;
}
for(i=0; i<NUMTHREADS;i++)
{
pthread_create(&thread[i], NULL, squarer, &data[i]);
}
for(i=0; i<NUMTHREADS;i++)
{
pthread_join(thread[i], NULL);
}
srand(time(NULL));
return array[rand() % ARRAYSIZE] > 10000;
}
My result: 0.192 s
Related
The following program is to sort a large array of random numbers using heapsort. The output of the program is the total execution time of the recursive heapSort function(in microseconds). The size of the input array is defined by the SIZE macro.
The program works fine for SIZE up to 1 million(1000000). But when I try to execute the program with SIZE 10 million(10000000), the program generates segmentation fault(core dumped).
Note: I have already tried increasing the soft and hard limits of the stack using ulimit -s command on Linux(128 MB). The SEGFAULT still persists.
Please suggest me any alterations to the code needed or any method which will overcome the existing SEGFAULT malady without having to declare the array dynamically or as global/static.
/* Program to implement Heap-Sort algorithm */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
long SIZE = 10000000; // Or #define SIZE 10000000
long heapSize;
void swap(long *p, long *q)
{
long temp = *p;
*p = *q;
*q = temp;
}
void heapify(long A[], long i)
{
long left, right, index_of_max;
left = 2*i + 1;
right = 2*i + 2;
if(left<heapSize && A[left]>A[i])
index_of_max = left;
else
index_of_max = i;
if(right<heapSize && A[right]>A[index_of_max])
index_of_max = right;
if(index_of_max != i)
{
swap(&A[index_of_max], &A[i]);
heapify(A, index_of_max);
}
}
void buildHeap(long A[])
{
long i;
for(i=SIZE/2; i>=0 ; i--)
heapify(A,i);
}
void heapSort(long A[])
{
long i;
buildHeap(A);
for(i=SIZE-1 ; i>=1 ; i--)
{
swap(&A[i], &A[0]);
heapSize--;
heapify(A, 0);
}
}
int main()
{
long i, A[SIZE];
heapSize = SIZE;
struct timespec start, end;
srand(time(NULL));
for(i = 0; i < SIZE; i++)
A[i] = rand() % SIZE;
/*printf("Unsorted Array is:-\n");
for(i = 0; i < SIZE; i++)
printf("%li\n", A[i]);
*/
clock_gettime(CLOCK_MONOTONIC_RAW, &start);//start timer
heapSort(A);
clock_gettime(CLOCK_MONOTONIC_RAW, &end);//end timer
//To find time taken by heapsort by calculating difference between start and stop time.
unsigned long delta_us = (end.tv_sec - start.tv_sec) * 1000000 \
+ (end.tv_nsec - start.tv_nsec) / 1000;
/*printf("Sorted Array is:-\n");
for(i = 0; i < SIZE; i++)
printf("%li\n", A[i]);
*/
printf("Heapsort took %lu microseconds for sorting of %li elements\n",delta_us, SIZE);
return 0;
}
So, once you plan to stick with stack-only approach, you have to understand who is the main consumer(s) of your stack space.
Player #1: Array A[] itself. Depending to the OS/build, it consumes approx. 40 or 80 Mb of stack. One-time only.
Player #2: Beware recursion! In your case, this is heapify() function. Each call consumes decent stack chunk to serve a calling convention, stack alignment like stack-frames etc. If you do that million times and tree-like schema, you have tens of megabytes spent here too. So, you can try to re-implement this function to non-recursive way to decrease stack size pressure.
I'm new with MPI and I'm trying to develop a non-blocking programm (with Isend and Irecv). The functionality is very basic (it's educational):
There is one process (rank 0) who is the master and receives messages from the slaves (rank 1-P). The master only receives results.
The slaves generates an array of N random numbers between 0 and R and then they do some operations with those numbers (again, it's just for educational purpose, the operations don't make any sense)
This whole process (operations + send data) is done M times (this is just for comparing different implementations; blocking and non-blocking)
I get a Segmentation Fault in the Master process when I'm calling the MPI_waitall() funcion
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
#include <math.h>
#include <time.h>
#define M 1000 //Number of times
#define N 2000 //Quantity of random numbers
#define R 1000 //Max value of random numbers
double SumaDeRaices (double*);
int main(int argc, char* argv[]) {
int yo; /* rank of process */
int p; /* number of processes */
int dest; /* rank of receiver */
/* Start up MPI */
MPI_Init(&argc, &argv);
/* Find out process rank */
MPI_Comm_rank(MPI_COMM_WORLD, &yo);
/* Find out number of processes */
MPI_Comm_size(MPI_COMM_WORLD, &p);
MPI_Request reqs[p-1];
MPI_Status stats[p-1];
if (yo == 0) {
int i,j;
double result;
clock_t inicio,fin;
inicio = clock();
for(i = 0; i<M; i++){ //M times
for(j = 1; j<p; j++){ //for every slave
MPI_Irecv(&result, sizeof(double), MPI_DOUBLE, j, i, MPI_COMM_WORLD, &reqs[j-1]);
}
MPI_Waitall(p-1,reqs,stats); //wait all slaves (SEG_FAULT)
}
fin = clock()-inicio;
printf("Tiempo total de ejecucion %f segundos \n", ((double)fin)/CLOCKS_PER_SEC);
}
else {
double* numAleatorios = (double*) malloc( sizeof(double) * ((double) N) ); //array with numbers
int i,j;
double resultado;
dest=0;
for(i=0; i<M; i++){ //again, M times
for(j=0; j<N; j++){
numAleatorios[j] = rand() % R ;
}
resultado = SumaDeRaices(numAleatorios);
MPI_Isend(&resultado,sizeof(double), MPI_DOUBLE, dest, i, MPI_COMM_WORLD,&reqs[p-1]); //send result to master
}
}
/* Shut down MPI */
MPI_Finalize();
exit(0);
} /* main */
double SumaDeRaices (double* valores){
int i;
double sumaTotal = 0.0;
//Raices cuadradas de los valores y suma de estos
for(i=0; i<N; i++){
sumaTotal = sqrt(valores[i]) + sumaTotal;
}
return sumaTotal;
}
There are several issues with your code. First and foremost in your Isend you pass &resultado several times without waiting until previous non-blocking operation finishes. You are not allowed to reuse the buffer you pass to Isend before you make sure the operation finishes.
Instead I recommend you using normal Send, because in contrast to synchronous send (SSend) normal blocking send returns as soon as you can reuse the buffer.
Second, there is no need to use message tags. I recommend you to just set tag to 0. In terms of performance it is simply faster.
Third, the result shouldn't be a simple variable, but an array of size at least (p-1)
Fourth, I do not recommend you to allocate arrays on stack, like MPI_Request and MPI_Status if the size is not a known small number. In this case the size of array is (p-1), so you better use malloc for this data structure.
Fifth, if you do not check status, use MPI_STATUSES_IGNORE.
Also instead of sizeof(double) you should specify number of items (1).
But of course the absolutely best version is just to use MPI_Gather.
Moreover, generally there is no reason not to run computations on the root node.
Here is slightly rewritten example:
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
#include <math.h>
#include <time.h>
#define M 1000 //Number of times
#define N 2000 //Quantity of random numbers
#define R 1000 //Max value of random numbers
double SumaDeRaices (double* valores)
{
int i;
double sumaTotal = 0.0;
//Raices cuadradas de los valores y suma de estos
for(i=0; i<N; i++) {
sumaTotal = sqrt(valores[i]) + sumaTotal;
}
return sumaTotal;
}
int main(int argc, char* argv[]) {
int yo; /* rank of process */
int p; /* number of processes */
/* Start up MPI */
MPI_Init(&argc, &argv);
/* Find out process rank */
MPI_Comm_rank(MPI_COMM_WORLD, &yo);
/* Find out number of processes */
MPI_Comm_size(MPI_COMM_WORLD, &p);
double *result;
clock_t inicio, fin;
double *numAleatorios;
if (yo == 0) {
inicio = clock();
}
numAleatorios = (double*) malloc( sizeof(double) * ((double) N) ); //array with numbers
result = (double *) malloc(sizeof(double) * p);
for(int i = 0; i<M; i++){ //M times
for(int j=0; j<N; j++) {
numAleatorios[j] = rand() % R ;
}
double local_result = SumaDeRaices(numAleatorios);
MPI_Gather(&local_result, 1, MPI_DOUBLE, result, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD); //send result to master
}
if (yo == 0) {
fin = clock()-inicio;
printf("Tiempo total de ejecucion %f segundos \n", ((double)fin)/CLOCKS_PER_SEC);
}
free(numAleatorios);
/* Shut down MPI */
MPI_Finalize();
} /* main */
I have the following code of filling an array with multiple threads:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#define MAX_ITEMS 67108864
#define LINES_PER_THREAD 8388608
#define THREADS 8
static int *array;
static pthread_t pids[THREADS];
static int args[THREADS];
static void init_array_line(int *line) {
int i, max;
i = *line;
max = i + LINES_PER_THREAD;
for (i; i < max; i++)
array[i] = rand() % 10000 + 1;
}
static void init_array() {
int i;
for ( i = 0; i < THREADS; i++) {
args[i]=i* LINES_PER_THREAD;
pthread_create(pids + i, NULL, &init_array_line, args + i);;
}
}
static wait_all() {
for (int i = 0; i < THREADS; i++) {
pthread_join(pids[i], NULL);
}
}
int
main(int argc, char **argv)
{
array = (int *)malloc(MAX_ITEMS * sizeof(int));
init_array();
wait_all();
}
I am giving each thread 1/8 of the array to fill LINES_PER_THREAD, but it seems that it takes longer than filling it normally. Any suggestions why might this be?
I suspect the main bottleneck would the calls to rand(). rand() isn't required to be thread-safe. So, it can't be safely used in a multi-threaded program when multiple threads could call rand() concurrently. But the Glibc implementation uses an internal lock to protect against such uses. This effectively serializes the call to rand() in all threads and thus severely affecting the multi-threaded nature of your program. Instead use rand_r() which doesn't need to maintain any internal state (because the caller(s) do) and can at least solve this aspect of your problem.
In general, if the threads don't do sufficient work then the thread creation/synchronization overhead can outdo the concurrency that could be available on multi-core systems using threads.
I am recently learning thread. And in a little experiment I use pthread to try multithread to calculate the product of two matrix. Then I found that using multithread costs even more time than not to. I have tried to enlarge the volume of the matrix, single thread just performs better.
Here are my test code and the results:
Single Thread:
#include <stdio.h>
#include <pthread.h>
#include <sys/time.h>
#define M 3
#define K 2
#define N 3
int A[M][K]={{1,4},{2,5},{3,6}};
int B[K][N]={{8,7,6},{5,4,3}};
int C[M][N];
int main()
{
int begin = clock();
int result = 0;
int i,j,m;
for(i=0;i<M;i++)
for(j=0;j<N;j++){
for(m=0;m<K;m++){
result+=A[i][m]*B[m][j];
}
C[i][j]=result;
result = 0;
}
int end = clock();
printf("time cost:%d\n",end-begin);
for(i=0;i<M;i++){
for(j=0;j<N;j++){
printf("%d ", C[i][j]);
}
printf("\n");
}
}
result:
time cost:1
28 23 18
41 34 27
54 45 36
Multithread:
#include <stdio.h>
#include <pthread.h>
#include <malloc.h>
#include <sys/time.h>
#define M 3
#define K 2
#define N 3
/*structure for passing data to thread*/
struct v
{
int i;
/*row*/
int j;
/*column*/
};
void create_and_pass(struct v *data);
void* runner(void *param);
int A[M][K]={{1,4},{2,5},{3,6}};
int B[K][N]={{8,7,6},{5,4,3}};
int C[M][N];
int main(int argc, char* argv[])
{
/*We have to create M*N worker threads*/
int begin = clock();
int i,j;
for(i=0;i<M;i++)
for(j=0;j<N;j++){
struct v *data = (struct v *)malloc(sizeof(struct v));
data->i=i;
data->j=j;
/*Now create the thread passing it data as a parameter*/
create_and_pass(data);
}
int end = clock();
printf("花费时间:%d\n",end-begin);
for(i=0;i<M;i++){
for(j=0;j<N;j++){
printf("%d ",C[i][j]);
}
printf("\n");
}
}
void create_and_pass(struct v *data)
{
pthread_t tid;
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_create(&tid,&attr,runner,(void *)data);
pthread_join(tid,NULL);
}
void* runner(void *param)
{
struct v *data = param;
int result = 0;
int m;
for(m=0;m<K;m++)
result+=A[data->i][m]*B[m][data->j];
C[data->i][data->j]=result;
}
result:
time cost:1163
28 23 18
41 34 27
54 45 36
Please help, thanks.
The main thread creates and starts the worker thread, and immediately joins it. Joining is blocking operation, meaning that no other thread is started until this one finishes. Effectively the execution is sequential, with all the overhead of memory allocation, thread creation, etc.
It is also unlikely that you see any gain on such a small data set.
EDIT TO QUESTION: Is it possible to have thread safe access to a bit array? My implementation below seems to require mutex locks which defeats the purpose of parallelizing.
I've been tasked with creating a parallel implementation of a twin prime generator using pthreads. I decided to use the Sieve of Eratosthenes and to divide the work of marking the factors of known primes. I staggering which factors a thread gets.
For example, if there are 4 threads:
thread one marks multiples 3, 11, 19, 27...
thread two marks multiples 5, 13, 21, 29...
thread two marks multiples 7, 15, 23, 31...
thread two marks multiples 9, 17, 25, 33...
I skipped the even multiples as well as the even base numbers. I've used a bitarray, so I run it up to INT_MAX. The problem I have is at max value of 10 million, the result varies by about 5 numbers, which is how much error there is compared to a known file. The results vary all the way down to about max value of 10000, where it changes by 1 number. Anything below that is error-free.
At first I didn't think there was a need for communication between processes. When I saw the results, I added a pthread barrier to let all the threads catch up after each set of multiples. This didn't make any change. Adding a mutex lock around the mark() function did the trick, but that slows everything down.
Here is my code. Hoping someone might see something obvious.
#include <pthread.h>
#include <stdio.h>
#include <sys/times.h>
#include <stdlib.h>
#include <unistd.h>
#include <math.h>
#include <string.h>
#include <limits.h>
#include <getopt.h>
#define WORDSIZE 32
struct t_data{
int *ba;
unsigned int val;
int num_threads;
int thread_id;
};
pthread_mutex_t mutex_mark;
void mark( int *ba, unsigned int k )
{
ba[k/32] |= 1 << (k%32);
}
void mark( int *ba, unsigned int k )
{
pthread_mutex_lock(&mutex_mark);
ba[k/32] |= 1 << (k%32);
pthread_mutex_unlock(&mutex_mark);
}
void initBa(int **ba, unsigned int val)
{
*ba = calloc((val/WORDSIZE)+1, sizeof(int));
}
void getPrimes(int *ba, unsigned int val)
{
int i, p;
p = -1;
for(i = 3; i<=val; i+=2){
if(!isMarked(ba, i)){
if(++p == 8){
printf(" \n");
p = 0;
}
printf("%9d", i);
}
}
printf("\n");
}
void markTwins(int *ba, unsigned int val)
{
int i;
for(i=3; i<=val; i+=2){
if(!isMarked(ba, i)){
if(isMarked(ba, i+2)){
mark(ba, i);
}
}
}
}
void *setPrimes(void *arg)
{
int *ba, thread_id, num_threads, status;
unsigned int val, i, p, start;
struct t_data *data = (struct t_data*)arg;
ba = data->ba;
thread_id = data->thread_id;
num_threads = data->num_threads;
val = data->val;
start = (2*(thread_id+2))-1; // stagger threads
i=3;
for(i=3; i<=sqrt(val); i+=2){
if(!isMarked(ba, i)){
p=start;
while(i*p <= val){
mark(ba, (i*p));
p += (2*num_threads);
}
}
}
return 0;
}
void usage(char *filename)
{
printf("Usage: \t%s [option] [arg]\n", filename);
printf("\t-q generate #'s internally only\n");
printf("\t-m [size] maximum size twin prime to calculate\n");
printf("\t-c [threads] number of threads\n");
printf("Defaults:\n\toutput results\n\tsize = INT_MAX\n\tthreads = 1\n");
}
int main(int argc, char **argv)
{
int *ba, i, num_threads, opt, output;
unsigned int val;
output = 1;
num_threads = 1;
val = INT_MAX;
while ((opt = getopt(argc, argv, "qm:c:")) != -1){
switch (opt){
case 'q': output = 0;
break;
case 'm': val = atoi(optarg);
break;
case 'c': num_threads = atoi(optarg);
break;
default:
usage(argv[0]);
exit(EXIT_FAILURE);
}
}
struct t_data data[num_threads];
pthread_t thread[num_threads];
pthread_attr_t attr;
pthread_mutex_init(&mutex_mark, NULL);
initBa(&ba, val);
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for(i=0; i < num_threads; i++){
data[i].ba = ba;
data[i].thread_id = i;
data[i].num_threads = num_threads;
data[i].val = val;
if(0 != pthread_create(&thread[i],
&attr,
setPrimes,
(void*)&data[i])){
perror("Cannot create thread");
exit(EXIT_FAILURE);
}
}
for(i = 0; i < num_threads; i++){
pthread_join(thread[i], NULL);
}
markTwins(ba, val);
if(output)
getPrimes(ba, val);
free(ba);
return 0;
}
EDIT: I got rid of the barrier and added a mutex_lock to the mark function. Output is accurate now, but now more than one thread slows it down. Any suggestions on speeding it up?
Your currently implementation of mark is correct, but the locking is extremely coarse-grained - there's only one lock for your entire array. This means that your threads are constantly contending for that lock.
One way of improving performance is to make the lock finer-grained: each 'mark' operation only requires exclusive access to a single integer within the array, so you could have a mutex for each array entry:
struct bitarray
{
int *bits;
pthread_mutex_t *locks;
};
struct t_data
{
struct bitarray ba;
unsigned int val;
int num_threads;
int thread_id;
};
void initBa(struct bitarray *ba, unsigned int val)
{
const size_t array_size = val / WORDSIZE + 1;
size_t i;
ba->bits = calloc(array_size, sizeof ba->bits[0]);
ba->locks = calloc(array_size, sizeof ba->locks[0]);
for (i = 0; i < array_size; i++)
{
pthread_mutex_init(&ba->locks[i], NULL);
}
}
void mark(struct bitarray ba, unsigned int k)
{
const unsigned int entry = k / 32;
pthread_mutex_lock(&ba.locks[entry]);
ba.bits[entry] |= 1 << (k%32);
pthread_mutex_unlock(&ba.locks[entry]);
}
Note that your algorithm has a race-condition: consider the example where num_threads = 4, so Thread 0 starts at 3, Thread 1 starts at 5 and Thread 2 starts at 7. It is possible for Thread 2 to execute fully, marking every multiple of 7 and then start again at 15, before Thread 0 or Thread 1 get a chance to mark 15 as a multiple of 3 or 5. Thread 2 will then do useless work, marking every multiple of 15.
Another alternative, if your compiler supports Intel-style atomic builtins, is to use those instead of a lock:
void mark(int *ba, unsigned int k)
{
__sync_or_and_fetch(&ba[k/32], 1U << k % 32);
}
Your mark() funciton is not threadsafe - if two threads try to set bits within the same int location one might overwrite with 0 a bit that was just set by another thread.