Why MultiThread is slower than Single?? (Linux,C,using pthread)

Why MultiThread is slower than Single?? (Linux,C,using pthread) - c

I have a question with MultiThread.
This code is simple Example about comparing Single Thread vs MultiThread.
(sum 0~400,000,000 with singlethread vs 4-multiThread)
//Single
#include<pthread.h>
#include<unistd.h>
#include<stdio.h>
#include<stdlib.h>
#define NUM_THREAD 4
#define MY_NUM 100000000
void* calcThread(void* param);
double total = 0;
double sum[NUM_THREAD] = { 0, };
int main() {
long p[NUM_THREAD] = {MY_NUM, MY_NUM * 2,MY_NUM * 3,MY_NUM * 4 };
int i;
long total_nstime;
struct timespec begin, end;
pthread_t tid[NUM_THREAD];
pthread_attr_t attr[NUM_THREAD];
clock_gettime(CLOCK_MONOTONIC, &begin);
for (i = 0; i < NUM_THREAD; i++) {
calcThread((void*)p[i]);
}
for (i = 0; i < NUM_THREAD; i++) {
total += sum[i];
}
clock_gettime(CLOCK_MONOTONIC, &end);
printf("total = %lf\n", total);
total_nstime = (end.tv_sec - begin.tv_sec) * 1000000000 + (end.tv_nsec - begin.tv_nsec);
printf("%.3fs\n", (float)total_nstime / 1000000000);
return 0;
}
void* calcThread(void* param) {
int i;
long to = (long)(param);
int from = to - MY_NUM + 1;
int th_num = from / MY_NUM;
for (i = from; i <= to; i++)
sum[th_num] += i;
}
I wanna change using 4-MultiThread Code, so I changed that calculate function to using MultiThread.
...
int main() {
...
//createThread
for (i = 0; i < NUM_THREAD; i++) {
pthread_attr_init(&attr[i]);
pthread_create(&tid[i],&attr[i],calcThread,(void *)p[i]);
}
//wait
for(i=0;i<NUM_THREAD;i++){
pthread_join(tid[i],NULL);
}
for (i = 0; i < NUM_THREAD; i++) {
total += sum[i];
}
clock_gettime(CLOCK_MONOTONIC, &end);
...
}
Result(in Ubuntu)
But,It's slower than Single Function Code. I know MultiThread is faster.
I have no idea with this problem :( What's wrong?
Could you give me some advice ? Thanks a lot!

"I know MultiThread is faster"
This isn't always the case, as generally you would be CPU bound in some way, whether that be due to core count, how it is scheduled at the OS level, and hardware level.
It is a balance how many threads is worth giving to a process, as you may run into an old Linux problem where you would be spending more time scheduling the processes than actually running them.
As this is very hardware and OS dependant, it is difficult to say exactly what the issue may be, but make sure you have the appropriate microcode for your CPU installed (generally installed by default in Ubuntu), but just in case, try:
sudo apt-get install intel-microcode
Otherwise look at what other processes are being run, and it may be that a lot of other things are running on the cores that are being allocated the process.

Related

Improving a simple function using threading

I have written a simple function with the following code that calculates the minimum number from a one-dimensional array:
uint32_t get_minimum(const uint32_t* matrix) {
int min = 0;
min = matrix[0];
for (ssize_t i = 0; i < g_elements; i++){
if (min > matrix[i]){
min = matrix[i];
}
}
return min;
}
However, I wanted to improve the performance of this function and was advised using threads so I have modified it to the following:
struct minargument{
const uint32_t* matrix;
ssize_t tid;
long long results;
};
static void *minworker(void *arg){
struct minargument *argument = (struct minargument *)arg;
const ssize_t start = argument -> tid * CHUNK;
const ssize_t end = argument -> tid == THREADS - 1 ? g_elements : (argument -> tid + 1) * CHUNK;
long long result = argument -> matrix[0];
for(ssize_t i = start; i < end; i++){
for(ssize_t x = 0; x < g_elements; x++){
if(result > argument->matrix[i]){
result = argument->matrix[i];
}
}
}
argument -> results = result;
return NULL;
}
uint32_t get_minimum(const uint32_t* matrix) {
struct minargument *args = malloc(sizeof(struct minargument) * THREADS);
long long min = 0;
for(ssize_t i = 0; i < THREADS; i++){
args[i] = (struct minargument){
.matrix = matrix,
.tid = i,
.results = min,
};
}
pthread_t thread_ids[THREADS];
for(ssize_t i =0; i < THREADS; i++){
if(pthread_create(thread_ids + i, NULL, minworker, args + i) != 0){
perror("pthread_create failed");
return 1;
}
}
for (ssize_t i = 0; i < THREADS; i++){
if(pthread_join(thread_ids[i], NULL) != 0){
perror("pthread_join failed");
return 1;
}
}
for(ssize_t i =0; i < THREADS; i++){
min = args[i].results;
}
free(args);
return min;
}
However this seems to be slower than the first function.
Am I correct in using threads to make the first function run faster? And if so, how do I modify the second function so that it is faster than the first function?

Having more threads than cores available to run them on is always going to be slower than a single thread due to the overhead of creating them, scheduling them and waiting for them all to finish.
The example you provide is unlikely to benefit from any optimisation beyond that which the compiler will do for you, as it is a short and simple operation. If you were doing something more complicated on a multi-core system, such as multiplying two huge matrices, of running a correlation algorithm on high speed real-time data then multi-threading may be the solution.
A more abstract answer to your question is another question: do you really need to be optimising it at all? Unless you know for a fact that there are performance issues, then your time would be better spent adding more functionality to your program than fixing a problem that doesn't really exist.
Edit - Comparison
I just ran (a representative version of) the OP's code on a 16 bit ARM microcontroller running with a 40 MHz instruction clock. Code compiled using GCC with no optimisation.
Finding the minimum of 20,000 32 bit integers took a little over 25 milliseonds.
With a 40 kByte page size (to hold half of a 20,000 array of 4 byte values) with threads running on different cores of a dual Intel 5150 processor clocked at 2.67 GHz, it takes nearly 50 ms just to do the context switch and paging operation!
A simple, single-threaded microcontroller implementation takes half as long in real time terms as a multi-threaded desktop implementation.

clock() returns 0 even with roughly a 5 second pause when using C

given the code below
#include<time.h>
#include <stdio.h>
#include <stdlib.h>
void firstSequence()
{
int columns = 999999;
int rows = 400000;
int **matrix;
int j;
int counter = 0;
matrix = (int **)malloc(columns*sizeof(int*));
for(j=0;j<columns;j++)
{
matrix[j]=(int*)malloc(rows*sizeof(int));
}
for(counter = 1;counter < columns; counter ++)
{
free(matrix[counter]);
}
}
void secondSequence()
{
int columns = 111;
int rows = 600000;
int **matrix;
int j;
matrix = (int **)malloc(columns*sizeof(int*));
for(j=0;j<columns;j++)
{
matrix[j]=(int*)malloc(rows*sizeof(int));
}
}
int main()
{
long t1;
long t2;
long diff;
t1 = clock();
firstSequence();
t2 = clock();
diff = (t2-t1) * 1000.0 / CLOCKS_PER_SEC;
printf("%f",t2);
t1 = clock();
secondSequence();
t2 = clock();
diff = (t2-t1) * 1000.0 / CLOCKS_PER_SEC;
printf("%f",diff);
return(0);
}
I need to be able to see how long it takes for both sequence one and sequence two to run. However both times I get 0 as the time elapsed. From looking online I have seen that this can be an issue but I do not how to fix the issue

You display the time incorrectly, so even if your functions take more than 0ms the call to printf() invokes undefined behaviour.
printf("%f",diff);
%f is used to display doubles. You probably want to use %ld.
If your functions really do take 0 ms to execute then a simple method to calculate the time for one call to the function is to call it multiple times, evough to be measurable, and then take the average of the total time elapsed.

clock is not the suitable function for calculating the time a program used.
You should use clock_gettime instead. detail explain about clock_gettime
Simple usage:
struct timespec start, end;
clock_gettime(CLOCK_REALTIME, &start);
for(int i = 0; i < 10000; i++) {
f1();
}
clock_gettime(CLOCK_REALTIME, &end);
cout <<"time elapsed = " << (double)((end.tv_sec - start.tv_sec)*1000000 + end.tv_nsec - start.tv_nsec) << endl;
PS: when you are compiling on linux, remember using the -lrt.

Not sure if I need a mutex or not

I'm new to concurrent programming so be nice. I have a basic sequential program (which is for homework) and I'm attempting to turn it into a multithreaded program. I'm not sure if I need a lock for my second shared variable. The threads should modify my variable but never read them. The only time count should be read is after the loop which spawns all of my threads has finished distributing keys.
#define ARRAYSIZE 50000
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
void binary_search(int *array, int key, int min, int max);
int count = 0; // count of intersections
int l_array[ARRAYSIZE * 2]; //array to check for intersection
int main(void)
{
int r_array[ARRAYSIZE]; //array of keys
int ix = 0;
struct timeval start, stop;
double elapsed;
for(ix = 0; ix < ARRAYSIZE; ix++)
{
r_array[ix] = ix;
}
for(ix = 0; ix < ARRAYSIZE * 2; ix++)
{
l_array[ix] = ix + 500;
}
gettimeofday(&start, NULL);
for(ix = 0; ix < ARRAYSIZE; ix++)
{
//this is where I will spawn off separate threads
binary_search(l_array, r_array[ix], 0, ARRAYSIZE * 2);
}
//wait for all threads to finish computation, then proceed.
fprintf(stderr, "%d\n", count);
gettimeofday(&stop, NULL);
elapsed = ((stop.tv_sec - start.tv_sec) * 1000000+(stop.tv_usec-start.tv_usec))/1000000.0;
printf("time taken is %f seconds\n", elapsed);
return 0;
}
void binary_search(int *array, int key, int min, int max)
{
int mid = 0;
if (max < min) return;
else
{
mid = (min + max) / 2;
if (array[mid] > key) return binary_search(array, key, min, mid - 1);
else if (array[mid] < key) return binary_search(array, key, mid + 1, max);
else
{
//this is where I'm not sure if I need a lock or not
count++;
return;
}
}
}

As you suspect, count++; requires synchronization. This is actually not something you should try to "get away with" not doing. Sooner or later a second thread will read count after the first thread reads it but before it increments it. Then you will miss a count. It is impossible to predict how often it will happen. It could happen once in a blue moon or thousands of times a second.

Actually, the code as you've written it does both read and modify the variable. If you were to look at the machine code that gets generated for a line like
count++
you'd see that it consists of something like
fetch count into register
increment register
store count
So yes, you should use a mutex there. (And even if you could get away without doing so, why not take the chance to practice?)

If you simply want accurate increments to count across multiple threads, these types of single-value updates are precisely what the interlocked memory-barrier functions are for.
For this I would use :__sync_add_and_fetch if you're using gcc. There a host of different interlocked operations you can do, most of them platform-specific, so check your documentation. For updating counters like this, however, they can save a heap-ton of hassle. Other samples include InterlockedIncrement under Windows, OSAtomicIncrement32 on OS X, etc.

Why is multithreading slower than sequential programming in my case?

I'm new to multithreading and try to learn it through a simple program, which adds 1 to n and return the sum. In the sequential case, the main call the sumFrom1 function twice for n = 1e5 and 2e5; in the multithreaded cases, two threads are created using pthread_create and two sums are calculated in separate thread. The multithreadting version is much slower than the sequential version (see results below). I run this on a 12-CPU platform and there are no communication between threads.
Multithreaded:
Thread 1 returns: 0
Thread 2 returns: 0
sum of 1..10000: 50005000
sum of 1..20000: 200010000
time: 156 seconds
Sequential:
sum of 1..10000: 50005000
sum of 1..20000: 200010000
time: 56 seconds
When I add -O2 in compilation, the time of multithreaded version (9s) is less than that of sequential version (11s), but not much as I expect. I can always have the -O2 flag on but I'm curious about the low speed of multithreading in the unoptimized case. Should it be slower than sequential version? If not, what can I do to make it faster?
The code:
#include <stdio.h>
#include <pthread.h>
#include <time.h>
typedef struct my_struct
{
int n;
int sum;
}my_struct_t;
void *sumFrom1(void* sit)
{
my_struct_t* local_sit = (my_struct_t*) sit;
int i;
int nsim = 500000; // Loops for consuming time
int j;
for(j = 0; j < nsim; j++)
{
local_sit->sum = 0;
for(i = 0; i <= local_sit->n; i++)
local_sit->sum += i;
}
}
int main(int argc, char *argv[])
{
pthread_t thread1;
pthread_t thread2;
my_struct_t si1;
my_struct_t si2;
int iret1;
int iret2;
time_t t1;
time_t t2;
si1.n = 10000;
si2.n = 20000;
if(argc == 2 && atoi(argv[1]) == 1) // Use "./prog 1" to test the time of multithreaded version
{
t1 = time(0);
iret1 = pthread_create(&thread1, NULL, sumFrom1, (void*)&si1);
iret2 = pthread_create(&thread2, NULL, sumFrom1, (void*)&si2);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
t2 = time(0);
printf("Thread 1 returns: %d\n",iret1);
printf("Thread 2 returns: %d\n",iret2);
printf("sum of 1..%d: %d\n", si1.n, si1.sum);
printf("sum of 1..%d: %d\n", si2.n, si2.sum);
printf("time: %d seconds", t2 - t1);
}
else // Use "./prog" to test the time of sequential version
{
t1 = time(0);
sumFrom1((void*)&si1);
sumFrom1((void*)&si2);
t2 = time(0);
printf("sum of 1..%d: %d\n", si1.n, si1.sum);
printf("sum of 1..%d: %d\n", si2.n, si2.sum);
printf("time: %d seconds", t2 - t1);
}
return 0;
}
UPDATE1:
After a little googling on "false sharing" (Thanks, #Martin James!), I think it is the main cause. There are (at least) two ways to fix it:
The first way is inserting a buffer zone between the two structs (Thanks, #dasblinkenlight):
my_struct_t si1;
char memHolder[4096];
my_struct_t si2;
Without -O2, the time consuming decreases from ~156s to ~38s.
The second way is avoiding frequently updating sit->sum, which can be realized using a temp variable in sumFrom1 (as #Jens Gustedt replied):
for(int sum = 0, j = 0; j < nsim; j++)
{
sum = 0;
for(i = 0; i <= local_sit->n; i++)
sum += i;
}
local_sit->sum = sum;
Without -O2, the time consuming decreases from ~156s to ~35s or ~109s (It has two peaks! I don't know why.). With -O2, the time consuming stays ~8s.

By modifying your code to
typedef struct my_struct
{
size_t n;
size_t sum;
}my_struct_t;
void *sumFrom1(void* sit)
{
my_struct_t* local_sit = sit;
size_t nsim = 500000; // Loops for consuming time
size_t n = local_sit->n;
size_t sum = 0;
for(size_t j = 0; j < nsim; j++)
{
for(size_t i = 0; i <= n; i++)
sum += i;
}
local_sit->sum = sum;
return 0;
}
the phenomenon disappears. The problems you had:
using int as a datatype is completely wrong for such a test. Your
figures where such that the sum overflowed. Overflow of signed types is undefined behavior. You are lucky that it didn't eat your lunch.
having bounds and summation variables with indirection buys you
additional loads and stores, that in case of -O0 are really done as
such, with all the implications of false sharing and stuff like that.
Your code also observed other errors:
a missing include for atoi
superflouous cast to and from void*
printing of time_t as int
Please compile your code with -Wall before posting.

Getting timing consistency in Linux

I can't seem to get a simple program (with lots of memory access) to achieve consistent timing in Linux. I'm using a 2.6 kernel, and the program is being run on a dual-core processor with realtime priority. I'm trying to disable cache effects by declaring the memory arrays as volatile. Below are the results and the program. What are some possible sources of the outliers?
Results:
Number of trials: 100
Range: 0.021732s to 0.085596s
Average Time: 0.058094s
Standard Deviation: 0.006944s
Extreme Outliers (2 SDs away from mean): 7
Average Time, excluding extreme outliers: 0.059273s
Program:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <sched.h>
#include <sys/time.h>
#define NUM_POINTS 5000000
#define REPS 100
unsigned long long getTimestamp() {
unsigned long long usecCount;
struct timeval timeVal;
gettimeofday(&timeVal, 0);
usecCount = timeVal.tv_sec * (unsigned long long) 1000000;
usecCount += timeVal.tv_usec;
return (usecCount);
}
double convertTimestampToSecs(unsigned long long timestamp) {
return (timestamp / (double) 1000000);
}
int main(int argc, char* argv[]) {
unsigned long long start, stop;
double times[REPS];
double sum = 0;
double scale, avg, newavg, median;
double stddev = 0;
double maxval = -1.0, minval = 1000000.0;
int i, j, freq, count;
int outliers = 0;
struct sched_param sparam;
sched_getparam(getpid(), &sparam);
sparam.sched_priority = sched_get_priority_max(SCHED_FIFO);
sched_setscheduler(getpid(), SCHED_FIFO, &sparam);
volatile float* data;
volatile float* results;
data = calloc(NUM_POINTS, sizeof(float));
results = calloc(NUM_POINTS, sizeof(float));
for (i = 0; i < REPS; ++i) {
start = getTimestamp();
for (j = 0; j < NUM_POINTS; ++j) {
results[j] = data[j];
}
stop = getTimestamp();
times[i] = convertTimestampToSecs(stop-start);
}
free(data);
free(results);
for (i = 0; i < REPS; i++) {
sum += times[i];
if (times[i] > maxval)
maxval = times[i];
if (times[i] < minval)
minval = times[i];
}
avg = sum/REPS;
for (i = 0; i < REPS; i++)
stddev += (times[i] - avg)*(times[i] - avg);
stddev /= REPS;
stddev = sqrt(stddev);
for (i = 0; i < REPS; i++) {
if (times[i] > avg + 2*stddev || times[i] < avg - 2*stddev) {
sum -= times[i];
outliers++;
}
}
newavg = sum/(REPS-outliers);
printf("Number of trials: %d\n", REPS);
printf("Range: %fs to %fs\n", minval, maxval);
printf("Average Time: %fs\n", avg);
printf("Standard Deviation: %fs\n", stddev);
printf("Extreme Outliers (2 SDs away from mean): %d\n", outliers);
printf("Average Time, excluding extreme outliers: %fs\n", newavg);
return 0;
}

Make sure you have no other processes taking CPU time. Watch out in particular for screen savers and anything which regularly does GUI updates (e.g. a clock or similar). Try setting CPU affinity for your benchmark process to lock it onto one core (e.g. taskset from the command line). Make your benchmark process if not paging - typically you want to have an outer loop which runs N times and then time the last N-1 executions.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight