user time increase in multi-cpu job - c

I am running the following code:
when I run this code with 1 child process: i get the following timing information:
(I run using /usr/bin/time ./job 1)
5.489u 0.090s 0:05.58 99.8% (1 job running)
when I run with 6 children processes: i get following
74.731u 0.692s 0:12.59 599.0% (6 jobs running in parallel)
The machine I am running the experiment on has 6 cores, 198 GB of RAM and nothing else is running on that machine.
I was expecting the user time reporting to be 6 times in case of 6 jobs running in parallel. But it is much more than that (13.6 times). My questions is from where this increase in user time comes from? Is it because multiple cores are jumping from one memory location to another more frequently in case of 6 jobs running in parallel? Or there is something else I am missing.
Thanks
#define MAX_SIZE 7000000
#define LOOP_COUNTER 100
#define simple_struct struct _simple_struct
simple_struct {
int n;
simple_struct *next;
};
#define ALLOCATION_SPLIT 5
#define CHAIN_LENGTH 1
void do_function3(void)
{
int i = 0, j = 0, k = 0, l = 0;
simple_struct **big_array = NULL;
simple_struct *temp = NULL;
big_array = calloc(MAX_SIZE + 1, sizeof(simple_struct*));
for(k = 0; k < ALLOCATION_SPLIT; k ++) {
for(i =k ; i < MAX_SIZE; i +=ALLOCATION_SPLIT) {
big_array[i] = calloc(1, sizeof(simple_struct));
if((CHAIN_LENGTH-1)) {
for(l = 1; l < CHAIN_LENGTH; l++) {
temp = calloc(1, sizeof(simple_struct));
temp->next = big_array[i];
big_array[i] = temp;
}
}
}
}
for (j = 0; j < LOOP_COUNTER; j++) {
for(i=0 ; i < MAX_SIZE; i++) {
if(big_array[i] == NULL) {
big_array[i] = calloc(1, sizeof(simple_struct));
}
big_array[i]->n = i * 13;
temp = big_array[i]->next;
while(temp) {
temp->n = i*13;
temp = temp->next;
}
}
}
}
int main(int argc, char **argv)
{
int i, no_of_processes = 0;
pid_t pid, wpid;
int child_done = 0;
int status;
if(argc != 2) {
printf("usage: this_binary number_of_processes");
return 0;
}
no_of_processes = atoi(argv[1]);
for(i = 0; i < no_of_processes; i ++) {
pid = fork();
switch(pid) {
case -1:
printf("error forking");
exit(-1);
case 0:
do_function3();
return 0;
default:
printf("\nchild %d launched with pid %d\n", i, pid);
break;
}
}
while(child_done != no_of_processes) {
wpid = wait(&status);
child_done++;
printf("\nchild done with pid %d\n", wpid);
}
return 0;
}

Firstly, your benchmark is a bit unusual. Normally, when benchmarking concurrent applications, one would compare two implementations:
A single thread version solving a problem of size S;
A multi-thread version with N threads, solving cooperatively the problem of size S; in your case, each solving a problem of size S/N.
Then you divide the execution times to obtain the speedup.
If your speedup is:
Around 1: the parallel implementation has similar performance as the single thread implementation;
Higher than 1 (usually between 1 and N), parallelizing the application increases performance;
Lower than 1: parallelizing the application hurts performance.
The effect on performance depends on a variety of factors:
How well your algorithm can be parallelized. See Amdahl's law. Does not apply here.
Overhead in inter-thread communication. Does not apply here.
Overhead in inter-thread synchronization. Does not apply here.
Contention for CPU resources. Should not apply here (since the number of threads is equal to the number of cores). However HyperThreading might hurt.
Contention for memory caches. Since the threads do not share memory, this will decrease performance.
Contention for accesses to main memory. This will decrease performance.
You can measure the last 2 with a profiler. Look for cache misses and stalled instructions.

Related

Analyzing overhead created by context-switching process using time-slice

I'm trying to make a C program in Linux environment, which requires to implement the function of calculating the total number of matrix operations performed by each process.
The input should be as followed:
./(file name).c 3 3
The output should be as followed:
Creating Process: #0
Creating Process: #1
Creating Process: #2
Process #0X Count = XXX XXXX
Process #0X Count = XXX XXXX
Process #0X Count = XXX XXXX
...
The first problem I'm facing is the merging the following two functions:
RR Scheduling Code + 2. Actual Matrix execution code
RR Scheduling code:
int main() {
int result;
struct sched_attr attr;
//reset
memset(&attr, 0, sizeof(attr));
attr.size = sizeof(struct sched_attr);
attr.sched_priority = 10;
attr.sched_policy = SCHED_RR;
//Scheduling property
result = sched_setattr(getpid(), &attr, 0);
if (result == -1) {
perror("Error calling sched_setattr.");
}
}
Actual Matrix execution code
#define ROW (100)
#define COL ROW
int calc(int time, int cpu) {
int matrixA[ROW][COL];
int matrixB[ROW][COL];
int matrixC[ROW][COL];
int i, j, k;
cpuid = cpu;
while(1) {
for (i=0; i < ROW; i++) {
for (j=0; j < COL; j++) {
for (k=0; k < COL; k++) {
matrixC[i][j] += matrixA[i][k] * matrixB[k][j];
}
}
}
count++;
if(count%100 == 0) printf("PROCESS #%02d count = #02ld\n", cpuid, count);
}
return 0;
}
It's my first time using Linux environment and C programming.
And the codes above is the best I've got so far.
Any help will be very thankful!
I programmed two codes:
one with RT-RR Scheduling code
another with a simple Matrix execution
I want to merge these two codes

Improving a simple function using threading

I have written a simple function with the following code that calculates the minimum number from a one-dimensional array:
uint32_t get_minimum(const uint32_t* matrix) {
int min = 0;
min = matrix[0];
for (ssize_t i = 0; i < g_elements; i++){
if (min > matrix[i]){
min = matrix[i];
}
}
return min;
}
However, I wanted to improve the performance of this function and was advised using threads so I have modified it to the following:
struct minargument{
const uint32_t* matrix;
ssize_t tid;
long long results;
};
static void *minworker(void *arg){
struct minargument *argument = (struct minargument *)arg;
const ssize_t start = argument -> tid * CHUNK;
const ssize_t end = argument -> tid == THREADS - 1 ? g_elements : (argument -> tid + 1) * CHUNK;
long long result = argument -> matrix[0];
for(ssize_t i = start; i < end; i++){
for(ssize_t x = 0; x < g_elements; x++){
if(result > argument->matrix[i]){
result = argument->matrix[i];
}
}
}
argument -> results = result;
return NULL;
}
uint32_t get_minimum(const uint32_t* matrix) {
struct minargument *args = malloc(sizeof(struct minargument) * THREADS);
long long min = 0;
for(ssize_t i = 0; i < THREADS; i++){
args[i] = (struct minargument){
.matrix = matrix,
.tid = i,
.results = min,
};
}
pthread_t thread_ids[THREADS];
for(ssize_t i =0; i < THREADS; i++){
if(pthread_create(thread_ids + i, NULL, minworker, args + i) != 0){
perror("pthread_create failed");
return 1;
}
}
for (ssize_t i = 0; i < THREADS; i++){
if(pthread_join(thread_ids[i], NULL) != 0){
perror("pthread_join failed");
return 1;
}
}
for(ssize_t i =0; i < THREADS; i++){
min = args[i].results;
}
free(args);
return min;
}
However this seems to be slower than the first function.
Am I correct in using threads to make the first function run faster? And if so, how do I modify the second function so that it is faster than the first function?
Having more threads than cores available to run them on is always going to be slower than a single thread due to the overhead of creating them, scheduling them and waiting for them all to finish.
The example you provide is unlikely to benefit from any optimisation beyond that which the compiler will do for you, as it is a short and simple operation. If you were doing something more complicated on a multi-core system, such as multiplying two huge matrices, of running a correlation algorithm on high speed real-time data then multi-threading may be the solution.
A more abstract answer to your question is another question: do you really need to be optimising it at all? Unless you know for a fact that there are performance issues, then your time would be better spent adding more functionality to your program than fixing a problem that doesn't really exist.
Edit - Comparison
I just ran (a representative version of) the OP's code on a 16 bit ARM microcontroller running with a 40 MHz instruction clock. Code compiled using GCC with no optimisation.
Finding the minimum of 20,000 32 bit integers took a little over 25 milliseonds.
With a 40 kByte page size (to hold half of a 20,000 array of 4 byte values) with threads running on different cores of a dual Intel 5150 processor clocked at 2.67 GHz, it takes nearly 50 ms just to do the context switch and paging operation!
A simple, single-threaded microcontroller implementation takes half as long in real time terms as a multi-threaded desktop implementation.

Unsure why this is generating so many threads

I'm running some tests on a short program I've written. It runs another program which performs some file operations based on the inputs I give it. The whole purpose of this program is to break a large packet of work into smaller packets to increase performance (sending 10 smaller packets to 10 versions of the program instead of waiting for the one larger one to execute, simple divide and conquer).
The problem lies in the fact that, while I believe I have limited the number of threads that will be created, the testing messages I have set up indicate that there are many more threads running than there should be. I'm really uncertain what I did wrong here.
Code snippet:
if (finish != start){
if (sizeOfBlock != 0){
num_threads = (finish - start)/sizeOfBlock + 1;
}
else{
num_threads = (finish-start) + 1;
}
if (num_threads > 10){ // this should limit threads to 10 at most
num_threads == 10;
}
else if (finish == start){
num_threads = 1;
}
}
threads = (pthread_t *) malloc(num_threads * sizeof(pthread_t));
for (i = 0; i < num_threads; i++){
printf("Creating thread %d\n", i);
s = pthread_create(&threads[i], NULL, thread, &MaxNum);
if (s != 0)
printf("error in pthread_create\n");
if (s==0)
activethreads++;
}
while (activethreads > 0){
//printf("active threads: %d\n", activethreads);
}
pthread_exit(0);
This code is useless:
if (num_threads > 10){ // this should limit threads to 10 at most
num_threads == 10
}
num_threads == 10 compares num_threads to 10 and then throws that away. You want assignment instead:
if (num_threads > 10){ // this should limit threads to 10 at most
num_threads = 10;
}
Also, there are numerous ; missing in your code, in the future, please try to provide a self contained example of code that compiles.

Dividing processes evenly among threads

I am trying to come up with an algorithm to divide a number of processes as evenly as possible over a number of threads. Each process takes the same amount of time.
The number of processes can vary, from 1 to 1 million. The threadCount is fixed, and can be anywhere from 4 to 48.
The code below does divide all the work evenly, except for the last case, where I throw in what is left over.
Is there a way to fix this so that the work is spread more evenly?
void main(void)
{
int processBegin[100];
int processEnd[100];
int activeProcessCount = 6243;
int threadCount = 24;
int processsInBundle = (int) (activeProcessCount / threadCount);
int processBalance = activeProcessCount - (processsInBundle * threadCount);
for (int i = 0; i < threadCount; ++i)
{
processBegin[ i ] = i * processsInBundle;
processEnd[ i ] = (processBegin[ i ] + processsInBundle) - 1;
}
processEnd[ threadCount - 1 ] += processBalance;
FILE *debug = fopen("s:\\data\\testdump\\debug.csv", WRITE);
for (int i = 0; i < threadCount; ++i)
{
int processsInBucket = (i == threadCount - 1) ? processsInBundle + processBalance : processBegin[i+1] - processBegin[i];
fprintf(debug, "%d,start,%d,stop,%d,processsInBucket,%d\n", activeProcessCount, processBegin[i], processEnd[i], processsInBucket);
}
fclose(debug);
}
Give the first activeProcessCount % threadCount threads processInBundle + 1 processes and give the others processsInBundle ones.
int processInBundle = (int) (activeProcessCount / threadCount);
int processSoFar = 0;
for (int i = 0; i < activeProcessCount % threadCount; i++){
processBegin[i] = processSoFar;
processSoFar += processInBundle + 1;
processEnd[i] = processSoFar - 1;
}
for (int i = activeProcessCount % threadCount; i < threadCount; i++){
processBegin[i] = processSoFar;
processSoFar += processInBundle;
processEnd[i] = processSoFar - 1;
}
That's the same problem as trying to divide 5 pennies onto 3 people. It's just impossible unless you can saw the pennies in half.
Also even if all processes need an equal amount of theoretical runtime it doesn't mean that they will be executed in the same amount of time due to kernel scheduling, cache performance and various other hardware related factors.
To suggest some performance optimisations:
Use dynamic scheduling. i.e. split your work into batches (can be size 1) and have your threads take one batch at a time, run it, then take the next one. This way the threads will always be working until all batches are gone.
More advanced is to start with a big batch size (commonly numwork/numthreads and decrease it each time a thread takes work out of the pool). OpenMP refers to it as guided scheduling.

Why is multithreading slower than sequential programming in my case?

I'm new to multithreading and try to learn it through a simple program, which adds 1 to n and return the sum. In the sequential case, the main call the sumFrom1 function twice for n = 1e5 and 2e5; in the multithreaded cases, two threads are created using pthread_create and two sums are calculated in separate thread. The multithreadting version is much slower than the sequential version (see results below). I run this on a 12-CPU platform and there are no communication between threads.
Multithreaded:
Thread 1 returns: 0
Thread 2 returns: 0
sum of 1..10000: 50005000
sum of 1..20000: 200010000
time: 156 seconds
Sequential:
sum of 1..10000: 50005000
sum of 1..20000: 200010000
time: 56 seconds
When I add -O2 in compilation, the time of multithreaded version (9s) is less than that of sequential version (11s), but not much as I expect. I can always have the -O2 flag on but I'm curious about the low speed of multithreading in the unoptimized case. Should it be slower than sequential version? If not, what can I do to make it faster?
The code:
#include <stdio.h>
#include <pthread.h>
#include <time.h>
typedef struct my_struct
{
int n;
int sum;
}my_struct_t;
void *sumFrom1(void* sit)
{
my_struct_t* local_sit = (my_struct_t*) sit;
int i;
int nsim = 500000; // Loops for consuming time
int j;
for(j = 0; j < nsim; j++)
{
local_sit->sum = 0;
for(i = 0; i <= local_sit->n; i++)
local_sit->sum += i;
}
}
int main(int argc, char *argv[])
{
pthread_t thread1;
pthread_t thread2;
my_struct_t si1;
my_struct_t si2;
int iret1;
int iret2;
time_t t1;
time_t t2;
si1.n = 10000;
si2.n = 20000;
if(argc == 2 && atoi(argv[1]) == 1) // Use "./prog 1" to test the time of multithreaded version
{
t1 = time(0);
iret1 = pthread_create(&thread1, NULL, sumFrom1, (void*)&si1);
iret2 = pthread_create(&thread2, NULL, sumFrom1, (void*)&si2);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
t2 = time(0);
printf("Thread 1 returns: %d\n",iret1);
printf("Thread 2 returns: %d\n",iret2);
printf("sum of 1..%d: %d\n", si1.n, si1.sum);
printf("sum of 1..%d: %d\n", si2.n, si2.sum);
printf("time: %d seconds", t2 - t1);
}
else // Use "./prog" to test the time of sequential version
{
t1 = time(0);
sumFrom1((void*)&si1);
sumFrom1((void*)&si2);
t2 = time(0);
printf("sum of 1..%d: %d\n", si1.n, si1.sum);
printf("sum of 1..%d: %d\n", si2.n, si2.sum);
printf("time: %d seconds", t2 - t1);
}
return 0;
}
UPDATE1:
After a little googling on "false sharing" (Thanks, #Martin James!), I think it is the main cause. There are (at least) two ways to fix it:
The first way is inserting a buffer zone between the two structs (Thanks, #dasblinkenlight):
my_struct_t si1;
char memHolder[4096];
my_struct_t si2;
Without -O2, the time consuming decreases from ~156s to ~38s.
The second way is avoiding frequently updating sit->sum, which can be realized using a temp variable in sumFrom1 (as #Jens Gustedt replied):
for(int sum = 0, j = 0; j < nsim; j++)
{
sum = 0;
for(i = 0; i <= local_sit->n; i++)
sum += i;
}
local_sit->sum = sum;
Without -O2, the time consuming decreases from ~156s to ~35s or ~109s (It has two peaks! I don't know why.). With -O2, the time consuming stays ~8s.
By modifying your code to
typedef struct my_struct
{
size_t n;
size_t sum;
}my_struct_t;
void *sumFrom1(void* sit)
{
my_struct_t* local_sit = sit;
size_t nsim = 500000; // Loops for consuming time
size_t n = local_sit->n;
size_t sum = 0;
for(size_t j = 0; j < nsim; j++)
{
for(size_t i = 0; i <= n; i++)
sum += i;
}
local_sit->sum = sum;
return 0;
}
the phenomenon disappears. The problems you had:
using int as a datatype is completely wrong for such a test. Your
figures where such that the sum overflowed. Overflow of signed types is undefined behavior. You are lucky that it didn't eat your lunch.
having bounds and summation variables with indirection buys you
additional loads and stores, that in case of -O0 are really done as
such, with all the implications of false sharing and stuff like that.
Your code also observed other errors:
a missing include for atoi
superflouous cast to and from void*
printing of time_t as int
Please compile your code with -Wall before posting.

Resources