Analyzing overhead created by context-switching process using time-slice - c

I'm trying to make a C program in Linux environment, which requires to implement the function of calculating the total number of matrix operations performed by each process.
The input should be as followed:
./(file name).c 3 3
The output should be as followed:
Creating Process: #0
Creating Process: #1
Creating Process: #2
Process #0X Count = XXX XXXX
Process #0X Count = XXX XXXX
Process #0X Count = XXX XXXX
...
The first problem I'm facing is the merging the following two functions:
RR Scheduling Code + 2. Actual Matrix execution code
RR Scheduling code:
int main() {
int result;
struct sched_attr attr;
//reset
memset(&attr, 0, sizeof(attr));
attr.size = sizeof(struct sched_attr);
attr.sched_priority = 10;
attr.sched_policy = SCHED_RR;
//Scheduling property
result = sched_setattr(getpid(), &attr, 0);
if (result == -1) {
perror("Error calling sched_setattr.");
}
}
Actual Matrix execution code
#define ROW (100)
#define COL ROW
int calc(int time, int cpu) {
int matrixA[ROW][COL];
int matrixB[ROW][COL];
int matrixC[ROW][COL];
int i, j, k;
cpuid = cpu;
while(1) {
for (i=0; i < ROW; i++) {
for (j=0; j < COL; j++) {
for (k=0; k < COL; k++) {
matrixC[i][j] += matrixA[i][k] * matrixB[k][j];
}
}
}
count++;
if(count%100 == 0) printf("PROCESS #%02d count = #02ld\n", cpuid, count);
}
return 0;
}
It's my first time using Linux environment and C programming.
And the codes above is the best I've got so far.
Any help will be very thankful!
I programmed two codes:
one with RT-RR Scheduling code
another with a simple Matrix execution
I want to merge these two codes

Related

"Primary job terminated normally but 1 process returned a non-zero exit code" only in some cases

Good afternoon, I've developed a 2D FFT in MPI for scientific purpose.
Everything used to work until I've implemented MPI_Scatterv.
Since I've implemented it something odd started happening. In particular if I stay below 64 modes I don't get problems, but when I push above it I get the message:
> Primary job terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
>--------------------------------------------------------------------------
>mpiexec noticed that process rank 0 with PID 0 on node MacBook-Pro-di-Mirco
>exited on signal 11 (Segmentation fault: 11).`
I can't figure out where is the mistake, but I'm pretty sure it is in MPI_Scatterv.
Could anyone help me please?
/********************************** Setup factors for scattering **********************************/
// Alloc the arrays
int* displs = (int *)malloc(size*sizeof(int));
int* scounts = (int *)malloc(size*sizeof(int));
int* receive = (int *)malloc(size*sizeof(int));
// Setup matrix
int modes_per_proc[size];
for (int i = 0; i < size; i++){
modes_per_proc[i] = 0;
}
// Set modes per processor
cores_handler( nx*nz, size, modes_per_proc);
// Scattering parameters
for (int i=0; i<size; ++i) {
scounts[i] = modes_per_proc[i]*ny*2;
receive[i] = scounts[i];
displs[i] = displs[i-1] + modes_per_proc[i-1] *ny*2; // *2 to handle complex numbers
if (i == 0 ) displs[0] = 0;
}
/************************************************ Data scattering ***********************************************/
MPI_Scatterv(U, scounts, displs, MPI_DOUBLE, u, receive[rank] , MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
The core_handler function:
void cores_handler( int modes, int size, int modes_per_proc[size]) {
int rank =0;
int check=0;
for (int i = 0; i < modes; i++) {
modes_per_proc[rank] = modes_per_proc[rank]+1;
rank = rank+1;
if (rank == size ) rank = 0;
}
for (int i = 0; i < size; i++){
//printf("%d modes on rank %d\n", modes_per_proc[i], i);
check = check+modes_per_proc[i];
}
if ( (int)(check - modes) != 0 ) {
printf("[ERROR] check - modes = %d!!\nUnable to scatter modes properly\nAbort... \n", check - modes);
}

Looping a Function in C

I have a school project which requires me to simulate first come first serve using these variables:
Users Input:
Number of Process: 3
Process 1 Arrives at 0 time and requires 5 'resources'
3
1,5,0
2,5,4
3,1,8
However, i can't seem to get past the first 5 'resources'. I'm trying to figure out how to increase PID and repeat but keep time increasing for all these resources. I've created this same program but it only allows for this specific input and I'm trying to make it more versatile so i can choose any number of processes and resources(unit) needed.
#include <stdio.h>
main() {
int n;
printf("Enter the Amount of processes: ");
scanf("%d",&n);
//Variables
int process[n], unit[n], at[n];
int i,time,PID = 1;
int awt, atat,sum,counter;
int x = n;
//Takes and stores the users input into process unit and at
for(i=0;i<n;i++)
{
scanf("%d,%d,%d", &process[i], &unit[i], &at[i]);
}
sum = sum_array(unit,n);
printf("%d\n", sum);
printf("FCFS\n");
printf("Time PID");
for(counter = 0; counter < x; counter++, PID++){
FCFS(time,n,unit,PID);
}
}
int sum_array(int at[], int num_elements){
int x, sum = 0;
for(x=0; x<num_elements;x++){
sum = sum + at[x];
}
return(sum);
}
int FCFS(int time,int n,int unit[], int PID){
for(time = 0, n = 0 ; unit[n] >0 ;time++, unit[n]--){
printf("\n%d ", time);
printf("%d", PID);
}
return;
}
Sample Output:
FCFS
TIME PID
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 2
10 3
Your problems are mostly related to the FCFS function and the loop where you call it.
Try the following:
Initialize time = 0 in the main function
Pass counter instead of n to FCFS in the loop
Return the updated time from FCFS
Don't reset the time and n parameter inside FCFS
Call to FCFS inside for loop:
time = FCFS(time, counter, unit, PID);
Updated FCFS code:
int FCFS(int time,int n,int unit[], int PID)
{
for( ; unit[n] >0 ;time++, unit[n]--)
{
printf("\n%d ", time);
printf("%d", PID);
}
return time;
}
Other than that, there are a number of issues with your code, but it wouldn't really fit into this Q/A to mention them all, so I stick with the necessary things to get your code running for valid example input.
Since this is a homework question, I would encourage you to solve it on your own. Since you have put in some effort on solving this, I am posting the answer below as a spoiler (Note indentation does not work in spoilers). However before seeing the answer here are few suggestions to fix your program:
As mentioned above, passing n does absolutely nothing. Please use a different variable inside the FCFS function.
No need to increment and pass the PID. Since you are putting it in an array, try to get the value from the array.
Instead of n pass counter to the function so that you can index the two arrays.
The for loop inside FCFS makes no sense. It should be for(i=0; i<unit[counter]; i++). time can just be incremented inside the loop.
time needs to be returned to increment properly
And my code:
int time = 0;
int cur_index = 0;
while (cur_index < n) {
int pid = -1;
if (at[cur_index] <= time) {
pid = process[cur_index];
} else {
printf("%d %d\n", time, pid);
time++;
continue;
}
if (pid != -1) {
int r = 0;
for (r = 0; r < unit[cur_index]; r++) {
printf("%d %d\n", time, pid);
time++;
}
}
}

user time increase in multi-cpu job

I am running the following code:
when I run this code with 1 child process: i get the following timing information:
(I run using /usr/bin/time ./job 1)
5.489u 0.090s 0:05.58 99.8% (1 job running)
when I run with 6 children processes: i get following
74.731u 0.692s 0:12.59 599.0% (6 jobs running in parallel)
The machine I am running the experiment on has 6 cores, 198 GB of RAM and nothing else is running on that machine.
I was expecting the user time reporting to be 6 times in case of 6 jobs running in parallel. But it is much more than that (13.6 times). My questions is from where this increase in user time comes from? Is it because multiple cores are jumping from one memory location to another more frequently in case of 6 jobs running in parallel? Or there is something else I am missing.
Thanks
#define MAX_SIZE 7000000
#define LOOP_COUNTER 100
#define simple_struct struct _simple_struct
simple_struct {
int n;
simple_struct *next;
};
#define ALLOCATION_SPLIT 5
#define CHAIN_LENGTH 1
void do_function3(void)
{
int i = 0, j = 0, k = 0, l = 0;
simple_struct **big_array = NULL;
simple_struct *temp = NULL;
big_array = calloc(MAX_SIZE + 1, sizeof(simple_struct*));
for(k = 0; k < ALLOCATION_SPLIT; k ++) {
for(i =k ; i < MAX_SIZE; i +=ALLOCATION_SPLIT) {
big_array[i] = calloc(1, sizeof(simple_struct));
if((CHAIN_LENGTH-1)) {
for(l = 1; l < CHAIN_LENGTH; l++) {
temp = calloc(1, sizeof(simple_struct));
temp->next = big_array[i];
big_array[i] = temp;
}
}
}
}
for (j = 0; j < LOOP_COUNTER; j++) {
for(i=0 ; i < MAX_SIZE; i++) {
if(big_array[i] == NULL) {
big_array[i] = calloc(1, sizeof(simple_struct));
}
big_array[i]->n = i * 13;
temp = big_array[i]->next;
while(temp) {
temp->n = i*13;
temp = temp->next;
}
}
}
}
int main(int argc, char **argv)
{
int i, no_of_processes = 0;
pid_t pid, wpid;
int child_done = 0;
int status;
if(argc != 2) {
printf("usage: this_binary number_of_processes");
return 0;
}
no_of_processes = atoi(argv[1]);
for(i = 0; i < no_of_processes; i ++) {
pid = fork();
switch(pid) {
case -1:
printf("error forking");
exit(-1);
case 0:
do_function3();
return 0;
default:
printf("\nchild %d launched with pid %d\n", i, pid);
break;
}
}
while(child_done != no_of_processes) {
wpid = wait(&status);
child_done++;
printf("\nchild done with pid %d\n", wpid);
}
return 0;
}
Firstly, your benchmark is a bit unusual. Normally, when benchmarking concurrent applications, one would compare two implementations:
A single thread version solving a problem of size S;
A multi-thread version with N threads, solving cooperatively the problem of size S; in your case, each solving a problem of size S/N.
Then you divide the execution times to obtain the speedup.
If your speedup is:
Around 1: the parallel implementation has similar performance as the single thread implementation;
Higher than 1 (usually between 1 and N), parallelizing the application increases performance;
Lower than 1: parallelizing the application hurts performance.
The effect on performance depends on a variety of factors:
How well your algorithm can be parallelized. See Amdahl's law. Does not apply here.
Overhead in inter-thread communication. Does not apply here.
Overhead in inter-thread synchronization. Does not apply here.
Contention for CPU resources. Should not apply here (since the number of threads is equal to the number of cores). However HyperThreading might hurt.
Contention for memory caches. Since the threads do not share memory, this will decrease performance.
Contention for accesses to main memory. This will decrease performance.
You can measure the last 2 with a profiler. Look for cache misses and stalled instructions.

Multithreading pthread errors

Im trying to create a multithreaded application in C for Linux with pthreads library that makes an approximation of pi using infinite series with N+1 terms.Variable N and T are passed from the command line. I am using the Nilakantha approximation formula for pi. N is the upper limit of the number sequence to sum and T would be the # of child threads that calculate that sum. For example if I run command "./pie 100 4". The parent thread will create 4 child threads indexed 0 to 3. I have a global variable called vsum that is a double array allocated dynamically using malloc to hold values. So with 4 threads and 100 as the upper bound. My progam should compute:
Thread 0 computes the partial sum for i going from 0 to 24 stored to an element vsum[0]
Thread 1 computes the partial sum for i going from 25 to 49 stored to an element vsum[1]
Thread 2 computes the partial sum for i going from 50 to 74 stored to an element vsum[2]
Thread 3 computes the partial sum for i going from 75 to 99 stored to an element vsum[3]
After each thread makes calculations. The main thread will compute the sum by adding together all numbers from vsum[0] to vsum[T-1].
Im just starting to learn about threads and processes. Any help or advice would be appreciated. Thank you.
Code I wrote so far:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
double *vsum;
int N, T;
void *PI(void *sum) //takes param sum and gets close to pi
{
int upper = (int)sum;
double pi = 0;
int k = 1;
for (int i = (N/T)*upper; i <= (N/T)*(upper+1)-1; i++)
{
pi += k*4/((2*i)*(2*i+1)*(2*i+2));
if(i = (N/T)*(upper+1)-1)
{
vsum[upper] = pi;
}
k++;
}
pthread_exit(0);
}
int main(int argc, char*argv[])
{
T = atoi(argv[2]);
N = atoi(argv[1]);
if (N<T)
{
fprintf(stderr, "Upper bound(N) < # of threads(T)\n");
return -1;
}
int pie = 0;
pthread_t tid[T]; //thread identifier
pthread_attr_t attr; //thread attributes
vsum = (double *)malloc(sizeof(double));//creates dyn arr
//Initialize vsum to [0,0...0]
for (int i = 0; i < T; i++){
{
vsum[i] = 0;
}
if(argc!=2) //command line does not give proper # of values
{
fprintf(stderr, "usage: commandline error <integer values>\n");
return -1;
}
if (atoi(argv[1]) <0) //if its is negative/sum error
{
fprintf(stderr, "%d must be >=0\n", atoi(argv[1]));
return -1;
}
//CREATE A LOOP THAT MAKES PARAM N #OF THREADS
pthread_attr_init(&attr);
for(int j =0; j < T;j++)
{
int from = (N/T)*j;
int to = (N/T)*(j+1)-1;
//CREATE ARRAY VSUM TO HOLD VALUES FOR PI APPROX.
pthread_create(&tid[j],&attr,PI,(void *)j);
printf("Thread %d computes the partial sum for i going from %d to %d stored to an element vsum[%d]\n", j, from, to, j);
}
//WAITS FOR THREADS TO FINISH
for(int j =0; j <T; i++)
{
pthread_join(tid[j], NULL);
}
//LOOP TO ADD ALL THE vsum array values to get pi approximation
for(int i = 0; i < T; i++)
{
pie += vsum[i];
}
pie = pie +3;
printf("pi computed with %d terms in %d threads is %d\n",N,T,pie);
vsum = realloc(vsum, 0);
pthread_exit(NULL);
return 0;
}
Here is the error I dont see that I get on my program: What am I missing here?
^
pie.c:102:1: error: expected declaration or statement at end of input
}
When I try to run my program I get the following:
./pie.c: line 6: double: command not found
./pie.c: line 7: int: command not found
./pie.c: line 8: int: command not found
./pie.c: line 10: syntax error near unexpected token `('
./pie.c: line 10: `void *PI(void *sum) //takes param sum and gets close to pi'
I haven't looked at logic of your code, but I see following programming errors.
Change
pthread_create(&tid[j],&attr,PI,j);
to
pthread_create(&tid[j],&attr,PI,(void *)j);
pthread_create() takes 4th param as void * which is passed to the thread function.
Also fix your thread function PI to use passed parameter as int like
void *PI(void *sum) //takes param sum and gets close to pi
{
int upper = (int)sum; //don't use `atoi` as passed param is int.
...
//your existing code
}
The 3rd error is for line
realloc(vsum, 0);
By passing 0 to re-allocate, you are effectively just freeing vsum, so you can just use free(vsum). If you indeed want to reallocate you should take the new allocated memory returned by the function something like vsum = realloc(vsum, 0);
The Syntax of pthread is
pthread_create(threadId, threadAttribute, callingMethodName, parameters of calling method);
Ex:
void printLetter( void *p)
{
int i=0;
char c=(char *)p;
while (i<10000)
{
printf("%c",c);
}
}
int main()
{
pthread_t thread_id;
char c='x';
pthread_create (&thread_id, NULL, &printLetter, &c);
pthread_join (thread_id, NULL);
return 0;
}
}

Why is multithreading slower than sequential programming in my case?

I'm new to multithreading and try to learn it through a simple program, which adds 1 to n and return the sum. In the sequential case, the main call the sumFrom1 function twice for n = 1e5 and 2e5; in the multithreaded cases, two threads are created using pthread_create and two sums are calculated in separate thread. The multithreadting version is much slower than the sequential version (see results below). I run this on a 12-CPU platform and there are no communication between threads.
Multithreaded:
Thread 1 returns: 0
Thread 2 returns: 0
sum of 1..10000: 50005000
sum of 1..20000: 200010000
time: 156 seconds
Sequential:
sum of 1..10000: 50005000
sum of 1..20000: 200010000
time: 56 seconds
When I add -O2 in compilation, the time of multithreaded version (9s) is less than that of sequential version (11s), but not much as I expect. I can always have the -O2 flag on but I'm curious about the low speed of multithreading in the unoptimized case. Should it be slower than sequential version? If not, what can I do to make it faster?
The code:
#include <stdio.h>
#include <pthread.h>
#include <time.h>
typedef struct my_struct
{
int n;
int sum;
}my_struct_t;
void *sumFrom1(void* sit)
{
my_struct_t* local_sit = (my_struct_t*) sit;
int i;
int nsim = 500000; // Loops for consuming time
int j;
for(j = 0; j < nsim; j++)
{
local_sit->sum = 0;
for(i = 0; i <= local_sit->n; i++)
local_sit->sum += i;
}
}
int main(int argc, char *argv[])
{
pthread_t thread1;
pthread_t thread2;
my_struct_t si1;
my_struct_t si2;
int iret1;
int iret2;
time_t t1;
time_t t2;
si1.n = 10000;
si2.n = 20000;
if(argc == 2 && atoi(argv[1]) == 1) // Use "./prog 1" to test the time of multithreaded version
{
t1 = time(0);
iret1 = pthread_create(&thread1, NULL, sumFrom1, (void*)&si1);
iret2 = pthread_create(&thread2, NULL, sumFrom1, (void*)&si2);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
t2 = time(0);
printf("Thread 1 returns: %d\n",iret1);
printf("Thread 2 returns: %d\n",iret2);
printf("sum of 1..%d: %d\n", si1.n, si1.sum);
printf("sum of 1..%d: %d\n", si2.n, si2.sum);
printf("time: %d seconds", t2 - t1);
}
else // Use "./prog" to test the time of sequential version
{
t1 = time(0);
sumFrom1((void*)&si1);
sumFrom1((void*)&si2);
t2 = time(0);
printf("sum of 1..%d: %d\n", si1.n, si1.sum);
printf("sum of 1..%d: %d\n", si2.n, si2.sum);
printf("time: %d seconds", t2 - t1);
}
return 0;
}
UPDATE1:
After a little googling on "false sharing" (Thanks, #Martin James!), I think it is the main cause. There are (at least) two ways to fix it:
The first way is inserting a buffer zone between the two structs (Thanks, #dasblinkenlight):
my_struct_t si1;
char memHolder[4096];
my_struct_t si2;
Without -O2, the time consuming decreases from ~156s to ~38s.
The second way is avoiding frequently updating sit->sum, which can be realized using a temp variable in sumFrom1 (as #Jens Gustedt replied):
for(int sum = 0, j = 0; j < nsim; j++)
{
sum = 0;
for(i = 0; i <= local_sit->n; i++)
sum += i;
}
local_sit->sum = sum;
Without -O2, the time consuming decreases from ~156s to ~35s or ~109s (It has two peaks! I don't know why.). With -O2, the time consuming stays ~8s.
By modifying your code to
typedef struct my_struct
{
size_t n;
size_t sum;
}my_struct_t;
void *sumFrom1(void* sit)
{
my_struct_t* local_sit = sit;
size_t nsim = 500000; // Loops for consuming time
size_t n = local_sit->n;
size_t sum = 0;
for(size_t j = 0; j < nsim; j++)
{
for(size_t i = 0; i <= n; i++)
sum += i;
}
local_sit->sum = sum;
return 0;
}
the phenomenon disappears. The problems you had:
using int as a datatype is completely wrong for such a test. Your
figures where such that the sum overflowed. Overflow of signed types is undefined behavior. You are lucky that it didn't eat your lunch.
having bounds and summation variables with indirection buys you
additional loads and stores, that in case of -O0 are really done as
such, with all the implications of false sharing and stuff like that.
Your code also observed other errors:
a missing include for atoi
superflouous cast to and from void*
printing of time_t as int
Please compile your code with -Wall before posting.

Resources