Debugging simplified version of Lamport's Bakery Algorithm in C - c

I'm trying to implement a simplified version of Lamport's Bakery Algorithm in C before I attempt to use it to solve a more complex problem.* The simplification I am making is that the lock is only shared by only two threads instead of N.
I set up two threads (via OpenMP to keep things simple) and they loop, attempting to increment a shared counter within their critical section. If everything goes according to plan, then the final counter value should be equal to the number of iterations. However, here's some example output:
count: 9371470 (expected: 10000000)
Doh! Something is broken, but what? My implementation is pretty textbook (for reference), so perhaps I'm misusing memory barriers? Did I forget to mark something as volatile?
My code:
#include <stdio.h>
#include <string.h>
#include <stdint.h>
#include <omp.h>
typedef struct
{
volatile bool entering[2];
volatile uint32_t number[2];
} SimpleBakeryLock_t;
inline void mb() { __sync_synchronize(); }
inline void lock(SimpleBakeryLock_t* l, int id)
{
int i = id, j = !id;
uint32_t ni, nj;
l->entering[i] = true;
mb();
ni = 1 + l->number[j];
l->number[i] = ni;
mb();
l->entering[i] = false;
mb();
while (l->entering[j]) {
mb();
}
nj = l->number[j];
mb();
while ((nj != 0) && (nj < ni || (nj == ni && j < i)))
{
nj = l->number[j]; // re-read
mb();
}
}
inline void unlock(SimpleBakeryLock_t* l, int id)
{
l->number[id] = 0;
mb();
}
SimpleBakeryLock_t x;
int main(void)
{
const uint32_t iterations = 10000000;
uint32_t count = 0;
bool once = false;
int i;
memset((void*)&x, 0, sizeof(x));
mb();
// set OMP_NUM_THREADS=2 in your environment!
#pragma omp parallel for schedule(static, 1) private(once, i)
for(uint32_t dummy = 0; dummy < iterations; ++dummy)
{
if (!once)
{
i = omp_get_thread_num();
once = true;
}
lock(&x, i);
{
count = count + 1;
mb();
}
unlock(&x, i);
}
printf("count: %u (expected: %u)\n", count, iterations);
return 0;
}
To compile and run (on Linux), do:
$ gcc -O3 -fopenmp bakery.c
$ export OMP_NUM_THREADS=2
$ ./a.out
I intend to chain simple Bakery locks into a binary tree (tournament style) to achieve mutual exclusion among N threads.

I tracked down two problems and the code now works. Issues:
__sync_synchronize() was not generating the mfence instruction on my platform (Apple's GCC 4.2.1). Replacing __sync_synchronize() with an explicit mfence resolves this issue.
I was doing something wrong with the OpenMP private variables (still not sure what...). Sometimes the two threads entered the lock with the same identity (ex. both may say they were thread 0). Recomputing 'i' with 'omp_get_thread_num' on every iteration seems to do the trick.
Here is the corrected code for completeness:
#include <stdio.h>
#include <string.h>
#include <stdint.h>
#include <omp.h>
#define cpu_relax() asm volatile ("pause":::"memory")
#define mb() asm volatile ("mfence":::"memory")
/* Simple Lamport bakery lock for two threads. */
typedef struct
{
volatile uint32_t entering[2];
volatile uint32_t number[2];
} SimpleBakeryLock_t;
void lock(SimpleBakeryLock_t* l, int id)
{
int i = id, j = !id;
uint32_t ni, nj;
l->entering[i] = 1;
mb();
ni = 1 + l->number[j];
l->number[i] = ni;
mb();
l->entering[i] = 0;
mb();
while (l->entering[j]) {
cpu_relax();
}
do {
nj = l->number[j];
} while ((nj != 0) && (nj < ni || (nj == ni && j < i)));
}
void unlock(SimpleBakeryLock_t* l, int id)
{
mb(); /* prevent critical section writes from leaking out over unlock */
l->number[id] = 0;
mb();
}
SimpleBakeryLock_t x;
int main(void)
{
const int32_t iterations = 10000000;
int32_t dummy;
uint32_t count = 0;
memset((void*)&x, 0, sizeof(x));
mb();
// set OMP_NUM_THREADS=2 in your environment!
#pragma omp parallel for schedule(static, 1)
for(dummy = 0; dummy < iterations; ++dummy)
{
int i = omp_get_thread_num();
lock(&x, i);
count = count + 1;
unlock(&x, i);
}
printf("count: %u (expected: %u)\n", count, iterations);
return 0;
}

Related

OpenMP: Access violation and other errors

Preface
Recently, I implemented OpenMP into our group's project code. Main runs in two for loops; the outer controls the 'run', while the inner controls the 'generation.' Generations are completely independent from different runs, though dependent on other generations in the same run.
The idea is to parallelize the outer loop, the 'run' loop, while letting each thread maintain evolution of generations on whatever specific run number it was assigned to.
The Problem
When setting OMP_THREADS = 1 , i.e. letting the program run with only one thread, it runs without a hitch. If this number is any higher, I get the following error:
Unhandled exception at 0x00F5C4C3 in projectc.exe: 0xC0000005: Access violation writing location 0x00000072.
with the following appearing in the "Autos" section of Visual Studio:
(Note: t, t->active_cells, and t->cellx are "error red" while the rest are white when I get this error)
If I change default(none) to default(shared) in the #pragma right above the outer loop, and remove t, s, and bn from threadprivate (these are structures initialized in external files), then the program runs normally for a generation on each thread before freezing (though CPU activity shows that both threads are still running with the same intensity as before).
Attempts at Solutions
I cannot figure out what is going wrong. Trying a simple #pragma omp parallel for outside of the outer loop of course doesn't work, but I have also tried declaring all of main as #pragma omp parallel and the outer loop as #pragma omp for. A few other subtle approaches were tried like this as well, which leads me to the conclusions that it must be something to do with the way the variables are shared between threads...because all runs, and so threads, are independent, really all of the variables could be set as private; though there is some overlap that you see reflected in shared(..).
The code is attached below.
main.c
/* General Includes */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <omp.h>
/* Project Includes */
#include "main.h"
#include "randgen.h"
#include "board7.h"
#include "tissue.h"
#include "io.h"
#define BitFlp(arg,posn) ((arg) ^ (1L << (posn)))
#define BitClr(arg,posn) ((arg) & ~(1L << (posn)))
#define display_dbg 1 //Controls whether print statements in main.c are displayed.
#define display_time 1 //Controls whether timing print statements are executed.
#define BILLION 1000000000L;
#define num_runs 10 //Controls number of runs per simulation
#define num_gens 4000//Controls number of generations per run
#define OMP_THREADS 1 // Max number of threads used if OpenMP is enabled
int n, i, r, j, z, x, sxa, y, flagb, m;
int j1, j2;
char a;
int max_fit_gen, collect_data, lb_run, w, rn, sx;
float f, max_fitness;
tissuen *fx;
input_vec dx;
calookup ra;
#pragma omp threadprivate(n, r, j, x, z, sxa, y, flagb, m, \
j1, j2, a, max_fit_gen, collect_data, lb_run, w, \
rn, sx, f, max_fitness, fx, dx, ra, run_data, t, s, bn)
int main(int argc, char *argv[])
{
int* p = 0x00000000; // pointer to NULL
char sa[256];
char ss[10];
long randn;
boardtable ba;
srand((unsigned)time(NULL));
init_mm();
randn = number_range(1, 100);
#ifdef OS_WINDOWS
// Timing parameters
LARGE_INTEGER clk_freq;
LARGE_INTEGER t1, t2, t3;
#endif
#ifdef OS_UNIX
struct timespec clk_freq, t1, t2, t3;
#endif
double avg_gen_time, avg_run_time, run_time, sim_time, est_run_time, est_sim_time;
// File System and IO Parameters
char cwd[FILENAME_MAX];
getcwd(&cwd, sizeof(cwd));
char curState[FILENAME_MAX];
char recState[FILENAME_MAX];
char recMode[FILENAME_MAX];
char curGen[FILENAME_MAX];
char curRun[FILENAME_MAX];
char genTmp[FILENAME_MAX];
strcpy(curState, cwd);
strcpy(recState, cwd);
strcpy(recMode, cwd);
strcpy(curGen, cwd);
strcpy(curRun, cwd);
strcpy(genTmp, cwd);
#ifdef OS_WINDOWS
strcat(curState, "\\current.txt");
strcat(recState, "\\recover.txt");
strcat(recMode, "\\recovermode.txt");
strcat(curGen, "\\gen.txt");
strcat(curRun, "\\run");
strcat(genTmp, "\\tmp\\gentmp");
#endif
#ifdef OS_UNIX
strcat(curState, "/current.txt");
strcat(recState, "/recover.txt");
strcat(recMode, "/recovermode.txt");
strcat(curGen, "/gen.txt");
strcat(curRun, "/run");
strcat(genTmp, "/tmp/gentmp");
#endif
//Read current EA run variables (i.e. current run number, generation, recover mode status)
z = readorcreate(curState);
x = readorcreate(recState);
sxa = readorcreate(recMode);
y = readorcreate(curGen);
//Initialize simulation parameters
s.count = 0;
s.x[0] = 0;
s.y[0] = 0;
s.addvec[0] = 0;
s.bestnum = 0;
s.countb = 0;
s.count = 0;
initialize_sim_param(&s, 0, 200);
collect_data = 0;
//Build a collection of experiment initial conditions
buildboardcollection7(&bn);
//Determine clock frequency.
#ifdef OS_WINDOWS
if (display_time) get_frequency(&clk_freq);
#endif
#ifdef OS_UNIX
if (display_time) get_frequency(CLOCK_REALTIME, &clk_freq);
#endif
//Start simulation timer
#ifdef OS_WINDOWS
if (display_time) read_clock(&t1);
#endif
#ifdef OS_UNIX
if (display_time) read_clock(CLOCK_REALTIME, &t1);
#endif
#pragma omp parallel for schedule(static) default(none) num_threads(OMP_THREADS) \
private(sa, ss, randn, ba, t2, t3, avg_gen_time, avg_run_time, sim_time, \
run_time, est_run_time, est_sim_time) \
shared(i, cwd, recMode, curRun, curGen, curState, genTmp, clk_freq, t1)
for (i = z; i < num_runs; i++)
{
// randomly initialize content of tissue population
initialize_tissue_pop_s2(&(t.tgen[0]), &s);
initialize_tissue_pop_s2(&(t.tgen[1]), &s);
max_fit_gen = 0;
max_fitness = 0.0;
flagb = 0;
if ((i == z) && (x == 1))
{
w = y;
}
else
{
w = 0;
}
rn = 200;
j1 = 0;
s.run_num = i;
s.maxfitness = 0.0;
//Start run timer
#ifdef OS_WINDOWS
if (display_time) read_clock(&t2);
#endif
#ifdef OS_UNIX
if (display_time) read_clock(CLOCK_REALTIME, &t2);
#endif
#if defined(_OPENMP)
printf("\n ======================================= \n");
printf(" OpenMP Status Message \n");
printf("\n --------------------------------------- \n");
printf("| RUN %d : \n", i);
printf("| New Thread Process (Thread %d) \n", omp_get_thread_num());
printf("| Available Threads: %d of %d \n", omp_get_num_threads(), omp_get_max_threads());
printf(" ======================================= \n\n");
#endif
for (j = w; j < num_gens; j++)
{
// Flips on lightboard data collection. See board7.h.
if (enable_collection == 1) {
if ((i >= run_collect) && (j >= gen_collect)) { collect_data = 1; }
}
sx = readcurrent(recMode);
// Pseudo loop code. Uses bit flipping to cycle through boards.
j2 = ~(j1)& 1;
if (display_dbg) printf("start evaluation...\n");
// evaluate tissue
// Most of the problems in the code happen here.
evaluatepopulation_tissueb(&(t.tgen[j1]), &ra, &bn, &s, j, i);
if (display_dbg) printf("\n");
// display fitness stats to screen
printmaxfitness(&(t.tgen[j1]), i, j, j1, &cwd);
if (display_dbg) printf("start tournament...\n");
// Perform tournament selection and have children ready for evaluation
// Rarely have to touch. Figure out best parents. Crossover operator.
// Create a subgroup. Randomly pick individuals from the population.
// Pick fittest individuals out of the random group.
// 2 parents and 2 children. Children replace parents.
tournamentsel_tissueb(&(t.tgen[j1]), &(t.tgen[j2]), &s);
printf("Tournament selection complete.\n");
// keep track of best fitness during run
if (t.tgen[j1].fit_max > max_fitness)
{
max_fitness = t.tgen[j1].fit_max;
max_fit_gen = j;
}
if ((t.tgen[j1].fit_max > 99.0) && (flagb == 0))
{
flagb = 1;
run_data.fit90[i] = t.tgen[j1].fit_max;
run_data.gen90[i] = j;
}
sa[0] = 0;
strcat(sa, curRun);
sprintf(ss, "%d", i);
strcat(sa, ss);
strcat(sa, ".txt");
printf("Write fitness epc...\n");
// write fitness stats to file
writefitnessepc(sa, &(t), j1, j);
printf("Write fitness complete.\n");
// trunk for saving population to disk
if (sx != 0)
{
sa[0] = 0;
strcat(sa, genTmp);
sprintf(ss, "%d", 1);
strcat(sa, ss);
strcat(sa, ".txt");
if (display_dbg) printf("Saving Current Run\n");
}
//update current generation to file
writecurrent(curGen, j + 1);
if (display_time && j > 0 && (j % 10 == 0 || j % (num_gens - 1) == 0))
{
#ifdef OS_WINDOWS
read_clock(&t3);
sim_time = (t3.QuadPart - t1.QuadPart) / clk_freq.QuadPart;
run_time = (t3.QuadPart - t2.QuadPart) / clk_freq.QuadPart;
#endif
#ifdef OS_UNIX
read_clock(CLOCK_REALTIME, &t3);
sim_time = (double)(t3.tv_sec - t1.tv_sec);
run_time = (double)(t3.tv_sec - t2.tv_sec);
#endif
avg_gen_time = run_time / (j + 1);
est_run_time = avg_gen_time * (num_gens - j);
avg_run_time = est_run_time + run_time;
est_sim_time = (est_run_time * (num_runs - i)) / (i + 1);
printf("\n============= Timing Data =============\n");
printf("Time in Simulation: %.2fs\n", sim_time);
printf("Time in Run: %.2fs\n", run_time);
printf("Est. Time to Complete Run: %.2fs\n", est_run_time);
printf("Est. Time to Complete Simulation: %.2fs\n\n", est_sim_time);
printf("Average Time Per Generation: %.2fs/gen\n", avg_gen_time);
printf("Average Time Per Run: %.2fs/run\n", avg_run_time);
printf("=======================================\n\n");
if (j % (num_gens - 1) == 0) {
}
}
//Display Position Board
//displayboardl(&bn.board[0]);
j1 = j2;
}
}
}
Structures
typedef struct boardcollectionn
{
boardtable board[boardnumb];
} boardcollection;
boardcollection bn;
typedef struct tissue_gent
{
tissue_population tgen[2];
} tissue_genx;
typedef struct sim_paramt //struct for storing simulation parameters
{
int penalty;
int addnum[cell_numz];
int x[9];
int y[9];
uint8_t addvec[9];
uint8_t parenta[50];
uint8_t parentb[50];
int errorstatus;
int ones[outputnum][5000];
int zeros[outputnum][5000];
int probcount;
int num;
int numb;
int numc;
int numd;
int nume;
int numf;
int bestnum;
int count;
int col_flag;
int behaviour[outputnum];
int memm[4];
int sel;
int seldecnum;
int seldec[200];
int selx[200];
int sely[200];
int selz[200];
int countb;
float maxfitness;
float oldmaxfitness;
int run_num;
int collision;
} sim_param;
tissue_genx t;
sim_param s;
The code is too big for a proper testing and the use of global variables really doesn't help to figure out the data dependencies. However I can just make a few remarks:
i is declared shared whereas it is the index of the parallelised loop. This is wrong! If there is a variable that you really want to be private in a omp for loop, it is the loop index. I didn't find anything clear about that in the OpenMP standard for C and C++, whereas for Fortran, the loop index (and the ones of all enclosed loops) is implicitly privatised. Nonetheless, the Intel compiler gives an error while attempting to explicitly declare shared such an index:
sharedi.cc(11): warning #2555: static control variable for parallel loop
for ( i=0; i<10; i++ ) {
^
sharedi.cc(10): error: index variable "i" of for statement following an OpenMP for pragma must be private
#pragma omp parallel for shared(i) schedule(static)
^
compilation aborted for sharedi.cc (code 2)
by the mean-time, gcc version 5.1.0 doesn't emit any warning or error for the same code, and acts as if the variable had been declared private... I tend to find Intel's compiler's behaviour more reasonable, but I'm not 100% sure which one is correct. What I know however is that declaring i shared is definitely a very very bad idea (and even a bug AFAIC). So I feel like this is a grey area where your compiler may or may not do a sensible job, which could all by itself explain most of your problems.
You seem to output your data into files, which names might conflict across threads. Be careful with that as you might end-up with a big mess...
Your printing is very likely to be all messed-up. I don't know what importance you put in that, but that won't be pretty the way it is written for now.
In summary, your code is just to tangled for me to get a clear view on what's happening. Try to address at least the two first points I mentioned, it might be sufficient for getting it to "work". However, I couldn't encourage you enough to clean the code up and to get rid of your global variables. Likewise, try to only declare your variables as late in the sources as possible, since this reduces the need of declaring them private for OpenMP, and it improves greatly readability.
Good luck with your debugging.

Multithreaded program outputs different results every time it runs

I have been trying to create a Multithreaded program that calculates the multiples of 3 and 5 from 1 to 999 but I can't seem to get it right every time I run it I get a different value I think it might have to do with the fact that I use a shared variable with 10 threads but I have no idea how to get around that. Also The program does work if I calculate the multiples of 3 and 5 from 1 to 9.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <string.h>
#define NUM_THREADS 10
#define MAX 1000
//finds multiples of 3 and 5 and sums up all of the multiples
int main(int argc, char ** argv)
{
omp_set_num_threads(10);//set number of threads to be used in the parallel loop
unsigned int NUMS[1000] = { 0 };
int j = 0;
#pragma omp parallel
{
int ID = omp_get_thread_num();//get thread ID
int i;
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
}
int i = 0;
unsigned int total;
for(i = 0; NUMS[i] != 0; i++)total += NUMS[i];//add up multiples of 3 and 5
printf("Total : %d\n", total);
return 0;
}
"j++" is not an atomic operation.
It means "take the value contained at the storage location called j, use it in the current statement, add one to it, then store it back in the same location it came from".
(That's the simple answer. Optimization and whether or not the value is kept in a register can and will change things even more.)
When you have multiple threads doing that to the same variable all at the same time, you get different and unpredictable results.
You can use thread variables to get around that.
In your code j is a shared inductive variable. You can't rely on using shared inductive variables efficiently with multiple threads (using atomic every iteration is not efficient).
You could find a special solution not using inductive variables (for example using wheel factorization with seven spokes {0,3,5,6,9,10,12} out of 15) or you could find a general solution using private inductive variables like this
#pragma omp parallel
{
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
#pragma omp for schedule(static) ordered
for(i=0; i<omp_get_num_threads(); i++) {
#pragma omp ordered
{
memcpy(&NUMS[j], NUMS_local, sizeof *NUMS *k);
j += k;
}
}
}
This solution does not make optimal use of memory however. A better solution would use something like std::vector from C++ which you could implement for example using realloc in C but I'm not going to do that for you.
Edit:
Here is a special solution which does not use shared inductive variables using wheel factorization
int wheel[] = {0,3,5,6,9,10,12};
int n = MAX/15;
#pragma omp parallel for reduction(+:total)
for(int i=0; i<n; i++) {
for(int k=0; k<7; k++) {
NUMS[7*i + k] = 7*i + wheel[k];
total += NUMS[7*i + k];
}
}
//now clean up for MAX not a multiple of 15
int j = n*7;
for(int i=n*15; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS[j++] = i;
total += i;
}
}
Edit: It's possible to do this without a critical section (from the ordered clause). This does memcpy in parallel and also makes better use of memory at least for the shared array.
int *NUMS;
int *prefix;
int total=0, j;
#pragma omp parallel
{
int i;
int nthreads = omp_get_num_threads();
int ithread = omp_get_thread_num();
#pragma omp single
{
prefix = malloc(sizeof *prefix * (nthreads+1));
prefix[0] = 0;
}
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
prefix[ithread+1] = k;
#pragma omp barrier
#pragma omp single
{
for(i=1; i<nthreads+1; i++) prefix[i+1] += prefix[i];
NUMS = malloc(sizeof *NUMS * prefix[nthreads]);
j = prefix[nthreads];
}
memcpy(&NUMS[prefix[ithread]], NUMS_local, sizeof *NUMS *k);
}
free(prefix);
This is a typical thread synchronization issue. All you need to do is using a kernel synchronization object for the sake of atomicity of any desired operation (incrementing the value of variable j in your case). It would be a mutex, semaphore or an event object depending on the operating system you're working on. But whatever your development environment is, to provide atomicity, the fundamental flow logic should be like the following pseudo-code:
{
lock(kernel_object)
// ...
// do your critical operation (increment your variable j in your case)
// ++j;
// ...
unlock(kernel_object)
}
If you're working on Windows operating system, there are some special synchronization mechanisms provided by the environment (i.e: InterlockedIncrement or CreateCriticalSection etc.) If you're working on a Unix/Linux based operating system, you can use mutex or semaphore kernel synchronization objects. Actually all those synchronization mechanism are stem from the concept of semaphores which is invented by Edsger W. Dijkstra in the begining of 1960's.
Here's some basic examples below:
Linux
#include <pthread.h>
pthread_mutex_t g_mutexObject = PTHREAD_MUTEX_INITIALIZER;
int main(int argc, char* argv[])
{
// ...
pthread_mutex_lock(&g_mutexObject);
++j; // incrementing j atomically
pthread_mutex_unlock(&g_mutexObject);
// ...
pthread_mutex_destroy(&g_mutexObject);
// ...
exit(EXIT_SUCCESS);
}
Windows
#include <Windows.h>
CRITICAL_SECTION g_csObject;
int main(void)
{
// ...
InitializeCriticalSection(&g_csObject);
// ...
EnterCriticalSection(&g_csObject);
++j; // incrementing j atomically
LeaveCriticalSection(&g_csObject);
// ...
DeleteCriticalSection(&g_csObject);
// ...
exit(EXIT_SUCCESS);
}
or just simply:
#include <Windows.h>
LONG volatile g_j; // our little j must be volatile in here now
int main(void)
{
// ...
InterlockedIncrement(&g_j); // incrementing j atomically
// ...
exit(EXIT_SUCCESS);
}
The problem you have is that threads doesn't necesarlly execute in order so the last thread to wirete may not have read the value in order so you overwrite wrong data.
There is a form to set that the threads in a loop, do a sumatory when they finish with the openmp options. You have to wirte somthing like this to use it.
#pragma omp parallel for reduction(+:sum)
for(k=0;k<num;k++)
{
sum = sum + A[k]*B[k];
}
/* Fin del computo */
gettimeofday(&fin,NULL);
all you have to do is write the result in "sum", this is from an old code i have that do a sumatory.
The other option you have is the dirty one. Someway, make the threads wait and get in order using a call to the OS. This is easier than it looks. This will be a solution.
#pragma omp parallel
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
printf("asdasdasdasdasdasdasdas");
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
but i recommendo you to read fully the openmp options.

Measuring execution time using rdtsc within kernel

I am writing a code to measure the time consumption of a sequence of codes in kernel by loading the codes as module into the kernel. I uses common rdtsc routine to calculate the time. Interesting thing is similar routine running in user mode results in normal values, whereas the results is always 0 when running in kernel mode, no matter how many lines of codes I have added into the time_count function. The calculation I use here is a common matrix product function, and the running cycles should increase rapidly through the increasing of matrix dimension. Can anyone point out the mistakes in my code why I could not measure the cycle number in kernel?
#include <linux/init.h>
#include <linux/module.h>
int matrix_product(){
int array1[500][500], array2[500][500], array3[500][500];
int i, j, k, sum;
for(i = 0; i < 50000; i++){
for(j = 0; j < 50000; j++){
array1[i][j] = 5*i + j;
array2[i][j] = 5*i + j;
}
}
for(i = 0; i < 50000; i++){
for(j = 0; j < 50000; j++){
for(k = 0; k < 50000; k++)
sum += array1[i][k]*array2[k][j];
array3[i][j] = sum;
sum = 0;
}
}
return 0;
}
static __inline__ unsigned long long rdtsc(void)
{
unsigned long hi, lo;
__asm__ __volatile__ ("xorl %%eax,%%eax\ncpuid" ::: "%rax", "%rbx", "%rcx", "%rdx");
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ((unsigned long long)lo) | (((unsigned long long)hi)<<32) ;
}
static int my_init(void)
{
unsigned long str, end, curr, best, tsc, best_curr;
long i, t;
#define time_count(codes) for(i=0; i<120000; i++){str=rdtsc(); codes; end=rdtsc(); curr=end-str; if(curr<best)best=curr;}
best = ~0;
time_count();
tsc = best;
best = ~0;
time_count(matrix_product());
best_curr = best;
printk("<0>matrix product: %lu ticks\n", best_curr-tsc);
return 0;
}
static void my_exit(void){
return;
}
module_init(my_init);
module_exit(my_exit);`
Any help is appreciated! Thanks.
rdtsc is not guaranteed to be available on every CPU, or to run at a constant rate, or be consistent between different cores.
You should use a reliable and portable function like getrawmonotonic unless you have special requirements for the timestamps.
If you really want to use cycles directly, the kernel already defines get_cycles and cpuid functions for this.

Segmentation Fault when using OpenMP when creating an array

I'm having a Segmentation Fault when accessing an array inside a for loop.
What I'm trying to do is to generate all subsequences of a DNA string.
It was happening when I created the array inside the for. After reading for a while, I found out that the openmp limits the stack size, so it would be safer to use the heap instead. So I change the code to use malloc, but the problem persists.
This is the full code:
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#define DNA_SIZE 26
#define DNA "AGTC"
static char** powerset(int argc, char* argv)
{
unsigned int i, j, bits, i_max = 1U << argc;
if (argc >= sizeof(i) * CHAR_BIT) {
fprintf(stderr, "Error: set too large\n");
exit(1);
}
omp_set_num_threads(2);
char** subsequences = malloc(i_max*sizeof(char*));
#pragma omp parallel for shared(subsequences, argv)
for (i = 0; i < i_max ; ++i) {
//printf("{");
int characters = 0;
for (bits=i; bits ; bits>>=1)
if (bits & 1)
++characters;
//This is the line where the error is happening.
char *ss = malloc(characters+1 * sizeof(char)*16);//the *16 is just to save the cache lin
int ssindex = 0;
for (bits = i, j=0; bits; bits >>= 1, ++j) {
if (bits & 1) {
//char a = argv[j];
ss[ssindex++] = argv[j] ;
}
}
ss[ssindex] = '\0';
subsequences[i] = ss;
}
return subsequences;
}
char* getdna()
{
int i;
char *dna = (char *)malloc((DNA_SIZE+1) * sizeof(char));
for(i = 0; i < DNA_SIZE; i++)
{
int randomDNA = rand() % 4;
dna[i] = DNA[randomDNA];
}
dna[DNA_SIZE] = '\0';
return dna;
}
void printResult(char** ss, int size)
{
//PRINTING THE SUBSEQUENCES
printf("SUBSEQUENCES FOUND:\r\n");
int i;
for(i = 0; i < size; i++)
{
printf("%i.\t{ %s } \r\n",i+1 , ss[i]);
free(ss[i]);
}
free(ss);
}
int main(int argc, char* argv[])
{
srand(time(NULL));
double starttime, stoptime;
starttime = omp_get_wtime();
char* a = getdna();
printf("%s\r\n", a);
int size = pow(2, DNA_SIZE);
printf("number of subsequences: %i\r\n", size);
char** subsequences = powerset(DNA_SIZE, a);
//todo: make it optional printing to the stdout or saving to a file
//printResult(subsequences, size);
stoptime = omp_get_wtime();
printf("Tempo de execucao: %3.2f segundos\n\n", stoptime-starttime);
printf("Numero de sequencias geradas: %i\n\n", size);
free(a);
return 0;
}
I also tried to make the malloc line critical with the #pragma omp critical which didn't help.
Also I tried to compile with -mstackrealign which also didn't work.
Appreciate all the help.
You should use a more efficient thread-safe memory management.
Applications can use either malloc() and free() explicitly, or implicitly in the compiler-generated code for dynamic/allocatable arrays, vectorized intrinsics, and so on.
The thread-safe malloc() and free() in some libc implementations carry a high synchronization overhead caused by internal locking. Faster allocators for multi-threaded applications exist. For instance, on Solaris multithreaded applications should be linked with the "MT-hot" allocator mtmalloc, (i.e., link with -lmtmalloc to use mtmalloc instead of the default libc allocator). glibc, used on Linux and some OpenSolaris and FreeBSD distributions with GNU userlands, uses a modified ptmalloc2 allocator, which is based on Doug Lea's dlmalloc. It uses multiple memory arenas to achieve near lock-free behavior. It can also be configured to use per-thread arenas and some distributions, notably RHEL 6 and derivates, have that feature enabled.
static char** powerset(int argc, char* argv)
{
int i, j, bits, i_max = 1U << argc;
if (argc >= sizeof(i) * CHAR_BIT) {
fprintf(stderr, "Error: set too large\n");
exit(1);
}
omp_set_num_threads(2);
char** subsequences = malloc(i_max*sizeof(char*));
int characters = 0;
for (i = 0; i < i_max ; ++i)
{
for (bits=i; bits ; bits>>=1)
if (bits & 1)
++characters;
subsequences[i] = malloc(characters+1 * sizeof(char)*16);
characters = 0;
}
#pragma omp parallel for shared(subsequences, argv) private(j,bits)
for (i = 0; i < i_max; ++i)
{
int ssindex = 0;
for (bits = i, j=0; bits; bits >>= 1, ++j) {
if (bits & 1) {
subsequences[i][ssindex++] = argv[j] ;
}
}
subsequences[i][ssindex] = '\0';
}
return subsequences;
}
I create (and allocate) the desired data before the parallel region, and then made the remaining calculations. The version above running with 12 threads in a 24 core machine takes "Tempo de execucao: 9.44 segundos".
However, when I try to parallelize the following code:
#pragma omp parallel for shared(subsequences) private(bits,characters)
for (i = 0; i < i_max ; ++i)
{
for (bits=i; bits ; bits>>=1)
if (bits & 1)
++characters;
subsequences[i] = malloc(characters+1 * sizeof(char)*16);
characters = 0;
}
it take "Tempo de execucao: 10.19 segundos"
As you can see calling malloc in parallel leads to slower times.
Eventually, you would have had problems with the fact that each sub-malloc was trying to allocate (characters+1*DNA_SIZE*sizeof(char)) rather than ((characters+1)*DNA_SIZE*sizeof(char)), and the multiplying by a factor for cache line size is not necessary inside the parallel section if I understand what you were trying to avoid.
There also seems to be some issue with this piece of code:
for (bits = i, j=0; bits; bits >>= 1, ++j) {
if (bits & 1) {
//char a = argv[j];
ss[ssindex++] = argv[j] ;
}
}
With this code, j sometimes hits DNA_SIZE or DNA_SIZE+1, resulting in reading argv[j] going off the end of the array. (Also, using argc and argv as names for arguments in this function is somewhat confusing.)
The problem is here with dna[DNA_SIZE] = '\0';. So far you have allocated memory for 26 characters (say), and you are trying to access the 27th character. Always remember array index starts from 0.

Why is my parallel code slower than sequential code?

I have implemented a parallel code in C for merge sort using OPENMP. I get speed up of 3.9 seconds which is quite slower that the sequential version of the same code(for which i get 3.6). I am trying to optimise the code to the best possible state but cant increase the speedup. Can you please help out with this? Thanks.
void partition(int arr[],int arr1[],int low,int high,int thread_count)
{
int tid,mid;
#pragma omp if
if(low<high)
{
if(thread_count==1)
{
mid=(low+high)/2;
partition(arr,arr1,low,mid,thread_count);
partition(arr,arr1,mid+1,high,thread_count);
sort(arr,arr1,low,mid,high);
}
else
{
#pragma omp parallel num_threads(thread_count)
{
mid=(low+high)/2;
#pragma omp parallel sections
{
#pragma omp section
{
partition(arr,arr1,low,mid,thread_count/2);
}
#pragma omp section
{
partition(arr,arr1,mid+1,high,thread_count/2);
}
}
}
sort(arr,arr1,low,mid,high);
}
}
}
As was correctly noted, there are several mistakes in your code that prevent its correct execution, so I would first suggest to review these errors.
Anyhow, taking into account only how OpenMP performance scales with thread, maybe an implementation based on task directives would fit better as it overcomes the limits already pointed by a previous answer:
Since the sections directive only has two sections, I think you won't get any benefit from spawning more threads than two in the parallel clause
You can find a trace of such an implementation below:
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/time.h>
void getTime(double *t) {
struct timeval tv;
gettimeofday(&tv, 0);
*t = tv.tv_sec + (tv.tv_usec * 1e-6);
}
int compare( const void * pa, const void * pb ) {
const int a = *((const int*) pa);
const int b = *((const int*) pb);
return (a-b);
}
void merge(int * array, int * workspace, int low, int mid, int high) {
int i = low;
int j = mid + 1;
int l = low;
while( (l <= mid) && (j <= high) ) {
if( array[l] <= array[j] ) {
workspace[i] = array[l];
l++;
} else {
workspace[i] = array[j];
j++;
}
i++;
}
if (l > mid) {
for(int k=j; k <= high; k++) {
workspace[i]=array[k];
i++;
}
} else {
for(int k=l; k <= mid; k++) {
workspace[i]=array[k];
i++;
}
}
for(int k=low; k <= high; k++) {
array[k] = workspace[k];
}
}
void mergesort_impl(int array[],int workspace[],int low,int high) {
const int threshold = 1000000;
if( high - low > threshold ) {
int mid = (low+high)/2;
/* Recursively sort on halves */
#ifdef _OPENMP
#pragma omp task
#endif
mergesort_impl(array,workspace,low,mid);
#ifdef _OPENMP
#pragma omp task
#endif
mergesort_impl(array,workspace,mid+1,high);
#ifdef _OPENMP
#pragma omp taskwait
#endif
/* Merge the two sorted halves */
#ifdef _OPENMP
#pragma omp task
#endif
merge(array,workspace,low,mid,high);
#ifdef _OPENMP
#pragma omp taskwait
#endif
} else if (high - low > 0) {
/* Coarsen the base case */
qsort(&array[low],high-low+1,sizeof(int),compare);
}
}
void mergesort(int array[],int workspace[],int low,int high) {
#ifdef _OPENMP
#pragma omp parallel
#endif
{
#ifdef _OPENMP
#pragma omp single nowait
#endif
mergesort_impl(array,workspace,low,high);
}
}
const size_t largest = 100000000;
const size_t length = 10000000;
int main(int argc, char *argv[]) {
int * array = NULL;
int * workspace = NULL;
double start,end;
printf("Largest random number generated: %d \n",RAND_MAX);
printf("Largest random number after truncation: %d \n",largest);
printf("Array size: %d \n",length);
/* Allocate and initialize random vector */
array = (int*) malloc(length*sizeof(int));
workspace = (int*) malloc(length*sizeof(int));
for( int ii = 0; ii < length; ii++)
array[ii] = rand()%largest;
/* Sort */
getTime(&start);
mergesort(array,workspace,0,length-1);
getTime(&end);
printf("Elapsed time sorting: %g sec.\n", end-start);
/* Check result */
for( int ii = 1; ii < length; ii++) {
if( array[ii] < array[ii-1] ) printf("Error:\n%d %d\n%d %d\n",ii-1,array[ii-1],ii,array[ii]);
}
free(array);
free(workspace);
return 0;
}
Notice that if you seek performances you also have to guarantee that the base case of your recursion is coarse enough to avoid substantial overhead due to recursive function calls. Other than that, I would suggest to profile your code so you can have a good hint on which parts are really worth optimizing.
It took some figuring out, which is a bit embarassing, since when you see it, the answer is so simple.
As it stands in the question, the program doesn't work correctly, instead it randomly on some runs duplicates some numbers and loses others. This appears to be a totally parallel error, that doesn't arise when running the program with the variable thread_count == 1.
The pragma "parallel sections", is a combined parallel and sections directive, which in this case means, that it starts a second parallel region inside the previous one. Parallel regions inside other parallel regions are fine, but I think most implementation don't give you extra threads when they encounter a nested parallel region.
The fix is to replace
#pragma omp parallel sections
with
#pragma omp sections
After this fix, the program starts to give correct answers, and with a two core system and for a million numbers I get for timing the following results.
One thread:
time taken: 0.378794
Two threads:
time taken: 0.203178
Since the sections directive only has two sections, I think you won't get any benefit from spawning more threads than two in the parallel clause, so change num_threads(thread_count) -> num_threads(2)
But because of the fact that at least the two implementations I tried are not able to spawn new threads for nested parallel regions, the program as it stands doesn't scale to more than two threads.

Resources