Prevent False Sharing without using padding

Prevent False Sharing without using padding - c

I'm currently learning about pthreads in C and came across the issue of False Sharing. I think I understand the concept of it and I've tried experimenting a bit.
Below is a short program that I've been playing around with. Eventually I'm going to change it into a program to take a large array of ints and sum it in parallel.
#include <stdio.h>
#include <pthread.h>
#define THREADS 4
#define NUMPAD 14
struct s
{
int total; // 4 bytes
int my_num; // 4 bytes
int pad[NUMPAD]; // 4 * NUMPAD bytes
} sum_array[4];
static void *worker(void * ind) {
const int curr_ind = *(int *) ind;
for (int i = 0; i < 10; ++i) {
sum_array[curr_ind].total += sum_array[curr_ind].my_num;
}
printf("%d\n", sum_array[curr_ind].total);
return NULL;
}
int main(void) {
int args[THREADS] = { 0, 1, 2, 3 };
pthread_t thread_ids[THREADS];
for (size_t i = 0; i < THREADS; ++i) {
sum_array[i].total = 0;
sum_array[i].my_num = i + 1;
pthread_create(&thread_ids[i], NULL, worker, &args[i]);
}
for (size_t i = 0; i < THREADS; ++i) {
pthread_join(thread_ids[i], NULL);
}
}
My question is, is it possible to prevent false sharing without using padding? Here struct s has a size of 64 bytes so that each struct is on its own cache line (assuming that the cache line is 64 bytes). I'm not sure how else I can achieve parallelism without padding.
Also, if I were to sum an array of a varying size between 1000-50,000 bytes, how could I prevent false sharing? Would I be able to pad it out using a similar program? My current thoughts are to put each int from the big array, into an array of struct s and then use parallelism to sum it. However I'm not sure if this is the optimal solution.

Partition the problem: In worker(), sum into a local variable, then add the local variable to the array:
static void *worker(void * ind) {
const int curr_ind = *(int *) ind;
int localsum = 0;
for (int i = 0; i < 10; ++i) {
localsum += sum_array[curr_ind].my_num;
}
sum_array[curr_ind].total += localsum;
printf("%d\n", sum_array[curr_ind].total);
return NULL;
}
This may still have false sharing after the loop, but that is one time per thread. Thread creation overhead is much more significant than a single cache-miss. Of course, you probably want to have a loop that actually does something time-consuming, as your current code can be optimized to:
static void *worker(void * ind) {
const int curr_ind = *(int *) ind;
int localsum = 10 * sum_array[curr_ind].my_num;
sum_array[curr_ind].total += localsum;
printf("%d\n", sum_array[curr_ind].total);
return NULL;
}
The runtime of which is definitely dominated by thread creation and synchronization in printf().

Related

how to speed up a sort algorithm using thread?

I am working on an assignment where the goal is to speed up a quick sort process by creating multiple threads. However i cannot figure out how to speed up this process. I apply the threads allowed at the start however it only seems to slow the program down?
basically the goal is to sort a simple array by using the traditional recursive quick sort. But like I had stated previously that only seems to slow it down when i use the clock() library to time its performance. any suggestions? or is there something else that i need to do with the threads? I will upload my full source code here:
#include <pthread.h>
#include <stdio.h>
#include <sys/types.h>
#include <errno.h>
#include <unistd.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <string.h>
#include <stdlib.h>
#include <signal.h>
#include <math.h>
#include <sys/wait.h>
#include <assert.h>
#include <time.h>
static int maxThreads = 4;
#define SORT_THRESHOLD 40
//#includes==========
void *blin();
pthread_mutex_t mutex;
void *send(void *);
static int used = 0;
static int reached = 0;
int doit = 0;
int threadNo = 0;
typedef struct _sortParams {
char** array;
int left;
int right;
} SortParams;
static void insertSort(char** array, int left, int right) {
int i, j;
for (i = left + 1; i <= right; i++) {
char* pivot = array[i];
j = i - 1;
while (j >= left && (strcmp(array[j],pivot) > 0)) {
array[j + 1] = array[j];
j--;
}
array[j + 1] = pivot;
}
}
int going = 0;
int blins = 0;
void *send(void * p) {
SortParams* params = (SortParams*) p;
char** array = params->array;
int left = params->left;
int right = params->right;
int i = left, j = right;
if (j - i > SORT_THRESHOLD) {
/* if the sort range is substantial, use quick sort */
int m = (i + j) >> 1; /* pick pivot as median of */
char* temp, *pivot; /* first, last and middle elements */
if (strcmp(array[i],array[m]) > 0) {
temp = array[i]; array[i] = array[m]; array[m] = temp;
}
if (strcmp(array[m],array[j]) > 0) {
temp = array[m]; array[m] = array[j]; array[j] = temp;
if (strcmp(array[i],array[m]) > 0) {
temp = array[i]; array[i] = array[m]; array[m] = temp;
}
}
pivot = array[m];
for (;;) {
while (strcmp(array[i],pivot) < 0) i++;
/* move i down to first element greater than or equal to pivot */
while (strcmp(array[j],pivot) > 0) j--;
/* move j up to first element less than or equal to pivot */
if (i < j) {
char* temp = array[i]; /* if i and j have not passed each other */
array[i++] = array[j]; /* swap their respective elements and */
array[j--] = temp; /* advance both i and j */
} else if (i == j) {
i++; j--;
} else break; /* if i > j, this partitioning is done */
}
if (blins < 1) {
blins++;
SortParams first; first.array = array; first.left = left; first.right = j;
int ex;
pthread_t thred[2];
pthread_create(&thred[0], NULL, send, &first);
pthread_join(thred[0], NULL);
SortParams second; second.array = array; second.left = i; second.right = right;
pthread_create(&thred[1], NULL, send, &second);
pthread_join(thred[1], NULL);
} else {
SortParams first; first.array = array; first.left = left; first.right = j;
send(&first); /* sort the left partition */
SortParams second; second.array = array; second.left = i; second.right = right;
send(&second); /* sort the right partition */
}
} else insertSort(array,i,j); /* for a small range use insert sort */
}
int main() {
int count = 100000;
char * array[count];
char * random[10] = {"asdfs", "wesasd", "asded", "aaddsdaa", "dsfs", "av", "bb",
"zz", "das", "efdxse"};
int r = 0;
for(int ni = 0; ni < count; ni++) {
r = (rand() % 4);
char string[100];
strcpy(string, "");
int b = (rand() % 50)+1;
for (int bb = 0; bb < b; bb++) {
r = (rand() % 4);
if (r == 0) {
strcat(string, "a");
}
if (r == 1) {
strcat(string, "b");
}
if (r == 2) {
strcat(string, "c");
}
if (r == 3) {
strcat(string, "d");
}
if (r == 4) {
strcat(string, "e");
}
}
array[ni] = malloc(sizeof(string));
strcpy(array[ni], string);
}
clock_t t;
t = clock();
SortParams parameters; // declare structure
parameters.array = array; parameters.left = 0; parameters.right = count - 1;
//sleep(5);
send(&parameters);
t = clock() - t;
double total = ((double)t)/CLOCKS_PER_SEC;
printf("%f \n", total);
char ** jink = parameters.array;
for (int ni = 0; ni < count/10; ni++) {
printf("%s \n", jink[ni]);
}
// */
for (int ni = 0; ni < count; ni++) {
free(array[ni]);
} printf("%f \n", total);
return 0;
}
you should be able to simply copy/paste and it should work but as you can see i have created 2 threads but it's slower than with no threads.

Let me explain the main problem with the analogy.
You have a pile of sheets each with the number. The pile is unordered and you need to sort it.
Sequential
You split the pile into two. Order them and then combine ordered piles into one big pile. You cannot put all sheets on your table as it is small. So you can only get say 20 sheets at once. When you need to sort the pile larger than 20 sheets you temporarily store what does not fit on the table in the box.
Parallel
Every time you split the pile into two piles you call the courier service. It arrives in say 15 minutes and you send two piles to two of your friends who live on the other side of the town. Your friends sort the piles (and if they are large enough they also split them and send the piles to their friends and so on) and then they send sorted piles back to you using courier service. You need to wait until all the pile arrive from your friends before you can combine and sort them.
You can see that if you have just 10 sheets to sort (or even 1000) it will be much quicker to do that alone. All the overhead of the coordination, sending data to other side of the town is very large even that your friends (and their friends) can do the work (that is sort sheets) in parallel.
In order to get any visible speedup you need that your pile is large enough so that gain from introducing parallelization overweights the need to start new threads and synchronize their work (delay the courier services introduce).
Also in your current implementation you do not even do work in parallel. As starting a thread and then joining it immediately is similar to sending the first pile to your friend and then waiting for the result from your first friend before you send the second pile to your second friend. If you do this yourself it will be much more quicker for any size of the pile.
What to do?
First of all you need to test this on some large data to see the gains.
Secondly there is no sense to create threads more than you have executors that is cores in you system. After you created that many threads they should just sort what they have without spawning any new threads.
You really need to make sure that threads do the work in parallel. This is the only place which works in your favour and the only source of the speedup of the parallel algorithm. All other factors work against you.

A minimal improvement in your code is to sort the both parts in parallel and then join both threads:
...
if (blins < 1) {
blins++;
SortParams first; first.array = array; first.left = left; first.right = j;
int ex;
pthread_t thred[2];
pthread_create(&thred[0], NULL, send, &first); // start first thread
SortParams second; second.array = array; second.left = i; second.right = right;
pthread_create(&thred[1], NULL, send, &second); // start second thread
pthread_join(thred[0], NULL); // wait for first thread
pthread_join(thred[1], NULL); // wait for second thread
} else {
...
In my tests, the gain is between 50% and 30%. But using threads on a recursive processing is brave. I would instead have fixed the number of working threads (number of cores or hardware threads -1?) split the array in that number of chunks and have one thread per chunk. Then you only merge the sorted chunks

How to pass specific value to one thread?

I'm working through a thread exercise in C, it's a typical thread scheduling code many schools teach, a basic one can be seen here, my code is basically the same except for my altered runner method
http://webhome.csc.uvic.ca/~wkui/Courses/CSC360/pthreadScheduling.c
What I'm doing is basically altering the runner part so my code prints an array with random numbers within a certain range, instead of just printing some words. my runner code is here:
void *runner(void *param) {
int i, j, total;
int threadarray[100];
for (i = 0; i < 100; i++)
threadarray[i] = rand() % ((199 + modifier*100) + 1 - (100 + modifier*100)) + (100 + modifier*100);
/* prints array and add to total */
for (j = 0; j < 100; j += 10) {
printf("%d,%d,%d,%d,%d,%d,%d,%d,%d,%d\n", threadarray[j], threadarray[j+1], threadarray[j+2], threadarray[j+3], threadarray[j+4], threadarray[j+5], threadarray[j+6], threadarray[j+7], threadarray[j+8], threadarray[j+9]);
total = total + threadarray[j] + threadarray[j+1] + threadarray[j+2] + threadarray[j+3] + threadarray[j+4] + threadarray[j+5] + threadarray[j+6] + threadarray[j+7] + threadarray[j+8] + threadarray[j+9];
}
printf("Thread %d finished running, total is: %d\n", pthread_self(), total);
pthread_exit(0);
}
My question lies in the first for loop where I'm assigning random numbers to my array, I want this modifier to change based on which thread it is, but I can't figure out how to do it, for example if its the first thread the range will be 100-199, 2nd will be 200-299, etc and so on. I have tried to assign i to an value before doing pthread_create and assigning that value to an int in runner to use as the modifier, but since there are 5 concurrent threads it ends up assigning this number to all 5 threads, and they end up having the same modifier.
So I'm looking for a method to approach this where it will work for all the individual threads instead of assigning it to all of them, I have tried to change the parameters to something like (void *param, int modifier) but when I do this I have no idea how to reference runner, since by default it's refrenced like pthread_create(&tid[i],&attr,runner,NULL);

You want to make param point to a data structure or variable who's lifetime will exist longer than the thread lifetime. And you cast the void* parameter to the actual data type it was allocated as.
Easy example:
struct thread_data
{
int thread_index;
int start;
int end;
}
struct thread_info;
{
struct thread_data data;
pthread_t thread;
}
struct thread_info threads[10];
for (int x = 0; x < 10; x++)
{
struct thread_data* pData = (struct thread_data*)malloc(sizeof(struct thread_data)); // never pass a stack variable as a thread parameter. Always allocate it from the heap.
pData->thread_index = x;
pData->start = 100 * x + 1;
pData->end = 100*(x+1) - 1;
pthread_create(&(threads[x].thread), NULL, runner, pData);
}
Then your runner:
void *runner(void *param)
{
struct thread_data* data = (struct thread_data*)param;
int modifier = data->thread_index;
int i, j, total;
int threadarray[100];
for (i = 0; i < 100; i++)
{
threadarray[i] = ...
}

Improving a simple function using threading

I have written a simple function with the following code that calculates the minimum number from a one-dimensional array:
uint32_t get_minimum(const uint32_t* matrix) {
int min = 0;
min = matrix[0];
for (ssize_t i = 0; i < g_elements; i++){
if (min > matrix[i]){
min = matrix[i];
}
}
return min;
}
However, I wanted to improve the performance of this function and was advised using threads so I have modified it to the following:
struct minargument{
const uint32_t* matrix;
ssize_t tid;
long long results;
};
static void *minworker(void *arg){
struct minargument *argument = (struct minargument *)arg;
const ssize_t start = argument -> tid * CHUNK;
const ssize_t end = argument -> tid == THREADS - 1 ? g_elements : (argument -> tid + 1) * CHUNK;
long long result = argument -> matrix[0];
for(ssize_t i = start; i < end; i++){
for(ssize_t x = 0; x < g_elements; x++){
if(result > argument->matrix[i]){
result = argument->matrix[i];
}
}
}
argument -> results = result;
return NULL;
}
uint32_t get_minimum(const uint32_t* matrix) {
struct minargument *args = malloc(sizeof(struct minargument) * THREADS);
long long min = 0;
for(ssize_t i = 0; i < THREADS; i++){
args[i] = (struct minargument){
.matrix = matrix,
.tid = i,
.results = min,
};
}
pthread_t thread_ids[THREADS];
for(ssize_t i =0; i < THREADS; i++){
if(pthread_create(thread_ids + i, NULL, minworker, args + i) != 0){
perror("pthread_create failed");
return 1;
}
}
for (ssize_t i = 0; i < THREADS; i++){
if(pthread_join(thread_ids[i], NULL) != 0){
perror("pthread_join failed");
return 1;
}
}
for(ssize_t i =0; i < THREADS; i++){
min = args[i].results;
}
free(args);
return min;
}
However this seems to be slower than the first function.
Am I correct in using threads to make the first function run faster? And if so, how do I modify the second function so that it is faster than the first function?

Having more threads than cores available to run them on is always going to be slower than a single thread due to the overhead of creating them, scheduling them and waiting for them all to finish.
The example you provide is unlikely to benefit from any optimisation beyond that which the compiler will do for you, as it is a short and simple operation. If you were doing something more complicated on a multi-core system, such as multiplying two huge matrices, of running a correlation algorithm on high speed real-time data then multi-threading may be the solution.
A more abstract answer to your question is another question: do you really need to be optimising it at all? Unless you know for a fact that there are performance issues, then your time would be better spent adding more functionality to your program than fixing a problem that doesn't really exist.
Edit - Comparison
I just ran (a representative version of) the OP's code on a 16 bit ARM microcontroller running with a 40 MHz instruction clock. Code compiled using GCC with no optimisation.
Finding the minimum of 20,000 32 bit integers took a little over 25 milliseonds.
With a 40 kByte page size (to hold half of a 20,000 array of 4 byte values) with threads running on different cores of a dual Intel 5150 processor clocked at 2.67 GHz, it takes nearly 50 ms just to do the context switch and paging operation!
A simple, single-threaded microcontroller implementation takes half as long in real time terms as a multi-threaded desktop implementation.

Recursive function using pthreads in C

I have the following piece of code
#include "stdio.h"
#include "stdlib.h"
#include <string.h>
#define MAXBINS 8
void swap_long(unsigned long int **x, unsigned long int **y){
unsigned long int *tmp;
tmp = x[0];
x[0] = y[0];
y[0] = tmp;
}
void swap(unsigned int **x, unsigned int **y){
unsigned int *tmp;
tmp = x[0];
x[0] = y[0];
y[0] = tmp;
}
void truncated_radix_sort(unsigned long int *morton_codes,
unsigned long int *sorted_morton_codes,
unsigned int *permutation_vector,
unsigned int *index,
int *level_record,
int N,
int population_threshold,
int sft, int lv){
int BinSizes[MAXBINS] = {0};
unsigned int *tmp_ptr;
unsigned long int *tmp_code;
level_record[0] = lv; // record the level of the node
if(N<=population_threshold || sft < 0) { // Base case. The node is a leaf
memcpy(permutation_vector, index, N*sizeof(unsigned int)); // Copy the pernutation vector
memcpy(sorted_morton_codes, morton_codes, N*sizeof(unsigned long int)); // Copy the Morton codes
return;
}
else{
// Find which child each point belongs to
int j = 0;
for(j=0; j<N; j++){
unsigned int ii = (morton_codes[j]>>sft) & 0x07;
BinSizes[ii]++;
}
// scan prefix
int offset = 0, i = 0;
for(i=0; i<MAXBINS; i++){
int ss = BinSizes[i];
BinSizes[i] = offset;
offset += ss;
}
for(j=0; j<N; j++){
unsigned int ii = (morton_codes[j]>>sft) & 0x07;
permutation_vector[BinSizes[ii]] = index[j];
sorted_morton_codes[BinSizes[ii]] = morton_codes[j];
BinSizes[ii]++;
}
//swap the index pointers
swap(&index, &permutation_vector);
//swap the code pointers
swap_long(&morton_codes, &sorted_morton_codes);
/* Call the function recursively to split the lower levels */
offset = 0;
for(i=0; i<MAXBINS; i++){
int size = BinSizes[i] - offset;
truncated_radix_sort(&morton_codes[offset],
&sorted_morton_codes[offset],
&permutation_vector[offset],
&index[offset], &level_record[offset],
size,
population_threshold,
sft-3, lv+1);
offset += size;
}
}
}
I tried to make this block
int j = 0;
for(j=0; j<N; j++){
unsigned int ii = (morton_codes[j]>>sft) & 0x07;
BinSizes[ii]++;
}
parallel by substituting it with the following
int rc,j;
pthread_t *thread = (pthread_t *)malloc(NTHREADS*sizeof(pthread_t));
belong *belongs = (belong *)malloc(NTHREADS*sizeof(belong));
pthread_mutex_init(&bin_mtx, NULL);
for (j = 0; j < NTHREADS; j++){
belongs[j].n = NTHREADS;
belongs[j].N = N;
belongs[j].tid = j;
belongs[j].sft = sft;
belongs[j].BinSizes = BinSizes;
belongs[j].mcodes = morton_codes;
rc = pthread_create(&thread[j], NULL, belong_wrapper, (void *)&belongs[j]);
}
for (j = 0; j < NTHREADS; j++){
rc = pthread_join(thread[j], NULL);
}
and defining these outside the recursive function
typedef struct{
int n, N, tid, sft;
int *BinSizes;
unsigned long int *mcodes;
}belong;
pthread_mutex_t bin_mtx;
void * belong_wrapper(void *arg){
int n, N, tid, sft, j;
int *BinSizes;
unsigned int ii;
unsigned long int *mcodes;
n = ((belong *)arg)->n;
N = ((belong *)arg)->N;
tid = ((belong *)arg)->tid;
sft = ((belong *)arg)->sft;
BinSizes = ((belong *)arg)->BinSizes;
mcodes = ((belong *)arg)->mcodes;
for (j = tid; j<N; j+=n){
ii = (mcodes[j] >> sft) & 0x07;
pthread_mutex_lock(&bin_mtx);
BinSizes[ii]++;
pthread_mutex_unlock(&bin_mtx);
}
}
However it takes a lot more time than the serial one to execute... Why is this happening? What should I change?

Since you're using a single mutex to guard updates to the BinSizes array, you're still ultimately doing all the updates to this array sequentially: only one thread can call BinSizes[ii]++ at any given time. Basically you're still executing your function in sequence but incurring the extra overhead of creating and destroying threads.
There are several options I can think of for you (there are probably more):
do as #Chris suggests and make each thread update one portion of
BinSizes. This might not be viable depending on the properties of
the calculation you're using to compute ii.
Create multiple mutexes representing different partitions of
BinSizes. For example, if BinSizes has 10 elements, you could
create one mutex for elements 0-4, and another for elements 5-9,
then use them in your thread something like so:
if (ii < 5) {
mtx_index = 0;
} else {
mtx_index = 1;
}
pthread_mutex_lock(&bin_mtx[mtx_index]);
BinSizes[ii]++;
pthread_mutex_unlock(&bin_mtx[mtx_index]);
You could generalize this idea to any size of BinSizes and any range:
Potentially you could have a different mutex for each array element. Of course
then you're opening yourself up to the overhead of creating each of these mutexes, and
the possibility of deadlock if someone tries to lock several of them at once etc...
Finally, you could abandon the idea of parallelizing this block altogether: as other users have mentioned using threads this way is subject to some level of diminishing returns. Unless your BinSizes array is very large, you might not see a huge benefit to parallelization even if you "do it right".

tl;dr - adding threads isn't a trivial fix for most problems. Yours isn't embarassingly parallelizable, and this code has hardly any actual concurrency.
You spin a mutex for every (cheap) integer operation on BinSizes. This will crush any parallelism, because all your threads are serialized on this.
The few instructions you can run concurrently (the for loop and a couple of operations on the morton code array) are much cheaper than (un)locking a mutex: even using an atomic increment (if available) would be more expensive than the un-synchronized part.
One fix would be to give each thread its own output array, and combine them after all tasks are complete.
Also, you create and join multiple threads per call. Creating threads is relatively expensive compared to computation, so it's generally recommended to create a long-lived pool of them to spread that cost.
Even if you do this, you need to tune the number of threads according to how many (free) cores do you have. If you do this in a recursive function, how many threads exist at the same time? Creating more threads than you have cores to schedule them on is pointless.
Oh, and you're leaking memory.

Segmentation Fault when using OpenMP when creating an array

I'm having a Segmentation Fault when accessing an array inside a for loop.
What I'm trying to do is to generate all subsequences of a DNA string.
It was happening when I created the array inside the for. After reading for a while, I found out that the openmp limits the stack size, so it would be safer to use the heap instead. So I change the code to use malloc, but the problem persists.
This is the full code:
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#define DNA_SIZE 26
#define DNA "AGTC"
static char** powerset(int argc, char* argv)
{
unsigned int i, j, bits, i_max = 1U << argc;
if (argc >= sizeof(i) * CHAR_BIT) {
fprintf(stderr, "Error: set too large\n");
exit(1);
}
omp_set_num_threads(2);
char** subsequences = malloc(i_max*sizeof(char*));
#pragma omp parallel for shared(subsequences, argv)
for (i = 0; i < i_max ; ++i) {
//printf("{");
int characters = 0;
for (bits=i; bits ; bits>>=1)
if (bits & 1)
++characters;
//This is the line where the error is happening.
char *ss = malloc(characters+1 * sizeof(char)*16);//the *16 is just to save the cache lin
int ssindex = 0;
for (bits = i, j=0; bits; bits >>= 1, ++j) {
if (bits & 1) {
//char a = argv[j];
ss[ssindex++] = argv[j] ;
}
}
ss[ssindex] = '\0';
subsequences[i] = ss;
}
return subsequences;
}
char* getdna()
{
int i;
char *dna = (char *)malloc((DNA_SIZE+1) * sizeof(char));
for(i = 0; i < DNA_SIZE; i++)
{
int randomDNA = rand() % 4;
dna[i] = DNA[randomDNA];
}
dna[DNA_SIZE] = '\0';
return dna;
}
void printResult(char** ss, int size)
{
//PRINTING THE SUBSEQUENCES
printf("SUBSEQUENCES FOUND:\r\n");
int i;
for(i = 0; i < size; i++)
{
printf("%i.\t{ %s } \r\n",i+1 , ss[i]);
free(ss[i]);
}
free(ss);
}
int main(int argc, char* argv[])
{
srand(time(NULL));
double starttime, stoptime;
starttime = omp_get_wtime();
char* a = getdna();
printf("%s\r\n", a);
int size = pow(2, DNA_SIZE);
printf("number of subsequences: %i\r\n", size);
char** subsequences = powerset(DNA_SIZE, a);
//todo: make it optional printing to the stdout or saving to a file
//printResult(subsequences, size);
stoptime = omp_get_wtime();
printf("Tempo de execucao: %3.2f segundos\n\n", stoptime-starttime);
printf("Numero de sequencias geradas: %i\n\n", size);
free(a);
return 0;
}
I also tried to make the malloc line critical with the #pragma omp critical which didn't help.
Also I tried to compile with -mstackrealign which also didn't work.
Appreciate all the help.

You should use a more efficient thread-safe memory management.
Applications can use either malloc() and free() explicitly, or implicitly in the compiler-generated code for dynamic/allocatable arrays, vectorized intrinsics, and so on.
The thread-safe malloc() and free() in some libc implementations carry a high synchronization overhead caused by internal locking. Faster allocators for multi-threaded applications exist. For instance, on Solaris multithreaded applications should be linked with the "MT-hot" allocator mtmalloc, (i.e., link with -lmtmalloc to use mtmalloc instead of the default libc allocator). glibc, used on Linux and some OpenSolaris and FreeBSD distributions with GNU userlands, uses a modified ptmalloc2 allocator, which is based on Doug Lea's dlmalloc. It uses multiple memory arenas to achieve near lock-free behavior. It can also be configured to use per-thread arenas and some distributions, notably RHEL 6 and derivates, have that feature enabled.
static char** powerset(int argc, char* argv)
{
int i, j, bits, i_max = 1U << argc;
if (argc >= sizeof(i) * CHAR_BIT) {
fprintf(stderr, "Error: set too large\n");
exit(1);
}
omp_set_num_threads(2);
char** subsequences = malloc(i_max*sizeof(char*));
int characters = 0;
for (i = 0; i < i_max ; ++i)
{
for (bits=i; bits ; bits>>=1)
if (bits & 1)
++characters;
subsequences[i] = malloc(characters+1 * sizeof(char)*16);
characters = 0;
}
#pragma omp parallel for shared(subsequences, argv) private(j,bits)
for (i = 0; i < i_max; ++i)
{
int ssindex = 0;
for (bits = i, j=0; bits; bits >>= 1, ++j) {
if (bits & 1) {
subsequences[i][ssindex++] = argv[j] ;
}
}
subsequences[i][ssindex] = '\0';
}
return subsequences;
}
I create (and allocate) the desired data before the parallel region, and then made the remaining calculations. The version above running with 12 threads in a 24 core machine takes "Tempo de execucao: 9.44 segundos".
However, when I try to parallelize the following code:
#pragma omp parallel for shared(subsequences) private(bits,characters)
for (i = 0; i < i_max ; ++i)
{
for (bits=i; bits ; bits>>=1)
if (bits & 1)
++characters;
subsequences[i] = malloc(characters+1 * sizeof(char)*16);
characters = 0;
}
it take "Tempo de execucao: 10.19 segundos"
As you can see calling malloc in parallel leads to slower times.
Eventually, you would have had problems with the fact that each sub-malloc was trying to allocate (characters+1*DNA_SIZE*sizeof(char)) rather than ((characters+1)*DNA_SIZE*sizeof(char)), and the multiplying by a factor for cache line size is not necessary inside the parallel section if I understand what you were trying to avoid.
There also seems to be some issue with this piece of code:
for (bits = i, j=0; bits; bits >>= 1, ++j) {
if (bits & 1) {
//char a = argv[j];
ss[ssindex++] = argv[j] ;
}
}
With this code, j sometimes hits DNA_SIZE or DNA_SIZE+1, resulting in reading argv[j] going off the end of the array. (Also, using argc and argv as names for arguments in this function is somewhat confusing.)

The problem is here with dna[DNA_SIZE] = '\0';. So far you have allocated memory for 26 characters (say), and you are trying to access the 27th character. Always remember array index starts from 0.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight