Segmentation Fault when using OpenMP when creating an array

Segmentation Fault when using OpenMP when creating an array - arrays

I'm having a Segmentation Fault when accessing an array inside a for loop.
What I'm trying to do is to generate all subsequences of a DNA string.
It was happening when I created the array inside the for. After reading for a while, I found out that the openmp limits the stack size, so it would be safer to use the heap instead. So I change the code to use malloc, but the problem persists.
This is the full code:
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#define DNA_SIZE 26
#define DNA "AGTC"
static char** powerset(int argc, char* argv)
{
unsigned int i, j, bits, i_max = 1U << argc;
if (argc >= sizeof(i) * CHAR_BIT) {
fprintf(stderr, "Error: set too large\n");
exit(1);
}
omp_set_num_threads(2);
char** subsequences = malloc(i_max*sizeof(char*));
#pragma omp parallel for shared(subsequences, argv)
for (i = 0; i < i_max ; ++i) {
//printf("{");
int characters = 0;
for (bits=i; bits ; bits>>=1)
if (bits & 1)
++characters;
//This is the line where the error is happening.
char *ss = malloc(characters+1 * sizeof(char)*16);//the *16 is just to save the cache lin
int ssindex = 0;
for (bits = i, j=0; bits; bits >>= 1, ++j) {
if (bits & 1) {
//char a = argv[j];
ss[ssindex++] = argv[j] ;
}
}
ss[ssindex] = '\0';
subsequences[i] = ss;
}
return subsequences;
}
char* getdna()
{
int i;
char *dna = (char *)malloc((DNA_SIZE+1) * sizeof(char));
for(i = 0; i < DNA_SIZE; i++)
{
int randomDNA = rand() % 4;
dna[i] = DNA[randomDNA];
}
dna[DNA_SIZE] = '\0';
return dna;
}
void printResult(char** ss, int size)
{
//PRINTING THE SUBSEQUENCES
printf("SUBSEQUENCES FOUND:\r\n");
int i;
for(i = 0; i < size; i++)
{
printf("%i.\t{ %s } \r\n",i+1 , ss[i]);
free(ss[i]);
}
free(ss);
}
int main(int argc, char* argv[])
{
srand(time(NULL));
double starttime, stoptime;
starttime = omp_get_wtime();
char* a = getdna();
printf("%s\r\n", a);
int size = pow(2, DNA_SIZE);
printf("number of subsequences: %i\r\n", size);
char** subsequences = powerset(DNA_SIZE, a);
//todo: make it optional printing to the stdout or saving to a file
//printResult(subsequences, size);
stoptime = omp_get_wtime();
printf("Tempo de execucao: %3.2f segundos\n\n", stoptime-starttime);
printf("Numero de sequencias geradas: %i\n\n", size);
free(a);
return 0;
}
I also tried to make the malloc line critical with the #pragma omp critical which didn't help.
Also I tried to compile with -mstackrealign which also didn't work.
Appreciate all the help.

You should use a more efficient thread-safe memory management.
Applications can use either malloc() and free() explicitly, or implicitly in the compiler-generated code for dynamic/allocatable arrays, vectorized intrinsics, and so on.
The thread-safe malloc() and free() in some libc implementations carry a high synchronization overhead caused by internal locking. Faster allocators for multi-threaded applications exist. For instance, on Solaris multithreaded applications should be linked with the "MT-hot" allocator mtmalloc, (i.e., link with -lmtmalloc to use mtmalloc instead of the default libc allocator). glibc, used on Linux and some OpenSolaris and FreeBSD distributions with GNU userlands, uses a modified ptmalloc2 allocator, which is based on Doug Lea's dlmalloc. It uses multiple memory arenas to achieve near lock-free behavior. It can also be configured to use per-thread arenas and some distributions, notably RHEL 6 and derivates, have that feature enabled.
static char** powerset(int argc, char* argv)
{
int i, j, bits, i_max = 1U << argc;
if (argc >= sizeof(i) * CHAR_BIT) {
fprintf(stderr, "Error: set too large\n");
exit(1);
}
omp_set_num_threads(2);
char** subsequences = malloc(i_max*sizeof(char*));
int characters = 0;
for (i = 0; i < i_max ; ++i)
{
for (bits=i; bits ; bits>>=1)
if (bits & 1)
++characters;
subsequences[i] = malloc(characters+1 * sizeof(char)*16);
characters = 0;
}
#pragma omp parallel for shared(subsequences, argv) private(j,bits)
for (i = 0; i < i_max; ++i)
{
int ssindex = 0;
for (bits = i, j=0; bits; bits >>= 1, ++j) {
if (bits & 1) {
subsequences[i][ssindex++] = argv[j] ;
}
}
subsequences[i][ssindex] = '\0';
}
return subsequences;
}
I create (and allocate) the desired data before the parallel region, and then made the remaining calculations. The version above running with 12 threads in a 24 core machine takes "Tempo de execucao: 9.44 segundos".
However, when I try to parallelize the following code:
#pragma omp parallel for shared(subsequences) private(bits,characters)
for (i = 0; i < i_max ; ++i)
{
for (bits=i; bits ; bits>>=1)
if (bits & 1)
++characters;
subsequences[i] = malloc(characters+1 * sizeof(char)*16);
characters = 0;
}
it take "Tempo de execucao: 10.19 segundos"
As you can see calling malloc in parallel leads to slower times.
Eventually, you would have had problems with the fact that each sub-malloc was trying to allocate (characters+1*DNA_SIZE*sizeof(char)) rather than ((characters+1)*DNA_SIZE*sizeof(char)), and the multiplying by a factor for cache line size is not necessary inside the parallel section if I understand what you were trying to avoid.
There also seems to be some issue with this piece of code:
for (bits = i, j=0; bits; bits >>= 1, ++j) {
if (bits & 1) {
//char a = argv[j];
ss[ssindex++] = argv[j] ;
}
}
With this code, j sometimes hits DNA_SIZE or DNA_SIZE+1, resulting in reading argv[j] going off the end of the array. (Also, using argc and argv as names for arguments in this function is somewhat confusing.)

The problem is here with dna[DNA_SIZE] = '\0';. So far you have allocated memory for 26 characters (say), and you are trying to access the 27th character. Always remember array index starts from 0.

Related

My program crashes if the vector is too large?

When I test my program with large vectors, as in larger than 12 elements, it crashes (I get an lldb error). However, it works fine for small vectors. I think it's trying to access a memory space it shouldn't, but I have no idea how to fix it.
The program is supposed to print out the vectors whose sum of elements are equal to the "target"
Also, is there a different way that I can express: if (i & (1 << j)) ?
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int c = 0;
/* find subsets of a given set */
void findSubsets(int *value, int n, int i) {
int j;
if (i < 0)
return;
for (j = 0; j < n; j++) {
/*
* checking jth bit is set in i. If
* it is set, then fetch the element at
* jth index in value array
*/
if (i & (1 << j)) {
suma = suma + value[j];
}
/* recursive call */
findSubsets(value, n, i - 1);
return;
}
int main() {
/* 2^n - indicates the possible no of subsets */
int count = pow(2, size);
/* finds the subsets of the given set */
findSubsets(vector, size, count - 1);
return 0;
}
I would like to be able to use this program for large vectors (up to about 20)

The problem is that you got 52428810 recursive function calls. This will cause a stack overflow. Instead of recursion, try iteration:
for (int i = 0; i < count; i++) {
findSubsets(vector, size, i);
}
And remove the recursive call within findSubsets.

error with array size

I am trying to make a program that calculates the amount of prime numbers that don't exceed an integer using the sieve of Eratosthenes. While my program works fine (and fast) for small numbers, after a certain number (46337) I get a "command terminated by signal 11" error, which I suppose has to do with array size. I tried to use malloc() but I didn't get it quite right. What shall I do for big numbers (up to 5billion)?
#include <stdio.h>
#include<stdlib.h>
int main(){
signed long int x,i, j, prime = 0;
scanf("%ld", &x);
int num[x];
for(i=2; i<=x;i++){
num[i]=1;
}
for(i=2; i<=x;i++){
if(num[i] == 1){
for(j=i*i; j<=x; j = j + i){
num[j] = 0;
}
//printf("num[%d]\n", i);
prime++;
}
}
printf("%ld", prime);
return 0;
}

Your array
int num[x];
is on the stack, where only small arrays can be accommodated. For large array size you'll have to allocate memory. You can save on memory bloat by using char type, because you only need a status.
char *num = malloc(x+1); // allow for indexing by [x]
if(num == NULL) {
// deal with allocation error
}
//... the sieve code
free(num);
I suggest also, you must check that i*i does not break the int limit by using
if(num[i] == 1){
if (x / i >= i){ // make sure i*i won't break
for(j=i*i; j<=x; j = j + i){
num[j] = 0;
}
}
}
Lastly, you want to go to 5 billion, which is outside the range of uint32_t (which unsigned long int is on my system) at 4.2 billion. If that will satisfy you, change the int definitions to unsigned, watching out that your loop controls don't wrap, that is, use unsigned x = UINT_MAX - 1;
If you don't have 5Gb memory available, use bit status as suggest by #BoPersson.

The following code checks for errors, tested with values up to 5000000000, properly outputs the final count of number of primes, uses malloc so as to avoid overrunning the available stack space.
#include <stdio.h>
#include <stdlib.h>
int main()
{
unsigned long int x,i, j;
unsigned prime = 0;
scanf("%lu", &x);
char *num = malloc( x);
if( NULL == num)
{
perror( "malloc failed");
exit(EXIT_FAILURE);
}
for(i=0; i<x;i++)
{
num[i]=1;
}
for(i=2; i<x;i++)
{
if(num[i] == 1)
{
for(j=i*i; j<x; j = j + i)
{
num[j] = 0;
}
//printf("num[%lu]\n", i);
prime++;
}
}
printf("%u\n", prime);
return 0;
}

Prevent False Sharing without using padding

I'm currently learning about pthreads in C and came across the issue of False Sharing. I think I understand the concept of it and I've tried experimenting a bit.
Below is a short program that I've been playing around with. Eventually I'm going to change it into a program to take a large array of ints and sum it in parallel.
#include <stdio.h>
#include <pthread.h>
#define THREADS 4
#define NUMPAD 14
struct s
{
int total; // 4 bytes
int my_num; // 4 bytes
int pad[NUMPAD]; // 4 * NUMPAD bytes
} sum_array[4];
static void *worker(void * ind) {
const int curr_ind = *(int *) ind;
for (int i = 0; i < 10; ++i) {
sum_array[curr_ind].total += sum_array[curr_ind].my_num;
}
printf("%d\n", sum_array[curr_ind].total);
return NULL;
}
int main(void) {
int args[THREADS] = { 0, 1, 2, 3 };
pthread_t thread_ids[THREADS];
for (size_t i = 0; i < THREADS; ++i) {
sum_array[i].total = 0;
sum_array[i].my_num = i + 1;
pthread_create(&thread_ids[i], NULL, worker, &args[i]);
}
for (size_t i = 0; i < THREADS; ++i) {
pthread_join(thread_ids[i], NULL);
}
}
My question is, is it possible to prevent false sharing without using padding? Here struct s has a size of 64 bytes so that each struct is on its own cache line (assuming that the cache line is 64 bytes). I'm not sure how else I can achieve parallelism without padding.
Also, if I were to sum an array of a varying size between 1000-50,000 bytes, how could I prevent false sharing? Would I be able to pad it out using a similar program? My current thoughts are to put each int from the big array, into an array of struct s and then use parallelism to sum it. However I'm not sure if this is the optimal solution.

Partition the problem: In worker(), sum into a local variable, then add the local variable to the array:
static void *worker(void * ind) {
const int curr_ind = *(int *) ind;
int localsum = 0;
for (int i = 0; i < 10; ++i) {
localsum += sum_array[curr_ind].my_num;
}
sum_array[curr_ind].total += localsum;
printf("%d\n", sum_array[curr_ind].total);
return NULL;
}
This may still have false sharing after the loop, but that is one time per thread. Thread creation overhead is much more significant than a single cache-miss. Of course, you probably want to have a loop that actually does something time-consuming, as your current code can be optimized to:
static void *worker(void * ind) {
const int curr_ind = *(int *) ind;
int localsum = 10 * sum_array[curr_ind].my_num;
sum_array[curr_ind].total += localsum;
printf("%d\n", sum_array[curr_ind].total);
return NULL;
}
The runtime of which is definitely dominated by thread creation and synchronization in printf().

Finding cyclic single transposition vector in C

I have the input as array A = [ 2,3,4,1]
The output is simply all possible permutation from elements in A which can be done by single transposition (single flip of two neighbouring elements) operation. So the output is :
[3,2,4,1],[ 2,4,3,1],[2,3,1,4],[1,3,4,2]
Circular transpositioning is allowed. Hence [2,3,4,1] ==> [1,3,4,2] is allowed and a valid output.
How to do it in C?
EDIT
In python, it would be done as follows:
def Transpose(alist):
leveloutput = []
n = len(alist)
for i in range(n):
x=alist[:]
x[i],x[(i+1)%n] = x[(i+1)%n],x[i]
leveloutput.append(x)
return leveloutput

This solution uses dynamic memory allocation, this way you can do it for an array of size size.
int *swapvalues(const int *const array, size_t size, int left, int right)
{
int *output;
int sotred;
output = malloc(size * sizeof(int));
if (output == NULL) /* check for success */
return NULL;
/* copy the original values into the new array */
memcpy(output, array, size * sizeof(int));
/* swap the requested values */
sotred = output[left];
output[left] = output[right];
output[right] = sotred;
return output;
}
int **transpose(const int *const array, size_t size)
{
int **output;
int i;
int j;
/* generate a swapped copy of the array. */
output = malloc(size * sizeof(int *));
if (output == NULL) /* check success */
return NULL;
j = 0;
for (i = 0 ; i < size - 1 ; ++i)
{
/* allocate space for `size` ints */
output[i] = swapvalues(array, size, j, 1 + j);
if (output[i] == NULL)
goto cleanup;
/* in the next iteration swap the next two values */
j += 1;
}
/* do the same to the first and last element now */
output[i] = swapvalues(array, size, 0, size - 1);
if (output[i] == NULL)
goto cleanup;
return output;
cleanup: /* some malloc call returned NULL, clean up and exit. */
if (output == NULL)
return NULL;
for (j = i ; j >= 0 ; j--)
free(output[j]);
free(output);
return NULL;
}
int main()
{
int array[4] = {2, 3, 4, 1};
int i;
int **permutations = transpose(array, sizeof(array) / sizeof(array[0]));
if (permutations != NULL)
{
for (i = 0 ; i < 4 ; ++i)
{
int j;
fprintf(stderr, "[ ");
for (j = 0 ; j < 4 ; ++j)
{
fprintf(stderr, "%d ", permutations[i][j]);
}
fprintf(stderr, "] ");
free(permutations[i]);
}
fprintf(stderr, "\n");
}
free(permutations);
return 0;
}
Although some people think goto is evil, this is a very nice use for it, don't use it to control the flow of your program (for instance to create a loop), that is confusing. But for the exit point of a function that has to do several things before returning, it think it's actually a nice use, it's my opinion, for me it makes the code easier to understand, I might be wrong.

Have a look at this code I have written with an example :
void transpose() {
int arr[] = {3, 5, 8, 1};
int l = sizeof (arr) / sizeof (arr[0]);
int i, j, k;
for (i = 0; i < l; i++) {
j = (i + 1) % l;
int copy[l];
for (k = 0; k < l; k++)
copy[k] = arr[k];
int t = copy[i];
copy[i] = copy[j];
copy[j] = t;
printf("{%d, %d, %d, %d}\n", copy[0], copy[1], copy[2], copy[3]);
}
}
Sample Output :
{5, 3, 8, 1}
{3, 8, 5, 1}
{3, 5, 1, 8}
{1, 5, 8, 3}

A few notes:
a single memory block is preferred to, say, an array of pointers because of better locality and less heap fragmentation;
the cyclic transposition is only one, it can be done separately, thus avoiding the overhead of the modulo operator in each iteration.
Here's the code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int *single_transposition(const int *a, unsigned int n) {
// Output size is known, can use a single allocation
int *out = malloc(n * n * sizeof(int));
// Perform the non-cyclic transpositions
int *dst = out;
for (int i = 0; i < n - 1; ++i) {
memcpy(dst, a, n * sizeof (int));
int t = dst[i];
dst[i] = dst[i + 1];
dst[i + 1] = t;
dst += n;
}
// Perform the cyclic transposition, no need to impose the overhead
// of the modulo operation in each of the above iterations.
memcpy(dst, a, n * sizeof (int));
int t = dst[0];
dst[0] = dst[n-1];
dst[n-1] = t;
return out;
}
int main() {include
int a[] = { 2, 3, 4, 1 };
const unsigned int n = sizeof a / sizeof a[0];
int *b = single_transposition(a, n);
for (int i = 0; i < n * n; ++i)
printf("%d%c", b[i], (i % n) == n - 1 ? '\n' : ' ');
free(b);
}

There are many ways to tackle this problem, and most important questions are: how you're going to consume the output and how variable is the size of the array. You've already said the array is going to be very large, therefore I assume memory, not CPU will be the biggest bottleneck here.
If output is going to be used only few times (especially just once), it'll may be best to use functional approach: generate every transposition on the fly, and never have more than one in memory at a time. For this approach many high level languages would work as well as (maybe sometimes even better than) C.
If size of the array is fixed, or semi-fixed (eg few sizes known at compile-time), you can define structures, using C++ templates at best.
If size is dynamic and you still want to have every transposition in memory then you should allocate one huge memory block and treat it as contiguous array of arrays. This is very simple and straightforward on machine level. Unfortunately it's best tackled using pointer arithmetic, one feature of C/C++ that is renowned for being difficult to understand. (It isn't if you learn C from basics, but people jumping down from high level languages have proven track record of getting it completely wrong first time)
Other approach is to have big array of pointers to smaller arrays, which results in double pointer, the ** which is even more terrifying to newcomers.
Sorry for long post which is not a real answer, but IMHO there are too many questions left open for choosing the best solution and I feel you need bit more C basic knowledge to manage them on your own.
/edit:
As other solutions are already posted, here's a solution with minimum memory footprint. This is the most limiting approach, it uses same one buffer over and over, and you must be sure that your code is finished with first transposition before moving on to the next one. On the bright side, it'll still work just fine when other solutions would require terabyte of memory. It's also so undemanding that it might be as well implemented with a high level language. I insisted on using C++ in case you would like to have more than one matrix at a time (eg comparing them OR running several threads concurrently).
#define NO_TRANSPOSITION -1
class Transposable1dMatrix
{
private:
int * m_pMatrix;
int m_iMatrixSize;
int m_iCurrTransposition;
//transposition N means that elements N and N+1 are swapped
//transpostion -1 means no transposition
//transposition (size-1) means cyclic transpostion
//as usual in C (size-1) is the last valid index
public:
Transposable1dMatrix(int MatrixSize)
{
m_iMatrixSize = MatrixSize;
m_pMatrix = new int[m_iMatrixSize];
m_iCurrTransposition = NO_TRANSPOSITION;
}
int* GetCurrentMatrix()
{
return m_pMatrix;
}
bool IsTransposed()
{
return m_iCurrTransposition != NO_TRANSPOSITION;
}
void ReturnToOriginal()
{
if(!IsTransposed())//already in original state, nothing to do here
return;
//apply same transpostion again to go back to original
TransposeInternal(m_iCurrTransposition);
m_iCurrTransposition = NO_TRANSPOSITION;
}
void TransposeTo(int TranspositionIndex)
{
if(IsTransposed())
ReturnToOriginal();
TransposeInternal(TranspositionIndex);
m_iCurrTransposition = TranspositionIndex;
}
private:
void TransposeInternal(int TranspositionIndex)
{
int Swap1 = TranspositionIndex;
int Swap2 = TranspositionIndex+1;
if(Swap2 == m_iMatrixSize)
Swap2 = 0;//this is the cyclic one
int tmp = m_pMatrix[Swap1];
m_pMatrix[Swap1] = m_pMatrix[Swap2];
m_pMatrix[Swap2] = tmp;
}
};
void main(void)
{
int arr[] = {2, 3, 4, 1};
int size = 4;
//allocate
Transposable1dMatrix* test = new Transposable1dMatrix(size);
//fill data
memcpy(test->GetCurrentMatrix(), arr, size * sizeof (int));
//run test
for(int x = 0; x<size;x++)
{
test->TransposeTo(x);
int* copy = test->GetCurrentMatrix();
printf("{%d, %d, %d, %d}\n", copy[0], copy[1], copy[2], copy[3]);
}
}

Why does my sieve get SIGSEGV

Alright, so, just for fun, I was working on the sieve of eratosthenes.
It was working fine intially so I sought out to improve its runtime complexity. and now, I on't know why, but I'm gettig a segmentation fault.
Here's the code:
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
int* check = malloc(1000000000 * sizeof(int));
long long int i;
for(i = 0;i < 1000000000;i++)
{
check[i] = 0;
}
int j = 0;
for(i = 2;i <= 1000000002;i++)
{
if(check[i] == 0)
{
printf("%lld\n", i);
for(j = 1;j < (1000000001/i);j++)
{
check[j*i] == 1;
}
}
}
return 0;
}
Any help as to why it fails would be appreciated.

Your code has multiple errors, any of which could explain a segfault. First, you have not checked the return value of malloc, which may be NULL, even when you are totally sure it couldn't be.
Second, you are exceeding the bounds of the array you've allocated when you iterate i from 2 to 1000000002. With so many zeros it's hard to eyeball, so here are your figures with separators:
Initial allocation: 1,000,000,000
Range of i: 2 to 1,000,000,002 inclusive
At the end of that loop you are accessing memory past the end of your array.

#include <stdio.h>
#include <stdlib.h>
#if 1
static const size_t N = 1000 * 1000 * 1000;
#else
static const size_t N = 1000;
#endif
Don't use a magic number, define it as a constant. 1000000000 is also hard to read. Your C compiler can do calculation for you before it emits an executable. And you should have started with a small number. If you change #if 1 into #if 0, then the #else clause defining N as 1,000 will take effect.
int main(void)
{
char* check = malloc(N + 3);
When you essentially use check as a boolean array, it doesn't have to be of type int. int occupies 4 bytes whereas char only 1 byte.
if (NULL == check) {
perror("malloc");
abort();
}
malloc silently returns NULL when it failed to find a memory chunk of the specified length. But if you work with 64 bit OS and compiler, I don't think it's likely to fail...
long long int i;
memset(check, 0, sizeof(check[0]) * (N + 3));
memset fills an array with the value of the 2nd parameter (here 0.) The third parameter takes the number of BYTES of the input array, so I used sizeof(check[0]) (this is not necessary for a char array becuase sizeof(char)==1 but I always stick to this practice.)
int j = 0;
for(i = 2;i <= N+2;i++)
{
if(check[i] == 0)
{
printf("%lld\n", i);
for(j = 1;j < ((N+1)/i);j++)
{
check[j*i] = 1;
You wrote check[j*i] == 1 but it was an equality test whose result didn't have any effects.
}
}
}
free(check);
It is a good practice to always free the memory chunk that you allocated with malloc, regardless whether free is necessary or not (in this case no, because your program just exits at the end of sieve calculation.) Perhaps until you become really fluent with C.
return 0;
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Segmentation Fault when using OpenMP when creating an array - arrays

The problem is here with dna[DNA_SIZE] = '\0';. So far you have allocated memory for 26 characters (say), and you are trying to access the 27th character. Always remember array index starts from 0.

Related

My program crashes if the vector is too large?

error with array size

Prevent False Sharing without using padding

Finding cyclic single transposition vector in C

Why does my sieve get SIGSEGV

Categories

Resources