I am trying to play around with the following quicksort algorithm in parallel:
#include <stdio.h>
#include <stdlib.h>
#define MAX_UNFINISHED 1000 /* Maximum number of unsorted sub-arrays */
struct {
int first; /* Low index of unsorted sub-array */
int last; /* High index of unsorted sub-array */
} unfinished[MAX_UNFINISHED]; /* Stack */
int unfinished_index; /* Index of top of stack */
float *A; /* Array of elements to be sorted */
int n; /* Number of elements in A */
void swap (float *x, float *y)
float tmp;
tmp = *x;
*x = *y;
*y = tmp;
int partition (int first, int last)
int i, j;
float x;
x = A[last];
i = first - 1;
for (j = first; j < last; j++)
if (A[j] <= x) {
swap (&A[i], &A[j]);
swap (&A[i+1], &A[last]);
return (i+1);
void quicksort (void)
int first;
int last;
int my_index;
int q;
while (unfinished_index >= 0) {
#pragma omp critical
my_index = unfinished_index;
first = unfinished[my_index].first;
last = unfinished[my_index].last;
while (first < last) {
q = partition (first, last);
if ((unfinished_index+1) >= MAX_UNFINISHED) {
printf ("Stack overflow\n");
exit (-1);
#pragma omp critical
unfinished[unfinished_index].first = q+1;
unfinished[unfinished_index].last = last;
last = q-1;
int verify_sorted (float *A, int n)
int i;
for (i = 0; i < n-1; i++)
if (A[i] > A[i+1])
return 0;
return 1;
int main (int argc, char *argv[])
int i;
int seed; /* Seed component input by user */
unsigned short xi[3]; /* Random number seed */
if (argc != 3) {
printf ("Command-line syntax: %s <n> <seed>\n", argv[0]);
exit (-1);
seed = atoi (argv[2]);
xi[0] = xi[1] = xi[2] = seed;
n = atoi (argv[1]);
A = (float *) malloc (n * sizeof(float));
for (i = 0; i < n; i++)
A[i] = erand48(xi);
unfinished[0].first = 0;
unfinished[0].last = n-1;
unfinished_index = 0; //
#pragma omp parallel
if (verify_sorted (A, n)) printf ("Elements are sorted\n");
else printf ("ERROR: Elements are NOT sorted\n");
return 0;
Adding the critical sections in the quicksort() function causes a segmentation fault 11, why is that? From my basic understanding, such an error occurs when the system tries to access memory it doesn't have access to or is non-existent, I can't see where that would happen. Putting a critical section over the entire while() loop fixes it but it would be slow.
Your strategy to parallelize the QuickSort looks overcomplicated and prone to race-conditions, the typically way to parallelize that algorithm is to use OpenMP tasks. You can have a look a the following SO Threads
Adding the critical sections in the quicksort() function causes a
segmentation fault 11, why is that?
Your code has several issues namely;
There is a race-condition between the read of unfinished_index in while (unfinished_index >= 0) and the updates of that variable by other threads;
The segmentation fault 11 happens because threads can access positions out of bounds of the array unfinished, including negative positions.
Even with the critical region multiple threads can execute:
which eventually leads to unfinished_index < 0 and consequently:
my_index = unfinished_index;
first = unfinished[my_index].first; <-- problem
last = unfinished[my_index].last; <-- problem
accessing negative position the unfinished array. And the same applies to the upper bound as well. All threads my pass this check:
if ((unfinished_index+1) >= MAX_UNFINISHED) {
printf ("Stack overflow\n");
exit (-1);
and then simply
#pragma omp critical
unfinished[unfinished_index].first = q+1;
unfinished[unfinished_index].last = last;
last = q-1;
increment unfinished_index so much that it can access positions outside of the array boundaries.
To solve those problems you can do the following:
void quicksort (void)
int first;
int last;
int my_index;
int q;
int keep_working = 1;
while (keep_working) {
#pragma omp critical
my_index = unfinished_index;
if(my_index >= 0 && my_index < MAX_UNFINISHED)
keep_working = 0;
first = unfinished[my_index].first;
last = unfinished[my_index].last;
while (first < last && keep_working)
q = partition (first, last);
#pragma omp critical
my_index = unfinished_index;
if (my_index < MAX_UNFINISHED){
unfinished[my_index].first = q+1;
unfinished[my_index].last = last;
last = q-1;
keep_working = 0;
Bear in mind, however, that the following code works for 2 threads. For more than that you might get sometimes "the array not sorted". I will leave it up to you to fixed. However, I would suggested you to used the Task approach instead, because it is faster, simpler and more updated with more modern OpenMP constructors.
I'll get right into my problem. So basically what I want to do is to generate an array of random numbers of different amounts. So one with 10,000, 50,000, 100,000, 500,000, 600,000, etc. Then I would sort them using quicksort and print the sorted array to the screen. Additionally, the time taken for it to run would be recorded and printed as well. The only part I'm having problems with however is generating the array. For some reason generating past 500,000 random numbers does not work and returns this:
Process exited after 2.112 seconds with return value 3221225725
Press any key to continue . . .
([1]: https://i.stack.imgur.com/m83el.png)
This is my code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
void randNums(int array[], int range) {
int i, num;
for (i = 0; i < range; i++) {
num = rand() % range;
array[i] = num;
//prints elements of given array
void display(int array[], int size) {
int i;
for (i = 0; i < size; i++) {
printf("#%d. %d\n", i, array[i]);
//displays time taken for sorting algorithm to run
void timeTaken(char sortingAlgo[], int size, clock_t start, clock_t end) {
double seconds = end - start;
double milliseconds = seconds / 1000;
printf("Time taken for %s Sort to sort %d numbers was %f milliseconds or %f seconds",
sortingAlgo, size, milliseconds, seconds);
//quick sort
void quickSort(int array[], int first, int last) {
int i, j, pivot, temp;
if (first < last) {
pivot = first;
i = first;
j = last;
while (i < j) {
while (array[i] <= array[pivot] && i < last)
while (array[j] > array[pivot])
if (i < j) {
temp = array[i];
array[i] = array[j];
array[j] = temp;
temp = array[pivot];
array[pivot] = array[j];
array[j] = temp;
quickSort(array, first, j - 1);
quickSort(array, j + 1, last);
int main() {
int size = 600000;
int myArray[size];
time_t end, start;
int first, last;
randNums(myArray, size);
first = myArray[0];
last = sizeof(myArray) / sizeof(myArray[0]);
quickSort(myArray, first, last);
display(myArray, size);
timeTaken("Quick", size, start, end);
return 0;
Any help would be greatly appreciated, Thank you!
There's a lot of little bugs in this code that aren't too difficult to resolve. I'll try and break it down here in this refactoring and cleanup:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
void randNums(int* array, int range) {
// Declare iterator variables like `i` within the scope of the iterator.
for (int i = 0; i < range; i++) {
// No need for a single-use variable here, just assign directly.
array[i] = rand () % range;
void display(int* array, int size) {
// for is not a function, it's a control flow mechanism, so
// it is expressed as `for (...)` with a space. `for()` implies
// it is a function, which it isn't.
for (int i = 0; i < size; i++) {
printf("#%d. %d\n", i, array[i]);
void timeTaken(char* sortingAlgo, int size, clock_t start, clock_t end) {
// Time calculation here needs to account for the fact that clock_t
// does not use seconds as units, it must be converted
// https://en.cppreference.com/w/c/chrono/clock_t
"Time taken for %s Sort to sort %d numbers was %.6f seconds",
((double) (end - start)) / CLOCKS_PER_SEC
void quickSort(int* array, int first, int last) {
// Establish a guard condition. Rest of the function is no longer
// nested in a control flow structure, so it simplifies the code.
if (first >= last) {
int pivot = first;
int i = first;
int j = last;
// Use `while (...)` as it's also a control flow structure.
while (i < j) {
// Adding space around operators improves clarity considerably. Unspaced
// elements like `a->b()` are supposed to stand out and not be confused
// with visually similar `a>>b()` which does something very different.
while (array[i] <= array[pivot] && i < last) {
// Use surrounding braces on all blocks, even single-line ones, as this
// can avoid a whole class of errors caused by flawed assumptions.
// while (...) { ... }
while (array[j] > array[pivot]) {
if (i < j) {
int temp = array[i];
array[i] = array[j];
array[j] = temp;
int temp = array[pivot];
array[pivot] = array[j];
array[j] = temp;
quickSort(array, first, j - 1);
quickSort(array, j + 1, last);
int main(int argc, char** argv) {
int size = 600000;
// If an argument was given...
if (argc > 1) {
// ...use that as the size parameter instead.
size = atol(argv[1]);
// Allocate an array of sufficient size
int* numbers = calloc(size, sizeof(int));
randNums(numbers, size);
// time_t has at best second-level precision, it's very inaccurate.
// Use clock_t which gives far more fidelity.
clock_t start = clock();
// This function takes *offsets*, not values.
quickSort(numbers, 0, size - 1);
clock_t end = clock();
display(numbers, size);
timeTaken("Quick", size, start, end);
return 0;
The number one bug here was calling quickSort() incorrectly:
// Represents first *value* in the array
first = myArray[0]; // Should be: 0
// Rough calculation of the size of the array, but this is off by one
last = sizeof(myArray)/sizeof(myArray[0]); // Should be: size - 1
quickSort(myArray, first, last);
I'm doing quick sorting with different methods of selecting pivots, I don't see any problems in my code, the functions work while testing separated, but when I put them together, they don't work most of the time.
I've tried moving the files to another path, and changing the way I access the array.
void quick_sort(uint32_t arr[], int first, int last, int pivot_opt)
int i, j;
uint32_t pivot = pivot_select(arr, last, pivot_opt);
i = first;
j = last;
while (arr[i] < pivot) i++; // Counting elements smaller than pivot
while (arr[j] > pivot) j--; // Counting elements greater than pivot
if (i <= j)
swap(&arr[i++], &arr[j--]); // Placing smaller elements in the left, and greater elements in the right without touching the pivot
} while (i <= j);
if (first < j)
quick_sort(arr, first, j, pivot_opt); // Sorting smaller elements of array
if (i < last)
quick_sort(arr, i, last, pivot_opt); // Sorting greater elements of array
uint32_t pivot_select(uint32_t arr[], int last, int pivot_opt)
uint32_t pivot = 0;
int random_index = 0;
switch (pivot_opt)
case 0:
pivot = arr[last]; // Choosing the pivot as the last element in the array
case 1:
random_index = rand()%(last); // Choosing the pivot as a random element of array
pivot = arr[random_index];
case 2:
pivot = median(arr, last); // Choosing the pivot as avg of three random indexes of the array
return pivot;
uint32_t median(uint32_t arr[], int n)
if (n <= 3)
return arr[0]; // If the array have 3 or less elements, choose as pivot first element
int index[3] = {0}; // Index of 3 elements of original array
int last_index = 0; // Last chosen index, to verify if index was selected
int i = 0;
while(i < 3) // Selecting 3 random index
int current_index = (rand()%(n));
if (current_index == last_index)
index[i++] = current_index;
last_index = current_index;
uint32_t array[3] = {arr[index[0]], arr[index[1]], arr[index[2]]}; // Creating array with the elements on random indexes
insertion_sort(array, 3); // Sorting the array
return array[1]; // Returning the pivot as the middle element of array
I'm getting this error on median function
Program received signal SIGSEGV, Segmentation fault.
0x0000555555555546 in median (arr=<error reading variable: Cannot access memory at address 0x7fffff7fefd8>, n=<error reading variable: Cannot access memory at address 0x7fffff7fefd4>) at /media/storage/Codes/Data Structure/Recursive_Sorting/main.c:107
107 {
I put all the libraries I'm using so I don't miss one.
Code for testing:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/time.h>
#include <time.h>
#define size 1000
uint32_t comparations_count;
uint32_t exchanges_count;
int main(int argc, char** argv)
uint32_t array[size];
fill(array, size);
permute_array(array, size);
comparations_count = 0;
exchanges_count = 0;
quick_sort(array, 0, size-1, 2);
return 0;
void fill(uint32_t arr[], uint32_t n)
for (size_t i = 0; i < n; i++) // Filling the array in ascending order
arr[i] = i;
void swap(uint32_t *a, uint32_t *b)
// Swapping two elements
uint32_t t = *a;
*a = *b;
*b = t;
void permute_array(int a[], size_t n)
// Adapted from:
// https://www.geeksforgeeks.org/shuffle-a-given-array-using-fisher-yates-shuffle-algorithm/
srand(time(NULL)); // Init random seed
for (size_t i = n - 1; i > 0; i--) // Permute array
size_t j = rand() % (i+1); // Pick a random index from 0 to i
swap(&a[i], &a[j]); // Swap arr[i] with the element at random index
void insertion_sort(uint32_t arr[], size_t n)
uint32_t current_index = 0;
uint32_t current_value = 0;
for (size_t i = 1; i < n; i++) {
current_index = i;
current_value = arr[i];
while (current_index > 0) {
if (current_value < arr[current_index - 1]) {
swap(&arr[current_index], &arr[current_index-1]);
}else { break; }
Let's start with this one:
while (arr[i] < pivot) i++;
What if all the elements are less than pivot, your i will be out of bounds, change the condition to while(arr[i] < pivot && i <= j) i++;
Consider this one:
while (arr[j] > pivot) j--;
What if all the elements are greater than pivot, your j will be out of bounds (a negative number), change the condition here too.
According to my opinion, the above-mentioned areas are causing problems.
Happy debugging!
I created a struct, called ArrayCount, that contains a double array and an integer that should count how often an array occurs.
If the size of the double-array is n, the idea is, to create an array of the struct ArrayCount of the size n! (n! is called m in my code).
The idea is to safe each permutation in the ArrayCount-array, counting the occurrences of each permutation, for a given algorithm. But that's just the background information and not part of the problem.
I am having issues while freeing the memory that was allocated for the double-Arrays.
Oddly enough, ~ 1/10 times my code compiles without an error message and sometimes different error messages appear.
error message:
munmap_chunk(): invalid pointer
Aborted (core dumped)
error message:
free(): invalid size
Aborted (core dumped)
error message:
Segmentation fault (core dumped)
Part of the code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
double* array_copy(const double* a, int n) {
double* copy = calloc(n, 8);
for(int i = 0; i < n; i++) {
copy[i] = a[i];
return copy;
void shuffle(double* a, int n) {
for(int i = n - 1; i >= 0; i--) {
time_t t;
/* Intializes random number generator */
srand((unsigned) time(&t));
double* copy = array_copy(a, i + 1);
//Generates random numbers in the closed intervall [0,i].
int random = rand() % (i + 1);
a[i] = a[random];
a[random] = copy[i];
// Refers to a double array and counts how often this array has
occurred yet.
typedef struct {
double* array;
int counter;
} ArrayCount;
// Computes the factorial of n: n!.
int factorial(int n) {
int result = 1;
for (int i = 2; i <= n; i++) {
result *= i;
return result;
Saves all permutations in array_counts, for a given double array of
the length n and counts how often each permutations occurs.
(Hint given by our supervisor: Save a copy of a in array_counts)
void update_array_counts(/*INOUT*/ ArrayCount* array_counts, int m,
/*IN*/ const double* a, int n) {
double* copy_a = array_copy(a, n);
//Increases the counter by 1, if a is already listed in
for(int i = 1; i <= m; i++) {
int count = 0;
for(int j = 0; j < n; j++) {
if(array_counts[i].array[j] == a[j]) count++;
if(count == n) {
//Saves a in array_counts and sets the counter to 1, if a is not
listed in array_counts, yet
for(int i = 1; i <= m; i++) {
int count = 0;
for(int j = 0; j < n; j++) {
if(array_counts[i].array[j] == 0) count++;
if(count == n) {
for(int j = 0; j < n; j++) {
array_counts[i].array[j] = a[j];
array_counts[i].counter = 1;
// Gibt die Häufigkeit der verschiedenen Permutationen eines Arrays
der Länge n aus.
void shuffle_frequency(int n) {
double a[n];
for (int i = 0; i < n; i++) {
a[i] = i;
int m = factorial(n);
ArrayCount* array_counts = calloc(m, sizeof(ArrayCount));
for(int i = 1; i <= m; i++){
array_counts[i].array = calloc(n, sizeof(double));
for (int i = 0; i < 1000 * m; i++) {
shuffle(a, n);
update_array_counts(array_counts, m, a, n);
for (int i = 1; i <= m; i++) {
printf("%4d%8d ", i, array_counts[i].counter);
//The next free-statement is causing problems.
for(int i = 1; i <= m; i++) {
printf("i = %d\n", i);
int main(void) {
return 0;
What am I doing wrong?
I am having issues while freeing the memory that was allocated for the
double-Arrays. Oddly enough, ~ 1/10 times my code compiles without an
error message and sometimes different error messages appear.
complies without error message or runs without error message? I see runtime errors ( Segfault or Abort signals, to be exact ) not compile time. kl
for (int i = 1; i <= m; i++) {
The correct way to iterate through an array of m elements is
for(int i=0; i < m; i++){
As pointed out in the comments, offsets start at 0 and to to m-1, not m. That makes free(array_counts[i].array) becomes free(array_counts[m].array) What's at array_counts[m]? Could be various things, which might be deterministic or nondeterministic at runtime, but it is outside the memory you allocated. Behavior of free is undefined in this case, as it is whenever passed an address that wasn't allocated with malloc and friends.
Consider http://man7.org/linux/man-pages/man3/malloc.3.html, a copy of the manpage for free:
The free() function frees the memory space pointed to by ptr, which
must have been returned by a previous call to malloc(), calloc(), or
realloc(). Otherwise, or if free(ptr) has already been called
before, undefined behavior occurs.
Hello everyone, I have got a program (from the net) that I intend to speed up by converting it into its parallel version with the use of pthreads. But surprisingly though, it runs slower than the serial version. Below is the program:
# include <stdio.h>
//fast square root algorithm
double asmSqrt(double x)
__asm__ ("fsqrt" : "+t" (x));
return x;
//test if a number is prime
bool isPrime(int n)
if (n <= 1) return false;
if (n == 2) return true;
if (n%2 == 0) return false;
int sqrtn,i;
sqrtn = asmSqrt(n);
for (i = 3; i <= sqrtn; i+=2) if (n%i == 0) return false;
return true;
//number generator iterated from 0 to n
int main()
n = 1000000; //maximum number
int k,j;
for (j = 0; j<= n; j++)
if(isPrime(j) == 1) k++;
if(j == n) printf("Count: %d\n",k);
return 0;
First attempt for parallelization
I let the pthread manage the for loop
# include <stdio.h>
int main()
//----->pthread code here<----
for (j = 0; j<= n; j++)
if(isPrime(j) == 1) k++;
if(j == n) printf("Count: %d\n",k);
return 0;
Well, it runs slower than the serial one
Second attempt
I divided the for loop into two threads and run them in parallel using pthreads
However, it still runs slower, I am intending that it may run about twice as fast or well faster. But its not!
These is my parallel code by the way:
# include <stdio.h>
# include <pthread.h>
# include <cmath>
# define NTHREADS 2
pthread_mutex_t mutex1 = PTHREAD_MUTEX_INITIALIZER;
int k = 0;
double asmSqrt(double x)
__asm__ ("fsqrt" : "+t" (x));
return x;
struct arg_struct
int initialPrime;
int nextPrime;
bool isPrime(int n)
if (n <= 1) return false;
if (n == 2) return true;
if (n%2 == 0) return false;
int sqrtn,i;
sqrtn = asmSqrt(n);
for (i = 3; i <= sqrtn; i+=2) if (n%i == 0) return false;
return true;
void *parallel_launcher(void *arguments)
struct arg_struct *args = (struct arg_struct *)arguments;
int j = args -> initialPrime;
int n = args -> nextPrime - 1;
for (j = 0; j<= n; j++)
if(isPrime(j) == 1)
printf("This is prime: %d\n",j);
pthread_mutex_lock( &mutex1 );
pthread_mutex_unlock( &mutex1 );
if(j == n) printf("Count: %d\n",k);
int main()
int f = 100000000;
int m;
pthread_t thread_id[NTHREADS];
struct arg_struct args;
int rem = (f+1)%NTHREADS;
int n = floor((f+1)/NTHREADS);
for(int h = 0; h < NTHREADS; h++)
if(rem > 0)
m = n + 1;
rem-= 1;
else if(rem == 0)
m = n;
args.initialPrime = args.nextPrime;
args.nextPrime = args.initialPrime + m;
pthread_create(&thread_id[h], NULL, ¶llel_launcher, (void *)&args);
pthread_join(thread_id[h], NULL);
// printf("Count: %d\n",k);
return 0;
OS: Fedora 21 x86_64,
Compiler: gcc-4.4,
Processor: Intel Core i5 (2 physical core, 4 logical),
Mem: 6 Gb,
HDD: 340 Gb,
You need to split the range you are examining for primes up into n parts, where n is the number of threads.
The code that each thread runs becomes:
typedef struct start_end {
int start;
int end;
} start_end_t;
int find_primes_in_range(void *in) {
start_end_t *start_end = (start_end_t *) in;
int num_primes = 0;
for (int j = start_end->start; j <= start_end->end; j++) {
if (isPrime(j) == 1)
pthread_exit((void *) num_primes;
The main routine first starts all the threads which call find_primes_in_range, then calls pthread_join for each thread. It sums all the values returned by find_primes_in_range. This avoids locking and unlocking a shared count variable.
This will parallelize the work, but the amount of work per thread will not be equal. This can be addressed but is more complicated.
The main design flaw: you must let each thread have its own private counter variable instead of using the shared one. Otherwise they will spend far more time waiting on and handling that mutex, than they will do on the actual calculation. You are essentially forcing the threads to execute in serial.
Instead, sum everything up with a private counter variable and once a thread is done with its work, return the counter variable and sum them up in main().
Also, you should not call printf() from inside the threads. If there is a context switch in the middle of a printf call, you'll end up with crappy output such as This is This is prime: 2. In which case you must synchronize the printf calls between threads, which will slow the program down again. Also, the printf() calls themselves are likely 90% of the work that the thread is doing. So some sort of re-design of who does the printing might be a good idea, depending on what you want to do with the results.
Indeed, the use of PThread speed up my code. It was my programming flaw of placing pthread_join right after the first pthread_create and the common counter I have set on arguments. After fixing this up, I tested my parallel code to determine the primality of 100 Million numbers then compared its processing time with a serial code. Below are the results.
http://i.stack.imgur.com/gXFyk.jpg (I could not attach the image as I don't have much reputation yet, instead, I am including a link)
I conducted three trials for each to account for the variations caused by different OS activities. We got speed up for utilizing parallel programming with PThread. What is surprising is a PThread code running in ONE thread was a bit faster than purely serial code. I could not explain this one, nevertheless using PThreads is well, surely worth a try.
Here is the corrected parallel version of the code (gcc-c++):
# include <stdio.h>
# include <pthread.h>
# include <cmath>
# define NTHREADS 4
double asmSqrt(double x)
__asm__ ("fsqrt" : "+t" (x));
return x;
struct start_end_f
int start;
int end;
//test if a number is prime
bool isPrime(int n)
if (n <= 1) return false;
if (n == 2) return true;
if (n%2 == 0) return false;
int sqrtn = asmSqrt(n);
for (int i = 3; i <= sqrtn; i+=2) if (n%i == 0) return false;
return true;
//executes the tests for prime in a certain range, other threads will test the next range and so on..
void *find_primes_in_range(void *in)
int k = 0;
struct start_end_f *start_end_h = (struct start_end_f *)in;
for (int j = start_end_h->start; j < (start_end_h->end +1); j++)
if(isPrime(j) == 1) k++;
int *t = new int;
*t = k;
int main()
int f = 100000000; //maximum number to be tested for prime
pthread_t thread_id[NTHREADS];
struct start_end_f start_end[NTHREADS];
int rem = (f+1)%NTHREADS;
int n = (f+1)/NTHREADS;
int rem_change = rem;
int m;
if(rem>0) m = n+1;
else if(rem == 0) m = n;
//distributes task 'evenly' to the number of parallel threads requested
for(int h = 0; h < NTHREADS; h++)
if(rem_change > 0)
start_end[h].start = m*h;
start_end[h].end = start_end[h].start+m-1;
rem_change -= 1;
else if(rem_change<= 0)
start_end[h].start = m*(h+rem_change)-rem_change*n;
start_end[h].end = start_end[h].start+n-1;
rem_change -= 1;
pthread_create(&thread_id[h], NULL, find_primes_in_range, &start_end[h]);
//retreiving returned values
int *t;
int c = 0;
for(int h = 0; h < NTHREADS; h++)
pthread_join(thread_id[h], (void **)&t);
int b = *((int *)t);
c += b;
b = 0;
printf("\nNumber of Primes: %d\n",c);
return 0;
I try to use OpenMP to parallelize QuickSort in partition part and QuickSort part. My C code is as follows:
#include "stdlib.h"
#include "stdio.h"
#include "omp.h"
// parallel partition
int ParPartition(int *a, int p, int r) {
int b[r-p];
int key = *(a+r); // use the last element in the array as the pivot
int lt[r-p]; // mark 1 at the position where its element is smaller than the key, else 0
int gt[r-p]; // mark 1 at the position where its element is bigger than the key, else 0
int cnt_lt = 0; // count 1 in the lt array
int cnt_gt = 0; // count 1 in the gt array
int j=p;
int k = 0; // the position of the pivot
// deal with gt and lt array
#pragma omp parallel for
for ( j=p; j<r; ++j) {
b[j-p] = *(a+j);
if (*(a+j) < key) {
lt[j-p] = 1;
gt[j-p] = 0;
} else {
lt[j-p] = 0;
gt[j-p] = 1;
// calculate the new position of the elements
for ( j=0; j<(r-p); ++j) {
if (lt[j]) {
lt[j] = cnt_lt;
} else
lt[j] = cnt_lt;
if (gt[j]) {
gt[j] = cnt_gt;
} else
gt[j] = cnt_gt;
// move the pivot
k = lt[r-p-1];
*(a+p+k) = key;
// move elements to their new positon
#pragma omp parallel for
for ( j=p; j<r; ++j) {
if (b[j-p] < key)
*(a+p+lt[j-p]-1) = b[j-p];
else if (b[j-p] > key)
*(a+k+gt[j-p]) = b[j-p];
return (k+p);
void ParQuickSort(int *a, int p, int r) {
int q;
if (p<r) {
q = ParPartition(a, p, r);
#pragma omp parallel sections
#pragma omp section
ParQuickSort(a, p, q-1);
#pragma omp section
ParQuickSort(a, q+1, r);
int main() {
int a[10] = {5, 3, 8, 4, 0, 9, 2, 1, 7, 6};
ParQuickSort(a, 0, 9);
int i=0;
for (; i!=10; ++i)
printf("%d\t", a[i]);
return 0;
For the example in the main function, the sorting result is:
0 9 9 2 2 2 6 7 7 7
I used gdb to debug. In the early recursion, all went well. But in some recursions, it suddenly messed up to begin duplicate elements. Then generate the above result.
Can someone help me figure out where the problem is?
I decided to post this answer because:
the accepted answer is wrong, and the user seems inactive these days. There is a race-condition on
#pragma omp parallel for
for(i = p; i < r; i++){
if(a[i] < a[r]){
lt[lt_n++] = a[i]; //<- race condition lt_n is shared
gt[gt_n++] = a[i]; //<- race condition gt_n is shared
Nonetheless, even if it was correct, the modern answer to this question is to use OpenMP tasks instead of sections.
I am providing the community with full runnable example of such approach including tests and profiling.
#include <assert.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#define TASK_SIZE 100
unsigned int rand_interval(unsigned int min, unsigned int max)
// https://stackoverflow.com/questions/2509679/
int r;
const unsigned int range = 1 + max - min;
const unsigned int buckets = RAND_MAX / range;
const unsigned int limit = buckets * range;
r = rand();
while (r >= limit);
return min + (r / buckets);
void fillupRandomly (int *m, int size, unsigned int min, unsigned int max){
for (int i = 0; i < size; i++)
m[i] = rand_interval(min, max);
void init(int *a, int size){
for(int i = 0; i < size; i++)
a[i] = 0;
void printArray(int *a, int size){
for(int i = 0; i < size; i++)
printf("%d ", a[i]);
int isSorted(int *a, int size){
for(int i = 0; i < size - 1; i++)
if(a[i] > a[i + 1])
return 0;
return 1;
int partition(int * a, int p, int r)
int lt[r-p];
int gt[r-p];
int i;
int j;
int key = a[r];
int lt_n = 0;
int gt_n = 0;
for(i = p; i < r; i++){
if(a[i] < a[r]){
lt[lt_n++] = a[i];
gt[gt_n++] = a[i];
for(i = 0; i < lt_n; i++){
a[p + i] = lt[i];
a[p + lt_n] = key;
for(j = 0; j < gt_n; j++){
a[p + lt_n + j + 1] = gt[j];
return p + lt_n;
void quicksort(int * a, int p, int r)
int div;
if(p < r){
div = partition(a, p, r);
#pragma omp task shared(a) if(r - p > TASK_SIZE)
quicksort(a, p, div - 1);
#pragma omp task shared(a) if(r - p > TASK_SIZE)
quicksort(a, div + 1, r);
int main(int argc, char *argv[])
int N = (argc > 1) ? atoi(argv[1]) : 10;
int print = (argc > 2) ? atoi(argv[2]) : 0;
int numThreads = (argc > 3) ? atoi(argv[3]) : 2;
int *X = malloc(N * sizeof(int));
int *tmp = malloc(N * sizeof(int));
omp_set_dynamic(0); /** Explicitly disable dynamic teams **/
omp_set_num_threads(numThreads); /** Use N threads for all parallel regions **/
// Dealing with fail memory allocation
if(!X || !tmp)
if(X) free(X);
if(tmp) free(tmp);
return (EXIT_FAILURE);
fillupRandomly (X, N, 0, 5);
double begin = omp_get_wtime();
#pragma omp parallel
#pragma omp single
quicksort(X, 0, N);
double end = omp_get_wtime();
printf("Time: %f (s) \n",end-begin);
assert(1 == isSorted(X, N));
printArray(X, N);
return (EXIT_SUCCESS);
return 0;
How to run:
This program accepts three parameters:
The size of the array;
Print or not the array, 0 for no, otherwise yes;
The number of Threads to run in parallel.
Mini Benchmark
In a 4 core machine : Input 100000 with
1 Thread -> Time: 0.784504 (s)
2 Threads -> Time: 0.424008 (s) ~ speedup 1.85x
4 Threads -> Time: 0.282944 (s) ~ speedup 2.77x
I feel sorry for my first comment.It does not matter with your problem.I have not found the true problem of your question(Maybe your move element has the problem).According to your opinion, I wrote a similar program, it works
fine.(I am also new on OpenMP).
#include <stdio.h>
#include <stdlib.h>
int partition(int * a, int p, int r)
int lt[r-p];
int gt[r-p];
int i;
int j;
int key = a[r];
int lt_n = 0;
int gt_n = 0;
#pragma omp parallel for
for(i = p; i < r; i++){
if(a[i] < a[r]){
lt[lt_n++] = a[i];
gt[gt_n++] = a[i];
for(i = 0; i < lt_n; i++){
a[p + i] = lt[i];
a[p + lt_n] = key;
for(j = 0; j < gt_n; j++){
a[p + lt_n + j + 1] = gt[j];
return p + lt_n;
void quicksort(int * a, int p, int r)
int div;
if(p < r){
div = partition(a, p, r);
#pragma omp parallel sections
#pragma omp section
quicksort(a, p, div - 1);
#pragma omp section
quicksort(a, div + 1, r);
int main(void)
int a[10] = {5, 3, 8, 4, 0, 9, 2, 1, 7, 6};
int i;
quicksort(a, 0, 9);
for(i = 0;i < 10; i++){
printf("%d\t", a[i]);
return 0;
I've implemented parallel quicksort in a production environment, although with concurrent processes (i.e. fork() and join()) and not OpenMP. I also found a pretty good pthread solution, but a concurrent process solution was the best in terms of worst-case runtime. Let me start by saying that it doesn't seem like you're making copies of your input array for each thread, so you'll definitely encounter race conditions which can corrupt your data.
Essentially, what is happening is you have created an array N in shared memory, and when you do a #pragma omp parallel sections, you're spawning as many worker threads as there are #pragma omp section's. Each time a worker thread tries to access and modify elements of a, it will execute a series of instructions: "read the n'th value of N from the given address", "modify the n'th value of N", "write the n'th value of N back to the given address". Since you have multiple threads with no locking or synchronization, the read, modify, and write instructions may be executed in any order by multiple processors, so the threads may overwrite each other's modifications or read a non-updated value.
The best solution that I found (after many weeks of testing and benchmarking many solutions that I came up with) is to subdivide the list log(n) times, where n is the number of processors. For example, if you have a quad core machine (n = 4), subdivide the list 2 times (log(4) = 2) choosing pivots that are the medians of the data set. It is important that the pivots are medians, because otherwise you can end up with a case where a poorly chosen pivot causes the lists to be distributed unevenly amongst processes. Then each process does quicksort on its local subarray, then merges its results with the results of other processes. This is called "hyperquicksort", and from an initial github search, I found this. I can't vouch for the code in there, and can't publish any of the code that I wrote since it is protected under an NDA.
By the way, one of the best parallel sorting algorithm is PSRS (Parallel Sorting by Regular Sampling), which keeps list sizes more balanced amongst processes, doesn't unnecessarily communicate keys between processes, and can work on an arbitrary number of concurrent processes (they don't necessarily have to be a power of 2).