After a lot of searching for an implementation of parallel quicksort in c, I'm about to dive in and code it myself. (I need to sort an array of about 1 million text strings.) It seems that all the implementations I have found divide the work inside the qsort function itself, which creates a huge amount of overhead in partitioning the relatively small amount of work per thread.
Would it not be much faster to divide the 1 million strings by the number of threads (in my case, 24 threads), and have them each work on a section, and then do a mergesort? Granted, this has the theoretical disadvantage that it is not an in-place sort, but with gobs of memory available it is not a problem. The machine this runs on has 12 (very fast) physical/24 logical cores and 192 GB (yes, gigabytes) of memory. Currently, even on this machine, the sort takes almost 8 minutes!
Would it not be much faster to divide
the 1 million strings by the number of
threads (in my case, 24 threads), and
have them each work on a section, and
then do a mergesort?
Its a good idea.
But you can make some observation by writing toy programs for quick-sort and merge-sort and take advantages of their algorithmic-/run-time-behavior.
For example. quick-sort sorts while dividing process (pivot element will be put in its final place at the end of that iteration) and merge-sort sorts while merging (sorting is done after the whole working-set is broken down (divided) into very granular-units where it can be directly compared with other granular-units (== or strcmp()).
Mixing up algorithms based on the nature of the working set is a good idea.
With respect to the parallel sorting, here is my parallel merge-sort for you to get started.
#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#define NOTHREADS 2
/*
gcc -ggdb -lpthread parallel-mergesort.c
NOTE:
The mergesort boils downs to this..
Given two sorted array's how do we merge this?
We need a new array to hold the result of merging
otherwise it is not possible to do it using array,
so we may need a linked list
*/
int a[] = {10, 8, 5, 2, 3, 6, 7, 1, 4, 9};
typedef struct node {
int i;
int j;
} NODE;
void merge(int i, int j)
{
int mid = (i+j)/2;
int ai = i;
int bi = mid+1;
int newa[j-i+1], newai = 0;
while(ai <= mid && bi <= j) {
if (a[ai] > a[bi])
newa[newai++] = a[bi++];
else
newa[newai++] = a[ai++];
}
while(ai <= mid) {
newa[newai++] = a[ai++];
}
while(bi <= j) {
newa[newai++] = a[bi++];
}
for (ai = 0; ai < (j-i+1) ; ai++)
a[i+ai] = newa[ai];
}
void * mergesort(void *a)
{
NODE *p = (NODE *)a;
NODE n1, n2;
int mid = (p->i+p->j)/2;
pthread_t tid1, tid2;
int ret;
n1.i = p->i;
n1.j = mid;
n2.i = mid+1;
n2.j = p->j;
if (p->i >= p->j) return;
ret = pthread_create(&tid1, NULL, mergesort, &n1);
if (ret) {
printf("%d %s - unable to create thread - ret - %d\n", __LINE__, __FUNCTION__, ret);
exit(1);
}
ret = pthread_create(&tid2, NULL, mergesort, &n2);
if (ret) {
printf("%d %s - unable to create thread - ret - %d\n", __LINE__, __FUNCTION__, ret);
exit(1);
}
pthread_join(tid1, NULL);
pthread_join(tid2, NULL);
merge(p->i, p->j);
pthread_exit(NULL);
}
int main()
{
int i;
NODE m;
m.i = 0;
m.j = 9;
pthread_t tid;
int ret;
ret=pthread_create(&tid, NULL, mergesort, &m);
if (ret) {
printf("%d %s - unable to create thread - ret - %d\n", __LINE__, __FUNCTION__, ret);
exit(1);
}
pthread_join(tid, NULL);
for (i = 0; i < 10; i++)
printf ("%d ", a[i]);
printf ("\n");
// pthread_exit(NULL);
return 0;
}
Good luck!
Quicksort involves an initial pass over a list, which sorts the list into sections that are higher and lower than the pivot.
Why not do that in one thread, and then spawn another thread and delegate it to one half while the extant thread takes the other half, and so on and so forth?
Have you considered using a sorting algorithm specifically designed to sort strings?
It seems like it might be a better idea than trying to implement a custom quicksort. The specific choice of algorithms probably depends on the length of the strings and how different they are but a radix sort probably isn't a bad bet.
A quick google search turned up an article about sorting strings. I haven't read it but Sedgewick and Bentley really know their stuff. According to the abstract, their algorithm is an amalgam of Quicksort and radix sort.
Another possible solution is to wrap a parallel sorting algorithm from C++. GNU's STL implementation has a parallel mode, which contains a parallel quicksort implementation.
This is probably the easiest solution.
To make multi-threaded quicksort feasible memory accesses need to be optimized such that most of the sorting work is performed inside the non-shared caches (L1 & L2). My bet is that single-threaded quicksort will be faster than muli-threaded unless you're prepared to put in copious amounts of work.
One approach to test could be one thread to sort the upper half and another to sort the lower.
As to a special string-adapted sorting routine the concept sounds strange to me. I mean there aren't many cases where sorting a vector of only strings (or integers) is especially useful. Usually the data will be organized in a table with columns and rows and you will want to sort the rows by one column containing letters, and, if they're equal, you'll sort using an additional column containing a time stamp or a ranking or something else. So the sort routine should be able to handle a multi-level sort rule set that can specify any type of data (boolean, integer, dates, strings, floating point etc) in any direction (ascending or descending) present in the table's columns.
Related
I have three sorting algorithms: bubble, insertion, selection.
I need to experimentally test them for stability. I understand that I need to somehow compare the same numbers before sorting and after, but I don't know how to do it in C.
I need some advice on how this can be implemented.
As chqrlie has rightly pointed out, the following approach does NOT prove absence of stability mistakes for all inputs. It is however something of a smoke-test, which actually has a chance to detect implementation mistakes. This is at least a small step ahead to trying to test based on sequences of numbers, which do NOT allow detecting ANY stability mistake.
In order to test stability you therefore need to do sorting on elements which have the same sorting key AND can still be told apart outside of the sorting.
One way would be to sort elements with two keys. One is non-unique, i.e. there are pairs of elements with the same first key. The other, second key is unique, i.e. no two elements have the same second key.
Then create several pairs of elements with same first key and different second key.
Then sort.
Then verify that the relative position of elements with same first key have not changed, use the non-identical second key to do so.
Then sort again and verify that the elements with same first key have again not changed (relative) location. This is for paranoia, the algorithm (strictly speaking the tested implementation) might be stable from an unsorted list, but unstable on a sorted one.
Then restart with unsorted list/array, similar to before, but with reversed relative location of elements with identical first key.
Then sort and verify unchanged relative location AND reverse relative location compared to first sorting.
Then sort again and verify for paranoia.
As mentioned this will not detect ALL stability mistakes, but at least there are two improvements over a simple "do and verify" test here.
test unsorted and sorted (described as paranoia), this will detect all toggle-based stability mistakes, i.e. any mechanisms to always change the relative order of same-keyed pairs
test both "directions" of same keyed pairs, this at leat ensures detection of all stability mistakes which are caused by some kind of "preferred order", by making sure that the non-preferred order is among the test inputs and would get sorted into the preferred order
If one is going to test it probabilistically, one needs to have, at minimum, key and a monotonic value to test if they are the same order before and after. There are several ways that one can do this. I would think that one of the simpler, more effective, ways is to sort them by oddness: the key is { even, odd } and the value is the integer. Having only two values increases the chance that the non-stable sort algorithms put it in the wrong spot per item. Here, the qsort function is tested.
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
static int cmp(const int *const a, const int *const b)
{ return (*a & 1) - (*b & 1); }
static int void_cmp(const void *a, const void *b) { return cmp(a, b); }
int main(void) {
unsigned a[10], counter = 0;
size_t i;
const size_t a_size = sizeof a / sizeof *a;
const unsigned seed = (unsigned)clock();
/* http://c-faq.com/lib/randrange.html */
printf("Seed %u.\n", seed), srand(seed);
for(i = 0; i < a_size; i++)
a[i] = (counter += 1 + rand() / (RAND_MAX / (2 - 1 + 1) + 1));
qsort(a, a_size, sizeof *a, &void_cmp);
for(i = 0; i < a_size; i++) printf("%s%u", i ? ", " : "", a[i]);
printf("\n");
/* Even numbers. */
for(counter = 0, i = 0; i < a_size && !(a[i] & 1) && counter < a[i];
counter = a[i], i++);
if(i == a_size) goto pass;
if(!(a[i] & 1)) goto fail; /* Not stable by `counter >= a[i]`. */
/* Odd numbers. */
for(counter = 0; i < a_size && counter < a[i]; counter = a[i], i++);
if(i == a_size) goto pass;
fail:
printf("Not stable.\n");
return EXIT_FAILURE;
pass:
printf("Possibly stable.\n");
return EXIT_SUCCESS;
}
Note that this doesn't prove that an algorithm is stable, but after sorting a large number of elements, the chance that it gets it right by chance is exponentially small.
I am using this program written in C to determine the permutations of size 10 of an regular alphabet.
When I run the program it only uses 36% of my 3GHz CPU leaving 50% free. It also only uses 7MB of my 8GB of RAM.
I would like to use at least 70-80% of my computer's performance and not just this misery. This limitation is making this procedure very time consuming and I don't know when (number of days) I will need to have the complete output. I need help to resolve this issue in the shortest possible time, whether improving the source code or other possibilities.
Any help is welcome even if this solution goes through instead of using the C language use another one that gives me better performance in the execution of the program.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
static int count = 0;
void print_permutations(char arr[], char prefix[], int n, int k) {
int i, j, l = strlen(prefix);
char newprefix[l + 2];
if (k == 0) {
printf("%d %s\n", ++count, prefix);
return;
}
for (i = 0; i < n; i++) {
//Concatenation of currentPrefix + arr[i] = newPrefix
for (j = 0; j < l; j++)
newprefix[j] = prefix[j];
newprefix[l] = arr[i];
newprefix[l + 1] = '\0';
print_permutations(arr, newprefix, n, k - 1);
}
}
int main() {
int n = 26, k = 10;
char arr[27] = "abcdefghijklmnopqrstuvwxyz";
print_permutations(arr, "", n, k);
system("pause");
return 0;
}
There are fundamental problems with your approach:
What are you trying to achieve?
If you want to enumerate the permutations of size 10 of a regular alphabet, your program is flawed as it enumerates all combinations of 10 letters from the alphabet. Your program will produce 2610 combinations, a huge number, 141167095653376, 141,167 billion! Ignoring the numbering, which will exceed the range of type int, that's more than 1.5 Petabytes, unlikely to fit on your storage space. Writing this at the top speed of 100MB/s would take more than 20 days.
The number of permutations, that is combinations of distinct letters from the 26 letter alphabet is not quite as large: 26! / 16! which is still large: 19275223968000, 7 times less than the previous result. That is still more than 212 terabytes of storage and 3 days at 100MB/s.
Storing these permutations is therefore impractical. You could change your program to just count the permutations and measure how long it takes if the count is the expected value. The first step of course is to correct your program to produce the correct set.
Test on smaller sets to verify correctness
Given the expected size of the problem, you should first test for smaller values such as enumerating permutations of 1, 2 and 3 letters to verify that you get the expected number of results.
Once you have correctness, only then focus on performance
Selecting different output methods, from printf("%d %s\n", ++count, prefix); to ++count; puts(prefix); to just ++count;, you will see that most of the time is spent in producing the output. Once you stop producing output, you might see that strlen() consumes a significant fraction of the execution time, which is useless since you can pass the prefix length from the caller. Further improvements may come from using a common array for the current prefix, precluding the need to copy at each recursive step.
Using multiple threads each producing its own output, for example each with a different initial letter, will not improve the overall time as the bottleneck is the bandwidth of the output device. But if you reduce the program to just enumerate and count the permutations, you might get faster execution with multiple threads, one per core, thereby increasing the CPU usage. But this should be the last step in your development.
Memory use is no measure of performance
Using as much memory as possible is not a goal in itself. Some problems may require a tradeoff between memory and time, where faster solving times are achieved using more core memory, but this one does not. 8MB is actually much more than your program's actual needs: this count includes the full stack space assigned to the program, of which only a tiny fraction will be used.
As a matter of fact, using less memory may improve overall performance as the CPU will make better use of its different caches.
Here is a modified program:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
static unsigned long long count;
void print_permutations(char arr[], int n, char used[], char prefix[], int pos, int k) {
if (pos == k) {
prefix[k] = '\0';
++count;
//printf("%llu %s\n", count, prefix);
//puts(prefix);
return;
}
for (int i = 0; i < n; i++) {
if (!used[i]) {
used[i] = 1;
prefix[pos] = arr[i];
print_permutations(arr, n, used, prefix, pos + 1, k);
used[i] = 0;
}
}
}
int main(int argc, char *argv[]) {
int n = 26, k = 10;
char arr[27] = "abcdefghijklmnopqrstuvwxyz";
char used[27] = { 0 };
char perm[27];
unsigned long long expected_count;
clock_t start, elapsed;
if (argc >= 2)
k = strtol(argv[1], NULL, 0);
if (argc >= 3)
n = strtol(argv[2], NULL, 0);
start = clock();
print_permutations(arr, n, used, perm, 0, k);
elapsed = clock() - start;
expected_count = 1;
for (int i = n; i > n - k; i--)
expected_count *= i;
printf("%llu permutations, expected %llu, %.0f permutations per second\n",
count, expected_count, count / ((double)elapsed / CLOCKS_PER_SEC));
return 0;
}
Without output, this program enumerates 140 million combinations per second on my slow laptop, it would take 1.5 days to enumerate the 19275223968000 10-letter permutations from the 26-letter alphabet. It uses almost 100% of a single core, but the CPU is still 63% idle as I have a dual core hyper-threaded Intel Core i5 CPU. Using multiple threads should yield increased performance, but the program must be changed to no longer use a global variable count.
There are multiple reasons for your bad experience:
Your metric:
Your metric is fundamentally flawed. Peak-CPU% is an imprecise measurement for "how much work does my CPU do". Which normally isn't really what you're most interested in. You can inflate this number my doing more work (like starting another thread that doesn't contribute to the output at all).
Your proper metric would be items per second: How many different strings will be printed or written to a file per second. To measure that, start a test run with a smaller size (like k=4), and measure how long it takes.
Your problem: Your problem is hard. Printing or writing down all 26^10 ~1.4e+14 different words with exactly 10 letters will take some time. Even if you changed it to all permutations - which your program doesn't do - it's still ~1.9e13. The resulting file will be 1.4 petabytes - which is most likely more than your hard drive will accept. Also, if you used your CPU to 100% and used one thousand cycles for one word, it'd take 1.5 years. 1000 cycles are an upper bound, you most likely won't be faster that this while still printing your result, as printf usually takes around 1000 cycles to complete.
Your output: Writing to stdout is slow comapred to writing to a file, see https://stackoverflow.com/a/14574238/4838547.
Your program: There are issues with your program that could be a problem for your performance. However, they are dominated by the other problems stated here. With my setup, this program uses 93.6% of its runtime in printf. Therefore, optimizing this code won't yield satisfying results.
I managed to roll off an insertion sort routine as shown:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef struct{
int n;
char l;
char z;
} dat;
void sortx(dat* y){
char tmp[sizeof(dat)+1];
dat *sp=y;
while(y->l){
dat *ip=y;
while(ip>sp && ip->n < (ip-1)->n){
memcpy(tmp,ip,sizeof(dat));
memcpy(ip,ip-1,sizeof(dat));
memcpy(ip-1,tmp,sizeof(dat));
ip--;
}
y++;
}
}
void printa(dat* y){
while(y->l){printf("%c %d,",y->l,y->n);y++;}
printf("\n");
}
int main(int argc,char* argv[]){
const long sz=10000;
dat* new=calloc(sz+2,sizeof(dat));
dat* randx=new;
//fill struct array with random values
int i;
for (i = 0 ; i < sz ; i++) {
randx->l = (unsigned char)(65+(rand() % 25));
randx->n = (rand() % 1000);randx++;
}
//sort - takes forever
sortx(new);
printa(new);
free(new);
return 0;
}
My sorting routine was partly derived from: http://www.programmingsimplified.com/c/source-code/c-program-insertion-sort
but because I am dealing with sorting the array based on the numeric value in the struct, memcpy works for me so far.
The computer I'm using to execute this code has a Pentium 1.6Ghz Processor and when I change sz in the main function to at least 20000, I notice I have to wait two seconds to see the results on the screen.
The reason why I'm testing large numbers is because I want to process server logs in C and will be sorting information by timestamps and sometimes the logs can become very large, and I don't want to put too much strain on the CPU as it is running other processes already such as apache.
Is there anyway I can improve this code so I don't have to wait two seconds to see 20000 structs sorted?
There is already a function that does this, and it's built in in the C standard library: qsort. You just have to provide suitable comparison function.
This function has to return -1 if the item taken as a left argument should be put earlier in the desired order, 1 if it should be put later, or 0 if the items are to be considered equal by qsort.
int dat_sorter(const void* l, const void* r)
{
const dat* left = (const dat*)l;
const dat* right = (const dat*)r;
if(left->n > right->n)
return 1;
else if(left->n < right->n)
return -1;
else
return 0;
}
void sortx(dat* y)
{
/* find the length */
dat* it = y;
size_t count = 0;
while(it->l)
{
count++;
it++;
}
/* do the sorting */
qsort(y, count, sizeof(dat), dat_sorter);
}
If you want to speed it up even more, you can make sortx function take length of the array, so the function won't need to figure it out on its own.
Use quick sort, heap sort, or bottom up merge sort. Wiki has examples of these in their articles, and typically have more complete examples on each article's talk page.
Insertion sort has O(n^2) time complexity, and there are other algorithms out there that will give you O(nlogn) time complexity like mergesort, quicksort, and heapsort. It looks like you are sorting by an integer, so you also might want to consider using LSD radix sort, which is O(n) time complexity.
A Killer Adversary for Quicksort claims to have a method of reducing any quicksort implementation to quadratic time. I guess this means that it will always produce a list that will always take O(n^2) to run. This is saying something because, even though Quicksort has worst case O(n^2) it typically runs O(nlogn). The author claims that this still works even when the array is randomly shuffled before calling quicksort. How is this possible? I don't know C but here are the prerequisites and the code of the program
The quicksort will be vulnerable provided only that it satisfies some
mild assumptions that are met by every implementation I have seen:
1. The implementation is single-threaded.
2. Pivot-choosing takes O (1) comparisons; all other comparisons are for partitioning.
3. The comparisons of the partitioning phase are contiguous and involve the pivot value.
4. The only data operations performed are comparison and copying.
5. Comparisons involve only input data values or copies thereof.
#include <stdlib.h>
int *val; /* item values */
int ncmp; /* number of comparisons */
int nsolid; /* number of solid items */
int candidate; /* pivot candidate */
int gas; /* gas value */
#define freeze(x) val[x] = nsolid++
int cmp(const void *px, const void *py) /* per C standard */
{
const int x = *(const int*)px;
const int y = *(const int*)py;
ncmp++;
if(val[x]==gas && val[y]==gas)
{
if(x == candidate)
freeze(x);
else
freeze(y);
}
if(val[x] == gas)
candidate = x;
else if(val[y] == gas)
candidate = y;
return val[x] - val[y]; /* only the sign matters */
}
int antiqsort(int n, int *a)
{
int i;
int *ptr = malloc(n*sizeof(*ptr));
val = a;
gas = n - 1;
nsolid = ncmp = candidate = 0;
for(i=0; i<n; i++) {
ptr[i] = i;
val[i] = gas;
}
qsort(ptr, n, sizeof(*ptr), cmp);
free(ptr);
return ncmp;
}
The general method works against any implementation of quicksort–even a randomizing one–that satisfies certain very mild and realistic assumptions.
By randomizing does he mean random pivot selection or randomizing the input structure?
The model the paper uses assumes that the attacker has control not only of the data, but the comparison function. Essentially, the comparison function lies about the data, pretending that the quicksort always happened to choose a really awful pivot. This is a very powerful capability for the adversary to have.
A colleague of mine asked me to write a homework for him. Although this wasn’t too ethical I did it, I plead guilty.
This is how the problem goes:
Write a program in C where the sequence 12 + 22 + ... + n2 is calculated.
Assume that n is multiple of p and p is the number of threads.
This is what I wrote:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define SQR(X) ((X) * (X))
int n, p = 10, total_sum = 0;
pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;
/* Function prototype */
void *do_calc(void *arg);
int main(int argc, char** argv)
{
int i;
pthread_t *thread_array;
printf("Type number n: ");
fscanf(stdin, "%d", &n);
if (n % p != 0 ) {
fprintf(stderr, "Number must be multiple of 10 (number of threads)\n");
exit(-1);
}
thread_array = (pthread_t *) malloc(p * sizeof(pthread_t));
for (i = 0; i < p; i++)
pthread_create(&thread_array[i], NULL, do_calc, (void *) i);
for (i = 0; i < p; i++)
pthread_join(thread_array[i], NULL);
printf("Total sum: %d\n", total_sum);
pthread_exit(NULL);
}
void *do_calc(void *arg)
{
int i, local_sum = 0;
int thr = (int) arg;
pthread_mutex_lock(&mtx);
for (i = thr * (n / p); i < ((thr + 1) * (n / p)); i++)
local_sum += SQR(i + 1);
total_sum += local_sum;
pthread_mutex_unlock(&mtx);
pthread_exit(NULL);
}
Aside from the logical/syntactic point of view, I was wondering:
how the respective non-multithreaded program would perform
how could I test/see their performance
what would be the program without using threads
Thanks in advance and I’m looking forward to reading your thoughts
You are acquiring the Mutex before the calculations. You should do that immediately before summing to local values.
pthread_mutex_lock(&mtx);
total_sum += local_sum;
pthread_mutex_unlock(&mtx);
This would depend on how many CPUs you have. With a single CPU core, a computation-bound program will never run faster with multiple threads.
Moreover, since you're doing all the work with the lock held, you'll end up with only a single thread running at any time, so it's effectively single threaded anyway.
Don't bother with threading etc. In fact, don't do any additions in a loop at all. Just use this formula:
∑(r = 1; n) r^2 = 1/6 * n (n + 1)(2 n + 1) [1]
[1]http://thesaurus.maths.org/mmkb/entry.html?action=entryById&id=1539
As your code is serialised by a mutex in the actual calculation, it will be slower than a non-threaded version. Of course, you could easily have tested this for yourself.
i would try to see how much do those calculations take. In case it's a very small fraction of time then i would probably gone for a single process model since spawning a thread for each calculation involves some overhead by it self.
to compare performance just remember system time at program start, call it from n=1000 and see system time at the end. compare to non-threaded program result.
as bdonlan said, non-threaded will run faster
1) Single threaded would probably perform a bit better than this, because all calculations are done within a lock and the overhead of locking will add to the total time. You are better off only locking when adding the local sums to the total sum, or storing the local sums in an array and calculating the total sum in the main thread.
2) Use timing statements in your code to measure elapsed time during the algoritm. In the multithreaded case, only measure elapsed time on the main thread.
3) Derived from your code:
int i, total_sum = 0;
for (i = 0; i < n; i++)
total_sum += SQR(i + 1);
A much larger consideration comes to scheduling. The easiest way for kernel-side threading to be implemented is for each thread to get equal time regardless. Processes are just threads with their own memory space. IF all threads get equal time, adding a thread takes you from 1/n of the time to 2/(n + 1) of the time, which is obviously better given > 0 other threads that aren't you.
Actual implementations may and do vary wildly though.
Off-topic a bit, but maybe avoid the mutex by having each thread write it's result into an array element (so assign "results = calloc(sizeof(int), p)" (btw "p" is an awful name for the variable holding the number of threads) and results[thr] = local_sum), and have the joining thread (well, main()) do the summing of the results. So each thread is responsible for just calculating its total: only main(), which orchestrates the threads, joins their data together. Separation of concerns.
For extra credit (:p), use the arg passed to do_calc() as a way to pass the thread ID and the location to write the result to rather than relying on a global array.