Hoare quicksort in c - c

Similarly to other users I am using this wikipedia algorithm. However I have tried to reimplement the algorithm using pointer arithmetic. However I'm having difficulty finding where I've gone wrong.
I think that this if statement is probably the cause but I'm not be sure.
...
if (left >= right) {
ret = (right - ptr);
return ret;
}
temp = *left;
*left = *right;
*right = temp;
/* sortstuff.h */
extern void quicksort(const size_t n, int * ptr);
/* sortstuff.c */
size_t quicksortpartition(const size_t n, int * ptr);
void quicksort(const size_t n, int * ptr) {
int* end = ptr + n - 1;
// for debug purposes
if (original_ptr == NULL) {
original_ptr = ptr;
original_count = n;
}
if (n > 1) {
size_t index = quicksortpartition(n, ptr);
quicksort(index, ptr);
quicksort(n - index - 1, ptr + index + 1);
}
return;
}
size_t quicksortpartition(const size_t n, int * ptr) {
int* right = ptr + n - 1;
int* pivot = ptr + (n - 1) / 2;
int* left = ptr;
int temp;
size_t ret = NULL;
while (1) {
while (*left <= *pivot && left < pivot) {
++left;
}
while (*right > *pivot) {
--right;
}
if (left >= right) {
ret = (right - ptr);
return ret;
}
temp = *left;
*left = *right;
*right = temp;
//print_arr();
}
}
int main(void) {
}
/* main.c */
int array0[] = {5, 22, 16, 3, 1, 14, 9, 5};
const size_t array0_count = sizeof(array0) / sizeof(array0[0]);
int main(void) {
quicksort(array0_count, array0);
printf("array out: ");
for (size_t i = 0; i != array0_count; ++i) {
printf("%d ", array0[i]);
}
puts("");
}
I don't think there are any off by one errors

The code you have presented does not accurately implement the algorithm you referenced. Consider in particular this loop:
while (*left <= *pivot && left < pivot) {
++left;
}
The corresponding loop in the algorithm description has no analog of the left < pivot loop-exit criterion, and its analog of *left <= *pivot uses strict less-than (<), not (<=).
It's easy to see that the former discrepancy must constitute an implementation error. The final sorted position of the pivot is where the left and right pointers meet, but the condition prevents the left pointer ever from advancing past the initial position of the pivot. Thus, if the correct position is rightward of the initial position then the partition function certainly cannot return the correct value. It takes a more thoughtful analysis to realize that in fact, the partition function is moreover prone to looping infinitely in that case, though I think that's somewhat data-dependent.
The latter discrepancy constitutes a provisional error. It risks overrunning the end of the array in the event that the selected pivot value happens to be the largest value in the array, but that's based in part on the fact that the left < pivot condition is erroneous and must be removed. You could replace the latter with left < right to resolve that issue, but although you could form a working sort that way, it probably would not be an improvement on the logic details presented in the algorithm description.
Note, however, that with the <= variation, either quicksortpartition() needs to do extra work (not presently provided for) to ensure that the pivot value ends up at the computed pivot position, or else the quicksort function needs to give up its assumption that that will happen. The former is more practical, supposing you want your sort to be robust.

Pivot needs to be an int, not a pointer. Also to more closely follow the Wiki algorithm, the parameters should be two pointers, not a count and a pointer. I moved the partition logic into the quick sort function.
void QuickSort(int *lo, int *hi)
{
int *i, *j;
int p, t;
if(lo >= hi)
return;
p = *(lo + (hi-lo)/2);
i = lo - 1;
j = hi + 1;
while (1){
while (*(++i) < p);
while (*(--j) > p);
if (i >= j)
break;
t = *i;
*i = *j;
*j = t;
}
QuickSort(lo, j);
QuickSort(j+1, hi);
}
The call would be:
QuickSort(array0, array0+array0_count-1);

Related

How to write a C function to detect cycles in a void* array

I'm trying to implement the C function int contains_cycle(void *const array[], size_t length) to detect if there are any "cycles" in an array of void pointers. All elements of this array either point to an adress of this array or to NULL. Pointers still quite overwhelm me and I've got no idea where to start.
Just to clarify, what I mean by cycle, here are some examples. Just for illustration the first element's adress is always at adress 0x1 and pointers have the size of 1 byte.
{NULL, 0x3, 0x2} -> should return 1, cycle between array[1] and array [2]
{0x2, 0x3, 0x1} -> should return 1, cycle between all the elements
{0x2, 0x3, NULL} -> should return 0, no cycle
I would appreciate any help and if my goal is still not quite clear, I am happy to explain more.
My idea would be iterating over the array and somehowe "follow" the pointers to see if I end up on the starting point again. If that's the case for at least one element, I've found a cycle.
Yes. You just "follow the pointers", but you need to know whether you followed to a pointer that you already hit.
So my idea to solve your problem is to make a struct that contains an index instead of a pointer because this makes life so much easier...
typedef struct {
size_t toIndex;
bool marked;
} Entry;
Then I create a new array of all these entries with the same length as the original. I calculate the toIndex that I store in the struct using the current element's pointer minus the address of the array's beginning.
bool contains_cycle(void* array[], size_t length) {
Entry newArray[length];
for(size_t i = 0; i < length; ++i) {
size_t toIndex = ((size_t) array[i] - (size_t) &array[0] ) / sizeof *array;
newArray[i] = (Entry) { toIndex, false };
}
After that I look for the first index where the pointer is not null
size_t index = 0;
for(size_t i = 0; i < length; ++i) {
if (array[i] == NULL) continue;
index = i;
break;
}
Now, if we just let a loop run until we hit some index that is out of bounds (this will implicitly detect if we hit a NULL-element) and check if the current element is already marked. if so, return true.
while(index < length) {
if (newArray[index].marked) return true;
newArray[index].marked = true;
index = newArray[index].toIndex;
}
If the loop exits without a return you know that the loop did not start from there. You now need to check if the loop started from any other index that you haven't marked yet. But I'm too lazy to implement that now. Go try this yourself :)
For now I just return false
return false;
}
I tried to replicate your examples in the main function.
#include <stdio.h>
#include <stdbool.h>
typedef struct {
size_t toIndex;
bool marked;
} Entry;
bool contains_cycle(void* array[], size_t length) {
Entry newArray[length];
for(size_t i = 0; i < length; ++i) {
size_t toIndex = ((size_t) array[i] - (size_t) &array[0] ) / sizeof *array;
newArray[i] = (Entry) { toIndex, false };
}
size_t index = 0;
for(size_t i = 0; i < length; ++i) {
if (array[i] == NULL) continue;
index = i;
break;
}
while(index < length) {
if (newArray[index].marked) return true;
newArray[index].marked = true;
index = newArray[index].toIndex;
}
return false;
}
int main() {
void* example1[3];
void* example2[3];
void* example3[3];
example1[0] = NULL;
example1[1] = &example1[2];
example1[2] = &example1[1];
example2[0] = &example2[1];
example2[1] = &example2[2];
example2[2] = &example2[0];
example3[0] = &example3[1];
example3[1] = &example3[2];
example3[2] = NULL;
printf("%d ", contains_cycle(example1, 3));
printf("%d ", contains_cycle(example2, 3));
printf("%d ", contains_cycle(example3, 3));
}
I'm certain that there can be a faster way but the one above does work with your examples

Quick sort and bubble sort give different results

I'm facing with a singular issue:
I have a big table which content is a lot of pair of numbers. I have to sort them in descending order. I wrote a BubbleSort procedure and works fine, but is very slow to do its job. So I used a QuickSort procedure and... Data inside of the array changes after the sort!
So I tried with a sample table, with similar dimensions and "easy-to-write" content, basically a cicle which assign to
table[i][0]=i*3
and
table[i][1]=i*5
and... Works fine.
The code used is the following:
typedef struct MATCHES {
short size;
unsigned short values[10000][2];
} MATCHES;
int partition(MATCHES **data, int left, int right, int pivot, int col){
int temp;
int i;
int storeIndex = left;
int pivotVal = (**data).values[pivot][col];
(**data).values[pivot][col] = (**data).values[right][col];
(**data).values[right][col] = pivotVal;
for(i = left; i < right; i++){
if ((**data).values[i][col] >= pivotVal){ //Change this to greater then and BOOM we're done
temp = (**data).values[i][col];
(**data).values[i][col] = (**data).values[storeIndex][col];
(**data).values[storeIndex][col] = temp;
storeIndex++;
}
}
temp = (**data).values[storeIndex][col];
(**data).values[storeIndex][col] = (**data).values[right][col];
(**data).values[right][col] = temp;
return storeIndex;
}
void quickSort(MATCHES **vec, int left, int right, int col) {
int r;
if (right > left) {
r = partition(vec, left, right, right+1/2, col);
quickSort(vec, left, r - 1, col);
quickSort(vec, r + 1, right, col);
}
}
void sorter(MATCHES *table) {
quickSort(&table, 0, (*table).size-1, 0);
quickSort(&table, 0, (*table).size-1, 1);
}
int main () {
MATCHES table;
table.size=10000;
int i;
for (i=0; i<table.size; i++) {
table.values[i][0]=i*3;
table.values[i][1]=i*5;
}
printf("Unsorted\n");
for (i=0; i<table.size; i++)
printf("%d %d\n",table.values[i][0],table.values[i][1]);
sorter(&table);
printf("Sorted\n");
for (i=0; i<table.size; i++)
printf("%d %d\n",table.values[i][0],table.values[i][1]);
return 0;
}
For doing another try, I took the data I need to sort into this program and result is again wrong.
I'll link the code, since is very long due the initialization vector.
http://pastebin.com/Ztwu6iUP
Thanks in advance for any help!
EDIT:
I found a partial solution. Instead of using quickSort, that is unstable, I used mergeSort. Now, when I sort the second time the table, for every duplicate (or three times the same value) on the second column, in the first I have data sorted in ascending order.
The code is the following:
void merge(MATCHES *v, int i1, int i2, int fine, int col, MATCHES *vout) {
int i=i1, j=i2, k=i1;
while (i<=i2-1 && j<=fine) {
if ((*v).values[i][col]>(*v).values[j][col]) {
(*vout).values[k][0]=(*v).values[i][0];
(*vout).values[k][1]=(*v).values[i][1];
i++;
}
else {
(*vout).values[k][0]=(*v).values[j][0];
(*vout).values[k][1]=(*v).values[j][1];
j++;
}
k++;
}
while (i<=i2-1){
(*vout).values[k][0]=(*v).values[i][0];
(*vout).values[k][1]=(*v).values[i][1];
i++;
k++;
}
while (j<=fine){
(*vout).values[k][0]=(*v).values[j][0];
(*vout).values[k][1]=(*v).values[j][1];
j++;
k++;
}
for (i=i1; i<=fine; i++) {
(*v).values[i][0]=(*vout).values[i][0];
(*v).values[i][1]=(*vout).values[i][1];
}
}
void mergeSort(MATCHES *v, int iniz, int fine, int col, MATCHES *vout) {
int mid;
if(iniz<fine){
mid=(fine+iniz)/2;
mergeSort(v, iniz, mid, col, vout);
mergeSort(v, mid+1, fine, col, vout);
merge(v, iniz, mid+1, fine, col, vout);
}
}
Any hint for this?
In order to use quicksort to get stability, you need to answer the following question.
Can I tell the difference between a1 and a2?
If a1 and a2 differ because they have a secondary field, then there is a 'stable' solution with quick sort.
If a1 and a2 differ because they were added at different times (a field which doesn't matter), then the sort is unstable and will sometimes have a1 before a2 and sometimes after.
In your question, it is not clear if these numbers are linked
1,9
5,8
3,7
4,6
Should that go to :-
1,6
3,7
4,8
5,9
or
4,6
3,7
5,8
1,9
Are there 2 independent sorts? or is it a secondary field sort.
The merge code looks like a secondary field sort.
Sort on a secondary field
To sort on a secondary field, the comparison needs to be like :-
int compare( Atype* lhs, Atype * rhs )
{
if( lhs->field1 < rhs->field1 ) return -1;
if( lhs->field1 > rhs->field1 ) return 1;
if( lhs->field2 < rhs->field2 ) return -1;
if( lhs->field2 > rhs->field2 ) return 1;
/* more fields can be added here */
return 0;
}
Instead of sorting columns independently
quickSort(&table, 0, (*table).size-1, 0);
quickSort(&table, 0, (*table).size-1, 1);
Try the following.
Combining the sort into one go :-
quickSort(&table, 0, (*table).size-1 );
Change the comparison to take base array
int compare( short * lhs, short * rhs ) /* sort by 1 then 0 */
{
if( lhs[1] < rhs[1] ) return -1;
if( lhs[1] > rhs[1] ) return 1;
if( lhs[0] < rhs[0] ) return -1;
if( lhs[0] > rhs[0] ) return 1;
return 0;
}
Partition becomes
int partition(MATCHES **data, int left, int right, int pivot, int col){
int temp;
int i;
int storeIndex = left;
short pivotVal[2];
pivotVal[0] = (**data).values[pivot][0];
pivotVal[1] = (**data).values[pivot][1];
/* here you were jumbling pivot value - not keeping [0,1] together */
(**data).values[pivot][0] = (**data).values[right][0];
(**data).values[pivot][1] = (**data).values[right][1];
(**data).values[right][0] = pivotVal[0];
(**data).values[right][1] = pivotVal[1];
for(i = left; i < right; i++){
if ( compare( (**data).values[i] , pivotVal ) >= 0){ //Change this to greater then and BOOM we're done
temp = (**data).values[i][0];
(**data).values[i][0] = (**data).values[storeIndex][0];
(**data).values[storeIndex][0] = temp;
temp = (**data).values[i][1];
(**data).values[i][1] = (**data).values[storeIndex][1];
(**data).values[storeIndex][1] = temp;
storeIndex++;
}
}
temp = (**data).values[storeIndex][0];
(**data).values[storeIndex][0] = (**data).values[right][0];
(**data).values[right][0] = temp;
temp = (**data).values[storeIndex][1];
(**data).values[storeIndex][1] = (**data).values[right][1];
(**data).values[right][1] = temp;
return storeIndex;
}

Max in array and its frequency

How do you write a function that finds max value in an array as well as the number of times the value appears in the array?
We have to use recursion to solve this problem.
So far i am thinking it should be something like this:
int findMax(int[] a, int head, int last)
{
int max = 0;
if (head == last) {
return a[head];
}
else if (a[head] < a[last]) {
count ++;
return findMax(a, head + 1, last);
}
}
i am not sure if this will return the absolute highest value though, and im not exactly sure how to change what i have
Setting the initial value of max to INT_MIN solves a number of issues. #Rerito
But the approach OP uses iterates through each member of the array and incurs a recursive call for each element. So if the array had 1000 int there would be about 1000 nested calls.
A divide and conquer approach:
If the array length is 0 or 1, handle it. Else find the max answer from the 1st and second halves. Combine the results as appropriate. By dividing by 2, the stack depth usage for a 1000 element array will not exceed 10 nested calls.
Note: In either approach, the number of calls is the same. The difference lies in the maximum degree of nesting. Using recursion where a simple for() loop would suffice is questionable. To conquer a more complex assessment is recursion's strength, hence this approach.
To find the max and its frequency using O(log2(length)) stack depth usage:
#include <stddef.h>
typedef struct {
int value;
size_t frequency; // `size_t` better to use that `int` for large arrays.
} value_freq;
value_freq findMax(const int *a, size_t length) {
value_freq vf;
if (length <= 1) {
if (length == 0) {
vf.value = INT_MIN; // Degenerate value if the array was size 0.
vf.frequency = 0;
} else {
vf.value = *a;
vf.frequency = 1;
}
} else {
size_t length1sthalf = length / 2;
vf = findMax(a, length1sthalf);
value_freq vf1 = findMax(&a[length1sthalf], length - length1sthalf);
if (vf1.value > vf.value)
return vf1;
if (vf.value == vf1.value)
vf.frequency += vf1.frequency;
}
return vf;
}
Your are not thaaaat far.
In order to save the frequency and the max you can keep a pointer to a structure, then just pass the pointer to the start of your array, the length you want to go through, and a pointer to this struct.
Keep in mind that you should use INT_MIN in limits.h as your initial max (see reset(maxfreq *) in the code below), as int can carry negative values.
The following code does the job recursively:
#include <limits.h>
typedef struct {
int max;
int freq;
} maxfreq;
void reset(maxfreq *mfreq){
mfreq->max = INT_MIN;
mfreq->freq = 0;
}
void findMax(int* a, int length, maxfreq *mfreq){
if(length>0){
if(*a == mfreq->max)
mfreq->freq++;
else if(*a > mfreq->max){
mfreq->freq = 1;
mfreq->max = *a;
}
findMax(a+1, length - 1, mfreq);
}
}
A call to findMax will recall itself as many times as the initial length plus one, each time incrementing the provided pointer and processing the corresponding element, so this is basically just going through all of the elements in a once, and no weird splitting.
this works fine with me :
#include <stdio.h>
#include <string.h>
// define a struct that contains the (max, freq) information
struct arrInfo
{
int max;
int count;
};
struct arrInfo maxArr(int * arr, int max, int size, int count)
{
int maxF;
struct arrInfo myArr;
if(size == 0) // to return from recursion we check the size left
{
myArr.max = max; // prepare the struct to output
myArr.count = count;
return(myArr);
}
if(*arr > max) // new maximum found
{
maxF = *arr; // update the max
count = 1; // initialize the frequency
}
else if (*arr == max) // same max encountered another time
{
maxF = max; // keep track of same max
count ++; // increase frequency
}
else // nothing changes
maxF = max; // keep track of max
arr++; // move the pointer to next element
size --; // decrease size by 1
return(maxArr(arr, maxF, size, count)); // recursion
}
int main()
{
struct arrInfo info; // return of the recursive function
// define an array
int arr[] = {8, 4, 8, 3, 7};
info = maxArr(arr, 0, 5, 1); // call with max=0 size=5 freq=1
printf("max = %d count = %d\n", info.max, info.count);
return 0;
}
when ran, it outputs :
max = 8 count = 3
Notice
In my code example I assumed the numbers to be positive (initializing max to 0), I don't know your requirements but you can elaborate.
The reqirements in your assignment are at least questionable. Just for reference, here is how this should be done in real code (to solve your assignment, refer to the other answers):
int findMax(int length, int* array, int* maxCount) {
int trash;
if(!maxCount) maxCount = &trash; //make sure we ignore it when a NULL pointer is passed in
*maxCount = 0;
int result = INT_MIN;
for(int i = 0; i < length; i++) {
if(array[i] > result) {
*maxCount = 1;
result = array[i];
} else if(array[i] == result) {
(*maxCount)++;
}
}
return result;
}
Always do things as straight forward as you can.

Sorting an array of coordinates by their distance from origin

The code should take an array of coordinates from the user, then sort that array, putting the coordinates in order of their distance from the origin. I believe my problem lies in the sorting function (I have used a quicksort).
I am trying to write the function myself to get a better understanding of it, which is why I'm not using qsort().
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define MAX_SIZE 64
typedef struct
{
double x, y;
}POINT;
double distance(POINT p1, POINT p2);
void sortpoints(double distances[MAX_SIZE], int firstindex, int lastindex, POINT data[MAX_SIZE]);
void printpoints(POINT data[], int n_points);
int main()
{
int n_points, i;
POINT data[MAX_SIZE], origin = { 0, 0 };
double distances[MAX_SIZE];
printf("How many values would you like to enter?\n");
scanf("%d", &n_points);
printf("enter your coordinates\n");
for (i = 0; i < n_points; i++)
{
scanf("%lf %lf", &data[i].x, &data[i].y);
distances[i] = distance(data[i], origin); //data and distances is linked by their index number in both arrays
}
sortpoints(distances, 0, i, data);
return 0;
}
double distance(POINT p1, POINT p2)
{
return sqrt(pow((p1.x - p2.x), 2) + pow((p1.y - p2.y), 2));
}
void printpoints(POINT *data, int n_points)
{
int i;
printf("Sorted points (according to distance from the origin):\n");
for (i = 0; i < n_points; i++)
{
printf("%.2lf %.2lf\n", data[i].x, data[i].y);
}
}
//quicksort
void sortpoints(double distances[MAX_SIZE], int firstindex, int lastindex, POINT data[MAX_SIZE])
{
int indexleft = firstindex;
int indexright = lastindex;
int indexpivot = (int)((lastindex + 1) / 2);
int n_points = lastindex + 1;
double left = distances[indexleft];
double right = distances[indexright];
double pivot = distances[indexpivot];
POINT temp;
if (firstindex < lastindex) //this will halt the recursion of the sorting function once all the arrays are 1-size
{
while (indexleft < indexpivot || indexright > indexpivot) //this will stop the sorting once both selectors reach the pivot position
{
//reset the values of left and right for the iterations of this loop
left = distances[indexleft];
right = distances[indexright];
while (left < pivot)
{
indexleft++;
left = distances[indexleft];
}
while (right > pivot)
{
indexright--;
right = distances[indexright];
}
distances[indexright] = left;
distances[indexleft] = right;
temp = data[indexleft];
data[indexleft] = data[indexright];
data[indexright] = temp;
}
//recursive sorting to sort the sublists
sortpoints(distances, firstindex, indexpivot - 1, data);
sortpoints(distances, indexpivot + 1, lastindex, data);
}
printpoints(data, n_points);
}
Thanks for your help, I have been trying to debug this for hours, even using a debugger.
Ouch! You call sortpoints() with i as argument. That argument, according to your prototype and code, should be the last index, and i is not the last index, but the last index + 1.
int indexleft = firstindex;
int indexright = lastindex; // indexright is pointing to a non-existent element.
int indexpivot = (int)((lastindex + 1) / 2);
int n_points = lastindex + 1;
double left = distances[indexleft];
double right = distances[indexright]; // now right is an undefined value, or segfault.
To fix that, call your sortpoints() function as:
sortpoints (0, n_points-1, data);
The problem is in your sortpoints function. The first while loop is looping infinitely. To test that is it an infinite loop or not place a printf statement
printf("Testing first while loop\n");
in your first while loop. You have to fix that.
There are quite a number of problems, but one of them is:
int indexpivot = (int)((lastindex + 1) / 2);
The cast is unnecessary, but that's trivia. Much more fundamental is that if you are sorting a segment from, say, 48..63, you will be pivoting on element 32, which is not in the range you are supposed to be working on. You need to use:
int indexpivot = (lastindex + firstindex) / 2;
or perhaps:
int indexpivot = (lastindex + firstindex + 1) / 2;
For the example range, these will pivot on element 55 or 56, which is at least within the range.
I strongly recommend:
Creating a print function similar to printpoints() but with the following differences:
Takes a 'tag' string to identify what it is printing.
Takes and prints the distance array too.
Takes the arrays and a pair of offsets.
Use this function inside the sort function before recursing.
Use this function inside the sort function before returning.
Use this function in the main function after you've read the data.
Use this function in the main function after the data is sorted.
Print key values — the pivot distance, the pivot index, at appropriate points.
This allows you to check that your partitioning is working correctly (it isn't at the moment).
Then, when you've got the code working, you can remove or disable (comment out) the printing code in the sort function.

Removing Duplicates in an array in C

The question is a little complex. The problem here is to get rid of duplicates and save the unique elements of array into another array with their original sequence.
For example :
If the input is entered b a c a d t
The result should be : b a c d t in the exact state that the input entered.
So, for sorting the array then checking couldn't work since I lost the original sequence. I was advised to use array of indices but I don't know how to do. So what is your advise to do that?
For those who are willing to answer the question I wanted to add some specific information.
char** finduni(char *words[100],int limit)
{
//
//Methods here
//
}
is the my function. The array whose duplicates should be removed and stored in a different array is words[100]. So, the process will be done on this. I firstly thought about getting all the elements of words into another array and sort that array but that doesn't work after some tests. Just a reminder for solvers :).
Well, here is a version for char types. Note it doesn't scale.
#include "stdio.h"
#include "string.h"
void removeDuplicates(unsigned char *string)
{
unsigned char allCharacters [256] = { 0 };
int lookAt;
int writeTo = 0;
for(lookAt = 0; lookAt < strlen(string); lookAt++)
{
if(allCharacters[ string[lookAt] ] == 0)
{
allCharacters[ string[lookAt] ] = 1; // mark it seen
string[writeTo++] = string[lookAt]; // copy it
}
}
string[writeTo] = '\0';
}
int main()
{
char word[] = "abbbcdefbbbghasdddaiouasdf";
removeDuplicates(word);
printf("Word is now [%s]\n", word);
return 0;
}
The following is the output:
Word is now [abcdefghsiou]
Is that something like what you want? You can modify the method if there are spaces between the letters, but if you use int, float, double or char * as the types, this method won't scale at all.
EDIT
I posted and then saw your clarification, where it's an array of char *. I'll update the method.
I hope this isn't too much code. I adapted this QuickSort algorithm and basically added index memory to it. The algorithm is O(n log n), as the 3 steps below are additive and that is the worst case complexity of 2 of them.
Sort the array of strings, but every swap should be reflected in the index array as well. After this stage, the i'th element of originalIndices holds the original index of the i'th element of the sorted array.
Remove duplicate elements in the sorted array by setting them to NULL, and setting the index value to elements, which is the highest any can be.
Sort the array of original indices, and make sure every swap is reflected in the array of strings. This gives us back the original array of strings, except the duplicates are at the end and they are all NULL.
For good measure, I return the new count of elements.
Code:
#include "stdio.h"
#include "string.h"
#include "stdlib.h"
void sortArrayAndSetCriteria(char **arr, int elements, int *originalIndices)
{
#define MAX_LEVELS 1000
char *piv;
int beg[MAX_LEVELS], end[MAX_LEVELS], i=0, L, R;
int idx, cidx;
for(idx = 0; idx < elements; idx++)
originalIndices[idx] = idx;
beg[0] = 0;
end[0] = elements;
while (i>=0)
{
L = beg[i];
R = end[i] - 1;
if (L<R)
{
piv = arr[L];
cidx = originalIndices[L];
if (i==MAX_LEVELS-1)
return;
while (L < R)
{
while (strcmp(arr[R], piv) >= 0 && L < R) R--;
if (L < R)
{
arr[L] = arr[R];
originalIndices[L++] = originalIndices[R];
}
while (strcmp(arr[L], piv) <= 0 && L < R) L++;
if (L < R)
{
arr[R] = arr[L];
originalIndices[R--] = originalIndices[L];
}
}
arr[L] = piv;
originalIndices[L] = cidx;
beg[i + 1] = L + 1;
end[i + 1] = end[i];
end[i++] = L;
}
else
{
i--;
}
}
}
int removeDuplicatesFromBoth(char **arr, int elements, int *originalIndices)
{
// now remove duplicates
int i = 1, newLimit = 1;
char *curr = arr[0];
while (i < elements)
{
if(strcmp(curr, arr[i]) == 0)
{
arr[i] = NULL; // free this if it was malloc'd
originalIndices[i] = elements; // place it at the end
}
else
{
curr = arr[i];
newLimit++;
}
i++;
}
return newLimit;
}
void sortArrayBasedOnCriteria(char **arr, int elements, int *originalIndices)
{
#define MAX_LEVELS 1000
int piv;
int beg[MAX_LEVELS], end[MAX_LEVELS], i=0, L, R;
int idx;
char *cidx;
beg[0] = 0;
end[0] = elements;
while (i>=0)
{
L = beg[i];
R = end[i] - 1;
if (L<R)
{
piv = originalIndices[L];
cidx = arr[L];
if (i==MAX_LEVELS-1)
return;
while (L < R)
{
while (originalIndices[R] >= piv && L < R) R--;
if (L < R)
{
arr[L] = arr[R];
originalIndices[L++] = originalIndices[R];
}
while (originalIndices[L] <= piv && L < R) L++;
if (L < R)
{
arr[R] = arr[L];
originalIndices[R--] = originalIndices[L];
}
}
arr[L] = cidx;
originalIndices[L] = piv;
beg[i + 1] = L + 1;
end[i + 1] = end[i];
end[i++] = L;
}
else
{
i--;
}
}
}
int removeDuplicateStrings(char *words[], int limit)
{
int *indices = (int *)malloc(limit * sizeof(int));
int newLimit;
sortArrayAndSetCriteria(words, limit, indices);
newLimit = removeDuplicatesFromBoth(words, limit, indices);
sortArrayBasedOnCriteria(words, limit, indices);
free(indices);
return newLimit;
}
int main()
{
char *words[] = { "abc", "def", "bad", "hello", "captain", "def", "abc", "goodbye" };
int newLimit = removeDuplicateStrings(words, 8);
int i = 0;
for(i = 0; i < newLimit; i++) printf(" Word # %d = %s\n", i, words[i]);
return 0;
}
Traverse through the items in the array - O(n) operation
For each item, add it to another sorted-array
Before adding it to the sorted array, check if the entry already exists - O(log n) operation
Finally, O(n log n) operation
i think that in C you can create a second array. then you copy the element from the original array only if this element is not already in the send array.
this also preserve the order of the element.
if you read the element one by one you can discard the element before insert in the original array, this could speedup the process.
As Thomas suggested in a comment, if each element of the array is guaranteed to be from a limited set of values (such as a char) you can achieve this in O(n) time.
Keep an array of 256 bool (or int if your compiler doesn't support bool) or however many different discrete values could possibly be in the array. Initialize all the values to false.
Scan the input array one-by-one.
For each element, if the corresponding value in the bool array is false, add it to the output array and set the bool array value to true. Otherwise, do nothing.
You know how to do it for char type, right?
You can do same thing with strings, but instead of using array of bools (which is technically an implementation of "set" object), you'll have to simulate the "set"(or array of bools) with a linear array of strings you already encountered. I.e. you have an array of strings you already saw, for each new string you check if it is in array of "seen" strings, if it is, then you ignore it (not unique), if it is not in array, you add it to both array of seen strings and output. If you have a small number of different strings (below 1000), you could ignore performance optimizations, and simply compare each new string with everything you already saw before.
With large number of strings (few thousands), however, you'll need to optimize things a bit:
1) Every time you add a new string to an array of strings you already saw, sort the array with insertion sort algorithm. Don't use quickSort, because insertion sort tends to be faster when data is almost sorted.
2) When checking if string is in array, use binary search.
If number of different strings is reasonable (i.e. you don't have billions of unique strings), this approach should be fast enough.

Resources