Remove duplicates from array of strings in C - c

I have an array of strings in C. The string length is around 3000 characters each. I thought to hash them for faster search results and preferred Perfect hashing. The problem is, perfect hash needs unique strings from data set to create hash function where as my data set has inevitable duplicates.
So now, I need a very fast way of removing duplicates from array of strings in C.
Kindly suggest the fastest way to do this.

My first thought, without researching, was to potentially create some kind of basic hash for each string and only check the complete strings for equality if the hashes match. This should allow for speeding up the algorithm slightly, at a small cost to how straightforward the whole algorithm is. There should be a better solution than this, but it should help in a pinch.

These are the data structures which can help
array
Add each item to an array. qsort the result.
Output the result but not if the previous string was a duplicate. Unix sort | uniq
binary tree
Hold the strings in a binary tree. Wikipedia binary tree. As each string is added, then search the tree. Add the string if it is not there.
hash table
Use a hash of string to keep a hash table. Collisions are checked by strcmp, and duplicates not added.
trie
Wikipedia trie. The trie stores the common prefix. This would automatically 'lose' duplicates

#include <string.h>
#include <stdio.h>
/**
* Removes duplicate strings from the array and shifts items left.
* Returns the number of items in the modified array.
*
* Parameters:
* n_items - number of items in the array.
* arr - an array of strings with possible duplicates.
*/
int remove_dups(int n_items, char *arr[])
{
int i, j = 1, k = 1;
for (i = 0; i < n_items; i++)
{
for (j = i + 1, k = j; j < n_items; j++)
{
/* If strings don't match... */
if (strcmp(arr[i], arr[j]))
{
arr[k] = arr[j];
k++;
}
}
n_items -= j - k;
}
return n_items;
}

Related

Compare two arrays and create new array with equal elements in C

The problem is to check two arrays for the same integer value and put matching values in a new array.
Let say I have two arrays
a[n] = {2,5,2,7,8,4,2}
b[m] = {1,2,6,2,7,9,4,2,5,7,3}
Each array can be a different size.
I need to check if the arrays have matching elements and put them in a new array. The result in this case should be:
array[] = {2,2,2,5,7,4}
And I need to do it in O(n.log(n) + m.log(m)).
I know there is a way to do with merge sorting or put one of the array in a hash array but I really don't know how to implement it.
I will really appreciate your help, thanks!!!
As you have already figured out you can use merge sort (implementing it is beyond the scope of this answer, I suppose you can find a solution on wikipedia or searching on Stack Overflow) so that you can get nlogn + mlogm complexity supposing n is the size of the first array and m is the size of another.
Let's call the first array a (with the size n) and the second one b (with size m). First sort these arrays (merge sort would give us nlogn + mlogm complexity). And now we have:
a[n] // {2,2,2,4,5,7,8} and b[n] // {1,2,2,2,3,4,5,6,7,7,9}
Supposing n <= m we can simply iterate simulateously comparing coresponding values:
But first lets allocate array int c[n]; to store results (you can print to the console instead of storing if you need). And now the loop itself:
int k = 0; // store the new size of c array!
for (int i = 0, j = 0; i < n && j < m; )
{
if (a[i] == b[j])
{
// match found, store it
c[k] = a[i];
++i; ++j; ++k;
}
else if (a[i] > b[j])
{
// current value in a is leading, go to next in b
++j;
}
else
{
// the last possibility is a[i] < b[j] - b is leading
++i;
}
}
Note: the loop itself is n+m complexity at worst (remember n <= m assumption) which is less than for sorting so overal complexity is nlogn + mlogm. Now you can iterate c array (it's size is actually n as we allocated, but the number of elements in it is k) and do what you need with that numbers.
From the way that you explain it the way to do this would be to loop over the shorter array and check it against the longer array. Let us assume that A is the shorter array and B the longer array. Create a results array C.
Loop over each element in A, call it I
If I is found in B, remove it from B and put it in C, break out of the test loop.
Now go to the next element in A.
This means that if a number I is found twice in A and three times in B, then I will only appear twice in C. Once you finish, then every number found in both arrays will appear in C the number of times that it actually appears in both.
I am carefully not putting in suggested code as your question is about a method that you can use. You should figure out the code yourself.
I would be inclined to take the following approach:
1) Sort array B. There are many well published sort algorithms to do this, as well as several implementations in various generally available libraries.
2) Loop through array A and for each element do a binary search (or other suitable algorithm) on array B for a match. If a match is found, remove the element from array B (to avoid future matches) and add it to the output array.

Logic challenge: sorting arrays alphabetically in C

I'm new to programming, currently learning C. I've been working at this problem for a week now, and I just can't seem to get the logic straight. This is straight from the book I'm using:
Build a program that uses an array of strings to store the following names:
"Florida"
"Oregon"
"Califoria"
"Georgia"
Using the preceding array of strings, write your own sort() function to display each state's name in alphabetical order using the strcmp() function.
So, let's say I have:
char *statesArray[4] = {"Florida", "Oregon", "California", "Georgia"};
Should I do nested for loops, like strcmp(string[x], string[y])...? I've hacked and hacked away. I just can't wrap my head around the algorithm required to solve this even somewhat efficiently. Help MUCH appreciated!!!
imagine you had to sort the array - think of each state written on a card. HOw would you sort it into order. There are many ways of doing it. Each one is called an algorithm
One way is to find the first state by looking at every card and keeping track in your head of the lowest one you have seen. After looking at each card you will have the lowest one. Put that in a new pile. NOw repeat - trying to find the lowest of the ones you have left.
repeat till no cards left in original pile
This is a well known simple but slow algorithm. Its the one i would do first
there are other ones too
Yes, you can sort by using nested for loops. After you understand how strcmp() works it should be fairly straight forward:
strcmp(char *string1, char *string2)
if Return value is < 0 then it indicates string1 is less than string2
if Return value is > 0 then it indicates string2 is less than string1
if Return value is = 0 then it indicates string1 is equal to string2
You can then choose any of the sorting methods once from this point
This site has a ton of great graphical examples of various sorts being performed and includes the pseudo code for the given algorithms.
Do you need "any" sorting algorithm, or an "efficient" sorting algorithm?
For simplicity, I can show you how to implement an easy, but not efficient, sorting algorithm.
It's the double for method!!
Then, with the same ideas, you can modify it to any other efficient algorithm (like shell, or quicksort).
For numbers, you could put arrays ir order, as follows (as you probably know):
int intcmp(int a, int b) {
return (a < b)? -1: ((a > b)? +1: 0);
}
int main(void) {
int a[5] = {3, 4, 22, -13, 9};
for (int i = 0; i < 5; i++) {
for (int j = i+1; j < 5; j++)
if (intcmp(a[i], a[j]) > 0) {
int temp = a[i];
a[i] = a[j];
a[j] = temp;
}
printf("%d ", a[i]);
}
}
The only thing that has changed now is that you have strings intead integers.
So, you have to consider an array of strings:
char *a[] = {"Florida", "Oregon", "Califoria", "Georgia"};
Then, you have to change the type of temp to char*,
and finally you put the function strcmp() instead of intcmp().
The function strcmp(s1, s2) (from < string.h >)
returns a number < 0 if s1 is a string "less than" s2, == 0 if s1 is
"equal to" s2, and > 1 else.
The program looks like this:
#include <stdio.h>
#include <string.h>
int main(void) {
char *a[] = {"Florida", "Oregon", "Califoria", "Georgia"};
for (int i = 0; i < 4; i++) {
for (int j = i+1; j < 4; j++)
if (strcmp(a[i], a[j]) > 0) {
char* temp = a[i];
a[i] = a[j];
a[j] = temp;
}
printf("%s ", a[i]);
}
getchar();
return 0;
}
Note that for the printf() sentence, we have changed "%d " by "%s ", in order to properly show strings.
Final comment: When you program a better algorithm, like quick-sort, it is enough that you change the comparisson function, because the algorithm it is the same, in despite of the type of data you are comparing.
Remark: I have used a "tricky" method. As you can see, I have defined the variable a as a pointer to string. The initializer has taken a constant array of strings and then initialized the variable a with it. The variable a now can be safely treated and indexed as an array of exactly 4 pointer-to-strings.
That is the reason why the "swap" works fine in the double-for algorithm: The memory addresses are swapped instead the entire strings.
Steps you likely should take:
Populate array with state names
Create method to swap two states in place in the array
At this point you have all the tools necessary to use strcmp to implement any sorting algorithm you choose
Most sorting methods rely on two things.
Being able to rearrange a list (i.e. swap)
Being able to compare items in list to see if they should be swapped
I would work on getting those two things working correctly and the rest should just be learning a particular sorting algorithm
Beware of a little headaching problem: Strings are sorted by ascii numeric representations, so if you sort alphabetically like this, any capital letter will come before a lowercase letter e.g. "alpha", "beta", "gamma", "Theta" will be sorted as:
Theta, alpha, beta, gamma
When it comes to the sample array you have listed here the simple algorithm mentioned earlier might actually be the most efficient. The algorithm I'm referring to is the one where you start with the first element and then compare it to the others and substitute with the smallest you find and then going to the next element to do the same only dont compare it to the already sorted elements.
While this algorithm has an execution time of O(n^2). Where n is the number of elements in the array. It will usually be faster than something like quick sort (execution time O(n*log(n)) for smaller arrays. The reason being that quick sort has more overhead. For larger arrays quick sort will serve you better than the other method, which if memory serves me right is called substitution sort although I see it mentioned as "double for" in a different answer.

C - Returning the most repeated/occurring string in an array of char pointers

I have almost completed the code for this problem, which I shall state as under:
Given:
Array of length 'n' (say n = 10000) declared as below,
char **records = malloc(10000*sizeof(*records));
Each record[i] is a char pointer and points to a non-empty string.
records[i] = malloc(11);
The strings are of fixed length (10 chars + '\0').
Requirement:
Return the most frequently occurring string in the above array.
But now, I am interested in obtaining a slightly less brutal algorithm than the primitive one which I have currently, which is to sift through the entire array in two for loops :(, storing strings encountered by the two loops in a temporary array of similar size ('n' - in case all are unique strings) for comparison with the next strings. The inner loop iterates from 'outer loop position + 1' to 'n'. At the same time, I have an integer array, of similar size - 'n', for counting repeat occurrences, with each i th element corresponding to the i th (unique) string in the comparison array. Then find the largest integer and use its index in the comparison array to return the most frequently occurring string.
I hope I am clear enough. I am quite ashamed of the algo myself, but it had to be done. I am sure there is a much smarter way to do this in C.
Have a great Sunday,
Cheers!
Without being good at nice algorithms (Google, Wikipedia and Stackoverflow are good enough for me), one solution that comes out at the top of my head is to sort the array, then use a single loop to go through the entries. As long as the current string is the same as the previous, increase a counter for that string. When done you have a "list" of strings and their occurrence, which can then be sorted if needed.
In most languages, the usual approach would be to construct a hashtable, mapping strings to counts. This has O(N) complexity.
For example, in Python (although usually you would use collections.Counter for this, and even this code can be made more concise using more specialised Python knowledge, but I've made it explicit for demonstration).
def most_common(strings):
counts = {}
for s in strings:
if s not in counts:
counts[s] = 0
counts[s] += 1
return max(counts, key=counts.get)
But in C, you don't have a hashtable in the standard library (although in C++ you can use hash_map from the STL), so a sort and scan can be done instead. It's O(N.log(N)) complexity, which is worse than optimal, but quite practical.
Here's some C (actually C99) code that implements this.
int compare_strings(const void*s0, const void*s1) {
return strcmp((const char*)s0, (const char*)s1);
}
const char *most_common(const char **records, size_t n) {
qsort(records, n, sizeof(records[0]), compare_strings);
const char *best = 0; // The most common string found so far.
size_t max = 0; // The longest run found.
size_t run = 0; // The length of the current run.
for (size_t i = 0; i < n; i++) {
if (!compare_strings(records[i], records[i - run])) {
run += 1;
} else {
run = 1;
}
if (run > max) {
best = records[i];
max = run;
}
}
return best;
}

What is the bug in this code?

Based on a this logic given as an answer on SO to a different(similar) question, to remove repeated numbers in a array in O(N) time complexity, I implemented that logic in C, as shown below. But the result of my code does not return unique numbers. I tried debugging but could not get the logic behind it to fix this.
int remove_repeat(int *a, int n)
{
int i, k;
k = 0;
for (i = 1; i < n; i++)
{
if (a[k] != a[i])
{
a[k+1] = a[i];
k++;
}
}
return (k+1);
}
main()
{
int a[] = {1, 4, 1, 2, 3, 3, 3, 1, 5};
int n;
int i;
n = remove_repeat(a, 9);
for (i = 0; i < n; i++)
printf("a[%d] = %d\n", i, a[i]);
}
1] What is incorrect in above code to remove duplicates.
2] Any other O(N) or O(NlogN) solution for this problem. Its logic?
Heap sort in O(n log n) time.
Iterate through in O(n) time replacing repeating elements with a sentinel value (such as INT_MAX).
Heap sort again in O(n log n) to distil out the repeating elements.
Still bounded by O(n log n).
Your code only checks whether an item in the array is the same as its immediate predecessor.
If your array starts out sorted, that will work, because all instances of a particular number will be contiguous.
If your array isn't sorted to start with, that won't work because instances of a particular number may not be contiguous, so you have to look through all the preceding numbers to determine whether one has been seen yet.
To do the job in O(N log N) time, you can sort the array, then use the logic you already have to remove duplicates from the sorted array. Obviously enough, this is only useful if you're all right with rearranging the numbers.
If you want to retain the original order, you can use something like a hash table or bit set to track whether a number has been seen yet or not, and only copy each number to the output when/if it has not yet been seen. To do this, we change your current:
if (a[k] != a[i])
a[k+1] = a[i];
to something like:
if (!hash_find(hash_table, a[i])) {
hash_insert(hash_table, a[i]);
a[k+1] = a[i];
}
If your numbers all fall within fairly narrow bounds or you expect the values to be dense (i.e., most values are present) you might want to use a bit-set instead of a hash table. This would be just an array of bits, set to zero or one to indicate whether a particular number has been seen yet.
On the other hand, if you're more concerned with the upper bound on complexity than the average case, you could use a balanced tree-based collection instead of a hash table. This will typically use more memory and run more slowly, but its expected complexity and worst case complexity are essentially identical (O(N log N)). A typical hash table degenerates from constant complexity to linear complexity in the worst case, which will change your overall complexity from O(N) to O(N2).
Your code would appear to require that the input is sorted. With unsorted inputs as you are testing with, your code will not remove all duplicates (only adjacent ones).
You are able to get O(N) solution if the number of integers is known up front and smaller than the amount of memory you have :). Make one pass to determine the unique integers you have using auxillary storage, then another to output the unique values.
Code below is in Java, but hopefully you get the idea.
int[] removeRepeats(int[] a) {
// Assume these are the integers between 0 and 1000
Boolean[] v = new Boolean[1000]; // A lazy way of getting a tri-state var (false, true, null)
for (int i=0;i<a.length;++i) {
v[a[i]] = Boolean.TRUE;
}
// v[i] = null => number not seen
// v[i] = true => number seen
int[] out = new int[a.length];
int ptr = 0;
for (int i=0;i<a.length;++i) {
if (v[a[i]] != null && v[a[i]].equals(Boolean.TRUE)) {
out[ptr++] = a[i];
v[a[i]] = Boolean.FALSE;
}
}
// Out now doesn't contain duplicates, order is preserved and ptr represents how
// many elements are set.
return out;
}
You are going to need two loops, one to go through the source and one to check each item in the destination array.
You are not going to get O(N).
[EDIT]
The article you linked to suggests a sorted output array which means the search for duplicates in the output array can be a binary search...which is O(LogN).
Your logic just wrong, so the code is wrong too. Do your logic by yourself before coding it.
I suggest a O(NlnN) way with a modification of heapsort.
With heapsort, we join from a[i] to a[n], find the minimum and replace it with a[i], right?
So now is the modification, if the minimum is the same with a[i-1] then swap minimum and a[n], reduce your array item's number by 1.
It should do the trick in O(NlnN) way.
Your code will work only on particular cases. Clearly, you're checking adjacent values but duplicate values can occur any where in array. Hence, it's totally wrong.

How do I remove duplicate strings from an array in C?

I have an array of strings in C and an integer indicating how many strings are in the array.
char *strarray[MAX];
int strcount;
In this array, the highest index (where 10 is higher than 0) is the most recent item added and the lowest index is the most distant item added. The order of items within the array matters.
I need a quick way to check the array for duplicates, remove all but the highest index duplicate, and collapse the array.
For example:
strarray[0] = "Line 1";
strarray[1] = "Line 2";
strarray[2] = "Line 3";
strarray[3] = "Line 2";
strarray[4] = "Line 4";
would become:
strarray[0] = "Line 1";
strarray[1] = "Line 3";
strarray[2] = "Line 2";
strarray[3] = "Line 4";
Index 1 of the original array was removed and indexes 2, 3, and 4 slid downwards to fill the gap.
I have one idea of how to do it. It is untested and I am currently attempting to code it but just from my faint understanding, I am sure this is a horrendous algorithm.
The algorithm presented below would be ran every time a new string is added to the strarray.
For the interest of showing that I am trying, I will include my proposed algorithm below:
Search entire strarray for match to str
If no match, do nothing
If match found, put str in strarray
Now we have a strarray with a max of 1 duplicate entry
Add highest index strarray string to lowest index of temporary string array
Continue downwards into strarray and check each element
If duplicate found, skip it
If not, add it to the next highest index of the temporary string array
Reverse temporary string array and copy to strarray
Once again, this is untested (I am currently implementing it now). I just hope someone out there will have a much better solution.
The order of items is important and the code must utilize the C language (not C++). The lowest index duplicates should be removed and the single highest index kept.
Thank you!
The typical efficient unique function is to:
Sort the given array.
Verify that consecutive runs of the same item are setup so that only one remains.
I believe you can use qsort in combination with strcmp to accomplish the first part; writing an efficient remove would be all on you though.
Unfortunately I don't have specific ideas here; this is kind of a grey area for me because I'm usually using C++, where this would be a simple:
std::vector<std::string> src;
std::sort(src.begin(), src.end());
src.remove(std::unique(src.begin(), src.end()), src.end);
I know you can't use C++, but the implementation should essentially be the same.
Because you need to save the original order, you can have something like:
typedef struct
{
int originalPosition;
char * string;
} tempUniqueEntry;
Do your first sort with respect to string, remove unique sets of elements on the sorted set, then resort with respect to originalPosition. This way you still get O(n lg n) performance, yet you don't lose the original order.
EDIT2:
Simple C implementation example of std::unique:
tempUniqueEntry* unique ( tempUniqueEntry * first, tempUniqueEntry * last )
{
tempUniqueEntry *result=first;
while (++first != last)
{
if (strcmp(result->string,first->string))
*(++result)=*first;
}
return ++result;
}
I don't quite understand your proposed algorithm (I don't understand what it means to add a string to an index in step 5), but what I would do is:
unsigned int i;
for (i = n; i > 0; i--)
{
unsigned int j;
if (strarray[i - 1] == NULL)
{
continue;
}
for (j = i - 1; j > 0; j--)
{
if (strcmp(strarray[i - 1], strarray[j - 1]) == 0)
{
strarray[j - 1] = NULL;
}
}
}
Then you just need to filter the null pointers out of your array (which I'll leave as an exercise).
A different approach would be to iterate backwards over the array and to insert each item into a (balanced) binary search tree as you go. If the item is already in the binary search tree, flag the array item (such as setting the array element to NULL) and move on. When you've processed the entire array, filter out the flagged elements as before. This would have slightly more overhead and would consume more space, but its running time would be O(n log n) instead of O(n^2).
Can you control the input as it is going into the array? If so, just do something like this:
int addToArray(const char * toadd, char * strarray[], int strcount)
{
const int toaddlen = strlen(toadd);
// Add new string to end.
// Remember to add one for the \0 terminator.
strarray[strcount] = malloc(sizeof(char) * (toaddlen + 1));
strncpy(strarray[strcount], toadd, toaddlen + 1);
// Search for a duplicate.
// Note that we are cutting the new array short by one.
for(int i = 0; i < strcount; ++i)
{
if (strncmp(strarray[i], toaddlen + 1) == 0)
{
// Found duplicate.
// Remove it and compact.
// Note use of new array size here.
free(strarray[i]);
for(int k = i + 1; k < strcount + 1; ++k)
strarray[i] = strarray[k];
strarray[strcount] = null;
return strcount;
}
}
// No duplicate found.
return (strcount + 1);
}
You can always use the above function looping over the elements of an existing array, building a new array without duplicates.
PS: If you are doing this type of operation a lot, you should move away from an array as your storage structure, and used a linked list instead. They are much more efficient for removing elements from a location other than the end.
Sort the array with an algorithm like qsort (man 3 qsort in the terminal to see how it should be used) and then use the function strcmp to compare the strings and find duplicates
If you want to mantain the original order you could use a O(N^2) complexity algorithm nesting two for, the first each time pick an element to compare to the other and the second for will be used to scan the rest of the array to find if the chosen element is a duplicate.

Resources