Fastest ways to find duplicates

Fastest ways to find duplicates - c

So my interviewer showed me the following code,
struct test {
uint8_t inuse;
int32_t val;
};
#define MAX_LIST_SIZE 100
struct test list[MAX_LIST_SIZE];
int checkAndAdd(int32_t val) {
for(int i=0; i<MAX_LIST_SIZE; i++) {
if(list[i].inuse && list[i].value == value)
return DUPLICATE;
}
for(int i=0; i<MAX_LIST_SIZE; i++) {
if(!list[i].inuse) {
list[i].inuse = 1;
list[i].value = value;
return ADDED;
}
}
return EA_FAIL;
}
and asked me the following questions.
How to make that function faster?
What are the other fastest methods to find duplicates in array?
My answers were
1.
int checkAndAdd(int32_t val) {
int32_t addedIndex = -1;
for(int i=0; i<MAX_LIST_SIZE; i++) {
if(list[i].inuse && list[i].value == value) {
if (addedIndex != -1) {
list[addedIndex].inuse = 0;
list[addedIndex].value = 0;
}
return DUPLICATE;
} else if (!list[i].inuse && (addedIndex == -1)) {
list[i].inuse = 1;
list[i].value = value;
addedIndex = i;
}
}
if (addedIndex)
return ADDED;
return EA_FAIL;
}
You can't have a faster duplicate check than O(n).
Were my answers correct? Please suggest any other good approaches and answer to 2. Thanks.

The fastest, most general way to find duplicates is with a hash table. That gives essentially constant-time access (with just a little added overhead to handle hash collisions).
If the set of integers is sufficiently dense, you could alternatively use an array indexed by value, but this is only practical if the percentage of holes is sufficiently small.

It's somewhat dependent on data patterns, but interpolative search is O(log log n).
Using binary search gives you O(log n), and for n <= 100, a maximum of 7 search steps, making interpolative search probably not worthwhile.
Edit to add side note: it's curious that the struct has int32_t but the argument to the function is plain int. Probably not broken (are there any ILP64 systems out there?), but seems a bit sloppy.

If you need to be able to quickly insert, delete, and avoid duplicates: what you want is a set probably implemented with a hash table where the key and value both point to the same data.
Hash tables inherently cannot have duplicates. They're on average O(1) for inserts, deletions, and lookups and O(n) on space. The only downside is there is no inherent order to the values. Since your original code does not appear to be preserving order that would be fine.

If I gave you that as an interview question I might want to discuss Hashes etc and that would be a good sign, hashes are fundamental data structures but I'd be really looking to see if you could merge the two loops and you did that. Your second answer for a duplicate check is correct for lists ie O(n) but incorrect generally because it's dependent on data structure, if you use a hash it's O(1).
Please note that O(1) can sometimes and quite often turns out to be slower than O(n) in real life ie by the time you've hashed the thing to get a key, done the lookup and traversed the list of the hash structure you might have found the single item in the 5 you were looking for in a linked list.

Related

can someone suggest a better algorithm than this to check if there is at least one duplicate value in an array?

an unsorted integer array nums, and it's size numsSize is given as arguments of function containsDuplicate and we have to return a boolean value true if at least one duplicate value is there otherwise false.
for this task I chose to check if every element, and the elements after that are equal or not until last second element is reached, if equal I will be returning true otherwise false.
bool containsDuplicate(int* nums, int numsSize){
for(int i =0 ;i< numsSize-1;i++)
{
for(int j = i+1;j < numsSize; j++)
{
if(nums[i] == nums[j])
{
return true;
}
}
}
return false;
}
To minimize run time, I've written return value just when the duplicates are found, but still my code is not performing well on large size arrays, I'm expecting an algorithm which has a time complexity O(n) if possible. And is there anyway we can skip the values which are duplicates of previously looked values?
I've seen all other solutions, but I couldn't find a better solution in C.

Your algorithm is O(n^2). But if you sort first, which can be done in less than O(n^2), then determining if there is a duplicate in the array is O(n).
You could maintain a lookup table to determine if each value has been previously seen, which would run in O(n) time, but unless the potential range of values stored in the array are relatively small, this has prohibitive memory usage.
For instance, if you know the values in the array will range from 0-127.
int contains_dupes(int *arr, size_t n) {
char seen[128] = {0};
for (size_t i = 0; i < n; i++) {
if (seen[arr[i]]) return 0;
seen[arr[i]] = 1;
}
return 1;
}
But if we assume int is 4 bytes, and the values in the array can be any int, and we use char for our lookup table, then your lookup table would have to be 4GB in size.

O(n) time, O(n) space: use a set or map. Parse your array, checking each element in turn for membership in your set or map. If it's present then you've found a duplicate; if not, then add it.
If O(n) space is too expensive, you can get away with far less by doing a first pass using a cuckoo hash, which is a space efficient data structure that guarantees no false negatives, but can have false positives. Use the same approach as above but with the cuckoo hash instead of a set or map. Any duplicates you detect may be false positives, so will need to be checked.
Then, parse the array a second time, using the approach described in the first paragraph, but skip past anything that isn't in your set of candidates.
This is still O(n) time.
https://en.wikipedia.org/wiki/Cuckoo_hashing

How to remove certain elements from an array using a conditional test in C?

I am writing a program that goes through an array of ints and calculates stdev to identify outliers in the data. From here, I would like to create a new array with the identified outliers removed in order to recalculate the avg and stdev. Is there a way that I can do this?

There is a pretty simple solution to the problem that involves switching your mindset in the if statement (which isn't actually in a for loop it seems... might want to fix that).
float dataMinusOutliers[n];
int indexTracker = 0;
for (i=0; i<n; i++) {
if (data[i] >= (-2*stdevfinal) && data[i] <= (2*stdevfinal)) {
dataMinusOutliers[indexTracker] = data[i];
indexTracker += 1;
}
}
Note that this isn't particularly scalable and that the dataMinusOutliers array is going to potentially have quite a few unused indices. You can always use indexTracker - 1 to note how large the array actually is though, and create yet another array into which you copy the important values in dataMinusOutliers. Is there likely a more elegant solution? Yes. Does this work given your requirements though? Yup.

Solving Lights out for AI Course

So I was given the following task: Given that all lights in a 5x5 version of a game are turned on, write an algorithm using UCS / A* / BFS / Greedy best first search that finds a solution.
What I did first was realize that UCS would be unnecessary as the cost from moving from one state to another is 1(pressing a button that flips itself and neighbouring ones). So what I did is wrote BFS instead. It turned out that it works too long and fills up a queue, even though I was paying attention to removing parent nodes when I was finished with them not to overflow the memory. It would work for around 5-6mins and then crash because of memory.
Next, what I did is write DFS(even though it was not mentioned as one of possibilities) and it did find a solution in 123 secs, at depth 15(I used depth-first limited because I knew that there was a solution at depth 15).
What I am wondering now is am I missing something? Is there some good heuristics to try to solve this problem using A* search? I figured out absolutely nothing when it's about heuristics, because it doesn't seem any trivial to find one in this problem.
Thanks very much. Looking forward to some help from you guys
Here is my source code(I think it's pretty straightforward to follow):
struct state
{
bool board[25];
bool clicked[25];
int cost;
int h;
struct state* from;
};
int visited[1<<25];
int dx[5] = {0, 5, -5};
int MAX_DEPTH = 1<<30;
bool found=false;
struct state* MakeStartState()
{
struct state* noviCvor = new struct state();
for(int i = 0; i < 25; i++) noviCvor->board[i] = false, noviCvor->clicked[i] = false;
noviCvor->cost = 0;
//h=...
noviCvor->from = NULL;
return noviCvor;
};
struct state* MakeNextState(struct state* temp, int press_pos)
{
struct state* noviCvor = new struct state();
for(int i = 0; i < 25; i++) noviCvor->board[i] = temp->board[i], noviCvor->clicked[i] = temp->clicked[i];
noviCvor->clicked[press_pos] = true;
noviCvor->cost = temp->cost + 1;
//h=...
noviCvor->from = temp;
int temp_pos;
for(int k = 0; k < 3; k++)
{
temp_pos = press_pos + dx[k];
if(temp_pos >= 0 && temp_pos < 25)
{
noviCvor->board[temp_pos] = !noviCvor->board[temp_pos];
}
}
if( ((press_pos+1) % 5 != 0) && (press_pos+1) < 25 )
noviCvor->board[press_pos+1] = !noviCvor->board[press_pos+1];
if( (press_pos % 5 != 0) && (press_pos-1) >= 0 )
noviCvor->board[press_pos-1] = !noviCvor->board[press_pos-1];
return noviCvor;
};
bool CheckFinalState(struct state* temp)
{
for(int i = 0; i < 25; i++)
{
if(!temp->board[i]) return false;
}
return true;
}
int bijection_mapping(struct state* temp)
{
int temp_pow = 1;
int mapping = 0;
for(int i = 0; i < 25; i++)
{
if(temp->board[i])
mapping+=temp_pow;
temp_pow*=2;
}
return mapping;
}
void BFS()
{
queue<struct state*> Q;
struct state* start = MakeStartState();
Q.push(start);
struct state* temp;
visited[ bijection_mapping(start) ] = 1;
while(!Q.empty())
{
temp = Q.front();
Q.pop();
visited[ bijection_mapping(temp) ] = 2;
for(int i = 0; i < 25; i++)
{
if(!temp->clicked[i])
{
struct state* next = MakeNextState(temp, i);
int mapa = bijection_mapping(next);
if(visited[ mapa ] == 0)
{
if(CheckFinalState(next))
{
printf("NADJENO RESENJE\n");
exit(0);
}
visited[ mapa ] = 1;
Q.push(next);
}
}
}
delete temp;
}
}
PS. As I am not using map anymore(switched to array) for visited states, my DFS solution improved from 123 secs to 54 secs but BFS still crashes.

First of all, you may already recognize that in Lights Out you never have to flip the same switch more than once, and it doesn't matter in which order you flip the switches. You can thus describe the current state in two distinct ways: either in terms of which lights are on, or in terms of which switches have been flipped. The latter, together with the starting pattern of lights, gives you the former.
To employ a graph-search algorithm to solve the problem, you need a notion of adjacency. That follows more easily from the second characterization: two states are adjacent if there is exactly one switch about which they they differ. That characterization also directly encodes the length of the path to each node (= the number of switches that have been flipped), and it reduces the number of subsequent moves that need to be considered for each state considered, since all possible paths to each node are encoded in the pattern of switches.
You could use that in a breadth-first search relatively easily (and this may be what you in fact tried). BFS is equivalent to Dijkstra's algorithm in that case, even without using an explicit priority queue, because you enqueue new nodes to explore in priority (path-length) order.
You can also convert that to an A* search with addition of a suitable heuristic. For example, since each move turns off at most five lights, one could take as the heuristic the number of lights still on after each move, divided by 5. Though that's a bit crude, I'm inclined to think that it would be of some help. You do need a real priority queue for that alternative, however.
As far as implementation goes, do recognize that you can represent both the pattern of lights currently on and the pattern of switches that have been pressed as bit vectors. Each pattern fits in a 32-bit integer, and a list of visited states requires 225 bits, which is well within the capacity of modern computing systems. Even if you use that many bytes, instead, you ought to be able to handle it. Moreover, you can perform all needed operations using bitwise arithmetic operators, especially XOR. Thus, this problem (at its given size) ought to be computable relatively quickly.
Update:
As I mentioned in comments, I decided to solve the problem for myself, with -- it seemed to me -- very good success. I used a variety of techniques to achieve good performance and minimize memory usage, and in this case, those mostly were complementary. Here are some of my tricks:
I represented each whole-system state with a single uint64_t. The top 32 bits contain a bitmask of which switches have been flipped, and the bottom 32 contain a bitmask of which lights are on as a result. I wrapped these in a struct along with a single pointer to link them together as elements of a queue. A given state can be tested as a solution with one bitwise-and operation and one integer comparison.
I created a pre-initialized array of 25 uint64_t bitmasks representing the effect of each move. One bit set among the top 32 of each represents the switch that is flipped, and between 3 and five bits set among the bottom 32 represent the lights that are toggled as a result. The effect of flipping one switch can then be computed simply as new_state = old_state ^ move[i].
I implemented plain breadth-first search instead of A*, in part because I was trying to put something together quickly, and in particular because that way I could use a regular queue instead of a priority queue.
I structured my BFS in a way that naturally avoided visiting the same state twice, without having to actually track which states had ever been enqueued. This was based on some insight into how to efficiently generate distinct bit patterns without repeating, with those having fewer bits set generated before those having more bits set. The latter criterion was satisfied fairly naturally by the queue-based approach required anyway for BFS.
I used a second (plain) queue to recycle dynamically-allocated queue nodes after they were removed from the main queue, to minimize the number calls to malloc().
Overall code was a bit less than 200 lines, including blank and comment lines, data type declarations, I/O, queue implementation (plain C, no STL) -- everything.
Note, by the way, that the priority queue employed in standard Dijkstra and in A* is primarily about finding the right answer (shortest path), and only secondarily about doing so efficiently. Enqueueing and dequeueing from a standard queue can both be O(1), whereas those operations on a priority queue are o(log m) in the number of elements in the queue. A* and BFS both have worst-case queue size upper bounds of O(n) in the total number of states. Thus, BFS will scale better than A* with problem size; the only question is whether the former reliably gives you the right answer, which in this case, it does.

Expand hash table without rehash?

I am looking to for a hash table data structure that does not require rehash for expansion and shrink?
Rehash is a CPU consuming effort. I was wondering if it is possible to design hash table data structure in a way that does not require rehash at all? Have you heard about such a data structure before?

does not require rehash for expansion and shrink? Rehash is a CPU consuming effort. I was wondering if it is possible to design hash table data structure in a way that does not require rehash at all? Have you heard about such a data structure before?
That depends on what you call "rehash":
If you simply mean that the table-level rehash shouldn't reapply the hash function to each key during resizing, then that's easy with most libraries: e.g. wrap the key and its raw (pre-modulo-table-size) real hash value together a la struct X { size_t hash_; Key key_ };, supply the hashtable library with a hash function that returns hash_, but a comparison function that compares key_s (depending on the complexity of key_ comparison, you may be able to use hash_ to optimise, e.g. lhs.hash_ == rhs.hash_ && lhs.key_ == rhs.key_).
This will help most if the hashing of keys was particularly time consuming (e.g. cryptographic strength on longish keys). For very simple hashing (e.g. passthrough of ints) it'll slow you down and waste memory.
If you mean the table-level operation of increasing or decreasing memory storage and reindexing all stored values, then yes - it can be avoided - but to do so you have to fundamentally change the way the hash table works, and the normal performance profile. Discussed below.
As just one example, you could leverage a more typical hashtable implementation (let's call it H) by having your custom hashtable (C) have an H** p that - up to an initial size limit - will have p[0] be the only instance of H, and simply ferry operations/results through. If the table grows beyond that, you keep p[0] referencing the existing H, while creating a second H hashtable to be tracked by p[1]. Then things start getting dicey:
to search or erase in C, your implementation needs to search p[1] then p[0] and report any match from either
to insert a new value in C, your implementation must confirm it's not in p[0], then insert to p[1]
with each insert (and potentially even for other operations), it could optionally migrate any matching - or an arbitrary p[0] entry - to p[1] so gradually p[0] empties; you can easily guarantee p[0] will be empty before p[1] will be so full (and consequently a larger table will be needed). When p[0] is empty you may want to p[0] = p[1]; p[1] = NULL; to keep the simple mental model of what's where - lots of options.
Some existing hash table implementations are very efficient at iterating over elements (e.g. GNU C++ std::unordered_set), as there's a singly linked list of all the values, and the hash table is really only a collection of pointers (in C++ parlance, iterators) into the linked list. This can mean that if your utilisation falls below some threshold (e.g. 10% load factor) for your only/larger hash table, you know you can very efficiently migrate the remaining elements to a smaller table.
These kind of tricks are used by some hash tables to avoid a sudden heavy cost during rehashing, and instead spread the pain more evenly over a number of subsequent operations, avoiding a possibly nasty spike in latency.
Some of the implementation options only make sense for either an open or a closed hashing implementation, or are only useful when the keys and/or values are small or large and depending on whether the table embeds them or points to them. Best way to learn about it is to code....

It depends what you want to avoid. Rehashing implies recomputing the hash values. You can avoid that by storing the hash values in the hash structures. Redispatching the entries into the reallocated hashtable may be less expensive (typically a single modulo or masking operation) and is hardly avoidable for simple hashtable implementations.

Assuming you actually do need this.. It is possible. Here I'll give a trivial example you can build on.
// Basic types we deal with
typedef uint32_t key_t;
typedef void * value_t;
typedef struct
{
key_t key;
value_t value;
} hash_table_entry_t;
typedef struct
{
uint32_t initialSize;
uint32_t size; // current max entries
uint32_t count; // current filled entries
hash_table_entry_t *entries;
} hash_table_t;
// Hash function depends on the size of the table
key_t hash(value_t value, uint32_t size)
{
// Simple hash function that just does modulo hash table size;
return *(key_t*)&value % size;
}
void init(hash_table_t *pTable, uint32_t initialSize)
{
pTable->initialSize = initialSize;
pTable->size = initialSize;
pTable->count = 0;
pTable->entries = malloc(pTable->size * sizeof(*pTable->entries));
/// #todo handle null return;
// Set to ~0 to signal invalid keys.
memset(pTable->entries, ~0, pTable->size * sizeof(*pTable->entries));
}
void insert(hash_table_t *pTable, value_t val)
{
key_t key = hash(val, pTable->size);
for (key_t i = key; i != (key-1); i=(i+1)%pTable->size)
{
if (pTable->entries[i].key == ~0)
{
pTable->entries[i].key = key;
pTable->entries[i].value = val;
pTable->count++;
break;
}
}
// Expand when 50% full
if (pTable->count > pTable->size/2)
{
pTable->size *= 2;
pTable->entries = realloc(pTable->entries, pTable->size * sizeof(*pTable->entries));
/// #todo handle null return;
memset(pTable->entries + pTable->size/2, ~0, pTable->size * sizeof(*pTable->entries));
}
}
_Bool contains(hash_table_t *pTable, value_t val)
{
// Try current size first
uint32_t sizeToTry = pTable->size;
do
{
key_t key = hash(val, sizeToTry);
for (key_t i = key; i != (key-1); i=(i+1)%pTable->size)
{
if (pTable->entries[i].key == ~0)
break;
if (pTable->entries[i].key == key && pTable->entries[i].value == val)
return true;
}
// Try all previous sizes we had. Only report failure if found for none.
sizeToTry /= 2;
} while (sizeToTry != pTable->initialSize);
return false;
}
The idea is that the hash function depends on the size of the table. When you change the size of the table, you don't rehash current entries. You add new ones with the new hash function. When reading the entries, you try all the hash functions that have ever been used on this table.
This way, get()/contains() and similar operations take longer the more times you expanded your table, but you don't have the huge spike of rehashing. I can imagine some systems where this would be a requirement.

What is the bug in this code?

Based on a this logic given as an answer on SO to a different(similar) question, to remove repeated numbers in a array in O(N) time complexity, I implemented that logic in C, as shown below. But the result of my code does not return unique numbers. I tried debugging but could not get the logic behind it to fix this.
int remove_repeat(int *a, int n)
{
int i, k;
k = 0;
for (i = 1; i < n; i++)
{
if (a[k] != a[i])
{
a[k+1] = a[i];
k++;
}
}
return (k+1);
}
main()
{
int a[] = {1, 4, 1, 2, 3, 3, 3, 1, 5};
int n;
int i;
n = remove_repeat(a, 9);
for (i = 0; i < n; i++)
printf("a[%d] = %d\n", i, a[i]);
}
1] What is incorrect in above code to remove duplicates.
2] Any other O(N) or O(NlogN) solution for this problem. Its logic?

Heap sort in O(n log n) time.
Iterate through in O(n) time replacing repeating elements with a sentinel value (such as INT_MAX).
Heap sort again in O(n log n) to distil out the repeating elements.
Still bounded by O(n log n).

Your code only checks whether an item in the array is the same as its immediate predecessor.
If your array starts out sorted, that will work, because all instances of a particular number will be contiguous.
If your array isn't sorted to start with, that won't work because instances of a particular number may not be contiguous, so you have to look through all the preceding numbers to determine whether one has been seen yet.
To do the job in O(N log N) time, you can sort the array, then use the logic you already have to remove duplicates from the sorted array. Obviously enough, this is only useful if you're all right with rearranging the numbers.
If you want to retain the original order, you can use something like a hash table or bit set to track whether a number has been seen yet or not, and only copy each number to the output when/if it has not yet been seen. To do this, we change your current:
if (a[k] != a[i])
a[k+1] = a[i];
to something like:
if (!hash_find(hash_table, a[i])) {
hash_insert(hash_table, a[i]);
a[k+1] = a[i];
}
If your numbers all fall within fairly narrow bounds or you expect the values to be dense (i.e., most values are present) you might want to use a bit-set instead of a hash table. This would be just an array of bits, set to zero or one to indicate whether a particular number has been seen yet.
On the other hand, if you're more concerned with the upper bound on complexity than the average case, you could use a balanced tree-based collection instead of a hash table. This will typically use more memory and run more slowly, but its expected complexity and worst case complexity are essentially identical (O(N log N)). A typical hash table degenerates from constant complexity to linear complexity in the worst case, which will change your overall complexity from O(N) to O(N2).

Your code would appear to require that the input is sorted. With unsorted inputs as you are testing with, your code will not remove all duplicates (only adjacent ones).

You are able to get O(N) solution if the number of integers is known up front and smaller than the amount of memory you have :). Make one pass to determine the unique integers you have using auxillary storage, then another to output the unique values.
Code below is in Java, but hopefully you get the idea.
int[] removeRepeats(int[] a) {
// Assume these are the integers between 0 and 1000
Boolean[] v = new Boolean[1000]; // A lazy way of getting a tri-state var (false, true, null)
for (int i=0;i<a.length;++i) {
v[a[i]] = Boolean.TRUE;
}
// v[i] = null => number not seen
// v[i] = true => number seen
int[] out = new int[a.length];
int ptr = 0;
for (int i=0;i<a.length;++i) {
if (v[a[i]] != null && v[a[i]].equals(Boolean.TRUE)) {
out[ptr++] = a[i];
v[a[i]] = Boolean.FALSE;
}
}
// Out now doesn't contain duplicates, order is preserved and ptr represents how
// many elements are set.
return out;
}

You are going to need two loops, one to go through the source and one to check each item in the destination array.
You are not going to get O(N).
[EDIT]
The article you linked to suggests a sorted output array which means the search for duplicates in the output array can be a binary search...which is O(LogN).

Your logic just wrong, so the code is wrong too. Do your logic by yourself before coding it.
I suggest a O(NlnN) way with a modification of heapsort.
With heapsort, we join from a[i] to a[n], find the minimum and replace it with a[i], right?
So now is the modification, if the minimum is the same with a[i-1] then swap minimum and a[n], reduce your array item's number by 1.
It should do the trick in O(NlnN) way.

Your code will work only on particular cases. Clearly, you're checking adjacent values but duplicate values can occur any where in array. Hence, it's totally wrong.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight