The pseudo codes:
S = {};
Loop 10000 times:
u = unsorted_fixed_size_array_producer();
S = sort(S + u);
I need an efficient implementation of sort, which takes a sorted array and an unsorted one, then sort them all. But here we know after a few iterations, size(S) will be much bigger than size(u), that's a prior.
Update: There's another prior: the size of u is known, say 10 or 20, and the looping times is also known.
Update: I implemented the algorithm that #Dukelnig advised in C https://gist.github.com/blackball/bd7e5619a1e83bd985a3 which fits for my needs. Thanks!
Sort u, then merge S and u.
Merging simply involves iterating through two sorted arrays at the same time, and picking the smaller element and incrementing that iterator at each step.
The running time is O(|u| log |u| + |S|).
This is very similar to what merge sort does, so that it would result in a sorted array can be derived from there.
Some Java code for merge, derived from Wikipedia: (the C code wouldn't look all that different)
static void merge(int S[], int u[], int newS[])
{
int iS = 0, iu = 0;
for (int j = 0; j < S.length + u.length; j++)
if (iS < S.length && (iu >= u.length || S[iS] <= u[iu]))
newS[j] = S[iS++]; // Increment iS after using it as an index
else
newS[j] = u[iu++]; // Increment iu after using it as an index
}
This can also be done in-place (in S, assuming it has enough additional space) by going from the back.
Here's some working Java code that does this:
static void mergeInPlace(int S[], int SLength, int u[])
{
int iS = SLength-1, iu = u.length-1;
for (int j = SLength + u.length - 1; j >= 0; j--)
if (iS >= 0 && (iu < 0 || S[iS] >= u[iu]))
S[j] = S[iS--];
else
S[j] = u[iu--];
}
public static void main(String[] args)
{
int[] S = {1,5,9,13,22, 0,0,0,0}; // 4 additional spots reserved here
int[] u = {0,10,11,15};
mergeInPlace(S, 5, u);
// prints [0, 1, 5, 9, 10, 11, 13, 15, 22]
System.out.println(Arrays.toString(S));
}
To reduce the number of comparisons, we can also use binary search (although the time complexity would remain the same - this can be useful when comparisons are expensive).
// returns the first element in S before SLength greater than value,
// or returns SLength if no such element exists
static int binarySearch(int S[], int SLength, int value) { ... }
static void mergeInPlaceBinarySearch(int S[], int SLength, int u[])
{
int iS = SLength-1;
int iNew = SLength + u.length - 1;
for (int iu = u.length-1; iu >= 0; iu--)
{
if (iS >= 0)
{
int index = binarySearch(S, iS+1, u[iu]);
for ( ; iS >= index; iS--)
S[iNew--] = S[iS];
}
S[iNew--] = u[iu];
}
// assert (iS != iNew)
for ( ; iS >= 0; iS--)
S[iNew--] = S[iS];
}
If S doesn't have to be an array
The above assumes that S has to be an array. If it doesn't, something like a binary search tree might be better, depending on how large u and S are.
The running time would be O(|u| log |S|) - just substitute some values to see which is better.
If you really really have to use a literal array for S at all times, then the best approach would be to individually insert the new elements into the already sorted S. I.e. basically use the classic insertion sort technique for each element in each new batch. This will be expensive in a sense that insertion into an array is expensive (you have to move the elements), but that's the price of having to use an array for S.
So if the size of S is much more than the size of u, isn't what you want simply an efficient sort for a mostly sorted array? Traditionally this would be insertion sort. But you will only know the real answer by experimentation and measurement - try different algorithms and pick the best one. Without actually running your code (and perhaps more importantly, with your data), you cannot reliably predict performance, even with something as well studied as sorting algorithms.
Say we have a big sorted list of size n and a little sorted list of size k.
Binary search, starting from the end (position n-1, n-2, n-4, &c) for the insertion point for the largest element of the smaller list. Shift the tail end of the larger list k elements to the right, insert the largest element of the smaller list, then repeat.
So if we have the lists [1,2,4,5,6,8,9] and [3,7], we will do:
[1,2,4,5,6, , ,8,9]
[1,2,4,5,6, ,7,8,9]
[1,2, ,4,5,6,7,8,9]
[1,2,3,4,5,6,7,8,9]
But I would advise you to benchmark just concatenating the lists and sorting the whole thing before resorting to interesting merge procedures.
Related
Taken from the google interview question here
Suppose that you have a sorted array of integers (positive or negative). You want to apply a function of the form f(x) = a * x^2 + b * x + c to each element x of the array such that the resulting array is still sorted. Implement this in Java or C++. The input are the initial sorted array and the function parameters (a, b and c).
Do you think we can do it in-place with less than O(n log(n)) time where n is the array size (e.g. apply a function to each element of an array, after that sort the array)?
I think this can be done in linear time. Because the function is quadratic it will form a parabola, ie the values decrease (assuming a positive value for 'a') down to some minimum point and then after that will increase. So the algorithm should iterate over the sorted values until we reach/pass the minimum point of the function (which can be determined by a simple differentiation) and then for each value after the minimum it should just walk backward through the earlier values looking for the correct place to insert that value. Using a linked list would allow items to be moved around in-place.
The quadratic transform can cause part of the values to "fold" over the others. You will have to reverse their order, which can easily be done in-place, but then you will need to merge the two sequences.
In-place merge in linear time is possible, but this is a difficult process, normally out of the scope of an interview question (unless for a Teacher's position in Algorithmics).
Have a look at this solution: http://www.akira.ruc.dk/~keld/teaching/algoritmedesign_f04/Artikler/04/Huang88.pdf
I guess that the main idea is to reserve a part of the array where you allow swaps that scramble the data it contains. You use it to perform partial merges on the rest of the array and in the end you sort back the data. (The merging buffer must be small enough that it doesn't take more than O(N) to sort it.)
If a is > 0, then a minimum occurs at x = -b/(2a), and values will be copied to the output array in forward order from [0] to [n-1]. If a < 0, then a maximum occurs at x = -b/(2a) and values will be copied to the output array in reverse order from [n-1] to [0]. (If a == 0, then if b > 0, do a forward copy, if b < 0, do a reverse copy, If a == b == 0, nothing needs to be done). I think the sorted array can be binary searched for the closest value to -b/(2a) in O(log2(n)) (otherwise it's O(n)). Then this value is copied to the output array and the values before (decrementing index or pointer) and after (incrementing index or pointer) are merged into the output array, taking O(n) time.
static void sortArray(int arr[], int n, int A, int B, int C)
{
// Apply equation on all elements
for (int i = 0; i < n; i++)
arr[i] = A*arr[i]*arr[i] + B*arr[i] + C;
// Find maximum element in resultant array
int index=-1;
int maximum = -999999;
for (int i = 0; i< n; i++)
{
if (maximum < arr[i])
{
index = i;
maximum = arr[i];
}
}
// Use maximum element as a break point
// and merge both subarrays usin simple
// merge function of merge sort
int i = 0, j = n-1;
int[] new_arr = new int[n];
int k = 0;
while (i < index && j > index)
{
if (arr[i] < arr[j])
new_arr[k++] = arr[i++];
else
new_arr[k++] = arr[j--];
}
// Merge remaining elements
while (i < index)
new_arr[k++] = arr[i++];
while (j > index)
new_arr[k++] = arr[j--];
new_arr[n-1] = maximum;
// Modify original array
for (int p = 0; p < n ; p++)
arr[p] = new_arr[p];
}
I have a linked list of objects each containing a 32-bit integer (and provably fewer than 232 such objects) and I want to efficiently choose an integer that's not present in the list, without using any additional storage (so copying them to an array, sorting the array, and choosing the minimum value not in the array would not be an option). However, the definition of the structure for list elements is under my control, so I could add (within reason) additional storage to each element as part of solving the problem. For example, I could add an extra set of prev/next pointers and merge-sort the list. Is this the best solution? Or is there a simpler or more efficient way to do it?
Given the conditions that you outline in the comments, especially your expectation of many identical values, you must expect a sparse distribution of used values.
Consequently, it might actually be best to just guess a value randomly and then check whether it coincides with a value in the list. Even if half the available value range were used (which seems extremely unlikely from your comments), you would only traverse the list twice on average. And you can drastically decrease this factor by simultaneously checking a number of guesses in one pass. Done correctly, the factor should always be close to one.
The advantage of such a probabilistic approach is that you are immune to bad sequences of values. Such sequences are always possible with range based approaches: If you calculate the min and max of the data, you run the risk, that the data contains both 0 and 2^32-1. If you sequentially subdivide an interval, you run the risk of always getting values in the middle of the interval, which can shrink it to zero in 32 steps. With a probabilistic approach, these sequences can't hurt you.
I think, I would use something like four guesses for very small lists, and crank it up to roughly 16 as the size of the list approaches the limit. The high starting value is due to the fact that any such algorithm will be memory bound, i. e. your CPU has ample amounts of time to check a value while it waits for the next values to arrive from memory, so you better make good use of that time to reduce the number of passes required.
A further optimization would instantly replace a busted guess with a new one and keep track of where the replacement happened, so that you can avoid a complete second pass through the data. Also, move the busted guess to the end of the list of guesses, so that you only need to check against the start position of the first guess in your loop to stop as early as possible.
If you can spare one pointer in each object, you get an O(n) worst-case algorithm easily (standard divide-and-conquer):
Divide the range of possible IDs equally.
Make a singly-linked list covering each subrange.
If one subrange is empty, choose any id in it.
Otherwise repeat with the elements of the subrange with fewest elements.
Example code using two sub-ranges per iteration:
unsigned getunusedid(element* h) {
unsigned start = 0, stop = -1;
for(;h;h = h->mainnext)
h->next = h->mainnext;
while(h) {
element *l = 0, *r = 0;
unsigned cl = 0, cr = 0;
unsigned mid = start + (stop - start) / 2;
while(h) {
element* next = h->next;
if(h->id < mid) {
h->next = l;
cl++;
l = h;
} else {
h->next = r;
cr++;
r = h;
}
h = next;
}
if(cl < cr) {
h = l;
stop = mid - 1;
} else {
h = r;
start = mid;
}
}
return start;
}
Some more remarks:
Beware of bugs in the above code; I have only proved it correct, not tried it.
Using more buckets (best keep to a power of 2 for easy and efficient handling) each iteration might be faster due to better data-locality (though only try and measure if it's not fast enough otherwise), as #MarkDickson rightly remarks.
Without those extra-pointers, you need full sweeps each iteration, raising the bound to O(n*lg n).
An alternative would be using 2+ extra-pointers per element to maintain a balanced tree. That would speed up id-search, at the expense of some memory and insertion/removal time overhead.
If you don't mind an O(n) scan for each change in the list and two extra bits per element, whenever an element is inserted or removed, scan through and use the two bits to represent whether an integer (element + 1) or (element - 1) exists in the list.
For example, inserting the element, 2, the extra bits for each 3 and 1 in the list would be updated to show that 3-1 (in the case of 3) and 1+1 (in the case of 1) now exist in the list.
Insertion/deletion time can be reduced by adding a pointer from each element to the next element with the same integer.
I am supposing that integers have random values not controlled by your code.
Add two unsigned integers in your list class:
unsigned int rangeMinId = 0;
unsigned int rangeMaxId = 0xFFFFFFFF ;
Or if not possible to change the List class add them as global variables.
When the list is empty you will always know that the range if free. When you add a new item in the list check if its ID is between rangeMinId and rangeMaxId and if so change the nearest of them to this ID.
It may happen after a lot of time that rangeMinId to become equal to rangeMaxId-1, then you need a simple function which traverses the whole list and search for another free range. But this will not happens very frequently.
Other solutions are more complex and involves using of sets, binary trees or sorted arrays.
Update:
The free range search function can be done in O(n*log(n)). An example of such function is given below(I have not extensively tested it). The example is for integer array but easily can be adapted for a list.
int g_Calls = 0;
bool _findFreeRange(const int* value, int n, int& left, int& right)
{
g_Calls ++ ;
int l=left, r=right,l2,r2;
int m = (right + left) / 2 ;
int nl=0, nr=0;
for(int k = 0; k < n; k++)
{
const int& i = value[k] ;
if(i > l && i < r)
{
if(i-l < r-i)
l = i;
else
r = i;
}
if(i < m)
nl ++ ;
else
nr ++ ;
}
if ( (r - l) > 1 )
{
left = l;
right = r;
return true ;
}
if( nl < nr)
{
// check first left then right
l2 = left;
r2 = m;
if(r2-l2 > 1 && _findFreeRange(value, n, l2, r2))
{
left = l2 ;
right = r2 ;
return true;
}
l2 = m;
r2 = right;
if(r2-l2 > 1 && _findFreeRange(value, n, l2, r2))
{
left = l2 ;
right = r2 ;
return true;
}
}
else
{
// check first right then left
l2 = m;
r2 = right;
if(r2-l2 > 1 && _findFreeRange(value, n, l2, r2))
{
left = l2 ;
right = r2 ;
return true;
}
l2 = left;
r2 = m;
if(r2-l2 > 1 && _findFreeRange(value, n, l2, r2))
{
left = l2 ;
right = r2 ;
return true;
}
}
return false;
}
bool findFreeRange(const int* value, int n, int& left, int& right, int maxx)
{
g_Calls = 1;
left = 0;
right = maxx;
if(!_findFreeRange(value, n, left, right))
return false ;
left++;
right--;
return (right - left) >= 0 ;
}
If it returns false list is filled and there is no free range (very least possible), maxm is the maximal limit of the range in this case 0xFFFFFFFF.
The idea is first to search the biggest range of the list and then if no free hole is found to recursively search the subranges for holes which may have been left during the first pass. If the list is sparsely filled it is very least probable that function will be called more than once. However when the list become almost completely filled it can happen the range search to take longer. Thus in this most worst case scenario, when the list becomes closed to filled, its better to start keeping all free ranges in a list.
This reminds me of the book Programming Pearls, and in particular the very first column, "Cracking the Oyster". What is the real problem you are trying to solve?
If your list is small, then a simple linear search to find max/min would work and it would work quickly.
When your list gets large and linear search becomes unwieldy, you can build a bitmap to represent the unused numbers for much less memory than adding 2 extra pointers at each node in the linked list. In fact, it would only be 2^(32-8) = 16KB of RAM compared to your linked list being potentially >10GB.
Then to find an unused number, you can just traverse the bitmap one machine-word at a time, checking if it's non-zero. If it is, then at least one number in that 32- or 64- bit block is unused, and you can inspect the word to find out exactly which bit is set. As you add numbers to the list, all you have to do is clear the corresponding bit in the bitmap.
One possible solution is to take the min and max of the list with a simple O(n) iteration, then pick a number between max and min + (1 << 32). This is simple to do since overflow/underflow behavior is well-defined for unsigned integers:
uint32_t min, max;
// TODO: compute min and max here
// exclude max from choice space (min will be an exclusive upper bound)
max++;
uint32_t choice = rand32() % (min - max) + max; // where rand32 is a random unsigned 32-bit integer
Of course, if it doesn't need to be random, then you can just use one more than the maximum of the list.
Note: the only case where this fails is if min is 0 and max is UINT32_MAX (aka 4294967295).
Ok. Here is one really simple solution. Some of the answers have become too theoretical and complicated for optimization. If you need a quick solution do this:
1.In your List add a member:
unsigned int NextFreeId = 1;
add also an std::set<unsigned int> ids
When you add item in the list add also the integer in the set and keep track of the NextFreeId:
int insert(unsigned int id)
{
ids.insert(id);
if (NextFreeId == id) //will not happen too frequently
{
unsigned int TheFreeId ;
unsigned int nextid = id+1, previd = id-1;
while(true )
{
if(nextid < 0xFFFFFFF && !ids.count(nextid))
{
NextFreeId = nextid ;
break ;
}
if(previd > 0 && !ids.count(previd))
{
NextFreeId = previd ;
break ;
}
if(prevId == 0 && nextid == 0xFFFFFFF)
break; // all the range is filled, there is no free id
nextid++ ;
previd -- ;
}
}
return 1;
}
Sets are very efficient to check if a value is contained so the complexity will be O(log(N)). It is quick to implement. Also set is searched not each time but only when the NextFreeId is filled. List is not traversed at all.
I'm looking for an algorithm to find the first number less than X in an array of integers. Actually I'm using linear search,but I think that binary search may be better(as I already have seen sometime ago) but I don't know how to implement it myself(not implement a modified version to find the first-less than X). If there is something better than bin search,please tell me. I need of it because this array is so-much accessed and modified while the program is running.
Here's the current(trival) implementation:
int findmin(int *arr,int n,int size)
{
int i;
for(i = 0; i < size && arr[i] < n; i++)
;
return i-1;
}
This index is used the parameter input to function that insert N-value in a specific index. The index of this function and insert a number in array but still make it sorted without a sort() call at each time a new number is inserted.
It is a relevant file text parsing,parse much files and I a number considerable of characters. I need to make some effort to make the things more fast as possible(in my context and knowlege).
EDIT: The array is sorted always will be,even after insections of new numbers.
Binary search is no problem here. The difficult is how to check the boundary conditions and what value should be returned. Moreover, you have to take care of duplicate values. The following function will work.
int binarySearch(int a[], int length, int value)
{
if (a[0] > value)
return -1;
int low = 0;
int high = length-1;
while (low<=high)
{
int mid = low+((high-low)/2);
if(a[mid] >= value)
high = mid-1;
else
low = mid+1;
}
return low-1;
}
Test: (with duplicates)
int arr[] = {1, 3, 4, 6, 6, 10};
int length = sizeof(arr)/sizeof(arr[0]);
int index = binarySearch(arr, length, 6); // will be 2
You have to have the array of N elements already sorted.
Set "high" to N and "low" to -1.
"Probe" element N/2 (rounding down to an int).
If the probed element is < your target, set "high" to N/2. If >, set "low" to N/2. (If ==, of course, then you have your answer.)
Repeat, probing halfway between "low" and "high".
There are boundary conditions you need to worry about, but not too complicated.
I need the fastest and simple algorithm which finds the duplicate numbers in an array, also should be able to know the number of duplicates.
Eg: if the array is {2,3,4,5,2,4,6,2,4,7,3,8,2}
I should be able to know that there are four 2's, two 3's and three 4's.
Make a hash table where the key is array item and value is counter how many times the corresponding array item has occurred in array. This is efficient way to do it, but probably not the fastest way.
Something like this (in pseudo code). You will find plenty of hash map implementations for C by googling.
hash_map = create_new_hash_map()
for item in array {
if hash_map.contains_key(item){
counter = hash_map.get(item)
} else {
counter = 0
}
counter = counter + 1
hash_map.put(item, counter)
}
This can be solved elegantly using Linq:
public static void Main(string[] args)
{
List<int> list = new List<int> { 2, 3, 4, 5, 2, 4, 6, 2, 4, 7, 3, 8, 2 };
var grouping = list
.GroupBy(x => x)
.Select(x => new { Item = x.Key, Count = x.Count()});
foreach (var item in grouping)
Console.WriteLine("Item {0} has count {1}", item.Item, item.Count);
}
Internally it probably uses hashing to partition the list, but the code hides the internal details - here we are only telling it what to calculate. The compiler / runtime is free to choose how to calculate it, and optimize as it sees fit. Thanks to Linq this same code will run efficiently whether run an a list in memory, or if the list is in a database. In real code you should use this, but I guess you want to know how internally it works.
A more imperative approach that demonstrates the actual algorithm is as follows:
List<int> list = new List<int> { 2, 3, 4, 5, 2, 4, 6, 2, 4, 7, 3, 8, 2 };
Dictionary<int, int> counts = new Dictionary<int, int>();
foreach (int item in list)
{
if (!counts.ContainsKey(item))
{
counts[item] = 1;
}
else
{
counts[item]++;
}
}
foreach (KeyValuePair<int, int> item in counts)
Console.WriteLine("Item {0} has count {1}", item.Key, item.Value);
Here you can see that we iterate over the list only once, keeping a count for each item we see on the way. This would be a bad idea if the items were in a database though, so for real code, prefer to use the Linq method.
here's a C version that does it with standard input; it's as fast as the length of the input (beware, the number of parameters on the command line is limited...) but should give you an idea on how to proceed:
#include <stdio.h>
int main ( int argc, char **argv ) {
int dups[10] = { 0 };
int i;
for ( i = 1 ; i < argc ; i++ )
dups[atoi(argv[i])]++;
for ( i = 0 ; i < 10 ; i++ )
printf("%d: %d\n", i, dups[i]);
return 0;
}
example usage:
$ gcc -o dups dups.c
$ ./dups 0 0 3 4 5
0: 2
1: 0
2: 0
3: 1
4: 1
5: 1
6: 0
7: 0
8: 0
9: 0
caveats:
if you plan to count also the number of 10s, 11s, and so on -> the dups[] array must be bigger
left as an exercise is to implement reading from an array of integers and to determine their position
The more you tell us about the input arrays the faster we can make the algorithm. For example, for your example of single-digit numbers then creating an array of 10 elements (indexed 0:9) and accumulating number of occurrences of number in the right element of the array (poorly worded explanation but you probably catch my drift) is likely to be faster than hashing. (I say likely to be faster because I haven't done any measurements and won't).
I agree with most respondents that hashing is probably the right approach for the most general case, but it's always worth thinking about whether yours is a special case.
If you know the lower and upper bounds, and they are not too far apart, this would be a good place to use a Radix Sort. Since this smells of homework, I'm leaving it to the OP to read the article and implement the algorithm.
If you don't want to use hash table or smtg like that, just sort the array then count the number of occurrences, something like below should work
Arrays.sort(array);
lastOne=array's first element;
count=0,
for(i=0; i <array's length; i++)
{
if(array[i]==lastOne)
increment count
else
print(array[i] + " has " + count + " occurrences");
lastOne=array[i+1];
}
If the range of the numbers is known and small, you could use an array to keep track of how many times you've seen each (this is a bucket sort in essence). IF it's big you can sort it and then count duplicates as they will be following each other.
option 1: hash it.
option 2: sort it and then count consecutive runs.
You can use hash tables to store each element value as a key. Then increment +1 each time a key already exists.
Using hash tables / associative arrays / dictionaries (all the same thing but the terminology changes between programming environments) is the way to go.
As an example in python:
numberList = [1, 2, 3, 2, 1, ...]
countDict = {}
for value in numberList:
countDict[value] = countDict.get(value, 0) + 1
# Now countDict contains each value pointing to their count
Similar constructions exist in most programming languages.
> I need the fastest and simple algorithm which finds the duplicate numbers in an array, also should be able to know the number of duplicates.
I think the fastest algorithm is counting the duplicates in an array:
#include <stdlib.h>
#include <stdio.h>
#include <limits.h>
#include <assert.h>
typedef int arr_t;
typedef unsigned char dup_t;
const dup_t dup_t_max=UCHAR_MAX;
dup_t *count_duplicates( arr_t *arr, arr_t min, arr_t max, size_t arr_len ){
assert( min <= max );
dup_t *dup = calloc( max-min+1, sizeof(dup[0]) );
for( size_t i=0; i<arr_len; i++ ){
assert( min <= arr[i] && arr[i] <= max && dup[ arr[i]-min ] < dup_t_max );
dup[ arr[i]-min ]++;
}
return dup;
}
int main(void){
arr_t arr[] = {2,3,4,5,2,4,6,2,4,7,3,8,2};
size_t arr_len = sizeof(arr)/sizeof(arr[0]);
arr_t min=0, max=16;
dup_t *dup = count_duplicates( arr, min, max, arr_len );
printf( " value count\n" );
printf( " -----------\n" );
for( size_t i=0; i<(size_t)(max-min+1); i++ ){
if( dup[i] ){
printf( "%5i %5i\n", (int)(i+min), (int)(dup[i]) );
}
}
free(dup);
}
Note: You can not use the fastest algorithm on every array.
The code first sorts the array and then moves unique elements to the front, keeping track of the number of elements. It's slower than using bucket sort, but more convenient.
#include <stdio.h>
#include <stdlib.h>
static int cmpi(const void *p1, const void *p2)
{
int i1 = *(const int *)p1;
int i2 = *(const int *)p2;
return (i1 > i2) - (i1 < i2);
}
size_t make_unique(int values[], size_t count, size_t *occ_nums)
{
if(!count) return 0;
qsort(values, count, sizeof *values, cmpi);
size_t top = 0;
int prev_value = values[0];
if(occ_nums) occ_nums[0] = 1;
size_t i = 1;
for(; i < count; ++i)
{
if(values[i] != prev_value)
{
++top;
values[top] = prev_value = values[i];
if(occ_nums) occ_nums[top] = 1;
}
else ++occ_nums[top];
}
return top + 1;
}
int main(void)
{
int values[] = { 2, 3, 4, 5, 2, 4, 6, 2, 4, 7, 3, 8, 2 };
size_t occ_nums[sizeof values / sizeof *values];
size_t unique_count = make_unique(
values, sizeof values / sizeof *values, occ_nums);
size_t i = 0;
for(; i < unique_count; ++i)
{
printf("number %i occurred %u time%s\n",
values[i], (unsigned)occ_nums[i], occ_nums[i] > 1 ? "s": "");
}
}
There is an "algorithm" that I use all the time to find duplicate lines in a file in Unix:
sort file | uniq -d
If you implement the same strategy in C, then it is very difficult to beat it with a fancier strategy such as hash tables. Call a sorting algorithm, and then call your own function to detect duplicates in the sorted list. The sorting algorithm takes O(n*log(n)) time and the uniq function takes linear time. (Southern Hospitality makes a similar point, but I want to emphasize that what he calls "option 2" seems both simpler and faster than the more popular hash tables suggestion.)
Counting sort is the answer to the above question.If you see the algorithm for counting sort you will find that there is an array that is kept for keeping the count of an element i present in the original array.
Here is another solution but it takes O(nlogn) time.
Use Divide and Conquer approach to sort the given array using merge sort.
During combine step in merge sort, find the duplicates by comparing the elements in the two sorted sub-arrays.
An interesting interview question that a colleague of mine uses:
Suppose that you are given a very long, unsorted list of unsigned 64-bit integers. How would you find the smallest non-negative integer that does not occur in the list?
FOLLOW-UP: Now that the obvious solution by sorting has been proposed, can you do it faster than O(n log n)?
FOLLOW-UP: Your algorithm has to run on a computer with, say, 1GB of memory
CLARIFICATION: The list is in RAM, though it might consume a large amount of it. You are given the size of the list, say N, in advance.
If the datastructure can be mutated in place and supports random access then you can do it in O(N) time and O(1) additional space. Just go through the array sequentially and for every index write the value at the index to the index specified by value, recursively placing any value at that location to its place and throwing away values > N. Then go again through the array looking for the spot where value doesn't match the index - that's the smallest value not in the array. This results in at most 3N comparisons and only uses a few values worth of temporary space.
# Pass 1, move every value to the position of its value
for cursor in range(N):
target = array[cursor]
while target < N and target != array[target]:
new_target = array[target]
array[target] = target
target = new_target
# Pass 2, find first location where the index doesn't match the value
for cursor in range(N):
if array[cursor] != cursor:
return cursor
return N
Here's a simple O(N) solution that uses O(N) space. I'm assuming that we are restricting the input list to non-negative numbers and that we want to find the first non-negative number that is not in the list.
Find the length of the list; lets say it is N.
Allocate an array of N booleans, initialized to all false.
For each number X in the list, if X is less than N, set the X'th element of the array to true.
Scan the array starting from index 0, looking for the first element that is false. If you find the first false at index I, then I is the answer. Otherwise (i.e. when all elements are true) the answer is N.
In practice, the "array of N booleans" would probably be encoded as a "bitmap" or "bitset" represented as a byte or int array. This typically uses less space (depending on the programming language) and allows the scan for the first false to be done more quickly.
This is how / why the algorithm works.
Suppose that the N numbers in the list are not distinct, or that one or more of them is greater than N. This means that there must be at least one number in the range 0 .. N - 1 that is not in the list. So the problem of find the smallest missing number must therefore reduce to the problem of finding the smallest missing number less than N. This means that we don't need to keep track of numbers that are greater or equal to N ... because they won't be the answer.
The alternative to the previous paragraph is that the list is a permutation of the numbers from 0 .. N - 1. In this case, step 3 sets all elements of the array to true, and step 4 tells us that the first "missing" number is N.
The computational complexity of the algorithm is O(N) with a relatively small constant of proportionality. It makes two linear passes through the list, or just one pass if the list length is known to start with. There is no need to represent the hold the entire list in memory, so the algorithm's asymptotic memory usage is just what is needed to represent the array of booleans; i.e. O(N) bits.
(By contrast, algorithms that rely on in-memory sorting or partitioning assume that you can represent the entire list in memory. In the form the question was asked, this would require O(N) 64-bit words.)
#Jorn comments that steps 1 through 3 are a variation on counting sort. In a sense he is right, but the differences are significant:
A counting sort requires an array of (at least) Xmax - Xmin counters where Xmax is the largest number in the list and Xmin is the smallest number in the list. Each counter has to be able to represent N states; i.e. assuming a binary representation it has to have an integer type (at least) ceiling(log2(N)) bits.
To determine the array size, a counting sort needs to make an initial pass through the list to determine Xmax and Xmin.
The minimum worst-case space requirement is therefore ceiling(log2(N)) * (Xmax - Xmin) bits.
By contrast, the algorithm presented above simply requires N bits in the worst and best cases.
However, this analysis leads to the intuition that if the algorithm made an initial pass through the list looking for a zero (and counting the list elements if required), it would give a quicker answer using no space at all if it found the zero. It is definitely worth doing this if there is a high probability of finding at least one zero in the list. And this extra pass doesn't change the overall complexity.
EDIT: I've changed the description of the algorithm to use "array of booleans" since people apparently found my original description using bits and bitmaps to be confusing.
Since the OP has now specified that the original list is held in RAM and that the computer has only, say, 1GB of memory, I'm going to go out on a limb and predict that the answer is zero.
1GB of RAM means the list can have at most 134,217,728 numbers in it. But there are 264 = 18,446,744,073,709,551,616 possible numbers. So the probability that zero is in the list is 1 in 137,438,953,472.
In contrast, my odds of being struck by lightning this year are 1 in 700,000. And my odds of getting hit by a meteorite are about 1 in 10 trillion. So I'm about ten times more likely to be written up in a scientific journal due to my untimely death by a celestial object than the answer not being zero.
As pointed out in other answers you can do a sort, and then simply scan up until you find a gap.
You can improve the algorithmic complexity to O(N) and keep O(N) space by using a modified QuickSort where you eliminate partitions which are not potential candidates for containing the gap.
On the first partition phase, remove duplicates.
Once the partitioning is complete look at the number of items in the lower partition
Is this value equal to the value used for creating the partition?
If so then it implies that the gap is in the higher partition.
Continue with the quicksort, ignoring the lower partition
Otherwise the gap is in the lower partition
Continue with the quicksort, ignoring the higher partition
This saves a large number of computations.
To illustrate one of the pitfalls of O(N) thinking, here is an O(N) algorithm that uses O(1) space.
for i in [0..2^64):
if i not in list: return i
print "no 64-bit integers are missing"
Since the numbers are all 64 bits long, we can use radix sort on them, which is O(n). Sort 'em, then scan 'em until you find what you're looking for.
if the smallest number is zero, scan forward until you find a gap. If the smallest number is not zero, the answer is zero.
For a space efficient method and all values are distinct you can do it in space O( k ) and time O( k*log(N)*N ). It's space efficient and there's no data moving and all operations are elementary (adding subtracting).
set U = N; L=0
First partition the number space in k regions. Like this:
0->(1/k)*(U-L) + L, 0->(2/k)*(U-L) + L, 0->(3/k)*(U-L) + L ... 0->(U-L) + L
Find how many numbers (count{i}) are in each region. (N*k steps)
Find the first region (h) that isn't full. That means count{h} < upper_limit{h}. (k steps)
if h - count{h-1} = 1 you've got your answer
set U = count{h}; L = count{h-1}
goto 2
this can be improved using hashing (thanks for Nic this idea).
same
First partition the number space in k regions. Like this:
L + (i/k)->L + (i+1/k)*(U-L)
inc count{j} using j = (number - L)/k (if L < number < U)
find first region (h) that doesn't have k elements in it
if count{h} = 1 h is your answer
set U = maximum value in region h L = minimum value in region h
This will run in O(log(N)*N).
I'd just sort them then run through the sequence until I find a gap (including the gap at the start between zero and the first number).
In terms of an algorithm, something like this would do it:
def smallest_not_in_list(list):
sort(list)
if list[0] != 0:
return 0
for i = 1 to list.last:
if list[i] != list[i-1] + 1:
return list[i-1] + 1
if list[list.last] == 2^64 - 1:
assert ("No gaps")
return list[list.last] + 1
Of course, if you have a lot more memory than CPU grunt, you could create a bitmask of all possible 64-bit values and just set the bits for every number in the list. Then look for the first 0-bit in that bitmask. That turns it into an O(n) operation in terms of time but pretty damned expensive in terms of memory requirements :-)
I doubt you could improve on O(n) since I can't see a way of doing it that doesn't involve looking at each number at least once.
The algorithm for that one would be along the lines of:
def smallest_not_in_list(list):
bitmask = mask_make(2^64) // might take a while :-)
mask_clear_all (bitmask)
for i = 1 to list.last:
mask_set (bitmask, list[i])
for i = 0 to 2^64 - 1:
if mask_is_clear (bitmask, i):
return i
assert ("No gaps")
Sort the list, look at the first and second elements, and start going up until there is a gap.
We could use a hash table to hold the numbers. Once all numbers are done, run a counter from 0 till we find the lowest. A reasonably good hash will hash and store in constant time, and retrieves in constant time.
for every i in X // One scan Θ(1)
hashtable.put(i, i); // O(1)
low = 0;
while (hashtable.get(i) <> null) // at most n+1 times
low++;
print low;
The worst case if there are n elements in the array, and are {0, 1, ... n-1}, in which case, the answer will be obtained at n, still keeping it O(n).
You can do it in O(n) time and O(1) additional space, although the hidden factor is quite large. This isn't a practical way to solve the problem, but it might be interesting nonetheless.
For every unsigned 64-bit integer (in ascending order) iterate over the list until you find the target integer or you reach the end of the list. If you reach the end of the list, the target integer is the smallest integer not in the list. If you reach the end of the 64-bit integers, every 64-bit integer is in the list.
Here it is as a Python function:
def smallest_missing_uint64(source_list):
the_answer = None
target = 0L
while target < 2L**64:
target_found = False
for item in source_list:
if item == target:
target_found = True
if not target_found and the_answer is None:
the_answer = target
target += 1L
return the_answer
This function is deliberately inefficient to keep it O(n). Note especially that the function keeps checking target integers even after the answer has been found. If the function returned as soon as the answer was found, the number of times the outer loop ran would be bound by the size of the answer, which is bound by n. That change would make the run time O(n^2), even though it would be a lot faster.
Thanks to egon, swilden, and Stephen C for my inspiration. First, we know the bounds of the goal value because it cannot be greater than the size of the list. Also, a 1GB list could contain at most 134217728 (128 * 2^20) 64-bit integers.
Hashing part
I propose using hashing to dramatically reduce our search space. First, square root the size of the list. For a 1GB list, that's N=11,586. Set up an integer array of size N. Iterate through the list, and take the square root* of each number you find as your hash. In your hash table, increment the counter for that hash. Next, iterate through your hash table. The first bucket you find that is not equal to it's max size defines your new search space.
Bitmap part
Now set up a regular bit map equal to the size of your new search space, and again iterate through the source list, filling out the bitmap as you find each number in your search space. When you're done, the first unset bit in your bitmap will give you your answer.
This will be completed in O(n) time and O(sqrt(n)) space.
(*You could use use something like bit shifting to do this a lot more efficiently, and just vary the number and size of buckets accordingly.)
Well if there is only one missing number in a list of numbers, the easiest way to find the missing number is to sum the series and subtract each value in the list. The final value is the missing number.
int i = 0;
while ( i < Array.Length)
{
if (Array[i] == i + 1)
{
i++;
}
if (i < Array.Length)
{
if (Array[i] <= Array.Length)
{//SWap
int temp = Array[i];
int AnoTemp = Array[temp - 1];
Array[temp - 1] = temp;
Array[i] = AnoTemp;
}
else
i++;
}
}
for (int j = 0; j < Array.Length; j++)
{
if (Array[j] > Array.Length)
{
Console.WriteLine(j + 1);
j = Array.Length;
}
else
if (j == Array.Length - 1)
Console.WriteLine("Not Found !!");
}
}
Here's my answer written in Java:
Basic Idea:
1- Loop through the array throwing away duplicate positive, zeros, and negative numbers while summing up the rest, getting the maximum positive number as well, and keep the unique positive numbers in a Map.
2- Compute the sum as max * (max+1)/2.
3- Find the difference between the sums calculated at steps 1 & 2
4- Loop again from 1 to the minimum of [sums difference, max] and return the first number that is not in the map populated in step 1.
public static int solution(int[] A) {
if (A == null || A.length == 0) {
throw new IllegalArgumentException();
}
int sum = 0;
Map<Integer, Boolean> uniqueNumbers = new HashMap<Integer, Boolean>();
int max = A[0];
for (int i = 0; i < A.length; i++) {
if(A[i] < 0) {
continue;
}
if(uniqueNumbers.get(A[i]) != null) {
continue;
}
if (A[i] > max) {
max = A[i];
}
uniqueNumbers.put(A[i], true);
sum += A[i];
}
int completeSum = (max * (max + 1)) / 2;
for(int j = 1; j <= Math.min((completeSum - sum), max); j++) {
if(uniqueNumbers.get(j) == null) { //O(1)
return j;
}
}
//All negative case
if(uniqueNumbers.isEmpty()) {
return 1;
}
return 0;
}
As Stephen C smartly pointed out, the answer must be a number smaller than the length of the array. I would then find the answer by binary search. This optimizes the worst case (so the interviewer can't catch you in a 'what if' pathological scenario). In an interview, do point out you are doing this to optimize for the worst case.
The way to use binary search is to subtract the number you are looking for from each element of the array, and check for negative results.
I like the "guess zero" apprach. If the numbers were random, zero is highly probable. If the "examiner" set a non-random list, then add one and guess again:
LowNum=0
i=0
do forever {
if i == N then leave /* Processed entire array */
if array[i] == LowNum {
LowNum++
i=0
}
else {
i++
}
}
display LowNum
The worst case is n*N with n=N, but in practice n is highly likely to be a small number (eg. 1)
I am not sure if I got the question. But if for list 1,2,3,5,6 and the missing number is 4, then the missing number can be found in O(n) by:
(n+2)(n+1)/2-(n+1)n/2
EDIT: sorry, I guess I was thinking too fast last night. Anyway, The second part should actually be replaced by sum(list), which is where O(n) comes. The formula reveals the idea behind it: for n sequential integers, the sum should be (n+1)*n/2. If there is a missing number, the sum would be equal to the sum of (n+1) sequential integers minus the missing number.
Thanks for pointing out the fact that I was putting some middle pieces in my mind.
Well done Ants Aasma! I thought about the answer for about 15 minutes and independently came up with an answer in a similar vein of thinking to yours:
#define SWAP(x,y) { numerictype_t tmp = x; x = y; y = tmp; }
int minNonNegativeNotInArr (numerictype_t * a, size_t n) {
int m = n;
for (int i = 0; i < m;) {
if (a[i] >= m || a[i] < i || a[i] == a[a[i]]) {
m--;
SWAP (a[i], a[m]);
continue;
}
if (a[i] > i) {
SWAP (a[i], a[a[i]]);
continue;
}
i++;
}
return m;
}
m represents "the current maximum possible output given what I know about the first i inputs and assuming nothing else about the values until the entry at m-1".
This value of m will be returned only if (a[i], ..., a[m-1]) is a permutation of the values (i, ..., m-1). Thus if a[i] >= m or if a[i] < i or if a[i] == a[a[i]] we know that m is the wrong output and must be at least one element lower. So decrementing m and swapping a[i] with the a[m] we can recurse.
If this is not true but a[i] > i then knowing that a[i] != a[a[i]] we know that swapping a[i] with a[a[i]] will increase the number of elements in their own place.
Otherwise a[i] must be equal to i in which case we can increment i knowing that all the values of up to and including this index are equal to their index.
The proof that this cannot enter an infinite loop is left as an exercise to the reader. :)
The Dafny fragment from Ants' answer shows why the in-place algorithm may fail. The requires pre-condition describes that the values of each item must not go beyond the bounds of the array.
method AntsAasma(A: array<int>) returns (M: int)
requires A != null && forall N :: 0 <= N < A.Length ==> 0 <= A[N] < A.Length;
modifies A;
{
// Pass 1, move every value to the position of its value
var N := A.Length;
var cursor := 0;
while (cursor < N)
{
var target := A[cursor];
while (0 <= target < N && target != A[target])
{
var new_target := A[target];
A[target] := target;
target := new_target;
}
cursor := cursor + 1;
}
// Pass 2, find first location where the index doesn't match the value
cursor := 0;
while (cursor < N)
{
if (A[cursor] != cursor)
{
return cursor;
}
cursor := cursor + 1;
}
return N;
}
Paste the code into the validator with and without the forall ... clause to see the verification error. The second error is a result of the verifier not being able to establish a termination condition for the Pass 1 loop. Proving this is left to someone who understands the tool better.
Here's an answer in Java that does not modify the input and uses O(N) time and N bits plus a small constant overhead of memory (where N is the size of the list):
int smallestMissingValue(List<Integer> values) {
BitSet bitset = new BitSet(values.size() + 1);
for (int i : values) {
if (i >= 0 && i <= values.size()) {
bitset.set(i);
}
}
return bitset.nextClearBit(0);
}
def solution(A):
index = 0
target = []
A = [x for x in A if x >=0]
if len(A) ==0:
return 1
maxi = max(A)
if maxi <= len(A):
maxi = len(A)
target = ['X' for x in range(maxi+1)]
for number in A:
target[number]= number
count = 1
while count < maxi+1:
if target[count] == 'X':
return count
count +=1
return target[count-1] + 1
Got 100% for the above solution.
1)Filter negative and Zero
2)Sort/distinct
3)Visit array
Complexity: O(N) or O(N * log(N))
using Java8
public int solution(int[] A) {
int result = 1;
boolean found = false;
A = Arrays.stream(A).filter(x -> x > 0).sorted().distinct().toArray();
//System.out.println(Arrays.toString(A));
for (int i = 0; i < A.length; i++) {
result = i + 1;
if (result != A[i]) {
found = true;
break;
}
}
if (!found && result == A.length) {
//result is larger than max element in array
result++;
}
return result;
}
An unordered_set can be used to store all the positive numbers, and then we can iterate from 1 to length of unordered_set, and see the first number that does not occur.
int firstMissingPositive(vector<int>& nums) {
unordered_set<int> fre;
// storing each positive number in a hash.
for(int i = 0; i < nums.size(); i +=1)
{
if(nums[i] > 0)
fre.insert(nums[i]);
}
int i = 1;
// Iterating from 1 to size of the set and checking
// for the occurrence of 'i'
for(auto it = fre.begin(); it != fre.end(); ++it)
{
if(fre.find(i) == fre.end())
return i;
i +=1;
}
return i;
}
Solution through basic javascript
var a = [1, 3, 6, 4, 1, 2];
function findSmallest(a) {
var m = 0;
for(i=1;i<=a.length;i++) {
j=0;m=1;
while(j < a.length) {
if(i === a[j]) {
m++;
}
j++;
}
if(m === 1) {
return i;
}
}
}
console.log(findSmallest(a))
Hope this helps for someone.
With python it is not the most efficient, but correct
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-
import datetime
# write your code in Python 3.6
def solution(A):
MIN = 0
MAX = 1000000
possible_results = range(MIN, MAX)
for i in possible_results:
next_value = (i + 1)
if next_value not in A:
return next_value
return 1
test_case_0 = [2, 2, 2]
test_case_1 = [1, 3, 44, 55, 6, 0, 3, 8]
test_case_2 = [-1, -22]
test_case_3 = [x for x in range(-10000, 10000)]
test_case_4 = [x for x in range(0, 100)] + [x for x in range(102, 200)]
test_case_5 = [4, 5, 6]
print("---")
a = datetime.datetime.now()
print(solution(test_case_0))
print(solution(test_case_1))
print(solution(test_case_2))
print(solution(test_case_3))
print(solution(test_case_4))
print(solution(test_case_5))
def solution(A):
A.sort()
j = 1
for i, elem in enumerate(A):
if j < elem:
break
elif j == elem:
j += 1
continue
else:
continue
return j
this can help:
0- A is [5, 3, 2, 7];
1- Define B With Length = A.Length; (O(1))
2- initialize B Cells With 1; (O(n))
3- For Each Item In A:
if (B.Length <= item) then B[Item] = -1 (O(n))
4- The answer is smallest index in B such that B[index] != -1 (O(n))