What is the most efficient (fastest) way to find an N number of the largest integers in an array in C? - c

Let's have an array of size 8
Let's have N be 3
With an array:
1 3 2 17 19 23 0 2
Our output should be:
23, 19, 17
Explanation: The three largest numbers from the array, listed in descending order.
I have tried this:
int array[8];
int largest[N] = {0, 0, 0};
for (int i = 1; i < N; i++) {
for (int j = 0; j < SIZE_OF_ARRAY; j++) {
if (largest[i] > array[j]) {
largest[i] = array[j];
array[j] = 0;
}
}
}
Additionally, let the constraint be as such:
integers in the array should be 0 <= i <= 1 000
N should be 1 <= N <= SIZE_OF_ARRAY - 1
SIZE_OF_ARRAY should be 2 <= SIZE_OF_ARRAY <= 1 000 000
My way of implementing it is very inefficient, as it scrubs the entire array an N number of times. With huge arrays, this can take several minutes to do.
What would be the fastest and most efficient way to implement this in C?

You should look at the histogram algorithm. Since the values have to be between 0 and 1000, you just allocate an array for each of those values:
#define MAX_VALUE 1000
int occurrences[MAX_VALUE+1];
int largest[N];
int i, j;
for (i=0; i<N; i++)
largest[N] = -1;
for (i=0; i<=MAX_VALUE; i++)
occurrences[i] = 0;
for (i=0; i<SIZE_OF_ARRAY; i++)
occurrences[array[i]]++;
// Step through the occurrences array backward to find the N largest values.
for (i=MAX_VALUE, j=0, i; i>=0 && j<N; i--)
if (occurrences[i] > 0)
largest[j++] = i;
Note that this will yield only one element in largest for each unique value. Modify the insertion accordingly if you want all occurrences to appear in largest. Because of that, you may get values of -1 for some elements if there weren't enough unique large numbers to fill the largest array. Finally, the results in largest will be sorted from largest to smallest. That will be easy to fix if you want to: just fill the largest array from right to left.

The fastest way is to recognize that data doesn't just appear (it either exists at compile time; or arrives by IO - from files, from network, etc); and therefore you can find the 3 highest values when the data is created (at compile time; or when you're parsing and sanity checking and then storing data received by IO - from files, from network, etc). This is likely to be the fastest possible way (because you're either doing nothing at run-time, or avoiding the need to look at all the data a second time).
However; in this case, if the data is modified after it was created then you'd need to update the "3 highest values" at the same time as the data is modified; which is easy if a lower value is replaced by a higher value (you just check if the new value becomes one of the 3 highest values) but involves a search if a "previously highest" value is being replaced with a lower value.
If you need to search; then it can be done with a single loop, like:
firstHighest = INT_MIN;
secondHighest = INT_MIN;
thirdHighest = INT_MIN;
for (int i = 1; i < N; i++) {
if(array[i] > thirdHighest) {
if(array[i] > secondHighest) {
if(array[i] > firstHighest) {
thirdHighest = secondHighest;
secondHighest = firstHighest;
firstHighest = array[i];
} else {
thirdHighest = secondHighest;
secondHighest = array[i];
}
} else {
thirdHighest = array[i];
}
}
}
Note: The exact code will depend on what you want to do with duplicates (you may need to replace if(array[j] > secondHighest) { with if(array[j] >= secondHighest) { and if(array[j] > firstHighest) { with if(array[j] >= firstHighest) { if you want the numbers 1, 2, 3, 4, 4, 4, 4 to give the answer 4, 4, 4 instead of 2, 3, 4).
For large amounts of data it can be accelerated with SIMD and/or multiple threads. For example; if SIMD can do "bundles of 8 integers" and you have 4 CPUs (and 4 threads); then you can split it into quarters then treat each quarter as columns of 8 elements; find the highest 3 values in each column in each quarter; then determine the highest 3 values from the "highest 3 values in each column in each quarter". In this case you will probably want to add padding (dummy values set to INT_MIN) to the end of the array to ensure that the array's total size is a multiple of SIMD width and number of CPUs.
For small amounts of data the extra overhead of setting up SIMD and/or coordinating multiple threads is going to cost more than it saves; and the "simple loop" version is likely to be as fast as it gets.
For unknown/variable amounts of data you could provide multiple alternatives (simple loop, SIMD with single thread, and SIMD with a variable number of threads) and decide which method to use (and how many threads to use) at run-time based on the amount of data.

One method I can think of is to just sort the array and return the first N numbers. Since the array is sorted, the N number we return will be the N largest numbers of the array. This method will take a time complexity of O(nlogn) where n is the number of elements we have in the given array. I think this is probably very good time complexity you can get when approaching this problem.
Another approach with similar time complexity would be to use a max-heap. Form max-heap from the given array and for N times, use pop() (or extract or whatever you call it) to get the top-most element which would be the max element remaining in the heap after each pop.
The time complexity of this approach could be considered to be even better than first one - O(n + Nlogn) where n is the number of elements in array and N is the number of largest elements to be found. Here, O(n) would be required to build heap and for popping the top-most element, we would need O(logn) for N times which sums up to - O(n + Nlogn), slightly better than O(nlogn)

Related

Cycling through interval in C efficiently

I have dynamically allocated array consisting of a lot of numbers (200 000+) and I have to find out, if (and how many) these numbers are contained in given interval. There can be duplicates and all the numbers are in random order.
Example of numbers I get at the beginning:
{1,2,3,1484984,48941651,489416,1816,168189161,6484,8169181,9681916,121,231,684979,795641,231484891,...}
Given interval:
<2;150000>
I created a simple algorithm with 2 for loops cycling through all numbers:
for( int j = 0; j <= numberOfRepeats; j++){
for( int i = 0; i < arraySize; i++){
if(currentNumber == array[i]){
counter++;
}
}
currentNumber++;
}
printf(" -> %d\n", counter);
}
This algorithm is too slow for my task. Is there more efficient way for me to implement my solution? Could sorting the arrays by value help in this case / wouldn't that be too slow?
Example of working program:
{ 1, 7, 22, 4, 7, 5, 11, 9, 1 }
<4;7>
-> 4
The problem was simple as the single comment in my question answered it - there was no reason for second loop. Single loop could do it alone.
My changed code:
for(int i = 0; i <= arraySize-1; i++){
if(array[i] <= endOfInterval && array[i] >= startOfInterval){
counter++;
}
This algorithm is too slow for my task. Is there more efficient way for me to implement my solution? Could sorting the arrays by value help in this case / wouldn't that be too slow?
Of course, it is slow. A single pass algorithm to count the number of elements that are in the set should suffice, just count them in a single pass if they pass the test (be n[i] >= lower bound && be n[i] < upper bound or similar approach) will do the work.
Only in case you need to consider duplicates (e.g. not counting them) you will need to consider if you have already touched them or no. In that case, the sorting solution will be faster (a qsort(3) call is O(nlog(n)) against the O(nn) your double loop is doing, so it will run in an almost linear, then you make a second pass over the data (converting your complexity to O(nlog(n) + n), still lower than O(nn) for the large amount of data you have.
Sorting has the advantage that puts all the repeated key values together, so you have to consider only if the last element you read was the same as the one you are processing now, if it is different, then count it only if it is in the specified range.
One final note: Reading a set of 200,000 integers into an array to filter them, based on some criteria is normally a bad, non-scalable way to solve a problem. Your problem (select the elements that belong to a given interval) allow you for a scalable and better solution by streaming the problem (you read a number, check if it is in the interval, then output it, or count it, or whatever you like to do on it), without using a large amount of memory to hold them all before starting. That is far better way to solve a problem, as it allows you to read a true unbounded set of numbers (coming e.g. from a file) and producing an output based on that:
#include <stdio.h>
#define A (2)
#define B (150000)
int main()
{
int the_number;
size_t count = 0;
int res;
while ((res = scanf("%d", &the_number)) > 0) {
if (the_number >= A && the_number <= B)
count++;
}
printf("%zd numbers fitted in the range\n", count);
}
on this example you can give the program 1.0E26 numbers (assuming that you have an input file system large enough to hold a file this size) and your program will be able to handle it (you cannot create an array with capacity to hold 10^26 values)

Algorithm to find k smallest numbers in an array in same order using O(1) auxiliary space

For example if the array is arr[] = {4, 2, 6, 1, 5},
and k = 3, then the output should be 4 2 1.
It can be done in O(nk) steps and O(1) space.
Firstly, find the kth smallest number in kn steps: find the minimum; store it in a local variable min; then find the second smallest number, i.e. the smallest number that is greater than min; store it in min; and so on... repeat the process from i = 1 to k (each time it's a linear search through the array).
Having this value, browse through the array and print all elements that are smaller or equal to min. This final step is linear.
Care has to be taken if there are duplicate values in the array. In such a case we have to increment i several times if duplicate min values are found in one pass. Additionally, besides min variable we have to have a count variable, which is reset to zero with each iteration of the main loop, and is incremented each time a duplicate min number is found.
In the final scan through the array, we print all values smaller than min, and up to count values exactly min.
The algorithm in C would like this:
int min = MIN_VALUE, local_min;
int count;
int i, j;
i = 0;
while (i < k) {
local_min = MAX_VALUE;
count = 0;
for (j = 0; j < n; j++) {
if ((arr[j] > min || min == MIN_VALUE) && arr[j] < local_min) {
local_min = arr[j];
count = 1;
}
else if ((arr[j] > min || min == MIN_VALUE) && arr[j] == local_min) {
count++;
}
}
min = local_min;
i += count;
}
if (i > k) {
count = count - (i - k);
}
for (i = 0, j = 0; i < n; i++) {
if (arr[i] < min) {
print arr[i];
}
else if (arr[i] == min && j < count) {
print arr[i];
j++;
}
}
where MIN_VALUE and MAX_VALUE can be some arbitrary values such as -infinity and +infinity, or MIN_VALUE = arr[0] and MAX_VALUE is set to be maximal value in arr (the max can be found in an additional initial loop).
Single pass solution - O(k) space (for O(1) space see below).
The order of the items is preserved (i.e. stable).
// Pseudo code
if ( arr.size <= k )
handle special case
array results[k]
int i = 0;
// init
for ( ; i < k, i++) { // or use memcpy()
results[i] = arr[i]
}
int max_val = max of results
for( ; i < arr.size; i++) {
if( arr[i] < max_val ) {
remove largest in results // move the remaining up / memmove()
add arr[i] at end of results // i.e. results[k-1] = arr[i]
max_val = new max of results
}
}
// for larger k you'd want some optimization to get the new max
// and maybe keep track of the position of max_val in the results array
Example:
4 6 2 3 1 5
4 6 2 // init
4 2 3 // remove 6, add 3 at end
2 3 1 // remove 4, add 1 at end
// or the original:
4 2 6 1 5
4 2 6 // init
4 2 1 // remove 6, add 1 -- if max is last, just replace
Optimization:
If a few extra bytes are allowed, you can optimize for larger k:
create an array size k of objects {value, position_in_list}
keep the items sorted on value:
new value: drop last element, insert the new at the right location
new max is the last element
sort the end result on position_in_list
for really large k use binary search to locate the insertion point
O(1) space:
If we're allowed to overwrite the data, the same algorithm can be used, but instead of using a separate array[k], use the first k elements of the list (and you can skip the init).
If the data has to be preserved, see my second answer with good performance for large k and O(1) space.
First find the Kth smallest number in the array.
Look at https://www.geeksforgeeks.org/kth-smallestlargest-element-unsorted-array-set-2-expected-linear-time/
Above link shows how you can use randomize quick select ,to find the kth smallest element in an average complexity of O(n) time.
Once you have the Kth smallest element,loop through the array and print all those elements which are equal to or less than Kth smallest number.
int small={Kth smallest number in the array}
for(int i=0;i<array.length;i++){
if(array[i]<=small){
System.out.println(array[i]+ " ");
}
}
A baseline (complexity at most 3n-2 for k=3):
find the min M1 from the end of the list and its position P1 (store it in out[2])
redo it from P1 to find M2 at P2 (store it in out[1])
redo it from P2 to find M3 (store it in out[0])
It can undoubtedly be improved.
Solution with O(1) space and large k (for example 100,000) with only a few passes through the list.
In my first answer I presented a single pass solution using O(k) space with an option for single pass O(1) space if we are allowed to overwrite the data.
For data that cannot be overwritten, ciamej provided a O(1) solution requiring up to k passes through the data, which works great.
However, for large lists (n) and large k we may want a faster solution. For example, with n=100,000,000 (distinct values) and k=100,000 we would have to check 10 trillion items with a branch on each item + an extra pass to get those items.
To reduce the passes over n we can create a small histogram of ranges. This requires a small storage space for the histogram, but since O(1) means constant space (i.e. not depending on n or k) I think we're allowed to do that. That space could be as small as an array of 2 * uint32. Histogram size should be a power of two, which allows us to use bit masking.
To keep the following example small and simple, we'll use a list containing 16-bit positive integers and a histogram of uint32[256] - but it will work with uint32[2] as well.
First, find the k-th smallest number - only 2 passes required:
uint32 hist[256];
First pass: group (count) by multiples of 256 - no branching besides the loop
loop:
hist[arr[i] & 0xff00 >> 8]++;
Now we have a count for each range and can calculate which bucket our k is in.
Save the total count up to that bucket and reset the histogram.
Second pass: fill the histogram again,
now masking the lower 8 bits and only for the numbers belonging in that range.
The range check can also be done with a mask
After this last pass, all values represented in the histogram are unique
and we can easily calculate where our k-th number is.
If the count in that slot (which represents our max value after restoring
with the previous mask) is higher than one, we'll have to remember that
when printing out the numbers.
This is explained in ciamej's post, so I won't repeat it here.
---
With hist[4] and a list of 32-bit integers we would need 8 passes.
The algorithm can easily be adjusted for signed integers.
Example:
k = 7
uint32_t hist[256]; // can be as small as hist[2]
uint16_t arr[]:
88
258
4
524
620
45
440
112
380
580
88
178
Fill histogram with:
hist[arr[i] & 0xff00 >> 8]++;
hist count
0 (0-255) 6
1 (256-511) 3 -> k
2 (512-767) 3
...
k is in hist[1] -> (256-511)
Clear histogram and fill with range (256-511):
Fill histogram with:
if (arr[i] & 0xff00 == 0x0100)
hist[arr[i] & 0xff]++;
Numbers in this range are:
258 & 0xff = 2
440 & 0xff = 184
380 & 0xff = 124
hist count
0 0
1 0
2 1 -> k
... 0
124 1
... 0
184 1
... 0
k - 6 (first pass) = 1
k is in hist[2], which is 2 + 256 = 258
Loop through arr[] to display the numbers <= 258 in preserved order.
Take care of possible duplicate highest numbers (hist[2] > 1 in this case).
we can easily calculate how many we have to print of those.
Further optimization:
If we can expect k to be in the lower ranges, we can even optimize this further by using the log2 values instead of fixed ranges:
There is a single CPU instruction to count the leading zero bits (or one bits)
so we don't have to call a standard log() function
but can call an intrinsic function instead.
This would require hist[65] for a list with 64-bit (positive) integers.
We would then have something like:
hist[ 64 - n_leading_zero_bits ]++;
This way the ranges we have to use in the following passes would be smaller.

Codility: MaxZeroProduct - complexity issues

My solution scored 100% correctness, but 0% Performance.
I just can't figure out how to minimize time complexity.
Problem:
Write a function:
int solution(int A[], int N);
that, given an array of N positive integers, returns the maximum number of trailing zeros of the number obtained by multiplying three different elements from the array. Numbers are considered different if they are at different positions in the array.
For example, given A = [7, 15, 6, 20, 5, 10], the function should return 3 (you can obtain three trailing zeros by taking the product of numbers 15, 20 and 10 or 20, 5 and 10).
For another example, given A = [25, 10, 25, 10, 32], the function should return 4 (you can obtain four trailing zeros by taking the product of numbers 25, 25 and 32).
Assume that:
N is an integer within the range [3..100,000];
each element of array A is an integer within the range [1..1,000,000,000].
Complexity:
expected worst-case time complexity is O(N*log(max(A)));
expected worst-case space complexity is O(N) (not counting the storage required for input arguments).
Solution:
the idea:
factorize each element into pairs of 5's and 2's
sum each 3 pairs into one pair - this costs O(N^3)
find the pair who's minimum coordinate value is the biggest
return that minimun coordinate value
the code:
int solution(int A[], int N) {
int fives = 0, twos = 0, max_zeros = 0;
int(*factors)[2] = calloc(N, sizeof(int[2])); //each item (x,y) represents the amount of 5's and 2's of the corresponding item in A
for (int i = 0; i< N; i++) {
factorize(A[i], &fives, &twos);
factors[i][0] = fives;
factors[i][1] = twos;
}
//O(N^3)
for (int i = 0; i<N; i++) {
for (int j = i + 1; j<N; j++) {
for (int k = j + 1; k<N; k++) {
int x = factors[i][0] + factors[j][0] + factors[k][0];
int y = factors[i][1] + factors[j][1] + factors[k][1];
max_zeros = max(max_zeros, min(x, y));
}
}
}
return max_zeros;
}
void factorize(int val, int* fives, int* twos) {
int tmp = val;
*fives = 0, *twos = 0;
if (val == 0) return;
while (val % 5 == 0) { //factors of 5
val /= 5;
(*fives)++;
}
while (val % 2 == 0) { //factors of 2
val /= 2;
(*twos)++;
}
}
I can't figure out how else i can iterate over the N-sized array in order to find the optimal 3 items in time O(N*log(max(A))).
Since 2^30 > 1e9 and 5^13 > 1e9, there's a limit of 30 * 13 = 390 different pairs of factors of 2 and 5 in the array, no matter how large the array. This is an upper bound (the actual number is 213).
Discard all but three representatives from the array for each pair, and then your O(N^3) algorithm is probably fast enough.
If it's still not fast enough, you can continue by applying dynamic programming, computing P[i,j], the largest product of factors of 2s and 5s of pairs of elements with index <=i of the form x * 2^y * 5^y+j (where x is divisible by neither 2 nor 5). This table can then be used in a second dynamic programming pass to find the product of three numbers with the most 0's.
In real world I don't like such meta-thinking, but still, we are faced some artificial problem with some artificial restrictions...
Since space complexity is O(N), we can't afford dynamic programming based on initial input. We can't even make a map of N*factors. Well, we can afford map of N*2, anyway, but that's mostly all we can.
Since time complexity is O(Nlog(max(A))), we can allow ourselves to factorize items and do some simple one-way reduction. Probably we can sort items with count sort - it's a bit more like Nlog^2(max(A)) for 2-index sorting, but big O will even it out.
If my spider sense is right, we should simply pick something out of this counts array and polish it with 1-run through array. Something like best count for 2, then best for 5, and then we can enumerate the rest of elements finding best overal product. It's just heuristic, but dimentions don't lie!
Just my 2 cents

Effective Algorithms for selecting the top k ( in percent) items from a datastream:

I have to repeatedly sort an array containing 300 random elements. But i have to do a special kind of sort: I need the 5% smallest values from an subset of the array, then some value is calculated and the subset is increased. Now the value is calculated again and the subset also increased. And so on until the subset contains the whole array.
The subset starts with the first 10 elements and is increased by 10 elements after each step.
i.e. :
subset-size k=ceil(5%*subset)
10 1 (so just the smallest element)
20 1 (so also just the smallest)
30 2 (smallest and second smallest)
...
The calculated value is basically a sum of all elements smaller than k and the specially weighted k smallest element.
In code:
k = ceil(0.05 * subset) -1; // -1 because array index starts with 0...
temp = 0.0;
for( int i=0 i<k; i++)
temp += smallestElements[i];
temp += b * smallestElements[i];
I have implemented myself a selection sort based algorithm (code at the end of this post). I use MAX(k) pointers to keep track of the k smallest elements. Therefore I unnecessarily sort all elements smaller than k :/
Furthermore I know selection sort is bad for performance, which is unfortunately crucial in my case.
I tried figuring out a way how I could use some quick- or heapsort based algorithm. I know that quickselect or heapselect are perfect for finding the k smallest elements if k and the subset is fixed.
But because my subset is more like an input stream of data I think that quicksort based algorithm drop out.
I know that heapselect would be perfect for a data stream if k is fixed. But I don't manage it to adjust heapselect for dynamic k's without big performance drops, so that it is less effective than my selection-sort based version :( Can anyone help me to modify heap-select for dynamic k's?
If there is no better algorithm, you maybe find a different/faster approach for my selection sort implementation. Here is a minimal example of my implementation, the calculated variable isn't used in this example, so don't worry about it. (In my real programm i have just some loops unrolled manually for better performance)
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define ARRAY_SIZE 300
#define STEP_SIZE 10
float sortStream( float* array, float** pointerToSmallest, int k_max){
int i,j,k,last = k_max-1;
float temp=0.0;
// init first two pointers
if( array[0] < array[1] ){
pointerToSmallest[0] = &array[0];
pointerToSmallest[1] = &array[1];
}else{
pointerToSmallest[0] = &array[1];
pointerToSmallest[1] = &array[0];
}
// Init remaining pointers until i= k_max
for(i=2; i< k_max;++i){
if( *pointerToSmallest[i-1] < array[i] ){
pointerToSmallest[i] = &array[i];
}else{
pointerToSmallest[i] = pointerToSmallest[i-1];
for(j=0; j<i-1 && *pointerToSmallest[i-2-j] > array[i];++j)
pointerToSmallest[i-1-j] = pointerToSmallest[i-2-j];
pointerToSmallest[i-1-j]=&array[i];
}
if((i+1)%STEP_SIZE==0){
k = ceil(0.05 * i)-1;
for(j=0; j<k; j++)
temp += *pointerToSmallest[j];
temp += 2 * (*pointerToSmallest[k]);
}
}
// Selection sort remaining elements
for( ; i< ARRAY_SIZE; ++i){
if( *pointerToSmallest[ last ] > array[i] ) {
for(j=0; j != last && *pointerToSmallest[ last-1-j] > array[i];++j)
pointerToSmallest[last-j] = pointerToSmallest[last-1-j];
pointerToSmallest[last-j] = &array[i];
}
if( (i+1)%STEP_SIZE==0){
k = ceil(0.05 * i)-1;
for(j=0; j<k; j++)
temp += *pointerToSmallest[j];
temp += 2 * (*pointerToSmallest[k]);
}
}
return temp;
}
int main(void){
int i,k_max = ceil( 0.05 * ARRAY_SIZE );
float* array = (float*)malloc ( ARRAY_SIZE * sizeof(float));
float** pointerToSmallest = (float**)malloc( k_max * sizeof(float*));
for( i=0; i<ARRAY_SIZE; i++)
array[i]= rand() / (float)RAND_MAX*100-50;
// just return a, so that the compiler doens't drop the function call
float a = sortStream(array,pointerToSmallest, k_max);
return (int)a;
}
Thank you very much
By using two heap for storing all items from stream, you can:
find top p% elements in O(1)
update data structure (two heaps) in O(log N)
assume, now we have N elements, k = p% *N,
min heap (LargerPartHeap) for storing top k items
max heap (SmallerPartHeap) for storing the other (N - k) items.
all items in SmallerPartHeap is less or equal to min items of LargerPartHeap (top item # LargerPartHeap).
for query "what is top p% elements?", simply return LargerPartHeap
for update "new element x from stream",
2.a check new k' = (N + 1) * p%, if k' = k + 1, move top of SmallerPartHeap to LargerPartHeap. - O(logN)
2.b if x is larger than top element (min element) of LargerPartHeap, insert x to LargerPartHeap, and move top of LargerPartHeap to SmallerPartHeap; otherwise, insert x to SmallerPartHeap - O(logN)
I believe heap sort is far too complicated for this particular problem, even though that or other priority queue algorithms are well suited to get N minimum or maximum items from a stream.
The first notice is the constraint 0.05 * 300 = 15. That is the maximum amount of data, that has to be sorted at any moment. Also during each iteration one has add 10 elements. The overall operation in-place could be:
for (i = 0; i < 30; i++)
{
if (i != 1)
qsort(input + i*10, 10, sizeof(input[0]), cmpfunc);
else
qsort(input, 20, sizeof(input[0]), cmpfunc);
if (i > 1)
merge_sort15(input, 15, input + i*10, 10, cmpfunc);
}
When i==1, one could also merge sort input and input+10 to produce completely sorted array of 20 inplace, since that has lower complexity than the generic sort. Here the "optimizing" is also on minimizing the primitives of the algorithm.
Merge_sort15 would only consider the first 15 elements of the first array and the first 10 elements of the next one.
EDIT The parameters of the problem will have a considerable effect in choosing the right algorithm; here selecting 'sort 10 items' as basic unit will allow one half of the problem to be parallelized, namely sorting 30 individual blocks of 10 items each -- a problem which can be efficiently solved with fixed pipeline algorithm using sorting networks. With different parametrization such an approach may not be feasible.

Find the Smallest Integer Not in a List

An interesting interview question that a colleague of mine uses:
Suppose that you are given a very long, unsorted list of unsigned 64-bit integers. How would you find the smallest non-negative integer that does not occur in the list?
FOLLOW-UP: Now that the obvious solution by sorting has been proposed, can you do it faster than O(n log n)?
FOLLOW-UP: Your algorithm has to run on a computer with, say, 1GB of memory
CLARIFICATION: The list is in RAM, though it might consume a large amount of it. You are given the size of the list, say N, in advance.
If the datastructure can be mutated in place and supports random access then you can do it in O(N) time and O(1) additional space. Just go through the array sequentially and for every index write the value at the index to the index specified by value, recursively placing any value at that location to its place and throwing away values > N. Then go again through the array looking for the spot where value doesn't match the index - that's the smallest value not in the array. This results in at most 3N comparisons and only uses a few values worth of temporary space.
# Pass 1, move every value to the position of its value
for cursor in range(N):
target = array[cursor]
while target < N and target != array[target]:
new_target = array[target]
array[target] = target
target = new_target
# Pass 2, find first location where the index doesn't match the value
for cursor in range(N):
if array[cursor] != cursor:
return cursor
return N
Here's a simple O(N) solution that uses O(N) space. I'm assuming that we are restricting the input list to non-negative numbers and that we want to find the first non-negative number that is not in the list.
Find the length of the list; lets say it is N.
Allocate an array of N booleans, initialized to all false.
For each number X in the list, if X is less than N, set the X'th element of the array to true.
Scan the array starting from index 0, looking for the first element that is false. If you find the first false at index I, then I is the answer. Otherwise (i.e. when all elements are true) the answer is N.
In practice, the "array of N booleans" would probably be encoded as a "bitmap" or "bitset" represented as a byte or int array. This typically uses less space (depending on the programming language) and allows the scan for the first false to be done more quickly.
This is how / why the algorithm works.
Suppose that the N numbers in the list are not distinct, or that one or more of them is greater than N. This means that there must be at least one number in the range 0 .. N - 1 that is not in the list. So the problem of find the smallest missing number must therefore reduce to the problem of finding the smallest missing number less than N. This means that we don't need to keep track of numbers that are greater or equal to N ... because they won't be the answer.
The alternative to the previous paragraph is that the list is a permutation of the numbers from 0 .. N - 1. In this case, step 3 sets all elements of the array to true, and step 4 tells us that the first "missing" number is N.
The computational complexity of the algorithm is O(N) with a relatively small constant of proportionality. It makes two linear passes through the list, or just one pass if the list length is known to start with. There is no need to represent the hold the entire list in memory, so the algorithm's asymptotic memory usage is just what is needed to represent the array of booleans; i.e. O(N) bits.
(By contrast, algorithms that rely on in-memory sorting or partitioning assume that you can represent the entire list in memory. In the form the question was asked, this would require O(N) 64-bit words.)
#Jorn comments that steps 1 through 3 are a variation on counting sort. In a sense he is right, but the differences are significant:
A counting sort requires an array of (at least) Xmax - Xmin counters where Xmax is the largest number in the list and Xmin is the smallest number in the list. Each counter has to be able to represent N states; i.e. assuming a binary representation it has to have an integer type (at least) ceiling(log2(N)) bits.
To determine the array size, a counting sort needs to make an initial pass through the list to determine Xmax and Xmin.
The minimum worst-case space requirement is therefore ceiling(log2(N)) * (Xmax - Xmin) bits.
By contrast, the algorithm presented above simply requires N bits in the worst and best cases.
However, this analysis leads to the intuition that if the algorithm made an initial pass through the list looking for a zero (and counting the list elements if required), it would give a quicker answer using no space at all if it found the zero. It is definitely worth doing this if there is a high probability of finding at least one zero in the list. And this extra pass doesn't change the overall complexity.
EDIT: I've changed the description of the algorithm to use "array of booleans" since people apparently found my original description using bits and bitmaps to be confusing.
Since the OP has now specified that the original list is held in RAM and that the computer has only, say, 1GB of memory, I'm going to go out on a limb and predict that the answer is zero.
1GB of RAM means the list can have at most 134,217,728 numbers in it. But there are 264 = 18,446,744,073,709,551,616 possible numbers. So the probability that zero is in the list is 1 in 137,438,953,472.
In contrast, my odds of being struck by lightning this year are 1 in 700,000. And my odds of getting hit by a meteorite are about 1 in 10 trillion. So I'm about ten times more likely to be written up in a scientific journal due to my untimely death by a celestial object than the answer not being zero.
As pointed out in other answers you can do a sort, and then simply scan up until you find a gap.
You can improve the algorithmic complexity to O(N) and keep O(N) space by using a modified QuickSort where you eliminate partitions which are not potential candidates for containing the gap.
On the first partition phase, remove duplicates.
Once the partitioning is complete look at the number of items in the lower partition
Is this value equal to the value used for creating the partition?
If so then it implies that the gap is in the higher partition.
Continue with the quicksort, ignoring the lower partition
Otherwise the gap is in the lower partition
Continue with the quicksort, ignoring the higher partition
This saves a large number of computations.
To illustrate one of the pitfalls of O(N) thinking, here is an O(N) algorithm that uses O(1) space.
for i in [0..2^64):
if i not in list: return i
print "no 64-bit integers are missing"
Since the numbers are all 64 bits long, we can use radix sort on them, which is O(n). Sort 'em, then scan 'em until you find what you're looking for.
if the smallest number is zero, scan forward until you find a gap. If the smallest number is not zero, the answer is zero.
For a space efficient method and all values are distinct you can do it in space O( k ) and time O( k*log(N)*N ). It's space efficient and there's no data moving and all operations are elementary (adding subtracting).
set U = N; L=0
First partition the number space in k regions. Like this:
0->(1/k)*(U-L) + L, 0->(2/k)*(U-L) + L, 0->(3/k)*(U-L) + L ... 0->(U-L) + L
Find how many numbers (count{i}) are in each region. (N*k steps)
Find the first region (h) that isn't full. That means count{h} < upper_limit{h}. (k steps)
if h - count{h-1} = 1 you've got your answer
set U = count{h}; L = count{h-1}
goto 2
this can be improved using hashing (thanks for Nic this idea).
same
First partition the number space in k regions. Like this:
L + (i/k)->L + (i+1/k)*(U-L)
inc count{j} using j = (number - L)/k (if L < number < U)
find first region (h) that doesn't have k elements in it
if count{h} = 1 h is your answer
set U = maximum value in region h L = minimum value in region h
This will run in O(log(N)*N).
I'd just sort them then run through the sequence until I find a gap (including the gap at the start between zero and the first number).
In terms of an algorithm, something like this would do it:
def smallest_not_in_list(list):
sort(list)
if list[0] != 0:
return 0
for i = 1 to list.last:
if list[i] != list[i-1] + 1:
return list[i-1] + 1
if list[list.last] == 2^64 - 1:
assert ("No gaps")
return list[list.last] + 1
Of course, if you have a lot more memory than CPU grunt, you could create a bitmask of all possible 64-bit values and just set the bits for every number in the list. Then look for the first 0-bit in that bitmask. That turns it into an O(n) operation in terms of time but pretty damned expensive in terms of memory requirements :-)
I doubt you could improve on O(n) since I can't see a way of doing it that doesn't involve looking at each number at least once.
The algorithm for that one would be along the lines of:
def smallest_not_in_list(list):
bitmask = mask_make(2^64) // might take a while :-)
mask_clear_all (bitmask)
for i = 1 to list.last:
mask_set (bitmask, list[i])
for i = 0 to 2^64 - 1:
if mask_is_clear (bitmask, i):
return i
assert ("No gaps")
Sort the list, look at the first and second elements, and start going up until there is a gap.
We could use a hash table to hold the numbers. Once all numbers are done, run a counter from 0 till we find the lowest. A reasonably good hash will hash and store in constant time, and retrieves in constant time.
for every i in X // One scan Θ(1)
hashtable.put(i, i); // O(1)
low = 0;
while (hashtable.get(i) <> null) // at most n+1 times
low++;
print low;
The worst case if there are n elements in the array, and are {0, 1, ... n-1}, in which case, the answer will be obtained at n, still keeping it O(n).
You can do it in O(n) time and O(1) additional space, although the hidden factor is quite large. This isn't a practical way to solve the problem, but it might be interesting nonetheless.
For every unsigned 64-bit integer (in ascending order) iterate over the list until you find the target integer or you reach the end of the list. If you reach the end of the list, the target integer is the smallest integer not in the list. If you reach the end of the 64-bit integers, every 64-bit integer is in the list.
Here it is as a Python function:
def smallest_missing_uint64(source_list):
the_answer = None
target = 0L
while target < 2L**64:
target_found = False
for item in source_list:
if item == target:
target_found = True
if not target_found and the_answer is None:
the_answer = target
target += 1L
return the_answer
This function is deliberately inefficient to keep it O(n). Note especially that the function keeps checking target integers even after the answer has been found. If the function returned as soon as the answer was found, the number of times the outer loop ran would be bound by the size of the answer, which is bound by n. That change would make the run time O(n^2), even though it would be a lot faster.
Thanks to egon, swilden, and Stephen C for my inspiration. First, we know the bounds of the goal value because it cannot be greater than the size of the list. Also, a 1GB list could contain at most 134217728 (128 * 2^20) 64-bit integers.
Hashing part
I propose using hashing to dramatically reduce our search space. First, square root the size of the list. For a 1GB list, that's N=11,586. Set up an integer array of size N. Iterate through the list, and take the square root* of each number you find as your hash. In your hash table, increment the counter for that hash. Next, iterate through your hash table. The first bucket you find that is not equal to it's max size defines your new search space.
Bitmap part
Now set up a regular bit map equal to the size of your new search space, and again iterate through the source list, filling out the bitmap as you find each number in your search space. When you're done, the first unset bit in your bitmap will give you your answer.
This will be completed in O(n) time and O(sqrt(n)) space.
(*You could use use something like bit shifting to do this a lot more efficiently, and just vary the number and size of buckets accordingly.)
Well if there is only one missing number in a list of numbers, the easiest way to find the missing number is to sum the series and subtract each value in the list. The final value is the missing number.
int i = 0;
while ( i < Array.Length)
{
if (Array[i] == i + 1)
{
i++;
}
if (i < Array.Length)
{
if (Array[i] <= Array.Length)
{//SWap
int temp = Array[i];
int AnoTemp = Array[temp - 1];
Array[temp - 1] = temp;
Array[i] = AnoTemp;
}
else
i++;
}
}
for (int j = 0; j < Array.Length; j++)
{
if (Array[j] > Array.Length)
{
Console.WriteLine(j + 1);
j = Array.Length;
}
else
if (j == Array.Length - 1)
Console.WriteLine("Not Found !!");
}
}
Here's my answer written in Java:
Basic Idea:
1- Loop through the array throwing away duplicate positive, zeros, and negative numbers while summing up the rest, getting the maximum positive number as well, and keep the unique positive numbers in a Map.
2- Compute the sum as max * (max+1)/2.
3- Find the difference between the sums calculated at steps 1 & 2
4- Loop again from 1 to the minimum of [sums difference, max] and return the first number that is not in the map populated in step 1.
public static int solution(int[] A) {
if (A == null || A.length == 0) {
throw new IllegalArgumentException();
}
int sum = 0;
Map<Integer, Boolean> uniqueNumbers = new HashMap<Integer, Boolean>();
int max = A[0];
for (int i = 0; i < A.length; i++) {
if(A[i] < 0) {
continue;
}
if(uniqueNumbers.get(A[i]) != null) {
continue;
}
if (A[i] > max) {
max = A[i];
}
uniqueNumbers.put(A[i], true);
sum += A[i];
}
int completeSum = (max * (max + 1)) / 2;
for(int j = 1; j <= Math.min((completeSum - sum), max); j++) {
if(uniqueNumbers.get(j) == null) { //O(1)
return j;
}
}
//All negative case
if(uniqueNumbers.isEmpty()) {
return 1;
}
return 0;
}
As Stephen C smartly pointed out, the answer must be a number smaller than the length of the array. I would then find the answer by binary search. This optimizes the worst case (so the interviewer can't catch you in a 'what if' pathological scenario). In an interview, do point out you are doing this to optimize for the worst case.
The way to use binary search is to subtract the number you are looking for from each element of the array, and check for negative results.
I like the "guess zero" apprach. If the numbers were random, zero is highly probable. If the "examiner" set a non-random list, then add one and guess again:
LowNum=0
i=0
do forever {
if i == N then leave /* Processed entire array */
if array[i] == LowNum {
LowNum++
i=0
}
else {
i++
}
}
display LowNum
The worst case is n*N with n=N, but in practice n is highly likely to be a small number (eg. 1)
I am not sure if I got the question. But if for list 1,2,3,5,6 and the missing number is 4, then the missing number can be found in O(n) by:
(n+2)(n+1)/2-(n+1)n/2
EDIT: sorry, I guess I was thinking too fast last night. Anyway, The second part should actually be replaced by sum(list), which is where O(n) comes. The formula reveals the idea behind it: for n sequential integers, the sum should be (n+1)*n/2. If there is a missing number, the sum would be equal to the sum of (n+1) sequential integers minus the missing number.
Thanks for pointing out the fact that I was putting some middle pieces in my mind.
Well done Ants Aasma! I thought about the answer for about 15 minutes and independently came up with an answer in a similar vein of thinking to yours:
#define SWAP(x,y) { numerictype_t tmp = x; x = y; y = tmp; }
int minNonNegativeNotInArr (numerictype_t * a, size_t n) {
int m = n;
for (int i = 0; i < m;) {
if (a[i] >= m || a[i] < i || a[i] == a[a[i]]) {
m--;
SWAP (a[i], a[m]);
continue;
}
if (a[i] > i) {
SWAP (a[i], a[a[i]]);
continue;
}
i++;
}
return m;
}
m represents "the current maximum possible output given what I know about the first i inputs and assuming nothing else about the values until the entry at m-1".
This value of m will be returned only if (a[i], ..., a[m-1]) is a permutation of the values (i, ..., m-1). Thus if a[i] >= m or if a[i] < i or if a[i] == a[a[i]] we know that m is the wrong output and must be at least one element lower. So decrementing m and swapping a[i] with the a[m] we can recurse.
If this is not true but a[i] > i then knowing that a[i] != a[a[i]] we know that swapping a[i] with a[a[i]] will increase the number of elements in their own place.
Otherwise a[i] must be equal to i in which case we can increment i knowing that all the values of up to and including this index are equal to their index.
The proof that this cannot enter an infinite loop is left as an exercise to the reader. :)
The Dafny fragment from Ants' answer shows why the in-place algorithm may fail. The requires pre-condition describes that the values of each item must not go beyond the bounds of the array.
method AntsAasma(A: array<int>) returns (M: int)
requires A != null && forall N :: 0 <= N < A.Length ==> 0 <= A[N] < A.Length;
modifies A;
{
// Pass 1, move every value to the position of its value
var N := A.Length;
var cursor := 0;
while (cursor < N)
{
var target := A[cursor];
while (0 <= target < N && target != A[target])
{
var new_target := A[target];
A[target] := target;
target := new_target;
}
cursor := cursor + 1;
}
// Pass 2, find first location where the index doesn't match the value
cursor := 0;
while (cursor < N)
{
if (A[cursor] != cursor)
{
return cursor;
}
cursor := cursor + 1;
}
return N;
}
Paste the code into the validator with and without the forall ... clause to see the verification error. The second error is a result of the verifier not being able to establish a termination condition for the Pass 1 loop. Proving this is left to someone who understands the tool better.
Here's an answer in Java that does not modify the input and uses O(N) time and N bits plus a small constant overhead of memory (where N is the size of the list):
int smallestMissingValue(List<Integer> values) {
BitSet bitset = new BitSet(values.size() + 1);
for (int i : values) {
if (i >= 0 && i <= values.size()) {
bitset.set(i);
}
}
return bitset.nextClearBit(0);
}
def solution(A):
index = 0
target = []
A = [x for x in A if x >=0]
if len(A) ==0:
return 1
maxi = max(A)
if maxi <= len(A):
maxi = len(A)
target = ['X' for x in range(maxi+1)]
for number in A:
target[number]= number
count = 1
while count < maxi+1:
if target[count] == 'X':
return count
count +=1
return target[count-1] + 1
Got 100% for the above solution.
1)Filter negative and Zero
2)Sort/distinct
3)Visit array
Complexity: O(N) or O(N * log(N))
using Java8
public int solution(int[] A) {
int result = 1;
boolean found = false;
A = Arrays.stream(A).filter(x -> x > 0).sorted().distinct().toArray();
//System.out.println(Arrays.toString(A));
for (int i = 0; i < A.length; i++) {
result = i + 1;
if (result != A[i]) {
found = true;
break;
}
}
if (!found && result == A.length) {
//result is larger than max element in array
result++;
}
return result;
}
An unordered_set can be used to store all the positive numbers, and then we can iterate from 1 to length of unordered_set, and see the first number that does not occur.
int firstMissingPositive(vector<int>& nums) {
unordered_set<int> fre;
// storing each positive number in a hash.
for(int i = 0; i < nums.size(); i +=1)
{
if(nums[i] > 0)
fre.insert(nums[i]);
}
int i = 1;
// Iterating from 1 to size of the set and checking
// for the occurrence of 'i'
for(auto it = fre.begin(); it != fre.end(); ++it)
{
if(fre.find(i) == fre.end())
return i;
i +=1;
}
return i;
}
Solution through basic javascript
var a = [1, 3, 6, 4, 1, 2];
function findSmallest(a) {
var m = 0;
for(i=1;i<=a.length;i++) {
j=0;m=1;
while(j < a.length) {
if(i === a[j]) {
m++;
}
j++;
}
if(m === 1) {
return i;
}
}
}
console.log(findSmallest(a))
Hope this helps for someone.
With python it is not the most efficient, but correct
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-
import datetime
# write your code in Python 3.6
def solution(A):
MIN = 0
MAX = 1000000
possible_results = range(MIN, MAX)
for i in possible_results:
next_value = (i + 1)
if next_value not in A:
return next_value
return 1
test_case_0 = [2, 2, 2]
test_case_1 = [1, 3, 44, 55, 6, 0, 3, 8]
test_case_2 = [-1, -22]
test_case_3 = [x for x in range(-10000, 10000)]
test_case_4 = [x for x in range(0, 100)] + [x for x in range(102, 200)]
test_case_5 = [4, 5, 6]
print("---")
a = datetime.datetime.now()
print(solution(test_case_0))
print(solution(test_case_1))
print(solution(test_case_2))
print(solution(test_case_3))
print(solution(test_case_4))
print(solution(test_case_5))
def solution(A):
A.sort()
j = 1
for i, elem in enumerate(A):
if j < elem:
break
elif j == elem:
j += 1
continue
else:
continue
return j
this can help:
0- A is [5, 3, 2, 7];
1- Define B With Length = A.Length; (O(1))
2- initialize B Cells With 1; (O(n))
3- For Each Item In A:
if (B.Length <= item) then B[Item] = -1 (O(n))
4- The answer is smallest index in B such that B[index] != -1 (O(n))

Resources