Limit input data to achieve a better Big O complexity - arrays

You are given an unsorted array of n integers, and you would like to find if there are any duplicates in the array (i.e. any integer appearing more than once).
Describe an algorithm (implemented with two nested loops) to do this.
The question that I am stuck at is:
How can you limit the input data to achieve a better Big O complexity? Describe an algorithm for handling this limited data to find if there are any duplicates. What is the Big O complexity?
Your help will be greatly appreciated. This is not related to my coursework, assignment or coursework and such. It's from the previous year exam paper and I am doing some self-study but seem to be stuck on this question. The only possible solution that i could come up with is:
If we limit the data, and use nested loops to perform operations to find if there are duplicates. The complexity would be O(n) simply because the amount of time the operations take to perform is proportional to the data size.
If my answer makes no sense, then please ignore it and if you could, then please suggest possible solutions/ working out to this answer.
If someone could help me solve this answer, I would be grateful as I have attempted countless possible solution, all of which seems to be not the correct one.
Edited part, again.. Another possible solution (if effective!):
We could implement a loop to sort the array so that it sorts the array (from lowest integer to highest integer), therefore the duplicates will be right next to each other making them easier and faster to be identified.
The big O complexity would still be O(n^2).
Since this is linear type, it would simply use the first loop and iterate n-1 times as we are getting the index in the array (in the first iteration it could be, for instance, 1) and store this in a variable names 'current'.
The loop will update the current variable by +1 each time through the iteration, within that loop, we now write another loop to compare the current number to the next number and if it equals to the next number, we can print using a printf statement else we move back to the outer loop to update the current variable by + 1 (next value in the array) and update the next variable to hold the value of the number after the value in current.

You can do linearly (O(n)) for any input if you use hash tables (which have constant look-up time).
However, this is not what you are being asked about.
By limiting the possible values in the array, you can achieve linear performance.
E.g., if your integers have range 1..L, you can allocate a bit array of length L, initialize it to 0, and iterate over your input array, checking and flipping the appropriate bit for each input.

A variance of Bucket Sort will do. This will give you complexity of O(n) where 'n' is the number of input elements.
But one restriction - max value. You should know the max value your integer array can take. Lets say it as m.
The idea is to create a bool array of size m (all initialized to false). Then iterate over your array. As you find an element, set bucket[m] to true. If it is already true then you've encountered a duplicate.
A java code,
// alternatively, you can iterate over the array to find the maxVal which again is O(n).
public boolean findDup(int [] arr, int maxVal)
{
// java by default assigns false to all the values.
boolean bucket[] = new boolean[maxVal];
for (int elem : arr)
{
if (bucket[elem])
{
return true; // a duplicate found
}
bucket[elem] = true;
}
return false;
}
But the constraint here is the space. You need O(maxVal) space.

nested loops get you O(N*M) or O(N*log(M)) for O(N) you can not use nested loops !!!
I would do it by use of histogram instead:
DWORD in[N]={ ... }; // input data ... values are from < 0 , M )
DWORD his[M]={ ... }; // histogram of in[]
int i,j;
// compute histogram O(N)
for (i=0;i<M;i++) his[i]=0; // this can be done also by memset ...
for (i=0;i<N;i++) his[in[i]]++; // if the range of values is not from 0 then shift it ...
// remove duplicates O(N)
for (i=0,j=0;i<N;i++)
{
his[in[i]]--; // count down duplicates
in[j]=in[i]; // copy item
if (his[in[i]]<=0) j++; // if not duplicate then do not delete it
}
// now j holds the new in[] array size
[Notes]
if value range is too big with sparse areas then you need to convert his[]
to dynamic list with two values per item
one is the value from in[] and the second is its occurrence count
but then you need nested loop -> O(N*M)
or with binary search -> O(N*log(M))

Related

Testing whether or not an array is distinct in O(N) time and O(1) extra space - is it possible?

So I found this purported interview question(1), that looks something like this
Given an array of length n of integers with unknown range, find in O(n) time and O(1) extra space whether or not it contains any duplicate terms.
There are no additional conditions and restrictions given. Assume that you can modify the original array. If it helps, you can restrict the datatype of the integers to ints (the original wording was a bit ambiguous) - although try not to use a variable with 2^(2^32) bits to represent a hash map.
I know there is a solution for a similar problem, where the maximum integer in the array is restricted to n-1. I am aware that problems like
Count frequencies of all elements in array in O(1) extra space and O(n) time
Find the maximum repeating number in O(n) time and O(1) extra space
Algorithm to determine if array contains n…n+m?
exist and either have solutions, or answers saying that it is impossible. However, for 1. and 2. the problems are stronger than this one, and for 3. I'm fairly sure the solution offered there would require the additional n-1 constraint to be adapted for the task here.
So is there any solution to this, or is this problem unsolvable? If so, is there a proof that it is not solvable in O(n) time and O(1) extra space?
(1) I say purported - I can't confirm whether or not it is an actual interview question, so I can't confirm that anyone thought it was solvable in the first place.
We can sort integer arrays in O(N) time! Therefore, sort and run the well-known algorithm for adjacent distinct.
bool distinct(int array[], size_t n)
{
if (n > 0xFFFFFFFF)
return true; // Pigeonhole
else if (n > 0x7FFFFFFF)
radix_sort(array, n); // Yup O(N) sort
else
heapsort(array, n); // N is small enough that heapsort's O(N log (N)) is smaller than radix_sort's O(32N) after constant adjust
for (size_t i = 1; i < n; i++)
if (array[i] == array[i - 1])
return true;
return false;
}
You can do this in expected linear time by using the original array like a hash table...
Iterate through the array, and for each item, let item, index be the item and its index, and let hash(item) be a value in [0,n). Then:
If hash(item) == index, then just leave the item there and move on. Otherwise,
If item == array[hash(item)] then you've found a duplicate and you're all done. Otherwise,
If item < array[hash(item)] or hash(array[hash(item)]) != hash(item), then swap those and repeat with the new item at array[index]. Otherwise,
Leave the item and move on.
Now you can discard all the array elements where hash(item) == index. These are guaranteed to be the smallest items that hash to their target indexes, and they are guaranteed not to be duplicates.
Move all the remaining items to the front of the array and repeat with the new, smaller, subarray.
Each step takes O(N) time, and on average will remove some significant proportion of the remaining elements, leading to O(N) time overall. We can speed things up by taking advantage all the free slots we're creating in the array, but that doesn't improve the overall complexity.

given 3 arrays check if there is any common number

**I have 3 arrays a[1...n] b[1...n] c[1....n] which contain integers.
It is not mentioned if the arrays are sorted or if each array has or has not duplicates.
The task is to check if there is any common number in the given arrays and return true or false.
For example : these arrays a=[3,1,5,10] b=[4,2,6,1] c=[5,3,1,7] have one common number : 1
I need to write an algorithm with time complexity O(n^2).
I let the current element traversed in a[] be x, in b[] be y and in c[] be z and have following cases inside the loop : If x, y and z are same, I can simply return true and stop the program,something like:
for(x=1;x<=n;x++)
for(y=1;y<=n;y++)
for(z=1;z<=n;z++)
if(a[x]==b[y]==c[z])
return true
But this algorithm has time complexity O(n^3) and I need O(n^2).Any suggestions?
There is a pretty simple and efficient solution for this.
Sort a and b. Complexity = O(NlogN)
For each element in c, use binary search to check if it exists in both a and b. Complexity = O(NlogN).
That'll give you a total complexity of O(NlogN), better than O(N^2).
Create a new array, and save common elements in a and b arrays. Then find common elements in this array with c array.
python solution
def find_d(a, b, c):
for i in a:
for j in b:
if i==j:
d.append(i)
def findAllCommon(c, d):
for i in c:
for j in d:
if i==j:
e.append(i)
break
a = [3,1,5,10]
b = [4,2,6,1]
c = [5,3,1,7]
d = []
e = []
find_d(a, b, c)
findAllCommon(c, d)
if len(e)>0:
print("True")
else:
print("False")
Since I haven't seen a solution based on sets, so I suggest looking for how sets are implemented in your language of choice and do the equivalent of this:
set(a).intersection(b).intersection(c) != set([])
This evaluates to True if there is a common element, False otherwise. It runs in O(n) time.
All solutions so far either require O(n) additional space (creating a new array/set) or change the order of the arrays (sorting).
If you want to solve the problem in O(1) additional space and without changing the original arrays, you indeed can't do better than O(n^2) time:
foreach(var x in a) { // n iterations
if (b.Contains(x) && c.Contains(x)) return true; // max. 2n
} // O(n^2)
return false;
A suggestion:
Combine into a single array(z) where z = sum of the entries in each array. Keep track of how many entries there were in Array 1, Array 2, Array 3.
For each entry Z traverse the array to see how many duplicates there are within the combined array and where they are. For those which have 2 or more (ie there are 3 or more of the same number), check that the location of those duplicates correspond to having been in different arrays to begin with (ruling our duplicates within the original arrays). If your number Z has 2 or more duplicates and they are all in different arrays (checked through their position in the array) then store that number Z in result array.
Report result array.
You will traverse the entire combined array once and then almost (no need to check if Z is a duplicate of itself) traverse it again for each Z, so n^2 complexity.
Also worth noting that the time complexity will now be a function of total number of entries and not of number of arrays (your nested loops would become n^4 with 4 arrays - this will stay as n^2)
You could make it more efficient by always checking the already found duplicates before checking for a new Z - if the new Z is already found as a duplicate to an earlier Z you need not traverse to check for that number again. This will make it more efficient the more duplicates there are - with few duplicates the reduction in number of traverses is probably not worth the extra complexity.
Of course you could also do this without actually combining the values into a single array - you would just need to make sure that your traversing routine looks through the arrays and keeps track of what it finds the in the right order.
Edit
Thinking about it, the above is doing way more than you want. It would allow you to report on doubles, quads etc. as well.
If you just want triples, then it is much easier/quicker. Since a triple needs to be in all 3 arrays, you can start by finding those numbers which are in any of the 2 arrays (if they are different lengths, compare the 2 shortest arrays first) and then to check any doublets found against the third array. Not sure what that brings the complexity down to but it will be less than n^2...
there are many ways to solve this here few selected ones sorted by complexity (descending) assuming n is average size of your individual arrays:
Brute force O(n^3)
its basicaly the same as you do so test any triplet combination by 3 nested for loops
for(x=1;x<=n;x++)
for(y=1;y<=n;y++)
for(z=1;z<=n;z++)
if(a[x]==b[y]==c[z])
return true;
return false;
slightly optimized brute force O(n^2)
simply check if each element from a is in b and if yes then check if it is also in c which is O(n*(n+n)) = O(n^2) as the b and c loops are not nested anymore:
for(x=1;x<=n;x++)
{
for(ret=false,y=1;y<=n;y++)
if(a[x]==b[y])
{ ret=true; break; }
if (!ret) continue;
for(ret=false,z=1;z<=n;z++)
if(a[x]==c[z])
{ ret=true; break; }
if (ret) return true;
}
return false;
exploit sorting O(n.log(n))
simply sort all arrays O(n.log(n)) and then just traverse all 3 arrays together to test if each element is present in all arrays (single for loop, incrementing the smallest element array). This can be done also with binary search like one of the other answers suggest but that is slower still not exceeding n.log(n). Here the O(n) traversal:
for(x=1,y=1,z=1;(x<=n)&&(y<=n)&&(z<=n);)
{
if(a[x]==b[y]==c[z]) return true;
if ((a[x]<b[y])&&(a[x]<c[z])) x++;
else if ((b[y]<a[x])&&(b[y]<c[z])) y++;
else z++;
}
return false;
however this needs to change the contents of arrays or need additional arrays for index sorting instead (so O(n) space).
histogram based O(n+m)
this can be used only if the range of elements in your array is not too big. Let say the arrays can hold numbers 1 .. m then you add (modified) histogram holding set bit for each array where value is presen and simply check if value is present in all 3:
int h[m]; // histogram
for(i=1;i<=m;i++) h[i]=0; // clear histogram
for(x=1;x<=n;x++) h[a[x]]|=1;
for(y=1;y<=n;y++) h[b[y]]|=2;
for(z=1;z<=n;z++) h[c[z]]|=4;
for(i=1;i<=m;i++) if (h[i]==7) return true;
return false;
This needs O(m) space ...
So you clearly want option #2
Beware all the code is just copy pasted yours and modified directly in answer editor so there might be typos or syntax error I do not see right now...

find pair of numbers whose difference is an input value 'k' in an unsorted array

As mentioned in the title, I want to find the pairs of elements whose difference is K
example k=4 and a[]={7 ,6 23,19,10,11,9,3,15}
output should be :
7,11
7,3
6,10
19,23
15,19
15,11
I have read the previous posts in SO " find pair of numbers in array that add to given sum"
In order to find an efficient solution, how much time does it take? Is the time complexity O(nlogn) or O(n)?
I tried to do this by a divide and conquer technique, but i'm not getting any clue of exit condition...
If an efficient solution includes sorting the input array and manipulating elements using two pointers, then I think I should take minimum of O(nlogn)...
Is there any math related technique which brings solution in O(n). Any help is appreciated..
You can do it in O(n) with a hash table. Put all numbers in the hash for O(n), then go through them all again looking for number[i]+k. Hash table returns "Yes" or "No" in O(1), and you need to go through all numbers, so the total is O(n). Any set structure with O(1) setting and O(1) checking time will work instead of a hash table.
A simple solution in O(n*Log(n)) is to sort your array and then go through your array with this function:
void find_pairs(int n, int array[], int k)
{
int first = 0;
int second = 0;
while (second < n)
{
while (array[second] < array[first]+k)
second++;
if (array[second] == array[first]+k)
printf("%d, %d\n", array[first], array[second]);
first++;
}
}
This solution does not use extra space unlike the solution with a hashtable.
One thing may be done using indexing in O(n)
Take a boolean array arr indexed by the input list.
For each integer i is in the input list then set arr[i] = true
Traverse the entire arr from the lowest integer to the highest integer as follows:
whenever you find a true at ith index, note down this index.
see if there arr[i+k] is true. If yes then i and i+k numbers are the required pair
else continue with the next integer i+1

what's efficient way to filter an array

I am programming c on linux and I have a big integer array, how to filter it, say, find values that fit some condition, e.g. value > 1789 && value < 2031. what's the efficient way to do this, do I need to sort this array first?
I've read the answers and thank you all, but I need to do such filtering operation many times on this big array, not only for once. so is iterating it one by one every time the best way?
If the only thing you want to do with the array is to get the values that match this criteria, it would be faster just to iterate over the array and check each value for the condition (O(n) vs. O(nlogn)). If however, you are going to perform multiple operations on this array, than it's better to sort it.
Sort the array first. Then on each query do 2 binary searches. I'm assuming queries will be like -
Find integers x such that a < x < b
First binary search would find the index i of the element such that Array[i-1] <= a < Array[i] and second binary search would find the index j such that Array[j] < b <= Array[j+1]. Then your desired range would be [i, j].
This algorithm's complexity is O(NlogN) in preprocessing and O(N) per query if you want to iterate over all the elements and O(logN) per query if you just want to count the number of filtered element.
Let me know if you need help implementing binary search in C. There is library function named binary_search() in C and lower_bound() and upper_bound() in C++ STL.
You could use a max heap implemented as an array of the same size as the source array. Initialize it with min-1 value and insert values into the max-heap as the numbers come in. The first check would be to see if the number to be inserted is greater than the first element, if it's not, discard it, if it is larger then insert it into the array. To get the list of numbers back, read all numbers in the new array till min-1.
To filter the array, you'll have to look at each element once. There's no need to look at any element more than once, so a simple linear search of the array for items matching your criteria is going to be as efficient as you can get.
Sorting the array would end up looking at some elements more than once, which is not necessary for your purpose.
If you can spare some more memory, then you can scan your array once, get the indices of matching values and store it in another array. This new array will be significantly shorter since it has only indices of values which match a specific pattern! Something like this
int original_array[SOME_SIZE];
int new_array[LESS_THAN_SOME__SIZE];
for ( int i=0,j=0; i<SOME_SIZE; i++)
{
if ( original_array[i]> LOWER_LIMIT && original_array[i]< HIGHER_LIMIT )
{
new_array[j++] = i;
}
}
You need to do the above once and form now on,
for ( int i=0; i< LESS_THAN_SOME_SIZE; i++ )
{
if ( original_array[new_array[i]]> LOWER_LIMIT && original_array[new_array[i]]< HIGHER_LIMIT )
{
printf("Success! Found Value %d\n", original_array[new_array[i]] )
}
}
So at the cost of some memory, you can save considerable amount of time. Even if you invest some time in sorting, you have to parse the sorted array every time. This method minimizes the array length as well as the sorting time ( at the cost of extra memory, of course :) )
Try this library: http://code.google.com/p/boolinq/
It is iterator-based and as fast as can be, there are no any overhead. But it needs C++11 standard. Yor code will be written in declarative-way:
int arr[] = {1,2,3,4,5,6,7,8,9};
auto items = boolinq::from(arr).where([](int a){return a>3 && a<6;});
while (!items.empty())
{
int item = items.front();
...
}
Faster than iterator-based scan can be only multithreaded scan...

(Algorithm) Find if two unsorted arrays have any common elements in O(n) time without sorting?

We have two unsorted arrays and each array has a length of n. These arrays contain random integers in the range of 0-n100. How to find if these two arrays have any common elements in O(n)/linear time? Sorting is not allowed.
Hashtable will save you. Really, it's like a swiss knife for algorithms.
Just put in it all values from the first array and then check if any value from the second array is present.
You have not defined the model of computation. Assuming you can only read O(1) bits in O(1) time (anything else would be a rather exotic model of computation), there can be no algorithm solving the problem in O(n) worst case time complexity.
Proof Sketch:
Each number in the input takes O(log(n ^ 100)) = O(100 log n) = O(log n) bits. The entire input therefore O(n log n) bits, which can not be read in O(n) time. Any O(n) algorithm can therefore not read the entire input, and hence not react if these bits matter.
Answering Neil:
Since you know at start what is your N (two arrays of size N), you can create a hash with array size of 2*N*some_ratio (for example: some_ratio= 1.5). With this size, almost all simple hash functions will provide you good spread of the entities.
You can also implement find_or_insert to search for existing or insert a new one at the same action, this will reduce the hash function and comparison calls. (c++ stl find_or_insert is not good enough since it doesnt tell you whether the item was there before or not).
Linearity Test
Using Mathematica hash function and arbitrary length integers.
Tested until n=2^20, generating random numbers till (2^20)^100 = (approx 10^602)
Just in case ... the program is:
k = {};
For[t = 1, t < 21, t++,
i = 2^t;
Clear[a, b];
Table[a[RandomInteger[i^100]] = 1, {i}];
b = Table[RandomInteger[i^100], {i}];
Contains = False;
AppendTo[k,
{i, First#Timing#For[j = 2, j <= i, j++,
Contains = Contains || (NumericQ[a[b[[j]]]]);
]}]];
ListLinePlot[k, PlotRange -> All, AxesLabel -> {"n", "Time(secs)"}]
Put the elements of the first array in an hash table, and check for existence scanning the second array. This gives you a solution in O(N) average case.
If you want a truly O(N) worst case solution then instead of using an hash table use a linear array in the range 0-n^100. Note that you need to use just a single bit per entry.
If storage is not important, then scratch hash table in favor for an array of n in length. Flag to true when you come across that number in first array. In pass through second array, if you find any of them to be true, you have your answer. O(n).
Define largeArray(n)
// First pass
for(element i in firstArray)
largeArray[i] = true;
// Second pass
Define hasFound = false;
for(element i in secondArray)
if(largeArray[i] == true)
hasFound = true;
break;
Have you tried a counting sort? It is simple to implement, uses O(n) space and also has a \theta(n) time complexity.
Based on the ideas posted till date.We can store the one array integer elements into a hash map . Maximum number of different integers can be stored in RAM . Hash map will have only unique integer values. Duplicates are ignored.
Here is the implementation in Perl language.
#!/usr/bin/perl
use strict;
use warnings;
sub find_common_elements{ # function that prints common elements in two unsorted array
my (#arr1,#array2)=#_; # array elements assumed to be filled and passed as function arguments
my $hash; # hash map to store value of one array
# runtime to prepare hash map is O(n).
foreach my $ele ($arr1){
$hash->{$ele}=true; # true here element exists key is integer number and value is true, duplicate elements will be overwritten
# size of array will fit in memory as duplicate integers are ignored ( mx size will be 2 ^( 32) -1 or 2^(64) -1 based on operating system) which can be stored in RAM.
}
# O(n ) to traverse second array and finding common elements in two array
foreach my $ele2($arr2){
# search in hash map is O(1), if all integers of array are same then hash map will have only one entry and still search tim is O(1)
if( defined $hash->{$ele}){
print "\n $ele is common in both array \n";
}
}
}
I hope it helps.

Resources