what's efficient way to filter an array

what's efficient way to filter an array - c

I am programming c on linux and I have a big integer array, how to filter it, say, find values that fit some condition, e.g. value > 1789 && value < 2031. what's the efficient way to do this, do I need to sort this array first?
I've read the answers and thank you all, but I need to do such filtering operation many times on this big array, not only for once. so is iterating it one by one every time the best way?

If the only thing you want to do with the array is to get the values that match this criteria, it would be faster just to iterate over the array and check each value for the condition (O(n) vs. O(nlogn)). If however, you are going to perform multiple operations on this array, than it's better to sort it.

Sort the array first. Then on each query do 2 binary searches. I'm assuming queries will be like -
Find integers x such that a < x < b
First binary search would find the index i of the element such that Array[i-1] <= a < Array[i] and second binary search would find the index j such that Array[j] < b <= Array[j+1]. Then your desired range would be [i, j].
This algorithm's complexity is O(NlogN) in preprocessing and O(N) per query if you want to iterate over all the elements and O(logN) per query if you just want to count the number of filtered element.
Let me know if you need help implementing binary search in C. There is library function named binary_search() in C and lower_bound() and upper_bound() in C++ STL.

You could use a max heap implemented as an array of the same size as the source array. Initialize it with min-1 value and insert values into the max-heap as the numbers come in. The first check would be to see if the number to be inserted is greater than the first element, if it's not, discard it, if it is larger then insert it into the array. To get the list of numbers back, read all numbers in the new array till min-1.

To filter the array, you'll have to look at each element once. There's no need to look at any element more than once, so a simple linear search of the array for items matching your criteria is going to be as efficient as you can get.
Sorting the array would end up looking at some elements more than once, which is not necessary for your purpose.

If you can spare some more memory, then you can scan your array once, get the indices of matching values and store it in another array. This new array will be significantly shorter since it has only indices of values which match a specific pattern! Something like this
int original_array[SOME_SIZE];
int new_array[LESS_THAN_SOME__SIZE];
for ( int i=0,j=0; i<SOME_SIZE; i++)
{
if ( original_array[i]> LOWER_LIMIT && original_array[i]< HIGHER_LIMIT )
{
new_array[j++] = i;
}
}
You need to do the above once and form now on,
for ( int i=0; i< LESS_THAN_SOME_SIZE; i++ )
{
if ( original_array[new_array[i]]> LOWER_LIMIT && original_array[new_array[i]]< HIGHER_LIMIT )
{
printf("Success! Found Value %d\n", original_array[new_array[i]] )
}
}
So at the cost of some memory, you can save considerable amount of time. Even if you invest some time in sorting, you have to parse the sorted array every time. This method minimizes the array length as well as the sorting time ( at the cost of extra memory, of course :) )

Try this library: http://code.google.com/p/boolinq/
It is iterator-based and as fast as can be, there are no any overhead. But it needs C++11 standard. Yor code will be written in declarative-way:
int arr[] = {1,2,3,4,5,6,7,8,9};
auto items = boolinq::from(arr).where([](int a){return a>3 && a<6;});
while (!items.empty())
{
int item = items.front();
...
}
Faster than iterator-based scan can be only multithreaded scan...

Related

can someone suggest a better algorithm than this to check if there is at least one duplicate value in an array?

an unsorted integer array nums, and it's size numsSize is given as arguments of function containsDuplicate and we have to return a boolean value true if at least one duplicate value is there otherwise false.
for this task I chose to check if every element, and the elements after that are equal or not until last second element is reached, if equal I will be returning true otherwise false.
bool containsDuplicate(int* nums, int numsSize){
for(int i =0 ;i< numsSize-1;i++)
{
for(int j = i+1;j < numsSize; j++)
{
if(nums[i] == nums[j])
{
return true;
}
}
}
return false;
}
To minimize run time, I've written return value just when the duplicates are found, but still my code is not performing well on large size arrays, I'm expecting an algorithm which has a time complexity O(n) if possible. And is there anyway we can skip the values which are duplicates of previously looked values?
I've seen all other solutions, but I couldn't find a better solution in C.

Your algorithm is O(n^2). But if you sort first, which can be done in less than O(n^2), then determining if there is a duplicate in the array is O(n).
You could maintain a lookup table to determine if each value has been previously seen, which would run in O(n) time, but unless the potential range of values stored in the array are relatively small, this has prohibitive memory usage.
For instance, if you know the values in the array will range from 0-127.
int contains_dupes(int *arr, size_t n) {
char seen[128] = {0};
for (size_t i = 0; i < n; i++) {
if (seen[arr[i]]) return 0;
seen[arr[i]] = 1;
}
return 1;
}
But if we assume int is 4 bytes, and the values in the array can be any int, and we use char for our lookup table, then your lookup table would have to be 4GB in size.

O(n) time, O(n) space: use a set or map. Parse your array, checking each element in turn for membership in your set or map. If it's present then you've found a duplicate; if not, then add it.
If O(n) space is too expensive, you can get away with far less by doing a first pass using a cuckoo hash, which is a space efficient data structure that guarantees no false negatives, but can have false positives. Use the same approach as above but with the cuckoo hash instead of a set or map. Any duplicates you detect may be false positives, so will need to be checked.
Then, parse the array a second time, using the approach described in the first paragraph, but skip past anything that isn't in your set of candidates.
This is still O(n) time.
https://en.wikipedia.org/wiki/Cuckoo_hashing

Compare two arrays and create new array with equal elements in C

The problem is to check two arrays for the same integer value and put matching values in a new array.
Let say I have two arrays
a[n] = {2,5,2,7,8,4,2}
b[m] = {1,2,6,2,7,9,4,2,5,7,3}
Each array can be a different size.
I need to check if the arrays have matching elements and put them in a new array. The result in this case should be:
array[] = {2,2,2,5,7,4}
And I need to do it in O(n.log(n) + m.log(m)).
I know there is a way to do with merge sorting or put one of the array in a hash array but I really don't know how to implement it.
I will really appreciate your help, thanks!!!

As you have already figured out you can use merge sort (implementing it is beyond the scope of this answer, I suppose you can find a solution on wikipedia or searching on Stack Overflow) so that you can get nlogn + mlogm complexity supposing n is the size of the first array and m is the size of another.
Let's call the first array a (with the size n) and the second one b (with size m). First sort these arrays (merge sort would give us nlogn + mlogm complexity). And now we have:
a[n] // {2,2,2,4,5,7,8} and b[n] // {1,2,2,2,3,4,5,6,7,7,9}
Supposing n <= m we can simply iterate simulateously comparing coresponding values:
But first lets allocate array int c[n]; to store results (you can print to the console instead of storing if you need). And now the loop itself:
int k = 0; // store the new size of c array!
for (int i = 0, j = 0; i < n && j < m; )
{
if (a[i] == b[j])
{
// match found, store it
c[k] = a[i];
++i; ++j; ++k;
}
else if (a[i] > b[j])
{
// current value in a is leading, go to next in b
++j;
}
else
{
// the last possibility is a[i] < b[j] - b is leading
++i;
}
}
Note: the loop itself is n+m complexity at worst (remember n <= m assumption) which is less than for sorting so overal complexity is nlogn + mlogm. Now you can iterate c array (it's size is actually n as we allocated, but the number of elements in it is k) and do what you need with that numbers.

From the way that you explain it the way to do this would be to loop over the shorter array and check it against the longer array. Let us assume that A is the shorter array and B the longer array. Create a results array C.
Loop over each element in A, call it I
If I is found in B, remove it from B and put it in C, break out of the test loop.
Now go to the next element in A.
This means that if a number I is found twice in A and three times in B, then I will only appear twice in C. Once you finish, then every number found in both arrays will appear in C the number of times that it actually appears in both.
I am carefully not putting in suggested code as your question is about a method that you can use. You should figure out the code yourself.

I would be inclined to take the following approach:
1) Sort array B. There are many well published sort algorithms to do this, as well as several implementations in various generally available libraries.
2) Loop through array A and for each element do a binary search (or other suitable algorithm) on array B for a match. If a match is found, remove the element from array B (to avoid future matches) and add it to the output array.

Limit input data to achieve a better Big O complexity

You are given an unsorted array of n integers, and you would like to find if there are any duplicates in the array (i.e. any integer appearing more than once).
Describe an algorithm (implemented with two nested loops) to do this.
The question that I am stuck at is:
How can you limit the input data to achieve a better Big O complexity? Describe an algorithm for handling this limited data to find if there are any duplicates. What is the Big O complexity?
Your help will be greatly appreciated. This is not related to my coursework, assignment or coursework and such. It's from the previous year exam paper and I am doing some self-study but seem to be stuck on this question. The only possible solution that i could come up with is:
If we limit the data, and use nested loops to perform operations to find if there are duplicates. The complexity would be O(n) simply because the amount of time the operations take to perform is proportional to the data size.
If my answer makes no sense, then please ignore it and if you could, then please suggest possible solutions/ working out to this answer.
If someone could help me solve this answer, I would be grateful as I have attempted countless possible solution, all of which seems to be not the correct one.
Edited part, again.. Another possible solution (if effective!):
We could implement a loop to sort the array so that it sorts the array (from lowest integer to highest integer), therefore the duplicates will be right next to each other making them easier and faster to be identified.
The big O complexity would still be O(n^2).
Since this is linear type, it would simply use the first loop and iterate n-1 times as we are getting the index in the array (in the first iteration it could be, for instance, 1) and store this in a variable names 'current'.
The loop will update the current variable by +1 each time through the iteration, within that loop, we now write another loop to compare the current number to the next number and if it equals to the next number, we can print using a printf statement else we move back to the outer loop to update the current variable by + 1 (next value in the array) and update the next variable to hold the value of the number after the value in current.

You can do linearly (O(n)) for any input if you use hash tables (which have constant look-up time).
However, this is not what you are being asked about.
By limiting the possible values in the array, you can achieve linear performance.
E.g., if your integers have range 1..L, you can allocate a bit array of length L, initialize it to 0, and iterate over your input array, checking and flipping the appropriate bit for each input.

A variance of Bucket Sort will do. This will give you complexity of O(n) where 'n' is the number of input elements.
But one restriction - max value. You should know the max value your integer array can take. Lets say it as m.
The idea is to create a bool array of size m (all initialized to false). Then iterate over your array. As you find an element, set bucket[m] to true. If it is already true then you've encountered a duplicate.
A java code,
// alternatively, you can iterate over the array to find the maxVal which again is O(n).
public boolean findDup(int [] arr, int maxVal)
{
// java by default assigns false to all the values.
boolean bucket[] = new boolean[maxVal];
for (int elem : arr)
{
if (bucket[elem])
{
return true; // a duplicate found
}
bucket[elem] = true;
}
return false;
}
But the constraint here is the space. You need O(maxVal) space.

nested loops get you O(N*M) or O(N*log(M)) for O(N) you can not use nested loops !!!
I would do it by use of histogram instead:
DWORD in[N]={ ... }; // input data ... values are from < 0 , M )
DWORD his[M]={ ... }; // histogram of in[]
int i,j;
// compute histogram O(N)
for (i=0;i<M;i++) his[i]=0; // this can be done also by memset ...
for (i=0;i<N;i++) his[in[i]]++; // if the range of values is not from 0 then shift it ...
// remove duplicates O(N)
for (i=0,j=0;i<N;i++)
{
his[in[i]]--; // count down duplicates
in[j]=in[i]; // copy item
if (his[in[i]]<=0) j++; // if not duplicate then do not delete it
}
// now j holds the new in[] array size
[Notes]
if value range is too big with sparse areas then you need to convert his[]
to dynamic list with two values per item
one is the value from in[] and the second is its occurrence count
but then you need nested loop -> O(N*M)
or with binary search -> O(N*log(M))

find pair of numbers whose difference is an input value 'k' in an unsorted array

As mentioned in the title, I want to find the pairs of elements whose difference is K
example k=4 and a[]={7 ,6 23,19,10,11,9,3,15}
output should be :
7,11
7,3
6,10
19,23
15,19
15,11
I have read the previous posts in SO " find pair of numbers in array that add to given sum"
In order to find an efficient solution, how much time does it take? Is the time complexity O(nlogn) or O(n)?
I tried to do this by a divide and conquer technique, but i'm not getting any clue of exit condition...
If an efficient solution includes sorting the input array and manipulating elements using two pointers, then I think I should take minimum of O(nlogn)...
Is there any math related technique which brings solution in O(n). Any help is appreciated..

You can do it in O(n) with a hash table. Put all numbers in the hash for O(n), then go through them all again looking for number[i]+k. Hash table returns "Yes" or "No" in O(1), and you need to go through all numbers, so the total is O(n). Any set structure with O(1) setting and O(1) checking time will work instead of a hash table.

A simple solution in O(n*Log(n)) is to sort your array and then go through your array with this function:
void find_pairs(int n, int array[], int k)
{
int first = 0;
int second = 0;
while (second < n)
{
while (array[second] < array[first]+k)
second++;
if (array[second] == array[first]+k)
printf("%d, %d\n", array[first], array[second]);
first++;
}
}
This solution does not use extra space unlike the solution with a hashtable.

One thing may be done using indexing in O(n)
Take a boolean array arr indexed by the input list.
For each integer i is in the input list then set arr[i] = true
Traverse the entire arr from the lowest integer to the highest integer as follows:
whenever you find a true at ith index, note down this index.
see if there arr[i+k] is true. If yes then i and i+k numbers are the required pair
else continue with the next integer i+1

Compare two integer arrays with same length

[Description] Given two integer arrays with the same length. Design an algorithm which can judge whether they're the same. The definition of "same" is that, if these two arrays were in sorted order, the elements in corresponding position should be the same.
[Example]
<1 2 3 4> = <3 1 2 4>
<1 2 3 4> != <3 4 1 1>
[Limitation] The algorithm should require constant extra space, and O(n) running time.

(Probably too complex for an interview question.)
(You can use O(N) time to check the min, max, sum, sumsq, etc. are equal first.)
Use no-extra-space radix sort to sort the two arrays in-place. O(N) time complexity, O(1) space.
Then compare them using the usual algorithm. O(N) time complexity, O(1) space.
(Provided (max − min) of the arrays is of O(Nk) with a finite k.)

You can try a probabilistic approach - convert the arrays into a number in some huge base B and mod by some prime P, for example sum B^a_i for all i mod some big-ish P. If they both come out to the same number, try again for as many primes as you want. If it's false at any attempts, then they are not correct. If they pass enough challenges, then they are equal, with high probability.
There's a trivial proof for B > N, P > biggest number. So there must be a challenge that cannot be met. This is actually the deterministic approach, though the complexity analysis might be more difficult, depending on how people view the complexity in terms of the size of the input (as opposed to just the number of elements).

I claim that: Unless the range of input is specified, then it is IMPOSSIBLE to solve in onstant extra space, and O(n) running time.
I will be happy to be proven wrong, so that I can learn something new.

Insert all elements from the first array into a hashtable
Try to insert all elements from the second array into the same hashtable - for each insert to element should already be there
Ok, this is not with constant extra space, but the best I could come up at the moment:-). Are there any other constraints imposed on the question, like for example to biggest integer that may be included in the array?

A few answers are basically correct, even though they don't look like it. The hash table approach (for one example) has an upper limit based on the range of the type involved rather than the number of elements in the arrays. At least by by most definitions, that makes the (upper limit on) the space a constant, although the constant may be quite large.
In theory, you could change that from an upper limit to a true constant amount of space. Just for example, if you were working in C or C++, and it was an array of char, you could use something like:
size_t counts[UCHAR_MAX];
Since UCHAR_MAX is a constant, the amount of space used by the array is also a constant.
Edit: I'd note for the record that a bound on the ranges/sizes of items involved is implicit in nearly all descriptions of algorithmic complexity. Just for example, we all "know" that Quicksort is an O(N log N) algorithm. That's only true, however, if we assume that comparing and swapping the items being sorted takes constant time, which can only be true if we bound the range. If the range of items involved is large enough that we can no longer treat a comparison or a swap as taking constant time, then its complexity would become something like O(N log N log R), were R is the range, so log R approximates the number of bits necessary to represent an item.

Is this a trick question? If the authors assumed integers to be within a given range (2^32 etc.) then "extra constant space" might simply be an array of size 2^32 in which you count the occurrences in both lists.
If the integers are unranged, it cannot be done.

You could add each element into a hashmap<Integer, Integer>, with the following rules: Array A is the adder, array B is the remover. When inserting from Array A, if the key does not exist, insert it with a value of 1. If the key exists, increment the value (keep a count). When removing, if the key exists and is greater than 1, reduce it by 1. If the key exists and is 1, remove the element.
Run through array A followed by array B using the rules above. If at any time during the removal phase array B does not find an element, you can immediately return false. If after both the adder and remover are finished the hashmap is empty, the arrays are equivalent.
Edit: The size of the hashtable will be equal to the number of distinct values in the array does this fit the definition of constant space?

I imagine the solution will require some sort of transformation that is both associative and commutative and guarantees a unique result for a unique set of inputs. However I'm not sure if that even exists.

public static boolean match(int[] array1, int[] array2) {
int x, y = 0;
for(x = 0; x < array1.length; x++) {
y = x;
while(array1[x] != array2[y]) {
if (y + 1 == array1.length)
return false;
y++;
}
int swap = array2[x];
array2[x] = array2[y];
array2[y] = swap;
}
return true;
}

For each array, Use Counting sort technique to build the count of number of elements less than or equal to a particular element . Then compare the two built auxillary arrays at every index, if they r equal arrays r equal else they r not . COunting sort requires O(n) and array comparison at every index is again O(n) so totally its O(n) and the space required is equal to the size of two arrays . Here is a link to counting sort http://en.wikipedia.org/wiki/Counting_sort.

given int are in the range -n..+n a simple way to check for equity may be the following (pseudo code):
// a & b are the array
accumulator = 0
arraysize = size(a)
for(i=0 ; i < arraysize; ++i) {
accumulator = accumulator + a[i] - b[i]
if abs(accumulator) > ((arraysize - i) * n) { return FALSE }
}
return (accumulator == 0)
accumulator must be able to store integer with range = +- arraysize * n

How 'bout this - XOR all the numbers in both the arrays. If the result is 0, you got a match.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight