Counting unique element in large array - arrays

One of my colleagues was asked this question in an interview.
Given a huge array which stores unsigned int. Length of array is 100000000. Find the effective way to count the unique number of elements present in the array.
E.g arr = {2,34,5,6,7,2,2,5,1,34,5}
O/p: Count of 2 is 3, Count of 34 is 2 and so on.
What are effective algorithms to do this? I thought at first dictionary/hash would be one of the options, but since the array is very large it is inefficient. Is there any way to do this?

Heap sort is O(nlogn) and in-place. In-place is necessary when dealing with large data sets. Once sorted you can make one pass through the array tallying occurrences of each value. Because the array is sorted, once a value changes you know you've seen all occurrences of the previous value.

Many other posters have suggested sorting the data and then finding the number of adjacent values, but no one has mentioned using radix sort yet to get the runtime to be O(n lg U) (where U is the maximum value in the array) instead of O(n lg n). Since lg U = O(lg n), assuming that integers take up one machine word, this approach is asymptotically faster than heapsort.
Non-comparison sorts are always fun in interviews. :-)

Sort it, then scan it from the beginning to determine the counts for each item.
This approach requires no additional storage, and can be done in O(n log n) time (for the sort).

If the range of the int values is limited, then you may allocate an array, which serves to count the occurrences for each possible value. Then you just iterate through your huge array and increment the counters.
foreach x in huge_array {
counter[x]++;
}
Thus you find the solution in linear time (O(n)), but at the expense of memory consumption. That is, if your ints span the whole range allowed by 32-bit ints, you would need to allocate an array of 4G ints, which is impractical...

How about using a BloomFilter impl: like http://code.google.com/p/java-bloomfilter/
first do a bloom.contains(element) if true continue if false bloom.add(element).
At the end count the number of elements added. Bloomfilter needs approx. 250mb memory to store 100000000 elements at 10bits per element.
Problem is that false positives are possible in BloomFilters and can only be minimized by increasing the number of bits per element. This could be addressed by two BloomFilters with different hashing that need to agree.

Hashing in this case is not inneficient. The cost will be approximately O(N) (O(N) for iterating over the array and ~O(N) for iterating over the hashtable). Since you need O(N) for checking each element, the complexity is good.

Sorting is a good idea. However type of sorting depends on range of possible values. For small range counting sort would be good. While dealing with such a big array it would be efficient to use multiple cores - radix sort might be good.

Look at its variation that might help you to find no. of distinct elements.
#include <bits/stdc++.h>
using namespace std;
#define ll long long int
#define ump unordered_map
void file_i_o()
{
ios_base::sync_with_stdio(0);
cin.tie(0);
cout.tie(0);
#ifndef ONLINE_JUDGE
freopen("input.txt", "r", stdin);
freopen("output.txt", "w", stdout);
#endif
}
int main() {
file_i_o();
ll t;
cin>>t;
while(t--)
{
int n,q;
cin>>n>>q;
ump<int,int> num;
int x;
int arr[n+1];
int a,b;
for(int i=1;i<=n;i++)
{
cin>>x;
arr[i]=x;
num[x]++;
}
for(int i=0;i<q;i++)
{
cin>>a>>b;
num[arr[a]]--;
if((num[arr[a]])==0)
{ num.erase(arr[a]); }
arr[a]=b;
num[b]++;
cout<<num.size()<<"\n";
}
}
return 0;
}

Related

What kind of drawbacks are there performance-wise , if I sort an array by using hashing?

I wrote a simple function to sort an array int a[]; using hash.
For that I stored frequency for every element in new array hash1[] and then I put back in original array in linear time.
#include<bits/stdc++.h>
using namespace std;
int hash1[10000];
void sah(int a[],int n)
{
int maxo=-1;
for(int i=0;i<n;i++)
{
hash1[a[i]]++;
if(maxo<a[i]){maxo=a[i];}
}
int i=0,freq=0,idx=0;
while(i<maxo+1)
{
freq=hash1[i];
if(freq>0)
{
while(freq>0)
{
a[idx++]=i;freq--;
}
}
i++;
}
}
int main()
{
int a[]={6,8,9,22,33,59,12,5,99,12,57,7};
int n=sizeof(a)/sizeof(a[0]);
sah(a,n);
for(int i=0;i<n;i++)
{
printf("%d ",a[i]);
}
}
This algorithm runs in O(max_element). What kind of disadvantages I'm facing here considering only performance( time and space)?
The algorithm you've implemented is called counting sort. Its runtime is O(n + U), where n is the total number of elements and U is the maximum value in the array (assuming the numbers go from 0 to U), and its space usage is Θ(U). Your particular implementation assumes that U = 10,000. Although you've described your approach as "hashing," this really isn't a hash (computing some function of the elements and using that to put them into buckets) as a distribution (spreading elements around according to their values).
If U is a fixed constant - as it is in your case - then the runtime is O(n) and the space usage is O(1), though remember that big-O talks about long-term growth rates and that if U is large the runtime can be pretty high. This makes it attractive if you're sorting very large arrays with a restricted range of values. However, if the range of values can be large, this is not a particularly good approach. Interestingly, you can think of radix sort as an algorithm that repeatedly runs counting sort with U = 10 (if using the base-10 digits of the numbers) or U = 2 (if going in binary) and has a runtime of O(n log U), which is strongly preferable for large values of U.
You can clean up this code in a number of ways. For example, you have an if statement and a while loop with the same condition, which can be combined together into a single while loop. You also might want to put in some assert checks to make sure all the values are in the range from 0 to 9,999, inclusive, since otherwise you'll have a bounds error. Additionally, you could consider making the global array either a local variable (though watch your stack usage) or a static local variable (to avoid polluting the global namespace). You could alternatively have the user pass in a parameter specifying the maximum size or could calculate it yourself.
Issues you may consider:
Input validation. What if the user enters -10 or a very large value.
If the maximum element is large, you will at some point get a performance hit when the L1 cache is exhausted. The hash1-array will compete for memory bandwidth with the a-array. When I have implemented radix-sorting in the past I found that 8-bits per iteration was fastest.
The time complexity is actually O(max_element + number_of_elements). E.g. what if you sorted 2 million ones or zeros. It is not as fast as sorting 2 ones or zeros.

Finding no. of shifts in Insertion sort for large inputs in C

I'm trying to write a program that counts the number of swaps made by insertion sort. My program works on small inputs, but produces the wrong answer on large inputs. I'm also not sure how to use the long int type.
This problem came up in a setting described at https://drive.google.com/file/d/0BxOMrMV58jtmNF9EcUNQNGpreDQ/edit?usp=sharing
Input is given as
The first line contains the number of test cases T. T test cases follow.
The first line for each case contains N, the number of elements to be sorted.
The next line contains N integers a[1],a[2]...,a[N].
Code I used is
#include <stdio.h>
#include <stdlib.h>
int insertionSort(int ar_size,int * ar)
{
int i,j,t,temp,count;
count=0;
int n=ar_size;
for(i=0;i<n-1;i++)
{
j=i;
while(ar[j+1]<ar[j])
{
temp=ar[j+1];
ar[j+1]=ar[j];
ar[j]=temp;
j--;
count++;
}
}
return count;
}
int main()
{
int _ar_size,tc,i,_ar_i;
scanf("%d", &tc);
int sum=0;
for(i=0;i<tc;i++)
{
scanf("%d", &_ar_size);
int *_ar;
_ar=(int *)malloc(sizeof(int)*_ar_size);
for(_ar_i = 0; _ar_i < _ar_size; _ar_i++)
{
scanf("%d", &_ar[_ar_i]);
}
sum=insertionSort(_ar_size, _ar);
printf("%d\n",sum);
}
return 0;
}
There are two issues that I currently see with the solution you have.
First, there's an issue brought up in the comments about integer overflow. On most systems, the int type can hold numbers up through 231 - 1. In insertion sort, the number of swaps that need to be made in the worst case on an array of length n is n(n - 1) / 2 (details later), so for an array of size 217, you may end up not being able to store the number of swaps that you need inside an int. To address this, consider using a larger integer type. For example, the uint64_t type can store numbers up to roughly 1018, which should be good enough to store the answer for arrays up to length around 109. You mentioned that you're not sure how to use it, but the good news is that it's not that hard. Just add the line
#include <stdint.h>
(for C) or
#include <cstdint>
(for C++) to the top of your program. After that, you should just be able to use uint64_t in place of int without making any other modifications and everything should work out just fine.
Next, there's an issue of efficiency. The code you've posted essentially runs insertion sort and therefore takes time O(n2) in the worst-case. For large inputs - say, inputs around size 108 - this is prohibitively expensive. Amazingly, though, you can actually determine how many swaps insertion sort will make without actually running insertion sort.
In insertion sort, the number of swaps made is equal to the number of inversions that exist in the input array (an inversion is a pair of elements that are out of order). There's a beautiful divide-and-conquer algorithm for counting inversions that runs in time O(n log n), which likely will scale up to work on much larger inputs than just running insertion sort. I think that the "best" answer to this question would be to use this algorithm, while taking care to use the uint64_t type or some other type like it, since it will make your algorithm work correctly on much larger inputs.

find pair of numbers whose difference is an input value 'k' in an unsorted array

As mentioned in the title, I want to find the pairs of elements whose difference is K
example k=4 and a[]={7 ,6 23,19,10,11,9,3,15}
output should be :
7,11
7,3
6,10
19,23
15,19
15,11
I have read the previous posts in SO " find pair of numbers in array that add to given sum"
In order to find an efficient solution, how much time does it take? Is the time complexity O(nlogn) or O(n)?
I tried to do this by a divide and conquer technique, but i'm not getting any clue of exit condition...
If an efficient solution includes sorting the input array and manipulating elements using two pointers, then I think I should take minimum of O(nlogn)...
Is there any math related technique which brings solution in O(n). Any help is appreciated..
You can do it in O(n) with a hash table. Put all numbers in the hash for O(n), then go through them all again looking for number[i]+k. Hash table returns "Yes" or "No" in O(1), and you need to go through all numbers, so the total is O(n). Any set structure with O(1) setting and O(1) checking time will work instead of a hash table.
A simple solution in O(n*Log(n)) is to sort your array and then go through your array with this function:
void find_pairs(int n, int array[], int k)
{
int first = 0;
int second = 0;
while (second < n)
{
while (array[second] < array[first]+k)
second++;
if (array[second] == array[first]+k)
printf("%d, %d\n", array[first], array[second]);
first++;
}
}
This solution does not use extra space unlike the solution with a hashtable.
One thing may be done using indexing in O(n)
Take a boolean array arr indexed by the input list.
For each integer i is in the input list then set arr[i] = true
Traverse the entire arr from the lowest integer to the highest integer as follows:
whenever you find a true at ith index, note down this index.
see if there arr[i+k] is true. If yes then i and i+k numbers are the required pair
else continue with the next integer i+1

Find a single integer that occurs with even frequency in a given array of ints when all others occur odd with frequency

This is an interview question.
Given an array of integers, find the single integer value in the array which occurs with even frequency. All integers will be positive. All other numbers occur odd frequency. The max number in the array can be INT_MAX.
For example, [2, 8, 6, 2] should return 2.
the original array can be modified if you can find better solutions such as O(1) space with O(n) time.
I know how to solve it by hashtable (traverse and count freq). It is O(n) time and space.
Is it possible to solve it by O(1) space or better time?
Given this is an interview question, the answer is: O(1) space is achievable "for very big values of 1":
Prepare a matcharray 1..INT_MAX of all 0
When traversing the array, use the integer as an index into the matcharray, adding 1
When done, traverse the match array to find the one entry with a positive even value
The space for this is large, but independent of the size of the input array, so O(1) space. For really big data sets (say small value range, but enormous array length), this might even be a practically valid solution.
If you are allowed to sort the original array, I believe that you can do this in O(n lg U) time and O(lg U) space, where U is the maximum element of the array. The idea is as follows - using in-place MSD radix sort, sort the array in O(n lg U) time and O(lg U) space. Then, iterate across the array. Since all equal values are consecutive, you can then count how many times each value appears. Once you find the value that appears an even number of times, you can output the answer. This second scan requires O(n) time and O(1) space.
If we assume that U is a fixed constant, this gives an O(n)-time, O(1)-space algorithm. If you don't assume this, then the memory usage is still better than the O(n) algorithm provided that lg U = O(n), which should be true on most machines. Moreover, the space usage is only logarithmically as large as the largest element, meaning that the practical space usage is quite good. For example, on a 64-bit machine, we'd need only space sufficient to hold 64 recursive calls. This is much better than allocating a gigantic array up-front. Moreover, it means that the algorithm is a weakly-polynomial time algorithm as a function of U.
That said, this does rearrange the original array, and thus does destructively modify the input. In a sense, it's cheating because it uses the array itself for the O(n) storage space.
Hope this helps!
Scan through the list maintaining two sets, the 'Even' set and the 'Odd' set. If an element hasn't been seen before (i.e. if it's in neither set), place it in the 'Odd' set. If an element is in one set, move it to the other set. At the end, there should be only one item in the 'Even' set. This probably won't be fast, but the memory usage should be reasonable for large lists.
-Make a hash table containing ints. Call it is_odd or something. Since you might have to look through an array of size INT_MAX, just make it an array of size INT_MAX. Initialize to 0.
-Traverse through the whole array. You have to do this. There's no way to beat O(n).
for each number:
if it's not in the hash table, mark its spot in the table as 1.
if it is in the hash table then:
if its value is '1', make it '2'
if its value is '2', make it '1'.
Now you have to traverse through the hash table. Pull out the sole entry with "2" as the value.
Time:
You traverse the array once and the hash table once, so O(n).
Space:
Just an array of size INT_MAX. Or if you know the range of your array you can restrict your memory use to that.
edit: I just saw that you already had this method. Sorry about that!
I guess we read the task improperly. It asks us "find the single integer value in the array which occurs with even frequency". So, assuming that there is exactly ONE even element, the solution is:
public static void main(String[] args) {
int[] array = { 2, 1, 2, 4, 4 };
int count = 0;
for (int i : array) {
count^=i;
}
System.out.println(count); // Prints 1
}

Compare two integer arrays with same length

[Description] Given two integer arrays with the same length. Design an algorithm which can judge whether they're the same. The definition of "same" is that, if these two arrays were in sorted order, the elements in corresponding position should be the same.
[Example]
<1 2 3 4> = <3 1 2 4>
<1 2 3 4> != <3 4 1 1>
[Limitation] The algorithm should require constant extra space, and O(n) running time.
(Probably too complex for an interview question.)
(You can use O(N) time to check the min, max, sum, sumsq, etc. are equal first.)
Use no-extra-space radix sort to sort the two arrays in-place. O(N) time complexity, O(1) space.
Then compare them using the usual algorithm. O(N) time complexity, O(1) space.
(Provided (max − min) of the arrays is of O(Nk) with a finite k.)
You can try a probabilistic approach - convert the arrays into a number in some huge base B and mod by some prime P, for example sum B^a_i for all i mod some big-ish P. If they both come out to the same number, try again for as many primes as you want. If it's false at any attempts, then they are not correct. If they pass enough challenges, then they are equal, with high probability.
There's a trivial proof for B > N, P > biggest number. So there must be a challenge that cannot be met. This is actually the deterministic approach, though the complexity analysis might be more difficult, depending on how people view the complexity in terms of the size of the input (as opposed to just the number of elements).
I claim that: Unless the range of input is specified, then it is IMPOSSIBLE to solve in onstant extra space, and O(n) running time.
I will be happy to be proven wrong, so that I can learn something new.
Insert all elements from the first array into a hashtable
Try to insert all elements from the second array into the same hashtable - for each insert to element should already be there
Ok, this is not with constant extra space, but the best I could come up at the moment:-). Are there any other constraints imposed on the question, like for example to biggest integer that may be included in the array?
A few answers are basically correct, even though they don't look like it. The hash table approach (for one example) has an upper limit based on the range of the type involved rather than the number of elements in the arrays. At least by by most definitions, that makes the (upper limit on) the space a constant, although the constant may be quite large.
In theory, you could change that from an upper limit to a true constant amount of space. Just for example, if you were working in C or C++, and it was an array of char, you could use something like:
size_t counts[UCHAR_MAX];
Since UCHAR_MAX is a constant, the amount of space used by the array is also a constant.
Edit: I'd note for the record that a bound on the ranges/sizes of items involved is implicit in nearly all descriptions of algorithmic complexity. Just for example, we all "know" that Quicksort is an O(N log N) algorithm. That's only true, however, if we assume that comparing and swapping the items being sorted takes constant time, which can only be true if we bound the range. If the range of items involved is large enough that we can no longer treat a comparison or a swap as taking constant time, then its complexity would become something like O(N log N log R), were R is the range, so log R approximates the number of bits necessary to represent an item.
Is this a trick question? If the authors assumed integers to be within a given range (2^32 etc.) then "extra constant space" might simply be an array of size 2^32 in which you count the occurrences in both lists.
If the integers are unranged, it cannot be done.
You could add each element into a hashmap<Integer, Integer>, with the following rules: Array A is the adder, array B is the remover. When inserting from Array A, if the key does not exist, insert it with a value of 1. If the key exists, increment the value (keep a count). When removing, if the key exists and is greater than 1, reduce it by 1. If the key exists and is 1, remove the element.
Run through array A followed by array B using the rules above. If at any time during the removal phase array B does not find an element, you can immediately return false. If after both the adder and remover are finished the hashmap is empty, the arrays are equivalent.
Edit: The size of the hashtable will be equal to the number of distinct values in the array does this fit the definition of constant space?
I imagine the solution will require some sort of transformation that is both associative and commutative and guarantees a unique result for a unique set of inputs. However I'm not sure if that even exists.
public static boolean match(int[] array1, int[] array2) {
int x, y = 0;
for(x = 0; x < array1.length; x++) {
y = x;
while(array1[x] != array2[y]) {
if (y + 1 == array1.length)
return false;
y++;
}
int swap = array2[x];
array2[x] = array2[y];
array2[y] = swap;
}
return true;
}
For each array, Use Counting sort technique to build the count of number of elements less than or equal to a particular element . Then compare the two built auxillary arrays at every index, if they r equal arrays r equal else they r not . COunting sort requires O(n) and array comparison at every index is again O(n) so totally its O(n) and the space required is equal to the size of two arrays . Here is a link to counting sort http://en.wikipedia.org/wiki/Counting_sort.
given int are in the range -n..+n a simple way to check for equity may be the following (pseudo code):
// a & b are the array
accumulator = 0
arraysize = size(a)
for(i=0 ; i < arraysize; ++i) {
accumulator = accumulator + a[i] - b[i]
if abs(accumulator) > ((arraysize - i) * n) { return FALSE }
}
return (accumulator == 0)
accumulator must be able to store integer with range = +- arraysize * n
How 'bout this - XOR all the numbers in both the arrays. If the result is 0, you got a match.

Resources