Most Frequent of every N Elements in C - c

I have a large array A of size [0, 8388608] of "relatively small" integers A[i] = [0, 131072] and I want to find the most frequently occurring element of every N=32 elements.
What would be faster,
A. Create an associative array B of size 131072, iterate through 32 elements, increment B[A[i]], then iterate through B, find the largest value, reset all elements in B to 0, repeat |A|/32 times.
B. qsort every 32 elements, find the largest range where A[i] == A[i-1] (and thus the most frequent element), repeat |A|/32 times.
(EDIT) C. Something else.

An improvement over the first approach is possible. There is no need to iterate through B. And it can be an array of size 131072
Every time you increment B[A[i]], look at the new value in that cell. Then, have a global highest_frequency_found_far. This start at zero, but after every increment the new value should be compared with this global. If it's higher, then the global is replaced.
You could also have a global value_that_was_associated_with_the_highest_count
for each block of 32 members of A ... {
size_t B [131072] = {0,0,...};
size_t highest_frequency_found_so_far = 0;
int value_associated_with_that = 0;
for(a : A) { // where A just means the current 32-element sub-block
const int new_frequency = ++B[a];
if (new_frequency > highest_frequency_found_so_far) {
highest_frequency_found_so_far = new_frequency;
value_associated_with_that = a;
}
}
// now, 'value_associated_with_that' is the most frequent element
// Thanks to #AkiSuihkonen for pointing out a really simple way to reset B each time.
// B is big, instead of zeroing each element explicitly, just do this loop to undo
// the ++B[a] from earlier:
for(a : A) { --B[a]; }
}

what about a btree?
You only need a max of 32 nodes and can declare them up front.

Related

Given an array of integers of size n+1 consisting of the elements [1,n]. All elements are unique except one which is duplicated k times

I have been attempting to solve the following problem:
You are given an array of n+1 integers where all the elements lies in [1,n]. You are also given that one of the elements is duplicated a certain number of times, whilst the others are distinct. Develop an algorithm to find both the duplicated number and the number of times it is duplicated.
Here is my solution where I let k = number of duplications:
struct LatticePoint{ // to hold duplicate and k
int a;
int b;
LatticePoint(int a_, int b_) : a(a_), b(b_) {}
}
LatticePoint findDuplicateAndK(const std::vector<int>& A){
int n = A.size() - 1;
std::vector<int> Numbers (n);
for(int i = 0; i < n + 1; ++i){
++Numbers[A[i] - 1]; // A[i] in range [1,n] so no out-of-access
}
int i = 0;
while(i < n){
if(Numbers[i] > 1) {
int duplicate = i + 1;
int k = Numbers[i] - 1;
LatticePoint result{duplicate, k};
return LatticePoint;
}
So, the basic idea is this: we go along the array and each time we see the number A[i] we increment the value of Numbers[A[i]]. Since only the duplicate appears more than once, the index of the entry of Numbers with value greater than 1 must be the duplicate number with the value of the entry the number of duplications - 1. This algorithm of O(n) in time complexity and O(n) in space.
I was wondering if someone had a solution that is better in time and/or space? (or indeed if there are any errors in my solution...)
You can reduce the scratch space to n bits instead of n ints, provided you either have or are willing to write a bitset with run-time specified size (see boost::dynamic_bitset).
You don't need to collect duplicate counts until you know which element is duplicated, and then you only need to keep that count. So all you need to track is whether you have previously seen the value (hence, n bits). Once you find the duplicated value, set count to 2 and run through the rest of the vector, incrementing count each time you hit an instance of the value. (You initialise count to 2, since by the time you get there, you will have seen exactly two of them.)
That's still O(n) space, but the constant factor is a lot smaller.
The idea of your code works.
But, thanks to the n+1 elements, we can achieve other tradeoffs of time and space.
If we have some number of buckets we're dividing numbers between, putting n+1 numbers in means that some bucket has to wind up with more than expected. This is a variant on the well-known pigeonhole principle.
So we use 2 buckets, one for the range 1..floor(n/2) and one for floor(n/2)+1..n. After one pass through the array, we know which half the answer is in. We then divide that half into halves, make another pass, and so on. This leads to a binary search which will get the answer with O(1) data, and with ceil(log_2(n)) passes, each taking time O(n). Therefore we get the answer in time O(n log(n)).
Now we don't need to use 2 buckets. If we used 3, we'd take ceil(log_3(n)) passes. So as we increased the fixed number of buckets, we take more space and save time. Are there other tradeoffs?
Well you showed how to do it in 1 pass with n buckets. How many buckets do you need to do it in 2 passes? The answer turns out to be at least sqrt(n) bucekts. And 3 passes is possible with the cube root. And so on.
So you get a whole family of tradeoffs where the more buckets you have, the more space you need, but the fewer passes. And your solution is merely at the extreme end, taking the most spaces and the least time.
Here's a cheekier algorithm, which requires only constant space but rearranges the input vector. (It only reorders; all the original elements are still present at the end.)
It's still O(n) time, although that might not be completely obvious.
The idea is to try to rearrange the array so that A[i] is i, until we find the duplicate. The duplicate will show up when we try to put an element at the right index and it turns out that that index already holds that element. With that, we've found the duplicate; we have a value we want to move to A[j] but the same value is already at A[j]. We then scan through the rest of the array, incrementing the count every time we find another instance.
#include <utility>
#include <vector>
std::pair<int, int> count_dup(std::vector<int> A) {
/* Try to put each element in its "home" position (that is,
* where the value is the same as the index). Since the
* values start at 1, A[0] isn't home to anyone, so we start
* the loop at 1.
*/
int n = A.size();
for (int i = 1; i < n; ++i) {
while (A[i] != i) {
int j = A[i];
if (A[j] == j) {
/* j is the duplicate. Now we need to count them.
* We have one at i. There's one at j, too, but we only
* need to add it if we're not going to run into it in
* the scan. And there might be one at position 0. After that,
* we just scan through the rest of the array.
*/
int count = 1;
if (A[0] == j) ++count;
if (j < i) ++count;
for (++i; i < n; ++i) {
if (A[i] == j) ++count;
}
return std::make_pair(j, count);
}
/* This swap can only happen once per element. */
std::swap(A[i], A[j]);
}
}
/* If we get here, every element from 1 to n is at home.
* So the duplicate must be A[0], and the duplicate count
* must be 2.
*/
return std::make_pair(A[0], 2);
}
A parallel solution with O(1) complexity is possible.
Introduce an array of atomic booleans and two atomic integers called duplicate and count. First set count to 1. Then access the array in parallel at the index positions of the numbers and perform a test-and-set operation on the boolean. If a boolean is set already, assign the number to duplicate and increment count.
This solution may not always perform better than the suggested sequential alternatives. Certainly not if all numbers are duplicates. Still, it has constant complexity in theory. Or maybe linear complexity in the number of duplicates. I am not quite sure. However, it should perform well when using many cores and especially if the test-and-set and increment operations are lock-free.

Compare two arrays and create new array with equal elements in C

The problem is to check two arrays for the same integer value and put matching values in a new array.
Let say I have two arrays
a[n] = {2,5,2,7,8,4,2}
b[m] = {1,2,6,2,7,9,4,2,5,7,3}
Each array can be a different size.
I need to check if the arrays have matching elements and put them in a new array. The result in this case should be:
array[] = {2,2,2,5,7,4}
And I need to do it in O(n.log(n) + m.log(m)).
I know there is a way to do with merge sorting or put one of the array in a hash array but I really don't know how to implement it.
I will really appreciate your help, thanks!!!
As you have already figured out you can use merge sort (implementing it is beyond the scope of this answer, I suppose you can find a solution on wikipedia or searching on Stack Overflow) so that you can get nlogn + mlogm complexity supposing n is the size of the first array and m is the size of another.
Let's call the first array a (with the size n) and the second one b (with size m). First sort these arrays (merge sort would give us nlogn + mlogm complexity). And now we have:
a[n] // {2,2,2,4,5,7,8} and b[n] // {1,2,2,2,3,4,5,6,7,7,9}
Supposing n <= m we can simply iterate simulateously comparing coresponding values:
But first lets allocate array int c[n]; to store results (you can print to the console instead of storing if you need). And now the loop itself:
int k = 0; // store the new size of c array!
for (int i = 0, j = 0; i < n && j < m; )
{
if (a[i] == b[j])
{
// match found, store it
c[k] = a[i];
++i; ++j; ++k;
}
else if (a[i] > b[j])
{
// current value in a is leading, go to next in b
++j;
}
else
{
// the last possibility is a[i] < b[j] - b is leading
++i;
}
}
Note: the loop itself is n+m complexity at worst (remember n <= m assumption) which is less than for sorting so overal complexity is nlogn + mlogm. Now you can iterate c array (it's size is actually n as we allocated, but the number of elements in it is k) and do what you need with that numbers.
From the way that you explain it the way to do this would be to loop over the shorter array and check it against the longer array. Let us assume that A is the shorter array and B the longer array. Create a results array C.
Loop over each element in A, call it I
If I is found in B, remove it from B and put it in C, break out of the test loop.
Now go to the next element in A.
This means that if a number I is found twice in A and three times in B, then I will only appear twice in C. Once you finish, then every number found in both arrays will appear in C the number of times that it actually appears in both.
I am carefully not putting in suggested code as your question is about a method that you can use. You should figure out the code yourself.
I would be inclined to take the following approach:
1) Sort array B. There are many well published sort algorithms to do this, as well as several implementations in various generally available libraries.
2) Loop through array A and for each element do a binary search (or other suitable algorithm) on array B for a match. If a match is found, remove the element from array B (to avoid future matches) and add it to the output array.

Merge k sorted arrays using C

I need to merge k (1 <= k <= 16) sorted arrays into one sorted array. This is for a homework assignment and the Professor requires that this be done using an O(n) algorithm. Merging 2 arrays is no problem and I can do it easily using an O(n) algorithm. I feel that what my professor is asking is undoable for n arrays with an O(n) algorithm.
I am using the below algorithm to split the array indices and running InsertionSort on each partition. I could save these start and end indices into a 2D array. I just don't see how the merging can be done using O(n) because this is going to require more than one loop. If it is possible, does anyone have any hints. I'm not looking for actual code, just a hint as to where I should start/
int chunkSize = round(float(arraySize) / numThreads);
for (int i = 0; i < numThreads; i++) {
int start = i * chunkSize;
int end = start + chunkSize - 1;
if (i == numThreads - 1) {
end = arraySize - 1;
}
InsertionSort(&array[start], end - start + 1);
}
EDIT: The requirement is that the algorithm be O(n) where n is the number of elements in the array. Also, I need to solve this without using a min heap.
EDIT #2: Here is an algorithm I came up with. The problem here is that I'm not storing the result of each iteration back into the original array. I could just copy all of it back in for a loop but that would be expensive. Is there any way I can do this, other than using something memcpy? In the below code, indices is a 2D array [numThreads][2] where array[i][0] is the start index and array[i][1] is the end index of the ith array.
void mergeArrays(int array[], int indices[][2], int threads, int result[]) {
for (int i = 0; i < threads - 1; i++) {
int resPos = 0;
int lhsPos = 0;
int lhsEnd = indices[i][1];
int rhsPos = indices[i+1][0];
int rhsEnd = indices[i+1][1];
while (lhsPos <= lhsEnd && rhsPos <= rhsEnd) {
if (array[lhsPos] <= array[rhsPos]) {
result[resPos] = array[lhsPos];
lhsPos++;
} else {
result[resPos] = array[rhsPos];
rhsPos++;
}
resPos++;
}
while (lhsPos <= lhsEnd) {
result[resPos] = array[lhsPos];
lhsPos++;
resPos++;
}
while (rhsPos <= rhsEnd) {
result[resPos] = array[rhsPos];
rhsPos++;
resPos++;
}
}
}
You can merge K sorted arrays in one sorted array with O(N*log(K)) algorithm, using priority queue with K entries, where N is overall number of elements in all arrays.
If K is considered as constant value (it is limited by 16 in your case), then complexity is O(N).
Note again: N is number of elements in my post, not number of arrays.
It is impossible to merge arrays in O(K) - simple copy takes O(N)
Using the facts you provided:
(1) n is the number of arrays to to merge;
(2) the arrays to be merged are already sorted;
(3) the merge needs to be of order n, that is linear in the number of arrays
(and NOT linear in the number of elements in each array, as you might mistakenly think at first sight).
Use the analogy of merging 4 sorted piles of cards, low to high, face up. You would pick the card with the lowest face value from one of the piles and put it (face down) on the merged deck, until all piles are exhausted.
For your program: keep a counter for each array for the number of elements you have already transferred to the output. This is at the same time an index to the next element in each array NOT merged in the output. Pick the smallest element that you find at one of these locations. You have to lookup the first waiting element in all the arrays for that, so that is of order n.
Also, I don't understand why the answer from MoB got up-votes, it does not answer the question.
Here is one way to do it (pseudocode)
input array[k][n]
init indices[k] = { 0, 0, 0, ... }
init queue = { empty priority queue }
for i in 0..k:
insert i into queue with priority (array[i][0])
while queue is not empty:
let x = pop queue
output array[x, indices[x]]
increment indices[x]
insert x into queue with priority (array[x][indices[x]])
This can probably be simplified further in C. You would have to find a suitable queue implementation to use though as there are none in libc.
Complexity for this operation:
"while queue is not empty" => O(n)
"insert x into queue ..." => O(log k)
=> O(n log k)
Which, if you consider k = constant, is O(n).
After sorting the k sub-arrays (the method doesn't matter), the code does a k-way merge. The simplest implementation does k-1 compares to determine the smallest leading element of each of the k arrays, then moves that element from it's sub-array to the output array and gets the next element from that array. When the end of an array is reached, the algorithm drops down to a (k-1) way merge, then (k-2) way merge, finally there's just one sub-array left and it's copied. This will be O(n) time since k-1 is a constant.
The k-1 compares can be sped up by using a minimum heap (which is how some priority queues are implemented), but it's still O(n), with just a smaller constant. The heap needs to be initialized at the start, then updated each time an element is removed and a new one added.

Puzzle : finding out repeated element in an Array

Size of an array is n.All elements in the array are distinct in the range of [0 , n-1] except two elements.Find out repeated element without using extra temporary array with constant time complexity.
I tried with o(n) like this.
a[]={1,0,0,2,3};
b[]={-1,-1,-1,-1,-1};
i=0;
int required;
while(i<n)
{
b[a[i]]++;
if(b[a[i]==1)
required=a[i];
}
print required;
If there is no constraint on range of numbers i.e allowing out of range also.Is it possible get o(n) solution without temporary array.
XOR all the elements together, then XOR the result with XOR([0..n-1]).
This gives you missing XOR repeat; since missing!=repeat, at least one bit is set in missing XOR repeat.
Pick one of those set bits. Iterate over all the elements again, and only XOR elements with that bit set. Then iterate from 1 to n-1 and XOR those numbers that have that bit set.
Now, the value is either the repeated value or the missing value. Scan the elements for that value. If you find it, it's the repeated element. Otherwise, it's the missing value so XOR it with missing XOR repeat.
Look what is first and last number
Calculate SUM(1) of array elements without duplicate (like you know that sum of 1...5 = 1+2+3+4+5 = 15. Call it SUM(1)). As AaronMcSmooth pointed out, the formula is Sum(1, n) = (n+1)n/2.
Calculate SUM(2) of the elements in array that is given to you.
Subtract SUM(2) - SUM(1). Whoa! The result is the duplicate number (like if a given array is 1, 2, 3, 4, 5, 3, the SUM(2) will be 18. 18 - 15 = 3. So 3 is a duplicate).
Good luck coding!
Pick two distinct random indexes. If the array values at those indexes are the same, return true.
This operates in constant time. As a bonus, you get the right answer with probability 2/n * 1/(n-1).
O(n) without the temp array.
a[]={1,0,0,2,3};
i=0;
int required;
while(i<n)
{
a[a[i] % n] += n;
if(a[a[i] % n] >= 2 * n)
required = a[i] % n;
}
print required;
(Assuming of course that n < MAX_INT - 2n)
This example could be useful for int, char, and string.
char[] ch = { 'A', 'B', 'C', 'D', 'F', 'A', 'B' };
Dictionary<char, int> result = new Dictionary<char, int>();
foreach (char c in ch)
{
if (result.Keys.Contains(c))
{
result[c] = result[c] + 1;
}
else
{
result.Add(c, 1);
}
}
foreach (KeyValuePair<char, int> pair in result)
{
if (pair.Value > 1)
{
Console.WriteLine(pair.Key);
}
}
Console.Read();
Build a lookup table. Lookup. Done.
Non-temporary array solution:
Build lookup into gate array hardware, invoke.
The best I can do is O(n log n) in time and O(1) in space:
The basic idea is to perform a binary search of the values 0 through n-1, passing over the whole array of n elements at each step.
Initially, let i=0, j=n-1 and k=(i+j)/2.
On each run through the array, sum the elements whose values are in the range i to k, and count the number of elements in this range.
If the sum is equal to (k-i)*(k-i+1)/2 + i*(k-i+1), then the range i through k has neither the duplicate nor the omitted value. If the count of elements is less than k-i+1, then the range has the omitted value but not the duplicate. In either case, replace i by k+1 and k by the new value of (i+j)/2.
Else, replace j by k and k by the new value of (i+j)/2.
If i!=j, goto 2.
The algorithm terminates with i==j and both equal to the duplicate element.
(Note: I edited this to simplify it. The old version could have found either the duplicate or the omitted element, and had to use Vlad's difference trick to find the duplicate if the initial search turned up the omitted value instead.)
Lazy solution: Put the elements to java.util.Set one by one by add(E) until getting add(E)==false.
Sorry no constant-time. HashMap:O(N), TreeSet:O(lgN * N).
Based on #sje's answer. Worst case is 2 passes through the array, no additional storage, non destructive.
O(n) without the temp array.
a[]={1,0,0,2,3};
i=0;
int required;
while (a[a[i] % n] < n)   
a[a[i++] % n] += n;
required = a[i] % n;
while (i-->0)
a[a[i]%n]-=n;
print required;
(Assuming of course that n < MAX_INT/2)

how to calculate the mode of an unsorted array of integers in O(N)?

...using an iterative procedure (no hash table)?
It's not homework. And by mode I mean the most frequent number (statistical mode). I don't want to use a hash table because I want to know how it can be done iteratively.
OK Fantius, how bout this?
Sort the list with a RadixSort (BucketSort) algorithm (technically O(N) time; the numbers must be integers). Start at the first element, remember its value and start a count at 1. Iterate through the list, incrementing the count, until you reach a different value. If the count for that value is higher than the current high count, remember that value and count as the mode. If you get a tie with the high count, remember both (or all) numbers.
... yeah, yeah, the RadixSort is not an in-place sort, and thus involves something you could call a hashtable (a collection of collections indexed by the current digit). However, the hashtable is used to sort, not to calculate the mode.
I'm going to say that on an unsorted list, it would be impossible to compute the mode in linear time without involving a hashtable SOMEWHERE. On a sorted list, the second half of this algorithm works by just keeping track of the current max count.
Definitely sounds like homework. But, try this: go through the list once, and find the largest number. Create an array of integers with that many elements, all initialized to zero. Then, go through the list again, and for each number, increment the equivalent index of the array by 1. Finally, scan your array and return the index that has the highest value. This will execute in roughly linear time, whereas any algorithm that includes a sort will probably take NlogN time or worse. However, this solution is a memory hog; it'll basically create a bell plot just to give you one number from it.
Remember that many (but not all) languages use arrays that are zero-based, so when converting from a "natural" number to an index, subtract one, and then add one to go from index to natural number.
If you don't want to use a hash, use a modified binary search trie (with a counter per node). For each element in the array insert into the trie. If it already exists in the trie, increment the counter. At the end, find the node with the highest counter.
Of course you can also use a hashmap that maps to a counter variable and will work the same way. I don't understand your complaint about it not being iterative... You iterate through the array, and then you iterate through the members of the hashmap to find the highest counter.
just use counting sort and look into array which store the number occurrences for each entity.h store the number occurrences for each entity.
I prepared two implementations in Python with different space and time complexity:
The first one uses "occurence array" is O(k) in terms of time complexity and S(k+1) in terms of space needed, where k is the greatest number in input.
input =[1,2,3,8,4,6,1,3,7,9,6,1,9]
def find_max(tab):
max=tab[0]
for i in range(0,len(tab)):
if tab[i] > max:
max=tab[i]
return max
C = [0]*(find_max(input)+1)
print len(C)
def count_occurences(tab):
max_occurence=C[0]
max_occurence_index=0
for i in range(0,len(tab)):
C[tab[i]]=C[tab[i]]+1
if C[tab[i]]>max_occurence:
max_occurence = C[tab[i]]
max_occurence_index=tab[i]
return max_occurence_index
print count_occurences(input)
NOTE: Imagine such pitiful example of input like an array [1, 10^8,1,1,1], there will be array of length k+1=100000001 needed.
The second one solution assumes, that we sort our input before searching for mode. I used radix sort, which has time complexity O(kn) where k is the length of the longest number and n is size of the input array. And then we have to iterate over whole sorted array of size n, to determine the longest subset of numbers standing for mode.
input =[1,2,3,8,4,6,1,3,7,9,6,1,9]
def radix_sort(A):
len_A = len(A)
mod = 5 #init num of buckets
div = 1
while True:
the_buckets = [[], [], [], [], [], [], [], [], [], []]
for value in A:
ldigit = value % mod
ldigit = ldigit / div
the_buckets[ldigit].append(value)
mod = mod * 10
div = div * 10
if len(the_buckets[0]) == len_A:
return the_buckets[0]
A = []
rd_list_append = A.append
for b in the_buckets:
for i in b:
rd_list_append(i)
def find_mode_in_sorted(A):
mode=A[0]
number_of_occurences =1
number_of_occurences_canidate=0
for i in range(1,len(A)):
if A[i] == mode:
number_of_occurences =number_of_occurences +1
else:
number_of_occurences_canidate=number_of_occurences_canidate+1
if A[i] != A[i-1]:
number_of_occurences_canidate=0
if number_of_occurences_canidate > number_of_occurences :
mode=A[i]
number_of_occurences =number_of_occurences_canidate+1
return mode#,number_of_occurences
s_input=radix_sort(input)
print find_mode_in_sorted(s_input)
Using JavaScript:
const mode = (arr) => {
let numMapping = {};
let mode
let greatestFreq = 0;
for(var i = 0; i < arr.length; i++){
if(numMapping[arr[i]] === undefined){
numMapping[arr[i]] = 0;
}
numMapping[arr[i]] += 1;
if (numMapping[arr[i]] > greatestFreq){
greatestFreq = numMapping[arr[i]]
mode = arr[i]
}
}
return parseInt(mode)
}

Resources