Generate list of remaining numbers - arrays

Given a number n, and an array with size m where m<n. Provided that each number in the array is between 0 and n-1 (inclusive), I want to get as efficiently as possible the list of n-m numbers from 0 to n-1 which aren't in the array.
That's how I'm doing it (in pseudocode), but it feels rather inefficient and I'm wondering if there's a better way:
int[] remaining (int[] assigned) {
Set<int> s
int[n-m] remaining
add each int in assigned to s
for(i = 0 to n-1)
if(not s.contains(i)) remaining.add(i);
}
This isn't any particular computer language but it should be ilustrative. We'll assume that accessing an array is of course O(1) and adding/checking a set is O(log(n)) as an AVL set would be. So basically I'm trying to get this in linear time, instead of O(n·logn) like it's now, but if the initial array isn't sorted I don't know how to go about it, or if it's even possible.

copy the array into a hashmap H. This takes O(m).
for i from 0 to n-1
if(H.ispresent(i) == FALSE)
output i
This for loop takes O(n).
As n>=m the overall complexity is O(n)

I think it would be a little faster
pseudocode also
int[] remaining (int[] assigned) {
Set<int> s
int[n] all
int[n-m] remaining
for(i = 0 to m-1)
all[assigned[i]]=-1
int counter=0
for(i = 0 to n-1)
if (all[i]==-1)
remaining[counter]=all[i]
counter++
return remaining
}

The bitset (bit array) idea:
#include <iostream>
#include <fstream>
#include <bitset>
const int SIZE = 10; // for example
int main() {
std::bitset<SIZE> bs;
int i;
std::ifstream fin("numbers.txt");
while (fin >> i)
bs.set(i);
fin.close();
for (i = 0; i < SIZE; ++i)
if (!bs[i])
std::cout << i << '\n';
return 0;
}

If you have to find 1 or 2 missing numbers, you can always use the sum and/or the product of the numbers to figure out the missing numbers.If it is more than 2
Code for using a Bitset in java to find the missing numbers.
public List<Integer> findMissingNumbers(List<Integer> input,int maxNum){
/*You can also interate through the list and find the maNum later. The bitset is based on vector and can increase in size
*/
if(input==null || input.size()==0)
return null;
BitSet existSet=new BitSet(maxNum);
for(int val:input){
existSet.set(val);
}
List<Integer> missingNum=new ArrayList<Integer>();
for(int i=0;i<existSet.length()){
nextIndex=bitSet.nextClearBit();
if(nextIndex==-1)
break;
missingNum.add(nextIndex);
index=nextIndex+1;
}
return missingNum;
}

Related

How to sort an int array in linear time?

I had been given a homework to do a program to sort an array in ascending order.I did this:
#include <stdio.h>
int main()
{
int a[100],i,n,j,temp;
printf("Enter the number of elements: ");
scanf("%d",&n);
for(i=0;i<n;++i)
{
printf("%d. Enter element: ",i+1);
scanf("%d",&a[i]);
}
for(j=0;j<n;++j)
for(i=j+1;i<n;++i)
{
if(a[j]>a[i])
{
temp=a[j];
a[j]=a[i];
a[i]=temp;
}
}
printf("Ascending order: ");
for(i=0;i<n;++i)
printf("%d ",a[i]);
return 0;
}
The input will not be more than 10 numbers. Can this be done in less amount of code than i did here? I want the code to be as shortest as possible.Any help will be appreciated.Thanks!
If you know the range of the array elements, one way is to use another array to store the frequency of each of the array elements ( all elements should be int :) ) and print the sorted array. I am posting it for large number of elements (106). You can reduce it according to your need:
#include <stdio.h>
#include <malloc.h>
int main(void){
int t, num, *freq = malloc(sizeof(int)*1000001);
memset(freq, 0, sizeof(int)*1000001); // Set all elements of freq to 0
scanf("%d",&t); // Ask for the number of elements to be scanned (upper limit is 1000000)
for(int i = 0; i < t; i++){
scanf("%d", &num);
freq[num]++;
}
for(int i = 0; i < 1000001; i++){
if(freq[i]){
while(freq[i]--){
printf("%d\n", i);
}
}
}
}
This algorithm can be modified further. The modified version is known as Counting sort and it sorts the array in Θ(n) time.
Counting sort:1
Counting sort assumes that each of the n input elements is an integer in the range
0 to k, for some integer k. When k = O(n), the sort runs in Θ(n) time.
Counting sort determines, for each input element x, the number of elements less
than x. It uses this information to place element x directly into its position in the
output array. For example, if 17 elements are less than x, then x belongs in output
position 18. We must modify this scheme slightly to handle the situation in which
several elements have the same value, since we do not want to put them all in the
same position.
In the code for counting sort, we assume that the input is an array A[1...n] and
thus A.length = n. We require two other arrays: the array B[1....n] holds the
sorted output, and the array C[0....k] provides temporary working storage.
The pseudo code for this algo:
for i ← 1 to k do
c[i] ← 0
for j ← 1 to n do
c[A[j]] ← c[A[j]] + 1
//c[i] now contains the number of elements equal to i
for i ← 2 to k do
c[i] ← c[i] + c[i-1]
// c[i] now contains the number of elements ≤ i
for j ← n downto 1 do
B[c[A[i]]] ← A[j]
c[A[i]] ← c[A[j]] - 1
1. Content has been taken from Introduction to Algorithms by
Thomas H. Cormen and others.
You have 10 lines doing the sorting. If you're allowed to use someone else's work (subsequent notes indicate that you can't do this), you can reduce that by writing a comparator function and calling the standard C library qsort() function:
static int compare_int(void const *v1, void const *v2)
{
int i1 = *(int *)v1;
int i2 = *(int *)v2;
if (i1 < i2)
return -1;
else if (i1 > i2)
return +1;
else
return 0;
}
And then the call is:
qsort(a, n, sizeof(a[0]), compare_int);
Now, I wrote the function the way I did for a reason. In particular, it avoids arithmetic overflow which writing this does not:
static int compare_int(void const *v1, void const *v2)
{
return *(int *)v1 - *(int *)v2;
}
Also, the original pattern generalizes to comparing structures, etc. You compare the first field for inequality returning the appropriate result; if the first fields are unequal, then you compare the second fields; then the third, then the Nth, only returning 0 if every comparison shows the values are equal.
Obviously, if you're supposed to write the sort algorithm, then you'll have to do a little more work than calling qsort(). Your algorithm is a Bubble Sort. It is one of the most inefficient sorting techniques — it is O(N2). You can look up Insertion Sort (also O(N2)) but more efficient than Bubble Sort), or Selection Sort (also quadratic), or Shell Sort (very roughly O(N3/2)), or Heap Sort (O(NlgN)), or Quick Sort (O(NlgN) on average, but O(N2) in the worst case), or Intro Sort. The only ones that might be shorter than what you wrote are Insertion and Selection sorts; the others will be longer but faster for large amounts of data. For small sets like 10 or 100 numbers, efficiency is immaterial — all sorts will do. But as you get towards 1,000 or 1,000,000 entries, then the sorting algorithms really matter. You can find a lot of questions on Stack Overflow about different sorting algorithms. You can easily find information in Wikipedia for any and all of the algorithms mentioned.
Incidentally, if the input won't be more than 10 numbers, you don't need an array of size 100.

Sort an increasing array

The pseudo codes:
S = {};
Loop 10000 times:
u = unsorted_fixed_size_array_producer();
S = sort(S + u);
I need an efficient implementation of sort, which takes a sorted array and an unsorted one, then sort them all. But here we know after a few iterations, size(S) will be much bigger than size(u), that's a prior.
Update: There's another prior: the size of u is known, say 10 or 20, and the looping times is also known.
Update: I implemented the algorithm that #Dukelnig advised in C https://gist.github.com/blackball/bd7e5619a1e83bd985a3 which fits for my needs. Thanks!
Sort u, then merge S and u.
Merging simply involves iterating through two sorted arrays at the same time, and picking the smaller element and incrementing that iterator at each step.
The running time is O(|u| log |u| + |S|).
This is very similar to what merge sort does, so that it would result in a sorted array can be derived from there.
Some Java code for merge, derived from Wikipedia: (the C code wouldn't look all that different)
static void merge(int S[], int u[], int newS[])
{
int iS = 0, iu = 0;
for (int j = 0; j < S.length + u.length; j++)
if (iS < S.length && (iu >= u.length || S[iS] <= u[iu]))
newS[j] = S[iS++]; // Increment iS after using it as an index
else
newS[j] = u[iu++]; // Increment iu after using it as an index
}
This can also be done in-place (in S, assuming it has enough additional space) by going from the back.
Here's some working Java code that does this:
static void mergeInPlace(int S[], int SLength, int u[])
{
int iS = SLength-1, iu = u.length-1;
for (int j = SLength + u.length - 1; j >= 0; j--)
if (iS >= 0 && (iu < 0 || S[iS] >= u[iu]))
S[j] = S[iS--];
else
S[j] = u[iu--];
}
public static void main(String[] args)
{
int[] S = {1,5,9,13,22, 0,0,0,0}; // 4 additional spots reserved here
int[] u = {0,10,11,15};
mergeInPlace(S, 5, u);
// prints [0, 1, 5, 9, 10, 11, 13, 15, 22]
System.out.println(Arrays.toString(S));
}
To reduce the number of comparisons, we can also use binary search (although the time complexity would remain the same - this can be useful when comparisons are expensive).
// returns the first element in S before SLength greater than value,
// or returns SLength if no such element exists
static int binarySearch(int S[], int SLength, int value) { ... }
static void mergeInPlaceBinarySearch(int S[], int SLength, int u[])
{
int iS = SLength-1;
int iNew = SLength + u.length - 1;
for (int iu = u.length-1; iu >= 0; iu--)
{
if (iS >= 0)
{
int index = binarySearch(S, iS+1, u[iu]);
for ( ; iS >= index; iS--)
S[iNew--] = S[iS];
}
S[iNew--] = u[iu];
}
// assert (iS != iNew)
for ( ; iS >= 0; iS--)
S[iNew--] = S[iS];
}
If S doesn't have to be an array
The above assumes that S has to be an array. If it doesn't, something like a binary search tree might be better, depending on how large u and S are.
The running time would be O(|u| log |S|) - just substitute some values to see which is better.
If you really really have to use a literal array for S at all times, then the best approach would be to individually insert the new elements into the already sorted S. I.e. basically use the classic insertion sort technique for each element in each new batch. This will be expensive in a sense that insertion into an array is expensive (you have to move the elements), but that's the price of having to use an array for S.
So if the size of S is much more than the size of u, isn't what you want simply an efficient sort for a mostly sorted array? Traditionally this would be insertion sort. But you will only know the real answer by experimentation and measurement - try different algorithms and pick the best one. Without actually running your code (and perhaps more importantly, with your data), you cannot reliably predict performance, even with something as well studied as sorting algorithms.
Say we have a big sorted list of size n and a little sorted list of size k.
Binary search, starting from the end (position n-1, n-2, n-4, &c) for the insertion point for the largest element of the smaller list. Shift the tail end of the larger list k elements to the right, insert the largest element of the smaller list, then repeat.
So if we have the lists [1,2,4,5,6,8,9] and [3,7], we will do:
[1,2,4,5,6, , ,8,9]
[1,2,4,5,6, ,7,8,9]
[1,2, ,4,5,6,7,8,9]
[1,2,3,4,5,6,7,8,9]
But I would advise you to benchmark just concatenating the lists and sorting the whole thing before resorting to interesting merge procedures.

Effective Algorithms for selecting the top k ( in percent) items from a datastream:

I have to repeatedly sort an array containing 300 random elements. But i have to do a special kind of sort: I need the 5% smallest values from an subset of the array, then some value is calculated and the subset is increased. Now the value is calculated again and the subset also increased. And so on until the subset contains the whole array.
The subset starts with the first 10 elements and is increased by 10 elements after each step.
i.e. :
subset-size k=ceil(5%*subset)
10 1 (so just the smallest element)
20 1 (so also just the smallest)
30 2 (smallest and second smallest)
...
The calculated value is basically a sum of all elements smaller than k and the specially weighted k smallest element.
In code:
k = ceil(0.05 * subset) -1; // -1 because array index starts with 0...
temp = 0.0;
for( int i=0 i<k; i++)
temp += smallestElements[i];
temp += b * smallestElements[i];
I have implemented myself a selection sort based algorithm (code at the end of this post). I use MAX(k) pointers to keep track of the k smallest elements. Therefore I unnecessarily sort all elements smaller than k :/
Furthermore I know selection sort is bad for performance, which is unfortunately crucial in my case.
I tried figuring out a way how I could use some quick- or heapsort based algorithm. I know that quickselect or heapselect are perfect for finding the k smallest elements if k and the subset is fixed.
But because my subset is more like an input stream of data I think that quicksort based algorithm drop out.
I know that heapselect would be perfect for a data stream if k is fixed. But I don't manage it to adjust heapselect for dynamic k's without big performance drops, so that it is less effective than my selection-sort based version :( Can anyone help me to modify heap-select for dynamic k's?
If there is no better algorithm, you maybe find a different/faster approach for my selection sort implementation. Here is a minimal example of my implementation, the calculated variable isn't used in this example, so don't worry about it. (In my real programm i have just some loops unrolled manually for better performance)
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define ARRAY_SIZE 300
#define STEP_SIZE 10
float sortStream( float* array, float** pointerToSmallest, int k_max){
int i,j,k,last = k_max-1;
float temp=0.0;
// init first two pointers
if( array[0] < array[1] ){
pointerToSmallest[0] = &array[0];
pointerToSmallest[1] = &array[1];
}else{
pointerToSmallest[0] = &array[1];
pointerToSmallest[1] = &array[0];
}
// Init remaining pointers until i= k_max
for(i=2; i< k_max;++i){
if( *pointerToSmallest[i-1] < array[i] ){
pointerToSmallest[i] = &array[i];
}else{
pointerToSmallest[i] = pointerToSmallest[i-1];
for(j=0; j<i-1 && *pointerToSmallest[i-2-j] > array[i];++j)
pointerToSmallest[i-1-j] = pointerToSmallest[i-2-j];
pointerToSmallest[i-1-j]=&array[i];
}
if((i+1)%STEP_SIZE==0){
k = ceil(0.05 * i)-1;
for(j=0; j<k; j++)
temp += *pointerToSmallest[j];
temp += 2 * (*pointerToSmallest[k]);
}
}
// Selection sort remaining elements
for( ; i< ARRAY_SIZE; ++i){
if( *pointerToSmallest[ last ] > array[i] ) {
for(j=0; j != last && *pointerToSmallest[ last-1-j] > array[i];++j)
pointerToSmallest[last-j] = pointerToSmallest[last-1-j];
pointerToSmallest[last-j] = &array[i];
}
if( (i+1)%STEP_SIZE==0){
k = ceil(0.05 * i)-1;
for(j=0; j<k; j++)
temp += *pointerToSmallest[j];
temp += 2 * (*pointerToSmallest[k]);
}
}
return temp;
}
int main(void){
int i,k_max = ceil( 0.05 * ARRAY_SIZE );
float* array = (float*)malloc ( ARRAY_SIZE * sizeof(float));
float** pointerToSmallest = (float**)malloc( k_max * sizeof(float*));
for( i=0; i<ARRAY_SIZE; i++)
array[i]= rand() / (float)RAND_MAX*100-50;
// just return a, so that the compiler doens't drop the function call
float a = sortStream(array,pointerToSmallest, k_max);
return (int)a;
}
Thank you very much
By using two heap for storing all items from stream, you can:
find top p% elements in O(1)
update data structure (two heaps) in O(log N)
assume, now we have N elements, k = p% *N,
min heap (LargerPartHeap) for storing top k items
max heap (SmallerPartHeap) for storing the other (N - k) items.
all items in SmallerPartHeap is less or equal to min items of LargerPartHeap (top item # LargerPartHeap).
for query "what is top p% elements?", simply return LargerPartHeap
for update "new element x from stream",
2.a check new k' = (N + 1) * p%, if k' = k + 1, move top of SmallerPartHeap to LargerPartHeap. - O(logN)
2.b if x is larger than top element (min element) of LargerPartHeap, insert x to LargerPartHeap, and move top of LargerPartHeap to SmallerPartHeap; otherwise, insert x to SmallerPartHeap - O(logN)
I believe heap sort is far too complicated for this particular problem, even though that or other priority queue algorithms are well suited to get N minimum or maximum items from a stream.
The first notice is the constraint 0.05 * 300 = 15. That is the maximum amount of data, that has to be sorted at any moment. Also during each iteration one has add 10 elements. The overall operation in-place could be:
for (i = 0; i < 30; i++)
{
if (i != 1)
qsort(input + i*10, 10, sizeof(input[0]), cmpfunc);
else
qsort(input, 20, sizeof(input[0]), cmpfunc);
if (i > 1)
merge_sort15(input, 15, input + i*10, 10, cmpfunc);
}
When i==1, one could also merge sort input and input+10 to produce completely sorted array of 20 inplace, since that has lower complexity than the generic sort. Here the "optimizing" is also on minimizing the primitives of the algorithm.
Merge_sort15 would only consider the first 15 elements of the first array and the first 10 elements of the next one.
EDIT The parameters of the problem will have a considerable effect in choosing the right algorithm; here selecting 'sort 10 items' as basic unit will allow one half of the problem to be parallelized, namely sorting 30 individual blocks of 10 items each -- a problem which can be efficiently solved with fixed pipeline algorithm using sorting networks. With different parametrization such an approach may not be feasible.

Find the number of occurrence of each element in an array and update the information related to each elements

I have a big 2-D array, array[length][2]. the length= 500000.
In array[i][0]= hex number, array[i][1]= 0 or 1, which represents some information related to each hex number. Like this:
array[i][0] array[i][1]
e05f56f8 1
e045ac44 1
e05f57fc 1
e05f57b4 1
e05ff8dc 0
e05ff8ec 0
e05ff900 1
I want to get a new array which stores: the hex number,# of occurance, the sum of array[i][1] of the same hex number.
I write the code like this:
//First Sort the array according to array[][0]
int x,y,temp1,temp2;
for (x=lines_num1-2;x>=0;x--)
{
for (y=0;y<=x;y++)
{
if(array[y][0]>array[y+1][0])
{
temp1=array[y][0];
array[y][0]=array[y+1][0];
array[y+1][0]=temp1;
temp2=array[y][1];
array[y][1]=array[y+1][1];
array[y+1][1]=temp2;
}
}
}
// generate the new_array[][]
int new_array[length][3];
int n=0;
for (n=0; n<length; n++){
new_array[n][0]=0;
new_array[n][1]=0;
new_array[n][2]=0;
}
int prev = array[0][0];
new_array[0][0]=array[0][0];
new_array[0][1]=1;
new_array[0][2]=array[0][2];
for (k=1;k<length;k++)
{
if (array[k][0] == prev)
{
new_array[n][1]=new_array[n][1]+1;
new_array[n][2]=new_array[n][2]+array[k][0];
}else{
prev = array[k][0];
new_array[n+1][0]=array[k][0];
new_array[n+1][1]=new_array[n+1][1]+1;
new_array[n+1][2]=new_array[n+1][2]+array[k][0];
n++;
}
}
But the code seems not work as I expected. First the sorting is so slow. And It seems cannot generate the correct new_array. Any suggestion on how to deal with this.
Personally, I would write a hash function to index the result array with the hexadecimal value directly. Then it is simple:
struct {
unsigned int nocc;
unsigned int nsum;
} result[/* ... */];
/* calculate the results */
for (i = 0; i < LENGTH; ++i) {
int *curr = &array[i];
unsigned int index = hash(curr[0]);
result[index].nocc++;
result[index].nsum += curr[1];
}
If you want to sort your array, don't reinventing the wheel: use qsort from the standard C library.
Sorting is slow because you're using bubble sort to sort the data. Bubble sort has quadratic average complexity, which means it has to perform more then 100 billion comparisons and swaps to sort your array. For this reason, never use bubble sort. Instead, learn to use the qsort library function and apply it to your problem.
Also, your sorting code has at least one bug: when exchanging values for the second column of the array, you are getting the value with the wrong column index, [3] instead of [1].
For your scenario insertion sort is the right solution, while doing the insertion itself you could make the #count and the sum. When the sort is finished, you will have your result array as well.
The code might look something like this
int hex = 0, count = 0, sum = 0, iHole;
for (i=1; i < lines_num1 -1; i++)
{
hex = array[i][0];
count = array[i][1];
sum = array[i][2];
iHole = i
// keep moving the hole to next smaller index until A[iHole - 1] is <= item
while (iHole > 0 and array[iHole - 1][0] > hex)
{
// move hole to next smaller index
A[iHole][0] = A[iHole - 1][0];
A[iHole][1] = A[iHole - 1][1];
A[iHole][2] = A[iHole - 1][2];
iHole = iHole - 1
}
// put item in the hole
if (array[iHole][0] == hex)
{
array[iHole][1]++;
array[iHole][2] += array[iHole][0];
}
else
{
array[iHole][0] = hex;
array[iHole][1] = 1;
array[iHole][2] = hex;
}
}
So the cost of making the second array is cost of the sorting itself. O(n) best case, O(n^2) worst case, and you don't have to travel again to make the sum and count.
Remember this sort is a inplace sort. If you don't want to affect your original array that could be done as well with iHole pointing to the new array. The iHole should point to the tail of new array instead of "i"

Algorithm to find the duplicate numbers in an array ---Fastest Way

I need the fastest and simple algorithm which finds the duplicate numbers in an array, also should be able to know the number of duplicates.
Eg: if the array is {2,3,4,5,2,4,6,2,4,7,3,8,2}
I should be able to know that there are four 2's, two 3's and three 4's.
Make a hash table where the key is array item and value is counter how many times the corresponding array item has occurred in array. This is efficient way to do it, but probably not the fastest way.
Something like this (in pseudo code). You will find plenty of hash map implementations for C by googling.
hash_map = create_new_hash_map()
for item in array {
if hash_map.contains_key(item){
counter = hash_map.get(item)
} else {
counter = 0
}
counter = counter + 1
hash_map.put(item, counter)
}
This can be solved elegantly using Linq:
public static void Main(string[] args)
{
List<int> list = new List<int> { 2, 3, 4, 5, 2, 4, 6, 2, 4, 7, 3, 8, 2 };
var grouping = list
.GroupBy(x => x)
.Select(x => new { Item = x.Key, Count = x.Count()});
foreach (var item in grouping)
Console.WriteLine("Item {0} has count {1}", item.Item, item.Count);
}
Internally it probably uses hashing to partition the list, but the code hides the internal details - here we are only telling it what to calculate. The compiler / runtime is free to choose how to calculate it, and optimize as it sees fit. Thanks to Linq this same code will run efficiently whether run an a list in memory, or if the list is in a database. In real code you should use this, but I guess you want to know how internally it works.
A more imperative approach that demonstrates the actual algorithm is as follows:
List<int> list = new List<int> { 2, 3, 4, 5, 2, 4, 6, 2, 4, 7, 3, 8, 2 };
Dictionary<int, int> counts = new Dictionary<int, int>();
foreach (int item in list)
{
if (!counts.ContainsKey(item))
{
counts[item] = 1;
}
else
{
counts[item]++;
}
}
foreach (KeyValuePair<int, int> item in counts)
Console.WriteLine("Item {0} has count {1}", item.Key, item.Value);
Here you can see that we iterate over the list only once, keeping a count for each item we see on the way. This would be a bad idea if the items were in a database though, so for real code, prefer to use the Linq method.
here's a C version that does it with standard input; it's as fast as the length of the input (beware, the number of parameters on the command line is limited...) but should give you an idea on how to proceed:
#include <stdio.h>
int main ( int argc, char **argv ) {
int dups[10] = { 0 };
int i;
for ( i = 1 ; i < argc ; i++ )
dups[atoi(argv[i])]++;
for ( i = 0 ; i < 10 ; i++ )
printf("%d: %d\n", i, dups[i]);
return 0;
}
example usage:
$ gcc -o dups dups.c
$ ./dups 0 0 3 4 5
0: 2
1: 0
2: 0
3: 1
4: 1
5: 1
6: 0
7: 0
8: 0
9: 0
caveats:
if you plan to count also the number of 10s, 11s, and so on -> the dups[] array must be bigger
left as an exercise is to implement reading from an array of integers and to determine their position
The more you tell us about the input arrays the faster we can make the algorithm. For example, for your example of single-digit numbers then creating an array of 10 elements (indexed 0:9) and accumulating number of occurrences of number in the right element of the array (poorly worded explanation but you probably catch my drift) is likely to be faster than hashing. (I say likely to be faster because I haven't done any measurements and won't).
I agree with most respondents that hashing is probably the right approach for the most general case, but it's always worth thinking about whether yours is a special case.
If you know the lower and upper bounds, and they are not too far apart, this would be a good place to use a Radix Sort. Since this smells of homework, I'm leaving it to the OP to read the article and implement the algorithm.
If you don't want to use hash table or smtg like that, just sort the array then count the number of occurrences, something like below should work
Arrays.sort(array);
lastOne=array's first element;
count=0,
for(i=0; i <array's length; i++)
{
if(array[i]==lastOne)
increment count
else
print(array[i] + " has " + count + " occurrences");
lastOne=array[i+1];
}
If the range of the numbers is known and small, you could use an array to keep track of how many times you've seen each (this is a bucket sort in essence). IF it's big you can sort it and then count duplicates as they will be following each other.
option 1: hash it.
option 2: sort it and then count consecutive runs.
You can use hash tables to store each element value as a key. Then increment +1 each time a key already exists.
Using hash tables / associative arrays / dictionaries (all the same thing but the terminology changes between programming environments) is the way to go.
As an example in python:
numberList = [1, 2, 3, 2, 1, ...]
countDict = {}
for value in numberList:
countDict[value] = countDict.get(value, 0) + 1
# Now countDict contains each value pointing to their count
Similar constructions exist in most programming languages.
> I need the fastest and simple algorithm which finds the duplicate numbers in an array, also should be able to know the number of duplicates.
I think the fastest algorithm is counting the duplicates in an array:
#include <stdlib.h>
#include <stdio.h>
#include <limits.h>
#include <assert.h>
typedef int arr_t;
typedef unsigned char dup_t;
const dup_t dup_t_max=UCHAR_MAX;
dup_t *count_duplicates( arr_t *arr, arr_t min, arr_t max, size_t arr_len ){
assert( min <= max );
dup_t *dup = calloc( max-min+1, sizeof(dup[0]) );
for( size_t i=0; i<arr_len; i++ ){
assert( min <= arr[i] && arr[i] <= max && dup[ arr[i]-min ] < dup_t_max );
dup[ arr[i]-min ]++;
}
return dup;
}
int main(void){
arr_t arr[] = {2,3,4,5,2,4,6,2,4,7,3,8,2};
size_t arr_len = sizeof(arr)/sizeof(arr[0]);
arr_t min=0, max=16;
dup_t *dup = count_duplicates( arr, min, max, arr_len );
printf( " value count\n" );
printf( " -----------\n" );
for( size_t i=0; i<(size_t)(max-min+1); i++ ){
if( dup[i] ){
printf( "%5i %5i\n", (int)(i+min), (int)(dup[i]) );
}
}
free(dup);
}
Note: You can not use the fastest algorithm on every array.
The code first sorts the array and then moves unique elements to the front, keeping track of the number of elements. It's slower than using bucket sort, but more convenient.
#include <stdio.h>
#include <stdlib.h>
static int cmpi(const void *p1, const void *p2)
{
int i1 = *(const int *)p1;
int i2 = *(const int *)p2;
return (i1 > i2) - (i1 < i2);
}
size_t make_unique(int values[], size_t count, size_t *occ_nums)
{
if(!count) return 0;
qsort(values, count, sizeof *values, cmpi);
size_t top = 0;
int prev_value = values[0];
if(occ_nums) occ_nums[0] = 1;
size_t i = 1;
for(; i < count; ++i)
{
if(values[i] != prev_value)
{
++top;
values[top] = prev_value = values[i];
if(occ_nums) occ_nums[top] = 1;
}
else ++occ_nums[top];
}
return top + 1;
}
int main(void)
{
int values[] = { 2, 3, 4, 5, 2, 4, 6, 2, 4, 7, 3, 8, 2 };
size_t occ_nums[sizeof values / sizeof *values];
size_t unique_count = make_unique(
values, sizeof values / sizeof *values, occ_nums);
size_t i = 0;
for(; i < unique_count; ++i)
{
printf("number %i occurred %u time%s\n",
values[i], (unsigned)occ_nums[i], occ_nums[i] > 1 ? "s": "");
}
}
There is an "algorithm" that I use all the time to find duplicate lines in a file in Unix:
sort file | uniq -d
If you implement the same strategy in C, then it is very difficult to beat it with a fancier strategy such as hash tables. Call a sorting algorithm, and then call your own function to detect duplicates in the sorted list. The sorting algorithm takes O(n*log(n)) time and the uniq function takes linear time. (Southern Hospitality makes a similar point, but I want to emphasize that what he calls "option 2" seems both simpler and faster than the more popular hash tables suggestion.)
Counting sort is the answer to the above question.If you see the algorithm for counting sort you will find that there is an array that is kept for keeping the count of an element i present in the original array.
Here is another solution but it takes O(nlogn) time.
Use Divide and Conquer approach to sort the given array using merge sort.
During combine step in merge sort, find the duplicates by comparing the elements in the two sorted sub-arrays.

Resources