Find the most frequent number in an array, with limited memory - arrays

How to find the most frequent number in an array? The array can be extremely large, for example 2GB and we only have limited memory, say 100MB.
I'm thinking about external sort, which is sorting and than duplicating numbers that are next to each other. Or hashma. But don't know what to do with the limited memory. And I'm even not sure if external sort is a good idea for this.

In the worst case, all your numbers are distinct except for one number which appears twice, and there's no way to detect this in main memory unless you have the two duplicate numbers loaded into main memory at the same time, which is unlikely without sorting if your total data size is much larger than main memory size. In that case, aysmptotically the best thing to do is sort numbers in batches and save to disk in files, and then do a merge sort merge step reading in all the sorted files into memory a few lines at a time, and outputting the merged sorted list to a new file. Then you go through the aggregate sorted file in order and count how many times you see each number, keeping track of which number has occurred the most times.
If you assume that the most frequent number is 50% frequency or higher, then you can do much better. You can solve the problem with constant extra memory just going through the list of numbers once. Basically you start by initializing the most common value (MCV) to the first number and initialize a counter N to 1. Then you go through the list. If the next number in the list is the MCV, you increase N by one. Otherwise you decrease N by 1. If N is 0 and the next number is different than MCV, then you set MCV to the new number and set N to 1. It is easy to prove this will terminate with the most common value stored in MCV.

Depending on what the hypotheses are, an even better way of doing it might be using the MJRTY algorithm:
http://www.cs.utexas.edu/~moore/best-ideas/mjrty/
Or its generalization:
http://www.cs.yale.edu/homes/el327/datamining2011aFiles/ASimpleAlgorithmForFindingFrequentElementsInStreamsAndBags.pdf
The idea is that with exactly two variables (a counter and a value store) you can determine, if there exists a majority element (appearing strictly more than 50% of the time), what that element is. The generalization require (k+1) counters and value stores to find the elements appearing 100/k%.
Because these are only candidates to majority (if there is are k-majority elements, those are they; but if there are no k-majority elements, than these are just random elements there by chance), a second pass on the data could help you get the exact count of the candidates, and determine which one, if any, is a majority element.
This is extremely fast and memory efficient.
There are few other optimizations, but with 4kb of memory, you should be able to find the majority element of 2GB of data with good probability - depending on the type of data you have.

Assumptions:
Integer is 4 bytes.
There are less then (100 MB / 4 B) = (104857600 / 4) = 26214400 distinct integers in the 2 GB array. Every number maps into 0-26214399 index range.
Let's do the histogram.
Make buckets in our 100 MB space. It's an integer array called histogram, which can store up to 26214400 counters. Counters are initally set to 0.
Iterate once through the 2 GB array. When you read x, do histogram[x]++.
Find the maximum in the histogram, iterating through it once. If the maximum is histogram[i], then i is the most frequent number.
The bottleneck is step 2, iterating through 2 GB array, but we do it only once.
If the second assumptions doesn't hold (i.e. there are more than 26214400 distinct integers):
Make histogram for numbers with indices from 0 to 26214399. Keep the most frequent number from histogram. Make histogram for numbers with indices from 26214400 to 52428798. Keep the most frequent number from the histogram and the previous most frequent number. And so on.
In the worst case, with 2^32 distinct numbers, it will do (2^32 / 26214400 + 1) = 164 iterations over that 2 GB array.
In general, it will do (NUMBER_OF_DISTINCT_NUMBERS / 26214400 + 1) iterations.

Assuming 4-byte integers, you can fit (100 / 4) = 25MB integers into available memory.
Read through your big file, counting each occurrence of and number in the range 0 ... 25MB-1. Use a big array to accumulate counts.
Find the number which occurs most frequently, store the number and its frequency and clear the array.
Read through the big file repeating the counting process for numbers in the range 25MB ... 50MB-1.
Find the number which occurs most frequently in the new array. Compare it with the number/frequency you stored at step 2. Store the number/frequency of the one with the higher frequency and clear the array.
Lather, rinse, repeat.
ETA: If we can assume that there is one single answer, that there aren't two different numbers with the same highest frequency, then you can discard all numbers if the array for a particular range shows a tie. Otherwise the problem of storing the winner for each range becomes more complex.

If you have limited memory but a reasonable amount of processing power and super fast speed isn't an issue, depending on your dataset you could try something like:
Iterate through array counting number of numbers 1 to 1000. Keep the one with the biggest count. Then count 1001 to 2000. Keep the biggest count between these, and the biggest one from the first batch. Repeat until all numbers have been counted.
I'm sure there are many optimisations for this based on the specifics of your dataset.

Related

Fast way to count smaller/equal/larger elements in array

I need to optimize my algorithm for counting larger/smaller/equal numbers in array(unsorted), than a given number.
I have to do this a lot of times and given array also can have thousands of elements.
Array doesn't change, number is changing
Example:
array: 1,2,3,4,5
n = 3
Number of <: 2
Number of >: 2
Number of ==:1
First thought:
Iterate through the array and check if element is > or < or == than n.
O(n*k)
Possible optimization:
O((n+k) * logn)
Firstly sort the array (im using c qsort), then use binary search to find equal number, and then somehow count smaller and larger values. But how to do that?
If elements exists (bsearch returns pointer to the element) I also need to check if array contain possible duplicates of this elements (so I need to check before and after this elements while they are equal to found element), and then use some pointer operations to count larger and smaller values.
How to get number of values larger/smaller having a pointer to equal element?
But what to do if I don't find the value (bsearch returns null)?
If the array is unsorted, and the numbers in it have no other useful properties, there is no way to beat an O(n) approach of walking the array once, and counting items in the three buckets.
Sorting the array followed by a binary search would be no better than O(n), assuming that you employ a sort algorithm that is linear in time (e.g. a radix sort). For comparison-based sorts, such as quicksort, the timing would increase to O(n*log2n).
On the other hand, sorting would help if you need to run multiple queries against the same set of numbers. The timing for k queries against n numbers would go from O(n*k) for k linear searches to O(n+k*log2n) assuming a linear-time sort, or O((n+k)*log2n) with comparison-based sort. Given a sufficiently large k, the average query time would go down.
Since the array is (apparently?) not changing, presort it. This allows a binary search (Log(n))
a.) implement your own version of bsearch (it will be less code anyhow)
you can do it inline using indices vs. pointers
you won't need function pointers to a specialized function
b.) Since you say that you want to count the number of matches, you imply that the array can contain multiple entries with the same value (otherwise you would have used a boolean has_n).
This means you'll need to do a linear search for the beginning and end of the array of "n"s.
From which you can calculate the number less than n and greater than n.
It appears that you have some unwritten algorithm for choosing these (for n=3 you look for count of values greater and less than 2 and equal to 1, so there is no way to give specific code)
c.) For further optimization (at the expense of memory) you can sort the data into a binary search tree of structs that holds not just the value, but also the count and the number of values before and after each value. It may not use more memory at all if you have a lot of repeat values, but it is hard to tell without the dataset.
That's as much as I can help without code that describes your hidden algorithms and data or at least a sufficient description (aside from recommending a course or courses in data structures and algorithms).

Frequency of a number in array faster than linear time

Find the frequency of a number in array in less than O(n) time.
Array 1,2,2,3,4,5,5,5,2
Input 5
Output 3
Array 1,1,1,1
Input 1
Output 4
If the only information you have is an unsorted array (as your test data seems to indicate), you cannot do better than O(n) in finding the frequency of a given value. There's no getting around that.
In order to achieve a better time complexity, there are a variety of ways.
One would be to keep the array sorted (or a parallel sorted array if you didn't want to change the order). This way, you could use a binary search to find the first item with the given value then sequentially scan that portion to get a count. While the worst case (all items the same and that value is what you're looking for) is still O(n), it will tend toward O(log n) average case.
Note that sorting the data each time before looking for a value will not work since that will almost certainly push you above the O(n) limit. The idea would be to sort only on item insertion.
Another method, provided your domain (possible values) is limited, is to maintain the actual frequencies of those values separately. For example, if the domain is limited to the numbers one through a hundred, have a separate array containing the frequency of each value.
When the list is empty, all frequencies are zero. Whenever you add or remove an item, increment or decrement the frequency for that value. This would make frequency extraction a quick O(1) operation.
But, as stated, both these solutions require extra/modified data to be maintained. Without that, you cannot do better than O(n) since you will need to examine every item in the array to see if it matches the value you're looking for.

Maximum sum of size n

The question is from a local hackathon:
I have a sorted array of positive integers in descending order. e.g (9,4,2,1). You are allowed to traverse through n elements of the array to maximize the sum(initially n = the size of the array). As you traverse the array from the beginning to the end you are allowed to stop at any moment and start from the beginning again, but the cost of doing that is that you lose 1 element from your allowance. For example in this case the best way of doing that would be 9,0,9,4. Notice that I've stopped, lost an element(hence the 0) and continued again.
I want to solve this using dynamic programming. I know how to do this using DP in O(n^2). I am looking for an algorithm that does this in a better time complexity.
You don't want to take k numbers in some interval between restarts and k + 2 numbers in some other interval between restarts; it is always at least as good to take k + 1 each time. This means that, given the number of restarts, it's immediately clear what the pattern should be (as evenly as possible).
In time O(n), it's possible to precompute the sum of each prefix of the array. Then, in time O(n), iterate through all possible restart counts, computing for each in time O(1) what the total will be by examining the sums for two adjacent prefixes and multiplying by the appropriate number of times for each.

Partial heap sorting to find k most frequent words in 5GB file

I know what algorithm I'd like to use but want to know what I'd have to change since the file is so large.
I want to use a hash to store the frequencies of the words and use a min-heap to store the most frequent words and adjust the min-heap accordingly as I loop through the words. This should take O(nlogk) I think. How will my algorithm need to be changed if I have too much data to store in memory. This is an issue I have trouble understanding in general, not just for this specific question but I'm just giving context so that it might help with the explanation.
I think there is no deterministic way to do that without having the entire file in memory (or making some expensive kind of merge sort).
But there are some good probabilistic algorithms. Take a look at the Count-Min Sketch.
There is a great implementation of this and other algorithms, in this library.
Explaining the merge sort thing: if your file is already sorted, you can find the k most frequent pretty easily with a min-heap. Yes, a min-heap to be able to discard the less frequent term when you find one more competitive. You can do this because you can know the frequency of the current word without having to read the entire file. If you file is unsorted, you must keep an entire list, because the most frequent term may appear everywhere in the file, and be discarded as "non-competitive" too soon.
You can do a merge sort with limited memory pretty easily, but it's a I/O intensive operation and may take a while. Actually you can use any kind of External Sort.
Added after your comment that you need to calculate the frequencies.
You don't say how many words you expect are in the data, or what constitutes a word. If it's English text, I'd be surprised to see a half million words. And there certainly won't be a billion words in 5 gigabytes of text. But the technique doesn't really change, regardless of how many words there are.
You start by building a dictionary or hash map that contains key value pairs: word, count. As you read each word, look it up in the dictionary. If it's there, increase its count. If it's not there, add it with a count of 1.
If you have a lot of memory or relatively few words, it'll all fit into memory. If so, you can do the heap thing that I describe below.
If your memory fills up, then you simply write the key value pairs out to a text file, one word per line, like this:
word1, count
word2, count
Then clear your dictionary and keep going, adding words or increasing their counts. Repeat as necessary for each block of words until you've reached the end of the input.
Now you have a huge text file that contains word/count pairs. Sort it by word. There are many external sorting tools that'll do that. Two that come to mind are the Windows SORT utility and the GNU sort. Both can easily sort a very large file of short lines.
Once the file is sorted by word, you'll have:
word1, count
word1, count
word2, count
word3, count
word3, count
word3, count
Now it's a simple matter of going sequentially through the file, accumulating counts for words. At each word break, check its count against the heap as described below.
This whole process takes some time, but it works quite well. You can speed it some by sorting the blocks of words and writing them to individual files. Then, when you've reached the end of input you do an N-way merge on the several blocks. That's faster, but forces you to write a merge program, unless you can find one. If I were doing this once, I'd go for the simple solution. Were I to be doing it often, I'd spend the time to write a custom merge program.
After you've computed the frequencies ...
Assuming your file contains the words and their frequencies, and all you want to do is get the k words with the highest frequencies, then yes it's O(n log k), and you don't have to store all of the items in memory. Your heap only requires k items.
The idea:
heap = new minheap();
for each item
// if you don't already have k items on the heap, add this one
if (heap.count < k)
heap.Add(item)
else if (item.frequency > heap.Peek().frequency)
{
// The new item's frequency is greater than the lowest frequency
// already on the heap. Remove the item from the heap
// and add the new item.
heap.RemoveRoot();
heap.Add(item);
}
After you've processed every item, the heap will contain the k items that have the highest frequencies.
You can use selection algorithm (http://en.wikipedia.org/wiki/Selection_algorithm ) to calculate the kth largestnumber. Then do a linear scan and select only k large numbers.
In practice you might want to start with an estimated range where kth min false into and continue from there on. Eg. read first M numbers and calculate estimated kth max = (k*M/N)th max in M numbers. If you think data is biased (i.e. partially sorted), then pick those M numbers randomly.

How do I find common elements from n arrays

I am thinking of sorting and then doing binary search. Is that the best way?
I advocate for hashes in such cases: you'll have time proportional to common size of both arrays.
Since most major languages offer hashtable in their standard libraries, I hardly need to show your how to implement such solution.
Iterate through each one and use a hash table to store counts. The key is the value of the integer and the value is the count of appearances.
It depends. If one set is substantially smaller than the other, or for some other reason you expect the intersection to be quite sparse, then a binary search may be justified. Otherwise, it's probably easiest to step through both at once. If the current element in one is smaller than in the other, advance to the next item in that array. When/if you get to equal elements, you send that as output, and advance to the next item in both arrays. (This assumes, that as you advocated, you've already sorted both, of course).
This is an O(N+M) operation, where N is the size of one array, and M the size of the other. Using a binary search, you get O(N lg2 M) instead, which can be lower complexity if one array is lot smaller than the other, but is likely to be a net loss if they're close to the same size.
Depending on what you need/want, the versions that attempt to just count occurrences can cause a pretty substantial problem: if there are multiple occurrences of a single item in one array, they will still count that as two occurrences of that item, indicating an intersection that doesn't really exist. You can prevent this, but doing so renders the job somewhat less trivial -- you insert items from one array into your hash table, but always set the count to 1. When that's finished, you process the second array by setting the count to 2 if and only if the item is already present in the table.
Define "best".
If you want to do it fast, you can do it O(n) by iterating through each array and keeping a count for each unique element. Details of how to count the unique elements depend on the alphabet of things that can be in the array, eg, is it sparse or dense?
Note that this is O(n) in the number of arrays, but O(nm) for arrays of length m).
The best way is probably to hash all the values and keep a count of occurrences, culling all that have not occurred i times when you examine array i where i = {1, 2, ..., n}. Unfortunately, no deterministic algorithm can get you less than an O(n*m) running time, since it's impossible to do this without examining all the values in all the arrays if they're unsorted.
A faster algorithm would need to either have an acceptable level of probability (Monte Carlo), or rely on some known condition of the lists to examine only a subset of elements (i.e. you only care about elements that have occurred in all i-1 previous lists when considering the ith list, but in an unsorted list it's non-trivial to search for elements.

Resources