Related
I have an array of ~1000 objects that are float values which evolve over time (in a manner which cannot be predetermined; assume it is a black box). At every fixed time interval, I want to set a threshold value that separates the top 5-15% of values, making the cut wherever a distinction can be made most "naturally," in the sense that there are the largest gaps between data points in the array.
What is the best way for me to implement such an algorithm? Obviously (I think) the first step to take at the end of each time interval is to sort the array, but then after that I am not sure what the most efficient way to resolve this problem is. I have a feeling that it is not necessary to tabulate all of the gaps between consecutive data points in the region of interest in the sorted array, and that there is a much faster way than brute-force to solve this, but I am not sure what it is. Any ideas?
You could write your own quicksort/select routine that doesn't issue recursive calls for subarrays lying entirely outside of the 5%-15%ile range. For only 1,000 items, though, I'm not sure if it would be worth the trouble.
Another possibility would be to use fancy data structures to track the largest gaps online as the values evolve (e.g., a binary search tree decorated with subtree counts (for fast indexing) and largest subtree gaps). It's definitely not clear if this would be worth the trouble.
I have n arrays of data, each of these arrays is sorted by the same criteria.
The number of arrays will, in almost all cases, not exceed 10, so it is a relatively small number. In each array, however, can be a large number of objects, that should be treated as infinite for the algorithm I am looking for.
I now want to treat these arrays as if they are one array. However, I do need a way, to retrieve objects in a given range as fast as possible and without touching all objects before the range and/or all objects after the range. Therefore it is not an option to iterate over all objects and store them in one single array. Fetches with low start values are also more likely than fetches with a high start value. So e.g. fetching objects [20,40) is much more likely than fetching objects [1000,1020), but it could happen.
The range itself will be pretty small, around 20 objects, or can be increased, if relevant for the performance, as long as this does not hit the limits of memory. So I would guess a couple of hundred objects would be fine as well.
Example:
3 arrays, each containing a couple of thousand entires. I now want to get the overall objects in the range [60, 80) without touching either the upper 60 objects in each set nor all the objets that are after object 80 in the array.
I am thinking about some sort of combined, modified binary search. My current idea is something like the following (note, that this is not fully thought through yet, it is just an idea):
get object 60 of each array - the beginning of the range can not be after that, as every single array would already meet the requirements
use these objects as the maximum value for the binary search in every array
from one of the arrays, get the centered object (e.g. 30)
with a binary search in all the other arrays, try to find the object in each array, that would be before, but as close as possible to the picked object.
we now have 3 objects, e.g. object 15, 10 and 20. The sum of these objects would be 45. So there are 42 objects in front, which is more than the beginning of the range we are looking for (30). We continue our binary search in the remaining left half of one of the arrays
if we instead get a value where the sum is smaller than the beginning of the range we are looking for, we continue our search on the right.
at some point we will hit object 30. From there on, we can simply add the objects from each array, one by one, with an insertion sort until we hit the range length.
My questions are:
Is there any name for this kind of algorithm I described here?
Are there other algorithms or ideas for this problem, that might be better suited for this issue?
Thans in advance for any idea or help!
People usually call this problem something like "selection in the union of multiple sorted arrays". One of the questions in the sidebar is about the special case of two sorted arrays, and this question is about the general case. Several comparison-based approaches appear in the combined answers; they more or less have to determine where the lower endpoint in each individual array is. Your binary search answer is one of the better approaches; there's an asymptotically faster algorithm due to Frederickson and Johnson, but it's complicated and not obviously an improvement for small ranks.
This question already has answers here:
How to tell if an array is a permutation in O(n)?
(16 answers)
Closed 9 years ago.
Given an array A of size n, and two numbers a and b with b-a+1=n, I need to determine whether or not A contains each of the numbers between a and b (exactly once).
For example, if n=4 and a=1,b=4, then I'm looking to see if A is a rearrangement of [1,2,3,4].
In practice, I need to do this with O(1) space (no hash table).
My first idea was to sort A, but I have to do this without rearranging A, so that's out.
My second idea is to run through A once, adding up the entries and checking that they are in the correct range. At the end, I have to get the right sum (for a=1,b=n, this is n(n+1)/2), but this doesn't always catch everything, e.g. [1,1,4,4,5] passes the test for n=5,a=1,b=5, but shouldn't.
The only idea of mine that works is to pass through the array n times making sure to see each number once and only once. Is there a faster solution?
You can do this with a single pass through the array, using only a minor modification of the n(n+1)/2 method you already mentioned.
To do so, walk through the array, ignoring elements outside the a..b range. For numbers that are in the correct range, you want to track three values: the sum of the numbers, the sum of the squares of the numbers, and the count of the numbers.
You can pre-figure the correct values for both the sum of numbers and the sum of the squares (and, trivially, the count).
Then compare your result to the expected results. Consider, for example, if you're searching for 1, 2, 3, 4. If you used only the sums of the numbers, then [1, 1, 4, 4] would produce the correct result (1+2+3+4 = 10, 1+1+4+4 = 10), but if you also add the sums of the squares, the problem is obvious: 1+4+9+16 = 30 but 1+1+16+16 = 34.
This is essentially applying (something at least very similar to) a Bloom filter to the problem. Given a sufficiently large group and a fixed pair of functions, there's going to be some set of incorrect inputs that will produce the correct output. You can reduce that possibility to an arbitrarily low value by increasing the number of filters you apply. Alternatively, you can probably design an adaptive algorithm that can't be fooled--offhand, it seems like if your range of inputs is N, then raising each number to the power N+1 will probably assure that you can only get the correct result with exactly the correct inputs (but I'll admit, I'm not absolutely certain that's correct).
Here is a O(1) space and O(n) solution that might help :-
Find the mean and standard deviation in range (a,b)
Scan the array and find mean and standard deviation.
if any number is outside (a,b) return false
if(mean1!=mean2 || sd1!=sd2) return false else true.
Note : I might not be 100% accurate.
Here's a solution that fails with the probability of a hash collision.
Take an excellent (for example cryptographic) hash function H.
Compute: xor(H(x) for x in a...b)
Compute: xor(H(A[i]) for i in 1...n)
If the two are the different, then for sure you don't have a permutation. If the two are the same, then you've almost certainly got a permutation. You can make this immune to input that's been picked to produce a hash collision by including a random seed into the hash.
This is obviously O(b-a) in running time, needs O(1) external storage, and trivial to implement.
How to find the most frequent number in an array? The array can be extremely large, for example 2GB and we only have limited memory, say 100MB.
I'm thinking about external sort, which is sorting and than duplicating numbers that are next to each other. Or hashma. But don't know what to do with the limited memory. And I'm even not sure if external sort is a good idea for this.
In the worst case, all your numbers are distinct except for one number which appears twice, and there's no way to detect this in main memory unless you have the two duplicate numbers loaded into main memory at the same time, which is unlikely without sorting if your total data size is much larger than main memory size. In that case, aysmptotically the best thing to do is sort numbers in batches and save to disk in files, and then do a merge sort merge step reading in all the sorted files into memory a few lines at a time, and outputting the merged sorted list to a new file. Then you go through the aggregate sorted file in order and count how many times you see each number, keeping track of which number has occurred the most times.
If you assume that the most frequent number is 50% frequency or higher, then you can do much better. You can solve the problem with constant extra memory just going through the list of numbers once. Basically you start by initializing the most common value (MCV) to the first number and initialize a counter N to 1. Then you go through the list. If the next number in the list is the MCV, you increase N by one. Otherwise you decrease N by 1. If N is 0 and the next number is different than MCV, then you set MCV to the new number and set N to 1. It is easy to prove this will terminate with the most common value stored in MCV.
Depending on what the hypotheses are, an even better way of doing it might be using the MJRTY algorithm:
http://www.cs.utexas.edu/~moore/best-ideas/mjrty/
Or its generalization:
http://www.cs.yale.edu/homes/el327/datamining2011aFiles/ASimpleAlgorithmForFindingFrequentElementsInStreamsAndBags.pdf
The idea is that with exactly two variables (a counter and a value store) you can determine, if there exists a majority element (appearing strictly more than 50% of the time), what that element is. The generalization require (k+1) counters and value stores to find the elements appearing 100/k%.
Because these are only candidates to majority (if there is are k-majority elements, those are they; but if there are no k-majority elements, than these are just random elements there by chance), a second pass on the data could help you get the exact count of the candidates, and determine which one, if any, is a majority element.
This is extremely fast and memory efficient.
There are few other optimizations, but with 4kb of memory, you should be able to find the majority element of 2GB of data with good probability - depending on the type of data you have.
Assumptions:
Integer is 4 bytes.
There are less then (100 MB / 4 B) = (104857600 / 4) = 26214400 distinct integers in the 2 GB array. Every number maps into 0-26214399 index range.
Let's do the histogram.
Make buckets in our 100 MB space. It's an integer array called histogram, which can store up to 26214400 counters. Counters are initally set to 0.
Iterate once through the 2 GB array. When you read x, do histogram[x]++.
Find the maximum in the histogram, iterating through it once. If the maximum is histogram[i], then i is the most frequent number.
The bottleneck is step 2, iterating through 2 GB array, but we do it only once.
If the second assumptions doesn't hold (i.e. there are more than 26214400 distinct integers):
Make histogram for numbers with indices from 0 to 26214399. Keep the most frequent number from histogram. Make histogram for numbers with indices from 26214400 to 52428798. Keep the most frequent number from the histogram and the previous most frequent number. And so on.
In the worst case, with 2^32 distinct numbers, it will do (2^32 / 26214400 + 1) = 164 iterations over that 2 GB array.
In general, it will do (NUMBER_OF_DISTINCT_NUMBERS / 26214400 + 1) iterations.
Assuming 4-byte integers, you can fit (100 / 4) = 25MB integers into available memory.
Read through your big file, counting each occurrence of and number in the range 0 ... 25MB-1. Use a big array to accumulate counts.
Find the number which occurs most frequently, store the number and its frequency and clear the array.
Read through the big file repeating the counting process for numbers in the range 25MB ... 50MB-1.
Find the number which occurs most frequently in the new array. Compare it with the number/frequency you stored at step 2. Store the number/frequency of the one with the higher frequency and clear the array.
Lather, rinse, repeat.
ETA: If we can assume that there is one single answer, that there aren't two different numbers with the same highest frequency, then you can discard all numbers if the array for a particular range shows a tie. Otherwise the problem of storing the winner for each range becomes more complex.
If you have limited memory but a reasonable amount of processing power and super fast speed isn't an issue, depending on your dataset you could try something like:
Iterate through array counting number of numbers 1 to 1000. Keep the one with the biggest count. Then count 1001 to 2000. Keep the biggest count between these, and the biggest one from the first batch. Repeat until all numbers have been counted.
I'm sure there are many optimisations for this based on the specifics of your dataset.
I already read this post but the answer didn't satisfied me Check if Array is sorted in Log(N).
Imagine I have a serious big array over 1,000,000 double numbers (positive and/or negative) and I want to know if the array is "sorted" trying to avoid the max numbers of comparisons because comparing doubles and floats take too much time. Is it possible to use statistics on It?, and if It was:
It is well seen by real-programmers?
Should I take samples?
How many samples should I take
Should they be random, or in a sequence?
How much is the %error permitted to say "the array sorted"?
Thanks.
That depends on your requirements. If you can say that if 100 random samples out of 1.000.000 is enough the assume it's sorted - then so it is. But to be absolutely sure, you will always have to go through every single entry. Only you can answer this question since only you know how certain you need to be about it being sorted.
This is a classic probability problem taught in high school. Consider this question:
What is the probability that the batch will be rejected?
In a batch of 8,000, clocks 7% are defective. A random sample of 10 (without replacement) from the 8,000 is selected and tested. If at least one is defective the entire batch will be rejected.
So you can take a number of random samples from your large array and see if it's sorted, but you must note that you need to know the probability that the sample is out of order. Since you don't have that information, a probabilistic approach wouldn't work efficiently here.
(However, you can check 50% of the array and naively conclude that there is a 50% chance that it is sorted correctly.)
If you run a divide and conquer algorithm using multiprocessing (real parallelism, so only for multi-core CPUs) you can check whether an array is sorted or not in Log(N).
If you have GPU multiprocessing you can achieve Log(N) very easily since modern graphics card are able to run few thousands processes in parallel.
Your question 5 is the question that you need to answer to determine the other answers. To ensure the array is perfectly sorted you must go through every element, because any one of them could be the one out of place.
The maximum number of comparisons to decide whether the array is sorted is N-1, because there are N-1 adjacent number pairs to compare. But for simplicity, we'll say N as it does not matter if we look at N or N+1 numbers.
Furthermore, it is unimportant where you start, so let's just start at the beginning.
Comparison #1 (A[0] vs. A[1]). If it fails, the array is unsorted. If it succeeds, good.
As we only compare, we can reduce this to the neighbors and whether the left one is smaller or equal (1) or not (0). So we can treat the array as a sequence of 0's and 1's, indicating whether two adjacent numbers are in order or not.
Calculating the error rate or the propability (correct spelling?) we will have to look at all combinations of our 0/1 sequence.
I would look at it like this: We have 2^n combinations of an array (i.e. the order of the pairs, of which only one is sorted (all elements are 1 indicating that each A[i] is less or equal to A[i+1]).
Now this seems to be simple:
initially the error is 1/2^N. After the first comparison half of the possible combinations (all unsorted) get eliminated. So the error rate should be 1/2^n + 1/2^(n-1).
I'm not a mathematician, but it should be quite easy to calculate how many elements are needed to reach the error rate (find x such that ERROR >= sum of 1/2^n + 1/2^(n-1)... 1/^(2-x) )
Sorry for the confusing english. I come from germany..
Since every single element can be the one element that is out-of-line, you have to run through all of them, hence your algorithm has runtime O(n).
If your understanding of "sorted" is less strict, you need to specify what exaclty you mean by "sorted". Usually, "sorted" means that adjacent elements meet a less or less-or-equal condition.
Like everyone else says, the only way to be 100% sure that it is sorted is to run through every single element, which is O(N).
However, it seems to me that if you're so worried about it being sorted, then maybe having it sorted to begin with is more important than the array elements being stored in a contiguous portion in memory?
What I'm getting at is, you could use a map whose elements by definition follow a strict weak ordering. In other words, the elements in a map are always sorted. You could also use a set to achieve the same effect.
For example: std::map<int,double> collectoin; would allow you to almost use it like an array: collection[0]=3.0; std::cout<<collection[0]<<std:;endl;. There are differences, of course, but if the sorting is so important then an array is the wrong choice for storing the data.
The old fashion way.Print it out and see if there in order. Really if your sort is wrong you would probably see it soon. It's more unlikely that you would only see a few misorders if you were sorting like 100+ things. When ever I deal with it my whole thing is completely off or it works.
As an example that you probably should not use but demonstrates sampling size:
Statistically valid sample size can give you a reasonable estimate of sortedness. If you want to be 95% certain eerything is sorted you can do that by creating a list of truly random points to sample, perhaps ~1500.
Essentially this is completely pointless if the list of values being out of order in one single place will break subsequent algorithms or data requirements.
If this is a problem, preprocess the list before your code runs, or use a really fast sort package in your code. Most sort packages also have a validation mode, where it simply tells you yes, the list meets your sort criteria - or not. Other suggestions like parallelization of your check with threads are great ideas.