The question is from a local hackathon:
I have a sorted array of positive integers in descending order. e.g (9,4,2,1). You are allowed to traverse through n elements of the array to maximize the sum(initially n = the size of the array). As you traverse the array from the beginning to the end you are allowed to stop at any moment and start from the beginning again, but the cost of doing that is that you lose 1 element from your allowance. For example in this case the best way of doing that would be 9,0,9,4. Notice that I've stopped, lost an element(hence the 0) and continued again.
I want to solve this using dynamic programming. I know how to do this using DP in O(n^2). I am looking for an algorithm that does this in a better time complexity.
You don't want to take k numbers in some interval between restarts and k + 2 numbers in some other interval between restarts; it is always at least as good to take k + 1 each time. This means that, given the number of restarts, it's immediately clear what the pattern should be (as evenly as possible).
In time O(n), it's possible to precompute the sum of each prefix of the array. Then, in time O(n), iterate through all possible restart counts, computing for each in time O(1) what the total will be by examining the sums for two adjacent prefixes and multiplying by the appropriate number of times for each.
Related
This question already has answers here:
O(n) algorithm to find the median of n² implicit numbers
(3 answers)
Closed 7 years ago.
is there a way to find the Median of an unsorted array:
1- without sorting it.
2- without using the select algorithm, nor the median of medians
I found a lot of other questions similar to mine. But the solutions, most of them, if not all of them, discussed the SelectProblem and the MedianOfMedians
You can certainly find the median of an array without sorting it. What is not easy is doing that efficiently.
For example, you could just iterate over the elements of the array; for each element, count the number of elements less than and equal to it, until you find a value with the correct count. That will be O(n2) time but only O(1) space.
Or you could use a min heap whose size is just over half the size of the array. (That is, if the array has 2k or 2k+1 elements, then the heap should have k+1 elements.) Build the heap using the first array elements, using the standard heap building algorithm (which is O(N)). Then, for each remaining element x, if x is greater than the heap's minimum, replace the min element with x and do a SiftUp operation (which is O(log N)). At the end, the median is either the heap's minimum element (if the original array's size was odd) or is the average of the two smallest elements in the heap. So that's a total of O(n log n) time, and O(n) space if you cannot rearrange array elements. (If you can rearrange array elements, you can do this in-place.)
There is a randomized algorithm able to accomplish this task in O(n) steps (average case scenario), but it does involve sorting some subsets of the array. And, because of its random nature, there is no guarantee it will actually ever finish (though this unfortunate event should happen with vanishing probability).
I will leave the main idea here. For a more detailed description and for the proof of why this algorithm works, check here.
Let A be your array and let n=|A|. Lets assume all elements of A are distinct. The algorithm goes like this:
Randomly select t = n^(3/4) elements from A.
Let T be the "set" of the selected elements.Sort T.
Set pl = T[t/2-sqrt(n)] and pr = T[t/2+sqrt(n)].
Iterate through the elements of A and determine how many elements are less than pl (denoted by l) and how many are greater than pr (denoted by r). If l > n/2 or r > n/2, go back to step 1.
Let M be the set of elements in A in between pl and pr. M can be determined in step 4, just in case we reach step 5. If the size of M is no more than 4t, sort M. Otherwise, go back to step 1.
Return m = M[n/2-l] as the median element.
The main idea behind the algorithm is to obtain two elements (pl and pr) that enclose the median element (i.e. pl < m < pr) such that these two are very close one two each other in the ordered version of the array (and do this without actually sorting the array). With high probability, all the six steps only need to execute once (i.e. you will get pl and pr with these "good" properties from the first and only pass through step 1-5, so no going back to step 1). Once you find two such elements, you can simply sort the elements in between them and find the median element of A.
Step 2 and Step 5 do involve some sorting (which might be against the "rules" you've mysteriously established :p). If sorting a sub-array is on the table, you should use some sorting method that does this in O(slogs) steps, where s is the size of the array you are sorting. Since T and M are significantly smaller than A the sorting steps take "less than" O(n) steps. If it is also against the rules to sort a sub-array, then take into consideration that in both cases the sorting is not really needed. You only need to find a way to determine pl, pr and m, which is just another selection problem (with respective indices). While sorting T and M does accomplish this, you could use any other selection method (perhaps something rici suggested earlier).
A non-destructive routine selip() is described at http://www.aip.de/groups/soe/local/numres/bookfpdf/f8-5.pdf. It makes multiple passes through the data, at each stage making a random choice of items within the current range of values and then counting the number of items to establish the ranks of the random selection.
I was reading some practice interview questions and I have a question about this one. Assume a list of random integers each between 1 & 100, compute the sum of k largest integers? Discuss space and time complexity and whether the approach changes if each integer is between 1 & m where m varies?
My first thought is to sort the array and compute the sum of largest k numbers. Then, I thought if I use a binary tree structure where I can look starting from bottom right tree. I am not sure if my approach would change whether numbers are 1 to 100 or 1 to m? Any thoughts of most efficient approach?
The most efficient way might be to use something like randomized quickselect. It doesn't do the sorting step to completion and instead does just the partition step from quicksort. If you don't want the k largest integers in some particular order, this would be the way I'd go with. It takes linear time but the analysis is not very straightforward. m would have little impact on this. Also, you can write code in such a way that the sum is computed as you partition the array.
Time: O(n)
Space: O(1)
The alternative is sorting using something like counting sort which has a linear time guarantee. As you say the values are integers in a fixed range, it would work quite well. As m increases the space requirement goes up, but computing the sum is quite efficient within the buckets.
Time: O(m) in the worst case (see comments for the argument)
Space: O(m)
I'd say sorting is probably uneccessary. If k is small, then all you need to do is maintain a sorted list that truncates elements beyond the kth largest element.
Each step in this should be O(k) in the worst possible case where the element added is maximized. However, the average case scenario is much better, after a certain number of elements, most should just be smaller than the last element in the list and the operation will be O(log(k)).
One way is to use a min-heap (implemented as a binary tree) of maximum size k. To see if a new element belongs in the heap or not is only O(1) since it's a min-heap and retrieval of minimum element is a constant time operation. Each insertion step (or non-insertion...in the case of an element that is too small to be inserted) along the O(n) list is O(log k). The final tree traversal and summation step is O(k).
Total complexity:
O (n log k + k) = O(n log k))
Unless you have multiple cores running on your computer, in which case, parallel computing is an option, summation should only be done at the end. On-the-fly-computing adds additional computation steps without actually reducing your time complexity at all (you will actually have more computations to do) . You will always have to sum k elements anyways, so why not avoid the additional addition and subtraction steps?
The question is pretty much what the title says, with a slight variation. If I remember correctly, finding an entry in an array of size 'n' has as the average case the complexity O(n).
I assume that is also the case if there is a fixed number of elements in the vector, of which we want to find one.
But how is it if the amount of entries, of which we still only try to find one, is in some way related to the size of the vector, i.e. grows in some way with it?
I have such a case at hand, but I don't know the exact relation between array size and number of searched-for entries. Might be linear, might be logarithmically.. Is the average case still O(n)?
I would be grateful for any insights.
edit: an example
array size: 100
array content: at each position, a number of 1-10, completely random which one.
what we seek: the first occurrence of "1"
from a naive point of view, we should on average find an entry after 10 lookups in any kind of linear searches (which we have to do, as the content is not sorted.)
As factors are usually omitted in big-O, does that mean that we still need O(n) in time, even though it should be O(n)
It is O(n) anyway.
Think about finding 1 here:
[9,9,9,9,9,1]
If you're doing a linear search through the array, then the average time complexity of finding one of M elements in an array with N elements will be O(I) where I is the average index of the first of the sought elements. If the array is randomly ordered, then I will be O(N/M) on average, and so the time complexity will also be O(N/M) on average and O(N-M) in the worst case.
I have two minds over this question.
First, if you'll consider an unsorted array (which the case seems here), the asymptotic complexity for average case will be surely O(n).
Let's take an example.
We have n elements in the array or better to say Vector. Now,average case will be searching in a linear fashion by node to node. Which appears to be n/2 in general for average or O(n) as an average case. See,if the elements are added, then the complexity's nature won't change but, the effect is clear,it's n/2 comparisons for average---which is directly 1/2 (half) of n. The effect for m elements now after insertion in array will be O(n-m),or in comparison wise,(n-m)/2 comparisons added as a result to addition of elements in the Vector!
So,we find that with increase in size of array or better to say Vector---the complexity's nature won't change though the no. of comparisons required would be more as it is equal to n/2 in average case.
Second, if the array or vector is sorted, then performing binary-searches will have worst-cases of order log(n+1)---again dependent on n. Also, the average case will increase the comparisons logarithmically,but the complexity order O(log n) won't change!
How to find the most frequent number in an array? The array can be extremely large, for example 2GB and we only have limited memory, say 100MB.
I'm thinking about external sort, which is sorting and than duplicating numbers that are next to each other. Or hashma. But don't know what to do with the limited memory. And I'm even not sure if external sort is a good idea for this.
In the worst case, all your numbers are distinct except for one number which appears twice, and there's no way to detect this in main memory unless you have the two duplicate numbers loaded into main memory at the same time, which is unlikely without sorting if your total data size is much larger than main memory size. In that case, aysmptotically the best thing to do is sort numbers in batches and save to disk in files, and then do a merge sort merge step reading in all the sorted files into memory a few lines at a time, and outputting the merged sorted list to a new file. Then you go through the aggregate sorted file in order and count how many times you see each number, keeping track of which number has occurred the most times.
If you assume that the most frequent number is 50% frequency or higher, then you can do much better. You can solve the problem with constant extra memory just going through the list of numbers once. Basically you start by initializing the most common value (MCV) to the first number and initialize a counter N to 1. Then you go through the list. If the next number in the list is the MCV, you increase N by one. Otherwise you decrease N by 1. If N is 0 and the next number is different than MCV, then you set MCV to the new number and set N to 1. It is easy to prove this will terminate with the most common value stored in MCV.
Depending on what the hypotheses are, an even better way of doing it might be using the MJRTY algorithm:
http://www.cs.utexas.edu/~moore/best-ideas/mjrty/
Or its generalization:
http://www.cs.yale.edu/homes/el327/datamining2011aFiles/ASimpleAlgorithmForFindingFrequentElementsInStreamsAndBags.pdf
The idea is that with exactly two variables (a counter and a value store) you can determine, if there exists a majority element (appearing strictly more than 50% of the time), what that element is. The generalization require (k+1) counters and value stores to find the elements appearing 100/k%.
Because these are only candidates to majority (if there is are k-majority elements, those are they; but if there are no k-majority elements, than these are just random elements there by chance), a second pass on the data could help you get the exact count of the candidates, and determine which one, if any, is a majority element.
This is extremely fast and memory efficient.
There are few other optimizations, but with 4kb of memory, you should be able to find the majority element of 2GB of data with good probability - depending on the type of data you have.
Assumptions:
Integer is 4 bytes.
There are less then (100 MB / 4 B) = (104857600 / 4) = 26214400 distinct integers in the 2 GB array. Every number maps into 0-26214399 index range.
Let's do the histogram.
Make buckets in our 100 MB space. It's an integer array called histogram, which can store up to 26214400 counters. Counters are initally set to 0.
Iterate once through the 2 GB array. When you read x, do histogram[x]++.
Find the maximum in the histogram, iterating through it once. If the maximum is histogram[i], then i is the most frequent number.
The bottleneck is step 2, iterating through 2 GB array, but we do it only once.
If the second assumptions doesn't hold (i.e. there are more than 26214400 distinct integers):
Make histogram for numbers with indices from 0 to 26214399. Keep the most frequent number from histogram. Make histogram for numbers with indices from 26214400 to 52428798. Keep the most frequent number from the histogram and the previous most frequent number. And so on.
In the worst case, with 2^32 distinct numbers, it will do (2^32 / 26214400 + 1) = 164 iterations over that 2 GB array.
In general, it will do (NUMBER_OF_DISTINCT_NUMBERS / 26214400 + 1) iterations.
Assuming 4-byte integers, you can fit (100 / 4) = 25MB integers into available memory.
Read through your big file, counting each occurrence of and number in the range 0 ... 25MB-1. Use a big array to accumulate counts.
Find the number which occurs most frequently, store the number and its frequency and clear the array.
Read through the big file repeating the counting process for numbers in the range 25MB ... 50MB-1.
Find the number which occurs most frequently in the new array. Compare it with the number/frequency you stored at step 2. Store the number/frequency of the one with the higher frequency and clear the array.
Lather, rinse, repeat.
ETA: If we can assume that there is one single answer, that there aren't two different numbers with the same highest frequency, then you can discard all numbers if the array for a particular range shows a tie. Otherwise the problem of storing the winner for each range becomes more complex.
If you have limited memory but a reasonable amount of processing power and super fast speed isn't an issue, depending on your dataset you could try something like:
Iterate through array counting number of numbers 1 to 1000. Keep the one with the biggest count. Then count 1001 to 2000. Keep the biggest count between these, and the biggest one from the first batch. Repeat until all numbers have been counted.
I'm sure there are many optimisations for this based on the specifics of your dataset.
I am having trouble on an assignment regarding running time.
The problem statement is:
"Isabel has an interesting way of summing up the values in an array A of n integers, where n is a power of two. She creates an array B of half the size of A, and sets B[i] = A[2i] + A[2i+1], for i=0,1,…,(n/2)-1. If B has size 1, then she outputs B[0]. Otherwise, she replaces A with B, and repeats the process. What is the running time of her algorithm?"
Would this be considered a O(log n) or a O(n)? I am thinking O(log n) because you would keep on dividing the array in half until you get the final result and I believe the basis of O(log n) is that you do not traverse the entire data structure. However in order to compute the sum, you have to access each element within the array thus making me think that it could possibly be O(n). Any help in understanding this would be greatly appreciated.
I believe the basis of O(log n) is that you do not traverse the entire
data structure.
There's no basis for beliefs or guesses. Run through the algorithm mentally.
How many recursions are there going to be for array A of size n?
How many summations are there going to be for each recursion (when array A is of size n)?
First run: n/2 summations, n accesses to elements of A
.
.
.
Last run: 1 summation, 2 accesses to elements of A
How many runs are there total? When you sum this up, what is the highest power of n?
As you figured out yourself, you do need to access all elements to compute the sum. So your proposition:
I believe the basis of O(log n) is that you do not traverse the entire data structure
does not hold. You can safely disregard the possibility of the algorithm being O(log n) then.
As for being O(n) or something different, you need to think about how many operations will be done as a whole. George Skoptsov's answer gives a good hint at that. I'd just like to call attention to a fact (from my own experience) that to determine "the running time" you need to take everything into account: memory access, operations, input and output, etc. In your simple case, only looking at the accesses (or the number of sums) might be enough, but in practice you can have very skewed results if you don't look at the problem from every angle.