Finding percentiles in a sorted array - arrays

I am writing some code, and I want to know if I am correctly computing percentiles in a sorted array. Currently, if I want to compute, say, the 90th percentile, I do this: ARR[(9 * (N + 1))/10]. Or, let's say I'm computing the 50th percentile in a sorted array, I do this: ARR[(5 * (N + 1)) / 10]. More generally, to compute the xth percentile, I check index [x/100 * (N + 1)], where N is the size of the array.
These seem to be working, but I am just thinking if maybe there is some sort of edge case I'm missing. For instance, say you only have 5 elements. What should the 90th percentile be then? Should it just be the largest value?
Thanks in advance

For instance, say you only have 5 elements. What should the 90th percentile be then? Should it just be the largest value?
Yes. If you go by a definition like (this one is just copied from Wikipedia)
the P-th percentile of a list of N ordered values (sorted from least to greatest) is the smallest value in the list such that no more than P percent of the data is strictly less than the value and at least P percent of the data is less than or equal to that value
the 5th element can be the 90th percentile:
no more than P percent of the data is strictly less than the value: 80% of the data is strictly less than the largest element, which is no more than 90%
at least P percent of the data is less than or equal to that value: 100% of the data is less than or equal to the 5th element, which is at least 90%
And the 5th element is the smallest one which can do that (even if the 4th and 5th elements are equal, the 5th element is still the smallest one, because the percentile is about the value, not the position).
For fine tuning a formula, border cases are more interesting - like the 79-80-81st percentile of a 5-element list
element index: 0 1 2 3 4
strictly less: 0% 20% 40% 60% 80%
less or equal: 20% 40% 60% 80% 100%
79th percentile: 4th is expected (60%<79%, 79%<=80%)
80th percentile: 4th is expected (60%<80%, 80%<=80%)
81th percentile: 5th is expected (80%<81%, 81%<=100%)
That feels like rounding something (fraction indices) upwards (knowing that 80 is a border and looking at the mappings 79->3, 80->3, but 81->4). The function is usually called something like ceil(), Math.ceil() (question specifies no programming language at the moment)
P 5*P/100 ceil(5*P/100) (5=N)
79 3.95 4
80 4 4
81 4.05 5
((N+1) would produce 4.74, 4.8, 4.86, so it is safe to say +1 is not needed)
And thus ceil(N*P/100) really seems to be the one (of course it is on Wikipedia too, just 2-3 lines below the definition)
Note that programming languages may add various quirks:
arrays/lists are often indexed from 0
the result of ceil() may need to be converted to integer
and a sneaky one: if N and P are integer numbers, you may need to ensure that the division is not an integer-division (automatically throwing away the fraction part, so rounding the result downwards).
A Java line would be something like
int index=(int)Math.ceil(N*P/100.0)-1;
If you want 0th percentile, it can be handled separately, or hacked into the same line with max()
public static int percentile(int array[],float P) {
return array[Math.max(0,
Math.min(array.length, (int)Math.ceil(array.length*P/100))-1)];
}
(This one also uses min() and will produce some result for any finite P, implicitly truncating it into the 0<=P<=100 range)

Related

Interleaving array {a1,a2,....,an,b1,b2,...,bn} to {a1,b1,a2,b2,a3,b3} in O(n) time and O(1) space

I have to interleave a given array of the form
{a1,a2,....,an,b1,b2,...,bn}
as
{a1,b1,a2,b2,a3,b3}
in O(n) time and O(1) space.
Example:
Input - {1,2,3,4,5,6}
Output- {1,4,2,5,3,6}
This is the arrangement of elements by indices:
Initial Index Final Index
0 0
1 2
2 4
3 1
4 3
5 5
By observation after taking some examples, I found that ai (i<n/2) goes from index (i) to index (2i) & bi (i>=n/2) goes from index (i) to index (((i-n/2)*2)+1). You can verify this yourselves. Correct me if I am wrong.
However, I am not able to correctly apply this logic in code.
My pseudo code:
for (i = 0 ; i < n ; i++)
if(i < n/2)
swap(arr[i],arr[2*i]);
else
swap(arr[i],arr[((i-n/2)*2)+1]);
It's not working.
How can I write an algorithm to solve this problem?
Element bn is in the correct position already, so lets forget about it and only worry about the other N = 2n-1 elements. Notice that N is always odd.
Now the problem can be restated as "move the element at each position i to position 2i % N"
The item at position 0 doesn't move, so lets start at position 1.
If you start at position 1 and move it to position 2%N, you have to remember the item at position 2%N before you replace it. The the one from position 2%N goes to position 4%N, the one from 4%N goes to 8%N, etc., until you get back to position 1, where you can put the remaining item into the slot you left.
You are guaranteed to return to slot 1, because N is odd and multiplying by 2 mod an odd number is invertible. You are not guaranteed to cover all positions before you get back, though. The whole permutation will break into some number of cycles.
If you can start this process at one element from each cycle, then you will do the whole job. The trouble is figuring out which ones are done and which ones aren't, so you don't cover any cycle twice.
I don't think you can do this for arbitrary N in a way that meets your time and space constraints... BUT if N = 2x-1 for some x, then this problem is much easier, because each cycle includes exactly the cyclic shifts of some bit pattern. You can generate single representatives for each cycle (called cycle leaders) in constant time per index. (I'll describe the procedure in an appendix at the end)
Now we have the basis for a recursive algorithm that meets your constraints.
Given [a1...an,b1...bn]:
Find the largest x such that 2x <= 2n
Rotate the middle elements to create [a1...ax,b1...bx,ax+1...an,bx+1...bn]
Interleave the first part of the array in linear time using the above-described procedure, since it will have modulus 2x-1
Recurse to interleave the last part of the array.
Since the last part of the array we recurse on is guaranteed to be at most half the size of the original, we have this recurrence for the time complexity:
T(N) = O(N) + T(N/2)
= O(N)
And note that the recursion is a tail call, so you can do this in constant space.
Appendix: Generating cycle leaders for shifts mod 2x-1
A simple algorithm for doing this is given in a paper called "An algorithm for generating necklaces of beads in 2 colors" by Fredricksen and Kessler. You can get a PDF here: https://core.ac.uk/download/pdf/82148295.pdf
The implementation is easy. Start with x 0s, and repeatedly:
Set the lowest order 0 bit to 1. Let this be bit y
Copy the lower order bits starting from the top
The result is a cycle leader if x-y divides x
Repeat until you have all x 1s
For example, if x=8 and we're at 10011111, the lowest 0 is bit 5. We switch it to 1 and then copy the remainder from the top to give 10110110. 8-5=3, though, and 3 does not divide 8, so this one is not a cycle leader and we continue to the next.
The algorithm I'm going to propose is probably not o(n).
It's not based on swapping elements but on moving elements which probably could be O(1) if you have a list and not an array.
Given 2N elements, at each iteration (i) you take the element in position N/2 + i and move it to position 2*i
a1,a2,a3,...,an,b1,b2,b3,...,bn
| |
a1,b1,a2,a3,...,an,b2,b3,...,bn
| |
a1,b1,a2,b2,a3,...,an,b3,...,bn
| |
a1,b1,a2,b2,a3,b3,...,an,...,bn
and so on.
example with N = 4
1,2,3,4,5,6,7,8
1,5,2,3,4,6,7,8
1,5,2,6,3,4,7,8
1,5,2,6,3,7,4,8
One idea which is a little complex is supposing each location has the following value:
1, 3, 5, ..., 2n-1 | 2, 4, 6, ..., 2n
a1,a2, ..., an | b1, b2, ..., bn
Then using inline merging of two sorted arrays as explained in this article in O(n) time an O(1) space complexity. However, we need to manage this indexing during the process.
There is a practical linear time* in-place algorithm described in this question. Pseudocode and C code are included.
It involves swapping the first 1/2 of the items into the correct place, then unscrambling the permutation of the 1/4 of the items that got moved, then repeating for the remaining 1/2 array.
Unscrambling the permutation uses the fact that left items move into the right side with an alternating "add to end, swap oldest" pattern. We can find the i'th index in this permutation with this this rule:
For even i, the end was at i/2.
For odd i, the oldest was added to the end at step (i-1)/2
*The number of data moves is definitely O(N). The question asks for the time complexity of the unscramble index calculation. I believe it is no worse than O(lg lg N).

Changing the values of array by the distance of the indexes (c)

I'm having hard time with this one:
I need to write a function in C that recieving a binary array and his size, and the function should calculate and replace the current values with the distance (by indexes) of each 1 to the closest 0.
for example: if the function recieve that array {1,1,0,1,1,1,0,1} then the new values of the array should be {2,1,0,1,2,1,0,1}. It is known that the input has atleast 1 zero.
So the first step I tought about was to locate pair of zeros (or just 1 if there is only 1) and set them as 2 indexes (z1, z2). Then I set another index i
that check everytime which zero is the closest to him (absolute value) and then the diffrence between i and z1 or z2 would be the new value.
I have the plan but things are not going exactly as I planned. Basicly I deleted the code (it wasn't good anyway) so I would appreciate any help. thanks!
This problem is based on two things
Keep an array left[i] which has the distance of nearest 0 from index i from left to right.
Keep an array right[i] which has the distance of nearest 0 from index i from right to left.
Both can be calculate in single loop iteration. O(n).
Then for each position get the minimum value of left[i] and right[i]. That will be the answer for 1 staying in position i.
Overall the time complexity is O(n).

Maximize sum of weights with constraints given on left and right indices in array

I recently came through an interesting coding problem, which is as follows:
There are n boxes, let us assume this is an array of n boxes.
For each index i of this array, three values are given -
1.) Weight(i)
2.) Left(i)
3.) Right(i)
left(i) means - if weight[i] is chosen, we are not allowed to choose left[i] elements from the left of this ith element.
Similarly, right[i] means if arr[i] is chosen, we are not allowed to choose right[i] elements from the right of it.
Example :
Weight[2] = 5
Left[2] = 1
Right[2] = 3
Then, if I pick element at position 2, I get weight of 5 units. But, I cannot pick elements at position {1} (due to left constraint). And cannot pick elements at position {3,4,5} (due to right constraint).
Objective - We need to calculate the maximum sum of the weights we can pick.
Sample Test Case :-
**Input: **
5
2 0 3
4 0 0
3 2 0
7 2 1
9 2 0
**Output: **
13
Note - First column is weights, Second column is left constraints, Third column is right constraints
I used Dynamic Programming approach(similar to Longest Increasing Subsequence) to reach a O(n^2) solution. But, not able to think of a O(n*logn) solution. (n can be up to 10^5.)
I also tried to use priority queue, in which elements with lower value of (right[i] + i) are given higher priority(assigned higher priority to element with lower value of "i", in case primary key value is equal). But, it is also giving timeout error.
Any other approach for this? or any optimization in priority queue method? I can post both of my codes if needed.
Thanks.
One approach is to use a binary indexed tree to create a data structure that makes it easy to do two operations in O(logn) time each:
Insert number into an array
Find maximum in a given range
We will use this data structure to hold the maximum weight that can be achieved by selecting box i along with an optimal selection of boxes to the left.
The key is that we will only insert values into this data structure when we reach a point where the right constraint has been met.
To find the best value for box i, we need to find the maximum value in the data structure for all points up to location i-left[i], which can be done in O(logn).
The final algorithm is to loop over i=0..n-1 and for each i:
Compute result for box i by finding maximum in range 0..(i-left[i])
Schedule the result to be added when we reach location i+right[i]
Add any previously scheduled results into our data structure
The final result is the maximum value in the whole data structure.
Overall, the complexity is o(nlogn) because each value of i results in one lookup and one update operation.

Find all possible distances from two arrays

Given two sorted array A and B length N. Each elements may contain natural number less than M. Determine all possible distances for all combinations elements A and B. In this case, if A[i] - B[j] < 0, then the distance is M + (A[i] - B[j]).
Example :
A = {0,2,3}
B = {1,2}
M = 5
Distances = {0,1,2,3,4}
Note: I know O(N^2) solution, but I need faster solution than O(N^2) and O(N x M).
Edit: Array A, B, and Distances contain distinct elements.
You can get a O(MlogM) complexity solution in the following way.
Prepare an array Ax of length M with Ax[i] = 1 if i belongs to A (and 0 otherwise)
Prepare an array Bx of length M with Bx[M-1-i] = 1 if i belongs to B (and 0 otherwise)
Use the Fast Fourier Transform to convolve these 2 sequences together
Inspect the output array, non-zero values correspond to possible distances
Note that the FFT is normally done with floating point numbers, so in step 4 you probably want to test if the output is greater than 0.5 to avoid potential rounding noise issues.
I possible done with optimized N*N.
If convert A to 0 and 1 array where 1 on positions which present in A (in range [0..M].
After convert this array into bitmasks, size of A array will be decreased into 64 times.
This will allow insert results by blocks of size 64.
Complexity still will be N*N but working time will be greatly decreased. As limitation mentioned by author 50000 for A and B sizes and M.
Expected operations count will be N*N/64 ~= 4*10^7. It will passed in 1 sec.
You can use bitvectors to accomplish this. Bitvector operations on large bitvectors is linear in the size of the bitvector, but is fast, easy to implement, and may work well given your 50k size limit.
Initialize two bitvectors of length M. Call these vectA and vectAnswer. Set the bits of vectA that correspond to the elements in A. Leave vectAnswer with all zeroes.
Define a method to rotate a bitvector by k elements (rotate down). I'll call this rotate(vect,k).
Then, for every element b of B, vectAnswer = vectAnswer | rotate(vectA,b).

Efficient way of calculating average difference of array elements from array average value

Is there a way to calculate the average distance of array elements from array average value, by only "visiting" each array element once? (I search for an algorithm)
Example:
Array : [ 1 , 5 , 4 , 9 , 6 ]
Average : ( 1 + 5 + 4 + 9 + 6 ) / 5 = 5
Distance Array : [|1-5|, |5-5|, |4-5|, |9-5|, |6-5|] = [4 , 0 , 1 , 4 , 1 ]
Average Distance : ( 4 + 0 + 1 + 4 + 1 ) / 5 = 2
The simple algorithm needs 2 passes.
1st pass) Reads and accumulates values, then divides the result by array length to calculate average value of array elements.
2nd pass) Reads values, accumulates each one's distance from the previously calculated average value, and then divides the result by array length to find the average distance of the elements from the average value of the array.
The two passes are identical. It is the classic algorithm of calculating the average of a set of values. The first one takes as input the elements of the array, the second one the distances of each element from the array's average value.
Calculating the average can be modified to not accumulate the values, but caclulating the average "on the fly" as we sequentialy read the elements from the array.
The formula is:
Compute Running Average of Array's elements
-------------------------------------------
RA[i] = E[i] {for i == 1}
RA[i] = RA[i-1] - RA[i-1]/i + A[i]/i { for i > 1 }
Where A[x] is the array's element at position x, RA[x] is the average of the array's elements between position 1 and x (running average).
My question is:
Is there a similar algorithm, to calculate "on the fly" (as we read the array's elements), the average distance of the elements from the array's mean value?
The problem is that, as we read the array's elements, the final average value of the array is not known. Only the running average is known. So calculating differences from the running average will not yield the correct result. I suppose, if such algorithm exists, it probably should have the "ability" to compensate, in a way, on each new element read for the error calculated as far.
I don't think you can do better than O(n log n).
Suppose the array were sorted. Then we could divide it into the elements less than the average and the elements greater than the average. (If some elements are equal to the average, that doesn't matter.) Suppose the first k elements are less than the average. Then the average distance is
D = ((xave-x1) + (xave-x2) + (xave-x3) + ... + (xave-xk) + (xk+1-xave) + (xk+2-xave) + ... + (xn-xave))/n
= (-x1) + (-x2) + (-x3) + ... + (-xk) + (xk+1) + (xk+2) + ... + (xn) + (n-2k)xave)/n
= ( [sum of elements above average] - [sum of elements below average] + (n-2k)xave)/n
You could calculate this in one pass by working in from both ends, adjusting the limits on the (as-yet-unknown) average as you go. This would be O(n), and the sorting is O(n logn) (and they could perhaps be done in the same operation), so the whole thing is O(n logn).
The only problem with a two pass approach is that you need to reread or store the entire sequence for the second pass. The obvious improvement would be to maintain a data structure so that you could adjust the sum of absolute differences when the average value changed.
Suppose that you change the average value to a very large value, by observing a huge number. Now compare the change made by this to that caused by observing a not quite so huge value. You will be able to work out the difference between the two sums of absolute differences, because both average values are above all the other numbers, so all of the absolute values decrease by the difference between the two huge averages. This predictable change carries on until the average meets the highest value observed in the standard numbers, and this change allows you to find out what the highest number observed was.
By running experiments like this you can recover the set of numbers observed before the numbers you shove in to run the experiments. Therefore any clever data structure you use to keep track of sums of absolute differences is capable of storing the set of numbers observed, which (except for order, and cases where multiple copies of the same number are observed) is pretty much what you do by storing all the numbers seen for a second pass. So I don't think there is a trick for the case of sums of absolute differences as there is for squares of differences, where most of the information you care about is described by just the pair of numbers (sum, sum of squares).
if the l2 norm (average distance squared) is ok then it's:
sqrt(sum(x^2)/n - (sum(x)/n)^2)
that's (square root of) the average x^2 minus the square of the average x.
it's called variance (actually, the above is the square root of the variance, which is called the standard deviation, and is a typical "measure of spread").
note that this is more sensitive to outliers than the measure you originally asked for.
Your followup described your context as HLSL reading from a texture. If your filter footprint is a power of two and is aligned with the same power-of-two boundaries in the original image, you can use MIP maps to find the average value of the filter region.
For example, for an 8x8 filter, precompute a MIP map three levels down the MIP chain, whose elements will be the averages of each 8x8 region. Then a single texture read from that MIP level texture will give you the average for the 8x8 region. Unfortunately this doesn't work for sliding the filter around to arbitrary positions (not multiples of 8 in this example).
You could make use of intermediate MIP levels to decrease the number of texture reads by utilizing the MIP averages of 4x4 or 2x2 areas whenever possible, but that would complicate the algorithm quite a bit.

Resources