I am writing some code, and I want to know if I am correctly computing percentiles in a sorted array. Currently, if I want to compute, say, the 90th percentile, I do this: ARR[(9 * (N + 1))/10]. Or, let's say I'm computing the 50th percentile in a sorted array, I do this: ARR[(5 * (N + 1)) / 10]. More generally, to compute the xth percentile, I check index [x/100 * (N + 1)], where N is the size of the array.
These seem to be working, but I am just thinking if maybe there is some sort of edge case I'm missing. For instance, say you only have 5 elements. What should the 90th percentile be then? Should it just be the largest value?
Thanks in advance
For instance, say you only have 5 elements. What should the 90th percentile be then? Should it just be the largest value?
Yes. If you go by a definition like (this one is just copied from Wikipedia)
the P-th percentile of a list of N ordered values (sorted from least to greatest) is the smallest value in the list such that no more than P percent of the data is strictly less than the value and at least P percent of the data is less than or equal to that value
the 5th element can be the 90th percentile:
no more than P percent of the data is strictly less than the value: 80% of the data is strictly less than the largest element, which is no more than 90%
at least P percent of the data is less than or equal to that value: 100% of the data is less than or equal to the 5th element, which is at least 90%
And the 5th element is the smallest one which can do that (even if the 4th and 5th elements are equal, the 5th element is still the smallest one, because the percentile is about the value, not the position).
For fine tuning a formula, border cases are more interesting - like the 79-80-81st percentile of a 5-element list
element index: 0 1 2 3 4
strictly less: 0% 20% 40% 60% 80%
less or equal: 20% 40% 60% 80% 100%
79th percentile: 4th is expected (60%<79%, 79%<=80%)
80th percentile: 4th is expected (60%<80%, 80%<=80%)
81th percentile: 5th is expected (80%<81%, 81%<=100%)
That feels like rounding something (fraction indices) upwards (knowing that 80 is a border and looking at the mappings 79->3, 80->3, but 81->4). The function is usually called something like ceil(), Math.ceil() (question specifies no programming language at the moment)
P 5*P/100 ceil(5*P/100) (5=N)
79 3.95 4
80 4 4
81 4.05 5
((N+1) would produce 4.74, 4.8, 4.86, so it is safe to say +1 is not needed)
And thus ceil(N*P/100) really seems to be the one (of course it is on Wikipedia too, just 2-3 lines below the definition)
Note that programming languages may add various quirks:
arrays/lists are often indexed from 0
the result of ceil() may need to be converted to integer
and a sneaky one: if N and P are integer numbers, you may need to ensure that the division is not an integer-division (automatically throwing away the fraction part, so rounding the result downwards).
A Java line would be something like
int index=(int)Math.ceil(N*P/100.0)-1;
If you want 0th percentile, it can be handled separately, or hacked into the same line with max()
public static int percentile(int array[],float P) {
return array[Math.max(0,
Math.min(array.length, (int)Math.ceil(array.length*P/100))-1)];
}
(This one also uses min() and will produce some result for any finite P, implicitly truncating it into the 0<=P<=100 range)
I have been attempting to solve following problem. I have a sequence of positive
integer numbers which can be very long (several milions of elements). This
sequence can contain "jumps" in the elements values. The aforementioned jump
means that two consecutive elements differs each other by more than 1.
Example 01:
1 2 3 4 5 6 7 0
In the above mentioned example the jump occurs between 7 and 0.
I have been looking for some effective algorithm (from time point of view) for
finding of the position where this jump occurs. This issue is complicated by the
fact that there can be a situation when two jumps are present and one of them
is the jump which I am looking for and the other one is a wrap-around which I
am not looking for.
Example 02:
9 1 2 3 4 6 7 8
Here the first jump between 9 and 1 is a wrap-around. The second jump between
4 and 6 is the jump which I am looking for.
My idea is to somehow modify the binary search algorithm but I am not sure whether it is possible due to the wrap-around presence. It is worthwhile to say that only two jumps can occur in maximum and between these jumps the elements are sorted. Does anybody have any idea? Thanks in advance for any suggestions.
You cannot find an efficient solution (Efficient meaning not looking at all numbers, O(n)) since you cannot conclude anything about your numbers by looking at less than all. For example if you only look at every second number (still O(n) but better factor) you would miss double jumps like these: 1 5 3. You can and must look at every single number and compare it to it's neighbours. You could split your workload and use a multicore approach but that's about it.
Update
If you have the special case that there is only 1 jump in your list and the rest is sorted (eg. 1 2 3 7 8 9) you can find this jump rather efficiently. You cannot use vanilla binary search since the list might not be sorted fully and you don't know what number you are searching but you could use an abbreviation of the exponential search which bears some resemblance.
We need the following assumptions for this algorithm to work:
There is only 1 jump (I ignore the "wrap around jump" since it is not technically between any following elements)
The list is otherwise sorted and it is strictly monotonically increasing
With these assumptions we are now basically searching an interruption in our monotonicity. That means we are searching the case when 2 elements and b have n elements between them but do not fulfil b = a + n. This must be true if there is no jump between the two elements. Now you only need to find elements which do not fulfil this in a nonlinear manner, hence the exponential approach. This pseudocode could be such an algorithm:
let numbers be an array of length n fulfilling our assumptions
start = 0
stepsize = 1
while (start < n-1)
while (start + stepsize > n)
stepsize -= 1
stop = start + stepsize
while (numbers[stop] != numbers[start] + stepsize)
// the number must be between start and stop
if(stepsize == 1)
// congratiulations the jump is at start to start + 1
return start
else
stepsize /= 2
start += stepsize
stepsize *= 2
no jump found
There is an array where all but one of the cells are 0, and we want to find the index of that single non-zero cell. The problem is, every time that you check for a cell in this array, that non-zero element will do one of the following:
move forward by 1
move backward by 1
stay where it is.
For example, if that element is currently at position 10, and I check what is in arr[5], then the element may be at position 9, 10 or 11 after I checked arr[5].
We only need to find the position where the element is currently at, not where it started at (which is impossible).
The hard part is, if we write a for loop, there really is no way to know if the element is currently in front of you, or behind you.
Some more context if it helps:
The interviewer did give a hint which is maybe I should move my pointer back after checking x-number of cells. The problem is, when should I move back, and by how many slots?
While "thinking out loud", I started saying a bunch of common approaches hoping that something would hit. When I said recursion, the interviewer did say "recursion is a good start". I don't know recursion really is the right approach, because I don't see how I can do recursion and #1 at the same time.
The interviewer said this problem can't be solved in O(n^2). So we are looking at at least O(n^3), or maybe even exponential.
Tl;dr: Your best bet is to keep checking each even index in the array in turn, wrapping around as many times as necessary until you find your target. On average you will stumble upon your target in the middle of your second pass.
First off, as many have already said, it is indeed impossible to ensure you will find your target element in any given amount of time. If the element knows where your next sample will be, it can always place itself somewhere else just in time. The best you can do is to sample the array in a way that minimizes the expected number of accesses - and because after each sample you learn nothing except if you were successful or not and a success means you stop sampling, an optimal strategy can be described simply as a sequence of indexes that should be checked, dependent only on the size of the array you're looking through. We can test each strategy in turn via automated means to see how well they perform. The results will depend on the specifics of the problem, so let's make some assumptions:
The question doesn't specify the starting position our target. Let us assume that the starting position is chosen uniformly from across the entire array.
The question doesn't specify the probability our target moves. For simplicity let's say it's independent on parameters such as the current position in the array, time passed and the history of samples. Using the probability 1/3 for each option gives us the least information, so let's use that.
Let us test our algorithms on an array of 100 101 elements. Also, let us test each algorithm one million times, just to be reasonably sure about its average case behavior.
The algorithms I've tested are:
Random sampling: after each attempt we forget where we were looking and choose an entirely new index at random. Each sample has an independent 1/n chance of succeeding, so we expect to take n samples on average. This is our control.
Sweep: try each position in sequence until our target is found. If our target wasn't moving, this would take n/2 samples on average. Our target is moving, however, so we may miss it on our first sweep.
Slow sweep: the same, except we test each position several times before moving on. Proposed by Patrick Trentin with a slowdown factor of 30x, tested with a slowdown factor of 2x.
Fast sweep: the opposite of slow sweep. After the first sample we skip (k-1) cells before testing the next one. The first pass starts at ary[0], the next at ary[1] and so on. Tested with each speed up factor (k) from 2 to 5.
Left-right sweep: First we check each index in turn from left to right, then each index from right to left. This algorithm would be guaranteed to find our target if it was always moving (which it isn't).
Smart greedy: Proposed by Aziuth. The idea behind this algorithm is that we track each cell probability of holding our target, then always sampling the cell with the highest probability. On one hand, this algorithm is relatively complex, on the other hand it sounds like it should give us the optimal results.
Results:
The results are shown as [average] ± [standard derivation].
Random sampling: 100.889145 ± 100.318212
At this point I have realised a fencepost error in my code. Good thing we have a control sample. This also establishes that we have in the ballpark of two or three digits of useful precision (sqrt #samples), which is in line with other tests of this type.
Sweep: 100.327030 ± 91.210692
The chance of our target squeezing through the net well counteracts the effect of the target taking n/2 time on average to reach the net. The algorithm doesn't really fare any better than a random sample on average, but it's more consistent in its performance and it isn't hard to implement either.
slow sweep (x0.5): 128.272588 ± 99.003681
While the slow movement of our net means our target will probably get caught in the net during the first sweep and won't need a second sweep, it also means the first sweep takes twice as long. All in all, relying on the target moving onto us seems a little inefficient.
fast sweep x2: 75.981733 ± 72.620600
fast sweep x3: 84.576265 ± 83.117648
fast sweep x4: 88.811068 ± 87.676049
fast sweep x5: 91.264716 ± 90.337139
That's... a little surprising at first. While skipping every other step means we complete each lap in twice as many turns, each lap also has a reduced chance of actually encountering the target. A nicer view is to compare Sweep and FastSweep in broom-space: rotate each sample so that the index being sampled is always at 0 and the target drifts towards the left a bit faster. In Sweep, the target moves at 0, 1 or 2 speed each step. A quick parallel with the Fibonacci base tells us that the target should hit the broom/net around 62% of the time. If it misses, it takes another 100 turns to come back. In FastSweep, the target moves at 1, 2 or 3 speed each step meaning it misses more often, but it also takes half as much time to retry. Since the retry time drops more than the hit rate, it is advantageous to use FastSweep over Sweep.
Left-right sweep: 100.572156 ± 91.503060
Mostly acts like an ordinary sweep, and its score and standard derivation reflect that. Not too surprising a result.
Aziuth's smart greedy: 87.982552 ± 85.649941
At this point I have to admit a fault in my code: this algorithm is heavily dependent on its initial behavior (which is unspecified by Aziuth and was chosen to be randomised in my tests). But performance concerns meant that this algorithm will always choose the same randomized order each time. The results are then characteristic of that randomisation rather than of the algorithm as a whole.
Always picking the most likely spot should find our target as fast as possible, right? Unfortunately, this complex algorithm barely competes with Sweep 3x. Why? I realise this is just speculation, but let us peek at the sequence Smart Greedy actually generates: During the first pass, each cell has equal probability of containing the target, so the algorithm has to choose. If it chooses randomly, it could pick up in the ballpark of 20% of cells before the dips in probability reach all of them. Afterwards the landscape is mostly smooth where the array hasn't been sampled recently, so the algorithm eventually stops sweeping and starts jumping around randomly. The real problem is that the algorithm is too greedy and doesn't really care about herding the target so it could pick at the target more easily.
Nevertheless, this complex algorithm does fare better than both simple Sweep and a random sampler. it still can't, however, compete with the simplicity and surprising efficiency of FastSweep. Repeated tests have shown that the initial randomisation could swing the efficiency anywhere between 80% run time (20% speedup) and 90% run time (10% speedup).
Finally, here's the code that was used to generate the results:
class WalkSim
attr_reader :limit, :current, :time, :p_stay
def initialize limit, p_stay
#p_stay = p_stay
#limit = limit
#current = rand (limit + 1)
#time = 0
end
def poke n
r = n == #current
#current += (rand(2) == 1 ? 1 : -1) if rand > #p_stay
#current = [0, #current, #limit].sort[1]
#time += 1
r
end
def WalkSim.bench limit, p_stay, runs
histogram = Hash.new{0}
runs.times do
sim = WalkSim.new limit, p_stay
gen = yield
nil until sim.poke gen.next
histogram[sim.time] += 1
end
histogram.to_a.sort
end
end
class Array; def sum; reduce 0, :+; end; end
def stats histogram
count = histogram.map{|k,v|v}.sum.to_f
avg = histogram.map{|k,v|k*v}.sum / count
variance = histogram.map{|k,v|(k-avg)**2*v}.sum / (count - 1)
{avg: avg, stddev: variance ** 0.5}
end
RUNS = 1_000_000
PSTAY = 1.0/3
LIMIT = 100
puts "random sampling"
p stats WalkSim.bench(LIMIT, PSTAY, RUNS) {
Enumerator.new {|y|loop{y.yield rand (LIMIT + 1)}}
}
puts "sweep"
p stats WalkSim.bench(LIMIT, PSTAY, RUNS) {
Enumerator.new {|y|loop{0.upto(LIMIT){|i|y.yield i}}}
}
puts "x0.5 speed sweep"
p stats WalkSim.bench(LIMIT, PSTAY, RUNS) {
Enumerator.new {|y|loop{0.upto(LIMIT){|i|2.times{y.yield i}}}}
}
(2..5).each do |speed|
puts "x#{speed} speed sweep"
p stats WalkSim.bench(LIMIT, PSTAY, RUNS) {
Enumerator.new {|y|loop{speed.times{|off|off.step(LIMIT, speed){|i|y.yield i}}}}
}
end
puts "sweep LR"
p stats WalkSim.bench(LIMIT, PSTAY, RUNS) {
Enumerator.new {|y|loop{
0.upto(LIMIT){|i|y.yield i}
LIMIT.downto(0){|i|y.yield i}
}}
}
$sg_gen = Enumerator.new do |y|
probs = Array.new(LIMIT + 1){1.0 / (LIMIT + 1)}
loop do
ix = probs.each_with_index.map{|v,i|[v,rand,i]}.max.last
probs[ix] = 0
probs = [probs[0] * (1 + PSTAY)/2 + probs[1] * (1 - PSTAY)/2,
*probs.each_cons(3).map{|a, b, c| (a + c) / 2 * (1 - PSTAY) + b * PSTAY},
probs[-1] * (1 + PSTAY)/2 + probs[-2] * (1 - PSTAY)/2]
y.yield ix
end
end
$sg_cache = []
def sg_enum; Enumerator.new{|y| $sg_cache.each{|n| y.yield n}; $sg_gen.each{|n| $sg_cache.push n; y.yield n}}; end
puts "smart greedy"
p stats WalkSim.bench(LIMIT, PSTAY, RUNS) {sg_enum}
no forget everything about loops.
copy this array to another array and then check what cells are now non-zero. for example if your main array is mainArray[] you can use:
int temp[sizeOfMainArray]
int counter = 0;
while(counter < sizeOfArray)
{
temp[counter] == mainArray[counter];
}
//then check what is non-zero in copied array
counter = 0;
while(counter < sizeOfArray)
{
if(temp[counter] != 0)
{
std::cout<<"I Found It!!!";
}
}//end of while
One approach perhaps :
i - Have four index variables f,f1,l,l1. f is pointing at 0,f1 at 1, l is pointing at n-1 (end of the array) and l1 at n-2 (second last element)
ii - Check the elements at f1 and l1 - are any of them non zero ? If so stop. If not, check elements at f and l (to see if the element has jumped back 1).
iii - If f and l are still zero, increment the indexes and repeat step ii. Stop when f1 > l1
Iff an equality check against an array index makes the non-zero element jump.
Why not think of a way where we don't really require an equality check with an array index?
int check = 0;
for(int i = 0 ; i < arr.length ; i++) {
check |= arr[i];
if(check != 0)
break;
}
Orrr. Maybe you can keep reading arr[mid]. The non-zero element will end up there. Some day. Reasoning: Patrick Trentin seems to have put it in his answer (somewhat, its not really that, but you'll get an idea).
If you have some information about the array, maybe we can come up with a niftier approach.
Ignoring the trivial case where the 1 is in the first cell of the array if you iterate through the array testing each element in turn you must eventually get to the position i where the 1 is in cell i+2. So when you read cell i+1 one of three things is going to happen.
The 1 stays where it is, you're going to find it next time you look
The 1 moves away from you, your back to the starting position with the 1 at i+2 next time
The 1 moves to cell you've just checked, it dodged your scan
Re-reading the i+1 cell will find the 1 in case 3 but just give it another chance to move in cases 1 and 2 so a strategy based on re-reading won't work.
My option would therefore to adopt a brute force approach, if I keep scanning the array then I'm going to hit case 1 at some point and find the elusive 1.
Assumptions:
The array is no true array. This is obvious given the problem. We got some class that behaves somewhat like an array.
The array is mostly hidden. The only public operations are [] and size().
The array is obfuscated. We cannot get any information by retrieving it's address and then analyze the memory at that position. Even if we iterate through the whole memory of our system, we can't do tricks due to some advanced cryptographic means.
Every field of the array has the same probability to be the first field that hosts the one.
We know the probabilities of how the one changes it's position when triggered.
Probability controlled algorithm:
Introduce another array of same size, the probability array (over double).
This array is initialized with all fields to be 1/size.
Every time we use [] on the base array, the probability array changes in this way:
The accessed position is set to zero (did not contain the one)
An entry becomes the sum of it's neighbors times the probability of that neighbor to jump to the entries position. (prob_array_next_it[i] = prob_array_last_it[i-1]*prob_jump_to_right + prob_array_last_it[i+1]*prob_jump_to_left + prob_array_last_it[i]*prob_dont_jump, different for i=0 and i=size-1 of course)
The probability array is normalized (setting one entry to zero set the sum of the probabilities to below one)
The algorithm accesses the field with the highest probability (chooses amongst those that have)
It might be able to optimize this by controlling the flow of probabilities, but that needs to be based on the wandering event and might require some research.
No algorithm that tries to solve this problem is guaranteed to terminate after some time. For a complexity, we would analyze the average case.
Example:
Jump probabilities are 1/3, nothing happens if trying to jump out of bounds
Initialize:
Hidden array: 0 0 1 0 0 0 0 0
Probability array: 1/8 1/8 1/8 1/8 1/8
1/8 1/8 1/8
First iteration: try [0] -> failure
Hidden array: 0 0 1 0 0 0 0 0 (no jump)
Probability array step 1: 0
1/8 1/8 1/8 1/8 1/8 1/8 1/8
Probability array step 2: 1/24 2/24 1/8
1/8 1/8 1/8 1/8 1/8
Probability array step 2: same normalized (whole array * 8/7):
1/21 2/21 1/7
1/7 1/7 1/7 1/7 1/7
Second iteration: try [2] as 1/7 is the maximum and this is the first field with 1/7 -> success (example should be clear by now, of course this might not work so fast on another example, had no interest of doing this for a lot of iterations since the probabilities would get cumbersome to compute by hand, would need to implement it. Note that if the one jumped to the left, we wouldn't have checked it so fast, even if it remained there for some time)
I am both new to this website and new to C. I need a program to find the average 'jumps' it takes from all points.
The idea is this: Find "jump" distance from 1 to 2, 1 to 3, 1 to 4 ... 1 to 9, or find 2 to 1, 2 to 3, 2 to 4 2 to 5 etc.
Doing them on the first row is simple, just (2-1) or (3-1) and you get the correct number. But if I want to find the distance between 1 and 4 or 1 to 8 then I have absolutely no idea.
The dimensions of the matrix should potentially be changeable. But I just want help with a 3x3 matrix.
Anyone could show me how to find it?
Jump means vertical or horizontal move from one point to another. from 1 to 2 = 1, from 1 to 9 = 4 (shortest path only)
The definition of "distance" on this kind of problems is always tricky.
Imagine that the points are marks on a field, and you can freely walk all over it. Then, you could take any path from one point to the other. The shortest route then would be a straight line; its length would be the length of the vector that joins the points, which happens to be the difference vector among two points' positions. This length can be computed with the help of Pythagora's theorem: dist = sqrt((x2-x1)^2 + (y2-y1)^2). This is known as the Euclidian distance between the points.
Now imagine that you are in a city, and each point is a building. You can't walk over a building, so the only options are to go either up/down or left/right. Then, the shortest distance is given by the sum of the components of the difference vector; which is the mathematical way of saying that "go down 2 blocks and then one block to the left" means walking 3 blocks' distance: dist = abs(x2-x1) + abs(y2-y1). This is known as the Manhattan distance between the points.
In your problem, however, it looks like the only possible move is to jump to an adjacent point, in a single step, diagonals allowed. Then the problem gets a bit trickier, because the path is very irregular. You need some Graph Theory here, very useful when modeling problems with linked elements, or "nodes". Each point would be a node, connected to their neighbors, and the problem would be to find the shortest path to another given point. If jumps had different weights (for instance, is jumping in diagonal was harder), an easy way to solve this is would be with the Dijkstra's Algorithm; more details on implementation at Wikipedia.
If the cost is always the same, then the problem is reduced to counting the number of jumps in a Breadth-First Search of the destination point from the source.
Let's define the 'jump' distance : "the number of hops required to reach from Point A [Ax,Ay] to Point B [Bx,By]."
Now there can be two ways in which the hops are allowed :
Horizontally/VerticallyIn this case, you can go up/down or left/right. As you have to travel X axis and Y axis independently, your ans is:jumpDistance = abs(Bx - Ax) + abs(By - Ay);
Horizontally/Vertically and also Diagonally
In this case, you can go up/down or left/right and diagonally as well. How it differs from Case 1 is that now you have the ability to change your X axis and Y axis together at the cost of only one jump . Your answer now is:jumpDistance = Max(abs(Bx - Ax),abs(By - Ay));
What is the definition of "jump-distance" ?
If you mean how many jumps a man needs to go from square M to N, if he can only jumps vertically and horizontally, one possibility can:
dist = abs(x2 - x1) + abs(y2 - y1);
For example jump-distance between 1 and 9 is: |3-1|+|3-1| = 4
There are two ways to calculate jump distance.
1) when only horizontal and vertical movements are allowed, in that case all you need to do is form a rectangle in between the two points and calculate the length of two adjacent side. Like if you want to move from 1 to 9 then first move from 1 to 3 and then move from 3 to 9. (Convert it to code)
2) when movements in all eight directions are allowed, things get tricky. Like if you want to move from 1 to 6 suppose. What you'll need to do is you'll have to more from 1 to 5. And then from 5 to 6. The way of doing it in code is to find the maximum in between the difference in x and y coordinates. In this example, in x coordinate, difference is 2 (3-1) and in y coordinate, difference is 1 (2-1). So the maximum of this is 2. So here's the answer. (Convert to code)
A brief intro. I am creating a medical software. I forget some of the computation/permutation theorems in college. Let's say I have five nerves. Median, ulnar, radial, tibial, peroneal. I can choose one, two, three, four, or all five of them in any combintation. What is the equation to find the maxmimum number of combinations I can make?
For example;
median
median + ulnar
median + ulnar + radial
etc etc
ulnar + median = median + ulnar. so those would be repetitive. Thank you for your help. I know this isn't directly programming related, but I thought you guys would be familiar.
The comment that says it is (2^n)-1 is correct. 2^n is the number of possible subsets you can form from a set of n objects (in this case you have 5 objects), and then in your case, you don't want to count the empty set, so you subtract out 1.
I'm sure you can do the math, but for the sake of completeness, for 5 nerves, there would be 2^5 - 1 = 32 - 1 = 31 possible combinations you could end up with.