Picking a random item based on probabilities - c

There's a similar question, I know, but it confused me, so I thought it easier to ask in my way.
So I have an array of values, positive and negative. The higher they are, the more probability they have of being chosen.
I'm having trouble actually figuring out how to assign the probabilities and then randomly choose one. I'm guessing the array will need to be sorted first, but then I'm a bit lost after that.

"I have various different sizes of cups of coffee. The larger they are, the more I want to charge for them. I'm having trouble actually figuring out how to assign prices".
This isn't just a programming problem - you've specified that probability increases with value, but you haven't said how it increases with value. Normally, coffee shops don't charge in direct proportion to the amount of coffee. You can't assign probabilities in proportion to value, because some of your values are negative, but probabilities cannot be negative.
Sounds like you need to nail down the problem a bit more before you can write any code.
If you really don't care how probability relates to value, other than that they increase in order of value, then one easy way would be:
sort your array
assign a probability of 1 to the first element, 2 to the second, and so on.
now, your probabilities don't add up to 1, which is a problem. So divide each probability by the total of all the probabilities you have assigned: (1 + 2 + .. + n) = n(n+1)/2. This is called "normalization".
Given your list of probabilities, which add up to 1, the easiest way to repeatedly choose one is generally to calculate the cumulative probabilities, which I will demonstrate with an example:
value (sorted): -12 -3 127 1000000
assigned probability: 0.1 0.2 0.3 0.4
cumulative probability: 0.1 0.3 0.6 1.0
The cumulative probability is defined as the sum of all the probabilities up to that point.
Now, from your random number generator you need a random (floating-point) value between 0 and 1. If it lies between 0 and 0.1, you've picked -12. If it lies between 0.1 and 0.3, you've picked -3, and so on. To figure out which range it lies in, you could walk linearly through your array, or you could do a binary search.
You could skip the normalization step and the use of floating-point, if you wanted. Assign "cumulative probabilities" (1, 3, 6, 10 ...) , but make it understood that the actual probability is the stored integer value divided by n(n+1)/2. Then choose a random integer from 0 to n(n+1)/2 - 1. If it's less than 1, you've selected the first value, else if less than 3 the second, and so on. This may or may not make the code clearer, and your RNG may or may not do well choosing integer values from a large range.
Note that you could have assigned probabilities (0.001, 0.002, 0.003, 0.994) instead of (0.1, 0.2, 0.3, 0.4), and still satisfied your requirement that "the higher the value, the higher the probability".

One way could be
Make all values positive (add absolute value of the minimum value to all values)
Normalize the values to sum to 1 (divide each value with the sum of the values)
To randomize a value from the generated distribution now you can
Pick random number on [0,1].
Start summing the probabilites until the sum is greater or equal to the random value. Choose that index as your random value.

Following up on Steve Jessop's suggestion, after you've chosen a random integer from 0 to n(n+1)/2 - 1, you can just get the triangular root: (-1 + sqrt((8*x)+1))/2

Related

How to select value from different ranges with equal probability

Provided different ranges, select each value with equal probability.
Like say var 'a' can have values among { [10,20],40,[70,100]...} (given) . Each selected value by provided constraints should have same probability. How to get a random value in C?
Giving each Range equal probabilistic chance:
Let N be the number of ranges you've defined in your problem-set. Ranges { R0, R1, R2 ... RN-1 }, Indexes start at 0.
Generate a random number, RandValue mod N to pick a range. In C, modulo operator is %, gives you integral remainder.
Is picked range just a number? (like 40 in your example)
3.1 Yes, then your random value is that number
3.2 No, it's a range. Find a random value within selected range.
Giving each value in all ranges equal probabilistic chance:
Let N be the number of values across all ranges.
Map each value to an index, Values { V0, V1, V2 ... VN-1 }, Indexes start at 0.
Use hash-tables for quick lookups. Also, you can handle overlapping ranges.
Generate a random number, RandValue mod N to pick a value-index.
Look up in hash-table for mapped value against the index.
Also, note that hash-table could become huge if the ranges are too large. In that case you may have to merge overlapping/consecutive (if any) ranges and maintain sorted(by value-index) list(array of structs) of ranges and assign index-ranges. Use binary search to find the range against random-index. Range offsets (start/end values & indexes) should give the final value for a given random-index.
PS: This is for trivial implementations of randomness in C projects. I believe all randomness is deterministic.
Edit: I agree, there is modulo-bias & to reject values beyond (RAND_MAX - RAND_MAX % N).
Simple solution:
do
r=rand();
until (is_in_range(r));
It's not at all efficient, and especially it's not bounded in running time. But it should work.
And sometimes simple and stupid solutions are good enough.
(Once you start doing things like r=rand()%limit;, then you start introducing skewed probabilities. Imagine doing r=rand()%((RAND_MAX/2)+1);. It's twice as likely to return anything below RAND_MAX/2 as RAND_MAX/2.
See this answer for more detail. )
To improve performance, one could do something like what #Jakob Stark hinted at:
for(limit=1;limit<top_of_range;limit<<=1)
; // Find the smallest power-of-two larger than the top_of_range
do
r=rand()%limit;
while(!(is_in_range(r));
It's still not guaranteed to run in finite time, though...

Finding percentiles in a sorted array

I am writing some code, and I want to know if I am correctly computing percentiles in a sorted array. Currently, if I want to compute, say, the 90th percentile, I do this: ARR[(9 * (N + 1))/10]. Or, let's say I'm computing the 50th percentile in a sorted array, I do this: ARR[(5 * (N + 1)) / 10]. More generally, to compute the xth percentile, I check index [x/100 * (N + 1)], where N is the size of the array.
These seem to be working, but I am just thinking if maybe there is some sort of edge case I'm missing. For instance, say you only have 5 elements. What should the 90th percentile be then? Should it just be the largest value?
Thanks in advance
For instance, say you only have 5 elements. What should the 90th percentile be then? Should it just be the largest value?
Yes. If you go by a definition like (this one is just copied from Wikipedia)
the P-th percentile of a list of N ordered values (sorted from least to greatest) is the smallest value in the list such that no more than P percent of the data is strictly less than the value and at least P percent of the data is less than or equal to that value
the 5th element can be the 90th percentile:
no more than P percent of the data is strictly less than the value: 80% of the data is strictly less than the largest element, which is no more than 90%
at least P percent of the data is less than or equal to that value: 100% of the data is less than or equal to the 5th element, which is at least 90%
And the 5th element is the smallest one which can do that (even if the 4th and 5th elements are equal, the 5th element is still the smallest one, because the percentile is about the value, not the position).
For fine tuning a formula, border cases are more interesting - like the 79-80-81st percentile of a 5-element list
element index: 0 1 2 3 4
strictly less: 0% 20% 40% 60% 80%
less or equal: 20% 40% 60% 80% 100%
79th percentile: 4th is expected (60%<79%, 79%<=80%)
80th percentile: 4th is expected (60%<80%, 80%<=80%)
81th percentile: 5th is expected (80%<81%, 81%<=100%)
That feels like rounding something (fraction indices) upwards (knowing that 80 is a border and looking at the mappings 79->3, 80->3, but 81->4). The function is usually called something like ceil(), Math.ceil() (question specifies no programming language at the moment)
P 5*P/100 ceil(5*P/100) (5=N)
79 3.95 4
80 4 4
81 4.05 5
((N+1) would produce 4.74, 4.8, 4.86, so it is safe to say +1 is not needed)
And thus ceil(N*P/100) really seems to be the one (of course it is on Wikipedia too, just 2-3 lines below the definition)
Note that programming languages may add various quirks:
arrays/lists are often indexed from 0
the result of ceil() may need to be converted to integer
and a sneaky one: if N and P are integer numbers, you may need to ensure that the division is not an integer-division (automatically throwing away the fraction part, so rounding the result downwards).
A Java line would be something like
int index=(int)Math.ceil(N*P/100.0)-1;
If you want 0th percentile, it can be handled separately, or hacked into the same line with max()
public static int percentile(int array[],float P) {
return array[Math.max(0,
Math.min(array.length, (int)Math.ceil(array.length*P/100))-1)];
}
(This one also uses min() and will produce some result for any finite P, implicitly truncating it into the 0<=P<=100 range)

Round of Accurately from the last value after decimal

I have stuck (again) and looking for smart human beings of planet earth to help me out.
Background
I have an application which distributes the amounts to some users in a given percentage. Say I have $35000 and it will distribute the amounts to 3 users (A, B and C) in some ratio. So the amount distributed will be
A - 5691.05459265518
B - 14654.473815207
C - 14654.4715921378
which totals up to $35000
The Problem
I have to provide the results on the front end in 2 decimal spaces instead of float. So I use the round function of SQL Server with the precision value of 2 to convert these to 2 decimal spaces with rounding. But the issue is that when I total these values this comes out to be $34999.9999 instead of $35000.
My Findings
I searched a bit and found
If the expression that you are rounding ends with a 5, the Round()
function will round the expression so that the last digit is an even
number. Here are some examples:
Round(34.55, 1) - Result: 34.6 (rounds up)
Round(34.65, 1) - Result: 34.6 (rounds down)
So technically the answer is correct but I am looking for a function or a way to round of the value exactly what it should have been. I found that if I start rounding off (if the value is less than 5 then leave the previous number else increment the previous digit by 1 ) from the last digit after the decimal and keep on backtracking while I am left with only 2 decimal places.
Please advise.

Implementing Geometric Median

When I google for Geometric median, I got this link Geometric median
but I have no clue how to implement it in C . I am not very good at understanding this Mathematical Explanation. Lets Say I have 11 pair of co-ordinates how will I calculate the geometric median for the same.
I am trying to solve this problem Grid CIty. I was given a Hint that geometric median will help me achieve it. I am not looking for a final solution. If someone can guide me to a right path that would help.
Thanks is Advance
Below is the list of co-ordinates a (test case). result : 3 4
1 2
1 7
2 2
2 3
2 5
3 4
4 2
4 5
4 6
5 3
6 5
I don't think this is solvable without an iterative algorithm.
Here is a pseudocode solution similar to the hill-climbing version, except that it works to arbitrary accuracy, and in higher dimensions.
CurrentPoint = Mean(Points)
While (CurrentPoint - PreviousPoint) Length > 0.01 Do
For Each Point in Points Do
Vector = CurrentPoint - Point
Vector Length = Vector Length - 1.0
Point2 = Point + Vector
Add Point2 To Points2
Loop
PreviousPoint = CurrentPoint
CurrentPoint = Mean(Points2)
Loop
Notes:
The constant 0.01 does not guarantee the result to be within 0.01 of the true value. Use smaller values for better precision.
The constant 1.0 should be adjusted to (I'm guessing) about 1/5 the distance between the furthest points. Too small values will slow down the algorithm, but too large values will cause inaccuracies probably leading an to infinite loop.
To resolve this problem, you just have to compute the mean for each coordinate and round up the result.
It should resolve your problem.
You are not obliged to use the concept of Geometric median; so seeing that it is not easy to calculate, you better solve your problem without calculating it!
Here is an idea for an algorithm/implementation.
Start at any point (e.g. the first point in the given data).
Calculate the sum of distances for current point and the 8 neighboring points (+/-1 in each direction, x and y)
If one of the neighbors is better than current point, update the current point and start from 1
(Found the optimal distance; now choose the best point among those with equal distance)
Calculate the sum of distances for current point and the 3 neighboring points (-1 in each direction, x and y)
If one of the neighbors is the same as current point, update the current point and continue from 5
The answer is (xi, yj) where xi
is the median of all the x's and yj is the median of all the y's.
As I comment the solution to your problem is not the geometric mean, but the arithmetic mean.
If you have to calculate the arithmetic mean, you need to sum all the values of the column and divide the answer by the number of elements.

Efficient way of calculating average difference of array elements from array average value

Is there a way to calculate the average distance of array elements from array average value, by only "visiting" each array element once? (I search for an algorithm)
Example:
Array : [ 1 , 5 , 4 , 9 , 6 ]
Average : ( 1 + 5 + 4 + 9 + 6 ) / 5 = 5
Distance Array : [|1-5|, |5-5|, |4-5|, |9-5|, |6-5|] = [4 , 0 , 1 , 4 , 1 ]
Average Distance : ( 4 + 0 + 1 + 4 + 1 ) / 5 = 2
The simple algorithm needs 2 passes.
1st pass) Reads and accumulates values, then divides the result by array length to calculate average value of array elements.
2nd pass) Reads values, accumulates each one's distance from the previously calculated average value, and then divides the result by array length to find the average distance of the elements from the average value of the array.
The two passes are identical. It is the classic algorithm of calculating the average of a set of values. The first one takes as input the elements of the array, the second one the distances of each element from the array's average value.
Calculating the average can be modified to not accumulate the values, but caclulating the average "on the fly" as we sequentialy read the elements from the array.
The formula is:
Compute Running Average of Array's elements
-------------------------------------------
RA[i] = E[i] {for i == 1}
RA[i] = RA[i-1] - RA[i-1]/i + A[i]/i { for i > 1 }
Where A[x] is the array's element at position x, RA[x] is the average of the array's elements between position 1 and x (running average).
My question is:
Is there a similar algorithm, to calculate "on the fly" (as we read the array's elements), the average distance of the elements from the array's mean value?
The problem is that, as we read the array's elements, the final average value of the array is not known. Only the running average is known. So calculating differences from the running average will not yield the correct result. I suppose, if such algorithm exists, it probably should have the "ability" to compensate, in a way, on each new element read for the error calculated as far.
I don't think you can do better than O(n log n).
Suppose the array were sorted. Then we could divide it into the elements less than the average and the elements greater than the average. (If some elements are equal to the average, that doesn't matter.) Suppose the first k elements are less than the average. Then the average distance is
D = ((xave-x1) + (xave-x2) + (xave-x3) + ... + (xave-xk) + (xk+1-xave) + (xk+2-xave) + ... + (xn-xave))/n
= (-x1) + (-x2) + (-x3) + ... + (-xk) + (xk+1) + (xk+2) + ... + (xn) + (n-2k)xave)/n
= ( [sum of elements above average] - [sum of elements below average] + (n-2k)xave)/n
You could calculate this in one pass by working in from both ends, adjusting the limits on the (as-yet-unknown) average as you go. This would be O(n), and the sorting is O(n logn) (and they could perhaps be done in the same operation), so the whole thing is O(n logn).
The only problem with a two pass approach is that you need to reread or store the entire sequence for the second pass. The obvious improvement would be to maintain a data structure so that you could adjust the sum of absolute differences when the average value changed.
Suppose that you change the average value to a very large value, by observing a huge number. Now compare the change made by this to that caused by observing a not quite so huge value. You will be able to work out the difference between the two sums of absolute differences, because both average values are above all the other numbers, so all of the absolute values decrease by the difference between the two huge averages. This predictable change carries on until the average meets the highest value observed in the standard numbers, and this change allows you to find out what the highest number observed was.
By running experiments like this you can recover the set of numbers observed before the numbers you shove in to run the experiments. Therefore any clever data structure you use to keep track of sums of absolute differences is capable of storing the set of numbers observed, which (except for order, and cases where multiple copies of the same number are observed) is pretty much what you do by storing all the numbers seen for a second pass. So I don't think there is a trick for the case of sums of absolute differences as there is for squares of differences, where most of the information you care about is described by just the pair of numbers (sum, sum of squares).
if the l2 norm (average distance squared) is ok then it's:
sqrt(sum(x^2)/n - (sum(x)/n)^2)
that's (square root of) the average x^2 minus the square of the average x.
it's called variance (actually, the above is the square root of the variance, which is called the standard deviation, and is a typical "measure of spread").
note that this is more sensitive to outliers than the measure you originally asked for.
Your followup described your context as HLSL reading from a texture. If your filter footprint is a power of two and is aligned with the same power-of-two boundaries in the original image, you can use MIP maps to find the average value of the filter region.
For example, for an 8x8 filter, precompute a MIP map three levels down the MIP chain, whose elements will be the averages of each 8x8 region. Then a single texture read from that MIP level texture will give you the average for the 8x8 region. Unfortunately this doesn't work for sliding the filter around to arbitrary positions (not multiples of 8 in this example).
You could make use of intermediate MIP levels to decrease the number of texture reads by utilizing the MIP averages of 4x4 or 2x2 areas whenever possible, but that would complicate the algorithm quite a bit.

Resources