Finding max sum with operation limit - arrays

As an input i'm given an array of integers (all positive).
Also as an input i`m given a number of "actions". The goal is to find max possible sum of array elements with given number of actions.
As an "action" i can either:
Add current element to sum
Move to the next element
We are starting at 0 position in array. Each element could be added only once.
Limitation are:
2 < array.Length < 20
0 < number of "actions" < 20
It seems to me that this limitations essentially not important. Its possible to find each combination of "actions", but in this case complexity would be like 2^"actions" and this is bad...))
Examples:
array = [1, 4, 2], 3 actions. Output should be 5. In this case we added zero element, moved to first element, added first element.
array = [7, 8, 9], 2 actions. Output should be 8. In this case we moved to the first element, then added first element.
Could anyone please explain me the algorithm to solve this problem? Or at least the direction in which i shoudl try to solve it.
Thanks in advance

Here is another DP solution using memoization. The idea is to represent the state by a pair of integers (current_index, actions_left) and map it to the maximum sum when starting from the current_index, assuming actions_left is the upper bound on actions we are allowed to take:
from functools import lru_cache
def best_sum(arr, num_actions):
'get best sum from arr given a budget of actions limited to num_actions'
#lru_cache(None)
def dp(idx, num_actions_):
'return best sum starting at idx (inclusive)'
'with number of actions = num_actions_ available'
# return zero if out of list elements or actions
if idx >= len(arr) or num_actions_ <= 0:
return 0
# otherwise, decide if we should include current element or not
return max(
# if we include element at idx
# we spend two actions: one to include the element and one to move
# to the next element
dp(idx + 1, num_actions_ - 2) + arr[idx],
# if we do not include element at idx
# we spend one action to move to the next element
dp(idx + 1, num_actions_ - 1)
)
return dp(0, num_actions)
I am using Python 3.7.12.

array = [1, 1, 1, 1, 100]
actions = 5
In example like above, you just have to keep moving right and finally pickup the 100. At the beginning of the array we never know what values we are going to see further. So, this can't be greedy.
You have two actions and you have to try out both because you don't know which to apply when.
Below is a python code. If not familiar treat as pseudocode or feel free to convert to language of your choice. We recursively try both actions until we run out of actions or we reach the end of the input array.
def getMaxSum(current_index, actions_left, current_sum):
nonlocal max_sum
if actions_left == 0 or current_index == len(array):
max_sum = max(max_sum, current_sum)
return
if actions_left == 1:
#Add current element to sum
getMaxSum(current_index, actions_left - 1, current_sum + array[current_index])
else:
#Add current element to sum and Move to the next element
getMaxSum(current_index + 1, actions_left - 2, current_sum + array[current_index])
#Move to the next element
getMaxSum(current_index + 1, actions_left - 1, current_sum)
array = [7, 8, 9]
actions = 2
max_sum = 0
getMaxSum(0, actions, 0)
print(max_sum)
You will realize that there can be overlapping sub-problems here and we can avoid those repetitive computations by memoizing/caching the results to the sub-problems. I leave that task to you as an exercise. Basically, this is Dynamic Programming problem.
Hope it helped. Post in comments if any doubts.

Related

LeetCode Find All Numbers Disappeared in an Array Question

The problem I am doing is stated as follows:
Given an array nums of n integers where nums[i] is in the range [1, n], return an array of all the integers in the range [1, n] that do not appear in nums.
I found a solution that takes O(n) space fairly quickly however this problem has a condition to find a constant space solution and I do not understand the solution that is given. The constant space solution is recreated here as follows:
def findDisappearedNumbers(self, nums: List[int]) -> List[int]:
# Iterate over each of the elements in the original array
for i in range(len(nums)):
# Treat the value as the new index
new_index = abs(nums[i]) - 1
# Check the magnitude of value at this new index
# If the magnitude is positive, make it negative
# thus indicating that the number nums[i] has
# appeared or has been visited.
if nums[new_index] > 0:
nums[new_index] *= -1
# Response array that would contain the missing numbers
result = []
# Iterate over the numbers from 1 to N and add all those
# that have positive magnitude in the array
for i in range(1, len(nums) + 1):
if nums[i - 1] > 0:
result.append(i)
return result
I don't understand how this code works. To me it seems that every element will be made negative in the first pass and therefore no values will be appended to the answer list. I ran it through a debugger and it seems that is not what is happening. I was hoping that someone can explain to me what it is doing.
Let's take an example:
nums = [4,3,2,7,8,2,3,1]
Now Let's iterate over it,
Index-1: Value = 4 -> Mark (Value - 1) -> (4-1) index element as negative provided that element is positive.
Now, nums = [4,3,2,-7,8,2,3,1]
In this do for every index,
You will come to this:
nums = [-4,-3,-2,-7,8,2,-3,-1]
The element at index = 4 and index = 5 are positive.
So, the answer is [4+1,5+1] = [5,6].
Hope you understood this🔑.

What's the most efficient way to manipulate Arrays in Ruby

Im trying to make the following code more efficient but ran out of ideas, so looking for some help:
I receive an interval x and an Array of Integers space, I need to separate the space array into segments of x interval so for example if
space = [1, 2, 3, 1, 2]
x = 2
# The intervals Would be
[1, 2][2, 3][3, 1][1, 2]
So Then I need to find the minimum on each interval and then find the Max value of the minimums, so in the example above the minimums would be:
[1, 2, 1, 1]
Then I need to return the max value of the minimums that would be 2
Below is my solution, most test cases pass but some of them are failing because the time execution is exceeded. Any idea about how to make the following code more efficient?
def segment(x, space)
minimums = {}
space.each_with_index do |_val, index|
# 1 Create the segment
mark = x - 1
break if mark >= space.length
segment = space[index..mark]
x = x + 1
# 2 Find the min of the segment
minimum = segment.min
minimums[index] = minimum
end
# Compare the maximum of the minimums
return minimums.values.max
end
I tried making the minimums a hash instead of an array but didn't work. Also I thought I would convert the space array to a hash and manipulate the a hash instead of an Array, but I think that would be even more complicated...
You don’t need to collect the segments and process them later. Just walk the array once, keeping the largest min of each segment as you go.
Also each_with_index is a waste. Just walk the array yourself taking successive slices.
Simple example:
# initial conditions
space = [1, 2, 3, 1, 2]
x = 2
# here we go
min = space.min
(0..space.length-x).each do |i|
min = [space[i,x].min, min].max
end
puts min
As usual in ruby, there's always a more elegant way. For example, no need to create the slices; let ruby do it (each_cons):
# initial conditions
space = [1, 2, 3, 1, 2]
x = 2
# here we go
min = space.min
space.each_cons(x) do |slice|
min = [slice.min, min].max
end
puts min
And of course once you understand how this works, it can be refined even further (e.g. as shown in the comment below).

Most Efficient Algorithm to Align an Multiple Ordered Sequences

I have a strange feeling this is a very easy problem to solve but I'm not finding a good way of doing this without using brute force or dynamic programming. Here it goes:
Given N arrays of ordered and monotonic values, find the set of positions for each array i1, i2 ... in that minimises pair-wise difference of values at those indexes between all arrays. In other words, find the positions for all arrays whose values are closest to each other. Multiple solutions may exist and arrays may or may not be equally sized.
If A denotes the list of all arrays, the pair-wise difference is given by the sum of absolute differences between all values at the given indexes between all different arrays, as so:
An example, 3 arrays a, b and c:
a = [20 29 30 32 33]
b = [28 29 30 32 33]
c = [10 12 28 31 32 33]
The best alignment for this array would be a[3] b[3] c[4] or a[4] b[4] c[5], because (32,32,32) and (33,33,33) are all equal values and have, therefore minimum pairwise difference between each other. (Assuming array index starts at 0)
This is a common problem in bioinformatics thats usually solved with Dynamic Programming, but due to the fact this is an ordered sequence, I think there's somehow a way of exploiting this notion of order. I first thought about doing this pairwise, but this does not guarantee the global optimum because the best local answer might not be the best global answer.
This is meant to be language agnostic, but I don't really mind an answer for a specific language, as long as there is no loss of generality. I know Dynamic Programming is an option here, but I have a feeling there's an easier way to do this?
The tricky thing is parsing the arrays so that at some point you're guaranteed to be considering the set of indices that realize the pairwise min. Using a min heap on the values doesn't work. Counterexample with 4 arrays: [0,5], [1,2], [2], [2]. We start with a d(0,1,2,2) = 7, optimal is d(0,2,2,2) = 6, but the min heap moves us from 7 to d(5,1,2,2) = 12, then d(5,2,2,2) = 9.
I believe (but haven't proved) that if we alway increment the index that improves pairwise distance the most (or degrades it the least), we're guaranteed to visit every local min and the global min.
Assuming n total elements across k arrays:
Simple approach: we repeatedly get the pairwise distance deltas (delta wrt. incrementing each index), increment the best one, and any time doing so switch us from improvement to degradation (i.e. a local minimum) we calculate the pairwise distance. All this is O(k^2) per increment for a total running time of O((n-k) * (k^2)).
With O(k^2) storage, we could keep an array where (i,j) stores the pairwise distance delta achieve by increment the index of array i wrt. array j. We also store the column sums. Then on incrementing an index we can update the appropriate row & column & column sums in O(k). This gives us a running time of O((n-k)*k)
To just complete Dave's answer, here is the pseudocode of the delta algorithm:
initialise index_table to 0's where each row i denotes the index for the ith array
initialise delta_table with the corresponding cost of incrementing index of ith array and keeping the other indexes at their current values
cur_cost <- cost of current index table
best_cost <- cur_cost
best_solutions <- list with the current index table
while (can_at_least_one_index_increase)
i <- index whose delta is lowest
increment i-th entry of the index_table
if cost(index_table) < cur_cost
cur_cost = cost(index_table)
best_solutions = {} U {index_table}
if cost(index_table) = cur_cost
best_solutions = best_solutions U {index_table}
update delta_table
Important Note: During an iteration, some index_table entries might have already reached the maximum value for that array. Whenever updating the delta_table, it is necessary to never pick those values, otherwise this will result in a Array Out of Bounds,Segmentation Fault or undefined behaviour. A neat trick is to simply check which indexes are already at max and set a sufficiently large value, so they are never picked. If no index can increase anymore, the loop will end.
Here's an implementation in Python:
def align_ordered_sequences(arrays: list):
def get_cost(index_table):
n = len(arrays)
if n == 1:
return 0
sum = 0
for i in range(0, n-1):
for j in range(i+1, n):
v1 = arrays[i][index_table[i]]
v2 = arrays[j][index_table[j]]
sum += math.sqrt((v1 - v2) ** 2)
return sum
def compute_delta_table(index_table):
# Initialise the delta table: we switch each index element to 1, call
# the cost method and then revert the change, this avoids having to
# create copies, which decreases performance unnecessarily
delta_table = []
for i in range(n):
if index_table[i] + 1 >= len(arrays[i]):
# Implementation detail: if the index is outside the bounds of
# array i, choose a "large enough" number
delta_table.append(999999999999999)
else:
index_table[i] = index_table[i] + 1
delta_table.append(get_cost(index_table))
index_table[i] = index_table[i] - 1
return delta_table
def can_at_least_one_index_increase(index_table):
answer = False
for i in range(len(arrays)):
if index_table[i] < len(arrays[i]) - 1:
answer = True
return answer
n = len(arrays)
index_table = [0] * n
delta_table = compute_delta_table(index_table)
best_solutions = [index_table.copy()]
cur_cost = get_cost(index_table)
best_cost = cur_cost
while can_at_least_one_index_increase(index_table):
i = delta_table.index(min(delta_table))
index_table[i] = index_table[i] + 1
new_cost = get_cost(index_table)
# A new best solution was found
if new_cost < cur_cost:
cur_cost = new_cost
best_solutions = [index_table.copy()]
# A new solution with the same cost was found
elif new_cost == cur_cost:
best_solutions.append(index_table.copy())
# Update the delta table
delta_table = compute_delta_table(index_table)
return best_solutions
And here are some examples:
>>> print(align_ordered_sequences([[0,5], [1,2], [2], [2]]))
[[0, 1, 0, 0]]
>> print(align_ordered_sequences([[3, 5, 8, 29, 40, 50], [1, 4, 14, 17, 29, 50]]))
[[3, 4], [5, 5]]
Note 2: this outputs indexes not the actual values of each array.

Number of Distinct Subarrays

I want to find an algorithm to count the number of distinct subarrays of an array.
For example, in the case of A = [1,2,1,2],
the number of distinct subarrays is 7:
{ [1] , [2] , [1,2] , [2,1] , [1,2,1] , [2,1,2], [1,2,1,2]}
and in the case of B = [1,1,1], the number of distinct subarrays is 3:
{ [1] , [1,1] , [1,1,1] }
A sub-array is a contiguous subsequence, or slice, of an array. Distinct means different contents; for example:
[1] from A[0:1] and [1] from A[2:3] are not distinct.
and similarly:
B[0:1], B[1:2], B[2:3] are not distinct.
Construct suffix tree for this array. Then add together lengths of all edges in this tree.
Time needed to construct suffix tree is O(n) with proper algorithm (Ukkonen's or McCreight's algorithms). Time needed to traverse the tree and add together lengths is also O(n).
Edit: I think about how to reduce iteration/comparison number.
I foud a way to do it: if you retrieve a sub-array of size n, then each sub-arrays of size inferior to n will already be added.
Here is the code updated.
List<Integer> A = new ArrayList<Integer>();
A.add(1);
A.add(2);
A.add(1);
A.add(2);
System.out.println("global list to study: " + A);
//global list
List<List<Integer>> listOfUniqueList = new ArrayList<List<Integer>>();
// iterate on 1st position in list, start at 0
for (int initialPos=0; initialPos<A.size(); initialPos++) {
// iterate on liste size, start on full list and then decrease size
for (int currentListSize=A.size()-initialPos; currentListSize>0; currentListSize--) {
//initialize current list.
List<Integer> currentList = new ArrayList<Integer>();
// iterate on each (corresponding) int of global list
for ( int i = 0; i<currentListSize; i++) {
currentList.add(A.get(initialPos+i));
}
// insure unicity
if (!listOfUniqueList.contains(currentList)){
listOfUniqueList.add(currentList);
} else {
continue;
}
}
}
System.out.println("list retrieved: " + listOfUniqueList);
System.out.println("size of list retrieved: " + listOfUniqueList.size());
global list to study: [1, 2, 1, 2]
list retrieved: [[1, 2, 1, 2], [1, 2, 1], [1, 2], [1], [2, 1, 2], [2, 1], [2]]
size of list retrieved: 7
With a list containing the same patern many time the number of iteration and comparison will be quite low.
For your example [1, 2, 1, 2], the line if (!listOfUniqueList.contains(currentList)){ is executed 10 times. It only raise to 36 for the input [1, 2, 1, 2, 1, 2, 1, 2] that contains 15 different sub-arrays.
You could trivially make a set of the subsequences and count them, but i'm not certain it is the most efficient way, as it is O(n^2).
in python that would be something like :
subs = [tuple(A[i:j]) for i in range(0, len(A)) for j in range(i + 1, len(A) + 1)]
uniqSubs = set(subs)
which gives you :
set([(1, 2), (1, 2, 1), (1,), (1, 2, 1, 2), (2,), (2, 1), (2, 1, 2)])
The double loop in the comprehension clearly states the O(n²) complexity.
Edit
Apparently there are some discussion about the complexity. Creation of subs is O(n^2) as there are n^2 items.
Creating a set from a list is O(m) where m is the size of the list, m being n^2 in this case, as adding to a set is amortized O(1).
The overall is therefore O(n^2).
Right my first answer was a bit of a blonde moment.
I guess the answer would be to generate them all and then remove duplicates. Or if you are using a language like Java with a set object make all the arrays and add them to a set of int[]. Sets only contain one instance of each element and automatically remove duplicates so you can just get the size of the set at the end
I can think of 2 ways...
first is compute some sort of hash then add to a set.
if on adding your hashes are the same is an existing array... then do a verbose comparison... and log it so that you know your hash algorithm isn't good enough...
The second is to use some sort of probable match and then drill down from there...
if number of elements is same and the total of the elements added together is the same, then check verbosely.
Create an array of pair where each pair store the value of the element of subarray and its index.
pair[i] = (A[i],i);
Sort the pair in increasing order of A[i] and then decreasing order of i.
Consider example A = [1,3,6,3,6,3,1,3];
pair array after sorting will be pair = [(1,6),(1,0),(3,7),(3,5),(3,3),(3,1),(6,4),(6,2)]
pair[0] has element of index 6. From index 6 we can have two sub-arrays [1] and [1,3]. So ANS = 2;
Now take each consecutive pair one by one.
Taking pair[0] and pair[1],
pair[1] has index 0. We can have 8 sub-arrays beginning from index 0. But two subarrays [1] and [1,3] are already counted. So to remove them, we need to compare longest common prefix of sub-array for pair[0] and pair[1]. So longest common prefix length for indices beginning from 0 and 6 is 2 i.e [1,3].
So now new distinct sub-arrays will be [1,3,6] .. to [1,3,6,3,6,3,1,3] i.e. 6 sub-arrays.
So new value of ANS is 2+6 = 8;
So for pair[i] and pair[i+1]
ANS = ANS + Number of sub-arrays beginning from pair[i+1] - Length of longest common prefix.
The sorting part takes O(n logn).
Iterating each consecutive pair is O(n) and for each iteration find longest common prefix takes O(n) making whole iteration part O(n^2). Its the best I could get.
You can see that we dont need pair for this. The first value of pair, value of element was not required. I used this for better understanding. You can always skip that.

Weighted random selection from array

I would like to randomly select one element from an array, but each element has a known probability of selection.
All chances together (within the array) sums to 1.
What algorithm would you suggest as the fastest and most suitable for huge calculations?
Example:
id => chance
array[
0 => 0.8
1 => 0.2
]
for this pseudocode, the algorithm in question should on multiple calls statistically return four elements on id 0 for one element on id 1.
Compute the discrete cumulative density function (CDF) of your list -- or in simple terms the array of cumulative sums of the weights. Then generate a random number in the range between 0 and the sum of all weights (might be 1 in your case), do a binary search to find this random number in your discrete CDF array and get the value corresponding to this entry -- this is your weighted random number.
The algorithm is straight forward
rand_no = rand(0,1)
for each element in array
if(rand_num < element.probablity)
select and break
rand_num = rand_num - element.probability
I have found this article to be the most useful at understanding this problem fully.
This stackoverflow question may also be what you're looking for.
I believe the optimal solution is to use the Alias Method (wikipedia).
It requires O(n) time to initialize, O(1) time to make a selection, and O(n) memory.
Here is the algorithm for generating the result of rolling a weighted n-sided die (from here it is trivial to select an element from a length-n array) as take from this article.
The author assumes you have functions for rolling a fair die (floor(random() * n)) and flipping a biased coin (random() < p).
Algorithm: Vose's Alias Method
Initialization:
Create arrays Alias and Prob, each of size n.
Create two worklists, Small and Large.
Multiply each probability by n.
For each scaled probability pi:
If pi < 1, add i to Small.
Otherwise (pi ≥ 1), add i to Large.
While Small and Large are not empty: (Large might be emptied first)
Remove the first element from Small; call it l.
Remove the first element from Large; call it g.
Set Prob[l]=pl.
Set Alias[l]=g.
Set pg := (pg+pl)−1. (This is a more numerically stable option.)
If pg<1, add g to Small.
Otherwise (pg ≥ 1), add g to Large.
While Large is not empty:
Remove the first element from Large; call it g.
Set Prob[g] = 1.
While Small is not empty: This is only possible due to numerical instability.
Remove the first element from Small; call it l.
Set Prob[l] = 1.
Generation:
Generate a fair die roll from an n-sided die; call the side i.
Flip a biased coin that comes up heads with probability Prob[i].
If the coin comes up "heads," return i.
Otherwise, return Alias[i].
Here is an implementation in Ruby:
def weighted_rand(weights = {})
raise 'Probabilities must sum up to 1' unless weights.values.inject(&:+) == 1.0
raise 'Probabilities must not be negative' unless weights.values.all? { |p| p >= 0 }
# Do more sanity checks depending on the amount of trust in the software component using this method,
# e.g. don't allow duplicates, don't allow non-numeric values, etc.
# Ignore elements with probability 0
weights = weights.reject { |k, v| v == 0.0 } # e.g. => {"a"=>0.4, "b"=>0.4, "c"=>0.2}
# Accumulate probabilities and map them to a value
u = 0.0
ranges = weights.map { |v, p| [u += p, v] } # e.g. => [[0.4, "a"], [0.8, "b"], [1.0, "c"]]
# Generate a (pseudo-)random floating point number between 0.0(included) and 1.0(excluded)
u = rand # e.g. => 0.4651073966724186
# Find the first value that has an accumulated probability greater than the random number u
ranges.find { |p, v| p > u }.last # e.g. => "b"
end
How to use:
weights = {'a' => 0.4, 'b' => 0.4, 'c' => 0.2, 'd' => 0.0}
weighted_rand weights
What to expect roughly:
sample = 1000.times.map { weighted_rand weights }
sample.count('a') # 396
sample.count('b') # 406
sample.count('c') # 198
sample.count('d') # 0
An example in ruby
#each element is associated with its probability
a = {1 => 0.25 ,2 => 0.5 ,3 => 0.2, 4 => 0.05}
#at some point, convert to ccumulative probability
acc = 0
a.each { |e,w| a[e] = acc+=w }
#to select an element, pick a random between 0 and 1 and find the first
#cummulative probability that's greater than the random number
r = rand
selected = a.find{ |e,w| w>r }
p selected[0]
This can be done in O(1) expected time per sample as follows.
Compute the CDF F(i) for each element i to be the sum of probabilities less than or equal to i.
Define the range r(i) of an element i to be the interval [F(i - 1), F(i)].
For each interval [(i - 1)/n, i/n], create a bucket consisting of the list of the elements whose range overlaps the interval. This takes O(n) time in total for the full array as long as you are reasonably careful.
When you randomly sample the array, you simply compute which bucket the random number is in, and compare with each element of the list until you find the interval that contains it.
The cost of a sample is O(the expected length of a randomly chosen list) <= 2.
This is a PHP code I used in production:
/**
* #return \App\Models\CdnServer
*/
protected function selectWeightedServer(Collection $servers)
{
if ($servers->count() == 1) {
return $servers->first();
}
$totalWeight = 0;
foreach ($servers as $server) {
$totalWeight += $server->getWeight();
}
// Select a random server using weighted choice
$randWeight = mt_rand(1, $totalWeight);
$accWeight = 0;
foreach ($servers as $server) {
$accWeight += $server->getWeight();
if ($accWeight >= $randWeight) {
return $server;
}
}
}
Ruby solution using the pickup gem:
require 'pickup'
chances = {0=>80, 1=>20}
picker = Pickup.new(chances)
Example:
5.times.collect {
picker.pick(5)
}
gave output:
[[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 1, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1]]
If the array is small, I would give the array a length of, in this case, five and assign the values as appropriate:
array[
0 => 0
1 => 0
2 => 0
3 => 0
4 => 1
]
"Wheel of Fortune" O(n), use for small arrays only:
function pickRandomWeighted(array, weights) {
var sum = 0;
for (var i=0; i<weights.length; i++) sum += weights[i];
for (var i=0, pick=Math.random()*sum; i<weights.length; i++, pick-=weights[i])
if (pick-weights[i]<0) return array[i];
}
the trick could be to sample an auxiliary array with elements repetitions which reflect the probability
Given the elements associated with their probability, as percentage:
h = {1 => 0.5, 2 => 0.3, 3 => 0.05, 4 => 0.05 }
auxiliary_array = h.inject([]){|memo,(k,v)| memo += Array.new((100*v).to_i,k) }
ruby-1.9.3-p194 > auxiliary_array
=> [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4]
auxiliary_array.sample
if you want to be as generic as possible, you need to calculate the multiplier based on the max number of fractional digits, and use it in the place of 100:
m = 10**h.values.collect{|e| e.to_s.split(".").last.size }.max
Another possibility is to associate, with each element of the array, a random number drawn from an exponential distribution with parameter given by the weight for that element. Then pick the element with the lowest such ‘ordering number’. In this case, the probability that a particular element has the lowest ordering number of the array is proportional to the array element's weight.
This is O(n), doesn't involve any reordering or extra storage, and the selection can be done in the course of a single pass through the array. The weights must be greater than zero, but don't have to sum to any particular value.
This has the further advantage that, if you store the ordering number with each array element, you have the option to sort the array by increasing ordering number, to get a random ordering of the array in which elements with higher weights have a higher probability of coming early (I've found this useful when deciding which DNS SRV record to pick, to decide which machine to query).
Repeated random sampling with replacement requires a new pass through the array each time; for random selection without replacement, the array can be sorted in order of increasing ordering number, and k elements can be read out in that order.
See the Wikipedia page about the exponential distribution (in particular the remarks about the distribution of the minima of an ensemble of such variates) for the proof that the above is true, and also for the pointer towards the technique of generating such variates: if T has a uniform random distribution in [0,1), then Z=-log(1-T)/w (where w is the parameter of the distribution; here the weight of the associated element) has an exponential distribution.
That is:
For each element i in the array, calculate zi = -log(T)/wi (or zi = -log(1-T)/wi), where T is drawn from a uniform distribution in [0,1), and wi is the weight of the I'th element.
Select the element which has the lowest zi.
The element i will be selected with probability wi/(w1+w2+...+wn).
See below for an illustration of this in Python, which takes a single pass through the array of weights, for each of 10000 trials.
import math, random
random.seed()
weights = [10, 20, 50, 20]
nw = len(weights)
results = [0 for i in range(nw)]
n = 10000
while n > 0: # do n trials
smallest_i = 0
smallest_z = -math.log(1-random.random())/weights[0]
for i in range(1, nw):
z = -math.log(1-random.random())/weights[i]
if z < smallest_z:
smallest_i = i
smallest_z = z
results[smallest_i] += 1 # accumulate our choices
n -= 1
for i in range(nw):
print("{} -> {}".format(weights[i], results[i]))
Edit (for history): after posting this, I felt sure I couldn't be the first to have thought of it, and another search with this solution in mind shows that this is indeed the case.
In an answer to a similar question, Joe K suggested this algorithm (and also noted that someone else must have thought of it before).
Another answer to that question, meanwhile, pointed to Efraimidis and Spirakis (preprint), which describes a similar method.
I'm pretty sure, looking at it, that the Efraimidis and Spirakis is in fact the same exponential-distribution algorithm in disguise, and this is corroborated by a passing remark in the Wikipedia page about Reservoir sampling that ‘[e]quivalently, a more numerically stable formulation of this algorithm’ is the exponential-distribution algorithm above. The reference there is to a sequence of lecture notes by Richard Arratia; the relevant property of the exponential distribution is mentioned in Sect.1.3 (which mentions that something similar to this is a ‘familiar fact’ in some circles), but not its relationship to the Efraimidis and Spirakis algorithm.
I would imagine that numbers greater or equal than 0.8 but less than 1.0 selects the third element.
In other terms:
x is a random number between 0 and 1
if 0.0 >= x < 0.2 : Item 1
if 0.2 >= x < 0.8 : Item 2
if 0.8 >= x < 1.0 : Item 3
I am going to improve on https://stackoverflow.com/users/626341/masciugo answer.
Basically you make one big array where the number of times an element shows up is proportional to the weight.
It has some drawbacks.
The weight might not be integer. Imagine element 1 has probability of pi and element 2 has probability of 1-pi. How do you divide that? Or imagine if there are hundreds of such elements.
The array created can be very big. Imagine if least common multiplier is 1 million, then we will need an array of 1 million element in the array we want to pick.
To counter that, this is what you do.
Create such array, but only insert an element randomly. The probability that an element is inserted is proportional the the weight.
Then select random element from usual.
So if there are 3 elements with various weight, you simply pick an element from an array of 1-3 elements.
Problems may arise if the constructed element is empty. That is it just happens that no elements show up in the array because their dice roll differently.
In which case, I propose that the probability an element is inserted is p(inserted)=wi/wmax.
That way, one element, namely the one that has the highest probability, will be inserted. The other elements will be inserted by the relative probability.
Say we have 2 objects.
element 1 shows up .20% of the time.
element 2 shows up .40% of the time and has the highest probability.
In thearray, element 2 will show up all the time. Element 1 will show up half the time.
So element 2 will be called 2 times as many as element 1. For generality all other elements will be called proportional to their weight. Also the sum of all their probability are 1 because the array will always have at least 1 element.
I wrote an implementation in C#:
https://github.com/cdanek/KaimiraWeightedList
O(1) gets (fast!), O(n) recalculates, O(n) memory use.

Resources