Random based on area - arrays

I have an array of elements:
$arr = array(
'0' => 265000, // Area
'1' => 190000,
'2' => 30000,
'3' => 1300
);
I want to get random index based on the area (Array value). I need the area with big value be selected more frequently.
How can I do this?
What I have now:
$random_idx = mt_rand(0, count($arr)-1);
$selected_area = (object)$arr[$random_idx];
Thanks!

1. Repeted values
Let's suppose we have an array in which every value corresponds to the relative probability of its index. For example, given a coin, the possible outcomes of a toss are 50% tails and 50% heads. We can represent those probability with an array, like (I'll use PHP as this seems the language used by OP):
$coin = array(
'head' => 1,
'tails' => 1
);
While the results of rolling two dice can be represented as:
$dice = array( '2' => 1, '3' => 2, '4' => 3, '5' => 4, '6' => 5, '7' => 6,
'8' => 5, '9' => 4, '10' => 3, '11' => 2, '12' => 1
);
An easy way to pick a random key (index) with a probability proportional to the values of those arrays (and therefore consistent to the underlying model) is to create another array whose elements are the keys of the original one repeated as many times as indicated by the values and then return a random value. For example for the dice array:
$arr = array( 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6, ...
Doing so, we are confident that each key will be picked up with the right relative probability. We can encapsulate all the logic in a class with a constructer which builds the helper array an a function that returns a random index using mt_rand():
class RandomKeyMultiple {
private $pool = array();
private $max_range;
function __construct( $source ) {
// build the look-up array
foreach ( $source as $key => $value ) {
for ( $i = 0; $i < $value; $i++ ) {
$this->pool[] = $key;
}
}
$this->max_range = count($this->pool) - 1;
}
function get_random_key() {
$x = mt_rand(0, $this->max_range);
return $this->pool[$x];
}
}
The usage is simple, just create an object of the class passing the source array and then each call of the function will return a random key:
$test = new RandomKeyMultiple($dice);
echo $test->get_random_key();
The problem is that OP's array contains big values and this results in a very big (but still manageable, even without dividing all the values by 100) array.
2. Steps
In general, discrete probability distribution may be more complicated, with float values that cannot be easily translated in number of repetitions.
Another way to solve the problem is to consider the values in the array as the misures of intervals that divide the global range of all possible values:
+---------------------------+-----------------+-------+----+
| | | | |
|<--- 265000 --->|<-- 190000 -->|<30000>|1300|
|<------- 455000 ------>| |
|<---------- 485000 --------->| |
|<---------------- 486300 -------------->|
Then we can choose a random number between 0 and 486300 (the global range) and look up the right index (the odds of which would be proportional to the lenght of its segment, giving the correct probability distribution). Something like:
$x = mt_rand(0, 486300);
if ( $x < 265000 )
return 0;
elseif ( $x < 455000 )
return 1;
elseif ( $x < 485000 )
return 2;
else
return 3;
We can generalize the algorithm and encapsulate all the logic in a class (using an helper array to store the partial sums):
class RandomKey {
private $steps = array();
private $last_key;
private $max_range;
function __construct( $source ) {
// sort in ascending order to partially avoid numerical issues
asort($source);
// calculate the partial sums. Considering OP's array:
//
// 1300 ----> 0
// 30000 ----> 1300
// 190000 ----> 31300
// 265000 ----> 221300 endind with $partial = 486300
//
$partial = 0;
$temp = 0;
foreach ( $source as $k => &$v ) {
$temp = $v;
$v = $partial;
$partial += $temp;
}
// scale the steps to cover the entire mt_rand() range
$factor = mt_getrandmax() / $partial;
foreach ( $source as $k => &$v ) {
$v *= $factor;
}
// Having the most probably outcomes first, minimizes the look-up of
// the correct index
$this->steps = array_reverse($source);
// remove last element (don't needed during checks) but save the key
end($this->steps);
$this->last_key = key($this->steps);
array_pop($this->steps);
}
function get_random_key() {
$x = mt_rand();
foreach ( $this->steps as $key => $value ) {
if ( $x > $value ) {
return $key;
}
}
return $this->last_key;
}
}
Here or here there are live demos with some examples and helper functions to check the probability distribution of the keys.
For bigger arrays, a binary search to look-up the index may also be considered.

This solution is based on element's index, not on it's value. So we need the array to be ordered to always be sure that element with bigger value has bigger index.
Random index generator can now be represented as a linear dependency x = y:
(y)
a i 4 +
r n 3 +
r d 2 +
a e 1 +
y x 0 +
0 1 2 3 4
r a n d o m
n u m b e r (x)
We need to generate indices non-linearly (bigger index - more probability):
a i 4 + + + + +
r n 3 + + + +
r d 2 + + +
a e 1 + +
y x 0 +
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
r a n d o m
n u m b e r
To find the range of x values for an array of length c we can calculate the sum of all numbers in range 0..c:
(c * (c + 1)) / 2;
To find x for any y let's solve quadratic equation
y ^ 2 + y - 2 * x = 0;
Having solved this we get
y = (sqrt(8 * x + 1) - 1) / 2;
Now let's put it all together:
$c = $count($arr);
$range = ($c * ($c + 1)) / 2;
$random_x = mt_rand(0, range);
$random_idx = floor((sqrt(8 * $random_x + 1) - 1) / 2);
This solution fits best for big arrays in terms of performance - it does not depend on the array size and type.

This problem is somewhat similar to the way operating systems can identify the next thread to run with lottery scheduling.
The idea is to assign each area a number of tickets depending on its size and number all those tickets. Depending on which random number was chosen you know which ticket won and thus the winning area.
First you will need to sum up all the areas and find a random number up to this total. Now you just iterate through your array and look for the first element whose summed up total to this point is larger than the random number.
Assuming you are looking for a solution in PHP:
function get_random_index($array) {
// generate total
$total = array_sum($array);
// get a random number in the required range
$random_number = rand(0, $total-1);
// temporary sum needed to find the 'winning' area
$temp_total = 0;
// this variable helps us identify the winning area
$current_area_index = 0;
foreach ($array as $area) {
// add the area to our temporary total
$temp_total = $temp_total + $area;
// check if we already have the right ticket
if($temp_total > $random) {
return $current_area_index;
}
else {
// this area didn't win, so check the next one
$current_area_index++;
}
}
}

Your array describes a discrete probability distribution. Each array value ('area' or 'weight') relates to the probability of a discrete random variable taking a specific value from the range of array keys.
/**
* Draw a pseudorandom sample from the given discrete probability distribution.
* The input array values will be normalized and do not have to sum up to one.
*
* #param array $arr Array of samples => discrete probabilities (weights).
* #return sample
*/
function draw_discrete_sample($arr) {
$rand = mt_rand(0, array_sum($arr) - 1);
foreach ($arr as $key => $weight) {
if (($rand -= $weight) < 0) return $key;
}
}
Replace the first line with $rand = mt_rand() / mt_getrandmax() * array_sum($arr); if you want to support non-integer weights / probabilities.
You might also want to have a look at similar questions asked here. If you are only interested in sampling a small set of known distributions, I recommend the analytic approach outlined by Oleg Mikhailov.

Related

Generate a matrix of combinations (permutation) without repetition (array exceeds maximum array size preference)

I am trying to generate a matrix, that has all unique combinations of [0 0 1 1], I wrote this code for this:
v1 = [0 0 1 1];
M1 = unique(perms([0 0 1 1]),'rows');
• This isn't ideal, because perms() is seeing each vector element as unique and doing:
4! = 4 * 3 * 2 * 1 = 24 combinations.
• With unique() I tried to delete all the repetitive entries so I end up with the combination matrix M1 →
only [4!/ 2! * (4-2)!] = 6 combinations!
Now, when I try to do something very simple like:
n = 15;
i = 1;
v1 = [zeros(1,n-i) ones(1,i)];
M = unique(perms(vec_1),'rows');
• Instead of getting [15!/ 1! * (15-1)!] = 15 combinations, the perms() function is trying to do
15! = 1.3077e+12 combinations and it's interrupted.
• How would you go about doing in a much better way? Thanks in advance!
You can use nchoosek to return the indicies which should be 1, I think in your heart you knew this must be possible because you were using the definition of nchoosek to determine the expected final number of permutations! So we can use:
idx = nchoosek( 1:N, k );
Where N is the number of elements in your array v1, and k is the number of elements which have the value 1. Then it's simply a case of creating the zeros array and populating the ones.
v1 = [0, 0, 1, 1];
N = numel(v1); % number of elements in array
k = nnz(v1); % number of non-zero elements in array
colidx = nchoosek( 1:N, k ); % column index for ones
rowidx = repmat( 1:size(colidx,1), k, 1 ).'; % row index for ones
M = zeros( size(colidx,1), N ); % create output
M( rowidx(:) + size(M,1) * (colidx(:)-1) ) = 1;
This works for both of your examples without the need for a huge intermediate matrix.
Aside: since you'd have the indicies using this approach, you could instead create a sparse matrix, but whether that's a good idea or not would depend what you're doing after this point.

How many random requests do I need to make to a set of records to get 80% of the records?

Suppose I have an array of 100_000 records ( this is Ruby code, but any language will do)
ary = ['apple','orange','dog','tomato', 12, 17,'cat','tiger' .... ]
results = []
I can only make random calls to the array ( I cannot traverse it in any way)
results << ary.sample
# in ruby this will pull a random record from the array, and
# push into results array
How many random calls like that, do I need to make, to get least 80% of records from ary. Or expressed another way - what should be the size of results so that results.uniq will contain around 80_000 records from ary.
From my rusty memory of Stats class in college, I think it's needs to be 2*result set size = or around 160_000 requests ( assuming random function is random, and there is no some other underlying issue) . My testing seems to confirm this.
ary = [*1..100_000];
result = [];
160_000.times{result << ary.sample};
result.uniq.size # ~ 80k
This is stats, so we are talking about probabilities, not guaranteed results. I just need a reasonable guess.
So the question really, what's the formula to confirm this?
I would just perform a quick simulation study. In R,
N = 1e5
# Simulate 300 times
s = replicate(300, sample(x = 1:N, size = 1.7e5, replace = TRUE))
Now work out when you hit your target
f = function(i) which(i == unique(i)[80000])[1]
stats = apply(s, 2, f)
To get
summary(stats)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 159711 160726 161032 161037 161399 162242
So in 300 trials, the maximum number of simulations needed was 162242 with an average number of 161032.
With Fisher-Yates shuffle you could get 80K items from exactly 80K random calls
Have no knowledge of Ruby, but looking at https://gist.github.com/mindplace/3f3a08299651ebf4ab91de3d83254fbc and modifying it
def shuffle(array, counter)
#counter = array.length - 1
while counter > 0
# item selected from the unshuffled part of array
random_index = rand(counter)
# swap the items at those locations
array[counter], array[random_index] = array[random_index], array[counter]
# de-increment counter
counter -= 1
end
array
end
indices = [0, 1, 2, 3, ...] # up to 99999
counter = 80000
shuffle(indices, 80000)
i = 0
while counter > 0
res[i] = ary[indices[i]]
counter -= 1
i += 1
UPDATE
Packing sampled indices into custom RNG (bear with me, know nothing about Ruby)
class FYRandom
_indices = indices
_max = 80000
_idx = 0
def rand()
if _idx > _max
return -1.0
r = _indices[idx]
_idx += 1
return r.to_f / max.to_f
end
end
And code for sample would be
rng = FYRandom.new
results << ary.sample(random: rng)

Sampling intervals, not numbers, without replacement

The sort of problem I am dealing with involves a few things, namely:
I need to randomly sample numbers from a range of numbers.
That range of numbers is really huge, as from 1 to 1,000,000,000.
I need the sampling process to avoid sampling from intervals within the range that have already been sampled. Since using an array is too slow, my attempts to use splice are not going to work.
I start by picking a number between 1 and 1,000,000,000.
my $random = int(rand(1_000_000_000)) + 1;
I add a value, say 100, to that to make $random and $random + 100 define an interval.
my $interval = $random + 100;
Then I push both $random and $interval into another array. This other array is to store the intervals.
push ( #rememberOldIntervals, $random, $interval );
I step through array #rememberOldIntervals using a for loop, pulling out items in pairs. The first of a pair is a former $random and the other a $interval. Inside this for loop, I do another random number generation. But the number generated can't be between an interval already taken. If so, keep sampling until a number is found that is unique. Further, this new random number must be at least 100 away from any old interval.
for ( my $i= 0; $i < (scalar #rememberOldIntervals) / 2 ; $i=+2) {
$random = int(rand(1_000_000_000)) + 1;
my $new_random_low = $random - 100;
my $new_random_high = $random + 100;
if ( $new_random_low <= $rememberOldIntervals[0] OR
$new_random_high >= $rememberOldIntervals[1] ){
push( #rememberOldIntervals, $new_random_low, $new_random_high );
}
else {
until ($new_random_low <= $rememberOldIntervals[0] OR
$new_random_high >= $rememberOldIntervals[1] ) {
$random = int(rand(1_000_000_000)) + 1;
my $new_random_low = $random - 100;
my $new_random_high = $random + 100;
}
}
}
This latter loop would need to be embedded within another to drive it many times, say 10,000 times.
This problem can be reframed into pulling 10,000 random numbers between 0 and 1 billion, where no number is within 100 of another.
Brute Force - 5 secs
Because you're only pulling 10,000 numbers, and probably don't need to do it very often, I suggest approaching this type of problem using brute force initially. This is trying to follow the design pattern of Premature optimization is the root of all evil
In this case, that means just pulling random numbers and comparing them to all previously pulled numbers. This will have a speed of O(N^2), but will also take less code.
use strict;
use warnings;
my $max = 1_000_000_000;
my $dist = 100;
my $count = 10_000;
die "Too many numbers" if 2 * $dist * $count >= $max;
my #numbers;
while (#numbers < $count) {
my $num = int rand $max;
push #numbers, $num if ! grep {abs($num - $_) < $dist} #numbers;
}
print scalar(#numbers), "\n";
Output takes 5 seconds:
10000
Binary Search for faster generation - 0.14 secs
Now for faster algorithm, I agree with ysth that a much more efficient method to solve this is to create two lists of your random numbers. One of them is the running list, and the other is sorted. Use the sorted list to do a binary search for placement and then comparison to its nearby elements to see if it is within 100.
This reduces the number of comparisons from O(N^2) to O(N log N). The following takes just 0.14 seconds to run versus the 5 seconds of the brute force method.
use strict;
use warnings;
my $max = 1_000_000_000;
my $dist = 100;
my $count = 10_000;
die "Too many numbers" if 2 * $dist * $count >= $max;
my #numbers;
my #sorted = (-$dist, $max); # Include edges to simplify binary search logic.
while (#numbers < $count) {
my $num = int rand $max;
# Binary Search of Sorted list.
my $binary_min = 0;
my $binary_max = $#sorted;
while ($binary_max > $binary_min) {
my $average = int( ($binary_max + $binary_min) / 2 );
$binary_max = $average if $sorted[$average] >= $num;
$binary_min = $average + 1 if $sorted[$average] <= $num;
}
if (! grep {abs($num - $_) < $dist} #sorted[$binary_max, $binary_max - 1]) {
splice #sorted, $binary_max, 0, $num;
push #numbers, $num;
}
}
print scalar(#numbers), "\n";
Hash of quotients for fastest - 0.05 secs
I inquired in the comments: "Could you simplify this problem to pick a random multiple of 100? That would ensure no overlap, and then you'd just need to pick a random number from 1 to 10 million without repeat, and then just multiply it by 100." You didn't respond, but we can still use grouping by multiples of 100 to simplify this problem.
Basically, if we keep track of a number's quotient divided by 100, we only need it to compare it to numbers with quotients plus and minus one. This reduces the number of comparisons to O(N), which not surprisingly is the fastest at 0.05 seconds:
use strict;
use warnings;
my $max = 1_000_000_000;
my $dist = 100;
my $count = 10_000;
die "Too many numbers" if 2 * $dist * $count >= $max;
my #numbers;
my %num_per_quot;
while (#numbers < $count) {
my $num = int rand $max;
my $quotient = int $num / $dist;
if (! grep {defined && abs($num - $_) < $dist} map {$num_per_quot{$quotient + $_}} (-1, 0, 1)) {
push #numbers, $num;
$num_per_quot{$quotient} = $num;
}
}
print scalar(#numbers), "\n";
Caution if you're on Windows
If you run this code on Windows and are using a version of perl less than v5.20, you'll need to use a better random number generate than the built-in rand. For reasons why, read avoid using rand if it matters.
I used Math::Random::MT qw(rand); in this code since I'm on Strawberry Perl v5.18.2. However, starting with Perl v5.20 this will no longer be a concern because rand now uses a consistent random number generator.
You can speed it up by using hashes and indices.
This will part the space into indexed segments of width 200, and each interval will be placed randomly in a random segment.
my $interval = 100;
my $space = 1e9;
my $interval_count = 1e4;
my #values;
my %index_taken;
for(1..$interval_count)
{
my $index;
$index while $index_taken{$index = int rand $space/2/$interval }++;
my $start = $index*2*$interval + 1 + int rand $interval;
push #values, $start, $start+$interval;
}
It guarantees nonoverlapping intervals but there will be inaccessible space of up to 200 between two intervals.
Or, if you want the intervals sorted:
#values = map {$_*=2*$interval; $_+=1+int rand $interval; ($_,$_+$interval)}
sort keys %index_taken;

Given a list of numbers, find a subset that sum to any number in a range of numbers

How could I go about making a method that returns true or false when given an array of either positive or negative integers and a value k such that some numbers in that array sum up to any number from 1 to k.
For example, given an array [-10,20,14,-3] and k=6, this would return true as 14+(-10) = 4 which is between 1 and 6.
I know this requires dynamic programming (as it must run in polynomial time) but I'm not exactly sure how to go about implement it so any help would be great.
Thanks!
I think the way is to sort the numbers then sum a smaller and smaller list until it works
Here's an answer in scala
def sumit(x : List[Int], k: Int ): List[Int] = x match { case x => if(x.sum > k ) x else sumit(x.init, k)
The List must be presorted, ie call like this
val numbers=List(-9, 1, 3, 6, 5, 1, 2)
sumit( numbers.sorted.reverse, 16)
and in perl
#!/usr/bin/perl
#
sub any_adds_to {
my $target=shift;
my $list=shift;
my #sorted=sort {$a <=> $b} #$list;
my $v=0;
for (#sorted) {
$v=$v+$_;
last if $v>$target;
}
return $v>$target;
}
sub assert {
$_[0] ? "OK\n" : "FAIL\n";
}
my %test_as_false=(18=>[3,2,4,2,6],
255=>[32,33,34,35,35],
1200=>[-1,1000,1,199]);
for my $val_to_find (keys %test_as_false) {
print assert(not(any_adds_to($val_to_find, $test_as_false{$val_to_find})));
}
my %test_as_correct=(18,[2,4,6,8],
255,[65,66,67,68],
1200=>[1,1000,1,199]);
for my $val_to_find (keys %test_as_correct) {
print assert((any_adds_to($val_to_find, $test_as_correct{$val_to_find})));
}

Find repeated element in array

Consider array of INT of positive numbers:
{1,3,6,4,7,6,9,2,6,6,6,6,8}
Given: only one number is repeated, return number and positions with efficient algorithm.
Any ideas for efficient algorithms?
One possible solution is to maintain an external hash map. Iterate the array, and place the indices of values found into the hash map. When done, you now know which number was duplicated and the indices of the locations it was found at.
In an interview situation, I guess its your chance to ask around the question, for example, how many numbers? what range of numbers? you could state that an optimum algorithm could change depending.
That gives you a chance to show how you solve problems.
If the range of ints in the array is small enough then you could create another array to keep count of the number of times each integer is found then go linearly through the array accumulating occurrence counts, stopping when you get to an occurance count of two.
Hash will do just fine in here. Add numbers to it one by one, each time checking if number's already in there.
Well, there probably is some trick (usually is). But just off the cuff, you should be able to sort the list (O(nlogn)). Then its just a matter of finding a number that is the same as the next one (linear search - O(n)). You'd have to sort it as tuples of values and original indices of course, so you could return that index you are looking for. But the point is that the upper bound on an algorithim that will do the job should be O(nlogn).
If you just go through the list linerally, you could take each index, then search through the rest of the list after it for a matching index. I think that's roughly equivalent to the work done in a bubble sort, so it would probably be O(n^2), but a simple one.
I really hate trick questions as interview questions. They are kind of like optical illusions: Either you see it or you don't, but it doesn't really say anything bad about you if you don't see the trick.
I'd try this:
all elms of list have to be looked at (=> loop over the list)
before the repeated elm is known, store elm => location/index in a hash/dictionary
as soon as the second occurence of the repeated element is found, store its first postion (from the hash) and the current position in the result array
compare further elms of list against the repeated elm, append found locations to the result array
in code:
Function locRep( aSrc )
' to find repeated elm quickly
Dim dicElms : Set dicElms = CreateObject( "Scripting.Dictionary" )
' to store the locations
Dim aLocs : aLocs = Array()
' once found, simple comparison is enough
Dim vRepElm : vRepElm = Empty
Dim nIdx
For nIdx = 0 To UBound( aSrc )
If vRepElm = aSrc( nIdx ) Then ' repeated elm known, just store location
ReDim Preserve aLocs( UBound( aLocs ) + 1 )
aLocs( UBound( aLocs ) ) = nIdx
Else ' repeated elm not known
If dicElms.Exists( aSrc( nIdx ) ) Then ' found it
vRepElm = aSrc( nIdx )
ReDim aLocs( UBound( aLocs ) + 2 )
' location of first occurrence
aLocs( UBound( aLocs ) - 1 ) = dicElms( aSrc( nIdx ) )
' location of this occurrence
aLocs( UBound( aLocs ) ) = nIdx
Else
' location of first occurrence
dicElms( aSrc( nIdx ) ) = nIdx
End If
End If
Next
locRep = aLocs
End Function
Test run:
-------------------------------------------------
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
Src: 1 3 6 4 7 6 9 2 6 6 6 6 8
Res: 2 5 8 9 10 11
ok
Src:
Res:
ok
Src: 1 2 3
Res:
ok
Src: 1 1 2 3 4 5 6
Res: 0 1
ok
Src: 1 2 3 4 5 6 6
Res: 5 6
ok
=================================================
using namespace std;
list<int> find_duplicate_idx(const vector<int>& A)
{
hash_map<int, int> X;
list<int> idx;
for ( int i = 0; i < A.size(); ++ i ) {
hash_map<int, int>::iterator it = X.find(A[i]);
if ( it != X.end() ) {
idx.push_back(it->second);
idx.push_back(i);
for ( int j = i + 1; j < A.size(); ++j )
if ( A[j] == A[i] )
idx.push_back(j);
return idx;
}
X[A[i]] = i;
}
return idx;
}
This is a solution my friend provided. Thank you SETI from mitbbs.com
Use the hash-map to solve it :
private int getRepeatedElementIndex(int[] arr) {
Map<Integer, Integer> map = new HashMap();
// find the duplicate element in an array
for (int i = 0; i < arr.length; i++) {
if(map.containsKey(arr[i])) {
return i;
} else {
map.put(arr[i], i);
}
}
throw new RuntimeException("No repeated element found");
}
Time complexity : O(n)
Space complexity : O(n)

Resources