Sampling intervals, not numbers, without replacement - arrays

The sort of problem I am dealing with involves a few things, namely:
I need to randomly sample numbers from a range of numbers.
That range of numbers is really huge, as from 1 to 1,000,000,000.
I need the sampling process to avoid sampling from intervals within the range that have already been sampled. Since using an array is too slow, my attempts to use splice are not going to work.
I start by picking a number between 1 and 1,000,000,000.
my $random = int(rand(1_000_000_000)) + 1;
I add a value, say 100, to that to make $random and $random + 100 define an interval.
my $interval = $random + 100;
Then I push both $random and $interval into another array. This other array is to store the intervals.
push ( #rememberOldIntervals, $random, $interval );
I step through array #rememberOldIntervals using a for loop, pulling out items in pairs. The first of a pair is a former $random and the other a $interval. Inside this for loop, I do another random number generation. But the number generated can't be between an interval already taken. If so, keep sampling until a number is found that is unique. Further, this new random number must be at least 100 away from any old interval.
for ( my $i= 0; $i < (scalar #rememberOldIntervals) / 2 ; $i=+2) {
$random = int(rand(1_000_000_000)) + 1;
my $new_random_low = $random - 100;
my $new_random_high = $random + 100;
if ( $new_random_low <= $rememberOldIntervals[0] OR
$new_random_high >= $rememberOldIntervals[1] ){
push( #rememberOldIntervals, $new_random_low, $new_random_high );
}
else {
until ($new_random_low <= $rememberOldIntervals[0] OR
$new_random_high >= $rememberOldIntervals[1] ) {
$random = int(rand(1_000_000_000)) + 1;
my $new_random_low = $random - 100;
my $new_random_high = $random + 100;
}
}
}
This latter loop would need to be embedded within another to drive it many times, say 10,000 times.

This problem can be reframed into pulling 10,000 random numbers between 0 and 1 billion, where no number is within 100 of another.
Brute Force - 5 secs
Because you're only pulling 10,000 numbers, and probably don't need to do it very often, I suggest approaching this type of problem using brute force initially. This is trying to follow the design pattern of Premature optimization is the root of all evil
In this case, that means just pulling random numbers and comparing them to all previously pulled numbers. This will have a speed of O(N^2), but will also take less code.
use strict;
use warnings;
my $max = 1_000_000_000;
my $dist = 100;
my $count = 10_000;
die "Too many numbers" if 2 * $dist * $count >= $max;
my #numbers;
while (#numbers < $count) {
my $num = int rand $max;
push #numbers, $num if ! grep {abs($num - $_) < $dist} #numbers;
}
print scalar(#numbers), "\n";
Output takes 5 seconds:
10000
Binary Search for faster generation - 0.14 secs
Now for faster algorithm, I agree with ysth that a much more efficient method to solve this is to create two lists of your random numbers. One of them is the running list, and the other is sorted. Use the sorted list to do a binary search for placement and then comparison to its nearby elements to see if it is within 100.
This reduces the number of comparisons from O(N^2) to O(N log N). The following takes just 0.14 seconds to run versus the 5 seconds of the brute force method.
use strict;
use warnings;
my $max = 1_000_000_000;
my $dist = 100;
my $count = 10_000;
die "Too many numbers" if 2 * $dist * $count >= $max;
my #numbers;
my #sorted = (-$dist, $max); # Include edges to simplify binary search logic.
while (#numbers < $count) {
my $num = int rand $max;
# Binary Search of Sorted list.
my $binary_min = 0;
my $binary_max = $#sorted;
while ($binary_max > $binary_min) {
my $average = int( ($binary_max + $binary_min) / 2 );
$binary_max = $average if $sorted[$average] >= $num;
$binary_min = $average + 1 if $sorted[$average] <= $num;
}
if (! grep {abs($num - $_) < $dist} #sorted[$binary_max, $binary_max - 1]) {
splice #sorted, $binary_max, 0, $num;
push #numbers, $num;
}
}
print scalar(#numbers), "\n";
Hash of quotients for fastest - 0.05 secs
I inquired in the comments: "Could you simplify this problem to pick a random multiple of 100? That would ensure no overlap, and then you'd just need to pick a random number from 1 to 10 million without repeat, and then just multiply it by 100." You didn't respond, but we can still use grouping by multiples of 100 to simplify this problem.
Basically, if we keep track of a number's quotient divided by 100, we only need it to compare it to numbers with quotients plus and minus one. This reduces the number of comparisons to O(N), which not surprisingly is the fastest at 0.05 seconds:
use strict;
use warnings;
my $max = 1_000_000_000;
my $dist = 100;
my $count = 10_000;
die "Too many numbers" if 2 * $dist * $count >= $max;
my #numbers;
my %num_per_quot;
while (#numbers < $count) {
my $num = int rand $max;
my $quotient = int $num / $dist;
if (! grep {defined && abs($num - $_) < $dist} map {$num_per_quot{$quotient + $_}} (-1, 0, 1)) {
push #numbers, $num;
$num_per_quot{$quotient} = $num;
}
}
print scalar(#numbers), "\n";
Caution if you're on Windows
If you run this code on Windows and are using a version of perl less than v5.20, you'll need to use a better random number generate than the built-in rand. For reasons why, read avoid using rand if it matters.
I used Math::Random::MT qw(rand); in this code since I'm on Strawberry Perl v5.18.2. However, starting with Perl v5.20 this will no longer be a concern because rand now uses a consistent random number generator.

You can speed it up by using hashes and indices.
This will part the space into indexed segments of width 200, and each interval will be placed randomly in a random segment.
my $interval = 100;
my $space = 1e9;
my $interval_count = 1e4;
my #values;
my %index_taken;
for(1..$interval_count)
{
my $index;
$index while $index_taken{$index = int rand $space/2/$interval }++;
my $start = $index*2*$interval + 1 + int rand $interval;
push #values, $start, $start+$interval;
}
It guarantees nonoverlapping intervals but there will be inaccessible space of up to 200 between two intervals.
Or, if you want the intervals sorted:
#values = map {$_*=2*$interval; $_+=1+int rand $interval; ($_,$_+$interval)}
sort keys %index_taken;

Related

How many random requests do I need to make to a set of records to get 80% of the records?

Suppose I have an array of 100_000 records ( this is Ruby code, but any language will do)
ary = ['apple','orange','dog','tomato', 12, 17,'cat','tiger' .... ]
results = []
I can only make random calls to the array ( I cannot traverse it in any way)
results << ary.sample
# in ruby this will pull a random record from the array, and
# push into results array
How many random calls like that, do I need to make, to get least 80% of records from ary. Or expressed another way - what should be the size of results so that results.uniq will contain around 80_000 records from ary.
From my rusty memory of Stats class in college, I think it's needs to be 2*result set size = or around 160_000 requests ( assuming random function is random, and there is no some other underlying issue) . My testing seems to confirm this.
ary = [*1..100_000];
result = [];
160_000.times{result << ary.sample};
result.uniq.size # ~ 80k
This is stats, so we are talking about probabilities, not guaranteed results. I just need a reasonable guess.
So the question really, what's the formula to confirm this?
I would just perform a quick simulation study. In R,
N = 1e5
# Simulate 300 times
s = replicate(300, sample(x = 1:N, size = 1.7e5, replace = TRUE))
Now work out when you hit your target
f = function(i) which(i == unique(i)[80000])[1]
stats = apply(s, 2, f)
To get
summary(stats)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 159711 160726 161032 161037 161399 162242
So in 300 trials, the maximum number of simulations needed was 162242 with an average number of 161032.
With Fisher-Yates shuffle you could get 80K items from exactly 80K random calls
Have no knowledge of Ruby, but looking at https://gist.github.com/mindplace/3f3a08299651ebf4ab91de3d83254fbc and modifying it
def shuffle(array, counter)
#counter = array.length - 1
while counter > 0
# item selected from the unshuffled part of array
random_index = rand(counter)
# swap the items at those locations
array[counter], array[random_index] = array[random_index], array[counter]
# de-increment counter
counter -= 1
end
array
end
indices = [0, 1, 2, 3, ...] # up to 99999
counter = 80000
shuffle(indices, 80000)
i = 0
while counter > 0
res[i] = ary[indices[i]]
counter -= 1
i += 1
UPDATE
Packing sampled indices into custom RNG (bear with me, know nothing about Ruby)
class FYRandom
_indices = indices
_max = 80000
_idx = 0
def rand()
if _idx > _max
return -1.0
r = _indices[idx]
_idx += 1
return r.to_f / max.to_f
end
end
And code for sample would be
rng = FYRandom.new
results << ary.sample(random: rng)

Random based on area

I have an array of elements:
$arr = array(
'0' => 265000, // Area
'1' => 190000,
'2' => 30000,
'3' => 1300
);
I want to get random index based on the area (Array value). I need the area with big value be selected more frequently.
How can I do this?
What I have now:
$random_idx = mt_rand(0, count($arr)-1);
$selected_area = (object)$arr[$random_idx];
Thanks!
1. Repeted values
Let's suppose we have an array in which every value corresponds to the relative probability of its index. For example, given a coin, the possible outcomes of a toss are 50% tails and 50% heads. We can represent those probability with an array, like (I'll use PHP as this seems the language used by OP):
$coin = array(
'head' => 1,
'tails' => 1
);
While the results of rolling two dice can be represented as:
$dice = array( '2' => 1, '3' => 2, '4' => 3, '5' => 4, '6' => 5, '7' => 6,
'8' => 5, '9' => 4, '10' => 3, '11' => 2, '12' => 1
);
An easy way to pick a random key (index) with a probability proportional to the values of those arrays (and therefore consistent to the underlying model) is to create another array whose elements are the keys of the original one repeated as many times as indicated by the values and then return a random value. For example for the dice array:
$arr = array( 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6, ...
Doing so, we are confident that each key will be picked up with the right relative probability. We can encapsulate all the logic in a class with a constructer which builds the helper array an a function that returns a random index using mt_rand():
class RandomKeyMultiple {
private $pool = array();
private $max_range;
function __construct( $source ) {
// build the look-up array
foreach ( $source as $key => $value ) {
for ( $i = 0; $i < $value; $i++ ) {
$this->pool[] = $key;
}
}
$this->max_range = count($this->pool) - 1;
}
function get_random_key() {
$x = mt_rand(0, $this->max_range);
return $this->pool[$x];
}
}
The usage is simple, just create an object of the class passing the source array and then each call of the function will return a random key:
$test = new RandomKeyMultiple($dice);
echo $test->get_random_key();
The problem is that OP's array contains big values and this results in a very big (but still manageable, even without dividing all the values by 100) array.
2. Steps
In general, discrete probability distribution may be more complicated, with float values that cannot be easily translated in number of repetitions.
Another way to solve the problem is to consider the values in the array as the misures of intervals that divide the global range of all possible values:
+---------------------------+-----------------+-------+----+
| | | | |
|<--- 265000 --->|<-- 190000 -->|<30000>|1300|
|<------- 455000 ------>| |
|<---------- 485000 --------->| |
|<---------------- 486300 -------------->|
Then we can choose a random number between 0 and 486300 (the global range) and look up the right index (the odds of which would be proportional to the lenght of its segment, giving the correct probability distribution). Something like:
$x = mt_rand(0, 486300);
if ( $x < 265000 )
return 0;
elseif ( $x < 455000 )
return 1;
elseif ( $x < 485000 )
return 2;
else
return 3;
We can generalize the algorithm and encapsulate all the logic in a class (using an helper array to store the partial sums):
class RandomKey {
private $steps = array();
private $last_key;
private $max_range;
function __construct( $source ) {
// sort in ascending order to partially avoid numerical issues
asort($source);
// calculate the partial sums. Considering OP's array:
//
// 1300 ----> 0
// 30000 ----> 1300
// 190000 ----> 31300
// 265000 ----> 221300 endind with $partial = 486300
//
$partial = 0;
$temp = 0;
foreach ( $source as $k => &$v ) {
$temp = $v;
$v = $partial;
$partial += $temp;
}
// scale the steps to cover the entire mt_rand() range
$factor = mt_getrandmax() / $partial;
foreach ( $source as $k => &$v ) {
$v *= $factor;
}
// Having the most probably outcomes first, minimizes the look-up of
// the correct index
$this->steps = array_reverse($source);
// remove last element (don't needed during checks) but save the key
end($this->steps);
$this->last_key = key($this->steps);
array_pop($this->steps);
}
function get_random_key() {
$x = mt_rand();
foreach ( $this->steps as $key => $value ) {
if ( $x > $value ) {
return $key;
}
}
return $this->last_key;
}
}
Here or here there are live demos with some examples and helper functions to check the probability distribution of the keys.
For bigger arrays, a binary search to look-up the index may also be considered.
This solution is based on element's index, not on it's value. So we need the array to be ordered to always be sure that element with bigger value has bigger index.
Random index generator can now be represented as a linear dependency x = y:
(y)
a i 4 +
r n 3 +
r d 2 +
a e 1 +
y x 0 +
0 1 2 3 4
r a n d o m
n u m b e r (x)
We need to generate indices non-linearly (bigger index - more probability):
a i 4 + + + + +
r n 3 + + + +
r d 2 + + +
a e 1 + +
y x 0 +
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
r a n d o m
n u m b e r
To find the range of x values for an array of length c we can calculate the sum of all numbers in range 0..c:
(c * (c + 1)) / 2;
To find x for any y let's solve quadratic equation
y ^ 2 + y - 2 * x = 0;
Having solved this we get
y = (sqrt(8 * x + 1) - 1) / 2;
Now let's put it all together:
$c = $count($arr);
$range = ($c * ($c + 1)) / 2;
$random_x = mt_rand(0, range);
$random_idx = floor((sqrt(8 * $random_x + 1) - 1) / 2);
This solution fits best for big arrays in terms of performance - it does not depend on the array size and type.
This problem is somewhat similar to the way operating systems can identify the next thread to run with lottery scheduling.
The idea is to assign each area a number of tickets depending on its size and number all those tickets. Depending on which random number was chosen you know which ticket won and thus the winning area.
First you will need to sum up all the areas and find a random number up to this total. Now you just iterate through your array and look for the first element whose summed up total to this point is larger than the random number.
Assuming you are looking for a solution in PHP:
function get_random_index($array) {
// generate total
$total = array_sum($array);
// get a random number in the required range
$random_number = rand(0, $total-1);
// temporary sum needed to find the 'winning' area
$temp_total = 0;
// this variable helps us identify the winning area
$current_area_index = 0;
foreach ($array as $area) {
// add the area to our temporary total
$temp_total = $temp_total + $area;
// check if we already have the right ticket
if($temp_total > $random) {
return $current_area_index;
}
else {
// this area didn't win, so check the next one
$current_area_index++;
}
}
}
Your array describes a discrete probability distribution. Each array value ('area' or 'weight') relates to the probability of a discrete random variable taking a specific value from the range of array keys.
/**
* Draw a pseudorandom sample from the given discrete probability distribution.
* The input array values will be normalized and do not have to sum up to one.
*
* #param array $arr Array of samples => discrete probabilities (weights).
* #return sample
*/
function draw_discrete_sample($arr) {
$rand = mt_rand(0, array_sum($arr) - 1);
foreach ($arr as $key => $weight) {
if (($rand -= $weight) < 0) return $key;
}
}
Replace the first line with $rand = mt_rand() / mt_getrandmax() * array_sum($arr); if you want to support non-integer weights / probabilities.
You might also want to have a look at similar questions asked here. If you are only interested in sampling a small set of known distributions, I recommend the analytic approach outlined by Oleg Mikhailov.

Perl: Performance of array-insert using 'splice()' VS linked-list

I have a script in which I use Perl arrays. Each array contains hundreds of thousands of items.
I frequently need to dynamically add items in the middle of an array, or to delete items from it.
I want to understand whether I should use linked-lists instead of the Perl arrays, as I make frequent insertions and deletions
So my questions are:
How is splice() implemented?
What is the complexity of splice(), when is used for inserting item x into index i in a Perl array
Can you recommend on a Perl linked-list module that you've worked with?
Thanks!
Perl arrays are stored as an array of pointers, a beginning offset, a length, and an allocated length.
So inserting or deleting from the middle will require moving 4 or 8 bytes times the number of later elements in the array. Deleting from either end won't require moving anything, just adjusting the beginning offset or length. Inserting at the end will usually just require adjusting the length, but occasionally require reallocating the entire array of pointers. Inserting at the beginning, perl will do its best to arrange so that just the beginning offset will need to be adjusted, but sometimes the entire array will need to be moved or even reallocated.
In practice, the overhead of creating and managing a linked list using perl operations is going to be much greater in almost every case than just using an array.
To benchmark it, we would need to know a lot more about your particular case; what actual size of array, what kind and size of elements (not relevant to the cost of splice, but perhaps relevant to a linked list), relative frequency of inserts/deletes, etc.
Did a quick splicing benchmark and it seems to behave as O(N) for both removals and insertions.
Script:
my $length = shift;
my $toSplice = 100;
my #list = (1 .. $length);
my $t0 = Time::HiRes::time();
for(1 .. $toSplice) {
my $removeIdx = int(rand() * #list);
splice #list, $removeIdx, 1;
}
my $t1 = Time::HiRes::time();
for(1 .. $toSplice) {
my $insertIdx = int(rand() * #list);
splice #list, $insertIdx, 0, 0;
}
printf("Took %.4fs to remove\n", $t1 - $t0);
printf("Took %.4fs to insert\n", Time::HiRes::time() - $t0);
Results:
$ perl test.pl 100000
Took 0.0026s to remove
Took 0.0092s to insert
$ perl test.pl 1000000
Took 0.0296s to remove
Took 0.0617s to insert
$ perl test.pl 10000000
Took 0.2876s to remove
Took 0.6252s to insert
So increasing the number of iterations by 10x increased the run time by roughly 10x.
Your benchmarking of arrays versus linked list is flawed. The arrays method can be sped up using the following:
Create an array of scalars instead of the superfluous array of hash references to match the linked list.
This speeds up execution by a factor of 4.
Since you're just doing a single pass of the list, create a new list instead of trying to splice the old one.
This will increase speed by a factor of 10.
Of course this doubles your memory, but using the linked list increases it by a factor of 5 at least.
The following are benchmarks showing these two improvements. I also simplified the linked list functionality, but the array method is still twice as fast even with improvements to both.
use strict;
use warnings;
use Benchmark;
my $INSERTION_FREQUENCY = 5;
my $num_of_items = shift or die "Specify size of list\n";
timethese(10, {
'linked_list' => sub { linked_list($num_of_items) },
# 'array_splice' => sub { array_splice($num_of_items) },
'array_map' => sub { array_map($num_of_items) },
});
sub linked_list {
my $count = shift;
my $curr_node = my $list_head = {data => 1};
# Creating List
for my $i (2 .. $num_of_items) {
$curr_node = $curr_node->{next} = {
data => $i,
prev => $curr_node,
};
}
# Inserting Items
$curr_node = $list_head;
my $i = 0;
while ($curr_node) {
if (++$i % $INSERTION_FREQUENCY == 0) {
my %new_node = (
data => "inserted",
prev => $curr_node->{"prev"},
next => $curr_node,
);
$curr_node->{"prev"}{"next"} = \%new_node if $curr_node->{"prev"};
$curr_node->{"prev"} = \%new_node;
}
$curr_node = $curr_node->{"next"};
}
return $list_head;
}
sub array_splice {
my $num_of_items = shift;
# Creating Array
my #array = (1..$num_of_items);
# Inserting Items
for my $i (1 .. $num_of_items) {
if ($i % $INSERTION_FREQUENCY == 0) {
splice(#array, $i - 1, 0, "inserted");
}
}
return \#array;
}
sub array_map {
my $num_of_items = shift;
# Creating Array
my #array = (1..$num_of_items);
# Inserting Items
my $i = 0;
#array = map {
++$i % $INSERTION_FREQUENCY == 0 ? ("inserted", $_) : $_
} #array;
return \#array;
}
Benchmarks
$ perl arrays.pl 100000
Benchmark: timing 10 iterations of array_map, array_splice, linked_list...
array_map: 1 wallclock secs ( 0.58 usr + 0.01 sys = 0.59 CPU) # 16.89/s (n=10)
array_splice: 16 wallclock secs (16.21 usr + 0.00 sys = 16.21 CPU) # 0.62/s (n=10)
linked_list: 2 wallclock secs ( 1.43 usr + 0.09 sys = 1.53 CPU) # 6.54/s (n=10)
$ perl arrays.pl 200000
Benchmark: timing 10 iterations of array_map, array_splice, linked_list...
array_map: 1 wallclock secs ( 1.20 usr + 0.05 sys = 1.25 CPU) # 8.01/s (n=10)
array_splice: 64 wallclock secs (64.10 usr + 0.03 sys = 64.13 CPU) # 0.16/s (n=10)
linked_list: 3 wallclock secs ( 2.92 usr + 0.23 sys = 3.15 CPU) # 3.17/s (n=10)
$ perl arrays.pl 500000
Benchmark: timing 10 iterations of array_map, linked_list...
array_map: 4 wallclock secs ( 3.12 usr + 0.36 sys = 3.48 CPU) # 2.87/s (n=10)
linked_list: 8 wallclock secs ( 7.52 usr + 0.70 sys = 8.22 CPU) # 1.22/s (n=10)
I've also made a benchmark and wanted to share the results with you.
In the results I got, linked-list is by-far faster that Perl arrays.
This is the benchmark I've done:
Created a linked-list or an array with 1M items
Iterated over the list/array and made 200K insertions in place
Checked how much time each scenario took.
Linked-list: 2sec
Perl-array: 1:55min
I share the code with you:
run commands and results:
> time perl_benchmark.pl list 1000000
1.876u 0.124s 0:02.01 99.0% 0+0k 0+0io 0pf+0w
> time perl_benchmark.pl array 1000000
115.159u 0.104s 1:55.27 99.9% 0+0k 0+0io 0pf+0w
Source code:
my $INSERTION_FREQUENCY = 5;
my $use_list = $ARGV[0] eq "list";
my $num_of_items = $ARGV[1];
my $list_header;
my $list_tail;
my #array;
# Creating List or Array
for (my $i = 0 ; $i < $num_of_items ; $i++) {
my %new_node;
$new_node{"data"} = $i;
if ($use_list) {
if (! defined($list_header)) {
$list_header = $list_tail = \%new_node;
} else {
$new_node{"prev"} = $list_tail;
$list_tail->{"next"} = \%new_node;
$list_tail = \%new_node;
}
} else {
push(#array, \%new_node);
}
}
# Inserting Items
my $curr_node = $list_header;
for (my $i = 1 ; $i < $num_of_items ; $i++) {
if ($i % $INSERTION_FREQUENCY == 0) {
my %new_node;
$new_node{"data"} = "inserted";
if ($use_list) {
my $prev_ptr = $curr_node->{"prev"};
if (defined($prev_ptr)) {
$prev_ptr->{"next"} = \%new_node;
}
$new_node{"prev"} = $prev_ptr;
$new_node{"next"} = $curr_node;
$curr_node->{"prev"} = \%new_node
} else {
splice(#array, $i - 1, 0, \%new_node);
}
}
if ($use_list) {
$curr_node = $curr_node->{"next"};
}
}

Matching Values in Array with Tolerance

I'm trying to weed out duplicate values in an array, which I'm successfully accomplishing with the "List::MoreUtils uniq/distinct" function.
However, I would also like to count those values that fall within a given tolerance, say +-5, as duplicates as well (I think tolerance is also sometimes referred to as "delta").
For example, if 588 is a value in the array, but so is 589, because the difference falls within the tolerance of 5, 589 gets the boot.
Without some nasty/costly cross-checking of arrays, is there an elegant way to do this?
EDIT: ikegami brought to my attention some ambiguity in my question and I'm having a bit of a hard time wrapping my head around the problem. However, I think I have it worked out.
[500,505,510,515,525,900]
If you try to match the values throughout the entire array, you should get:
[500,510,525,900]
It hits 505, sees it as non-unique, removes it from the array, then sees 510 as newly-unique due to the absence of 505, and so on. This, I imagine is the way I outlined my original question, but on reflection, it seems it's a useless and fairly arbitrary data set.
What I really want is the following match:
[500,900]
It represents a group of numbers that are within 5 of each other, while also spotting the vast variance in the 900 value. This seems to be more useful information than the former and it appears that perreal's answer gets me close. Sorry for the confusion, and many thanks to ikegami as well as perreal for forcing my clarification.
EDIT 2
An even better match would be:
[510,900]
510, being the median of all the sequential +-5 values.
However, I recognize that now we're deviating severely from my original question, so I would be more than happy with an answer to my EDIT 1 clarification.
Isolate the samples that form a chain where each is within the tolerance of the next, then choose one from that group.
sub collapse {
my $tol = shift;
my #collapsed;
while (#_) {
my #group = shift(#_);
while (#_ && $group[-1] + $tol >= $_[0]) {
push #group, shift(#_);
}
push #collapsed, choose_from(#group);
}
return #collapsed;
}
say join ',', collapse(5 => 500,505,510,515,525,900);
So how do you choose? Well, you could return the average.
use List::Util qw( sum );
sub choose_from {
return sum(#_)/#_;
}
# Outputs: 507.5,525,900
Or you could return the median.
use List::Util qw( sum );
sub choose_from {
my $median;
if (#_ % 2 == 0) {
my $avg = sum(#_)/#_;
my $diff0 = abs( $_[ #_/2 - 1 ] - $avg );
my $diff1 = abs( $_[ #_/2 - 0 ] - $avg );
if ($diff0 <= $diff1) {
return $_[ #_/2 - 1 ];
} else {
return $_[ #_/2 - 0 ];
}
} else {
return $_[ #_/2 ];
}
}
# Outputs: 505,525,900
This is a deceptively complex problem, as the data must not only be organized into groups, but also those groups must be combined if a new data point is seen that belongs to more than one of them.
This program seems to do what you need. It keeps a list of arrays #buckets, where each element contains all values seen so far that is within TOLERANCE of one other. This list is scanned to see if each value falls within range of the maximum and minimum values already present. The index of the groups that the value belongs to are stored in memberof, and there will always be zero, one or two entries in this array.
All the groups specified by #memberof are removed from #buckets, combined together with the new data value, sorted, and replaced as a new group in the list.
At the end the #buckets array is converted to a list of median values, sorted and displayed. I have used Data::Dump to show the contents of the groups before they are aggregated to their median values.
To generate your desired output 510, 900 from the list 500, 510, 525, 900 the value for TOLERANCE must be increased so that values that differ by 15 or less are combined.
use strict;
use warnings;
use constant TOLERANCE => 5;
my #data = qw/ 500 505 510 515 525 900 /;
my #buckets;
for my $item (#data) {
my #memberof;
for my $i (0 .. $#buckets) {
if ($item >= $buckets[$i][0] - TOLERANCE and $item <= $buckets[$i][-1] + TOLERANCE) {
push #memberof, $i;
}
}
my #newbucket = ($item);
for my $i (reverse #memberof) {
push #newbucket, #{ splice #buckets, $i, 1 };
}
push #buckets, [ sort { $a <=> $b } #newbucket ];
}
use Data::Dump;
dd #buckets;
#buckets = sort { $a <=> $b } map median(#$_), #buckets;
print join(', ', #buckets), "\n";
sub median {
my $n = #_;
my $i = $n / 2;
if ($n % 2) {
return $_[$i];
}
else {
return ($_[$i-1] + $_[$i]) / 2;
}
}
output
([500, 505, 510, 515], [525], [900])
507.5, 525, 900

Perl- Reading sorted array elements into a hash with sorted keys

so i have an array (say #array) with sorted values between 0 and 1, and also and a hash (say %hash) whose keys are sorted and are numbers between 0 and 1. the values for each key in the hash is 0. Now, I need to look at each element of #array, find the key in %hash which is immediately smaller than it, and increment the corresponding value by 1 . that is, the keys serve as a lowerbounds for intervals
if say
$array = (0.15,0.33,0.67,0.87)
and %hash = ("0.25", 0, "0.50", 0, "0.75", 0)
and I take $array[1] = 0.33
then, I need to be able to determine that $array[1] is greater than 0.25 BUT less than 0.5 and, thus, increment the value for "0.25" by 1 giving me an updated hash %hash = ("0.25", 1, "0.50", 0, "0.75", 0).
I hope this made sense! thanks in advance!!!
Hash does not store keys in sorted order. You must rethink your approach to the problem.
You're building a frequency distribution for intervals or ranges. CPAN has modules that will do that. If you can reformulate your problem to agree with how those modules understand frequency distributions, you'll be able to save yourself a little bit of trouble and gain access to other statistical tools that might be useful for your project. An example:
use Statistics::Descriptive;
my #data = (0.15, 0.33, 0.67, 0.87);
my #bins = (0.25, 0.50, 0.75, 1.00);
my $stat = Statistics::Descriptive::Full->new();
$stat->add_data(#data);
my $freq = $stat->frequency_distribution_ref(\#bins);
The distribution in $freq will be a hash reference like this:
$freq = {
'0.25' => 1
'0.5' => 1, # N of items x, such that PREVIOUS_BIN_VAL < x <= .50
'0.75' => 1,
'1' => 1,
};
If you can't modify your problem, then you'll need to compute the distribution yourself, but you can take an important cue from Statistics::Descriptive. In particular, it will be helpful for you to have an ordered list of bin values. Here's an illustration:
my #data = (0.15, 0.33, 0.67, 0.87);
my #bins = (0.25, 0.50, 0.75); # Include 0.0 if you want 0.15 to be tallied.
my %freq = map {$_ => 0} #bins;
for my $d (#data){
for my $b (reverse #bins){
do { $freq{$b} ++; last } if $d >= $b;
}
}
As far as I understood, you want to keep track of how many items in $array are less than the key's in %hash
So for each key value in the hash, you can just retrieve all items from the array that are less then the key in a list and get the count of it. You can use grep for this
use strict;
use warnings;
use Data::Dumper;
my $array = [qw (0.15 0.33 0.67 0.87 1.5) ] ;
my %hash = (0.25 => 0, 0.50 => 0, 0.75 => 0, 0.05 => 0);
for my $k (keys %hash) {
my #filtered = grep { $_ < $k } #$array;
$hash{$k} = #filtered;
#$hash{$k} = #filtered ? 1 : 0 # if you just want a flag
}
print Dumper(\%hash);
If your hash keys are evenly spaced, like in your example, each can be calculated by a simple formula like $biggestSmaller = int(i*4)/4. If not, you need an auxiliary index like #keys = sort keys %hash - it could also be a binary tree, but this is trivial enough that a simple list ought to do (and if speed is not important, you could even be so lazy as to search bottom up, instead of implement binary search).

Resources