I'm trying to weed out duplicate values in an array, which I'm successfully accomplishing with the "List::MoreUtils uniq/distinct" function.
However, I would also like to count those values that fall within a given tolerance, say +-5, as duplicates as well (I think tolerance is also sometimes referred to as "delta").
For example, if 588 is a value in the array, but so is 589, because the difference falls within the tolerance of 5, 589 gets the boot.
Without some nasty/costly cross-checking of arrays, is there an elegant way to do this?
EDIT: ikegami brought to my attention some ambiguity in my question and I'm having a bit of a hard time wrapping my head around the problem. However, I think I have it worked out.
[500,505,510,515,525,900]
If you try to match the values throughout the entire array, you should get:
[500,510,525,900]
It hits 505, sees it as non-unique, removes it from the array, then sees 510 as newly-unique due to the absence of 505, and so on. This, I imagine is the way I outlined my original question, but on reflection, it seems it's a useless and fairly arbitrary data set.
What I really want is the following match:
[500,900]
It represents a group of numbers that are within 5 of each other, while also spotting the vast variance in the 900 value. This seems to be more useful information than the former and it appears that perreal's answer gets me close. Sorry for the confusion, and many thanks to ikegami as well as perreal for forcing my clarification.
EDIT 2
An even better match would be:
[510,900]
510, being the median of all the sequential +-5 values.
However, I recognize that now we're deviating severely from my original question, so I would be more than happy with an answer to my EDIT 1 clarification.
Isolate the samples that form a chain where each is within the tolerance of the next, then choose one from that group.
sub collapse {
my $tol = shift;
my #collapsed;
while (#_) {
my #group = shift(#_);
while (#_ && $group[-1] + $tol >= $_[0]) {
push #group, shift(#_);
}
push #collapsed, choose_from(#group);
}
return #collapsed;
}
say join ',', collapse(5 => 500,505,510,515,525,900);
So how do you choose? Well, you could return the average.
use List::Util qw( sum );
sub choose_from {
return sum(#_)/#_;
}
# Outputs: 507.5,525,900
Or you could return the median.
use List::Util qw( sum );
sub choose_from {
my $median;
if (#_ % 2 == 0) {
my $avg = sum(#_)/#_;
my $diff0 = abs( $_[ #_/2 - 1 ] - $avg );
my $diff1 = abs( $_[ #_/2 - 0 ] - $avg );
if ($diff0 <= $diff1) {
return $_[ #_/2 - 1 ];
} else {
return $_[ #_/2 - 0 ];
}
} else {
return $_[ #_/2 ];
}
}
# Outputs: 505,525,900
This is a deceptively complex problem, as the data must not only be organized into groups, but also those groups must be combined if a new data point is seen that belongs to more than one of them.
This program seems to do what you need. It keeps a list of arrays #buckets, where each element contains all values seen so far that is within TOLERANCE of one other. This list is scanned to see if each value falls within range of the maximum and minimum values already present. The index of the groups that the value belongs to are stored in memberof, and there will always be zero, one or two entries in this array.
All the groups specified by #memberof are removed from #buckets, combined together with the new data value, sorted, and replaced as a new group in the list.
At the end the #buckets array is converted to a list of median values, sorted and displayed. I have used Data::Dump to show the contents of the groups before they are aggregated to their median values.
To generate your desired output 510, 900 from the list 500, 510, 525, 900 the value for TOLERANCE must be increased so that values that differ by 15 or less are combined.
use strict;
use warnings;
use constant TOLERANCE => 5;
my #data = qw/ 500 505 510 515 525 900 /;
my #buckets;
for my $item (#data) {
my #memberof;
for my $i (0 .. $#buckets) {
if ($item >= $buckets[$i][0] - TOLERANCE and $item <= $buckets[$i][-1] + TOLERANCE) {
push #memberof, $i;
}
}
my #newbucket = ($item);
for my $i (reverse #memberof) {
push #newbucket, #{ splice #buckets, $i, 1 };
}
push #buckets, [ sort { $a <=> $b } #newbucket ];
}
use Data::Dump;
dd #buckets;
#buckets = sort { $a <=> $b } map median(#$_), #buckets;
print join(', ', #buckets), "\n";
sub median {
my $n = #_;
my $i = $n / 2;
if ($n % 2) {
return $_[$i];
}
else {
return ($_[$i-1] + $_[$i]) / 2;
}
}
output
([500, 505, 510, 515], [525], [900])
507.5, 525, 900
Related
I only get the smaller element as output although there are 2 elements with same highest occurrence in array
I have tried to remove sort function from the codes but it still returns me the smaller element
my(#a) = (undef,11,12,13,14,15,13,13,14,14);
my(%count);
foreach my $value (#a) {
$count{$value}++;
}
$max_value = (sort {$count{$b} <=> $count{$a}} #a)[0];
print "Max value = $max_value, occur $count{$max_value} times\n";
Expected result: Max value =13 14, occur 3 times
max_by from List::UtilsBy will return all values that share the maximum in list context.
use strict;
use warnings;
use List::UtilsBy 'max_by';
my #a = (undef,11,12,13,14,15,13,13,14,14);
my %count;
$count{$_}++ for #a;
my #max_values = max_by { $count{$_} } keys %count;
Your code simply takes the first maximal value it finds in the sorted data. You need to continue reading array elements until you reach one that is no longer maximal.
However, as you probably have to test all the hash values there's no great advantage to sorting it. You can just traverse it and keep track of the maximal value(s) you find.
my #a = (undef,11,12,13,14,15,13,13,14,14);
my %count;
$count{$_}++ for #a;
my ($max_count, #max_values);
while ( my ($k,$v) = each %count) {
if ($v > $max_count) {
#max_values = ($k);
$max_count = $v;
}
elsif ($v == $max_count) {
push #max_values, $k;
}
}
my $max_values = join " ", sort #max_values;
print "Max value = $max_values, occur $max_count times\n";
Note that undef is not a valid hash key - it gets converted to "".
I have read that it is much faster to iterate through a hash than through an array. Retrieving values from a hash is also much faster.
Instead of using an array, why not just use a hash and give each key a value corresponding to an index? If the items ever need to be in order, they can be sorted.
Retrieving from hash is faster in a sense that you can fetch value directly by key instead of iterating over whole hash (or array when you're searching for particular string). Having that said, $hash{key} isn't faster than $array[0] as no iteration is taking place.
Arrays can't be replaced by hashes, as they have different features,
arrays hashes
------------------------------------
ordered keys x -
push/pop x -
suitable for looping x -
named keys - x
I don't know where you read that hashes are faster than arrays. According to some Perl reference works (Mastering Algorithms with Perl), arrays are faster than hashes (follow this link for some more info).
If speed is your only criterae, you should benchmark to see which technique is going to be faster. It depends on what operations you will be doing onto the array/hash.
Here is an SO link with some further information: Advantage of 'one dimensional' hash over array in Perl
I think this is a good question: it's not so much a high level "language design" query so much as it is an implementation question. It could be worded in a way that emphasizes that - say using hashes versus arrays for a particular technique or use case.
Hashes are nice but you need lists/arrays (c.f. #RobEarl). You can use tie (or modules like Tie::IxHash or Tie::Hash::Indexed ) to "preserve" the order of a hash, but I believe these would have to be slower than a regular hash and in some cases you can't pass them around or copy them in quite the same way.
This code is more or less how a hash works. It should explain well enough why you would want to use an array instead of a hash.
package DIYHash;
use Digest::MD5;
sub new {
my ($class, $buckets) = #_;
my $self = bless [], $class;
$#$self = $buckets || 32;
return $self;
}
sub fetch {
my ( $self, $key ) = #_;
my $i = $self->_get_bucket_index( $key );
my $bo = $self->_find_key_in_bucket($key);
return $self->[$i][$bo][1];
}
sub store {
my ( $self, $key, $value ) = #_;
my $i = $self->_get_bucket_index( $key );
my $bo = $self->_find_key_in_bucket($key);
$self->[$i][$bo] = [$key, $value];
return $value;
}
sub _find_key_in_bucket {
my ($self, $key, $index) = #_;
my $bucket = $self->[$index];
my $i = undef;
for ( 0..$#$bucket ) {
next unless $bucket->[$_][0] eq $key;
$i = $_;
}
$i = #$bucket unless defined $i;
return $i;
}
# This function needs to always return the same index for a given key.
# It can do anything as long as it always does that.
# I use the md5 hashing algorithm here.
sub _get_bucket_index {
my ( $self, $key ) = #_;
# Get a number from 0 to 1 - bucket count.
my $index = unpack( "I", md5($key) ) % #$self;
return $index;
}
1;
To use this amazing cluster of code:
my $hash = DIYHash->new(4); #This hash has 4 buckets.
$hash->store(mouse => "I like cheese");
$hash->store(cat => "I like mouse");
say $hash->fetch('mouse');
Hashes look like they are constant time, rather than order N because for a given data set, you select a number of buckets that keeps the number of items in any bucket very small.
A proper hashing system would be able to resize as appropriate when the number of collisions gets too high. You don't want to do this often, because it is an order N operation.
The sort of problem I am dealing with involves a few things, namely:
I need to randomly sample numbers from a range of numbers.
That range of numbers is really huge, as from 1 to 1,000,000,000.
I need the sampling process to avoid sampling from intervals within the range that have already been sampled. Since using an array is too slow, my attempts to use splice are not going to work.
I start by picking a number between 1 and 1,000,000,000.
my $random = int(rand(1_000_000_000)) + 1;
I add a value, say 100, to that to make $random and $random + 100 define an interval.
my $interval = $random + 100;
Then I push both $random and $interval into another array. This other array is to store the intervals.
push ( #rememberOldIntervals, $random, $interval );
I step through array #rememberOldIntervals using a for loop, pulling out items in pairs. The first of a pair is a former $random and the other a $interval. Inside this for loop, I do another random number generation. But the number generated can't be between an interval already taken. If so, keep sampling until a number is found that is unique. Further, this new random number must be at least 100 away from any old interval.
for ( my $i= 0; $i < (scalar #rememberOldIntervals) / 2 ; $i=+2) {
$random = int(rand(1_000_000_000)) + 1;
my $new_random_low = $random - 100;
my $new_random_high = $random + 100;
if ( $new_random_low <= $rememberOldIntervals[0] OR
$new_random_high >= $rememberOldIntervals[1] ){
push( #rememberOldIntervals, $new_random_low, $new_random_high );
}
else {
until ($new_random_low <= $rememberOldIntervals[0] OR
$new_random_high >= $rememberOldIntervals[1] ) {
$random = int(rand(1_000_000_000)) + 1;
my $new_random_low = $random - 100;
my $new_random_high = $random + 100;
}
}
}
This latter loop would need to be embedded within another to drive it many times, say 10,000 times.
This problem can be reframed into pulling 10,000 random numbers between 0 and 1 billion, where no number is within 100 of another.
Brute Force - 5 secs
Because you're only pulling 10,000 numbers, and probably don't need to do it very often, I suggest approaching this type of problem using brute force initially. This is trying to follow the design pattern of Premature optimization is the root of all evil
In this case, that means just pulling random numbers and comparing them to all previously pulled numbers. This will have a speed of O(N^2), but will also take less code.
use strict;
use warnings;
my $max = 1_000_000_000;
my $dist = 100;
my $count = 10_000;
die "Too many numbers" if 2 * $dist * $count >= $max;
my #numbers;
while (#numbers < $count) {
my $num = int rand $max;
push #numbers, $num if ! grep {abs($num - $_) < $dist} #numbers;
}
print scalar(#numbers), "\n";
Output takes 5 seconds:
10000
Binary Search for faster generation - 0.14 secs
Now for faster algorithm, I agree with ysth that a much more efficient method to solve this is to create two lists of your random numbers. One of them is the running list, and the other is sorted. Use the sorted list to do a binary search for placement and then comparison to its nearby elements to see if it is within 100.
This reduces the number of comparisons from O(N^2) to O(N log N). The following takes just 0.14 seconds to run versus the 5 seconds of the brute force method.
use strict;
use warnings;
my $max = 1_000_000_000;
my $dist = 100;
my $count = 10_000;
die "Too many numbers" if 2 * $dist * $count >= $max;
my #numbers;
my #sorted = (-$dist, $max); # Include edges to simplify binary search logic.
while (#numbers < $count) {
my $num = int rand $max;
# Binary Search of Sorted list.
my $binary_min = 0;
my $binary_max = $#sorted;
while ($binary_max > $binary_min) {
my $average = int( ($binary_max + $binary_min) / 2 );
$binary_max = $average if $sorted[$average] >= $num;
$binary_min = $average + 1 if $sorted[$average] <= $num;
}
if (! grep {abs($num - $_) < $dist} #sorted[$binary_max, $binary_max - 1]) {
splice #sorted, $binary_max, 0, $num;
push #numbers, $num;
}
}
print scalar(#numbers), "\n";
Hash of quotients for fastest - 0.05 secs
I inquired in the comments: "Could you simplify this problem to pick a random multiple of 100? That would ensure no overlap, and then you'd just need to pick a random number from 1 to 10 million without repeat, and then just multiply it by 100." You didn't respond, but we can still use grouping by multiples of 100 to simplify this problem.
Basically, if we keep track of a number's quotient divided by 100, we only need it to compare it to numbers with quotients plus and minus one. This reduces the number of comparisons to O(N), which not surprisingly is the fastest at 0.05 seconds:
use strict;
use warnings;
my $max = 1_000_000_000;
my $dist = 100;
my $count = 10_000;
die "Too many numbers" if 2 * $dist * $count >= $max;
my #numbers;
my %num_per_quot;
while (#numbers < $count) {
my $num = int rand $max;
my $quotient = int $num / $dist;
if (! grep {defined && abs($num - $_) < $dist} map {$num_per_quot{$quotient + $_}} (-1, 0, 1)) {
push #numbers, $num;
$num_per_quot{$quotient} = $num;
}
}
print scalar(#numbers), "\n";
Caution if you're on Windows
If you run this code on Windows and are using a version of perl less than v5.20, you'll need to use a better random number generate than the built-in rand. For reasons why, read avoid using rand if it matters.
I used Math::Random::MT qw(rand); in this code since I'm on Strawberry Perl v5.18.2. However, starting with Perl v5.20 this will no longer be a concern because rand now uses a consistent random number generator.
You can speed it up by using hashes and indices.
This will part the space into indexed segments of width 200, and each interval will be placed randomly in a random segment.
my $interval = 100;
my $space = 1e9;
my $interval_count = 1e4;
my #values;
my %index_taken;
for(1..$interval_count)
{
my $index;
$index while $index_taken{$index = int rand $space/2/$interval }++;
my $start = $index*2*$interval + 1 + int rand $interval;
push #values, $start, $start+$interval;
}
It guarantees nonoverlapping intervals but there will be inaccessible space of up to 200 between two intervals.
Or, if you want the intervals sorted:
#values = map {$_*=2*$interval; $_+=1+int rand $interval; ($_,$_+$interval)}
sort keys %index_taken;
so i have an array (say #array) with sorted values between 0 and 1, and also and a hash (say %hash) whose keys are sorted and are numbers between 0 and 1. the values for each key in the hash is 0. Now, I need to look at each element of #array, find the key in %hash which is immediately smaller than it, and increment the corresponding value by 1 . that is, the keys serve as a lowerbounds for intervals
if say
$array = (0.15,0.33,0.67,0.87)
and %hash = ("0.25", 0, "0.50", 0, "0.75", 0)
and I take $array[1] = 0.33
then, I need to be able to determine that $array[1] is greater than 0.25 BUT less than 0.5 and, thus, increment the value for "0.25" by 1 giving me an updated hash %hash = ("0.25", 1, "0.50", 0, "0.75", 0).
I hope this made sense! thanks in advance!!!
Hash does not store keys in sorted order. You must rethink your approach to the problem.
You're building a frequency distribution for intervals or ranges. CPAN has modules that will do that. If you can reformulate your problem to agree with how those modules understand frequency distributions, you'll be able to save yourself a little bit of trouble and gain access to other statistical tools that might be useful for your project. An example:
use Statistics::Descriptive;
my #data = (0.15, 0.33, 0.67, 0.87);
my #bins = (0.25, 0.50, 0.75, 1.00);
my $stat = Statistics::Descriptive::Full->new();
$stat->add_data(#data);
my $freq = $stat->frequency_distribution_ref(\#bins);
The distribution in $freq will be a hash reference like this:
$freq = {
'0.25' => 1
'0.5' => 1, # N of items x, such that PREVIOUS_BIN_VAL < x <= .50
'0.75' => 1,
'1' => 1,
};
If you can't modify your problem, then you'll need to compute the distribution yourself, but you can take an important cue from Statistics::Descriptive. In particular, it will be helpful for you to have an ordered list of bin values. Here's an illustration:
my #data = (0.15, 0.33, 0.67, 0.87);
my #bins = (0.25, 0.50, 0.75); # Include 0.0 if you want 0.15 to be tallied.
my %freq = map {$_ => 0} #bins;
for my $d (#data){
for my $b (reverse #bins){
do { $freq{$b} ++; last } if $d >= $b;
}
}
As far as I understood, you want to keep track of how many items in $array are less than the key's in %hash
So for each key value in the hash, you can just retrieve all items from the array that are less then the key in a list and get the count of it. You can use grep for this
use strict;
use warnings;
use Data::Dumper;
my $array = [qw (0.15 0.33 0.67 0.87 1.5) ] ;
my %hash = (0.25 => 0, 0.50 => 0, 0.75 => 0, 0.05 => 0);
for my $k (keys %hash) {
my #filtered = grep { $_ < $k } #$array;
$hash{$k} = #filtered;
#$hash{$k} = #filtered ? 1 : 0 # if you just want a flag
}
print Dumper(\%hash);
If your hash keys are evenly spaced, like in your example, each can be calculated by a simple formula like $biggestSmaller = int(i*4)/4. If not, you need an auxiliary index like #keys = sort keys %hash - it could also be a binary tree, but this is trivial enough that a simple list ought to do (and if speed is not important, you could even be so lazy as to search bottom up, instead of implement binary search).
This function does the same as exists does with hashes.
I plan on use it a lot.
Can it be optimized in some way?
my #a = qw/a b c d/;
my $ret = array_exists("b", #a);
sub array_exists {
my ($var, #a) = #_;
foreach my $e (#a) {
if ($var eq $e) {
return 1;
}
}
return 0;
}
If you have to do this a lot on a fixed array, use a hash instead:
my %hash = map { $_, 1 } #array;
if( exists $hash{$key} ) { ... }
Some people reach for the smart match operator, but that's one of the features that we need to remove from Perl. You need to decide if this should match, where the array hold an array reference that has a hash reference with the key b:
use 5.010;
my #a = (
qw(x y z),
[ { 'b' => 1 } ],
);
say 'Matches' if "b" ~~ #a; # This matches
Since the smart match is recursive, if keeps going down into data structures. I write about some of this in Rethinking smart matching.
You can use smart matching, available in Perl 5.10 and later:
if ("b" ~~ #a) {
# "b" exists in #a
}
This should be much faster than a function call.
I'd use List::MoreUtils::any.
my $ret = any { $_ eq 'b' } #a;
Since there are lots of similar questions on StackOverflow where different "correct answers" return different results, I tried to compare them. This question seems to be a good place to share my little benchmark.
For my tests I used a test set (#test_set) of 1,000 elements (strings) of length 10 where only one element ($search_value) matches a given string.
I took the following statements to validate the existence of this element in a loop of 100,000 turns.
_grep
grep( $_ eq $search_value, #test_set )
_hash
{ map { $_ => 1 } #test_set }->{ $search_value }
_hash_premapped
$mapping->{ $search_value }
uses a $mapping that is precalculated as $mapping = { map { $_ => 1 } #test_set } (which is included in the final measuring)
_regex
sub{ my $rx = join "|", map quotemeta, #test_set; $search_value =~ /^(?:$rx)$/ }
_regex_prejoined
$search_value =~ /^(?:$rx)$/
uses a regular expression $rx that is precalculated as $rx = join "|", map quotemeta, #test_set; (which is included in the final measuring)
_manual_first
sub{ foreach ( #test_set ) { return 1 if( $_ eq $search_value ); } return 0; }
_first
first { $_ eq $search_value } #test_set
from List::Util (version 1.38)
_smart
$search_value ~~ #test_set
_any
any { $_ eq $search_value } #test_set
from List::MoreUtils (version 0.33)
On my machine ( Ubuntu, 3.2.0-60-generic, x86_64, Perl v5.14.2 ) I got the following results. The shown values are seconds and returned by gettimeofday and tv_interval of Time::HiRes (version 1.9726).
Element $search_value is located at position 0 in array #test_set
_hash_premapped: 0.056211
_smart: 0.060267
_manual_first: 0.064195
_first: 0.258953
_any: 0.292959
_regex_prejoined: 0.350076
_grep: 5.748364
_regex: 29.27262
_hash: 45.638838
Element $search_value is located at position 500 in array #test_set
_hash_premapped: 0.056316
_regex_prejoined: 0.357595
_first: 2.337911
_smart: 2.80226
_manual_first: 3.34348
_any: 3.408409
_grep: 5.772233
_regex: 28.668455
_hash: 45.076083
Element $search_value is located at position 999 in array #test_set
_hash_premapped: 0.054434
_regex_prejoined: 0.362615
_first: 4.383842
_smart: 5.536873
_grep: 5.962746
_any: 6.31152
_manual_first: 6.59063
_regex: 28.695459
_hash: 45.804386
Conclusion
The fastest method to check the existence of an element in an array is using prepared hashes. You of course buy that by an proportional amount of memory consumption and it only makes sense if you search for elements in the set multiple times. If your task includes small amounts of data and only a single or a few searches, hashes can even be the worst solution. Not the same way fast, but a similar idea would be to use prepared regular expressions, which seem to have a smaller preparation time.
In many cases, a prepared environment is no option.
Surprisingly List::Util::first has very good results, when it comes to the comparison of statements, that don't have a prepared environment. While having the search value at the beginning (which could be perhaps interpreted as the result in smaller sets, too) it is very close to the favourites ~~ and any (and could even be in the range of measurement inaccuracy). For items in the middle or at the end of my larger test set, first is definitely the fastest.
brian d foy suggested using a hash, which gives O(1) lookups, at the cost of slightly more expensive hash creation. There is a technique that Marc Jason Dominus describes in his book Higher Order Perl where by a hash is used to memoize (or cache) results of a sub for a given parameter. So for example, if findit(1000) always returns the same thing for the given parameter, there's no need to recalculate the result every time. The technique is implemented in the Memoize module (part of the Perl core).
Memoizing is not always a win. Sometimes the overhead of the memoized wrapper is greater than the cost of calculating a result. Sometimes a given parameter is unlikely to ever be checked more than once or a relatively few times. And sometimes it cannot be guaranteed that the result of a function for a given parameter will always be the same (ie, the cache can become stale). But if you have an expensive function with stable return values per parameter, memoization can be a big win.
Just as brian d foy's answer uses a hash, Memoize uses a hash internally. There is additional overhead in the Memoize implementation, but the benefit to using Memoize is that it doesn't require refactoring the original subroutine. You just use Memoize; and then memoize( 'expensive_function' );, provided it meets the criteria for benefitting from memoization.
I took your original subroutine and converted it to work with integers (just for simplicity in testing). Then I added a second version that passed a reference to the original array rather than copying the array. With those two versions, I created two more subs that I memoized. I then benchmarked the four subs.
In benchmarking, I had to make some decisions. First, how many iterations to test. The more iterations we test, the more likely we are to have good cache hits for the memoized versions. Then I also had to decide how many items to put into the sample array. The more items, the less likely to have cache hits, but the more significant the savings when a cache hit occurs. I ultimately decided on an array to be searched containing 8000 elements, and chose to search through 24000 iterations. That means that on average there should be two cache hits per memoized call. (The first call with a given param will write to the cache, while the second and third calls will read from the cache, so two good hits on average).
Here is the test code:
use warnings;
use strict;
use Memoize;
use Benchmark qw/cmpthese/;
my $n = 8000; # Elements in target array
my $count = 24000; # Test iterations.
my #a = ( 1 .. $n );
my #find = map { int(rand($n)) } 0 .. $count;
my ( $orx, $ormx, $opx, $opmx ) = ( 0, 0, 0, 0 );
memoize( 'orig_memo' );
memoize( 'opt_memo' );
cmpthese( $count, {
original => sub{ my $ret = original( $find[ $orx++ ], #a ); },
orig_memo => sub{ my $ret = orig_memo( $find[ $ormx++ ], #a ); },
optimized => sub{ my $ret = optimized( $find[ $opx++ ], \#a ); },
opt_memo => sub{ my $ret = opt_memo( $find[ $opmx++ ], \#a ); }
} );
sub original {
my ( $var, #a) = #_;
foreach my $e ( #a ) {
return 1 if $var == $e;
}
return 0;
}
sub orig_memo {
my ( $var, #a ) = #_;
foreach my $e ( #a ) {
return 1 if $var == $e;
}
return 0;
}
sub optimized {
my( $var, $aref ) = #_;
foreach my $e ( #{$aref} ) {
return 1 if $var == $e;
}
return 0;
}
sub opt_memo {
my( $var, $aref ) = #_;
foreach my $e ( #{$aref} ) {
return 1 if $var == $e;
}
return 0;
}
And here are the results:
Rate orig_memo original optimized opt_memo
orig_memo 876/s -- -10% -83% -94%
original 972/s 11% -- -82% -94%
optimized 5298/s 505% 445% -- -66%
opt_memo 15385/s 1657% 1483% 190% --
As you can see, the memoized version of your original function was actually slower. That's because so much of the cost of your original subroutine was spent in making copies of the 8000 element array, combined with the fact that there is additional call-stack and bookkeeping overhead with the memoized version.
But once we pass an array reference instead of a copy, we remove the expense of passing the entire array around. Your speed jumps considerably. But the clear winner is the optimized (ie, passing array refs) version that we memoized (cached), at 1483% faster than your original function. With memoization the O(n) lookup only happens the first time a given parameter is checked. Subsequent lookups occur in O(1) time.
Now you would have to decide (by Benchmarking) whether memoization helps you. Certainly passing an array ref does. And if memoization doesn't help you, maybe brian's hash method is best. But in terms of not having to rewrite much code, memoization combined with passing an array ref may be an excellent alternative.
Your current solution iterates through the array before it finds the element it is looking for. As such, it is a linear algorithm.
If you sort the array first with a relational operator (>for numeric elements, gt for strings) you can use binary search to find the elements. It is a logarithmic algorithm, much faster than linear.
Of course, one must consider the penalty of sorting the array in the first place, which is a rather slow operation (n log n). If the contents of the array you are matching against change often, you must sort after every change, and it gets really slow. If the contents remain the same after you've initially sorted them, binary search ends up being practically faster.
You can use grep:
sub array_exists {
my $val = shift;
return grep { $val eq $_ } #_;
}
Surprisingly, it's not off too far in speed from List::MoreUtils' any(). It's faster if your item is at the end of the list by about 25% and slower by about 50% if your item is at the start of the list.
You can also inline it if needed -- no need to shove it off into a subroutine. i.e.
if ( grep { $needle eq $_ } #haystack ) {
### Do something
...
}