Faster way to check for element in array? - arrays

This function does the same as exists does with hashes.
I plan on use it a lot.
Can it be optimized in some way?
my #a = qw/a b c d/;
my $ret = array_exists("b", #a);
sub array_exists {
my ($var, #a) = #_;
foreach my $e (#a) {
if ($var eq $e) {
return 1;
}
}
return 0;
}

If you have to do this a lot on a fixed array, use a hash instead:
my %hash = map { $_, 1 } #array;
if( exists $hash{$key} ) { ... }
Some people reach for the smart match operator, but that's one of the features that we need to remove from Perl. You need to decide if this should match, where the array hold an array reference that has a hash reference with the key b:
use 5.010;
my #a = (
qw(x y z),
[ { 'b' => 1 } ],
);
say 'Matches' if "b" ~~ #a; # This matches
Since the smart match is recursive, if keeps going down into data structures. I write about some of this in Rethinking smart matching.

You can use smart matching, available in Perl 5.10 and later:
if ("b" ~~ #a) {
# "b" exists in #a
}
This should be much faster than a function call.

I'd use List::MoreUtils::any.
my $ret = any { $_ eq 'b' } #a;

Since there are lots of similar questions on StackOverflow where different "correct answers" return different results, I tried to compare them. This question seems to be a good place to share my little benchmark.
For my tests I used a test set (#test_set) of 1,000 elements (strings) of length 10 where only one element ($search_value) matches a given string.
I took the following statements to validate the existence of this element in a loop of 100,000 turns.
_grep
grep( $_ eq $search_value, #test_set )
_hash
{ map { $_ => 1 } #test_set }->{ $search_value }
_hash_premapped
$mapping->{ $search_value }
uses a $mapping that is precalculated as $mapping = { map { $_ => 1 } #test_set } (which is included in the final measuring)
_regex
sub{ my $rx = join "|", map quotemeta, #test_set; $search_value =~ /^(?:$rx)$/ }
_regex_prejoined
$search_value =~ /^(?:$rx)$/
uses a regular expression $rx that is precalculated as $rx = join "|", map quotemeta, #test_set; (which is included in the final measuring)
_manual_first
sub{ foreach ( #test_set ) { return 1 if( $_ eq $search_value ); } return 0; }
_first
first { $_ eq $search_value } #test_set
from List::Util (version 1.38)
_smart
$search_value ~~ #test_set
_any
any { $_ eq $search_value } #test_set
from List::MoreUtils (version 0.33)
On my machine ( Ubuntu, 3.2.0-60-generic, x86_64, Perl v5.14.2 ) I got the following results. The shown values are seconds and returned by gettimeofday and tv_interval of Time::HiRes (version 1.9726).
Element $search_value is located at position 0 in array #test_set
_hash_premapped: 0.056211
_smart: 0.060267
_manual_first: 0.064195
_first: 0.258953
_any: 0.292959
_regex_prejoined: 0.350076
_grep: 5.748364
_regex: 29.27262
_hash: 45.638838
Element $search_value is located at position 500 in array #test_set
_hash_premapped: 0.056316
_regex_prejoined: 0.357595
_first: 2.337911
_smart: 2.80226
_manual_first: 3.34348
_any: 3.408409
_grep: 5.772233
_regex: 28.668455
_hash: 45.076083
Element $search_value is located at position 999 in array #test_set
_hash_premapped: 0.054434
_regex_prejoined: 0.362615
_first: 4.383842
_smart: 5.536873
_grep: 5.962746
_any: 6.31152
_manual_first: 6.59063
_regex: 28.695459
_hash: 45.804386
Conclusion
The fastest method to check the existence of an element in an array is using prepared hashes. You of course buy that by an proportional amount of memory consumption and it only makes sense if you search for elements in the set multiple times. If your task includes small amounts of data and only a single or a few searches, hashes can even be the worst solution. Not the same way fast, but a similar idea would be to use prepared regular expressions, which seem to have a smaller preparation time.
In many cases, a prepared environment is no option.
Surprisingly List::Util::first has very good results, when it comes to the comparison of statements, that don't have a prepared environment. While having the search value at the beginning (which could be perhaps interpreted as the result in smaller sets, too) it is very close to the favourites ~~ and any (and could even be in the range of measurement inaccuracy). For items in the middle or at the end of my larger test set, first is definitely the fastest.

brian d foy suggested using a hash, which gives O(1) lookups, at the cost of slightly more expensive hash creation. There is a technique that Marc Jason Dominus describes in his book Higher Order Perl where by a hash is used to memoize (or cache) results of a sub for a given parameter. So for example, if findit(1000) always returns the same thing for the given parameter, there's no need to recalculate the result every time. The technique is implemented in the Memoize module (part of the Perl core).
Memoizing is not always a win. Sometimes the overhead of the memoized wrapper is greater than the cost of calculating a result. Sometimes a given parameter is unlikely to ever be checked more than once or a relatively few times. And sometimes it cannot be guaranteed that the result of a function for a given parameter will always be the same (ie, the cache can become stale). But if you have an expensive function with stable return values per parameter, memoization can be a big win.
Just as brian d foy's answer uses a hash, Memoize uses a hash internally. There is additional overhead in the Memoize implementation, but the benefit to using Memoize is that it doesn't require refactoring the original subroutine. You just use Memoize; and then memoize( 'expensive_function' );, provided it meets the criteria for benefitting from memoization.
I took your original subroutine and converted it to work with integers (just for simplicity in testing). Then I added a second version that passed a reference to the original array rather than copying the array. With those two versions, I created two more subs that I memoized. I then benchmarked the four subs.
In benchmarking, I had to make some decisions. First, how many iterations to test. The more iterations we test, the more likely we are to have good cache hits for the memoized versions. Then I also had to decide how many items to put into the sample array. The more items, the less likely to have cache hits, but the more significant the savings when a cache hit occurs. I ultimately decided on an array to be searched containing 8000 elements, and chose to search through 24000 iterations. That means that on average there should be two cache hits per memoized call. (The first call with a given param will write to the cache, while the second and third calls will read from the cache, so two good hits on average).
Here is the test code:
use warnings;
use strict;
use Memoize;
use Benchmark qw/cmpthese/;
my $n = 8000; # Elements in target array
my $count = 24000; # Test iterations.
my #a = ( 1 .. $n );
my #find = map { int(rand($n)) } 0 .. $count;
my ( $orx, $ormx, $opx, $opmx ) = ( 0, 0, 0, 0 );
memoize( 'orig_memo' );
memoize( 'opt_memo' );
cmpthese( $count, {
original => sub{ my $ret = original( $find[ $orx++ ], #a ); },
orig_memo => sub{ my $ret = orig_memo( $find[ $ormx++ ], #a ); },
optimized => sub{ my $ret = optimized( $find[ $opx++ ], \#a ); },
opt_memo => sub{ my $ret = opt_memo( $find[ $opmx++ ], \#a ); }
} );
sub original {
my ( $var, #a) = #_;
foreach my $e ( #a ) {
return 1 if $var == $e;
}
return 0;
}
sub orig_memo {
my ( $var, #a ) = #_;
foreach my $e ( #a ) {
return 1 if $var == $e;
}
return 0;
}
sub optimized {
my( $var, $aref ) = #_;
foreach my $e ( #{$aref} ) {
return 1 if $var == $e;
}
return 0;
}
sub opt_memo {
my( $var, $aref ) = #_;
foreach my $e ( #{$aref} ) {
return 1 if $var == $e;
}
return 0;
}
And here are the results:
Rate orig_memo original optimized opt_memo
orig_memo 876/s -- -10% -83% -94%
original 972/s 11% -- -82% -94%
optimized 5298/s 505% 445% -- -66%
opt_memo 15385/s 1657% 1483% 190% --
As you can see, the memoized version of your original function was actually slower. That's because so much of the cost of your original subroutine was spent in making copies of the 8000 element array, combined with the fact that there is additional call-stack and bookkeeping overhead with the memoized version.
But once we pass an array reference instead of a copy, we remove the expense of passing the entire array around. Your speed jumps considerably. But the clear winner is the optimized (ie, passing array refs) version that we memoized (cached), at 1483% faster than your original function. With memoization the O(n) lookup only happens the first time a given parameter is checked. Subsequent lookups occur in O(1) time.
Now you would have to decide (by Benchmarking) whether memoization helps you. Certainly passing an array ref does. And if memoization doesn't help you, maybe brian's hash method is best. But in terms of not having to rewrite much code, memoization combined with passing an array ref may be an excellent alternative.

Your current solution iterates through the array before it finds the element it is looking for. As such, it is a linear algorithm.
If you sort the array first with a relational operator (>for numeric elements, gt for strings) you can use binary search to find the elements. It is a logarithmic algorithm, much faster than linear.
Of course, one must consider the penalty of sorting the array in the first place, which is a rather slow operation (n log n). If the contents of the array you are matching against change often, you must sort after every change, and it gets really slow. If the contents remain the same after you've initially sorted them, binary search ends up being practically faster.

You can use grep:
sub array_exists {
my $val = shift;
return grep { $val eq $_ } #_;
}
Surprisingly, it's not off too far in speed from List::MoreUtils' any(). It's faster if your item is at the end of the list by about 25% and slower by about 50% if your item is at the start of the list.
You can also inline it if needed -- no need to shove it off into a subroutine. i.e.
if ( grep { $needle eq $_ } #haystack ) {
### Do something
...
}

Related

Problem to get more than 1 element with same highest occurrence from array in Perl

I only get the smaller element as output although there are 2 elements with same highest occurrence in array
I have tried to remove sort function from the codes but it still returns me the smaller element
my(#a) = (undef,11,12,13,14,15,13,13,14,14);
my(%count);
foreach my $value (#a) {
$count{$value}++;
}
$max_value = (sort {$count{$b} <=> $count{$a}} #a)[0];
print "Max value = $max_value, occur $count{$max_value} times\n";
Expected result: Max value =13 14, occur 3 times
max_by from List::UtilsBy will return all values that share the maximum in list context.
use strict;
use warnings;
use List::UtilsBy 'max_by';
my #a = (undef,11,12,13,14,15,13,13,14,14);
my %count;
$count{$_}++ for #a;
my #max_values = max_by { $count{$_} } keys %count;
Your code simply takes the first maximal value it finds in the sorted data. You need to continue reading array elements until you reach one that is no longer maximal.
However, as you probably have to test all the hash values there's no great advantage to sorting it. You can just traverse it and keep track of the maximal value(s) you find.
my #a = (undef,11,12,13,14,15,13,13,14,14);
my %count;
$count{$_}++ for #a;
my ($max_count, #max_values);
while ( my ($k,$v) = each %count) {
if ($v > $max_count) {
#max_values = ($k);
$max_count = $v;
}
elsif ($v == $max_count) {
push #max_values, $k;
}
}
my $max_values = join " ", sort #max_values;
print "Max value = $max_values, occur $max_count times\n";
Note that undef is not a valid hash key - it gets converted to "".

Iterate through an array of hashes of array in Perl?

I have trouble visualizing loops and have what I think is an array of hashes of arrays. Please correct me if I am misunderstanding this. I want to be able to loop through the below array and print each key's value.
The End results would print the elements like so:
name
version
pop
tart
Unfortunately, I fall apart when I get to key three.
my #complex = (
[
{
one => 'name',
two => 'version',
three => [qw( pop tart )],
},
],
);
Here's what I've managed so far. I just don't know to handle key three within these loops.
for my $aref (#complex) {
for my $href (#$aref) {
for (keys %{$href}) {
print "$href->{$_}\n";
}
}
}
Any help would be appreciated.
What seems to be holding you back is that your hash has some values which are strings and some which are array references. You can find out which are which using ref and then print accordingly
for my $aref (#complex) {
for my $href (#$aref) {
for my $key (keys %{$href}) {
my $refval = ref $href->{$key};
if (not $refval) { # not a reference at all
print "$href->{$key}\n";
} elsif ($refval eq 'ARRAY') {
print "$_\n" for #{ $href->{$key} };
#print "#{ $href->{$key} }\n"; # or all in one line
} else {
warn "Unexpected data structure: $refval";
}
}
}
}
For deeper structures, or those that you don't know, write a recursive procedure based on this. And then there are modules which will do it, as well.
Note that a careful consideration of what data structures to use pays off handsomely; it is one of the critical parts of design. On the other hand, once these complex data structures grow unwieldy, or rather if you estimate ahead of time that that can happen in the lifetime of the project, the answer is to switch to a class.

Why ever use an array instead of a hash?

I have read that it is much faster to iterate through a hash than through an array. Retrieving values from a hash is also much faster.
Instead of using an array, why not just use a hash and give each key a value corresponding to an index? If the items ever need to be in order, they can be sorted.
Retrieving from hash is faster in a sense that you can fetch value directly by key instead of iterating over whole hash (or array when you're searching for particular string). Having that said, $hash{key} isn't faster than $array[0] as no iteration is taking place.
Arrays can't be replaced by hashes, as they have different features,
arrays hashes
------------------------------------
ordered keys x -
push/pop x -
suitable for looping x -
named keys - x
I don't know where you read that hashes are faster than arrays. According to some Perl reference works (Mastering Algorithms with Perl), arrays are faster than hashes (follow this link for some more info).
If speed is your only criterae, you should benchmark to see which technique is going to be faster. It depends on what operations you will be doing onto the array/hash.
Here is an SO link with some further information: Advantage of 'one dimensional' hash over array in Perl
I think this is a good question: it's not so much a high level "language design" query so much as it is an implementation question. It could be worded in a way that emphasizes that - say using hashes versus arrays for a particular technique or use case.
Hashes are nice but you need lists/arrays (c.f. #RobEarl). You can use tie (or modules like Tie::IxHash or Tie::Hash::Indexed ) to "preserve" the order of a hash, but I believe these would have to be slower than a regular hash and in some cases you can't pass them around or copy them in quite the same way.
This code is more or less how a hash works. It should explain well enough why you would want to use an array instead of a hash.
package DIYHash;
use Digest::MD5;
sub new {
my ($class, $buckets) = #_;
my $self = bless [], $class;
$#$self = $buckets || 32;
return $self;
}
sub fetch {
my ( $self, $key ) = #_;
my $i = $self->_get_bucket_index( $key );
my $bo = $self->_find_key_in_bucket($key);
return $self->[$i][$bo][1];
}
sub store {
my ( $self, $key, $value ) = #_;
my $i = $self->_get_bucket_index( $key );
my $bo = $self->_find_key_in_bucket($key);
$self->[$i][$bo] = [$key, $value];
return $value;
}
sub _find_key_in_bucket {
my ($self, $key, $index) = #_;
my $bucket = $self->[$index];
my $i = undef;
for ( 0..$#$bucket ) {
next unless $bucket->[$_][0] eq $key;
$i = $_;
}
$i = #$bucket unless defined $i;
return $i;
}
# This function needs to always return the same index for a given key.
# It can do anything as long as it always does that.
# I use the md5 hashing algorithm here.
sub _get_bucket_index {
my ( $self, $key ) = #_;
# Get a number from 0 to 1 - bucket count.
my $index = unpack( "I", md5($key) ) % #$self;
return $index;
}
1;
To use this amazing cluster of code:
my $hash = DIYHash->new(4); #This hash has 4 buckets.
$hash->store(mouse => "I like cheese");
$hash->store(cat => "I like mouse");
say $hash->fetch('mouse');
Hashes look like they are constant time, rather than order N because for a given data set, you select a number of buckets that keeps the number of items in any bucket very small.
A proper hashing system would be able to resize as appropriate when the number of collisions gets too high. You don't want to do this often, because it is an order N operation.

Matching Values in Array with Tolerance

I'm trying to weed out duplicate values in an array, which I'm successfully accomplishing with the "List::MoreUtils uniq/distinct" function.
However, I would also like to count those values that fall within a given tolerance, say +-5, as duplicates as well (I think tolerance is also sometimes referred to as "delta").
For example, if 588 is a value in the array, but so is 589, because the difference falls within the tolerance of 5, 589 gets the boot.
Without some nasty/costly cross-checking of arrays, is there an elegant way to do this?
EDIT: ikegami brought to my attention some ambiguity in my question and I'm having a bit of a hard time wrapping my head around the problem. However, I think I have it worked out.
[500,505,510,515,525,900]
If you try to match the values throughout the entire array, you should get:
[500,510,525,900]
It hits 505, sees it as non-unique, removes it from the array, then sees 510 as newly-unique due to the absence of 505, and so on. This, I imagine is the way I outlined my original question, but on reflection, it seems it's a useless and fairly arbitrary data set.
What I really want is the following match:
[500,900]
It represents a group of numbers that are within 5 of each other, while also spotting the vast variance in the 900 value. This seems to be more useful information than the former and it appears that perreal's answer gets me close. Sorry for the confusion, and many thanks to ikegami as well as perreal for forcing my clarification.
EDIT 2
An even better match would be:
[510,900]
510, being the median of all the sequential +-5 values.
However, I recognize that now we're deviating severely from my original question, so I would be more than happy with an answer to my EDIT 1 clarification.
Isolate the samples that form a chain where each is within the tolerance of the next, then choose one from that group.
sub collapse {
my $tol = shift;
my #collapsed;
while (#_) {
my #group = shift(#_);
while (#_ && $group[-1] + $tol >= $_[0]) {
push #group, shift(#_);
}
push #collapsed, choose_from(#group);
}
return #collapsed;
}
say join ',', collapse(5 => 500,505,510,515,525,900);
So how do you choose? Well, you could return the average.
use List::Util qw( sum );
sub choose_from {
return sum(#_)/#_;
}
# Outputs: 507.5,525,900
Or you could return the median.
use List::Util qw( sum );
sub choose_from {
my $median;
if (#_ % 2 == 0) {
my $avg = sum(#_)/#_;
my $diff0 = abs( $_[ #_/2 - 1 ] - $avg );
my $diff1 = abs( $_[ #_/2 - 0 ] - $avg );
if ($diff0 <= $diff1) {
return $_[ #_/2 - 1 ];
} else {
return $_[ #_/2 - 0 ];
}
} else {
return $_[ #_/2 ];
}
}
# Outputs: 505,525,900
This is a deceptively complex problem, as the data must not only be organized into groups, but also those groups must be combined if a new data point is seen that belongs to more than one of them.
This program seems to do what you need. It keeps a list of arrays #buckets, where each element contains all values seen so far that is within TOLERANCE of one other. This list is scanned to see if each value falls within range of the maximum and minimum values already present. The index of the groups that the value belongs to are stored in memberof, and there will always be zero, one or two entries in this array.
All the groups specified by #memberof are removed from #buckets, combined together with the new data value, sorted, and replaced as a new group in the list.
At the end the #buckets array is converted to a list of median values, sorted and displayed. I have used Data::Dump to show the contents of the groups before they are aggregated to their median values.
To generate your desired output 510, 900 from the list 500, 510, 525, 900 the value for TOLERANCE must be increased so that values that differ by 15 or less are combined.
use strict;
use warnings;
use constant TOLERANCE => 5;
my #data = qw/ 500 505 510 515 525 900 /;
my #buckets;
for my $item (#data) {
my #memberof;
for my $i (0 .. $#buckets) {
if ($item >= $buckets[$i][0] - TOLERANCE and $item <= $buckets[$i][-1] + TOLERANCE) {
push #memberof, $i;
}
}
my #newbucket = ($item);
for my $i (reverse #memberof) {
push #newbucket, #{ splice #buckets, $i, 1 };
}
push #buckets, [ sort { $a <=> $b } #newbucket ];
}
use Data::Dump;
dd #buckets;
#buckets = sort { $a <=> $b } map median(#$_), #buckets;
print join(', ', #buckets), "\n";
sub median {
my $n = #_;
my $i = $n / 2;
if ($n % 2) {
return $_[$i];
}
else {
return ($_[$i-1] + $_[$i]) / 2;
}
}
output
([500, 505, 510, 515], [525], [900])
507.5, 525, 900

How to do fast comparison of arrays in perl [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
In Perl, is there a built in way to compare two arrays for equality?
I need to compare arrays with a function that should return:
true if all elements are equal when compared pairwise
true if all elements are equal or the element in the first array is undefined when compared pairwise
false in all other cases
in other words, if the sub is called "comp":
#a = ('a', 'b', undef, 'c');
#b = ('a', 'b', 'f', 'c');
comp(#a, #b); # should return true
comp(#b, #a); # should return false
#a = ('a', 'b');
#b = ('a', 'b', 'f', 'c');
comp(#a, #b); # should return true
the obvious solution would be to do pairwise compares between the two arrays, but I'd like it to be faster than that, as the comparisons are run multiple times over a large set of arrays, the and the arrays may have many elements.
On the other hand, the contents of the arrays to be compared (i.e.: all the possible #b's) is pre-determined and does not change. The elements of the arrays do not have a fixed length, and there is no guarantee as to what chars they might contain (tabs, commas, you name it).
Is there a faster way to do this than pairwise comparison? Smart match won't cut it, as it returns true if all elements are equal (an therefore not if one is undef).
Could packing and doing bitwise comparisons be a strategy? It looks promising when I browse the docs for pack/unpack and vec, but I'm somewhat out of my depth there.
Thanks.
Perl can compare lists of 10,000 pairwise elements in about 100ms on my Macbook, so first thing I'll say is to profile your code to make sure this is actually the problem.
Doing some benchmarking, there's a few things you can do to speed things up.
Make sure to bail on the first failure to match.
Assuming you have a lot of comparisons which don't match, this will save HEAPS of time.
Check up front that the arrays are the same length.
If they arrays aren't the same length, they can never match. Compare their sizes and return early if they're different. This avoids needing to check this case over and over again inside the loop.
Use an iterator instead of a C-style for loop.
Iterating pair-wise you'd normally do something like for( my $idx = 0; $idx <= $#a; $idx += 2 ) but iterating over an array is faster than using a C-style for loop. This is an optimization trick of Perl, its more efficient to do the work inside perl in optimized C than to do it in Perl code. This gains you about 20%-30% depending on how you micro-optimize it.
for my $mark (0..$#{$a}/2) {
my $idx = $mark * 2;
next if !defined $a->[$idx] || !defined $b->[$idx];
return 0 if $a->[$idx] ne $b->[$idx] || $a->[$idx+1] ne $b->[$idx+1];
}
return 1;
Precompute the interesting indexes.
Since one set of pairs is fixed, you can produce an index of which pairs are defined. This makes the iterator even simpler and faster.
state $indexes = precompute_indexes($b);
for my $idx ( #$indexes ) {
next if !defined $a->[$idx];
return 0 if $a->[$idx] ne $b->[$idx] || $a->[$idx+1] ne $b->[$idx+1];
}
return 1;
With no nulls this is a performance boost of 40%. You get more beyond that the more nulls are in your fixed set.
use strict;
use warnings;
use v5.10; # for state
# Compute the indexes of a list of pairs which are interesting for
# comparison: those with defined keys.
sub precompute_indexes {
my $pairs = shift;
die "Unbalanced pairs" if #$pairs % 2 != 0;
my #indexes;
for( my $idx = 0; $idx <= $#$pairs; $idx += 2 ) {
push #indexes, $idx if defined $pairs->[$idx];
}
return \#indexes;
}
sub cmp_pairs_ignore_null_keys {
my($a, $b) = #_;
# state is like my but it will only evaluate once ever.
# It acts like a cache which initializes the first time the
# program is run.
state $indexes = precompute_indexes($b);
# If they don't have the same # of elements, they can never match.
return 0 if #$a != #$b;
for my $idx ( #$indexes ) {
next if !defined $a->[$idx];
return 0 if $a->[$idx] ne $b->[$idx] || $a->[$idx+1] ne $b->[$idx+1];
}
return 1;
}
I'm still convinced this is better to do in SQL with a self-join, but haven't worked that out.

Resources