Why ever use an array instead of a hash? - arrays

I have read that it is much faster to iterate through a hash than through an array. Retrieving values from a hash is also much faster.
Instead of using an array, why not just use a hash and give each key a value corresponding to an index? If the items ever need to be in order, they can be sorted.

Retrieving from hash is faster in a sense that you can fetch value directly by key instead of iterating over whole hash (or array when you're searching for particular string). Having that said, $hash{key} isn't faster than $array[0] as no iteration is taking place.
Arrays can't be replaced by hashes, as they have different features,
arrays hashes
------------------------------------
ordered keys x -
push/pop x -
suitable for looping x -
named keys - x

I don't know where you read that hashes are faster than arrays. According to some Perl reference works (Mastering Algorithms with Perl), arrays are faster than hashes (follow this link for some more info).
If speed is your only criterae, you should benchmark to see which technique is going to be faster. It depends on what operations you will be doing onto the array/hash.
Here is an SO link with some further information: Advantage of 'one dimensional' hash over array in Perl

I think this is a good question: it's not so much a high level "language design" query so much as it is an implementation question. It could be worded in a way that emphasizes that - say using hashes versus arrays for a particular technique or use case.
Hashes are nice but you need lists/arrays (c.f. #RobEarl). You can use tie (or modules like Tie::IxHash or Tie::Hash::Indexed ) to "preserve" the order of a hash, but I believe these would have to be slower than a regular hash and in some cases you can't pass them around or copy them in quite the same way.

This code is more or less how a hash works. It should explain well enough why you would want to use an array instead of a hash.
package DIYHash;
use Digest::MD5;
sub new {
my ($class, $buckets) = #_;
my $self = bless [], $class;
$#$self = $buckets || 32;
return $self;
}
sub fetch {
my ( $self, $key ) = #_;
my $i = $self->_get_bucket_index( $key );
my $bo = $self->_find_key_in_bucket($key);
return $self->[$i][$bo][1];
}
sub store {
my ( $self, $key, $value ) = #_;
my $i = $self->_get_bucket_index( $key );
my $bo = $self->_find_key_in_bucket($key);
$self->[$i][$bo] = [$key, $value];
return $value;
}
sub _find_key_in_bucket {
my ($self, $key, $index) = #_;
my $bucket = $self->[$index];
my $i = undef;
for ( 0..$#$bucket ) {
next unless $bucket->[$_][0] eq $key;
$i = $_;
}
$i = #$bucket unless defined $i;
return $i;
}
# This function needs to always return the same index for a given key.
# It can do anything as long as it always does that.
# I use the md5 hashing algorithm here.
sub _get_bucket_index {
my ( $self, $key ) = #_;
# Get a number from 0 to 1 - bucket count.
my $index = unpack( "I", md5($key) ) % #$self;
return $index;
}
1;
To use this amazing cluster of code:
my $hash = DIYHash->new(4); #This hash has 4 buckets.
$hash->store(mouse => "I like cheese");
$hash->store(cat => "I like mouse");
say $hash->fetch('mouse');
Hashes look like they are constant time, rather than order N because for a given data set, you select a number of buckets that keeps the number of items in any bucket very small.
A proper hashing system would be able to resize as appropriate when the number of collisions gets too high. You don't want to do this often, because it is an order N operation.

Related

Problem to get more than 1 element with same highest occurrence from array in Perl

I only get the smaller element as output although there are 2 elements with same highest occurrence in array
I have tried to remove sort function from the codes but it still returns me the smaller element
my(#a) = (undef,11,12,13,14,15,13,13,14,14);
my(%count);
foreach my $value (#a) {
$count{$value}++;
}
$max_value = (sort {$count{$b} <=> $count{$a}} #a)[0];
print "Max value = $max_value, occur $count{$max_value} times\n";
Expected result: Max value =13 14, occur 3 times
max_by from List::UtilsBy will return all values that share the maximum in list context.
use strict;
use warnings;
use List::UtilsBy 'max_by';
my #a = (undef,11,12,13,14,15,13,13,14,14);
my %count;
$count{$_}++ for #a;
my #max_values = max_by { $count{$_} } keys %count;
Your code simply takes the first maximal value it finds in the sorted data. You need to continue reading array elements until you reach one that is no longer maximal.
However, as you probably have to test all the hash values there's no great advantage to sorting it. You can just traverse it and keep track of the maximal value(s) you find.
my #a = (undef,11,12,13,14,15,13,13,14,14);
my %count;
$count{$_}++ for #a;
my ($max_count, #max_values);
while ( my ($k,$v) = each %count) {
if ($v > $max_count) {
#max_values = ($k);
$max_count = $v;
}
elsif ($v == $max_count) {
push #max_values, $k;
}
}
my $max_values = join " ", sort #max_values;
print "Max value = $max_values, occur $max_count times\n";
Note that undef is not a valid hash key - it gets converted to "".

De-reference x number of times for x number of data structures

I've come across an obstacle in one of my perl scripts that I've managed to solve, but I don't really understand why it works the way it works. I've been scouring the internet but I haven't found a proper explanation.
I have a subroutine that returns a reference to a hash of arrays. The hash keys are simple strings, and the values are references to arrays.
I print out the elements of the array associated with each key, like this
for my $job_name (keys %$build_numbers) {
print "$job_name => ";
my #array = #{#$build_numbers{$job_name}}; # line 3
for my $item ( #array ) {
print "$item \n";
}
}
While I am able to print out the keys & values, I don't really understand the syntax behind line 3.
Our data structure is as follows:
Reference to a hash whose values are references to the populated arrays.
To extract the elements of the array, we have to:
- dereference the hash reference so we can access the keys
- dereference the array reference associated to a key to extract elements.
Final question being:
When dealing with perl hashes of hashes of arrays etc; to extract the elements at the "bottom" of the respective data structure "tree" we have to dereference each level in turn to reach the original data structures, until we obtain our desired level of elements?
Hopefully somebody could help out by clarifying.
Line 3 is taking a slice of your hash reference, but it's a very strange way to do what you're trying to do because a) you normally wouldn't slice a single element and b) there's cleaner and more obvious syntax that would make your code easier to read.
If your data looks something like this:
my $data = {
foo => [0 .. 9],
bar => ['A' .. 'F'],
};
Then the correct version of your example would be:
for my $key (keys(%$data)) {
print "$key => ";
for my $val (#{$data->{$key}}) {
print "$val ";
}
print "\n";
}
Which produces:
bar => A B C D E F
foo => 0 1 2 3 4 5 6 7 8 9
If I understand your second question, the answer is that you can access precise locations of complex data structures if you use the correct syntax. For example:
print "$data->{bar}->[4]\n";
Will print E.
Additional recommended reading: perlref, perlreftut, and perldsc
Working with data structures can be hard depending on how it was made.
I am not sure if your "job" data structure is exactly this but:
#!/usr/bin/env perl
use strict;
use warnings;
use diagnostics;
my $hash_ref = {
job_one => [ 'one', 'two'],
job_two => [ '1','2'],
};
foreach my $job ( keys %{$hash_ref} ){
print " Job => $job\n";
my #array = #{$hash_ref->{$job}};
foreach my $item ( #array )
{
print "Job: $job Item $item\n";
}
}
You have an hash reference which you iterate the keys that are arrays. But each item of this array could be another reference or a simple scalar.
Basically you can work with the ref or undo the ref like you did in the first loop.
There is a piece of documentation you can check for more details here.
So answering your question:
Final question being: - When dealing with perl hashes of hashes of
arrays etc; to extract the elements at the "bottom" of the respective
data structure "tree" we have to dereference each level in turn to
reach the original data structures, until we obtain our desired level
of elements?
It depends on how your data structure was made and if you already know what you are looking for it would be simple to get the value for example:
%city_codes = (
a => 1, b => 2,
);
my $value = $city_codes{a};
Complex data structures comes with complex code.

Perl Hash Trouble

An easy one for a Perl guru...
I want a function that simply takes in an array of items (actually multiple arrays) and counts the number of times each item in the key section of a hash is there. However, I am really unsure of Perl hashes.
#array = qw/banana apple orange apple orange apple pear/
I read that you need to do arrays using code like this:
my %hash = (
'banana' => 0,
'orange' => 0,
'apple' => 0
#I intentionally left out pear... I only want the values in the array...
);
However, I am struggling getting a loop to work that can go through and add one to the value with a corresponding key equal to a value in the array for each item in the array.
foreach $fruit (#array) {
if ($_ #is equal to a key in the hash) {
#Add one to the corresponding value
}
}
This has a few basic functions all wrapped up in one, so on behalf of all beginning Perl programmers, thank you in advance!
All you need is
my #array = qw/banana apple orange apple orange apple pear/;
my %counts;
++$counts{$_} for #array;
This results in a hash like
my %counts = ( apple => 3, banana => 1, orange => 2, pear => 1 )
The for loop can be written with block and a an explicit loop counter variable if you prefer, like this
for my $word (#array) {
++$counts{$word};
}
with exactly the same effect.
You can use exists.
http://perldoc.perl.org/functions/exists.html
Given an expression that specifies an element of a hash, returns true
if the specified element in the hash has ever been initialized, even
if the corresponding value is undefined.
foreach my $fruit (#array) {
if (exists $hash{$fruit}) {
$hash{$fruit}++;
}
}
Suppose you have an array named #array. You'd access the 0th element of the array with $array[0].
Hashes are similar. %hash's banana element can be accessed with $hash{'banana'}.
Here's a pretty simple example. It makes use of the implicit variable $_ and a little bit of string interpolation:
use strict;
my #array = qw/banana apple orange apple orange apple pear/;
my %hash;
$hash{$_} += 1 for #array; #add one for each fruit in the list
print "$_: $hash{$_}\n" for keys %hash; #print our results
If needed, you can check if a particular hash key exists: if (exists $hash{'banana'}) {...}.
You'll eventually get to see something called a "hashref", which is not a hash but a reference to a hash. In that case, $hashref has $hashref->{'banana'}.
I'm trying to understand you here:
You have an array and a hash.
You want to count the items in the array and see how many time they occur
But, only if this item is in your hash.
Think of hashes as keyed arrays. Arrays have a position. You can talk about the 0th element, or the 5th element. There is only one 0th element and their is only one 5th element.
Let's look at a hash:
my %jobs;
$jobs{bob} = "banker";
$jobs{sue} = "banker";
$jobs{joe} = "plumber;
Just as we can talk about the element in the array in the 0th position, we can talk about the element with the of bob. Just as there is only one element in the 0th position, there can only be one element with a key of bob.
Hashes provide a quick way to look up information. For example, I can quickly find out Sue's job:
print "Sue is a $jobs{sue}\n";
We have:
An array filled with items.
A hash with the items we want to count
Another hash with the totals.
Here's the code:
use strict;
use warnings;
use feature qw(say);
my #items = qw(.....); # Items we want to count
my %valid_items = (....); # The valid items we want
# Initializing the totals. Explained below...
my %totals;
map { $totals{$_} = 0; } keys %valid_items;
for my $item ( #items ) {
if ( exists $valid_items{$item} ) {
$totals{$item} += 1; #Add one to the total number of items
}
}
#
# Now print the totals
#
for my $item ( sort keys %totals ) {
printf "%-10.10s %4d\n", $item, $totals{$item};
}
The map command takes the list of items on the right side (in our case keys %valid_items), and loop through the entire list.
Thus:
map { $totals{$_} = 0; } keys %valid_items;
Is a short way of saying:
for ( keys %valid_items ) {
$totals{$_} = 0;
}
The other things I use are keys which returns as an array (okay list) all of the keys of my hash. Thus, I get back apple, banana, and oranges when I say keys %valid_items.
The [exists](http://perldoc.perl.org/functions/exists.html) is a test to see if a particular key is in my hash. The value of that key might be zero, a null string, or even an undefined value, but if the key is in my hash, theexists` function will return a true value.
However, if we can use exists to see if a key is in my %valid_items hash, we could do the same with %totals. They have the same set of keys.
Instead or creating a %valid_items hash, I'm going to use a #valid_items array because arrays are easier to initialize than hashes. I just have to list the values. Instead of using keys %valid_items to get a list of the keys, I can use #valid_items:
use strict;
use warnings;
use feature qw(say);
my #items = qw(banana apple orange apple orange apple pear); # Items we want to count
my #valid_items = qw(banana apple orange); # The valid items we want
my %totals;
map { $totals{$_} = 0; } #valid_items;
# Now %totals is storing our totals and is the list of valid items
for my $item ( #items ) {
if ( exists $totals{$item} ) {
$totals{$item} += 1; #Add one to the total number of items
}
}
#
# Now print the totals
#
for my $item ( sort keys %totals ) {
printf "%-10.10s %4d\n", $item, $totals{$item};
}
And this prints out:
apple 3
banana 1
orange 2
I like using printf for keeping tables nice and orderly.
This will be easier to understand as I too started to write code just 2 months back.
use Data::Dumper;
use strict;
use warnings;
my #array = qw/banana apple orange apple orange apple pear/;
my %hashvar;
foreach my $element (#array) {
#Check whether the element is already added into hash ; if yes increment; else add.
if (defined $hashvar{$element}) {
$hashvar{$element}++;
}
else {
$hashvar{$element} = 1;
}
}
print Dumper(\%hashvar);
Will print out the output as
$VAR1 = {
'banana' => 1,
'apple' => 3,
'orange' => 2,
'pear' => 1
};
Cheers

Faster way to check for element in array?

This function does the same as exists does with hashes.
I plan on use it a lot.
Can it be optimized in some way?
my #a = qw/a b c d/;
my $ret = array_exists("b", #a);
sub array_exists {
my ($var, #a) = #_;
foreach my $e (#a) {
if ($var eq $e) {
return 1;
}
}
return 0;
}
If you have to do this a lot on a fixed array, use a hash instead:
my %hash = map { $_, 1 } #array;
if( exists $hash{$key} ) { ... }
Some people reach for the smart match operator, but that's one of the features that we need to remove from Perl. You need to decide if this should match, where the array hold an array reference that has a hash reference with the key b:
use 5.010;
my #a = (
qw(x y z),
[ { 'b' => 1 } ],
);
say 'Matches' if "b" ~~ #a; # This matches
Since the smart match is recursive, if keeps going down into data structures. I write about some of this in Rethinking smart matching.
You can use smart matching, available in Perl 5.10 and later:
if ("b" ~~ #a) {
# "b" exists in #a
}
This should be much faster than a function call.
I'd use List::MoreUtils::any.
my $ret = any { $_ eq 'b' } #a;
Since there are lots of similar questions on StackOverflow where different "correct answers" return different results, I tried to compare them. This question seems to be a good place to share my little benchmark.
For my tests I used a test set (#test_set) of 1,000 elements (strings) of length 10 where only one element ($search_value) matches a given string.
I took the following statements to validate the existence of this element in a loop of 100,000 turns.
_grep
grep( $_ eq $search_value, #test_set )
_hash
{ map { $_ => 1 } #test_set }->{ $search_value }
_hash_premapped
$mapping->{ $search_value }
uses a $mapping that is precalculated as $mapping = { map { $_ => 1 } #test_set } (which is included in the final measuring)
_regex
sub{ my $rx = join "|", map quotemeta, #test_set; $search_value =~ /^(?:$rx)$/ }
_regex_prejoined
$search_value =~ /^(?:$rx)$/
uses a regular expression $rx that is precalculated as $rx = join "|", map quotemeta, #test_set; (which is included in the final measuring)
_manual_first
sub{ foreach ( #test_set ) { return 1 if( $_ eq $search_value ); } return 0; }
_first
first { $_ eq $search_value } #test_set
from List::Util (version 1.38)
_smart
$search_value ~~ #test_set
_any
any { $_ eq $search_value } #test_set
from List::MoreUtils (version 0.33)
On my machine ( Ubuntu, 3.2.0-60-generic, x86_64, Perl v5.14.2 ) I got the following results. The shown values are seconds and returned by gettimeofday and tv_interval of Time::HiRes (version 1.9726).
Element $search_value is located at position 0 in array #test_set
_hash_premapped: 0.056211
_smart: 0.060267
_manual_first: 0.064195
_first: 0.258953
_any: 0.292959
_regex_prejoined: 0.350076
_grep: 5.748364
_regex: 29.27262
_hash: 45.638838
Element $search_value is located at position 500 in array #test_set
_hash_premapped: 0.056316
_regex_prejoined: 0.357595
_first: 2.337911
_smart: 2.80226
_manual_first: 3.34348
_any: 3.408409
_grep: 5.772233
_regex: 28.668455
_hash: 45.076083
Element $search_value is located at position 999 in array #test_set
_hash_premapped: 0.054434
_regex_prejoined: 0.362615
_first: 4.383842
_smart: 5.536873
_grep: 5.962746
_any: 6.31152
_manual_first: 6.59063
_regex: 28.695459
_hash: 45.804386
Conclusion
The fastest method to check the existence of an element in an array is using prepared hashes. You of course buy that by an proportional amount of memory consumption and it only makes sense if you search for elements in the set multiple times. If your task includes small amounts of data and only a single or a few searches, hashes can even be the worst solution. Not the same way fast, but a similar idea would be to use prepared regular expressions, which seem to have a smaller preparation time.
In many cases, a prepared environment is no option.
Surprisingly List::Util::first has very good results, when it comes to the comparison of statements, that don't have a prepared environment. While having the search value at the beginning (which could be perhaps interpreted as the result in smaller sets, too) it is very close to the favourites ~~ and any (and could even be in the range of measurement inaccuracy). For items in the middle or at the end of my larger test set, first is definitely the fastest.
brian d foy suggested using a hash, which gives O(1) lookups, at the cost of slightly more expensive hash creation. There is a technique that Marc Jason Dominus describes in his book Higher Order Perl where by a hash is used to memoize (or cache) results of a sub for a given parameter. So for example, if findit(1000) always returns the same thing for the given parameter, there's no need to recalculate the result every time. The technique is implemented in the Memoize module (part of the Perl core).
Memoizing is not always a win. Sometimes the overhead of the memoized wrapper is greater than the cost of calculating a result. Sometimes a given parameter is unlikely to ever be checked more than once or a relatively few times. And sometimes it cannot be guaranteed that the result of a function for a given parameter will always be the same (ie, the cache can become stale). But if you have an expensive function with stable return values per parameter, memoization can be a big win.
Just as brian d foy's answer uses a hash, Memoize uses a hash internally. There is additional overhead in the Memoize implementation, but the benefit to using Memoize is that it doesn't require refactoring the original subroutine. You just use Memoize; and then memoize( 'expensive_function' );, provided it meets the criteria for benefitting from memoization.
I took your original subroutine and converted it to work with integers (just for simplicity in testing). Then I added a second version that passed a reference to the original array rather than copying the array. With those two versions, I created two more subs that I memoized. I then benchmarked the four subs.
In benchmarking, I had to make some decisions. First, how many iterations to test. The more iterations we test, the more likely we are to have good cache hits for the memoized versions. Then I also had to decide how many items to put into the sample array. The more items, the less likely to have cache hits, but the more significant the savings when a cache hit occurs. I ultimately decided on an array to be searched containing 8000 elements, and chose to search through 24000 iterations. That means that on average there should be two cache hits per memoized call. (The first call with a given param will write to the cache, while the second and third calls will read from the cache, so two good hits on average).
Here is the test code:
use warnings;
use strict;
use Memoize;
use Benchmark qw/cmpthese/;
my $n = 8000; # Elements in target array
my $count = 24000; # Test iterations.
my #a = ( 1 .. $n );
my #find = map { int(rand($n)) } 0 .. $count;
my ( $orx, $ormx, $opx, $opmx ) = ( 0, 0, 0, 0 );
memoize( 'orig_memo' );
memoize( 'opt_memo' );
cmpthese( $count, {
original => sub{ my $ret = original( $find[ $orx++ ], #a ); },
orig_memo => sub{ my $ret = orig_memo( $find[ $ormx++ ], #a ); },
optimized => sub{ my $ret = optimized( $find[ $opx++ ], \#a ); },
opt_memo => sub{ my $ret = opt_memo( $find[ $opmx++ ], \#a ); }
} );
sub original {
my ( $var, #a) = #_;
foreach my $e ( #a ) {
return 1 if $var == $e;
}
return 0;
}
sub orig_memo {
my ( $var, #a ) = #_;
foreach my $e ( #a ) {
return 1 if $var == $e;
}
return 0;
}
sub optimized {
my( $var, $aref ) = #_;
foreach my $e ( #{$aref} ) {
return 1 if $var == $e;
}
return 0;
}
sub opt_memo {
my( $var, $aref ) = #_;
foreach my $e ( #{$aref} ) {
return 1 if $var == $e;
}
return 0;
}
And here are the results:
Rate orig_memo original optimized opt_memo
orig_memo 876/s -- -10% -83% -94%
original 972/s 11% -- -82% -94%
optimized 5298/s 505% 445% -- -66%
opt_memo 15385/s 1657% 1483% 190% --
As you can see, the memoized version of your original function was actually slower. That's because so much of the cost of your original subroutine was spent in making copies of the 8000 element array, combined with the fact that there is additional call-stack and bookkeeping overhead with the memoized version.
But once we pass an array reference instead of a copy, we remove the expense of passing the entire array around. Your speed jumps considerably. But the clear winner is the optimized (ie, passing array refs) version that we memoized (cached), at 1483% faster than your original function. With memoization the O(n) lookup only happens the first time a given parameter is checked. Subsequent lookups occur in O(1) time.
Now you would have to decide (by Benchmarking) whether memoization helps you. Certainly passing an array ref does. And if memoization doesn't help you, maybe brian's hash method is best. But in terms of not having to rewrite much code, memoization combined with passing an array ref may be an excellent alternative.
Your current solution iterates through the array before it finds the element it is looking for. As such, it is a linear algorithm.
If you sort the array first with a relational operator (>for numeric elements, gt for strings) you can use binary search to find the elements. It is a logarithmic algorithm, much faster than linear.
Of course, one must consider the penalty of sorting the array in the first place, which is a rather slow operation (n log n). If the contents of the array you are matching against change often, you must sort after every change, and it gets really slow. If the contents remain the same after you've initially sorted them, binary search ends up being practically faster.
You can use grep:
sub array_exists {
my $val = shift;
return grep { $val eq $_ } #_;
}
Surprisingly, it's not off too far in speed from List::MoreUtils' any(). It's faster if your item is at the end of the list by about 25% and slower by about 50% if your item is at the start of the list.
You can also inline it if needed -- no need to shove it off into a subroutine. i.e.
if ( grep { $needle eq $_ } #haystack ) {
### Do something
...
}

How would I use a hash slice to initialize a hash stored in a data structure?

In an earlier question I asked how to initialize a Perl hash using slices. It is done like this:
my %hash = ();
my #fields = ('currency_symbol', 'currency_name');
my #array = ('BRL','Real');
#hash{#fields} = #array;
Now let's imagine a more complex hash, and here is how it is initialized:
my %hash = ();
my $iso = 'BR';
$hash->{$iso}->{currency_symbol} = 'BRL';
$hash->{$iso}->{currency_name} = 'Real';
print Dumper($hash);
This results in the following:
$VAR1 = {
'BR' => {
'currency_symbol' => 'BRL',
'currency_name' => 'Real'
}
};
Now the question would be: how to initialize this particular hash using the splice method?
The perllol documentation's Slices section covers array slices:
If you want to get at a slice (part of a row) in a multidimensional array, you're going to have to do some fancy subscripting. That's because while we have a nice synonym for single elements via the pointer arrow for dereferencing, no such convenience exists for slices. (Remember, of course, that you can always write a loop to do a slice operation.)
Here's how to do one operation using a loop. We'll assume an #AoA variable as before.
#part = ();
$x = 4;
for ($y = 7; $y < 13; $y++) {
push #part, $AoA[$x][$y];
}
That same loop could be replaced with a slice operation:
#part = #{ $AoA[4] } [ 7..12 ];
Extrapolating to hash slices, we get
#{ $hash{$iso} }{#fields} = #array;
You know it's a hash slice because the “subscripts” are surrounded with curly braces rather than square brackets.
First of all, since your hash is declared %hash, it would just be $hash{ $iso }. $hash->{ $iso } refers to a slot in the hash pointed to by $hash, which may or may not be pointing to %hash.
But once you have that, you can do the following:
#{ $hash{ $iso } }{ #fields } = qw<BRL Real>;
But as levels soon get complex, it's better to forgo the autovivification luxury and do the following:
my $h = $hash{ $iso }{blah}{blah} = {};
#$h{ #field_names } = #field_values;
Relocatable pointers within the hierarchy of hashes makes it easier to write anonymous accesses that also allow for easy slices.
$hash{$iso} is going to be a hash reference. You replace what would be the variable name (without the sigil) in a simple slice with a block containing the reference, so:
#array{#list}
becomes
#{ $hash{$iso} }{#list}
See http://perlmonks.org/?node=References+quick+reference

Resources