create a hash of occurrences in an array with map - arrays

I seem to recall that there was a "clever" way to create a hash from an array with Perl with map such that the keys are the elements of the the array and the values are the number of times the element appears. Something like this, although this does not work:
$ perl -e '#a = ('a','a','b','c'); %h = map { $_ => $_ + 1 } #a ; foreach $k (keys (%h)) { print "$k -> $h{$k}\n"}'
c -> 1
b -> 1
a -> 2
$
Am I imagining things? How can I do this?

You can write map {$h{$_}++} #a ignoring its return value, but why would you do this? for (#a){$h{$_}++} is easy enough to type.
So why would you not do it?
map is meant for transforming lists. It takes an input list and generates an output list. It can confuse a reader if you use it in a different way using side effect instead of output.
Also, although map is optimized to not create the output list when called in void context, it is slower:
use warnings;
use strict;
use Benchmark qw/cmpthese/;
my #in = map {chr(int(rand(127)+1))} 1..10000;
my %out;
cmpthese(10000,
{stmtfor => sub{%out = (); $out{$_}++ for #in},
voidmap => sub{%out = (); map {$out{$_}++} #in;},
}
);
__END__
Rate voidmap stmtfor
voidmap 2075/s -- -17%
stmtfor 2513/s 21% --

I'm not sure this is what you are looking for, but you can easily do that with a for rather than a map:
$ perl -e '#a = ('a','a','b','c'); $h{$_}++ for #a; foreach $k (keys (%h)) { print "$k -> $h{$k}\n"}'
c -> 1
b -> 1
a -> 2

Related

Perl: Removing array items and resizing the array

I’m trying to filter an array of terms using another array in Perl. I have Perl 5.18.2 on OS X, though the behavior is the same if I use 5.010. Here’s my basic setup:
#!/usr/bin/perl
#use strict;
my #terms = ('alpha','beta test','gamma','delta quadrant','epsilon',
'zeta','eta','theta chi','one iota','kappa');
my #filters = ('beta','gamma','epsilon','iota');
foreach $filter (#filters) {
for my $ind (0 .. $#terms) {
if (grep { /$filter/ } $terms[$ind]) {
splice #terms,$ind,1;
}
}
}
This works to pull out the lines that match the various search terms, but the array length doesn’t change. If I write out the resulting #terms array, I get:
[alpha]
[delta quadrant]
[zeta]
[eta]
[theta chi]
[kappa]
[]
[]
[]
[]
As you might expect from that, printing scalar(#terms) gets a result of 10.
What I want is a resulting array of length 6, without the four blank items at the end. How do I get that result? And why isn’t the array shrinking, given that the perldoc page about splice says, “The array grows or shrinks as necessary.”?
(I’m not very fluent in Perl, so if you’re thinking “Why don’t you just...?”, it’s almost certainly because I don’t know about it or didn’t understand it when I heard about it.)
You can always regenerate the array minus things you don't want. grep acts as a filter allowing you to decide which elements you want and which you don't:
#!/usr/bin/perl
use strict;
my #terms = ('alpha','beta test','gamma','delta quadrant','epsilon',
'zeta','eta','theta chi','one iota','kappa');
my #filters = ('beta','gamma','epsilon','iota');
my %filter_exclusion = map { $_ => 1 } #filters;
my #filtered = grep { !$filter_exclusion{$_} } #terms;
print join(',', #filtered) . "\n";
It's pretty easy if you have a simple structure like %filter_exclusion on hand.
Update: If you want to allow arbitrary substring matches:
my $filter_exclusion = join '|', map quotemeta, #filters;
my #filtered = grep { !/$filter_exclusion/ } #terms;
To see what's going on, print the contents of the array in each step: When you splice the array, it shrinks, but your loop iterates over 0 .. $#terms, so at the end of the loop, $ind will point behind the end of the array. When you use grep { ... } $array[ $too_large ], Perl needs to alias the non-existent element to $_ inside the grep block, so it creates an undef element in the array.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my #terms = ('alpha', 'beta test', 'gamma', 'delta quadrant', 'epsilon',
'zeta', 'eta', 'theta chi', 'one iota', 'kappa');
my #filters = qw( beta gamma epsilon iota );
for my $filter (#filters) {
say $filter;
for my $ind (0 .. $#terms) {
if (grep { do {
no warnings 'uninitialized';
/$filter/
} } $terms[$ind]
) {
splice #terms, $ind, 1;
}
say "\t$ind\t", join ' ', map $_ || '-', #terms;
}
}
If you used $terms[$ind] =~ /$filter/ instead of grep, you'd still get uninitialized warnings, but as there's no need to alias the element, it won't be created.

Perl array element comparing

I am new in Perl programming. I am trying to compare the two arrays each element. So here is my code:
#!/usr/bin/perl
use strict;
use warnings;
use v5.10.1;
my #x = ("tom","john","michell");
my #y = ("tom","john","michell","robert","ricky");
if (#x ~~ #y)
{
say "elements matched";
}
else
{
say "no elements matched";
}
When I run this I get the output
no elements matched
So I want to compare both array elements in deep and the element do not matches, those elements I want to store it in a new array. As I can now compare the only matched elements but I can't store it in a new array.
How can I store those unmatched elements in a new array?
Please someone can help me and advice.
I'd avoid smart matching in Perl - e.g. see here
If you're trying to compare the contents of $y[0] with $x[0] then this is one way to go, which puts all non-matches in an new array #keep:
use strict;
use warnings;
use feature qw/say/;
my #x = qw(tom john michell);
my #y = qw(tom john michell robert ricky);
my #keep;
for (my $i = 0; $i <$#y; $i++) {
unless ($y[$i] eq $x[$i]){
push #keep, $y[$i];
}
}
say for #keep;
Or, if you simply want to see if one name exists in the other array (and aren't interested in directly comparing elements), use two hashes:
my (%x, %y);
$x{$_}++ for #x;
$y{$_}++ for #y;
foreach (keys %y){
say if not exists $x{$_};
}
It would be well worth your while spending some time reading the Perl FAQ.
Perl FAQ 4 concerns Data Manipulation and includes the following question and answer:
How do I compute the difference of two arrays? How do I compute
the intersection of two arrays?
Use a hash. Here's code to do both and more. It assumes that each
element is unique in a given array:
my (#union, #intersection, #difference);
my %count = ();
foreach my $element (#array1, #array2) { $count{$element}++ }
foreach my $element (keys %count) {
push #union, $element;
push #{ $count{$element} > 1 ? \#intersection : \#difference }, $element;
}
Note that this is the symmetric difference, that is, all elements
in either A or in B but not in both. Think of it as an xor
operation.

Perl: Load file into hash

I'm struggling to understand logic behind hashes in Perl. Task is to load file in to hash and assign values to keys which are created using this file.
File contains alphabet with each letter on its own line:
a
b
c
d
e
and etc,.
When using array instead of hash, logic is simple: load file into array and then print each element with corresponding number using some counter ($counter++).
But now my question is, how can I read file into my hash, assign automatically generated values and sort it in that way where output is printed like this:
a:1
b:2
c:3
I've tried to first create array and then link it to hash using
%hash = #array
but it makes my hash non-sortable.
There are a number of ways to approach this. The most direct would be to load the data into the hash as you read through the file.
my %hash;
while(<>)
{
chomp;
$hash{$_} = $.; #Use the line number as your autogenerated counter.
}
You can also perform simliar logic if you already have a populated array.
for (0..$#array)
{
$hash{$array[$_]} = $_;
}
Although, if you are in that situation, map is the perlier way of doing things.
%hash = map { $array[$_] => $_ } #array;
Think of a hash as a set of pairs (key, value), where the keys must be unique. You want to read the file one line at a time, and add a pair to the hash:
$record = <$file_handle>;
$hash{$record} = $counter++;
Of course, you could read the entire file into an array at once and then assign to your hash. But the solution is not:
#records = <$file_handle>;
%hash = #records;
... as you found out. If you think in terms of (key, value) pairs, you will see that the above is equivalent to:
$hash{a} = 'b';
$hash{c} = 'd';
$hash{e} = 'f';
...
and so on. You still are going to need a loop, either an explicit one like this:
foreach my $rec (#records)
{
$hash{$rec} = $counter++;
}
or an implicit one like one of these:
%hash = map {$_ => $counter++} #records;
# or:
$hash{$_} = $counter++ for #records;
This code should generate the proper output, where my-text-file is the path to your data file:
my %hash;
my $counter = 0;
open(FILE, "my-text-file");
while (<FILE>) {
chomp;
$counter++;
$hash{$_} = $counter;
}
# Now to sort
foreach $key (sort(keys(%hash))) {
print $key . ":" . $hash{$key} . "\n";
}
I assume you want to sort the hash aplhabetically. keys(%hash) and values(%hash) return the keys and values of %hash as an array, respectively. Run the program on this file:
f
a
b
d
e
c
And we get:
a:2
b:3
c:6
d:4
e:5
f:1
I hope this helps you.

Remove elements from an array which have a substring that is itself an element of the array

In Perl, I'd like to remove all elements from an array where another element of the same array is a non-empty substring of said element.
Say I have the array
#itemlist = ("abcde", "ab", "khi", "jklm");
In this instance I would like to have the element "abcde" removed, because "ab" is a substring of "abcde".
I could make a copy of the array (maybe as a hash?), iterate over it, try to index with every element of the original array and remove it, but there has to be a more elegant way, no?
Thanks for your help!
Edited for clarity a bit.
You could construct a regex from all the items and throw out anything that matches:
$alternation = join('|', map(quotemeta, #itemlist));
#itemlist = grep !/($alternation).|.($alternation)/, #itemlist;
The ().|.() thing just ensures that an item doesn't match itself.
Well, I wouldn't call this elegant, but here goes:
#!usr/bin/perl
use strict;
use warnings;
my #itemlist = ("abcde", "ab", "khi", "jklm");
#itemlist = grep {
#itemlist ~~ sub {$_ !~ /(?:.\Q$_[0]\E|\Q$_[0]\E.)/}
} #itemlist;
print "#itemlist";
It relies on a rather obscure behavior of smart match: if the left argument is an array and the right argument a sub, it calls the sub for each element, and the final result is true only if the sub returns true for each element.
Explanation: for each element of the array, it checks that no other element is a substring of that element (requiring at least one additional character so that elements won't match themselves).
Note: wdebeaum's answer is probably the one I would prefer in the real world. Still, it is kind of interesting the strange things one can do with smart match.
wdebeaum's answer is the solution to use, not the one below, but I learned something by doing it and perhaps someone else will too. After I had written mine I decided to test it on lists of several thousand elements.
b.pl:
#!/usr/bin/perl
use strict;
use warnings;
my #itemlist = <>;
for(#itemlist) { chomp; }
my $regex;
if(defined $ENV{wdebeaum}) {
# wdebeaum's solution
my $alternation = join('|', map(quotemeta, #itemlist));
$regex = qr/(?:$alternation).|.(?:$alternation)/;
} else {
# my solution
$regex = join "|", map {qq{(?:\Q$_\E.)|(?:.\Q$_\E)}} #itemlist;
}
my #result = grep !/$regex/, #itemlist;
print scalar #itemlist, "\t", scalar #result, "\n";
I generated a list of 5000 random words.
sort -R /usr/share/dict/american-english|head -5000 > some-words
For small lists both solutions seem fine.
$ time head -200 some-words | wdebeaum=1 ./b.pl
200 198
real 0m0.012s
user 0m0.004s
sys 0m0.004s
$ time head -200 some-words | ./b.pl
200 198
real 0m0.068s
user 0m0.060s
sys 0m0.004s
But for larger lists, wdebeaum's is clearly better.
$ time cat some-words | wdebeaum=1 ./b.pl
5000 1947
real 0m0.068s
user 0m0.064s
sys 0m0.000s
$ time cat some-words | ./b.pl
5000 1947
real 0m8.305s
user 0m8.277s
sys 0m0.012s
I think the reason for the difference, is that even though both regular expressions have the same number of possible paths, my regex has more paths that have to be tried, since it has the same number of .s as paths, while wdebebaum's has only two.
You can use a hash to count substrings of all the words. Any word in the list that has a higher count than one is then a substring of another word. The minimum length of the substrings is two in this example:
use strict;
use warnings;
use feature 'say';
my #list = qw(abcde ab foo foobar de oba cd xs);
my %count;
for my $word (#list) {
my $len = length $word;
$count{$word}++;
for my $start (0 .. $len - 2) {
for my $long (2 .. $len - 2) {
my $sub = substr($word, $start, $long);
$count{$sub}++;
}
}
}
say for grep $count{$_} == 1, #list;
Output:
abcde
foobar
xs
The following will remove the substring from the array.
#!/usr/bin/perl
use strict;
use warnings;
my #ar=("asl","pwe","jsl","nxu","sl","baks","ak");
foreach my $i (#ar){
my $p = grep /$i/, #ar;
if ( $p == 1 ){
print "$i" , "\n";
}
}
I had the inverse problem: removing from the list strings which are substrings of other strings. Here is my not-too-elegant solution.
sub remove_substrings_from_list {
my #list = #_;
my #vals_without_superstrings;
my %hash_of_others;
for ( 0 .. $#list ) {
my $a = shift #list;
$hash_of_others{$a} = [ #list ];
push #list, $a;
}
foreach my $k ( keys %hash_of_others ) {
push #vals_without_superstrings, $k unless grep { index( $_, $k ) != -1 } #{ $hash_of_others{$k} };
}
return #vals_without_superstrings;
}

Swap key and array value pair

I have a text file layed out like this:
1 a, b, c
2 c, b, c
2.5 a, c
I would like to reverse the keys (the number) and values (CSV) (they are separated by a tab character) to produce this:
a 1, 2.5
b 1, 2
c 1, 2, 2.5
(Notice how 2 isn't duplicated for c.)
I do not need this exact output. The numbers in the input are ordered, while the values are not. The output's keys must be ordered, as well as the values.
How can I do this? I have access to standard shell utilities (awk, sed, grep...) and GCC. I can probably grab a compiler/interpreter for other languages if needed.
If you have python (if you're on linux you probably already have) i'd use a short python script to do this. Note that we use sets to filter out "double" items.
Edited to be closer to requester's requirements:
import csv
from decimal import *
getcontext().prec = 7
csv_reader = csv.reader(open('test.csv'), delimiter='\t')
maindict = {}
for row in csv_reader:
value = row[0]
for key in row[1:]:
try:
maindict[key].add(Decimal(value))
except KeyError:
maindict[key] = set()
maindict[key].add(Decimal(value))
csv_writer = csv.writer(open('out.csv', 'w'), delimiter='\t')
sorted_keys = [x[1] for x in sorted([(x.lower(), x) for x in maindict.keys()])]
for key in sorted_keys:
csv_writer.writerow([key] + sorted(maindict[key]))
I would try perl if that's available to you. Loop through the input a row at a time. Split the line on the tab then the right hand part on the commas. Shove the values into an associative array with letters as the keys and the value being another associative array. The second associative array will be playing the part of a set so as to eliminate duplicates.
Once you read the input file, sort based on the keys of the associative array, loop through and spit out the results.
here's a small utility in php:
// load and parse the input file
$data = file("path/to/file/");
foreach ($data as $line) {
list($num, $values) = explode("\t", $line);
$newData["$num"] = explode(", ", trim($values));
}
unset($data);
// reverse the index/value association
foreach ($newData as $index => $values) {
asort($values);
foreach($values as $value) {
if (!isset($data[$value]))
$data[$value] = array();
if (!in_array($index, $data[$value]))
array_push($data[$value], $index);
}
}
// printout the result
foreach ($data as $index => $values) {
echo "$index\t" . implode(", ", $values) . "\n";
}
not really optimized or good looking, but it works...
# use Modern::Perl;
use strict;
use warnings;
use feature qw'say';
our %data;
while(<>){
chomp;
my($number,$csv) = split /\t/;
my #csv = split m"\s*,\s*", $csv;
push #{$data{$_}}, $number for #csv;
}
for my $number (sort keys %data){
my #unique = sort keys %{{ map { ($_,undef) } #{$data{$number}} }};
say $number, "\t", join ', ', #unique;
}
Here is an example using CPAN's Text::CSV module rather than manual parsing of CSV fields:
use strict;
use warnings;
use Text::CSV;
my %hash;
my $csv = Text::CSV->new({ allow_whitespace => 1 });
open my $file, "<", "file/to/read.txt";
while(<$file>) {
my ($first, $rest) = split /\t/, $_, 2;
my #values;
if($csv->parse($rest)) {
#values = $csv->fields()
} else {
warn "Error: invalid CSV: $rest";
next;
}
foreach(#values) {
push #{ $hash{$_} }, $first;
}
}
# this can be shortened, but I don't remember whether sort()
# defaults to <=> or cmp, so I was explicit
foreach(sort { $a cmp $b } keys %hash) {
print "$_\t", join(",", sort { $a <=> $b } #{ $hash{$_} }), "\n";
}
Note that it will print to standard output. I recommend just redirecting standard output, and if you expand this program at all, make sure to use warn() to print any errors, rather than just print()ing them. Also, it won't check for duplicate entries, but I don't want to make my code look like Brad Gilbert's, which looks a bit wack even to a Perlite.
Here's an awk(1) and sort(1) answer:
Your data is basically a many-to-many data set so the first step is to normalise the data with one key and value per line. We'll also swap the keys and values to indicate the new primary field, but this isn't strictly necessary as the parts lower down do not depend on order. We use a tab or [spaces],[spaces] as the field separator so we split on the tab between the key and values, and between the values. This will leave spaces embedded in the values, but trim them from before and after:
awk -F '\t| *, *' '{ for (i=2; i<=NF; ++i) { print $i"\t"$1 } }'
Then we want to apply your sort order and eliminate duplicates. We use a bash feature to specify a tab char as the separator (-t $'\t'). If you are using Bourne/POSIX shell, you will need to use '[tab]', where [tab] is a literal tab:
sort -t $'\t' -u -k 1f,1 -k 2n
Then, put it back in the form you want:
awk -F '\t' '{
if (key != $1) {
if (key) printf "\n";
key=$1;
printf "%s\t%s", $1, $2
} else {
printf ", %s", $2
}
}
END {printf "\n"}'
Pipe them altogether and you should get your desired output. I tested with the GNU tools.

Resources