Separating CSV file into key and array - arrays

I am new to perl, and I am trying to separate a csv file (has 10 comma-separated items per line) into a key (first item) and an array (9 items) to put in a hash. Eventually, I want to use an if function to match another variable to the key in the hash and print out the elements in the array.
Here's the code I have, which doesn't work right.
use strict;
use warnings;
my %hash;
my $in2 = "metadata1.csv";
open IN2, "<$in2" or die "Cannot open the file: $!";
while (my $line = <IN2>) {
my ($key, #value) = split (/,/, $line, 2);
%hash = (
$key => #value
);
}
foreach my $key (keys %hash)
{
print "The key is $key and the array is $hash{$key}\n";
}
Thank you for any help!

Don't use 2 as the third argument to split: it will split the line to only two elements, so there'll be just one #value.
Also, by doing %hash =, you're overwriting the hash in each iteration of the loop. Just add a new key/value pair:
$hash{$key} = \#value;
Note the \ sign: you can't store an array directly as a hash value, you have to store a reference to it. When printing the value, you have to dereference it back:
#! /usr/bin/perl
use warnings;
use strict;
my %hash;
while (<DATA>) {
my ($key, #values) = split /,/;
$hash{$key} = \#values;
}
for my $key (keys %hash) {
print "$key => #{ $hash{$key} }";
}
__DATA__
id0,1,2,a
id1,3,4,b
id2,5,6,c
If your CSV file contains quoted or escaped commas, you should use Text::CSV.

First of all hash can have only one unique key, so when you have lines like these in your CSV file:
key1,val11,val12,val13,val14,val15,val16,val17,val18,val19
key1,val21,val22,val23,val24,val25,val26,val27,val28,val29
after adding both key/value pairs with 'key1' key to the hash, you'll get just one pair saved in the hash, the one that were added to the hash later.
So to keep all records, the result you probably need array of hashes structure, where value of each hash is an array reference, like this:
#result = (
{ 'key1' => ['val11','val12','val13','val14','val15','val16','val17','val18','val19'] },
{ 'key1' => ['val21','val22','val23','val24','val25','val26','val27','val28','val29'] },
{ 'and' => ['so on'] },
);
In order to achieve that your code should become like this:
use strict;
use warnings;
my #AoH; # array of hashes containing data from CSV
my $in2 = "metadata1.csv";
open IN2, "<$in2" or die "Cannot open the file: $!";
while (my $line = <IN2>) {
my #string_bits = split (/,/, $line);
my $key = $string_bits[0]; # first element - key
my $value = [ #string_bits[1 .. $#string_bits] ]; # rest get into arr ref
push #AoH, {$key => $value}; # array of hashes structure
}
foreach my $hash_ref (#AoH)
{
my $key = (keys %$hash_ref)[0]; # get hash key
my $value = join ', ', #{ $hash_ref->{$key} }; # join array into str
print "The key is '$key' and the array is '$value'\n";
}

Related

How can I make an array of values after duplicate keys in a hash?

I have a question regarding duplicate keys in hashes.
Say my dataset looks something like this:
>Mammals
Cats
>Fish
Clownfish
>Birds
Parrots
>Mammals
Dogs
>Reptiles
Snakes
>Reptiles
Snakes
What I would like to get out of my script is a hash that looks like this:
$VAR1 = {
'Birds' => 'Parrots',
'Mammals' => 'Dogs', 'Cats',
'Fish' => 'Clownfish',
'Reptiles' => 'Snakes'
};
I found a possible answer here (https://www.perlmonks.org/?node_id=1116320). However I am not sure how to identify the values and the duplicates with the format of my dataset.
Here's the code that I have been using:
use Data::Dumper;
open($fh, "<", $file) || die "Could not open file $file $!/n";
while (<$fh>) {
chomp;
if($_ =~ /^>(.+)/){
$group = $1;
$animals{$group} = "";
next;
}
$animals{$group} .= $_;
push #{$group (keys %animals)}, $animals{$group};
}
print Dumper(\%animals);
When I execute it the push function does not seem to work as the output from this command is the same as when the command is absent (in the duplicate "Mammal" group, it will replace the cat with the dog instead of having both as arrays within the same group).
Any suggestions as to what I am doing wrong would be highly appreciated.
Thanks !
You're very close here. We can't get exactly the output you want from Data::Dumper because hashes can only have one value per key. The easiest way to fix that is to assign a reference to an array to the key and add things to it. But since you want to eliminate the duplicates as well, it's easier to build hashes as an intermediate representation then transform them to arrays:
use Data::Dumper;
my $file = "animals.txt";
open($fh, "<", $file) || die "Could not open file $file $!/n";
while (<$fh>) {
chomp;
if(/^>(.+)/){
$group = $1;
next;
}
$animals{$group} = {} unless exists $animals{$group};
$animals{$group}->{$_} = 1;
}
# Transform the hashes to arrays
foreach my $group (keys %animals) {
# Make the hash into an array of its keys
$animals{$group} = [ sort keys %{$animals{$group}} ];
# Throw away the array if we only have one thing
$animals{$group} = $animals{$group}->[0] if #{ $animals{$group} } == 1;
}
print Dumper(\%animals);
Result is
$VAR1 = {
'Reptiles' => 'Snakes',
'Fish' => 'Clownfish',
'Birds' => 'Parrots',
'Mammals' => [
'Cats',
'Dogs'
]
};
which is as close as you can get to what you had as your desired output.
For ease in processing the ingested data, it may actually be easier to not throw away the arrays in the one-element case so that every entry in the hash can be processed the same way (they're all references to arrays, no matter how many things are in them). Otherwise you've added a conditional to strip out the arrays, and you have to add another conditional test in your processing code to check
if (ref $item) {
# This is an anonymous array
} else {
# This is just a single entry
}
and it's easier to just have one path there instead of two, even if the else just wraps the single item into an array again. Leave them as arrays (delete the $animals{$group} = $animals{$group}->[0] line) and you'll be fine.
Given:
__DATA__
>Mammals
Cats
>Fish
Clownfish
>Birds
Parrots
>Mammals
Dogs
>Reptiles
Snakes
>Reptiles
Snakes
(at the end of the source code or a file with that content)
If you are willing to slurp the file, you can do something with a regex and a HoH like this:
use Data::Dumper;
use warnings;
use strict;
my %animals;
my $s;
while(<DATA>){
$s.=$_;
}
while($s=~/^>(.*)\R(.*)/mg){
++$animals{$1}{$2};
}
print Dumper(\%animals);
Prints:
$VAR1 = {
'Mammals' => {
'Cats' => 1,
'Dogs' => 1
},
'Birds' => {
'Parrots' => 1
},
'Fish' => {
'Clownfish' => 1
},
'Reptiles' => {
'Snakes' => 2
}
};
Which you can arrive to your format with this complete Perl program:
$s.=$_ while(<DATA>);
++$animals{$1}{$2} while($s=~/^>(.*)\R(.*)/mg);
while ((my $k, my $v) = each (%animals)) {
print "$k: ". join(", ", keys($v)) . "\n";
}
Prints:
Fish: Clownfish
Birds: Parrots
Mammals: Cats, Dogs
Reptiles: Snakes
(Know that the output order may be different than file order since Perl hashes do not maintain insertion order...)

How to create variable array name in foreach loop for perl

I would like to create an array inside foreach loop that change name itself
our $j = 1;
foreach $key ( sort keys %hash ){
#array1 = $hash{$key};
$j++;
}
How to i change array name with $j. Like every key my array name will change from #array1, #array2, #array3....
That would require symbolic references and you don't want to be doing that.
It is a dangerous feature which is actually needed and used only very rarely for very specific reasons. For all other purposes there are other, better, ways.
Instead, use anonymous arrays (or array references) stored in a data structure, with an array
my #data;
foreach $key (sort keys %hash) {
push #data, [ ... ]; # (populate with $hash data)
}
or a hash
my %data;
foreach $key (sort keys %hash) {
my $name = ...; # work out a suitable key-name
$data{$name} = [ ... ]; # populate with $hash data
}
I don't know what to put in anonymous arrays [ ... ], or what good names for keys ($name) are, since it's not stated what is in the hash.
It is conceivable that your hash values themselves are in fact arrayrefs, in which case
my #data;
foreach $key (sort keys %hash) {
push #data, $hash{$key};
}
seems to fit the question but is really just
my #data = map { $hash{$_} } sort keys %hash;
or, if you don't need a predictable order based on keys
my #data = values %hash;
But I presume that there is more to do with hash's data before it is stored in arrays.
Then you can refer to the individual array(ref)s by index (or by name in the case of a hash).
for(my $i=0;$i<100;$i++){
my #arr_i=($i);
print(#arr_i,"\n");
}

creating hash from array in perl

I have an array that I want to convert into a hash table. Basically, I want #array[0] to be the keys of the hash, and #array[1] to be the values of the hash. Is there an easy way to do this in perl? The code I have so far is as follows:
#!/usr/bin/perl
use warnings;
use strict;
use diagnostics;
unless( open(INFILE, "<", 'scratch/Drosophila/fb_synonym_fb_2014_05.tsv')) {
die "Cannot open file for reading: ", $!;
while(<INFILE>) {
my #values = split();
#convert values[0] to keys, values[1] to values
}
the file is available for download here
#array[0] (an array slice, used to return multiple elements) is a bad way of writing $array[0] (an array lookup, used to return a single element). use warnings; would have told you this.
To set a hash element, one uses
$hash{$key} = $val;
So the code becomes
my %hash;
while (<>) {
chomp;
my #fields = split /\t/;
$hash{ $fields[0] } = $fields[1];
}
Better yet,
my %hash;
while (<>) {
chomp;
my ($key, $val) = split /\t/;
$hash{$key} = $val;
}
The name of the file implies the fields are tab-separated, not whitespace separated, so I switched
split ' '
to
split /\t/
This required the addition of chomp.

Is it possible to put elements of array to hash in perl?

example of file content:
>random sequence 1 consisting of 500 residues.
VILVWRISEMNPTHEIYPEVSYEDRQPFRCFDEGINMQMGQKSCRNCLIFTRNAFAYGIV
HFLEWGILLTHIIHCCHQIQGGCDCTRHPVRFYPQHRNDDVDKPCQTKSPMQVRYGDDSD;
>random sequence 2 consisting of 500 residues.
KAAATKKPWADTIPYLLCTFMQTSGLEWLHTDYNNFSSVVCVRYFEQFWVQCQDHVFVKN
KNWHQVLWEEYAVIDSMNFAWPPLYQSVSSNLDSTERMMWWWVYYQFEDNIQIRMEWCNI
YSGFLSREKLELTHNKCEVCVDKFVRLVFKQTKWVRTMNNRRRVRFRGIYQQTAIQEYHV
HQKIIRYPCHVMQFHDPSAPCDMTRQGKRMNFCFIIFLYTLYEVKYWMHFLTYLNCLEHR;
>random sequence 3 consisting of 500 residues.
AYCSCWRIHNVVFQKDVVLGYWGHCWMSWGSMNQPFHRQPYNKYFCMAPDWCNIGTYAWK
I need an algorithm to build a hash $hash{$key} = $value; where lines starting with > are the values and following lines are the keys.
What I have tried:
open (DATA, "seq-at.txt") or die "blabla";
#data = <DATA>;
%result = ();
$k = 0;
$i = 0;
while($k != #data) {
$info = #data[$k]; #istrina pirma elementa
if(#data[$i] !=~ ">") {
$key .= #data[$i]; $i++;
} else {
$k = $i;
}
$result{$key} = $value;
}
but it doesn't work.
You don't have to previously use an array, you can directly build your hash:
use strict;
use warnings;
# ^- start always your code like this to see errors and what is ambiguous
# declare your variables using "my" to specify the scope
my $filename = 'seq-at.txt';
# use the 3 parameters open syntax to avoid to overwrite the file:
open my $fh, '<', $filename or die "unable to open '$filename' $!";
my %hash;
my $hkey = '';
my $hval = '';
while (<$fh>) {
chomp; # remove the newline \n (or \r\n)
if (/^>/) { # when the line start with ">"
# store the key/value in the hash if the key isn't empty
# (the key is empty when the first ">" is encountered)
$hash{$hkey} = $hval if ($hkey);
# store the line in $hval and clear $hkey
($hval, $hkey) = $_;
} elsif (/\S/) { # when the line isn't empty (or blank)
# append the line to the key
$hkey .= $_;
}
}
# store the last key/val in the hash if any
$hash{$hkey} = $hval if ($hkey);
# display the hash
foreach (keys %hash) {
print "key: $_\nvalue: $hash{$_}\n\n";
}
It is unclear what you want, the array seems to be the lines subsequent to the random sequence number... If the contenst of a file test.txt are:
Line 1:">"random sequence 1 consisting of 500 residues.
Line 2:VILVWRISEMNPTHEIYPEVSYEDRQPFRCFDEGINMQMGQKSCRNCLIFTRNAFAYGIV
Line 3:HFLEWGILLTHIIHCCHQIQGGCDCTRHPVRFYPQHRNDDVDKPCQTKSPMQVRYGDDSD;
You could try something like:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $contentFile = $ARGV[0];
my %testHash = ();
my $currentKey = "";
open(my $contentFH,"<",$contentFile);
while(my $contentLine = <$contentFH>){
chomp($contentLine);
next if($contentLine eq ''); # Empty lines.
if($contentLine =~ /^"\>"(.*)/){
$currentKey= $1;
}else{
push(#{$testHash{$currentKey}},$contentLine);
}
}
print Dumper(\%testHash);
Which results in a structure like this:
seb#amon:[~]$ perl test.pl test.txt
$VAR1 = {
'random sequence 3 consisting of 500 residues.' => [
'AYCSCWRIHNVVFQKDVVLGYWGHCWMSWGSMNQPFHRQPYNKYFCMAPDWCNIGTYAWK'
],
'random sequence 1 consisting of 500 residues.' => [
'VILVWRISEMNPTHEIYPEVSYEDRQPFRCFDEGINMQMGQKSCRNCLIFTRNAFAYGIV',
'HFLEWGILLTHIIHCCHQIQGGCDCTRHPVRFYPQHRNDDVDKPCQTKSPMQVRYGDDSD;'
],
'random sequence 2 consisting of 500 residues.' => [
'KAAATKKPWADTIPYLLCTFMQTSGLEWLHTDYNNFSSVVCVRYFEQFWVQCQDHVFVKN',
'KNWHQVLWEEYAVIDSMNFAWPPLYQSVSSNLDSTERMMWWWVYYQFEDNIQIRMEWCNI',
'YSGFLSREKLELTHNKCEVCVDKFVRLVFKQTKWVRTMNNRRRVRFRGIYQQTAIQEYHV',
'HQKIIRYPCHVMQFHDPSAPCDMTRQGKRMNFCFIIFLYTLYEVKYWMHFLTYLNCLEHR;'
]
};
You would be basically using each hash "value" as an array structure, the #{$variable} does the magic.

Perl Array of Hashes - Reference each hash within array?

I am trying to create an array of hashes, and I am wondering how to reference each hash within the array?
For eg:
while(<INFILE>)
{
my $row = $_;
chomp $row;
my #cols = split(/\t/,$row);
my $key = $cols[0]."\t".$cols[1];
my #total = (); ## This is my array of hashes - wrong syntax???
for($i=2;$i<#cols;$i++)
{
$total[$c++]{$key} += $cols[$i];
}
}
close INFILE;
foreach (sort keys %total) #sort keys for one of the hashes within the array - wrong syntax???
{
print $_."\t".$total[0]{$_}."\n";
}
Thanks in advance for any help.
You don't need
my #total = ();
This:
my #total;
is sufficient for what you are after. No special syntax needed to declare that your array will contain hashes.
There's probably different ways of doing the foreach part, but this should work:
foreach (sort keys %{$total[$the_index_you_want]}) {
print $_."\t".$total[$the_index_you_want]{$_}."\n";
}
[BTW, declaring my #total; inside that loop is probably not what you want (it would be reset on each line). Move that outside the first loop.]
And use strict; use warnings;, obviously :-)
Here's what I get:
print join( "\t", #$_{ sort keys %$_ } ), "\n" foreach #total;
I'm simply iterating through #total and for each hashref taking a slice in sorted order and joining those values with tabs.
If you didn't need them in sorted order, you could just
print join( "\t", values %$_ ), "\n" foreach #total;
But I also would compress the line processing like so:
my ( $k1, $k2, #cols ) = split /\t/, $row;
my $key = "$k1\t$k2";
$totals[ $_ ]{ $key } += $cols[ $_ ] foreach 0..$#cols;
But with List::MoreUtils, you could also do this:
use List::MoreUtils qw<pairwise>;
my ( $k1, $k2, #cols ) = split /\t/, $row;
my $key = "$k1\t$k2";
pairwise { $a->{ $key } += $b } #totals => #cols;

Resources