How to create hash of array in Perl - arrays

I have a data like this
Group AT1G01040-TAIR-G
LOC_Os03g02970 69%
Group AT1G01050-TAIR-G
LOC_Os10g26600 85%
LOC_Os10g26633 35%
Group AT1G01090-TAIR-G
LOC_Os04g02900 74%
How can create the data structure that looks like this:
print Dumper \%big;
$VAR = { "Group AT1G01040-TAIR-G" => ['LOC_Os03g02970 69%'],
"Group AT1G01050-TAIR-G" => ['LOC_Os10g26600 85%','LOC_Os10g26633 35%'],
"Group AT1G01090-TAIR-G" => ['LOC_Os04g02900 74%']};
This is my attempt, but fail:
my %big;
while ( <> ) {
chomp;
my $line = $_;
my $head = "";
my #temp;
if ( $line =~ /^Group/ ) {
$head = $line;
$head =~ s/[\r\s]+//g;
#temp = ();
}
elsif ($line =~ /^\t/){
my $cont = $line;
$cont =~ s/[\t\r]+//g;
push #temp, $cont;
push #{$big{$head}},#temp;
};
}

Here's how I'd do it:
my %big;
my $currentGroup;
while (my $line = <> ) {
chomp $line;
if ( $line =~ /^Group/ ) {
$big{$line} = $currentGroup = [];
}
elsif ($line =~ s/^\t+//) {
push #$currentGroup, $line;
}
}
You should probably add some additional error checking to this, e.g. an else clause to warn about lines that don't match either regex. Also, check to see if $currentGroup is undef before pushing (in case the first line begins with a tab instead of "Group").
The biggest problem with your original code is that you're declaring and initializing $head and #temp inside the loop, which means they got reset on every line. Variables that need to persist across lines have to be declared outside the loop, as I've done with $currentGroup.
I'm not quite sure what you're intending to accomplish with the s/[\r\s]+//g; bit. \r is included in \s, so that means the same as s/\s+//g; (which would strip all whitespace), but your desired result hash includes whitespace in your keys. If you want to strip trailing whitespace, you need to include an anchor: s/\s+\z//.

Well, I don't want to give you an answer, so I'll just tell you to look at:
perlref
perlreftut
Well, there ya go :-).

Your pushing arrays to your hash item. You should just be pushing the values. (You don't need #temp at all.)
push #{$big{$head}}, $cont;
Also $head must be declared outside your loop, otherwise it looses its value after each iteration.

Related

Perl - initialization of hash

I'm not sure how to correctly initialize my hash - I'm trying to create a key/value pair for values in coupled lines in my input file.
For example, my input looks like this:
#cluster t.18
46421 ../../../output###.txt/
#cluster t.34
41554 ../../../output###.txt/
I'm extracting the t number from line 1 (#cluster line) and matching it to output###.txt in the second line (line starting with 46421). However, I can't seem to get these values into my hash with the script that I have written.
#!/usr/bin/perl
use warnings;
use strict;
my $key;
my $value;
my %hash;
my $filename = 'input.txt';
open my $fh, '<', $filename or die "Can't open $filename: $!";
while (my $line = <$fh>) {
chomp $line;
if ($line =~ m/^\#cluster/) {
my #fields = split /(\d+)/, $line;
my $key = $fields[1];
}
elsif ($line =~ m/^(\d+)/) {
my #output = split /\//, $line;
my $value = $output[5];
}
$hash{$key} = $value;
}
It's a good idea, but your $key that is created with my in the if block is a local variable scoped to that block, masking the global $key. Inside the if block the symbol $key has nothing to do with the one you nicely declared upfront. See my in perlsub.
This local $key goes out of scope as soon as if is done and does not exist outside the if block. The global $key is again available after the if, being visible elsewhere in the loop, but is undefined since it has never been assigned to. The same goes for $value in the elsif block.
Just drop the my declaration inside the loop, thus assign to those global variables (as intended?). So, $key = ... and $value = ..., and the hash will be assigned correctly.
Note -- this is about how to get that hash assignment right. I don't know how your actual data looks and whether the line is parsed correctly. Here is a toy input.txt
#cluster t.1
1111 ../../../output1.1.txt/
#cluster t.2
2222 ../../../output2.2.txt/
I pick the 4th field instead of the 6th, $value = $output[3];, and add
print "$_ => $hash{$_}\n" for keys %hash;
after the loop. This prints
1 => output1.1.txt
2 => output2.2.txt
I am not sure whether this is what you want but the hash is built fine.
A comment on choice of tools in parsing
You parse the lines for numbers, by using the property of split to return the separators as well, when they are captured. That is neat, but in some sense it reverses its main purpose, which is to extract other components from the string, as delimited by the pattern. Thus it may make the purpose of the code a little bit convoluted, and you also have to index very precisely to retrieve what you need.
Instead of using split to extract the delimiter itself, which is given by a regex, why not extract it by a regex? That makes the intention crystal clear, too. For example, with input
#cluster t.10 has 4319 elements, 0 subclusters
37652 ../../../../clust/output43888.txt 1.397428
the parsing can go as
if ($line =~ m/^\#cluster/) {
($key) = $line =~ /t\.(\d+)/;
}
elsif ($line =~ m/^(\d+)/) {
($value) = $line =~ m|.*/(\w+\.txt)|;
}
$hash{$key} = $value if defined $key and defined $value;
where t\. and \.txt are added to more precisely specify the targets. If the target strings aren't certain to have that precise form, just capture \d+, and in the second case all non-space after the last /, say by m|^\d+.*/(\S+)|. We use the greediness of .*, which matches everything possible up to the thing that comes after it (a /), thus all the way to the very last /.
Then you can also reduce it to a single regex for each line, for example
if ($line =~ m/^\#cluster\s+t\.(\d+)/) {
$key = $1;
}
elsif ($line =~ m|^\d+.*/(\w+\.txt)|) {
$value = $1;
}
Note that I've added a condition to the hash assignment. The original code in fact assigns an undef on the first iteration, since no $value had yet been seen at that point. This is overwritten on the next iteration and we don't see it if we only print the hash afterwards. The condition also guards you against failed matches, for malformatted lines or such. Of course, far better checks can be run.

Access the key value from an associative array

I have the associative array %cart_item, within this is a series of associative arrays. I need to access the value of the keys within %cart_item. I have the following code which iterates on each array key. (I do the equivalent of php's continue if the value is 'meta')
my $key_value;
for (keys %cart_item) {
next if (/^meta$/ || /^\s*$/);
}
I need to do something like this though (although this isn't valid), setting the value of the keys in the loop:
my $key_value;
for $i (keys %cart_item) {
next if (/^meta$/ || /^\s*$/);
$key_value = $i;
# do stuff
}
Could anyone suggest a solution here? Apologies if this is obvious, I'm a Perl newbie. Thanks
I think you are asking for
for my $key (keys %cart_item) {
next if $key =~ /^meta$/ || $key =~ /^\s*$/;
my $val = $cart_item{$key};
...
}
If you're just looking for the value that goes with the key, you can get both at the same time with each:
while (my ($key, $val) = each %cart_item) {
next if $key eq 'meta' || $key =~ /^\s*$/;
...
}
That's the equivalent of PHP's foreach ($cart_item as $key => $val).
I also changed the "meta" check to use simple string equality; no need to use a regular expression for an exact match.
Your original code has
for ( keys %cart_item ) {
next if (/^meta$/ || /^\s*$/);
}
which works fine because the for has no loop control variable so it defaults to Perl's "pronoun" it variable $_. In addition, your regex pattern matches have no object so they also default to $_
Written fully, this would be
for $_ ( keys %cart_item ) {
next if ( $_ =~ /^meta$/ || $+ =~ /^\s*$/);
}
but we don't have to write all of that. Some people hate it; others like me think it's absolute genius
Your non-working code
my $key_value;
for $i (keys %cart_item) {
next if (/^meta$/ || /^\s*$/);
$key_value = $i;
# do stuff
}
does use a loop control control variable $i (bad name for a hash key, by the way). That's all fine except that your regex matches still
my $key_value;
for $i (keys %cart_item) {
next if $i =~ /^meta$/ or $i =~ /^\s*$/;
$key_value = $i;
# do stuff
}
or, better still, stick with $_ and write this
for ( keys %cart_item ) {
next if /^meta$/ or /^\s*$/;
my $key_value = $_;
# do stuff
}

Creating multidimensional array while reading a file - Perl

I'm totally new to Perl, and I've been assigned some task... I have to read a tab separated file, and then do some operations with the data in a DB. The .tsv file is like this:
ID Name Date
155 Pedro 1988-05-05
522 Mengano 2002-08-02
So far I thought that creating a multidimensional array with the data of the file will be a good solution to handle this data later. So I read the file line by line, skip the item title columns and save the values in an array. However, I'm having difficulties creating this multidimensional array... this is what I've done so far:
#Read file from path
my #array;
my $fh = path($filename)->openr_utf8;
while (my $line = <$fh>) {
chomp $line;
# skip comments and blank lines and title line
next if $line =~ /^\#/ || $line =~ /^\s*$/ || $line =~ /^\+/ || $line =~ /ID/;
#split each line into array
my #aux_line = split(/\s+/, $line);
push #array, #{ $aux_line };
}
Obviously, last line is not working... how could be done to create an array of arrays this way? I'm little bit lost with references... And somebody can think of a better way to store this data we read from file? Thank you!
You can also do this with map:
use Data::Dumper;
my #stuff = map {[split]} <$fh>;
print Dumper \#stuff;
(with maybe a grep to skip comments)
But it may suit your use case better to use an array of hashes :
my #stuff ;
chomp(my #header = split ' ', <$fh>);
while ( <$fh>) {
my %this_row;
#this_row{#header} = split;
push ( #stuff, \%this_row) ;
}
First, use strict and use warnings. That would instantly alert you about that your wrong way to get array reference tries to access completely different variable (Perl allows variable of different types have same names).
After that just change your last line to:
push #array, \#aux_line;

Empty array in a perl while loop, should have input

Was working on this script when I came across a weird anomaly. When I go to print #extract after declaring it, it prints correctly the following:
------MMMMMMMMMMMMMMMMMMMMMMMMMM-M-MMMMMMMM
------SSSSSSSSSSSSSSSSSSSSSSSSSS-S-SSSSSDTA
------TIIIIIIIIIIIIITIIIVVIIIIII-I-IIIIITTT
Now the weird part, when I then try to print or return #extract (or $column) inside of the while loop, it comes up empty, thus rendering the rest of the script useless. I've never come across this before up until now, haven't been able to find any documentation or people with similar problems as mine. Below is the code, I marked with #<------ where the problems are and are not, to see if anyone can have any idea what is going on? Thank you kindly.
P.S. I am utilizing perl version 5.12.2
use strict;
use warnings;
#use diagnostics;
#use feature qw(say);
open (S, "Val nuc align.txt") || die "cannot open FASTA file to read: $!";
open (OUTPUT, ">output.txt");
my #extract;
my $sum = 0;
my #lines = <S>;
my #seq = ();
my $start = 0; #amino acid column start
my $end = 10; #amino acid column end
#Removing of the sequence tag until amino acid sequence composition (from >gi to )).
foreach my $line (#lines) {
$line =~ s/\n//g;
if ($line =~ />/g) {
$line =~ s/>.*\]/>/g;
push #seq, $line;
}
else {
push #seq, $line;
}
}
my $seq = join ('', #seq);
my #seq_prot = join "\n", split '>', $seq;
#seq_prot = grep {/[A-Z]/} #seq_prot;
#number of sequences
print OUTPUT "Number of sequences:", scalar (grep {defined} #seq_prot), "\n";
#selection of amino acid sequence. From $start to $end.
my #vertical_array;
while ( my $line = <#seq_prot> ) {
chomp $line;
my #split_line = split //, $line;
for my $index ( $start..$end ) { #AA position, extracts whole columns
$vertical_array[$index] .= $split_line[$index];
}
}
# Print out your vertical lines
for my $line ( #vertical_array ) {
my $extract = say OUTPUT for unpack "(a200)*", $line; #split at end of each column
#extract = grep {defined} $extract;
}
print OUTPUT #extract; #<--------------- This prints correctly the input
#Count selected amino acids excluding '-'.
my %counter;
while (my $column = #extract) {
print #extract; #<------------------------ Empty print, no input found
}
Update: Found the main problem to be with the unpack command, I thought I could utilize it to split my columns of my input at X elements (43 in this case). While this works, the minute I change $start to another number that is not 0 (say 200), the code brings up errors. Probably has something to do with the number of column elements does not match the lines. Will keep updated.
Write your last while loop the same way as your previous for loop. The assignment
my $column = #extract
is in scalar context, which does not give you the same result as:
for my $column (#extract)
Instead, it will give you the number of elements in the array. Try this second option and it should work.
However, I still have a concern, because in fact, if #extract had anything in it, you would obtain an infinite loop. Is there any code that you did not include between your two commented lines?

grepping command line arguments out of an array in perl

I have a file that looks like this:
[options42BuySide]
logged-check-times=06:01:00
logged-check-address=192.168.3.4
logged-check-reply=192.168.2.5
logged-check-vac-days=sat,sun
start-time=06:01:00
stop-time=19:00:00
falldown=logwrite after 10000
failtolog=logwrite after 10000
listento=all
global-search-text=Target Down. This message is stored;
[stock42BuySide]
logged-check-times=06:01:00
logged-check-address=192.168.2.13
logged-check-reply=192.168.2.54
logged-check-vac-days=sat,sun
start-time=06:01:00
stop-time=18:00:00
The script grinds the list down to just the name, start and stop time.
sellSide40, start-time=07:05:00, stop-time=17:59:00
SellSide42, start-time=07:06:00, stop-time=17:29:00
SellSide44, start-time=07:31:00, stop-time=16:55:00
42SellSide, start-time=09:01:00, stop-time=16:59:00
The problem is that I would like to filter out specific names from the file with comand line parameters.
I am trying to use the #ARGV array and grep the command line values out of the #nametimes array. Something like :
capser#capser$ ./get_start_stop SorosSellSide42 ETFBuySide42
The script works fine for parsing the file - I just need help on the command line array
#!/usr/bin/perl
use strict ;
use warnings ;
my ($name , $start, $stop, $specific);
my #nametimes;
my $inifile = "/var/log/Alert.ini";
open ( my $FILE, '<', "$inifile") or die ("could not open the file -- $!");
while(<$FILE>) {
chomp ;
if (/\[(\w+)\]/) {
$name = $1;
} elsif (/(start-time=\d+:\d+:\d+)/) {
$start = $1;
} elsif (/(stop-time=\d+:\d+:\d+)/) {
$stop = $1;
push (#nametimes, "$name, $start, $stop");
}
}
for ($a = 0; $a >= $#ARGV ; $a++) {
$specific = (grep /$ARGV[$a]/, #nametimes) ;
print "$specific\n";
}
It is probably pretty easy - however I have worked on it for days, and I am the only guy that uses perl in this shop. I don't have anyone to ask and the googlize is not panning out. I apologize in advance for angering the perl deities who are sure to yell at me for asking such and easy question.
Your construct for looping over #ARGV is a bit unwieldy - the more common way of doing that would be:
for my $name (#ARGV) {
#do something
}
But really, you don't even need to loop over it. You can just join them all directly into a single regular expression:
my $names = join("|", #ARGV);
my #matches = grep { /\b($names)\b/ } #nametimes;
I've used \b in the regex here - that indicates a word boundary, so the argument SellSide4 wouldn't match SellSide42. That may or may not be what you want...
Use an array to store the results from the grep(), not a scalar. Push them, not assign. Otherwise the second iteration of the for loop will overwrite results. Something like:
for my $el ( #ARGV ) {
push #specific, grep { /$el/ } #nametimes);
};
print join "\n", #specific;
The easiest thing to do is to store your INI file as a structure. Then, you can go through your structure and pull out what you want. The simplest structure would be a hash of hashes. Where your heading is the key to the outer hash, and the inner hash is keyed by the parameter:
Here's is creating the basic structure:
use warnings;
use strict;
use autodie;
use feature qw(say);
use Data::Dumper;
use constant INI_FILE => "test.file.txt";
open my $ini_fh, "<", INI_FILE;
my %ini_file;
my $heading;
while ( my $line = <$ini_fh> ) {
chomp $line;
if ( $line =~ /\[(.*)\]/ ) { #Headhing name
$heading = $1;
}
elsif ( $line =~ /(.+?)\s*=\s*(.+)/ ) {
my $parameter = $1;
my $value = $2;
$ini_file{$heading}->{$parameter} = $value;
}
else {
say "Invalid line $. - $line";
}
}
After this, the structure will look like this:
$VAR1 = {
'options42BuySide' => {
'stop-time' => '19:00:00',
'listento' => 'all',
'logged-check-reply' => '192.168.2.5',
'logged-check-vac-days' => 'sat,sun',
'falldown' => 'logwrite after 10000',
'start-time' => '06:01:00',
'logged-check-address' => '192.168.3.4',
'logged-check-times' => '06:01:00',
'failtolog' => 'logwrite after 10000',
'global-search-text' => 'Target Down. This message is stored;'
},
'stock42BuySide' => {
'stop-time' => '18:00:00',
'start-time' => '06:01:00',
'logged-check-reply' => '192.168.2.54',
'logged-check-address' => '192.168.2.13',
'logged-check-vac-days' => 'sat,sun',
'logged-check-times' => '06:01:00'
}
};
Now, all you have to do is parse your structure and pull the information you want out of it:
for my $heading ( sort keys %ini_file ) {
say "$heading " . $ini_file{$heading}->{"start-time"} . " " . $ini_file{$heading}->{"stop-time"};
}
You could easily modify this last loop to skip the headings you want, or to print out the exact parameters you want.
I would also recommend using Getopt::Long to parse your command line parameters:
my_file -include SorosSellSide42 -include ETFBuySide42 -param start-time -param stop-time
Getopt::Long could store your parameters in arrays. For example. It would put all the -include parameters in an #includes array and all the -param parameters in an #parameters array:
for my $heading ( #includes ) {
print "$heading ";
for my $parameter ( #parameters ) {
print "$ini_file{$heading}->{$parameter} . " ";
}
print "\n;
}
Of course, there needs to be lots of error checking (does the heading exist? What about the requested parameters?). But, this is the basic structure. Unless your file is extremely long, this is probably the easiest way to process it. If your file is extremely long, you could use the #includes and #parameters in the first loop as you read in the parameters and headings.

Resources