Perl - initialization of hash - arrays

I'm not sure how to correctly initialize my hash - I'm trying to create a key/value pair for values in coupled lines in my input file.
For example, my input looks like this:
#cluster t.18
46421 ../../../output###.txt/
#cluster t.34
41554 ../../../output###.txt/
I'm extracting the t number from line 1 (#cluster line) and matching it to output###.txt in the second line (line starting with 46421). However, I can't seem to get these values into my hash with the script that I have written.
#!/usr/bin/perl
use warnings;
use strict;
my $key;
my $value;
my %hash;
my $filename = 'input.txt';
open my $fh, '<', $filename or die "Can't open $filename: $!";
while (my $line = <$fh>) {
chomp $line;
if ($line =~ m/^\#cluster/) {
my #fields = split /(\d+)/, $line;
my $key = $fields[1];
}
elsif ($line =~ m/^(\d+)/) {
my #output = split /\//, $line;
my $value = $output[5];
}
$hash{$key} = $value;
}

It's a good idea, but your $key that is created with my in the if block is a local variable scoped to that block, masking the global $key. Inside the if block the symbol $key has nothing to do with the one you nicely declared upfront. See my in perlsub.
This local $key goes out of scope as soon as if is done and does not exist outside the if block. The global $key is again available after the if, being visible elsewhere in the loop, but is undefined since it has never been assigned to. The same goes for $value in the elsif block.
Just drop the my declaration inside the loop, thus assign to those global variables (as intended?). So, $key = ... and $value = ..., and the hash will be assigned correctly.
Note -- this is about how to get that hash assignment right. I don't know how your actual data looks and whether the line is parsed correctly. Here is a toy input.txt
#cluster t.1
1111 ../../../output1.1.txt/
#cluster t.2
2222 ../../../output2.2.txt/
I pick the 4th field instead of the 6th, $value = $output[3];, and add
print "$_ => $hash{$_}\n" for keys %hash;
after the loop. This prints
1 => output1.1.txt
2 => output2.2.txt
I am not sure whether this is what you want but the hash is built fine.
A comment on choice of tools in parsing
You parse the lines for numbers, by using the property of split to return the separators as well, when they are captured. That is neat, but in some sense it reverses its main purpose, which is to extract other components from the string, as delimited by the pattern. Thus it may make the purpose of the code a little bit convoluted, and you also have to index very precisely to retrieve what you need.
Instead of using split to extract the delimiter itself, which is given by a regex, why not extract it by a regex? That makes the intention crystal clear, too. For example, with input
#cluster t.10 has 4319 elements, 0 subclusters
37652 ../../../../clust/output43888.txt 1.397428
the parsing can go as
if ($line =~ m/^\#cluster/) {
($key) = $line =~ /t\.(\d+)/;
}
elsif ($line =~ m/^(\d+)/) {
($value) = $line =~ m|.*/(\w+\.txt)|;
}
$hash{$key} = $value if defined $key and defined $value;
where t\. and \.txt are added to more precisely specify the targets. If the target strings aren't certain to have that precise form, just capture \d+, and in the second case all non-space after the last /, say by m|^\d+.*/(\S+)|. We use the greediness of .*, which matches everything possible up to the thing that comes after it (a /), thus all the way to the very last /.
Then you can also reduce it to a single regex for each line, for example
if ($line =~ m/^\#cluster\s+t\.(\d+)/) {
$key = $1;
}
elsif ($line =~ m|^\d+.*/(\w+\.txt)|) {
$value = $1;
}
Note that I've added a condition to the hash assignment. The original code in fact assigns an undef on the first iteration, since no $value had yet been seen at that point. This is overwritten on the next iteration and we don't see it if we only print the hash afterwards. The condition also guards you against failed matches, for malformatted lines or such. Of course, far better checks can be run.

Related

Creating multidimensional array while reading a file - Perl

I'm totally new to Perl, and I've been assigned some task... I have to read a tab separated file, and then do some operations with the data in a DB. The .tsv file is like this:
ID Name Date
155 Pedro 1988-05-05
522 Mengano 2002-08-02
So far I thought that creating a multidimensional array with the data of the file will be a good solution to handle this data later. So I read the file line by line, skip the item title columns and save the values in an array. However, I'm having difficulties creating this multidimensional array... this is what I've done so far:
#Read file from path
my #array;
my $fh = path($filename)->openr_utf8;
while (my $line = <$fh>) {
chomp $line;
# skip comments and blank lines and title line
next if $line =~ /^\#/ || $line =~ /^\s*$/ || $line =~ /^\+/ || $line =~ /ID/;
#split each line into array
my #aux_line = split(/\s+/, $line);
push #array, #{ $aux_line };
}
Obviously, last line is not working... how could be done to create an array of arrays this way? I'm little bit lost with references... And somebody can think of a better way to store this data we read from file? Thank you!
You can also do this with map:
use Data::Dumper;
my #stuff = map {[split]} <$fh>;
print Dumper \#stuff;
(with maybe a grep to skip comments)
But it may suit your use case better to use an array of hashes :
my #stuff ;
chomp(my #header = split ' ', <$fh>);
while ( <$fh>) {
my %this_row;
#this_row{#header} = split;
push ( #stuff, \%this_row) ;
}
First, use strict and use warnings. That would instantly alert you about that your wrong way to get array reference tries to access completely different variable (Perl allows variable of different types have same names).
After that just change your last line to:
push #array, \#aux_line;

Array got flushed after while loop within a filehandle

I got a problem with a Perl script.
Here is the code:
use strict;
use Data::Dumper;
my #data = ('a','b');
if(#data){
for(#data){
&debug_check($_)
}
}
print "#data";#Nothing is printed, the array gets empty after the while loop.
sub debug_check{
my $ip = shift;
open my $fh, "<", "debug.txt";
while(<$fh>){
print "$_ $ip\n";
}
}
Array data, in this example, has two elements. I need to check if the array has elements. If it has, then for each element I call a subroutine, in this case called "debug_check". Inside the subroutine I need to open a file to read some data. After I read the file using a while loop, the data array gets empty.
Why the array is being flushed and how do I avoid this strange behavior?
Thanks.
The problem here I think, will be down to $_. This is a bit of a special case, in that it's an alias to a value. If you modify $_ within a loop, it'll update the array. So when you hand it into the subroutine, and then shift it, it also updates #data.
Try:
my ( $ip ) = #_;
Or instead:
for my $ip ( #array ) {
debug_check($ip);
}
Note - you should also avoid using an & prefix to a sub. It has a special meaning. Usually it'll work, but it's generally redundant at best, and might cause some strange glitches.
while (<$fh>)
is short for
while (defined($_ = <$fh>))
$_ is currently aliased to the element of #data, so your sub is replacing each element of #data with undef. Fix:
while (local $_ = <$fh>)
which is short for
while (defined(local $_ = <$fh>))
or
while (my $line = <$fh>) # And use $line instead of $_ afterwards
which is short for
while (defined(my $line = <$fh>))
Be careful when using global variables. You want to localize them if you modify them.

using string variables as names for hash in loop - perl [duplicate]

This question already has answers here:
How can I use a variable as a variable name in Perl?
(3 answers)
Closed 8 years ago.
I have an array of strings that I want to use in a loop and then call later as names of a hash. I wrote a test file to play around with it, but I can't get it to work no matter how hard I am trying. It keeps giving me "can't use string as a hash ref while strict refs is in use". I can turn off strict refs, but then the code simply skips those lines and doesn't do what I want. Is there any way to evaluate a variable that contains a string and then pipe it to the name of a hash?
Example code:
#usr/bin/perl
use strict;
my %combined = ();
my %highmut = ();
$combined{'a'} = 1;
$combined{'b'} = 2;
$highmut{'c'} = 3;
$highmut{'d'} = 4;
my #counts = ('combined', 'highmut');
foreach my $count (#counts) {
my $file = $count . '.txt';
open(my $random, ">>", $file);
foreach my $key (keys %{$count}) {
print $random "key: $key \t %{$count{$key}} \n";
close($random);
}
}
Specifically, the issues are with %{$count} and %{$count{$key}} in the last few lines.
What I want is to evaluate $count (as say combined) and then feed it to use as hash name (%combined).
Is there any way to do this?
Thanks
You can, but it's a bad idea. Consider using a multidimensional data structure instead.
my %data = ( combined => { a => 1,
b => 2,
},
highmut => { c => 3,
d => 4,
},
);
foreach my $count (keys %data) {
my $file = $count . '.txt';
open(my $random, ">>", $file);
foreach my $key (keys %{$data{$count}}) {
print $random "key: $key \t $data{$count}{$key} \n";
}
close($random);
}
Note: you also had a bug in your program flow. You open a file based on the count type, loop over your keys, write the first one, close the file, and continue looping with no filehandle open to write to for subsequent keys. I therefore moved your close out of the iteration over keys.
Further reading:
http://perldoc.perl.org/perlfaq7.html#How-can-I-use-a-variable-as-a-variable-name%3f
perl variable substitution in variable name
In Perl, how can I use a string as a variable name?
Using variable strings to access named hashes is a bad idea for all kinds of maintainability and spooky-action-at-a-distance reasons. That's why use strict 'refs' doesn't let you do it on purpose or by accident. If that's what you really want to do, disable strict refs:
foreach my $count (#counts) {
my $file = $count . '.txt';
open(my $random, ">>", $file);
no strict 'refs';
foreach my $key (keys %{$count}) {
...

Empty array in a perl while loop, should have input

Was working on this script when I came across a weird anomaly. When I go to print #extract after declaring it, it prints correctly the following:
------MMMMMMMMMMMMMMMMMMMMMMMMMM-M-MMMMMMMM
------SSSSSSSSSSSSSSSSSSSSSSSSSS-S-SSSSSDTA
------TIIIIIIIIIIIIITIIIVVIIIIII-I-IIIIITTT
Now the weird part, when I then try to print or return #extract (or $column) inside of the while loop, it comes up empty, thus rendering the rest of the script useless. I've never come across this before up until now, haven't been able to find any documentation or people with similar problems as mine. Below is the code, I marked with #<------ where the problems are and are not, to see if anyone can have any idea what is going on? Thank you kindly.
P.S. I am utilizing perl version 5.12.2
use strict;
use warnings;
#use diagnostics;
#use feature qw(say);
open (S, "Val nuc align.txt") || die "cannot open FASTA file to read: $!";
open (OUTPUT, ">output.txt");
my #extract;
my $sum = 0;
my #lines = <S>;
my #seq = ();
my $start = 0; #amino acid column start
my $end = 10; #amino acid column end
#Removing of the sequence tag until amino acid sequence composition (from >gi to )).
foreach my $line (#lines) {
$line =~ s/\n//g;
if ($line =~ />/g) {
$line =~ s/>.*\]/>/g;
push #seq, $line;
}
else {
push #seq, $line;
}
}
my $seq = join ('', #seq);
my #seq_prot = join "\n", split '>', $seq;
#seq_prot = grep {/[A-Z]/} #seq_prot;
#number of sequences
print OUTPUT "Number of sequences:", scalar (grep {defined} #seq_prot), "\n";
#selection of amino acid sequence. From $start to $end.
my #vertical_array;
while ( my $line = <#seq_prot> ) {
chomp $line;
my #split_line = split //, $line;
for my $index ( $start..$end ) { #AA position, extracts whole columns
$vertical_array[$index] .= $split_line[$index];
}
}
# Print out your vertical lines
for my $line ( #vertical_array ) {
my $extract = say OUTPUT for unpack "(a200)*", $line; #split at end of each column
#extract = grep {defined} $extract;
}
print OUTPUT #extract; #<--------------- This prints correctly the input
#Count selected amino acids excluding '-'.
my %counter;
while (my $column = #extract) {
print #extract; #<------------------------ Empty print, no input found
}
Update: Found the main problem to be with the unpack command, I thought I could utilize it to split my columns of my input at X elements (43 in this case). While this works, the minute I change $start to another number that is not 0 (say 200), the code brings up errors. Probably has something to do with the number of column elements does not match the lines. Will keep updated.
Write your last while loop the same way as your previous for loop. The assignment
my $column = #extract
is in scalar context, which does not give you the same result as:
for my $column (#extract)
Instead, it will give you the number of elements in the array. Try this second option and it should work.
However, I still have a concern, because in fact, if #extract had anything in it, you would obtain an infinite loop. Is there any code that you did not include between your two commented lines?

How to create hash of array in Perl

I have a data like this
Group AT1G01040-TAIR-G
LOC_Os03g02970 69%
Group AT1G01050-TAIR-G
LOC_Os10g26600 85%
LOC_Os10g26633 35%
Group AT1G01090-TAIR-G
LOC_Os04g02900 74%
How can create the data structure that looks like this:
print Dumper \%big;
$VAR = { "Group AT1G01040-TAIR-G" => ['LOC_Os03g02970 69%'],
"Group AT1G01050-TAIR-G" => ['LOC_Os10g26600 85%','LOC_Os10g26633 35%'],
"Group AT1G01090-TAIR-G" => ['LOC_Os04g02900 74%']};
This is my attempt, but fail:
my %big;
while ( <> ) {
chomp;
my $line = $_;
my $head = "";
my #temp;
if ( $line =~ /^Group/ ) {
$head = $line;
$head =~ s/[\r\s]+//g;
#temp = ();
}
elsif ($line =~ /^\t/){
my $cont = $line;
$cont =~ s/[\t\r]+//g;
push #temp, $cont;
push #{$big{$head}},#temp;
};
}
Here's how I'd do it:
my %big;
my $currentGroup;
while (my $line = <> ) {
chomp $line;
if ( $line =~ /^Group/ ) {
$big{$line} = $currentGroup = [];
}
elsif ($line =~ s/^\t+//) {
push #$currentGroup, $line;
}
}
You should probably add some additional error checking to this, e.g. an else clause to warn about lines that don't match either regex. Also, check to see if $currentGroup is undef before pushing (in case the first line begins with a tab instead of "Group").
The biggest problem with your original code is that you're declaring and initializing $head and #temp inside the loop, which means they got reset on every line. Variables that need to persist across lines have to be declared outside the loop, as I've done with $currentGroup.
I'm not quite sure what you're intending to accomplish with the s/[\r\s]+//g; bit. \r is included in \s, so that means the same as s/\s+//g; (which would strip all whitespace), but your desired result hash includes whitespace in your keys. If you want to strip trailing whitespace, you need to include an anchor: s/\s+\z//.
Well, I don't want to give you an answer, so I'll just tell you to look at:
perlref
perlreftut
Well, there ya go :-).
Your pushing arrays to your hash item. You should just be pushing the values. (You don't need #temp at all.)
push #{$big{$head}}, $cont;
Also $head must be declared outside your loop, otherwise it looses its value after each iteration.

Resources