Identify items in hash with matching and non-matching criteria - arrays

I have two tab-delimited files:
one is a reference with thousands of entries
and the other is a list of millions of criteria
that are used to search the reference.
I make a hash of the reference file with the following code
use strict;
use warnings;
#use Data::Dumper;
#use Timer::Runtime;
use feature qw( say );
my $in_qfn = $ARGV[0];
my $out_qfn = $ARGV[1];
my $transcripts_qfn = "file";
my %transcripts;
{
open(my $transcripts_fh, "<", $transcripts_qfn)
or die("Can't open \"$transcripts_qfn\": $!\n");
while ( <$transcripts_fh> ) {
chomp;
my #refs = split(/\t/, $_);
my ($ref_chr, $ref_strand) = #refs[0, 6];
my $values = {
start => $refs[3],
end => $refs[4],
info => $refs[8]
};
#print Data::Dumper->Dump([$values]), $/; #confirm structure is fine
push #{ $transcripts{$ref_chr}{$ref_strand} }, $values;
}
}
Then I open the other input file, define the elements, and parse the hash to find matching criteria
while ( <$in_fh> ) {
chomp;
my ($x, $strand, $chr, $y, $z) = split(/\t/, $_);
#match the reference hash for things equal to $chr and $strand
my $transcripts_array = $transcripts{$chr}{$strand};
for my $transcript ( #$transcripts_array ) {
my $start = $transcript->{start};
my $end = $transcript->{end};
my $info = $transcript->{info};
#print $info and other criteria from if statements to outfile, this code works
}
}
This works, but I would like to know if I can then find elements in the hash that match $chr but not $strand (which has a binary value of either sign).
I put the following into the same while block after the previous for, but it does not appear to work
my $transcripts_opposite_strand = $transcripts{$chr}{!$strand};
for my $transcript (#$transcripts_opposite_strand) {
my $start = $transcript->{start};
my $end = $transcript->{end};
my $info = $transcript->{info};
#print $info and other criteria from if statements
}
I apologize for the code snippets; I tried to keep the relevant information. Because of the size of the files I can't really brute force it by going line by line by line.

The negation operator ! enforces boolean context on its argument. "+" and "-" are both true in boolean context, so ! $strand is always false, i.e. "" in string context.
Either store boolean value in the hash
$strand = $strand eq '+';
or don't use boolean negation:
my $transcripts_opposite_strand = $transripts{$chr}{ $strand eq '+' ? '-' : '+' };
The ternary operator can be replaced by a shorter but less readable alternatives, e.g.
qw( + - )[ $strand eq '+' ]
because in numeric context, true is interpreted as 1 and false as 0.

Related

Is it possible to put elements of array to hash in perl?

example of file content:
>random sequence 1 consisting of 500 residues.
VILVWRISEMNPTHEIYPEVSYEDRQPFRCFDEGINMQMGQKSCRNCLIFTRNAFAYGIV
HFLEWGILLTHIIHCCHQIQGGCDCTRHPVRFYPQHRNDDVDKPCQTKSPMQVRYGDDSD;
>random sequence 2 consisting of 500 residues.
KAAATKKPWADTIPYLLCTFMQTSGLEWLHTDYNNFSSVVCVRYFEQFWVQCQDHVFVKN
KNWHQVLWEEYAVIDSMNFAWPPLYQSVSSNLDSTERMMWWWVYYQFEDNIQIRMEWCNI
YSGFLSREKLELTHNKCEVCVDKFVRLVFKQTKWVRTMNNRRRVRFRGIYQQTAIQEYHV
HQKIIRYPCHVMQFHDPSAPCDMTRQGKRMNFCFIIFLYTLYEVKYWMHFLTYLNCLEHR;
>random sequence 3 consisting of 500 residues.
AYCSCWRIHNVVFQKDVVLGYWGHCWMSWGSMNQPFHRQPYNKYFCMAPDWCNIGTYAWK
I need an algorithm to build a hash $hash{$key} = $value; where lines starting with > are the values and following lines are the keys.
What I have tried:
open (DATA, "seq-at.txt") or die "blabla";
#data = <DATA>;
%result = ();
$k = 0;
$i = 0;
while($k != #data) {
$info = #data[$k]; #istrina pirma elementa
if(#data[$i] !=~ ">") {
$key .= #data[$i]; $i++;
} else {
$k = $i;
}
$result{$key} = $value;
}
but it doesn't work.
You don't have to previously use an array, you can directly build your hash:
use strict;
use warnings;
# ^- start always your code like this to see errors and what is ambiguous
# declare your variables using "my" to specify the scope
my $filename = 'seq-at.txt';
# use the 3 parameters open syntax to avoid to overwrite the file:
open my $fh, '<', $filename or die "unable to open '$filename' $!";
my %hash;
my $hkey = '';
my $hval = '';
while (<$fh>) {
chomp; # remove the newline \n (or \r\n)
if (/^>/) { # when the line start with ">"
# store the key/value in the hash if the key isn't empty
# (the key is empty when the first ">" is encountered)
$hash{$hkey} = $hval if ($hkey);
# store the line in $hval and clear $hkey
($hval, $hkey) = $_;
} elsif (/\S/) { # when the line isn't empty (or blank)
# append the line to the key
$hkey .= $_;
}
}
# store the last key/val in the hash if any
$hash{$hkey} = $hval if ($hkey);
# display the hash
foreach (keys %hash) {
print "key: $_\nvalue: $hash{$_}\n\n";
}
It is unclear what you want, the array seems to be the lines subsequent to the random sequence number... If the contenst of a file test.txt are:
Line 1:">"random sequence 1 consisting of 500 residues.
Line 2:VILVWRISEMNPTHEIYPEVSYEDRQPFRCFDEGINMQMGQKSCRNCLIFTRNAFAYGIV
Line 3:HFLEWGILLTHIIHCCHQIQGGCDCTRHPVRFYPQHRNDDVDKPCQTKSPMQVRYGDDSD;
You could try something like:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $contentFile = $ARGV[0];
my %testHash = ();
my $currentKey = "";
open(my $contentFH,"<",$contentFile);
while(my $contentLine = <$contentFH>){
chomp($contentLine);
next if($contentLine eq ''); # Empty lines.
if($contentLine =~ /^"\>"(.*)/){
$currentKey= $1;
}else{
push(#{$testHash{$currentKey}},$contentLine);
}
}
print Dumper(\%testHash);
Which results in a structure like this:
seb#amon:[~]$ perl test.pl test.txt
$VAR1 = {
'random sequence 3 consisting of 500 residues.' => [
'AYCSCWRIHNVVFQKDVVLGYWGHCWMSWGSMNQPFHRQPYNKYFCMAPDWCNIGTYAWK'
],
'random sequence 1 consisting of 500 residues.' => [
'VILVWRISEMNPTHEIYPEVSYEDRQPFRCFDEGINMQMGQKSCRNCLIFTRNAFAYGIV',
'HFLEWGILLTHIIHCCHQIQGGCDCTRHPVRFYPQHRNDDVDKPCQTKSPMQVRYGDDSD;'
],
'random sequence 2 consisting of 500 residues.' => [
'KAAATKKPWADTIPYLLCTFMQTSGLEWLHTDYNNFSSVVCVRYFEQFWVQCQDHVFVKN',
'KNWHQVLWEEYAVIDSMNFAWPPLYQSVSSNLDSTERMMWWWVYYQFEDNIQIRMEWCNI',
'YSGFLSREKLELTHNKCEVCVDKFVRLVFKQTKWVRTMNNRRRVRFRGIYQQTAIQEYHV',
'HQKIIRYPCHVMQFHDPSAPCDMTRQGKRMNFCFIIFLYTLYEVKYWMHFLTYLNCLEHR;'
]
};
You would be basically using each hash "value" as an array structure, the #{$variable} does the magic.

Perl: Inserting values into specific columns of CSV file

I have CSV data of the form:
S.No,Label,Customer1,Customer2,Customer3...
1,label1,Y,N,Y
2,label2,N,Y,N
...
I need to reproduce the "label" to the left of "customer" columns marked with Y - and have nothing ("") to the left of columns marked with N.
Expected output:
S.No,Label,Customer1,Customer1,Customer2,Customer2,Customer3,Customer3...
1,label1,label1,Y,"",N,label1,Y
2,label2,"",N,label2,Y,"",N
When opened using Excel, it would look like this:
S.No Label Customer1 Customer1 Customer2 Customer2 Customer3 Customer3...
1 label1 label1 Y N label1 Y
2 label2 N label2 Y N
The two leftmost columns, referring to S.No and the original "Label" column, are constant.
What is the simplest way to do this? I tried the following code:
use strict;
use warnings;
my $nonIncludesFile = "nonIncludes.csv";
open(my $xfh, "+>", $nonIncludesFile) or warn "Unable to open $nonIncludesFile, $!";
chomp( my $header = <$xfh> );
my #names = split ",", $header;
my #names1;
my #fields;
my #fields1;
for(my $j=0; $j< scalar(#names); $j++)
{
$names1[$j] = $names[$j];
}
while(<$xfh>)
{
my $nonIncLine = $_;
$nonIncLine = chomp($nonIncLine);
#fields = split ",", $nonIncLine;
next if $. == 1; #skip the first line
for(my $i = 0; $i < scalar(#fields) -2; $i++) #Number of "customers" = scalar(#fields) -2
{
$fields1[0] = $fields[0];
$fields1[1] = $fields[1];
if('Y' eq $fields[ $i + 2 ])
{
$fields1[$i+2] = 'Y';
substr(#fields1, $i + 1, 0, $fields[1]); #insert the label to the left - HERE
}
else
{
$fields1[$i+2] = 'N';
substr(#fields1, $i + 1, 0, "");
}
}
}
print $xfh #names1;
print $xfh #fields1;
close($xfh);
This however complains of "substr outside of string" at the line marked by "HERE".
What am I doing wrong? And is there any simpler (and better) way to do this?
Something like this maybe?
#!/usr/bin/perl
use strict;
use warnings;
#read the header row
chomp( my ( $sn, $label, #customers ) = split( /,/, <DATA> ) );
#double the 'customers' column headings (one is suffixed "_label")
print join( ",", $sn, $label, map { $_ . "_label", $_ } #customers ), "\n";
#iterate data
while (<DATA>) {
#strip trailing linefeed
chomp;
#extract fields with split - note breaks if you've quoted commas inline.
my ( $sn, $label, #row ) = split /,/;
print "$sn,$label,";
#iterate Y/N values, and either prints "Y" + label, or anything else + blank.
foreach my $value (#row) {
print join( ",", $value eq "Y" ? $label : "", $value ),",";
}
print "\n";
}
__DATA__
S.No,Label,Customer1,Customer2,Customer3
1,label1,Y,N,Y
2,label2,N,Y,N
Assumes you don't have any fruity special characters (e.g. commas) in the fields, because it'll break if you do, and you might want to consider Text::CSV instead.
It is always much better to post some usable test data than write a something like this question
However, it looks like your data has no quoted fields or escaped characters, so it looks like you can just use split and join to process the CSV data
Here's a sample Perl program that fulfils your requirement. The example output uses your data as it is. Each line of data has to be processed backwards so that the insertions don't affect the indices of elements that are yet to be processed
use strict;
use warnings 'all';
use feature 'say';
while ( <DATA> ) {
chomp;
my #fields = split /,/;
for ( my $i = $#fields; $i > 1; --$i ) {
my $newval =
$. == 1 ? $fields[$i] :
lc $fields[$i] eq 'y' ? $fields[1] :
'';
splice #fields, $i, 0, $newval;
}
say join ',', #fields;
}
__DATA__
S.No,Label,Customer1,Customer2,Customer3...
1,label1,Y,N,Y
2,label2,N,Y,N
output
S.No,Label,Customer1,Customer1,Customer2,Customer2,Customer3...,Customer3...
1,label1,label1,Y,,N,label1,Y
2,label2,,N,label2,Y,,N

Perl : matching the contents of a file with the contents of an array

I have an array #arr1 where each element is of the form #define A B.
I have another file, f1 with contents:
#define,x,y
#define,p,q
and so on. I need to check if the second value of every line (y, q etc) matches the first value in any element of the array. Example: say the array has an element #define abc 123 and the file has a line #define,hij,abc.
When such a match occurs, I need to add the line #define hij 123 to the array.
while(<$fhDef>) #Reading the file
{
chomp;
$_ =~ tr/\r//d;
if(/#define,(\w+),(\w+)/)
{
my $newLabel = $1;
my $oldLabel = $2;
push #oldLabels, $oldLabel;
push #newLabels, $newLabel;
}
}
foreach my $x(#tempX) #Reading the array
{
chomp $x;
if($x =~ /#define\h{1}\w+\h*0x(\w+)\h*/)
{
my $addr = $1;
unless(grep { $x =~ /$_/ } #oldLabels)
{
next;
}
my $index = grep { $oldLabels[$_] eq $_ } 0..$#oldLabels;
my $new1 = $newLabels[$index];
my $headerLabel1 = $headerLabel."X_".$new1;
chomp $headerLabel1;
my $headerLine = "#define ".$headerLabel1."0x".$addr;
push #tempX, $headerLine;
}
}
This just hangs. No doubt I'm missing something right in front of me, but what??
The canonical way is to use a hash. Hash the array, using the first argument as the key. Then walk the file and check for existence of the key in the hash. I used a HoA (hash of arrays) to handle multiple values for each key (see the last two lines).
#! /usr/bin/perl
use warnings;
use strict;
my #arr1 = ( '#define y x',
'#define abc 123',
);
my %hash;
for (#arr1) {
my ($arg1, $arg2) = (split ' ')[1, 2];
push #{ $hash{$arg1} }, $arg2;
}
while (<DATA>) {
chomp;
my ($arg1, $arg2) = (split /,/)[1, 2];
if ($hash{$arg2}) {
print "#define $arg1 $_\n" for #{ $hash{$arg2} };
}
}
__DATA__
#define,x,y
#define,p,q
#define,hij,abc
#define,klm,abc
As the other answer said, it's better to use a hash. Also, keep in mind that you're doing a
foreach my $x(#tempX)
but you're also doing a
push #tempX, $headerLine;
which means that you're modifying the array on which you're iterating. This is not just bad practice, this also means that you're most likely going to have an infinite loop because of it.

Perl: Empty/broken AoH

I have subroutine in my module which checks (regular) user password age using regex search on shadow file:
Module.pm
my $pwdsetts_dump = "tmp/shadow_dump.txt";
system("cat /etc/shadow > $pwdsetts_dump");
open (my $fh1, "<", $pwdsetts_dump) or die "Could not open file '$pwdsetts_dump': $!";
sub CollectPWDSettings {
my #pwdsettings;
while (my $array = <$fh1>) {
if ($array =~ /^(\S+)[:][$]\S+[:](1[0-9]{4})/) {
my $pwdchange = "$2";
if ("$2" eq "0") {
$pwdchange = "Next login";
}
my %hash = (
"Username" => $1,
"Last change" => $pwdchange
);
push (#pwdsettings, \%hash);
}
}
my $current_date = int(time()/86400); # epoch
my $ndate = shift #_; # n-days
my $search_date = int($current_date - $ndate);
my #sorted = grep{$_->{'Last change'} > $search_date} #pwdsettings;
return \#sorted;
}
Script is divided in 2 steps:
1. load all password settings
2. search for password which is older than n-days
In my main script I use following script:
my ($user_changed_pwd);
if (grep{$_->{'Username'} eq $users_to_check} #{Module::CollectPWDSettings("100")}) {
$user_changed_pwd = "no";
}
else {
$user_changed_pwd = "yes";
}
Problem occurs in first step, AoH never gets populated. I'm also pretty sure that this subroutine always worked for me and strict and warnings never complained about it, nut now, for some reason it refuses to work.
I've just run your regex against my /etc/shadow and got no matches. If I drop the leading 1 I get a few hits.
E.g.:
$array =~ /^(\S+)[:][$]\S+[:]([0-9]{4})/
But personally - I would suggest not trying to regex, and instead rely on the fact that /etc/shadow is defined as delimited by :.
my #fields = split ( /:/, $array );
$1 contains a bunch of stuff, and I suspect what you actually want is the username - but because \S+ is greedy, you might be accidentally ending up with encrypted passwords.
Which will be $fields[0].
And then the 'last change' field - from man shadow is $fields[2].
I think your regex pattern is the main problem. Don't forget that \S matches any non-space character including colons :, and \S+ will try to match as much as possible so it will happily skip over multiple fields of the file
I think using split to separate each record into colon-delimited fields is a better approach. I also think that, instead of the array of two-element hashes #pwdsettings it would be better to store the data as a hash relating usernames to their password history
Here's how I would write this. It prints a list of all usernames whose password history is greater than 90 days
use strict;
use warnings;
use Time::Seconds 'ONE_DAY';
my #shadow = do {
open my $fh, '<', '/etc/shadow' or die qq{Unable to open "/etc/shadow" for input: $!};
<$fh>;
};
print "$_\n" for #{ collect_pwd_settings(90) };
sub collect_pwd_settings {
my ($ndate) = #_;
my %pwdsettings;
for ( #shadow ) {
my ($user, $pwdchange) = (split /:/)[0,2];
$pwdsettings{$user} = $pwdchange;
}
my $current_date = time / ONE_DAY;
my #filtered = grep { $current_date - $pwdsettings{$_} > $ndate } keys %pwdsettings;
return \#filtered;
}

grepping command line arguments out of an array in perl

I have a file that looks like this:
[options42BuySide]
logged-check-times=06:01:00
logged-check-address=192.168.3.4
logged-check-reply=192.168.2.5
logged-check-vac-days=sat,sun
start-time=06:01:00
stop-time=19:00:00
falldown=logwrite after 10000
failtolog=logwrite after 10000
listento=all
global-search-text=Target Down. This message is stored;
[stock42BuySide]
logged-check-times=06:01:00
logged-check-address=192.168.2.13
logged-check-reply=192.168.2.54
logged-check-vac-days=sat,sun
start-time=06:01:00
stop-time=18:00:00
The script grinds the list down to just the name, start and stop time.
sellSide40, start-time=07:05:00, stop-time=17:59:00
SellSide42, start-time=07:06:00, stop-time=17:29:00
SellSide44, start-time=07:31:00, stop-time=16:55:00
42SellSide, start-time=09:01:00, stop-time=16:59:00
The problem is that I would like to filter out specific names from the file with comand line parameters.
I am trying to use the #ARGV array and grep the command line values out of the #nametimes array. Something like :
capser#capser$ ./get_start_stop SorosSellSide42 ETFBuySide42
The script works fine for parsing the file - I just need help on the command line array
#!/usr/bin/perl
use strict ;
use warnings ;
my ($name , $start, $stop, $specific);
my #nametimes;
my $inifile = "/var/log/Alert.ini";
open ( my $FILE, '<', "$inifile") or die ("could not open the file -- $!");
while(<$FILE>) {
chomp ;
if (/\[(\w+)\]/) {
$name = $1;
} elsif (/(start-time=\d+:\d+:\d+)/) {
$start = $1;
} elsif (/(stop-time=\d+:\d+:\d+)/) {
$stop = $1;
push (#nametimes, "$name, $start, $stop");
}
}
for ($a = 0; $a >= $#ARGV ; $a++) {
$specific = (grep /$ARGV[$a]/, #nametimes) ;
print "$specific\n";
}
It is probably pretty easy - however I have worked on it for days, and I am the only guy that uses perl in this shop. I don't have anyone to ask and the googlize is not panning out. I apologize in advance for angering the perl deities who are sure to yell at me for asking such and easy question.
Your construct for looping over #ARGV is a bit unwieldy - the more common way of doing that would be:
for my $name (#ARGV) {
#do something
}
But really, you don't even need to loop over it. You can just join them all directly into a single regular expression:
my $names = join("|", #ARGV);
my #matches = grep { /\b($names)\b/ } #nametimes;
I've used \b in the regex here - that indicates a word boundary, so the argument SellSide4 wouldn't match SellSide42. That may or may not be what you want...
Use an array to store the results from the grep(), not a scalar. Push them, not assign. Otherwise the second iteration of the for loop will overwrite results. Something like:
for my $el ( #ARGV ) {
push #specific, grep { /$el/ } #nametimes);
};
print join "\n", #specific;
The easiest thing to do is to store your INI file as a structure. Then, you can go through your structure and pull out what you want. The simplest structure would be a hash of hashes. Where your heading is the key to the outer hash, and the inner hash is keyed by the parameter:
Here's is creating the basic structure:
use warnings;
use strict;
use autodie;
use feature qw(say);
use Data::Dumper;
use constant INI_FILE => "test.file.txt";
open my $ini_fh, "<", INI_FILE;
my %ini_file;
my $heading;
while ( my $line = <$ini_fh> ) {
chomp $line;
if ( $line =~ /\[(.*)\]/ ) { #Headhing name
$heading = $1;
}
elsif ( $line =~ /(.+?)\s*=\s*(.+)/ ) {
my $parameter = $1;
my $value = $2;
$ini_file{$heading}->{$parameter} = $value;
}
else {
say "Invalid line $. - $line";
}
}
After this, the structure will look like this:
$VAR1 = {
'options42BuySide' => {
'stop-time' => '19:00:00',
'listento' => 'all',
'logged-check-reply' => '192.168.2.5',
'logged-check-vac-days' => 'sat,sun',
'falldown' => 'logwrite after 10000',
'start-time' => '06:01:00',
'logged-check-address' => '192.168.3.4',
'logged-check-times' => '06:01:00',
'failtolog' => 'logwrite after 10000',
'global-search-text' => 'Target Down. This message is stored;'
},
'stock42BuySide' => {
'stop-time' => '18:00:00',
'start-time' => '06:01:00',
'logged-check-reply' => '192.168.2.54',
'logged-check-address' => '192.168.2.13',
'logged-check-vac-days' => 'sat,sun',
'logged-check-times' => '06:01:00'
}
};
Now, all you have to do is parse your structure and pull the information you want out of it:
for my $heading ( sort keys %ini_file ) {
say "$heading " . $ini_file{$heading}->{"start-time"} . " " . $ini_file{$heading}->{"stop-time"};
}
You could easily modify this last loop to skip the headings you want, or to print out the exact parameters you want.
I would also recommend using Getopt::Long to parse your command line parameters:
my_file -include SorosSellSide42 -include ETFBuySide42 -param start-time -param stop-time
Getopt::Long could store your parameters in arrays. For example. It would put all the -include parameters in an #includes array and all the -param parameters in an #parameters array:
for my $heading ( #includes ) {
print "$heading ";
for my $parameter ( #parameters ) {
print "$ini_file{$heading}->{$parameter} . " ";
}
print "\n;
}
Of course, there needs to be lots of error checking (does the heading exist? What about the requested parameters?). But, this is the basic structure. Unless your file is extremely long, this is probably the easiest way to process it. If your file is extremely long, you could use the #includes and #parameters in the first loop as you read in the parameters and headings.

Resources