Counting and manipulating occurrences in text file (Perl) - arrays

I have a tab separated text file that is like
1J L 0.5
1J P 0.4
1J K 0.2
1J L 0.3
1B K 0.7
1B L 0.2
1B P 0.3
1B L 0.6
1B L 0.3
And I want to manipulate it in order to get the following information:
For each element in the 1st column, count how many repeated elements in the second column there are, and do the average of all numbers in the third column for each element of the second column. The desired output can be another tab separated text file, where "Average" is the average number for that element in the 2nd column:
1st K# Average L# Average P# Average
1J 1 0.2 2 0.4 1 0.4
1B 1 0.7 3 0.38 1 0.3
How should I proceed? I thought about doing a Hash of Arrays with key = 1st column, but I don't think this would be too advantageous.
I also thought about creating multiple arrays named #L, #P, #K to count the occurrences of each of these elements, for each element of the 1st column; and other arrays #Ln, #Pn, #Kn that would get all numbers for each of these. In the end, the sum of each number divided by scalar #L would give me the average number.
But my main problem in these is: how can I do all of this processing for each element of the 1st column?
Edit: another possibility (that I am trying right now) is to create an array of all unique elements of the first column. Then, greping each one and do the processing. But there may be easier ways?
Edit2: it may happen that some elements of the second column do not exist for some elements in the first column - problem: division by 0. E.g.:
1J L 0.5
1J P 0.4
1J K 0.2
1J L 0.3
1B K 0.7
1B L 0.2
1B L 0.3 <- note that this is not P as in the example above.
1B L 0.6
1B L 0.3

Here is a way to go:
my $result;
while(<DATA>){
chomp;
my #data = split;
$result->{$data[0]}{$data[1]}{sum} += $data[2];
$result->{$data[0]}{$data[1]}{nbr}++;
}
say "1st\tK#\tavg\tL#\tavg\tP#\tavg";
foreach my $k(keys %$result) {
print "$k\t";
for my $c (qw(K L P)) {
if (exists($result->{$k}{$c}{nbr}) && $result->{$k}{$c}{nbr} != 0) {
printf("%d\t%.2f\t",$result->{$k}{$c}{nbr},$result->{$k}{$c}{sum}/$result->{$k}{$c}{nbr});
} else {
printf("%d\t%.2f\t",0,0);
}
}
print "\n";
}
__DATA__
1J L 0.5
1J P 0.4
1J K 0.2
1J L 0.3
1B K 0.7
1B L 0.2
1B P 0.3
1B L 0.6
1B L 0.3
output:
1st K# avg L# avg P# avg
1B 1 0.70 3 0.37 1 0.30
1J 1 0.20 2 0.40 1 0.40

Untested code:
while (<>) {
chomp;
($x, $y, $z) = split /\t/;
push #{$f{$x}{$y}}, $z; # E.g. $f{'1J'}{'L'}[1] will be 0.3
}
#cols = qw/L P K/;
foreach $x (sort keys %f) {
print "$x\t";
foreach $y (#cols) {
$t = $n = 0;
foreach $z (#{$f{$x}{$y}}) {
$t += $z;
++$n;
}
$avg = $n ? $t / $n : 'N/A';
print "$n\t$avg\t";
}
print "\n";
}

For each of the count and sum I would use a Hash of Hashes where the first column is the key to the outer hash and the second column is the key to the inner hash. So something like:
my (%count, %sum);
while(<>) {
my #F = split / /, $_;
$count{$F[0]}->{$F[1]}++;
$sum{$F[0]}->{$F[1]} += $F[2];
}
for my $key (keys %count) {
print $key;
for my $subkey ("K", "L", "P") {
my $average = defined($count{$key}->{$subkey}) ? $sum{$key}->{$subkey} / $count{$key}->{$subkey} : 0;
...; # and print the result
}
print "\n";
}

I am sorry I did this - really - but here is a "one-liner" (ahem) that I will try to translate into a real script and explain - as an exercise for myself :-) I hope this admittedly artificial example of a one line solution adds something to the more clearly written and scripted examples submitted by the others.
perl -anE '$seen{$F[0]}->{$F[1]}++; $sum{$F[0]}->{$F[1]} += $F[2];}{
for(keys %seen){say " $_:"; for $F1(sort keys $seen{$_}) {
say "$F1","s: $seen{$_}->{$F1} avg:",$sum{$_}->{$F1}/$seen{$_}->{$F1}}}' data.txt
See perlrun(1) for a more detailed explanation of Perl's switches. Essentially, perl -anE starts Perl in "autosplit" mode (-a) and creates a while <> loop to read input (-n) for the code that is executed between the ' ' quotes. The -E turns on all the newest bells and whistles for execution (normally one uses -e). Here's my attempt at explaining what it does.
First, in the while loop this (sort of) "oneliner":
autosplits input into an array (#F ... awkish for "fields" I guess) using a space as the delimitter.
uses the %seen{} trick to count occurrences of matching lines in part of the array. Here it increments the value of the %seen hash key created from column one ($F[0]) of #F each time it sees a line in column two ($F[1]) of #F that repeats
uses a hash %sum or %total to add the values in column three ($F[2]) using the =+ operator. See this perlmonks node for another example.
Then it breaks out of the while <> loop created with -n by using a "butterfly" }{ that acts like an END block allowing a nested for loop to spit everything out. I use $F1 as the subkey for the inner for loop to remind myself that I'm getting it from the second column of the autosplit array #F.
Output (we need printf to get nicer numerical results);
1B:
Ks: 1 avg:0.7
Ls: 3 avg:0.366666666666667
Ps: 1 avg:0.3
1J:
Ks: 1 avg:0.2
Ls: 2 avg:0.4
Ps: 1 avg:0.4
This makes the numbers look nicer (using printf to format)
perl -anE '$seen{$F[0]}->{$F[1]}++; $sum{$F[0]}->{$F[1]} += $F[2];}{
for(keys %seen){say " $_:"; for $F1(sort keys $seen{$_}) {
printf("%ss %d avg: %.2f\n", $F1, $seen{$_}->{$F1}, $sum{$_}->{$F1}/$seen{$_}->{$F1})}}' data.txt
Script version. It increments values of repeated keys drawn from column two ($field[1]) of the data; in a second hash it sums the key values drawn from column three ($field[2]). I wanted to impress you with a more functional style or the exact right CPAN module for the job but had to $work. Cheers and be sure to ask more Perl questions!
#!/usr/bin/env perl
use strict;
use warnings;
my %seen ;
my %sum ;
while(<DATA>){
my #fields = split ;
$seen{$fields[0]}{$fields[1]}++ ;
$sum{$fields[0]}{$fields[1]} += $fields[2];
}
for(keys %seen) {
print " $_:\n";
for my $f(sort keys $seen{$_}) {
printf("%ss %d avg: %.2f\n",
$f, $seen{$_}->{$f}, $sum{$_}->{$f}/$seen{$_}->{$f} );
}
}
__DATA__
1J L 0.5
1J P 0.4
1J K 0.2
1J L 0.3
1B K 0.7
1B L 0.2
1B L 0.3
1B L 0.6
1B L 0.3

Related

Is there a way to compare two arrays based on user input in matlab

MATLAB software
i=[0 1.264241 1.729329 1.900426 1.963369 1.986524 1.995042 1.998176 1.999329 1.999753 1.999909];
t=[0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2];
How can I call the value of i if the user input is from the t array as the position for both array is the same?
For example, if I call the value 0.2 the program will call the value 1.264341 from array i.
You can use input to get the user to enter a number, and ismembertol to find the number's index in t. Once you have the index, you can get the corresponding value in i. You could even throw an error if the number entered is not found in t. Here's an example:
i=[0 1.264241 1.729329 1.900426 1.963369 1.986524 1.995042 1.998176 1.999329 1.999753 1.999909];
t=[0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2];
x = input('Enter number:\n');
[~,ind] = ismembertol(x,t);
if ind > 0
fprintf('Corresponding number in i is %g\n', i(ind))
else
error('Number not found in i')
end

Printing lines containing the least number in groups - AWK/SED/PERL

I would like to print only lines containing the least number in groups. My files contains multiple columns, and I use the first column to determine groups. Let's say 1st, 4th, 6th lines are in the same group because the content of the first column is the same. My goal is to print out a line that contains the least number in the second column for each group.
file.txt:
VDDA 0.7 ....
VDDB 0.2 ....
VDDB 0.3 ....
VDDA 0.4 ....
VSS 0.1 ....
VDDA 0.2 ....
VSS 0.2 ....
output.txt:
VDDA 0.2 ....
VDDB 0.2 ....
VSS 0.1 ....
I think I can do this job with C using a for loop and comparisons, but I think there is a better way using AWK/SED/PERL.
If you are not bothering about the sequence of the 1st field as per Input_file then following may help you in same too. Also this code will be looking for smallest number value for any 1st field and going to print it then.
awk '{a[$1]=a[$1]>$2?$2:(a[$1]?a[$1]:$2)} END{for(i in a){print i,a[i]}}' Input_file
EDIT1: If you want the output in same order as $1 is in, then following may help you in same too.
awk '!a[$1]{b[++i]=$1} {c[$1]=a[$1]>$2?$0:(c[$1]?c[$1]:$0);a[$1]=a[$1]>$2?$2:(a[$1]?a[$1]:$2);} END{for(j=1;j<=i;j++){print b[j],c[b[j]]}}' Input_file
$ awk '{split(a[$1],s);a[$1]=(s[2]<$2 && s[2])?a[$1]:$0} END{for(i in a)print a[i]}' file.txt
VDDA 0.2 ....
VDDB 0.2 ....
VSS 0.1 ....
Brief explanation:
Save $0 into a[$1]
split(a[$1],s): split numeric s[2] from a[$1] for comparing
if the condition s[2]<$2 && s[2] is met, set a[$1]=a[$1], otherwise set a[$1]=$0
With GNU datamash tool:
Assuming the following exemplary input file containing rows with 5 columns:
VDDA 0.7 c1 2 a
VDDB 0.2 c2 3 b
VDDB 0.3 c4 5 c
VDDA 0.4 c5 6 d
VSS 0.1 c6 7 e
VDDA 0.2 c7 8 f
VSS 0.2 c8 9 g
datamash -sWf -g1 min 2 < file | awk '{--NF}1'
The output:
VDDA 0.2 c7 8 f
VDDB 0.2 c2 3 b
VSS 0.1 c6 7 e

Generating limited amount of random numbers in Perl for a known range

I have a file letters.txt which has couple letters
#string1 10 letters
A H K Y M H O L H L
#string2 9 letters
H N U P W X L Y H
I am trying to create a file which will have random numbers from 20 to 60 for each letter, for each string.
My expected output should look like :
output.txt:
#string1
29 27 56 43 39 40 36 48 59 38
#sting2
26 36 39 39 26 51 38 42 42
I have tried the code below with $minimum as 20 and $maximum as 60:
open ($fh, '>', $fileToLocate) or die;
my $x = $minimum + int(rand($maximum - $minimum);
print $fh "$x\n";
close $fh;
It creates only one random number in $fileToLocate file.
I want to extract the number of letters in each string — written just before letters in the input file: 10 for #string1 and 9 for #string2
I have tried to this code to create 30 random numbers ranging between 20 and 60, however it did not work out
my #Chars = ( 20 .. 60);
$RandString = join("", #Chars[ map { $x } ( 1 .. 30 ) ]);
print $fh "$RandString\n";
close $fh;
You're close.
The code to pick a random number from $min to $max looks like this.
my $rand = $min + int rand($max - $min + 1));
So you had that part right, but with an off-by-one error (I made the same mistake earlier). Because rand starts at 0, int rand $x will go from 0 to $x - 1.
Then you need to generate a bunch of them. You're close with the map, but $x only stores a single random number, so map { $x } ( 1 .. 30 ) will just give you one number repeated 30 times. Instead you need to get a new random number inside the map block.
my #rands = map { $min + int rand($max - $min + 1)) } 1..length $string;
That will run int($x + rand($y-$x)) for a number of times equal to the number of characters in $string and put them all into the list #rands. Then you can join #rands like you have already.
print $fh join '', #rands;
That should get you the rest of the way.

Create an array with a sequence of numbers in bash

I would like to write a script that will create me an array with the following values:
{0.1 0.2 0.3 ... 2.5}
Until now I was using a script as follows:
plist=(0.1 0.2 0.3 0.4)
for i in ${plist[#]}; do
echo "submit a simulation with this parameter:"
echo "$i"
done
But now I need the list to be much longer ( but still with constant intervals).
Is there a way to create such an array in a single command? what is the most efficient way to create such a list?
Using seq you can say seq FIRST STEP LAST. In your case:
seq 0 0.1 2.5
Then it is a matter of storing these values in an array:
vals=($(seq 0 0.1 2.5))
You can then check the values with:
$ printf "%s\n" "${vals[#]}"
0,0
0,1
0,2
...
2,3
2,4
2,5
Yes, my locale is set to have commas instead of dots for decimals. This can be changed setting LC_NUMERIC="en_US.UTF-8".
By the way, brace expansion also allows to set an increment. The problem is that it has to be an integer:
$ echo {0..15..3}
0 3 6 9 12 15
Bash supports C style For loops:
$ for ((i=1;i<5;i+=1)); do echo "0.${i}" ; done
0.1
0.2
0.3
0.4
Complementing the main answer
In my case, seq was not the best choice.
To produce a sequence, you can also use the jot utility. However, this command has a more elaborated syntaxis.
# 1 2 3 4
jot - 1 4
# 3 evenly distributed numbers between 0 and 10
# 0 5 10
jot 3 0 10
# a b c ... z
jot -c - 97 122

comparing all elements of two big files

How to compare All elements of a file with All elements of another file using C or Perl for much larger data? File 1 contains 100,000 such numbers and file 2 contains 500,000 elements.
I used foreach within a foreach with spliting each and every element in arrays. It worked perfectly in perl but the time consumed to check and print every occurrence of elements of just a single column from File2 in file1 was 40 minutes. There are 28 such columns.
Is there any way to reduce the time or using another language like C?
File 1:
0.1
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.2
File 2:
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.1 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.2 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28
EDIT:
Expected output:
If element in file 2, matched print 'column number' if not print '0'.
1 2 0 0 0 0 0 0 0 10 11 12 13 14 15 16 17 18 19 20 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here is the code I am using. Note: it checks File2 columnwise in File 1 and prints the Column Number if true and '0' if false. It will print output for every column in 28 different files.
#!/usr/bin/perl-w
chomp($file = "File1.txt");
open(FH, $file);
#k_org = <FH>;
chomp($hspfile = 'file2.txt');
open(FH1, $hspfile);
#hsporg = <FH1>;
for $z (1 .. 28) {
open(OUT, ">$z.txt");
foreach (#hsporg) {
$i = 0;
#h_org = split('\t', $_);
chomp ($h_org[0]);
foreach(#k_org) {
#orginfo = split('\t', $_);
chomp($orginfo[0]);
if($h_org[0] eq $orginfo[0]) {
print OUT "$z\n";
$i = 1;
goto LABEL;
} elsif ($h_org[0] ne $orginfo[0]) {
if($h_org[0]=~/(\w+\s\w+)\s/) {
if($orginfo[0] eq $1) {
print OUT "0";
$i = 1;
goto LABEL;
}
}
}
}
if ($i == 0) {
print OUT "0";
}
LABEL:
}
}
close FH;
close FH1;
close OUT;
If you sort(1) the files you can then check it in a single pass. Should not take more than a couple of seconds (including the sort).
Another way is to load all values from file1 into a hash. It's a bit more memory-consuming, especially if file1 is large, but should be fast (again, no more than a couple of seconds).
I would choose perl over C for such a job, and I'm more proficient in C than in perl. It's much faster to code in perl for this kind of job, less error-prone and runs fast enough.
This script runs a test case. Note that your expected output is objectively wrong: In file 2, line 1, column 20, the value 0.2 exists.
#!perl
use 5.010; # just for `say`
use strict; use warnings;
use Test::More;
# define input files + expected outcome
my $file_1_contents = <<'_FILE1_';
0.1
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.2
_FILE1_
my $file_2_contents = <<'_FILE2_';
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.1 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.2 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28
_FILE2_
my $expected_output = <<'_OUTPUT_';
1 2 0 0 0 0 0 0 0 10 11 12 13 14 15 16 17 18 19 20 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
_OUTPUT_
# open the filehandles
open my $file1, "<", \$file_1_contents or die "$!";
open my $file2, "<", \$file_2_contents or die "$!";
open my $expected, "<", \$expected_output or die "$!";
my %file1 = map { chomp; 0+$_ => undef } <$file1>;
while (<$file2>) {
chomp;
my #vals = split;
# If value exists in file1, print the col number.
my $line = join " " => map { exists $file1{0+$vals[$_]} ? $_+1 : 0 } 0 .. $#vals;
chomp(my $expected_line = <$expected>);
is $line, $expected_line;
}
done_testing;
To print the exact same output to 28 files, you would remove the testing code, and rather
say {$_} $line for #filehandles;
instead.
Old answer
Your existing code is simply quite inefficient and unidiomatic. Let me help you fix that.
First, start all your Perl scripts with use strict; use warnings;, and if you have a modern perl (v10 or later), you could do use 5.010; (or whatever your version is) to activate additional features.
The chomp call takes a variable and removes the current value of $/ (usually a newline) from the end of the string. This is important because the readline operator doesn't do that for us. It is not good for declaring a constant variable. Rather, do
my $file = "File1.txt";
my $hspfle = "File2.txt";
The use strict forces you to declare your variables properly, you can do so with my.
To open a file, you should use the following idiom:
open my $fh, "<", $filename or die "Can't open $filename: $!";
Instead of or die ... you can use autodie at the top of your script.
This will abort the script if you can't open the file, tell you the reason ($!), and specifies an explicit open mode (here: < = read). This avoids bugs with special characters in filenames.
Lexical filehandles (in my variables, as contrasted to bareword filehandles) have proper scope, and close themselves. There are various other reasons why you should use them.
The split function takes a regex, not a string as first argument. If you inspect your program closely, you will see that you split each element in #hsporg 28 times, and each element in #k_org 28 × #hsporg times. This is extremely slow, and unneccessary, as we can do that beforehand.
If a condition is false, you don't need to explicitely test for the falseness again in
if ($h_org[0] eq $orginfo[0]) {
...;
} elsif ($h_org[0] ne $orginfo[0]) {
...;
}
as $a ne $b is exactly equivalent to not $a eq $b.
It is quite unidiomatic in Perl to use a goto, and jumping to a label somewhere isn't especially fast either. Labels are mainly used for loop control:
# random example
LOOP: for my $i (1 .. 10) {
for my $j (1 .. 5) {
next if $i == $j; # start next iteration of current loop
next LOOP if 2 * $i == $j; # start next iteration of labeled loop
last LOOP if $i + $j == 13;# like `break` in C
}
The redo loop control verb is similar to next, but doesn't recheck the loop condition, if there is one.
Because of these loop control facilities, and the ability to break out of any enclosing loop, maintaining flags or elaborate gotos is often quite unneccessary.
Here is a cleaned-up version of your script, without fixing too much of the actual algorithm:
#!/usr/bin/perl
use strict; use warnings;
use autodie; # automatic error messages
my ($file, $hspfile) = ("File1.txt", "file2.txt");
open my $fh1, "<", $file;
open my $fh2, "<", $hspfile;
my #k_org = <$fh1>;
my #hsporg = <$fh2>;
# Presplit the contents of the arrays:
for my $arr (\#k_org, \#hsporg) {
for (#$arr) {
chomp;
$_ = [ split /\t/ ]; # put an *anonymous arrayref* into each slot
}
}
my $output_files = 28;
for my $z (1 .. $output_files) {
open my $out, ">", "$z.txt";
H_ORG:
for my $h_org (#hsporg) {
my $i = 0;
ORGINFO:
for my $orginfo (#k_org) {
# elements in array references are accessed like $arrayref->[$i]
if($h_org->[0] eq $orginfo->[0]) {
print $out "$z\n";
$i = 1;
last ORGINFO; # break out of this loop
} elsif($h_org->[0] =~ /(\w+\s\w+)\s/ and $orginfo->[0] eq $1) {
print $out "0";
$i = 1;
last ORGINFO;
}
}
print $out "0" if not $i;
}
}
# filehandles are closed automatically.
Now we can optimize further: In each line, you only ever use the first element. This means we don't have to store the rest:
...;
for (#$arr) {
chomp;
$_ = (split /\t/, $_, 2)[0]; # save just the first element
}
...;
ORGINFO:
for my $orginfo (#k_org) {
# elements in array references are accessed like $arrayref->[$i]
if($h_org eq $orginfo) {
...;
} elsif($h_org =~ /(\w+\s\w+)\s/ and $orginfo eq $1) {
...;
}
}
Also, acessing scalars is a bit faster than accessing array elements.
The third arg to split limits the number of resulting fragments. Because we are only interested in the first field, we don't have to split the rest too.
Next on, we last out of the ORGINFO loop, then check a flag. This is unneccessary: We can jump directly to the next iteration of the H_ORG loop instead of setting the flag. If we drop out naturally out of the ORGINFO loop, the flag is guaranteed to not be set, so we can execute the print anyway:
H_ORG:
for my $h_org (#hsporg) {
for my $orginfo (#k_org) {
if($h_org eq $orginfo) {
print $out "$z\n";
next H_ORG;
} elsif($h_org =~ /(\w+\s\w+)\s/ and $orginfo eq $1) {
print $out "0";
next H_ORG;
}
}
print $out "0";
}
Then, you compare the same data 28 times to print it to different files. Better: Define two subs print_index and print_zero. Inside these, you loop over the output filehandles:
# make this initialization *before* you use the subs!
my #filehandles = map {open my $fh, ">", "$_.txt"; $fh} 1 .. $output_files;
...; # the H_ORG loop
sub print_index {
for my $i (0 .. $#filehandles) {
print {$filehandles[$i]} $i+1, "\n";
}
}
sub print_zero {
print {$_} 0 for #filehandles;
}
Then:
# no enclosing $z loop!
H_ORG:
for my $h_org (#hsporg) {
for my $orginfo (#k_org) {
if($h_org eq $orginfo) {
print_index()
next H_ORG;
} elsif($h_org =~ /(\w+\s\w+)\s/ and $orginfo eq $1) {
print_zero();
next H_ORG;
}
}
print_zero();
}
This avoids checking data you already know doesn't match.
In C you could try using "qsort" and "bsearch" functions
First you need to load your files into an array.
Than you should execute a qsort() (unless you are sure the elements have an order). And use the bsearch() to execute a binary search into your array.
http://linux.die.net/man/3/bsearch
This will be much more faster than check all elements one by one.
You could try to implement a binary search in perl if it do not exist already, it is a simple algorithm.

Resources