comparing all elements of two big files - c

How to compare All elements of a file with All elements of another file using C or Perl for much larger data? File 1 contains 100,000 such numbers and file 2 contains 500,000 elements.
I used foreach within a foreach with spliting each and every element in arrays. It worked perfectly in perl but the time consumed to check and print every occurrence of elements of just a single column from File2 in file1 was 40 minutes. There are 28 such columns.
Is there any way to reduce the time or using another language like C?
File 1:
0.1
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.2
File 2:
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.1 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.2 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28
EDIT:
Expected output:
If element in file 2, matched print 'column number' if not print '0'.
1 2 0 0 0 0 0 0 0 10 11 12 13 14 15 16 17 18 19 20 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here is the code I am using. Note: it checks File2 columnwise in File 1 and prints the Column Number if true and '0' if false. It will print output for every column in 28 different files.
#!/usr/bin/perl-w
chomp($file = "File1.txt");
open(FH, $file);
#k_org = <FH>;
chomp($hspfile = 'file2.txt');
open(FH1, $hspfile);
#hsporg = <FH1>;
for $z (1 .. 28) {
open(OUT, ">$z.txt");
foreach (#hsporg) {
$i = 0;
#h_org = split('\t', $_);
chomp ($h_org[0]);
foreach(#k_org) {
#orginfo = split('\t', $_);
chomp($orginfo[0]);
if($h_org[0] eq $orginfo[0]) {
print OUT "$z\n";
$i = 1;
goto LABEL;
} elsif ($h_org[0] ne $orginfo[0]) {
if($h_org[0]=~/(\w+\s\w+)\s/) {
if($orginfo[0] eq $1) {
print OUT "0";
$i = 1;
goto LABEL;
}
}
}
}
if ($i == 0) {
print OUT "0";
}
LABEL:
}
}
close FH;
close FH1;
close OUT;

If you sort(1) the files you can then check it in a single pass. Should not take more than a couple of seconds (including the sort).
Another way is to load all values from file1 into a hash. It's a bit more memory-consuming, especially if file1 is large, but should be fast (again, no more than a couple of seconds).
I would choose perl over C for such a job, and I'm more proficient in C than in perl. It's much faster to code in perl for this kind of job, less error-prone and runs fast enough.

This script runs a test case. Note that your expected output is objectively wrong: In file 2, line 1, column 20, the value 0.2 exists.
#!perl
use 5.010; # just for `say`
use strict; use warnings;
use Test::More;
# define input files + expected outcome
my $file_1_contents = <<'_FILE1_';
0.1
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.2
_FILE1_
my $file_2_contents = <<'_FILE2_';
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.1 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.2 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28
_FILE2_
my $expected_output = <<'_OUTPUT_';
1 2 0 0 0 0 0 0 0 10 11 12 13 14 15 16 17 18 19 20 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
_OUTPUT_
# open the filehandles
open my $file1, "<", \$file_1_contents or die "$!";
open my $file2, "<", \$file_2_contents or die "$!";
open my $expected, "<", \$expected_output or die "$!";
my %file1 = map { chomp; 0+$_ => undef } <$file1>;
while (<$file2>) {
chomp;
my #vals = split;
# If value exists in file1, print the col number.
my $line = join " " => map { exists $file1{0+$vals[$_]} ? $_+1 : 0 } 0 .. $#vals;
chomp(my $expected_line = <$expected>);
is $line, $expected_line;
}
done_testing;
To print the exact same output to 28 files, you would remove the testing code, and rather
say {$_} $line for #filehandles;
instead.
Old answer
Your existing code is simply quite inefficient and unidiomatic. Let me help you fix that.
First, start all your Perl scripts with use strict; use warnings;, and if you have a modern perl (v10 or later), you could do use 5.010; (or whatever your version is) to activate additional features.
The chomp call takes a variable and removes the current value of $/ (usually a newline) from the end of the string. This is important because the readline operator doesn't do that for us. It is not good for declaring a constant variable. Rather, do
my $file = "File1.txt";
my $hspfle = "File2.txt";
The use strict forces you to declare your variables properly, you can do so with my.
To open a file, you should use the following idiom:
open my $fh, "<", $filename or die "Can't open $filename: $!";
Instead of or die ... you can use autodie at the top of your script.
This will abort the script if you can't open the file, tell you the reason ($!), and specifies an explicit open mode (here: < = read). This avoids bugs with special characters in filenames.
Lexical filehandles (in my variables, as contrasted to bareword filehandles) have proper scope, and close themselves. There are various other reasons why you should use them.
The split function takes a regex, not a string as first argument. If you inspect your program closely, you will see that you split each element in #hsporg 28 times, and each element in #k_org 28 × #hsporg times. This is extremely slow, and unneccessary, as we can do that beforehand.
If a condition is false, you don't need to explicitely test for the falseness again in
if ($h_org[0] eq $orginfo[0]) {
...;
} elsif ($h_org[0] ne $orginfo[0]) {
...;
}
as $a ne $b is exactly equivalent to not $a eq $b.
It is quite unidiomatic in Perl to use a goto, and jumping to a label somewhere isn't especially fast either. Labels are mainly used for loop control:
# random example
LOOP: for my $i (1 .. 10) {
for my $j (1 .. 5) {
next if $i == $j; # start next iteration of current loop
next LOOP if 2 * $i == $j; # start next iteration of labeled loop
last LOOP if $i + $j == 13;# like `break` in C
}
The redo loop control verb is similar to next, but doesn't recheck the loop condition, if there is one.
Because of these loop control facilities, and the ability to break out of any enclosing loop, maintaining flags or elaborate gotos is often quite unneccessary.
Here is a cleaned-up version of your script, without fixing too much of the actual algorithm:
#!/usr/bin/perl
use strict; use warnings;
use autodie; # automatic error messages
my ($file, $hspfile) = ("File1.txt", "file2.txt");
open my $fh1, "<", $file;
open my $fh2, "<", $hspfile;
my #k_org = <$fh1>;
my #hsporg = <$fh2>;
# Presplit the contents of the arrays:
for my $arr (\#k_org, \#hsporg) {
for (#$arr) {
chomp;
$_ = [ split /\t/ ]; # put an *anonymous arrayref* into each slot
}
}
my $output_files = 28;
for my $z (1 .. $output_files) {
open my $out, ">", "$z.txt";
H_ORG:
for my $h_org (#hsporg) {
my $i = 0;
ORGINFO:
for my $orginfo (#k_org) {
# elements in array references are accessed like $arrayref->[$i]
if($h_org->[0] eq $orginfo->[0]) {
print $out "$z\n";
$i = 1;
last ORGINFO; # break out of this loop
} elsif($h_org->[0] =~ /(\w+\s\w+)\s/ and $orginfo->[0] eq $1) {
print $out "0";
$i = 1;
last ORGINFO;
}
}
print $out "0" if not $i;
}
}
# filehandles are closed automatically.
Now we can optimize further: In each line, you only ever use the first element. This means we don't have to store the rest:
...;
for (#$arr) {
chomp;
$_ = (split /\t/, $_, 2)[0]; # save just the first element
}
...;
ORGINFO:
for my $orginfo (#k_org) {
# elements in array references are accessed like $arrayref->[$i]
if($h_org eq $orginfo) {
...;
} elsif($h_org =~ /(\w+\s\w+)\s/ and $orginfo eq $1) {
...;
}
}
Also, acessing scalars is a bit faster than accessing array elements.
The third arg to split limits the number of resulting fragments. Because we are only interested in the first field, we don't have to split the rest too.
Next on, we last out of the ORGINFO loop, then check a flag. This is unneccessary: We can jump directly to the next iteration of the H_ORG loop instead of setting the flag. If we drop out naturally out of the ORGINFO loop, the flag is guaranteed to not be set, so we can execute the print anyway:
H_ORG:
for my $h_org (#hsporg) {
for my $orginfo (#k_org) {
if($h_org eq $orginfo) {
print $out "$z\n";
next H_ORG;
} elsif($h_org =~ /(\w+\s\w+)\s/ and $orginfo eq $1) {
print $out "0";
next H_ORG;
}
}
print $out "0";
}
Then, you compare the same data 28 times to print it to different files. Better: Define two subs print_index and print_zero. Inside these, you loop over the output filehandles:
# make this initialization *before* you use the subs!
my #filehandles = map {open my $fh, ">", "$_.txt"; $fh} 1 .. $output_files;
...; # the H_ORG loop
sub print_index {
for my $i (0 .. $#filehandles) {
print {$filehandles[$i]} $i+1, "\n";
}
}
sub print_zero {
print {$_} 0 for #filehandles;
}
Then:
# no enclosing $z loop!
H_ORG:
for my $h_org (#hsporg) {
for my $orginfo (#k_org) {
if($h_org eq $orginfo) {
print_index()
next H_ORG;
} elsif($h_org =~ /(\w+\s\w+)\s/ and $orginfo eq $1) {
print_zero();
next H_ORG;
}
}
print_zero();
}
This avoids checking data you already know doesn't match.

In C you could try using "qsort" and "bsearch" functions
First you need to load your files into an array.
Than you should execute a qsort() (unless you are sure the elements have an order). And use the bsearch() to execute a binary search into your array.
http://linux.die.net/man/3/bsearch
This will be much more faster than check all elements one by one.
You could try to implement a binary search in perl if it do not exist already, it is a simple algorithm.

Related

Use of uninitialized value within #spl in substitution (s///)

I am getting following error while running the script.
Use of uninitialized value in print at PreProcess.pl line 137.
Use of uninitialized value within #spl in substitution (s///) at PreProcess.pl line 137.
Is there any syntax error in the script?
(Running it in Windows - Strawberry 64 last version)
my $Dat=2;
my $a = 7;
foreach (#spl) {
if ( $_ =~ $NameInstru ) {
print $spl[$Dat] =~ s/-/\./gr, " 00:00; ",$spl[$a],"\n"; # data
$Dat += 87;
$a += 87;
}
}
inside of array i hve this type of data
"U.S. DOLLAR INDEX - ICE FUTURES U.S."
150113
2015-01-13
098662
ICUS
01
098
128104
14111
88637
505
13200
50
269
43140
34142
1862
37355
482
180
110623
126128
17480
1976
1081
-3699
8571
-120
646
50
248
1581
-8006
319
2093
31
-30
1039
1063
42
18
100.0
11.0
69.2
0.4
10.3
0.0
0.2
33.7
26.7
1.5
29.2
0.4
0.1
86.4
98.5
13.6
1.5
215
7
.
.
16
.
.
50
16
8
116
6
4
197
34
28.6
85.1
41.3
91.3
28.2
85.1
40.8
91.2
"(U.S. DOLLAR INDEX X $1000)"
"098662"
"ICUS"
"098"
"F90"
"Combined"
"U.S. DOLLAR INDEX - ICE FUTURES U.S."
150106
2015-01-06
098662
ICUS
01
098
127023
17810
80066
625
12554
0
21
41559
42148
1544
35262
452
210
109585
125065
17438
1958
19675
486
23911
49
2717
0
-73
9262
-5037
30
5873
270
95
18439
19245
1237
431
100.0
14.0
63.0
0.5
9.9
0.0
0.0
32.7
33.2
1.2
27.8
0.4
0.2
86.3
98.5
13.7
1.5
202
7
.
.
16
0
.
48
16
9
105
6
4
185
34
29.3
83.2
43.2
90.6
28.9
83.2
42.8
90.5
"(U.S. DOLLAR INDEX X $1000)"
"098662"
"ICUS"
"098"
"F90"
"Combined"
You are probably trying to load a file of data sets of a size of 87 lines each into an array, and then you get an error at the end of your data, when you try to read outside of the last array index.
You can probably solve it by iterating over the array indexes instead of the array values, e.g.
my $Dat = 2;
my $a = 7;
my $set_size = 87;
for (my $n = 0; $n + $a < #spl; $n += $set_size) {
if ( $spl[$n] =~ $NameInstru ) {
print $spl[$n + $Dat] =~ s/-/\./gr, " 00:00; ",$spl[$n + $a],"\n"; # data
}
}
While this might solve your problem, it might be better to try and find a proper way to parse your file.
If the records inside the input file are separated by a blank line, you can try to read whole records at once by changing the input record separator to "" or "\n\n". Then you can split each element in the resulting array on newline \n and get an entire record as a result. For example:
$/ = "";
my #spl;
open my $fh ...
while (<$fh>) {
push #spl, [ split "\n", $_ ];
}
...
for my $record (#spl) {
# #$record is now an 87 element array with each record in the file
}
TLP's solution of iterating over the indexes of an array, incrementing by 87 at time is great.
Here's a more complex solution, but one that doesn't require loading the entire file into memory.
my $lines_per_row = 87;
my #row;
while (<>) {
chomp;
push #row, $_;
if (#row == $lines_per_row) {
my ($instru, $dat, $a) = #row[0, 2, 7];
if ($instru =~ $NameInstru) {
print $dat =~ s/-/\./gr, " 00:00; $a\n";
}
#row = ();
}
}

perl array size is smaller than it should be

I want to initialize 4^9 (=262144) indices of #clump as 0. So I wrote this:
my $k=9;
my #clump=();
my $n=4**$k;
for(my $i=0;$i<$n;$i++){
push(#clump,0);
print "$i ";
}
But it keeps freezing at 261632! I then tried making $n=5^9 (=1953125) and my code stopped at 1952392. So its definitely not a memory issue. This should be simple enough but I can't figure out what's wrong with my code. Help a newbie?
Suffering from buffering?
When I add a sleep 1000 to the end of your program, stream the output to a file, and read the tail of the file, I also observe the last numbers to be printed are 261632 and 1952392. The remaining output is stuck in the output buffer, waiting for some event (the buffer filling up, the filehandle closing, the program exiting, or an explicit flush call) to flush the output.
The buffering can be changed by one of the following statements early in your program
$|= 1;
STDOUT->autoflush(1);
#!/usr/bin/env perl
use strict;
use warnings;
my $k = 9;
my $n = 4 ** $k;
my #clump = (0) x $n;
print join(' ', #clump), "\n";
printf "%d elements in \#clump\n", scalar #clump;
Or,
#!/usr/bin/env perl
use strict;
use warnings;
my $k = 9;
my $n = 4 ** $k;
my #clump;
$#clump = $n - 1;
$_ = 0 for #clump;
print join(' ', #clump), "\n";
printf "%d elements in \#clump\n", scalar #clump;
Output:
...
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
262144 elements in #clump
Also, note that initialization with 0 is almost never required in Perl. Why do you need this?

Perl: Sort part of array

I have an array with many fields in each line spaced by different spacing like:
INDDUMMY drawing2 139 30 1 0 0 0 0 0
RMDUMMY drawing2 69 2 1 0 0 0 0 0
PIMP drawing 7 0 1444 718 437 0 0 0
I'm trying to make sorting for this array by number in 3rd field so the desired output should be:
PIMP drawing 7 0 1444 718 437 0 0 0
RMDUMMY drawing2 69 2 1 0 0 0 0 0
INDDUMMY drawing2 139 30 1 0 0 0 0 0
I tried to make a split using regular expression within the sorting function like:
#sortedListOfLayers = sort {
split(m/\w+\s+(\d+)\s/gm,$a)
cmp
split(m/\w+\s+(\d+)\s/gm,$b)
}#listOfLayers;
but it doesn't work correctly. How I could make that type of sorting?
You need to expand out your sort function a little further. I'm also not sure that split is working the way you think it is. Split turns text into an array based on a delimiter.
I think your problem is that your regular expression - thanks to the gm flags - isn't matching what you think it's matching. I'd perhaps approach it slightly differently though:
#!/usr/bin/perl
use strict;
use warnings;
my #array = <DATA>;
sub sort_third_num {
my $a1 = (split ( ' ', $a ) )[2];
my $b1 = (split ( ' ', $b )) [2];
return $a1 <=> $b1;
}
print sort sort_third_num #array;
__DATA__
NDDUMMY drawing2 139 30 1 0 0 0 0 0
RMDUMMY drawing2 69 2 1 0 0 0 0 0
PIMP drawing 7 0 1444 718 437 0 0 0
This does the trick, for example.
If you're set on doing a regex approach:
sub sort_third_num {
my ($a1) = $a =~ m/\s(\d+)/;
my ($b1) = $b =~ m/\s(\d+)/;
return $a1 <=> $b1;
}
not globally matching means only the first element is returned. And only the first match of 'whitespace-digits' is returned. We also compare numerically, rather than stringwise.
If you want to sort a list and the operation used in the sort block is expensive, an often used Perl idiom is the Schwartzian Transform: you apply the operation once to each list element and store the result alongside the original element, sort, then map back to your original format.
The classic textbook example is sorting files in a directory by size using the expensive -s file test. A naïve approach would be
my #sorted = sort { -s $a <=> -s $b } #unsorted;
which has to perform -s twice for each comparison operation.
Using the Schwartzian Transform, we map the file names into a list of array references, each referencing an array containing the list element and its size (which has to be determined only once per file), then sort by file size, and finally map the array references back to just the file names. This is all done in a single step:
my #sorted =
map $_->[0], # 3. map to file name
sort { a$->[1] <=> b$->[1] } # 2. sort by size
map [ $_, -s $_ ], # 1. evaluate size once for each file
#unsorted;
In your case, the question is how expensive it is to extract the third field of each array element. When in doubt, measure to compare different methods. The speedup in the file size example is dramatic at about a factor 10 for a few dozen files!
The Schwartzian Transform applied to your problem would look something like this:
my #sorted =
map $_->[0], # 3. Map to original array
sort { $a->[1] <=> $b->[1] } # 2. Sort by third column
map [ $_, ( split( ' ', $_ ) )[2] ], # 1. Use Sobrique's idea
#array;
If the operation used is so expensive that you want to avoid performing it more than once per value in case you have identical array elements, you can cache the results as outlined in this question; this is known as the Orcish Maneuver.

Counting and manipulating occurrences in text file (Perl)

I have a tab separated text file that is like
1J L 0.5
1J P 0.4
1J K 0.2
1J L 0.3
1B K 0.7
1B L 0.2
1B P 0.3
1B L 0.6
1B L 0.3
And I want to manipulate it in order to get the following information:
For each element in the 1st column, count how many repeated elements in the second column there are, and do the average of all numbers in the third column for each element of the second column. The desired output can be another tab separated text file, where "Average" is the average number for that element in the 2nd column:
1st K# Average L# Average P# Average
1J 1 0.2 2 0.4 1 0.4
1B 1 0.7 3 0.38 1 0.3
How should I proceed? I thought about doing a Hash of Arrays with key = 1st column, but I don't think this would be too advantageous.
I also thought about creating multiple arrays named #L, #P, #K to count the occurrences of each of these elements, for each element of the 1st column; and other arrays #Ln, #Pn, #Kn that would get all numbers for each of these. In the end, the sum of each number divided by scalar #L would give me the average number.
But my main problem in these is: how can I do all of this processing for each element of the 1st column?
Edit: another possibility (that I am trying right now) is to create an array of all unique elements of the first column. Then, greping each one and do the processing. But there may be easier ways?
Edit2: it may happen that some elements of the second column do not exist for some elements in the first column - problem: division by 0. E.g.:
1J L 0.5
1J P 0.4
1J K 0.2
1J L 0.3
1B K 0.7
1B L 0.2
1B L 0.3 <- note that this is not P as in the example above.
1B L 0.6
1B L 0.3
Here is a way to go:
my $result;
while(<DATA>){
chomp;
my #data = split;
$result->{$data[0]}{$data[1]}{sum} += $data[2];
$result->{$data[0]}{$data[1]}{nbr}++;
}
say "1st\tK#\tavg\tL#\tavg\tP#\tavg";
foreach my $k(keys %$result) {
print "$k\t";
for my $c (qw(K L P)) {
if (exists($result->{$k}{$c}{nbr}) && $result->{$k}{$c}{nbr} != 0) {
printf("%d\t%.2f\t",$result->{$k}{$c}{nbr},$result->{$k}{$c}{sum}/$result->{$k}{$c}{nbr});
} else {
printf("%d\t%.2f\t",0,0);
}
}
print "\n";
}
__DATA__
1J L 0.5
1J P 0.4
1J K 0.2
1J L 0.3
1B K 0.7
1B L 0.2
1B P 0.3
1B L 0.6
1B L 0.3
output:
1st K# avg L# avg P# avg
1B 1 0.70 3 0.37 1 0.30
1J 1 0.20 2 0.40 1 0.40
Untested code:
while (<>) {
chomp;
($x, $y, $z) = split /\t/;
push #{$f{$x}{$y}}, $z; # E.g. $f{'1J'}{'L'}[1] will be 0.3
}
#cols = qw/L P K/;
foreach $x (sort keys %f) {
print "$x\t";
foreach $y (#cols) {
$t = $n = 0;
foreach $z (#{$f{$x}{$y}}) {
$t += $z;
++$n;
}
$avg = $n ? $t / $n : 'N/A';
print "$n\t$avg\t";
}
print "\n";
}
For each of the count and sum I would use a Hash of Hashes where the first column is the key to the outer hash and the second column is the key to the inner hash. So something like:
my (%count, %sum);
while(<>) {
my #F = split / /, $_;
$count{$F[0]}->{$F[1]}++;
$sum{$F[0]}->{$F[1]} += $F[2];
}
for my $key (keys %count) {
print $key;
for my $subkey ("K", "L", "P") {
my $average = defined($count{$key}->{$subkey}) ? $sum{$key}->{$subkey} / $count{$key}->{$subkey} : 0;
...; # and print the result
}
print "\n";
}
I am sorry I did this - really - but here is a "one-liner" (ahem) that I will try to translate into a real script and explain - as an exercise for myself :-) I hope this admittedly artificial example of a one line solution adds something to the more clearly written and scripted examples submitted by the others.
perl -anE '$seen{$F[0]}->{$F[1]}++; $sum{$F[0]}->{$F[1]} += $F[2];}{
for(keys %seen){say " $_:"; for $F1(sort keys $seen{$_}) {
say "$F1","s: $seen{$_}->{$F1} avg:",$sum{$_}->{$F1}/$seen{$_}->{$F1}}}' data.txt
See perlrun(1) for a more detailed explanation of Perl's switches. Essentially, perl -anE starts Perl in "autosplit" mode (-a) and creates a while <> loop to read input (-n) for the code that is executed between the ' ' quotes. The -E turns on all the newest bells and whistles for execution (normally one uses -e). Here's my attempt at explaining what it does.
First, in the while loop this (sort of) "oneliner":
autosplits input into an array (#F ... awkish for "fields" I guess) using a space as the delimitter.
uses the %seen{} trick to count occurrences of matching lines in part of the array. Here it increments the value of the %seen hash key created from column one ($F[0]) of #F each time it sees a line in column two ($F[1]) of #F that repeats
uses a hash %sum or %total to add the values in column three ($F[2]) using the =+ operator. See this perlmonks node for another example.
Then it breaks out of the while <> loop created with -n by using a "butterfly" }{ that acts like an END block allowing a nested for loop to spit everything out. I use $F1 as the subkey for the inner for loop to remind myself that I'm getting it from the second column of the autosplit array #F.
Output (we need printf to get nicer numerical results);
1B:
Ks: 1 avg:0.7
Ls: 3 avg:0.366666666666667
Ps: 1 avg:0.3
1J:
Ks: 1 avg:0.2
Ls: 2 avg:0.4
Ps: 1 avg:0.4
This makes the numbers look nicer (using printf to format)
perl -anE '$seen{$F[0]}->{$F[1]}++; $sum{$F[0]}->{$F[1]} += $F[2];}{
for(keys %seen){say " $_:"; for $F1(sort keys $seen{$_}) {
printf("%ss %d avg: %.2f\n", $F1, $seen{$_}->{$F1}, $sum{$_}->{$F1}/$seen{$_}->{$F1})}}' data.txt
Script version. It increments values of repeated keys drawn from column two ($field[1]) of the data; in a second hash it sums the key values drawn from column three ($field[2]). I wanted to impress you with a more functional style or the exact right CPAN module for the job but had to $work. Cheers and be sure to ask more Perl questions!
#!/usr/bin/env perl
use strict;
use warnings;
my %seen ;
my %sum ;
while(<DATA>){
my #fields = split ;
$seen{$fields[0]}{$fields[1]}++ ;
$sum{$fields[0]}{$fields[1]} += $fields[2];
}
for(keys %seen) {
print " $_:\n";
for my $f(sort keys $seen{$_}) {
printf("%ss %d avg: %.2f\n",
$f, $seen{$_}->{$f}, $sum{$_}->{$f}/$seen{$_}->{$f} );
}
}
__DATA__
1J L 0.5
1J P 0.4
1J K 0.2
1J L 0.3
1B K 0.7
1B L 0.2
1B L 0.3
1B L 0.6
1B L 0.3

Perl:Shuffle an array 10 times, computing 10 average maxes, printing the mean average max. Repeat this entire process 1000 times

I have a data set which looks like the following
1. Dataset
NR_046018 DDX11L1 , 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1.44 2.72 3.84 4.92
NR_047520 LOC643837 , 3 2.2 0.2 0 0 0.28 1 1 1 1 2.2 4.8 5 5.32 5 5 5 5 3
NM_001005484 OR4F5 , 2 2 2 1.68 1 0.48 0 0.92 1 1.8 2 2 2 2.04 3.88 3
NR_028327 LOC100133331 , 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2. What is needed
Shuffle the array 10 times. After each shuffle, divide the array into 2 new arrays, say set1 and set2. (Half goes into set1 and other half goes to set2)
From each new array, compute maximum value of each row of numbers, followed by an average max for all rows.
Get 10 average maxes of each set1 and set2.(10 average maxes for 10 shuffles) Compute the average of the 10 average maxes obtained for each set, let's call it 10avg1 and 10avg2.
Get a list of 1000 10avg2 and 1000 10avg2.
3.Code
use warnings;
use List::Util qw(max shuffle);
my $file = 'mergesmall.txt';
#Open file and output file
open my $fh,'<',$file or die "Unable to open file";
open OUT,">Shuffle.out" or die;
#Read into array
my #arr = <$fh>;
#Intialize loop for shuffling 10 times
my $i=10;
while($i){
my #arr1 = (); #Intitialize 1st set
my #arr2 = (); #Initialize 2nd set
my #shuffled = shuffle(#arr);
push #arr1,(#shuffled[0..1]); #Shift into 1st set
push #arr2,(#shuffled[2..3]); #Shift into 2nd set
foreach $_(#arr1){
my #val1 = split;
my $max1 = max(#val1[3..$#val1]);
$total1 += $max1;
$num1++;
}
my $average_max1 = $total1 / $num1;
#print "\n\n","Average max 1st set is : ",$average_max1;
print OUT "Average max 1st set is : ",$average_max1;
foreach $_(#arr2){
my #val2 = split;
my $max2 = max(#val2[3..$#val2]);
print "\n\n";
$total2 += $max2;
$num2++;
}
my $average_max2 = $total2 / $num2;
#print "\n\n","Average max 2nd set is : ",$average_max2;
print OUT "\n","Average max 2nd set is : ",$average_max2,"\n\n";
$i--;
}
4. The Problem
The code I have been able to write so far can get 10 maximum averages of each set1 and set2. I am not able to figure out how to compute the average of these 10 maximum averages. If I can figure out this, I can easily put a for loop to run 1000 times and obtain 1000 10avgset1 and 1000 10avgset2
5. Points to Note
The actual data set has each row comprising a maximum of 400 numbers, some rows have less than that, some have none at all, but never more than 400.
2.The actual dataset has 41,382 rows. Set1 will comprise of 23,558 rows and set2 will comrpise of 17,824 rows.
3.File is a .txt file and all the numbers in each row are tab delimited.
I would be grateful if some idea could be provided as to how to compute average of maximum averages. I had thought of using push #10avgset1, $average_max1 but I am not able to make it work.
First thing I noted: You aren't using the strict pragma, and are in fact using global variables. I am not sure if that is what you want. Also, variables names may not start with a digit (in general).
Second thing I noted: You repeat yourself quite a lot.
Here is a function that does this weird "averaging of maxima" stuff:
use constant CARRY => 1; # set behaviour of original code;
sub make_accumulator {
my $group = shift;
my ($max, $num) = (0, 0) if CARRY;
my #acc;
my $acc = sub {
my ($max, $num) = (0, 0) unless CARRY;
for (#_) {
$max += max #$_;
$num++;
}
my $avg = $max / $num;
push #acc, $avg;
printf "Average max in set %d is %.2f\n", $group, $avg;
$avg;
};
my $get = sub { #acc };
($acc, $get);
}
We can then do my ($acc, $get) = make_accumulator(1), where $acc is a callback that encapsulates your algorithm, and $get returns the array of all such values computed so far.
A real average is computed with
sub average { sum(#_) / #_ }
To intitialize the script, I did
#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw(shuffle max sum);
use constant CARRY => 1;
my #arr = map {my #arr = split; [#arr[3..$#arr]]} <DATA>;
my ($acc1, $get1) = make_accumulator(1);
my ($acc2, $get2) = make_accumulator(2);
The line for #arr parses the line only once, during loading.
I then go on to loop a few times over shuffeled versions of #arr:
for (1 .. 5){
my #shuffled = shuffle #arr;
my $halfway = int (#shuffled / 2);
my #arr1 = #shuffled[0 .. $halfway];
my #arr2 = #shuffled[$halfway .. $#shuffled];
my $average_max1 = $acc1->(#arr1);
my $average_max2 = $acc2->(#arr2);
printf "running: %.2f %.2f\n", average($get1->()), average($get2->());
print "\n";
}
Here, I split the shuffeled list strictly into halves, you want to hardcode 23557 later. I then print the running averages for set1 and set2.
This produces output like:
Average max in set 1 is 2.93
Average max in set 2 is 4.60
running: 2.93 4.60
Average max in set 1 is 3.17
Average max in set 2 is 4.60
running: 3.05 4.60
Average max in set 1 is 3.09
Average max in set 2 is 4.60
mrunning: 3.07 4.60
Average max in set 1 is 3.17
Average max in set 2 is 4.55
running: 3.09 4.59
Average max in set 1 is 3.22
Average max in set 2 is 4.03
running: 3.12 4.48
If I set CARRY to a false value, I get
Average max in set 1 is 3.07
Average max in set 2 is 5.12
running: 3.07 5.12
Average max in set 1 is 3.07
Average max in set 2 is 2.46
running: 3.07 3.79
Average max in set 1 is 3.07
Average max in set 2 is 4.40
running: 3.07 3.99
Average max in set 1 is 3.41
Average max in set 2 is 4.40
running: 3.15 4.10
Average max in set 1 is 3.07
Average max in set 2 is 5.12
running: 3.14 4.30
This looks stupid, because there are only very few possible combinations of four lines (n!/(n/2)!, I guess).
Of course these values differ every time they are run, because shuffle already is pseudo-random.
Edit:
The DATA filehandle assumes that you have a data section at the end of the script like
__DATA__
NR_046018 DDX11L1 , 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1.44 2.72 3.84 4.92
NR_047520 LOC643837 , 3 2.2 0.2 0 0 0.28 1 1 1 1 2.2 4.8 5 5.32 5 5 5 5 3
NM_001005484 OR4F5 , 2 2 2 1.68 1 0.48 0 0.92 1 1.8 2 2 2 2.04 3.88 3
NR_028327 LOC100133331 , 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
To use whatever file is listed on the command line, do
my #arr = map {...} <>; # no explicit filehandle
or open a file manually.

Resources