Perl. While loop behavior changes with seemingly inncocuous addition of command - arrays

The following code reads a file that contains many lines. Some lines of the file contain four elements. Other lines contain only a first element, followed single spaces separated by tabs (it is a tab delimited file). That is, some lines are "full" and others are "blank".
The point of this script is to read the data file, find an instance of a blank line, then remember the immediately preceding line (a full line), scroll to find all consecutive blank lines until the next full line is reached. This set of lines, consecutive blank lines flanked by immediately preceding full line and immediately succeeding full line, is to be used by a subroutine that will apply linear interpolation to "fill in" the blank lines. The information in the flanking full lines for each set will be used in the interpolation step. The script was an answer to a previously posted question, and was provided kindly by user #Kenosis. It is duplicated here but with some very minor changes in its layout---not as neat as #Kenosis originally proposed. You can see this interaction at Perl. Using until function
#!/usr/bin/perl
use strict; use warnings;
die "usage: [ map positions file post SAS ]\n\n" unless #ARGV == 1;
my $mapfile = $ARGV[ 0 ];
open( my $FILE, "<$mapfile" );
my #file = <$FILE>;
for ( my $i = 1 ; $i < $#file ; $i++ ) # $#file returns the index of the last element in #file
{
if ( $file[$i] =~ /(?:\t\s){3}/ ) # if a blank line is found
{
print $file[ $i - 1 ]; # print preceding line
while ( $file[$i] =~ /(?:\t\s){3}/ and $i < $#file ) # keep printing so long as they are blank
# or end of file
{
#print $file[ $i++ ] # one-column, blank line
}
print $file[ $i ]; # print the succeeding full line
} # if
} # for
The problem comes when I try to insert a modification.
my #collect = (); # array collects a current set of consecutive lines needed for linear interpolation
my #file = <$FILE>;
for ( my $i = 1 ; $i < $#file ; $i++ ) # $#file returns the index of the last element in #file
{
if ( $file[$i] =~ /(?:\t\s){3}/ ) # if a blank line is found
{
print $file[ $i - 1 ]; # print preceding line
push( #collect, $file[ $i - 1 ] );
while ( $file[$i] =~ /(?:\t\s){3}/ and $i < $#file ) # keep printing so long as they are blank
# or end of file
{
#print $file[ $i++ ]; # one-column, blank line
push( #collect, $file[ $i++ ] )
}
print $file[ $i ]; # else, succeeding full line
push( #collect, $file[ $i ] );
} # if
} # for
The culprit is in the while loop. Adding the push command there changes the behavior of the script. The script is no longer printing all the lines as the first script above. Why does adding that command change how the script is supposed to work?

What are you trying to do in that push line?
It includes the expression $i++, which adds 1 to the value of $1, so each iteration of that while loop will be jumping down another line in the file.
Do you just mean $i + 1?

Are you really adding a second line of code that increments $i? $i += 1 is not the same as $i += 2

Related

Output the line numbers where a string appears

I am trying to determine how many time a string, Apples appears in a text file and in which lines it appears.
The script outputs incorrect line numbers, instead it outputs numbers consecutively (1,2,..) and not the correct lines for the word.
file.txt
Apples
Grapes
Oranges
Apples
Goal Output
Apples appear 2 times in this file
Apples appear on these lines: 1, 4,
Instead my output as illustrated from the code below is:
Apples appear 2 times in this file
Apples appear on these lines: 1, 2,
Perl
my $filename = "<file.txt";
open( TEXT, $filename );
$initialLine = 10; ## holds the number of the line
$line = 0;
$counter = 0;
# holder for line numbers
#lineAry = ();
while ( $line = <TEXT> ) {
chomp( $line );
if ( $line =~ /Apples/ ) {
while ( $line =~ /Apples/ig ) {
$counter++;
}
push( #lineAry, $counter );
$initialLine++;
}
}
close( TEXT );
# print "\n\n'Apples' occurs $counter times in file.\n";
print "Apples appear $counter times in this file\n";
print "Apples appear on these lines: ";
foreach $a ( #lineAry ) {
print "$a, ";
}
print "\n\n";
exit;
There are a number of problems with your code, but the reason for the line numbers being printed wrongly is that you are incrementing your variable $counter once each time Apples appears on a line and saving it to #lineAry. That is different from the number of the line where the string appears, and the easiest fix is to use the built-in variable $. which represents the number of times a read has been performed on the file handle
In addition, I would encourage you to use lexical file handles, and the three-parameter form of open, and check that every call to open has succeeded
You never use the value of $initialLine, and I don't understand why you have initialised it to 10
I would write it like this
use strict;
use warnings 'all';
my $filename = 'file.txt';
open my $fh, '<', $filename or die qq{Unable to open "$filename" for input: $!};
my #lines;
my $n;
while ( <$fh> ) {
push #lines, $. if /apples/i;
++$n while /apples/ig;
}
print "Apples appear $n times in this file\n";
print "Apples appear on these lines: ", join( ', ', #lines ), "\n\n";
output
Apples appear 2 times in this file
Apples appear on these lines: 1, 4
Change
push(#lineAry, $counter);
to
push(#lineAry, $.);
$. is a variable that stores the line number when using perl's while (<>).
The alternative, if you want to use your $counter variable, is that you move the increment to increment on every line, not on every match.

Pulling out potentially overlapping subsets of elements in array to make smaller arrays

My input file looks like below (real one is much larger):
rs3683945_mark 0
rs6336442_mark 1E-07
rs31328150_impute 0.444121193
rs3658242_mark 0.444121293
rs39342374_impute 0.444121393
IMP!1! 1
rs3677817_mark 1.986015679
IMP!2! 2
SNP117_impute 2.685815665
IMP!3! 3
SNP3_1_impute 3.643119709
SNP1_impute 3.643119809
rs13475706_mark 3.643119909
13 lines, two elements each line. First element is a name. Each name ends either with a "tag" _mark or impute, or there is no tag. The point of the tag is to distinguish between types of names, which form the basis of my search for subsets within the entire list.
The subsets begin with a _mark name that immediately precedes an instance of an _impute name. The subsets end with the very next instance of _mark. All names in between, which will necessarily not have any such tag, also go into a subset, which I'd like to collect into an array and send off to a subroutine to process (details of that not important). Please note, the positions with IMP in the name are not the same as those actually tagged with a _impute.
For example, with the above, the first useable subset is:
rs6336442_mark 1E-07
rs31328150_impute 0.444121193
rs3658242_mark 0.444121293
The second useable subset is:
rs3658242_mark 0.444121293
rs39342374_impute 0.444121393
IMP!1! 1
rs3677817_mark 1.986015679
and so on... EDIT: Note that last _mark name of the first set is the first _mark name of the second.
My code for this:
#!/usr/bin/perl
use strict; use warnings;
my $usage = "usage: merge_impute.pl {genotype file} {distances file} \n";
die $usage unless #ARGV == 2;
my $genotypes = $ARGV[0];
open (FILE, "<$genotypes");
my #genotypes = <FILE>;
close FILE;
my $distances = $ARGV[1];
open (DISTS, "<$distances");
my #distances = <DISTS>;
close DISTS;
my #workingset = ();
#print scalar #distances;
for ( my $i = 0; $i < scalar #distances; $i++ ){
chomp $distances[$i];
#print "$distances[$i]\n";
if ( $distances[$i] =~ m/impute/ ){
push ( #workingset,$distances[$i-1],$distances[$i],$distances[$i+1]);
}
print "i=$i: #workingset\n";
# at this point send off to sub routine
#workingset=();
}
As you can see, the if loop is only set up to find subsets that contain only one _impute name. How can I modify the code so that a subset will "fill up" with as many names as required until we arrive at the next _mark name?
EDIT: Perhaps instead of the for loop, I could something like...
push (#workingset, $distances[0], $distances[1] );
until ( $distance[ ??? ] =~ m/_mark/ ){
push ( #workingset, $distance[ ??? ] );
}
But what could $distances[ ??? ] be?
EDIT: Or an alternative for loop...
push (#workingset, $distances[0] );
for ( my $i = 1; $i < scalar #distances - 1 ; $i++ ){
until ( $distances[ $i ] =~ m/_mark/ ){
push ( #workingset, $distances[ $i ] );
# send #workingset to sub routine
#clear workingset
#workingset = ();
}
}
Though this isn't working.
I also tried...
push (#workingset, $distances[0] );
for ( my $i = 1; $i < scalar #distances - 1 ; $i++ ){
until ( $distances[ $i ] =~ m/_mark/ ){
push ( #workingset, $distances[ $i ] );
next if $distances[ $i+1 ] !~ /_mark/;
}
# send #workingset to sub routine here
print "i=$i, #workingset\n\n";
#clear workingset
#workingset = ();
}
I don't have a lot of time right now but I'll hopefully have some time in the morning to check back. Here's a quick example on how you could do it (it is meant to be simple and easy to understand, not fancy). Hopefully it helps you get on the right track for parsing the data.
use strict;
use warnings;
my $first_mark;
my #workingset = ();
my $second_mark;
while (<DATA>){
chomp;
if ( /_mark/ and scalar #workingset == 0 ) {
$first_mark = $_;
} elsif ( /IMP|_impute/ and defined $first_mark) {
push #workingset, $_;
} elsif ( /_mark/ and defined $first_mark) {
$second_mark = $_;
print "Found valid set: ";
print "$first_mark," . join(",", #workingset) . ",$second_mark\n";
#workingset = ();
$first_mark = $second_mark;
undef $second_mark;
}
}
__DATA__
rs3683945_mark 0
rs6336442_mark 1E-07
rs31328150_impute 0.444121193
rs3658242_mark 0.444121293
rs39342374_impute 0.444121393
IMP!1! 1
rs3677817_mark 1.986015679
IMP!2! 2
SNP117_impute 2.685815665
IMP!3! 3
SNP3_1_impute 3.643119709
SNP1_impute 3.643119809
rs13475706_mark 3.643119909
Output:
Found valid set: rs6336442_mark 1E-07,rs31328150_impute 0.444121193,rs3658242_mark 0.444121293
Found valid set: rs3658242_mark 0.444121293,rs39342374_impute 0.444121393,IMP!1! 1,rs3677817_mark 1.986015679
Found valid set: rs3677817_mark 1.986015679,IMP!2! 2,SNP117_impute 2.685815665,IMP!3! 3,SNP3_1_impute 3.643119709,SNP1_impute 3.643119809,rs13475706_mark 3.643119909

Printing in unexpected order

I expected the following to print in the order of the elements of #Data, but it's printing in the order of the elements of #Queries. Am I missing something? I also tried declaring the items to be printed after foreach(#data){... and then printing inside that loop, but still wrong order.
$datafile is a file with the following:
GR29929,JAMES^BOB
GR21122,HANK^REN
$queryfile is a file with the following:
(3123123212):# FD [GR21122]
line 2
line 3
line 4
(12): # FD [HANK^REN]
line 6
line 7
line 8
(13): # FD [Y]
-------------------------------
--------------------------------
(3123123212):# FD [GR29929]
line 2
line 3
line 4
(12): # FD [JAMES^BOB]
line 6
line 7
line 8
(13): # FD [Z]
The output file is:
GR21122,HANK^WREN,Y
GR29929,JAMES^BOB,Z
When I want:
GR29929,JAMES^BOB,Z
GR21122,HANK^WREN,Y
Code is:
open(DA, "<$datafile");
open(QR, "<$queryfile");
my #Data = <DA>;
my #Queries = <QR>;
foreach (#Data) {
my ( $acce, $namee ) = split( ',', $_ );
chomp $acce;
chomp $namee;
print "'$acce' and '$namee'\n";
for my $i ( 0 .. $#Queries ) {
my $Qacce = $Queries[$i];
my $Qname = $Queries[ $i + 4 ];
my $Gen = $Queries[ $i + 8 ];
if ( $Qacce =~ m/$acce/ and $Qname =~ m/$namee/ ) {
my ($acc) = $Qacce =~ /\[(.+?)\]/;
my ($gen) = $Gen =~ /\[(.+?)\]/;
$gen =~ s/\s+$//;
my ($name) = $Qname =~ /\[(.+?)\]/;
print GL "$i,$acc,$gen,$name\n";
}
}
}
The basic shell of your program prints what you ask for, but there is a lot missing. The refactoring below should do what you want.
You had a problem with the values of your $i index variable, so that the first time around the loop you were accessing #data elements [0, 4, 8], the second time [1, 5, 9] etc. It looks like the second loop execution should use elements [11, 15, 19] and so on. Please correct me if I'm wrong.
In addition you were using regular expressions to compare the keys in the two files, and you were finding nothing because the name values contain caret ^ characters which are special within regexes. Escaping the strings using \Q...\E fixed this.
Note that a better solution would use hashes to match keys across the two files, but without details on your file format - particularly queryfile - I have had to follow your own algorithm.
use strict;
use warnings;
use autodie;
my ($data_file, $query_file) = qw/ datafile.txt queryfile.txt /;
my #queries = do {
open my $query_fh, '<', $query_file;
<$query_fh>;
};
chomp #queries;
open my $data_fh, '<', $data_file;
while (<$data_fh>) {
chomp;
my ($acce, $namee) = split /,/;
for (my $i = 0; $i < #queries; $i += 11) {
my ($qacce, $qname, $qgen) = #queries[$i, $i+4, $i+8];
if ( $qacce =~ /\Q$acce\E/ and $qname =~ /\Q$namee\E/ ) {
my ($acc, $name, $gen) = map / \[ ( [^\[\]]+ ) \] /x, ($qacce, $qname, $qgen);
$gen =~ s/\s+\z//;
print "$acc,$name,$gen\n";
}
}
}
output
GR29929,JAMES^BOB,Z
GR21122,HANK^REN,Y

Missmatch in array comparison error message ' Argument "" isn't numeric in array element at'

I am very new to perl and am struggling to get this script to work.
I have taken pieces or perl and gooten them to work as indivual sections but upon trying to blend them together it fails. Even with the error messages that show up I can not find where my mistake is.
The script when working and completed will read an output file and go through it section my section and utilmately generate a new output file with not much more the a heading with some additional text and a value of the amount of lines in that section.
My issues are when it does the looping for each keyword in the array it is now failing with the error message 'Argument "" isn't numeric in array element at'. Perl directs me to a section in the script but I can not see how I am calling the element incorrectly. All the elements in the array are alpha yet the error message is refering to a numeric value.
Can anyone see my mistake.
Thank you
Here is the script
#!/usr/bin/perl -w
use strict;
use warnings;
use diagnostics;
# this version reads each variable and loops through the 18 times put only displays on per loop.
my $NODE = `uname -n`;
my $a = "/tmp/";
my $b = $NODE ;
my $c = "_deco.txt";
my $d = "_deco_mini.txt";
chomp $b;
my $STRING = "$a$b$c";
my $STRING_out = "$a$b$d";
my #keyword = ( "Report", "Last", "HP", "sulog", "sudo", "eTrust", "proftp", "process", "active clusters", "pdos", "syslog", "BNY", "syslogmon", "errpt", "ports", "crontab", "NFS", "scripts", "messages");
my $i = 0;
my $keyword="";
my $x=0;
my $y=0;
my $jw="";
my $EOS = "########################################################################";
my $qty_lines=0;
my $skip5=0;
my $skipcnt=0;
my $keeplines=0;
my #HPLOG="";
do {
print "Reading File: [$STRING]\n";
if (-e "$STRING" && open (IN, "$STRING")) {
# ++$x; # proving my loop worked
# print "$x interal loop counter\n"; # proving my loop worked
for ( ++$i) { # working
while ( <IN> ) {
chomp ;
#if ($_ =~ /$keyword/) {
#if ($_ =~ / $i /) {
#if ($_ =~ /$keyword[ $i ]/) {
if ($_ =~ /$keyword $i/) {
print " $i \n";
$skip5=1;
next;
# print "$_\n";# $ not initalized error when tring to use it
}
if ($skip5) {
$skipcnt++;
print "SKIP LINE: $_\n";
print "Header LINE: $_\n";
next if $skipcnt <= 5;
$skip5=0;
$keeplines=1;
}
if ($keeplines) {
# ++$qty_lines; # for final output
last if $_ =~ /$EOS/;
print "KEEP LINE: $_\n";
# print "$qty_lines\n"; # for final output
push #HPLOG, "$_\n";
# push #HPLOG, "$qty_lines\n";# for final output
}
} ## end while ( <IN> )
} ## end for ( ++$i)
} ## end if (-e "$STRING" && open (IN, "$STRING"))
close (IN);
} while ( $i < 19 && ++$y < 18 );
Here is a sample section or the input file.
###############################################################################
Checking for active clusters.
#########
root 11730980 12189848 0 11:24:20 pts/2 0:00 egrep hagsd|harnad|HACMP|haemd
If there are any processes listed you need to remove the server from the cluster.
############################################################################
This is the output from Pdos log
Please review it for anything that looks like a users may be trying to run something.
#########
This server is not on Tamos
############################################################################
This is the output from syslog.conf.
Look for any entries on the right side column that are not the ususal logs or location.
#########
# #(#)34 1.11 src/bos/etc/syslog/syslog.conf, cmdnet, bos610 4/27/04 14:47:53
# IBM_PROLOG_BEGIN_TAG
# This is an automatically generated prolog.
#
# bos610 src/bos/etc/syslog/syslog.conf 1.11
I truncated the rest of the file
Can anyone see my mistake.
I can see quite a lot of mistakes. But I also see some good stuff like use strict and use warnings.
My suggestion for you is to work on your coding style so that it gets easier for you and others to debug any problems.
Naming variables
my $NODE = `uname -n`;
my $a = "/tmp/";
my $b = $NODE ;
my $c = "_deco.txt";
my $d = "_deco_mini.txt";
chomp $b;
my $STRING = "$a$b$c";
my $STRING_out = "$a$b$d";
Why are some of those names all uppercase and others all lower case? If you are building up a filename, why do you call the variable that holds the filename $STRING?
my #keyword = ( "Report", "Last", "HP", "sulog", "sudo", ....
If you have a list of several keywords, wouldn't it be apt to not chose a singular for the variable name? How about #keywords?
Using temporary variables you don't need
my $NODE = `uname -n`;
my $a = "/tmp/";
my $b = $NODE ;
my $c = "_deco.txt";
chomp $b;
my $STRING = "$a$b$c";
Why do you need $a, $b and $c? The (forgive me) stupid names of those vars are a tell-tale sign that you don't need them. How about this instead?
my $node_name = `uname -n`;
chomp $node_name;
my $file_name = sprintf '/tmp/%s/_deco.txt', $node_name;
Your biggest problem: you have no idea how to use arrays
You are making several drastic mistakes when it comes to arrays.
my #HPLOG="";
Do you want an array or another string? The # says array, the "" says string. I guess you wanted a new, empty array, so my #hplog = () would have been much better. But since there is no need to tell perl that you want an empty array as it will give you an empty one anyway, my #hplog; will do the job just fine.
It took me a while to figure out this next one and I'm still not sure whether I'm guessing your intentions correctly:
my #keyword = ( "Report", "Last", "HP", "sulog", "sudo", "eTrust", "proftp", "process", "active clusters", "pdos", "syslog", "BNY", "syslogmon", "errpt", "ports", "crontab", "NFS", "scripts", "messages");
...
if ($_ =~ /$keyword $i/) {
What I think you are doing here is trying to match your current input line against element number $i in #keywords. If my assumption is correct, you really wanted to say this:
if ( /$keyword[ $i ]/ ) {
Iterating arrays
Perl is not C. It doesn't make you jump through hoops to get a loop.
Just look at all the code you wrote to loop through your keywords:
my $i = 0;
...
for ( ++$i) { # working
...
if ($_ =~ /$keyword $i/) {
...
} while ( $i < 19 && ++$y < 18 );
Apart from the facts that your working comment is just self-deception and that you hard-coded the number of elements in your array, you could have just used a for-each loop:
foreach my $keyword ( #keywords ) {
# more code here
}
I'm sure that when you try to work on the above list, the problem that made you ask here will just go away. Have fun.

Linear interpolation of missing data from table of values

Here is an excerpt from a data file, my starting point:
Marker Distance_1 Distance_2 ID
.
.
.
30 13387412 34.80391242 seq-SN_FIRST
31 13387444 34.80391444 seq-SN_Second
31.1
31.2
32 13387555 39.80391 seq-SN_Third
.
.
.
This is a tab-delimited file of multiple rows of four elements each. First row is a header. After that, numerous rows of data. The vertical dots are not actually in the real file, but they are here just to represent that data similar to the actual rows shown occur before and after the example fo rows explicitly shown.
Some of the data rows are "full", that is, all four cell entries contain something. Other rows are "blank", with only a first actual entry but followed by 3 tab delimited single spaces. Those white spaces in the blank rows need to be "filled in." The filling in will be done by linear interpolation, using the corresponding cell entries of the immediately preceding and immediately succeeding rows. For example, missing Distance_1 values, in column 2, will be interpolated using the value 13387444 of the preceding row and the value 13387555 of the succeeding row. Similarly so for the column 3 values. Column 4 values are just ignored here.
The first goal of the script is to identify the blocks of data that require filling and their flanking "full" lines. Blank lines will contain 3 tabbed single spaces and will be ID'd that way. Once found, the consecutive sets of blank lines plus flanking full lines are sent to subroutine for interpolation.
#!/usr/bin/perl
use strict;
use warnings;
die "usage: [ map positions file post SAS ]\n\n" unless #ARGV == 1;
my #file = ();
while (my $line = <$ARGV[0]>) {
chomp $line;
push(#file, $line);
}
my #consecutive_lines = (); # array collects a current set of consecutive lines requiring linear interpolation
my #filled = (); # my final array, hopefully correctly filled in
#####
# search for consecutive set of lines in #file
#####
for (my $i = 0; $i < $#file; $i++) { # $#file returns the index of the last element in #file
if ($file[$i] !~ /(?:\t\s){3}/) { # if not a blank line
# but a "full line"
push(#filled, $file[$i]); # push the header and full lines, until...
}
elsif ($file[$i] =~ /(?:\t\s){3}/) { # ...a blank line is found
push (#consecutive_lines, $file[$i - 1]); # push preceding full line into array
while ($file[$i] =~ /(?:\t\s){3}/ and $i < $#file) { # keep pushing lines, so long as they are blank
# or end of file
push(#consecutive_lines, $file[$i++]);
}
push(#consecutive_lines, $file[$i]) ; # else we reach next full line, so push it into array
my #fillme = linearInterpolation(#consecutive_lines); # send set of lines out for filling
push(#filled, #fillme); # push filled in set of lines into the final array
#consecutive_lines = (); # reset or undef array #consecutive_lines for next cycle
} # end of elsif
} # end of for loop
Thanks to user #Kenosis for lots help with the above, which I have since modified (hopefully not mangled).
Next is the linear interpolation. It is here where I am trying to link the first phase of the script to the second phase. And it is not working well so far.
My goal is to hand off the array #incoming to the subroutine. This array is then split up, so that the actual cell entries are "visible" and can be indexed by the array, and so called upon. I have been trying to figure out how to do this for the column 2 values for Distance_1 first. I feel like this script gets close and it begins to stray at the point just after the interpolated values are calculated.
#####
# subroutine linear interpolation
#####
sub linearInterpolation {
my #incoming = #_; # array of consecutive set of lines
my #splitup; # declare new array, will be a "split up" version of #incoming
my ($A, $B, $C, $D, $E); # variables for linear interpolation
my #fillme; # declaring the "emtpy" array to be filled in
my #dist_1_fills; # array of interpolated values for dist_1
for (my $i = 0;
$i < scalar #incoming; $i++) # loop to split up lines of #incoming
{ # into indexed cell entries
chomp $incoming[$i]; # and make new array of them
my #entries = split('\t', $incoming[$i]);
push(#splitup, #entries);
}
$A = $splitup[1]; # cell entry in column 2 of preceding full line
$B = $splitup[-3]; # cell entry in column 2 of succeeding full line
$C = $splitup[2]; # cell entry in column 3 of preceding full line
$D = $splitup[-2]; # cell entry in column 3 of succeeding full line
$E = scalar #incoming - 1; # equals number of lines in the set minus 1
for (my $i = 1; $i < $E; $i++) { # need to start finding appropriate
# number interpolated values, given number of
my #dist_1_fills =
interpvalues($A, $B, $E, $i); # of lines in consecutive set of lines
for ($i = 0; $i < scalar #splitup; $i += 4) {
push(#fillme, $splitup[$i], $dist_1_fills[$i], "dist_2_fills", "--");
# fourth column values will be ignored or filled with --.
# "dist_2_fills" just occupying it's proper spot until I can figure out distance 1 fills
}
}
}
#########
sub interpvalues { # subroutine to find interpolated values
my ($A, $B, $E, $i) = #_;
my $dist_1_answers = (($B - $A) / ($E)) * $i + $A;
return $dist_1_answers;
}
The code gets confused in the second part that deals with finding the interpolated values and sending them back to the first part of the code to finally fill in the data set. I think specifically my biggest (though probably not my only) problem is trying to fill in the blank lines with the proper values after they have been calculated in the second subroutine.
Any hints and clues are greatly appreciated!
This program will do what you need. It expects the inout file name as a parameter on the command line.
use strict;
use warnings;
my #saved;
my #needed;
while (<>) {
chomp;
my #fields = split /\t/;
# Pass hrough headers and junk
unless ($fields[0] and $fields[0] =~ /\d/) {
print "$_\n";
next;
}
# Save x-value for records without a y-value
if ($fields[1] !~ /\d/) {
push #needed, $fields[0];
next;
}
# We have a filled-out row. Calculate any intermediate missing ones
if (#needed) {
if ($saved[0] == $fields[0]) {
die sprintf qq(Duplicate marker values %.1f at line %d of "%s"\n), $saved[0], $., $ARGV;
}
my ($a1, $b1) = solve_linear(#saved[0,1], #fields[0,1]);
my ($a2, $b2) = solve_linear(#saved[0,2], #fields[0,2]);
while (#needed) {
my $x = shift #needed;
my $y1 = $a1 * $x + $b1;
my $y2 = $a2 * $x + $b2;
print join("\t", $x, $y1, $y2), "\n";
}
}
print "$_\n";
#saved = #fields;
}
sub solve_linear {
my ($x0, $y0, $x1, $y1) = #_;
my ($dx, $dy) = ($x1 - $x0, $y1 - $y0);
my $aa = $dy / $dx;
my $bb = ($y0 * $dx - $x0 * $dy) / $dx;
return ($aa, $bb);
}
output
Marker Distance_1 Distance_2 ID
.
.
.
30 13387412 34.80391242 seq-SN_FIRST
31 13387444 34.80391444 seq-SN_Second
31.1 13387455.1 35.303913996 --
31.2 13387466.2 35.803913552 --
32 13387555 39.80391 seq-SN_Third
.
.
.
Tool completed successfully
I modified the code to this, so that the linear interpolation is based not on the values in the first column but rather on the values in the second and third columns. Thanks especially to users #Kenosis and #Borodin. I have accepted Kenosis' answer to a previous question, and I have accepted Borodin's here, even though I post this revision in the "answer your own question" section. Is posting a revision here acceptable? I skimmed the FAQ about this but haven't found anything relevant yet.
#!/usr/bin/perl
use strict; use warnings;
my #saved;
my #needed;
while (<>) {
chomp;
my #fields = split /\t/;
# Does the current line actually exist AND does it contain one or more digits.
unless ($fields[0] and $fields[0] =~ /\d/) {
# If no, this is the header, so print it. If yes, advance.
print "$_\n";
#after printing header, go back to <> and read in next line.
next;
}
# Is the second cell of the current line devoid of digits?
if ($fields[1] !~ /\d/) {
# If no, advance. If yes, remember $field[0], the Marker.
push #needed, $fields[0];
# After pushing, go back to <> and read in next line.
next;
}
# If we are here, we must have a filled-out row.
# Does #needed have any values? If no, advance. If yes,
if (#needed) {
if ($saved[0] == $fields[0]) {
die sprintf qq(Duplicate marker values %.1f at line %d of "%s"\n), $saved[0], $., $ARGV;
}
# Else send preceding dist_1 value, succeeding dist_1 value,
# preceding dist_2 value, succeeding dist_2 value,
# and number of emtpy lines to subroutine.
my ($dist_1_interval, $dist_2_interval) = interval_sizes($saved[1], $fields[1], $saved[2], $fields[2], scalar #needed);
# Current size of #needed is saved as $size and is used to help with iteration.
# So long as #needed contains values...
my $size = scalar #needed;
while (#needed) {
# ...remove left-most Marker value from array #needed.
my $x = shift #needed;
# Interpolated values for dist_1 and dist_2 are
# (respective interval size x iteration of while loop) + preceding values.
my $new_dist_1 = ($dist_1_interval * (1 + ($size - (scalar #needed + 1)))) + $saved[1];
my $new_dist_2 = ($dist_2_interval * (1 + ($size - (scalar #needed + 1)))) + $saved[2];
print join("\t", $x, $new_dist_1, $new_dist_2, "--"), "\n";
}
}
# We are here since current line is already a filled-in row.
print "$_\n";
# Print this row and assign it to #saved. Return to <>.
#saved = #fields;
}
sub interval_sizes {
# $A = preceding dist_1, $B = succeeding dist_1,
# $C = preceding dist_2, $D = succeeding dist_2,
# $E = number of needed distances.
my ($A, $B, $C, $D, $E) = #_;
# I need an interval size for dist_1 based on difference between $B and $A.
my $dist_1_interval = ($B - $A)/($E + 1);
# I need an interval size for dist_2 based on difference between $D and $C.
my $dist_2_interval = ($D - $C)/($E + 1);
return ($dist_1_interval, $dist_2_interval);
}

Resources