How can i split string per line on file - arrays

I have a file that contain string like this:
- ' *[0-9]-? [^a-c]#[*-^a-c]' '' < temp-test/758.inp.325.1
- ' *[0-9]-? [^a-c]#[*-^a-c]' '' < temp-test/759.inp.325.3
- ' *[0-9]#[[9-B]??[0-9]-[^-[^0-9]-[a-c][^a-c]' 'NEW' < temp-test/1133.inp.487.1`enter code here`
- ' *[0-9]#[[9-B]??[0-9]-[^-[^0-9]-[a-c][^a-c]' 'NEW' < temp-test/1134.inp.487.2
- '"##' 'm' < input/ruin.1890
I want to split this string per line to be 2 part , and i hope the result like this:
- line 1: array[0]=' *[0-9]-? [^a-c]#[*-^a-c]'; array [1]='' < temp-test/758.inp.325.1
- line 2: array[0]=' *[0-9]-? [^a-c]#[*-^a-c]'; array [1]='' < temp-test/759.inp.325.3
- line 3: array[0]=' *[0-9]#[[9-B]??[0-9]-[^-[^0-9]-[a-c][^a-c]'; array[1]='NEW' < temp-test/1133.inp.487.1
- line 4: array[0]=' *[0-9]#[[9-B]??[0-9]-[^-[^0-9]-[a-c][^a-c]'; array[1]='NEW' < temp-test/1134.inp.487.2
- line 5: array[0]='"##'; array[1]='m' < input/ruin.1890
and the code i've try is like this:
#!/usr/bin/perl
# location of universe file
$tc = "/root/Desktop/SIEMENS/replace/testplans.alt/universe";
# open file universe;
open( F, "<$tc" );
#test_case = <F>;
while ( $i < 5 ) {
$test_case[$i] =~ s/ //;
#isi = split( / /, $test_case[$i] );
if ( $#isi == 2 ) {
print "Input1:" . $isi[0] . "\n";
print "Input2:" . $isi[1] . "\n";
print "Input3:" . $isi[2] . "\n";
}
$i++;
}
I am confused because i can't slit that string with " " (space), because each of line have different order space and i can't get to be 2 part.
Thank you.

use strict;
use warnings;
# this stuff just gives me your data for testing
my $data =<<EOF;
' [0-9]-? [^a-c]\#[-^a-c]' '' < temp-test/758.inp.325.1
' [0-9]-? [^a-c]\#[-^a-c]' '' < temp-test/759.inp.325.3
' *[0-9]\#[[9-B]??[0-9]-[^-[^0-9]-[a-c][^a-c]' 'NEW' < temp-test/1133.inp.487.1
' *[0-9]\#[[9-B]??[0-9]-[^-[^0-9]-[a-c][^a-c]' 'NEW' < temp-test/1134.inp.487.2
'"\#\#' 'm' < input/ruin.1890
EOF
for my $line (split(/\n/,$data))
{
# this re splits your strings according
# to what I perceive to be your requirements
if ($line =~ /^(.*)('.*?' <.*)$/)
{
print("array[0]=$1; array[1]=$2;\n")
}
}
1;
Output:
array[0]=' [0-9]-? [^a-c]#[-^a-c]' ; array[1]='' < temp-test/758.inp.325.1;
array[0]=' [0-9]-? [^a-c]#[-^a-c]' ; array[1]='' < temp-test/759.inp.325.3;
array[0]=' *[0-9]#[[9-B]??[0-9]-[^-[^0-9]-[a-c][^a-c]' ; array[1]='NEW' < temp-test/1133.inp.487.1
array[0]=' *[0-9]#[[9-B]??[0-9]-[^-[^0-9]-[a-c][^a-c]' ; array[1]='NEW' < temp-test/1134.inp.487.2;
array[0]='"##' ; array[1]='m' < input/ruin.1890;
Applied to your code it might look something like this:
while ($i<5)
{
$test_case[$i] =~ /^(.*)('.*?' <.*)$/;
#isi = ($1,$2);
print "Input1:".$isi[0]."\n";
print "Input2:".$isi[1]."\n";
$i++;
}

Related

Add all values in array for each ID in Perl

I have this table:
NAME |12/31/2016|VALUE
AAA |1/31/2017 |10
AAA |2/1/2017 |20
AAA |2/2/2017 |30
AAA |2/3/2017 |40
AAA |2/4/2017 |50
NAME |2/9/2017 |VALUE
BBB |2/10/2017 |20
BBB |2/11/2017 |30
BBB |2/12/2017 |40
BBB |2/13/2017 |50
BBB |2/14/2017 |60
and this would be my desired output:
NAME |DATE |VALUE
AAA |12/31/2016 |150
AAA |1/31/2017 |140
AAA |2/1/2017 |120
NAME |DATE |VALUE
BBB |2/9/2017 |200
BBB |2/10/2017 |180
BBB |2/11/2017 |150
What I want to do is, for each of the valid symbols, (AAA, BBB) I want to have three rows.
For the first row of each column, I want all the values added,
For example, row 1 value for AAA:
10+20+30+40+50 = 150
then for row 2 I want to just add from the second value to the last.
For example row 2 value for AAA
20+30+40+50 = 140
and so on same goes for BBB.
I want to shift the dates down so that 12/31/2016 would match AAA, then get the first three dates for each row.
I currently have this code. but this doesn't do much. it just gives me a bunch of numbers.
use strict;
use warnings;
use Scalar::Util qw(looks_like_number);
use Data::Dumper;
sub uniq {
my %seen;
grep !$seen{$_}++, #_;
}
my %cashflow;
my %fields = (
ID => 0,
DATES => 1,
VALUE => 2,
);
my #total;
my #IDs;
my #uniqueIDs;
my #dates;
my #add;
my $i = 0;
my #values;
my $counter = 3;
open( FILE, "try.CSV" );
while ( my $line = <FILE> ) {
chomp( $line );
my #lineVals = split( /\|/, $line );
if ( $lineVals[ $fields{ID} ] !~ /^SYMBOL$/i ) {
push #IDs, $lineVals[ $fields{ID} ];
}
#uniqueIDs = uniq( #IDs );
#push all CASH FLOW AMOUNTS to #cashflow
if ( looks_like_number( $lineVals[ $fields{VALUE} ] ) ) {
$lineVals[ $fields{VALUE} ] =~ s/\r//;
push #total, $lineVals[ $fields{VALUE} ];
}
if ( $lineVals[ $fields{DATES} ] =~ /(\d{1,2})\/(\d{1,2})\/(\d{4})/ ) {
$lineVals[ $fields{DATES} ] = sprintf( '%04d%02d%02d', $3, $2, $1 );
}
$cashflow{ uc $lineVals[ $fields{ID} ] }{DATES} = $lineVals[ $fields{DATES} ];
$cashflow{ uc $lineVals[ $fields{ID} ] }{VALUE} = $lineVals[ $fields{VALUE} ];
foreach my $ID ( #uniqueIDs ) {
foreach my $symb ( keys %cashflow ) {
if ( $ID = $symb ) {
if ( looks_like_number( $lineVals[ $fields{VALUE} ] ) ) {
$lineVals[ $fields{VALUE} ] =~ s/\r//;
push #total, $lineVals[ $fields{VALUE} ];
my $i = 0;
my $grand = 0;
foreach my $val ( #total ) {
while ( $i < $counter ) {
$grand += $val;
print "$grand \n";
$i++;
}
shift #total;
}
}
}
}
}
}
close FILE;
I'm really stuck with this. I don't know what to do with the problem.
A possible solution:
#!perl
use strict;
use warnings;
sub trim {
my ($str) = #_;
s!\A\s+!!, s!\s+\z!! for $str;
$str
}
my $file = 'try.CSV';
open my $fh, '<', $file or die "$0: $file: $!\n";
my ($group_name, #dates, #values);
my $sum = 0;
my $print_group = sub {
return if !defined $group_name;
my $format = " %-6s|%-11s|%s\n";
printf $format, 'NAME', 'DATE', 'VALUE';
for my $date (#dates) {
printf $format, $group_name, $date, $sum;
$sum -= shift #values if #values;
}
};
while (my $line = readline $fh) {
my ($name, $date, $value) = map trim($_), split /\|/, $line;
if ($name eq 'NAME') {
$print_group->();
$group_name = undef;
#dates = $date;
#values = ();
$sum = 0;
next;
}
$group_name ||= $name;
push #dates, $date if #dates < 3;
push #values, $value if #values < 2;
$sum += $value;
}
$print_group->();
Let's go over it.
sub trim {
my ($str) = #_;
s!\A\s+!!, s!\s+\z!! for $str;
$str
}
A helper function for removing leading/trailing whitespace from a string. We're using ! as the s delimiter here because / breaks SO's syntax highlighting. Shrug.
my $file = 'try.CSV';
open my $fh, '<', $file or die "$0: $file: $!\n";
Open our input file. Note: We use a lexical variable ($fh) instead of a bareword filehandle, and we use 3-argument open. This is strongly recommended. We also check open's return value and produce a nice error message in case of failure, including both the name of the file that couldn't be opened ($file) and the reason for failing ($!).
my ($group_name, #dates, #values);
my $sum = 0;
We set up some state variables that we want to preserve across loop iterations. $group_name is the name of the group we're currently processing, #dates is the saved dates we've seen so far, #values is the saved values we've seen so far. $sum is a running sum of all the values in the current group, and it starts at 0.
my $print_group = sub {
return if !defined $group_name;
my $format = " %-6s|%-11s|%s\n";
printf $format, 'NAME', 'DATE', 'VALUE';
for my $date (#dates) {
printf $format, $group_name, $date, $sum;
$sum -= shift #values if #values;
}
};
A helper function for printing the output for a single group. If $group_name isn't set, we haven't processed any input for the current group yet, so we do nothing and return. Otherwise we print a NAME | DATE | VALUE header, followed by a row of data for each element in #dates. For each $date we output the current group name (e.g. AAA), $date, and the sum of values (all nicely formatted using printf). Initially $sum is the sum of all group values, but after the first iteration we start subtracting the values from #values: If the list of values in the input was x1, x2, x3, x4, ..., then $sum is initially x1 + x2 + x3 + x4 + ..., and that's what's printed in the first line of output. After that we subtract x1, so the next line gets x1 + x2 + x3 + x4 + ... - x1, which is x2 + x3 + x4 + .... After that we subtract x2, so the third row of data gets x3 + x4 + ....
while (my $line = readline $fh) {
my ($name, $date, $value) = map trim($_), split /\|/, $line;
Our main loop. We read a line of input, split it on |, and trim each field.
if ($name eq 'NAME') {
$print_group->();
$group_name = undef;
#dates = $date;
#values = ();
$sum = 0;
next;
}
If $name is 'NAME', this is the start of a new group. Print the output for the current group if any ($print_group->() does nothing if there is no current group), then reset our state variables back to initial values, except for #dates, which is filled with the $date value from the header row. Then start the next iteration of the loop because we're done with this line.
$group_name ||= $name;
push #dates, $date if #dates < 3;
push #values, $value if #values < 2;
$sum += $value;
If we get here, this line is not the start of a new group. We set $group_name if it hasn't been set yet. We add $date to our list of saved dates (but we only need 3 dates, so do nothing if we already have 3). We add $value to our list of saved values (but we only need 2 of them). Finally we add $value to our total $sum within the group.
}
$print_group->();
At the end of the loop we've also just finished processing a group, so we need to call $print_group here as well.
This will do as you ask. It reads the whole data file into an array of arrays and manipulates that array before printing it. The blocks are processed backwards from the end so that the other blocks remain in place when the trailing lines are deleted
This program expects the path to the input file as a parameter on the command line and writes the result to STDOUT
use strict;
use warnings 'all';
my #data = map [ /[^|\s]+/g ], <>;
# Make a list of the indices of all the header rows
my #headers = grep { $data[$_][0] eq 'NAME' } 0 .. $#data;
# Make a list of the indices of the first
# and last lines of all the data blocks
my #blocks = map {
[
$headers[$_] + 1,
$_ == $#headers ? $#data : $headers[$_+1] - 1
]
} 0 .. $#headers;
# Shift the second column down
# Replace the col2 header with 'DATE'
#
$data[$_][1] = $data[$_-1][1] for reverse 1 .. $#data;
$data[$_][1] = 'DATE' for #headers;
# Edit each block of data
#
for my $block ( reverse #blocks ) {
my ( $beg, $end ) = #$block;
# Calculate the block total
my $total = 0;
for ( $beg ... $end ) {
$total += $data[$_][2];
}
# Calculate the first three data values
for my $i ( $beg .. $beg + 2 ) {
my $next = $total - $data[$i][2];
$data[$i][2] = $total;
$total = $next;
}
# Remove everything except those three lines
splice #data, $beg+3, $end-$beg-2;
}
print join('|', #$_), "\n" for #data;
output
NAME|DATE|VALUE
AAA|12/31/2016|150
AAA|1/31/2017|140
AAA|2/1/2017|120
NAME|DATE|VALUE
BBB|2/9/2017|200
BBB|2/10/2017|180
BBB|2/11/2017|150

Use of uninitalized value in concatenation <.> or string at xyz.pl line 27

I am writing this perl program and want to create an array which should be stored in output file according to the values of for loop. I am new to this programming.
This is my code
use strict;
use warnings;
open( my $out_fh, ">", "output.txt" ) || die("Cannot open file.\n");
my ( $x, $y, $i, $j, $k, $p, $q );
my ( #Xrow, #b, #b_l, #w );
print("Enter the number of rows:\n");
$p = <STDIN>;
chop($p);
print("Enter the number of columns:\n");
$q = <STDIN>;
chop($q);
$x = 2**$p;
$y = 2**$q;
#Xrow = ( #b, #b_l, #w );
for ( $i = 0; $i < $x * $y; $i = $i + 1 ) {
for ( $j = 0; $j < $x; $j = $j + 1 ) {
for ( $k = 0; $k < $y; $k = $k + 1 ) {
$Xrow[$i] = "$b[$j],$b_l[$j],$w[$k]";
foreach (#Xrow) {
print $out_fh "$_\n";
}
}
}
}
So output should look like for example p=q=1
Xrow0 b0 b_l0 w0
Xrow1 b1 b_l1 w0
Xrow2 b0 b_l0 w1
Xrow3 b1 b_l1 w1
so it should print like this in output file without any braces and "="
But I am getting error like this
Use of uninitialized value in concatenation<.> or string at xyz.pl in line 27
Use of uninitialized value within #b in concatenation<.> or string at xyz.pl in line 27
You don't populate the arrays anywhere, so they stay empty.
But, in fact, you need no arrays to get the output you want.
Additional advice:
Prefer chomp to chop.
Declare the variables when you need them, not at the top of the program/subroutine.
How I'd do it:
#!/usr/bin/perl
use warnings;
use strict;
open my $OUT, '>', 'output.txt' or die "Cannot open file.\n";
print "Enter the number of rows:\n";
my $p = <STDIN>;
chomp $p;
print "Enter the number of columns:\n";
my $q = <STDIN>;
chomp $q;
my $x = 2 ** $p;
my $y = 2 ** $q;
my $i = 0;
for (my $j = 0; $j < $x; $j = $j + 1) {
for (my $k = 0; $k < $y; $k = $k + 1) {
print {$OUT} "Xrow$i b$k b_l$k w$j\n";
++$i;
}
}

Perl Script that should move through array with i=3 prints indicies that aren't x3

I have these arrays of Sequences and I wrote this script to walk through each sequence three letters at a time (eg. {0,1,2}, {3,4,5},{6,7,8}) and print the index of where it first encounters a certian 3 letter combination (TAA,TAG,TGA). (EX. if sequence were CGTAGCCCCTAACCCC, then the script would skip over the TAG in the 2 position because its not in the correct frame of 3 and report the TAA in the 9 position). Therefore, I am only expecting indices in multiples of 3 in my results.
On most strings there is no problem, however every once in a while it will index at 4 or other non multiples of three. I was wondering if anyone more advanced than I can figure out why this may happen. I know this script is ugly and I am sorry for that, I am a biologist and I mod it for whatever I am mining out of sequences at the time. I just can't figure out the bug.
Here are some sequences from my file. The 3rd line is the sequence that gives the strange result. Just for an example of what I am dealing with.
AGGTACGCGAGTCACCTTTCGTCTTCAATCTCGTTTGATCGAAGCTATTTGTCAAAAAGAGAGGATTTTTTTGCATCTCAATTATGATCATTCCTTAGGGTTTTCAGGGTTTTGGATTGTTGTTTTTGTTAACATTTATCTGATTCGTTTGTATTTGTGTGGCAGTCTAAAGTGGCATCAACAATGGCGTCTTTTATTATACATAAGCCAAAGGAGAGATCGCCTTTCACGAAAGCTGCTTTCAAAACGGTACCTTTAGTGATTCAGCATTTTTATCTGAAATATGTTTGTTGCATTATTGAATGATTCTGATGTGGTGTTGCTACCAACTTGTCTATGTTGGTTGATTTAGCTTGATAGCATCAAGGAGTTGGAACTGTTTATGTTGAAGCATCGAAAGGATTATGTTGATCTGCACCGGACTACAGAACAGGAAAAGGATAGTATTGAACAAGAAGTAAGTACTCTGAGCTAGGCTTGCCCGTAGTATATATCTGAACTCATGAAGTTACTGCGATAAATCTATGCTTGAGTTGAGATTGAACATATGGAACTATGGAATCATAAGAAATGTAGCAACTCATATTGAGATAACTCAGGAAGATTAATGTCTATTACTTTAGATAGCGAGGGAGTTAGTATATTGTGACACTGAGGAACTTGGATCTTGTATTCTTATACCTCTTGCAGTGTTTGATCGAGAACTATGTCTACTTATGTGTTGTGTAATATCATCAAACTCTCTCTCTCTCCCTCTTGCAGGTTGCTGCTTTTATTAAAGCTTGCAAAGAACAGATCGATATTCTCATAAACAGTATTAGAAATGAAGAAGCAAACTCCAAAGGATGGCTTGGCCTCCCCGCAGATAACTTCAATGCTGATTCTATAGCACACAAACATGGAGTGGTATGATATGCACCAATGTAGTAAGCCAACTTTGGTTTTTTTTTACTATGTTTTCTTTCAAAGTATCTAGATGTGTAGAAGTAATGGTAATTTTTTTTGTATGCAGGTTTTGATTCTGAGTGAGAAACTTCATTCAGTCACTGCCCAGTTTGATCAGCTTAGAGCTACTCGTTTCCAAGATATTATAAACAGAGCTATGCCGAGAAGAAAACCTAAGAGGGTCATAAAGGAAGCTACCCCAATTAATACAACTCTGGGAAATTCGGAGTCCATAGAACCGGATGAAATCCAGGCCCAACCTCGTAGATTACAACAACAACAACTTCTAGACGATGAAACACAAGCCCTTCAGGTAACAAGGCAAATATACATGATCTTCGAAAACTTGCATAAGTTTTGTAGTTATGCTAAATTTTGAAATTGATAATTTTTGCAGGTAGAGCTAAGTAATCTTTTAGATGGTGCTAGGCAGACAGAAACTAAGATGGTGGAGATGTCTGCATTAAACCACTTGATGGCAACTCATGTTCTGCAGCAAGCCCAACAGATAGAGTTTCTTTATGACCAGGTTAGGACTTATTAACTTCTCTAACGCTCTCATGTCAACACACTGTTTTGTTAGGCTTTCACTGTTCTTTACACTCCTTTGCTATCTCAAAGTTAAATTCGGATGCTTATTGTATTCAGAACTTTTCCTTGTCACATTCACCTAAATTAGGTATAGAGACGGGAAAGAAACTTTGTATTGGTCCAATTTTAATTGCTCTCCAATTTAGTGGTAGGAAATGGAACGGTTAATGTTTTTAGCTATGTAAAGTCTCTAAAACTCCATTTGAATGTGTCAATGACTCAATGCCATTCCCAATACTTTAGTTTATGGGGCTTTGCAGTTTTCCTACTCTGTAAACGTACAGCTTATGACTGACTTGGTGGCTCTCTTTATGTGTGTGTGTGTGTGTCTTGAGGCCCTTTTTCTCACTCAGTTTGACACTAAATGCAGGCAGTTGAGGCAACAAAGAACGTGGAGCTTGGAAACAAAGAGCTTTCTCAAGCAATCCAACGAAACAGCAGCAGCAGAACCTTTCTCTTACTGTTTTTCTTCGTCCTTACTTTCTCCGTCTTGTTCTTGGATTGGTACAGTTAAaaaacc
AGGTGATTGTTTTGTTATTATAAATCAAGATCAGTACATATATATTTTTGTTTTTCTTGGTTTCATATGTAATATTTTGGACTTTTGGTGTTTAGGTTTTTGACTTGGAAGAAAAGAACGTAATGGATGAGTCACTACACGAGGTGTATAAATTTTGCCTCACCGATGTTGATGAGAGAAGCAAGAAAGAGACATCAATGAAAGATGATTACATAGAACATAAGAAGTCTACTAGATTGTTGGCTGAAAATGCGAAGAAGTCCGGTCACAGTTTAGAAATATTAAGGCCGGAATCTAAACCTGAGACTGAAAAAGAGGTGATTTTATTTTCTTGTTATATAAAGATTCGTAGACATATATTTGGTTTTTCTTTGGTTTCATAATATTTTGGACTTATGTGTGTTTAGGTCAATGAAGAGGAAGAGAAGAGAGTAATGGATCCGGATGTGGATATTAGTTGTTATGAAGAGTCACCACACGAGGTGTATAAATTTAGCCTCACCGATTTCGAAGAAGAGATAATGGAAGATGATTACAGAGAAGATATGAAGTGTAGAATGTTGGATGATATAGTGAAGAATTCCGGTCACCGTGTAGAAATATCAAGGCCGGAATATTATAAACCTGAGATTGAAAAACAGGTTTTATTTTTTTGGTTATTTTGTGATTAAGATCAGTTTTTTTTTTTTTTTTTTTTGGTTTAATAATATTTGATCTTGTGTGTGTTTAGGTATATGAAAAGGAAGAGAAGAAAGTAATGGATCCGGATATCTATATTAGATCTTATGAAGAGTCACCAAACGAGGTGTATAAATTTAGCCTCACTGATTTGGAAGAAGAGATAATGGAAAATGACTCCATAGAAGGTGTGAAGTGTAGAATGTTGGATGAAATAATGAAGAAGTCCGGTCACCATTTAAAAATATCAAGGCCGGAATATAAACCTGAGATTGAAAAACAGGTTAGTTTTTAATAAAAAGATCACTAGATATTTTTTTTTATTTTTTTTTGTTTTTGGTTTCATAATATTTGACTTGTGGCATGTGTTTAGGTATATGAAGAGGAAGAGAAGAAAGTAATGGATCCAGATGTGGATATTAGATGTTATGAAGAGTCACCACACGAGGTGTCTAAATTTAGCCTCACCGATTTCGAAGAAGAGATAATGGAAGATGATTACATAGAAGCTTTGAAGTGTAGAATGTTGGATGATATATTGAAGAAGTCCGGTCACCGTTTAGAAATATCAAGGCGGCAATATAATAAACCTGAGATTGAAATACAGGTGATTTTTTTTTTTTATTATTGTTGTTATAGTAAGATCAGTAGATATATATCTTGGTTTCATAATATTTTGGACTTGTGTGTGTTTAGGTCAATGAAAAGGAAGAGAAGAAAGTAATCAATACGGATATGGATATTAGATATGATGATGAGTCACCAGAAGAGGTGGAGACATATTCTAGTCTCACGGATGATGAAGAAGAGAGAAGCAAGGAAGATACATCAATGGAAGATGTGAAGTGTAGAATGTTGGATTAAAAAACGACGAAGCTCGGCCACCTTTTAGGAATATCAAGGCCGGAATATAGACCTGAGATTGAAAAACAGGTGATTTTATTTTGTTGTTAATTGTATTAGTAAAGATCAGTAGATATATATTTGTTTTTGTTTTTCGGTTTCATAATATTTTGGACGCTTGTGTTTAGGTCAATGAAGAGAAAGAAAGAAAGTAATGGATATTAGATCTGCTGGTCAGTCACAAACACGAGGTGTACAAATTTAGCCTCACCGATATCAAAGAAGAGAGAAGCAATGAAGATACATCAATGGAAGATTGTTGCATAGAAGAGGCTCAAGTCGGAAAAGATCAAAGAGTCTTCAGATTCAGAGAAAGTAGTGAAGAGAAGAGAAAATCCTCATCATCACCATTATCACCACTAACAGAGTTTAGGGATATGGAGAGTTTGACGTATTACATGAGGCAAAAAGGGATGCATCGAAGAAGAAGAAGATCATCAACATCACCACATTGTTGCCATAATGTAGTATACAATGAGTTTAAAGTGACGAAGGAAGAAGAAGAGGAAGAAAGACAAAGATTAACAACCAAACGTGTTCATTCTAAGCTTCATGAATACGAACAATTTTTAACTCAGTTTAAAAAGAAGAAGGAAGAAGAAAACGAGAGACGAAGATTATCACCCAAAGACTTTGAGCCTACGCTTCCTGATTACGACCAAGTGATTACTCGCTTTAGAGTGCTGGAGAAGGAAGAAGAAGAAAGACGAAGATTAGCAACAAAACATGTTCATCCTAAGCTTCCTGATTACGACCAGATTGCTACTAAGTTTAAACTCCTGAAGGAGGTAGAAAAAGAAAGACGAAGATTATTAACCAAACACAGTTCATCCTAAgcttcc
TGGTAATTTTTGCATCTTCAAAATGTTCTAAAATTTTGGCAAATGGTTTTGTTAAGTTCGAATTTTTGGTTATGATACAGTTTGAACGTTTTTCTTCATAGATTACAGTTTTAGCAAATGTGAATCATTAAAAGTGGAATAGTTGGTTTGAAAACAATTGTCAATTTCATTTTTTTTTTGGTTTTATGGTTAGGCGAGGAAAGCATTAAGAGCTTTGAAAGGTATAGTGAAGCTACAAGCATTAGTGAGAGGATACTTAGTAAGGAAACGCGCGGCCGCAATGTTGCAGAGCATACAAACTTTGATCAGAGTCCAAACCGCTATGCGATCAAAACGCATCAATCGCAGCCTCAACAAAGAGTACAACAACATGTTTCAACCTCGACAATCCTTTGTAAAGAACTATTCTCATTTCCATTGGCTCTCTTTTTTTCTTTAAGCCAAAACAAGACTTAAAGTGTGTCCTCTGTTTGTAGGATAAGTTTGATGAAGCAACGTTCGATGACAGAAGAACAAAGATTGTAGAGAAGGACGATAGATACATGAGAAGATCAAGTTCAAGATCAAGATCTAGACAAGTGCACAATGTTGTTTCAATGTCTGACTATGAAGGCGATTTTGTTTACAAAGGGAATGATTTGGAGTTGTGTTTCTCGGATGAGAAGTGGAAGTTTGCTACCGCGCAGAACACGCCGAGATTATTGCATCACCATTCTGCTAATAATCGCTATTATGTAATGCAGTCTCCAGCTAAGAGTGTTGGTGGAAAGGCTTTGTGTGACTATGAAAGCAGTGTGAGTACTCCTGGCTACATGGAGAAAACTAAGTCCTTTAAGGCAAAAGTGCGTTCACACAGCGCACCGCGCCAGCGATCTGAGAGGCAGAGGTTGTCGCTAGATGAAGTTATGGCCTCTAAGAGTAGCGTTAGCGGTGTGAGTATGTCGCATCAGCATCCACCACGCCATTCTTGTTCCTGTGATCCGCTTTAActtaac
GAGTTAGTAAACAAAGTGTTCACATTTTAGTAAACATTGTTGTTCGTTAATCACGTAACGTTTTGTTTTTCCAGTTTACACTGAGCTCTGATGAGTATATAACGGAGGTGAATGGTTACTACAAAACTACGTTTTCGGGAGAAGTCATAACGTCGTTGACGTTCAAGACGAACAAAAGGACATATGGGACTTACGGAAATAAAACCAGTAGCTACTTTTCTGTTGCCGCACCCAAAGATAACCAGATTGTCGGTTTTCTTGGAAGTAGCAGCCATGCTCTCAACTCCATCGACGCTCATTTTGCCCCTGCTCCTCCTCCTGGTAGCACCGGAGCTAAGCCCGGTGCTAGTGGCATCGGAAGTGATTCTGGTAGCATTGGTAGTGCCGGAACTAACCCTGGTGCTGATGGCACCAGAGAAACCGAAAAAAACGCTGGTGGCTCAAAACCTAGTAGTGGTAGTGCCGGAACTAACCCTGGTGCTAGTGCTGTTGGCAACGGAGAAACCGAAAAAAATGCTGGTGGCTCAAAACCTAGCAGTGGTAGTGCTGGAACTAACCCTGGTGCTAGTGCTGGTGGCAACGGAGAAACCGAAAAAAACGTTGGTGGCTCAAAACCTAGCAGTGGTAAAGCCGGAACTAACCCTGGTGCTAATGCTGGTGGCAACGGAGGAACCGAAAAAAACGCTGGTGGCTCAAAATCTAGCAGTGGTAGTGCTCGAACTAACCCTGGTGCTAGTGCTGGTGGCAACGGAGAAACTGTTTCCAACATTGGAGATACGGAAAGTAACGCTGGTGGCTCGAAAAGTAATGATGGTGCTAACAATGGTGCTAGTGGCATTGAAAGTAATGCTGGTAGCACTGGAACTAACTTTGGTGCTGGTGGCACCGGGGGAATTGGAGATACGGAAAGTGATGCTGGTGGCTCCAAAACTAACTCTGGAAACGGCGGAACTAACGATGGTGCTAGTGGTATTGGAAGTAATGATGGTAGCACTGGAACTAACCCTGGTGCTGGTGGAGGAACAGATTCAAACATCGAAGGTACTGAAAATAACGTTGGTGGCAAGGAAACTAACCCTGGTGCTAGTGGCATTGGAAATAGTGATGGTAGCACTGGAACTAGCCCCGAAGGTACCGAAAGTAACGCTGACGGCACAAAAACTAACACGGGAGGCAAAGAATCTAACACCGGAAGTGAATCCAACACCAATTCTAGTCCACAAAAGTTGGAAGCACAAGGAGGCAATGGAGGAAATCAATGGGACGACGGAACCGATCATGATGGTGTGATGAAGATACATGTTGCAGTTGGTGGTCTAGGAATTGAGCAAATTAGATTTGATTATGTCAAGAACGGACAGTTGAAGGAAGGACCCTTCCACGGTGTCAAAGGAAGAGGTGGCACTTCAACGGTGCGTAAATTTTTATTATTATGGCTCAATTACGTTTTTCGAATAAGTGTTAATTCAAGATTATTGATCTTCATGATTCTGCAGATTGAGATTAGCCATCCGGACGAGTATCTTGTTTCCGTCGAGGGGTTGTACGACTCTTCCAATATCATTCAAGGAATCCAGTTTCAATCCAACAAACACACTTCTCAGTACTTTGGATATGAATATTATGGAGATGGTACACAATTTTCACTTCAAGTTAATGAAAAGAAGATCATTGGTTTCCATGGTTTTGCCGACTCACACCTTAATTCTCTTGGAGCTTATTTCGTTCCAATCTCATCCTCTTCTTCCTCCTTGACTCCTCCTCCCAACAAAGTTAAAGCTCAAGGAGGAAGTTATGGAGAAACATTTGACGATGGTGCTTTCGATCATGTAAGAAAGGTTTATGTTGGTCAAGGTGATTCTGGTGTAGCTTATGTCAAGTTCGATTATGAAAAAGACGGTAAAAAGGAGACACAAGAACATGGAAAAATGACATTGTCAGGAACAGAGGAGTTTGAGGTTGATTCAGACGATTACATAACATCAATGGAGGTTTATGTCGACAAAGTCTACGGTTATAAAAGCGAAATCGTCATTGCTCTTACCTTCAAGACCTTTAAGGGTGAAACTTCTCCACGTTTTGGAATAGAGACTGAGAATAAATATGAAGTTAAAGACGGTAAAGGAGGAAAACTTGCTGGTTTCCATGGAAAAGCTAGCGATGTTCTTTATGCTATTGGTGCTTATTTCATTCCAGCAGCAAATTAGagagtt
ACGTATGTCTTAGTTACTACTATCATACTATATTACTATGTATTGGAAAACTTTTGGTTAGAACCTGTTGGGAGGAAAGGGTTTATGTTCTGGTTCATTTTACGTGTACTAAGTACTTATAATTAAGATTAAAAGAAACATTTACAGCTTCACCCTCTGGTCGATGTATGTGGGCTGTGGGCATGTGGCCAATCTCTGAAGCGTTAGGTAGAGCAAATATAGAGTTGAGAGTTGCTTAAGTTAGTGAACGTGAATGACTAAAAAGATATGTTGCATTTAAATCGTATTGGGCCTCATCCCATCTAAAATATAGTAGGTGTAGGCCTTTTAGGTTAATTTGAATAAAATCAACCTTTTTGTAAGCAACATCGACGATTGTCACATTTTTCTCATACACATAGGTGTAATCTAGCTTTGAATGTTTTCTCATACACATAGGTGTAATCACCGTAATTATCATTTGTGAAGATATATGTTTTACCAAGTGGTTTGTATTGTCCATATATACTTTACCACTTTCATATTAACATATAATGTTTTTGTAAGTATTATACCATAAAGGATTGGTTTCTTAATATTATTAACAAAACGCAAAAATTCTTTTAAACGCAGGCGATTCCAATCCACAGCGTTGCGGTTAGAGTAGGATCAACACAAAGAGTAGTGATGGAGATCATAATCACATTCGCATTGGTCTACACTGTTTACGCCACAGCCATTGACTCCAACAATGGCACTCTCGGAACCATCGCTCCACTTGCTATCAGACTCATCGTTGGTGCTAACATTCTTGCAGCCGGCCCATTCTCTGGTGGTCCAATGAACCCTGGACGTTCTTTTGGATCATCTCTTGCCGTTGGAAATTTTTCAGGACATTAGgtttat
and here is the script I am running:
#!/usr/bin/perl
use strict;
use warnings;
# A program to find the first inframe stop codon of non-spliced intron containing genes
print "ENTER THE FILENAME FOR DNA SEQUENCES:= ";
# Asks for Sequence file and if file does not exist prints error message
my $filename = <STDIN>;
#my $sequence;
my #sequence;
chomp $filename;
unless (open(DNAFILE, $filename) ) {
print "Cannot open file \"$filename\"\n\n";
}
#sequence = <DNAFILE>;
close DNAFILE;
open (FILE, ">AtPTCindex.txt");
my $j;
my $i;
my $codon;
my $stopseq;
my $counter;
#Change $j<(375) to n=number of sequences
for ($j = 0; $j < #sequence; $j ++) {
$counter = 0;
for ($i = 0; $i < (length($sequence[$j]) - 2) && $counter < 1; $i += 3) {
$codon = substr($sequence[$j], $i, 3);
if ($codon =~ m/TAG|TGA|TAA/g) {
# m added before /TAG... above
$stopseq = substr($sequence[$j], $i, 9);
my $result = index($sequence[$j], $stopseq);
$counter = 1;
#my $results = index($sequence[$j], $stopseq);
print FILE "$result \n";
#print FILE "$results $j \n";
}
}
if ($counter == 0) {
print FILE "\n"
}
}
close FILE;
exit;
Thanks so much.
As threatened, the following is a cleaned up version of your script:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
die "Usage: $0 Filename\n" if #ARGV != 1;
my $file = shift;
open my $infh, '<', $file;
open my $outfh, '>', 'AtPTCindex.txt';
while (my $line = <$infh>) {
chomp($line);
my $result = '';
for (my $i = 0; $i < (length($line) - 2); $i += 3) {
my $codon = substr($line, $i, 3);
if ($codon =~ m/TAG|TGA|TAA/) {
# m added before /TAG... above
my $stopseq = substr($line, $i, 9);
$result = index($line, $stopseq);
$result .= " ($i, $codon, $stopseq)";
last;
}
}
print "$result\n";
# print $outfh "$result\n";
# print $outfh "$result $.\n";
}
close $infh;
close $outfh;
For the 5 lines of data that you provided, the following is the output:
84 (84, TGA, TGATCATTC)
3 (3, TGA, TGATTGTTT)
3 (3, TAA, TAATTTTTG)
4 (27, TAG, TAGTAAACA)
123 (123, TAA, TAAGATTAA)
I believe your issue is with these lines:
my $stopseq = substr($line, $i, 9);
$result = index($line, $stopseq);
You're pulling a sequence from the $line at position $i, and then immediately doing an index for it. In the case of 4 of 5 of those lines, it immediately finds the same value $i. However, in the case of line 4, it finds a matching sequence earlier in the line.
If this isn't desired, you'll have to explain what your desired behavior actually is. Perhaps, you just want $i? Or are you looking for a matching stop sequence any point AFTER $i? You'll have to specify what your actual logic wants to be.
I took a different approach, unpacking it into groups of three instead of counting by indexes of three. I believe this script does what you want, and it looks a lot cleaner. It can also optionally take the filename as argument.
#!/usr/bin/perl
use strict;
use warnings;
my $filename = 'a'; # dummy value
my $resultfile = 'AtPTCindex.txt';
# User may have passed filename as arguement
if (#ARGV) { if (-e $ARGV[0]) { $filename = $ARGV[0] } }
unless (-e $filename)
{
print "ENTER THE FILENAME FOR DNA SEQUENCES: ";
chomp($filename = <STDIN>)
}
open DNA,"<$filename" or die "Couldn't open $filename for reading: $!\n";
my #sequence = <DNA> or die "Couldn't read $filename: $!\n";;
close DNA;
# Uncomment the below line if you're braver than me
if (-e $resultfile) { die "Cowardly refusing to write to existing file" }
if (-e $resultfile) { unlink $resultfile };
open RESULT,">>$resultfile" or die "Courdn't open$!\n";
foreach my $string (#sequence)
{
# split into groups of 3
my #groups = unpack "(A3)*", $string;
# Search for the group you want
for (my $groupnum = 0; $groupnum < #groups - 1; $groupnum++)
{
if ($groups[$groupnum] =~ m/(TAG|TGA|TAA)/g)
{
print RESULT (($groupnum + 0) * 3) . "\n";
print "$1 (" . $1 . ( $groups[$groupnum + 1]) . ($groups[$groupnum + 2]) . ") at index " . (($groupnum + 0) * 3) . "\n";
last;
}
}
}
close RESULT;
Running the script on your sample data, it outputs:
TGA (TGATCATTC) at index 84
TGA (TGATTGTTT) at index 3
TAA (TAATTTTTG) at index 3
TAG (TAGTAAACA) at index 27
TAA (TAAGATTAA) at index 123
...as well as writes the raw index numbers to the file specified.

Perl - Splitting a string

I'm doing an Array that contents each word of a phrase. When I try to split it and print the length then the console gives me an enormous number such as 111039391231319239188238139123919232913123... (more lines)
why?
Here's my code:
$mynames = $texto3;
print $mynames. "\n";
#nameList = split(' ', $texto3);
#print #nameList.length();
for ($to = 0; $to<#nameList.length; $to++){
if($to<#nameList.length) {
#nameList[$to] = #nameList[$to] . "_" . #nameList[$to++];
}
print $to;
#print #nameList[$to] . "\n";
}
$string_level2 = join(' ', #nameList);
#print $string_level2;
To get the length of an array use scalar #nameList instead of #nameList.length.
A typical for-loop uses the less-than operator when counting up, e.g.:
for ( $to = 0; $to < scalar(#nameList); $to++ ) ...
You should never use a post-increment unless you understand the side effects. I believe the following line:
#nameList[$to] = #nameList[$to] . "_" . #nameList[$to++];
... should be written as ...
$nameList[$to] = $nameList[$to] . "_" . $nameList[$to + 1];
Finally the comparison you use should account for the boundary condition (because you refer to $to + 1 inside the loop):
if( $to < (scalar(#nameList) - 1) ) {
$nameList[ $to ] = $nameList[ $to ] . "_" . $nameList[ $to + 1 ];
}

What pitfalls does this Perl code have?

I have written some code to print formatted arrays (first line = no of inputs, second line = max width of numbers). The star can be any sort of marker to differentiate some elements from the rest.
$ cat inp.txt
6
2
1 *
2
3
4
9
12 *
$ cat inp.txt | ./formatmyarray.pl
____ ____ ____ ____ ____ ____
| * | | | | | * |
| 1 | 2 | 3 | 4 | 9 | 12 |
|____|____|____|____|____|____|
$
fomatmyarray.pl
#!/usr/bin/perl
use warnings;
use strict;
my $spc = q{ };
my $und = q{_};
my $sep = q{|};
my $end = "\n";
my #inp = <STDIN>;
my $len = $inp[0];
my $wid = $inp[1];
chomp $wid;
sub printwall {
my($left, $middle, $l, $w) = #_;
for(0..($l - 1)) {
if ($_ == 0) { print $left; }
for(1..($w + 2)) { print $middle; }
print $left;
}
print $end;
return;
}
sub printchar {
my($left, $middle, $l, $w) = #_;
for(0..($l - 1)) {
if ($_ == 0) { print $left; }
my #temp = split ' ', $inp[$_ + 2];
my $star = 0;
if (($#temp) >= 1) { $star = 1; }
my $mid = sprintf "%d", (($w + 2) /2);
for(1..($w + 2)) {
if (($_ == $mid) && ($star == 1)) { print "*"; }
else { print $middle; }
}
print $left;
}
print $end;
return;
}
sub printnum {
my($left, $middle, $l, $w) = #_;
for(0..($l - 1)) {
if ($_ == 0) { print $left; }
my #temp = split ' ', $inp[$_ + 2];
my $format = join '', q{%}, $w, q{d};
my $num = sprintf($format, $temp[0]);
print join '', $middle, $num, $middle, $left;
}
print $end;
return;
}
printwall($spc, $und, $len, $wid);
printchar($sep, $spc, $len, $wid);
printnum ($sep, $spc, $len, $wid);
printwall($sep, $und, $len, $wid);
I already checked it with Perl::Critic but that will only tell me the syntactical problems (which I have already corrected). Are there any problems that you see with this code. Something an experienced Perl programmer would do differently?
Any comments or suggestions are welcome.
Several suggestions in here. Hope this is helpful.
use strict;
use warnings;
use Getopt::Long qw(GetOptions);
my $SPC = q{ };
my $UND = q{_};
my $SEP = q{|};
my $END = "\n";
main();
sub main {
# Try to keep run options and core input data separate from each other.
GetOptions('max=i' => \my $max_n);
# Parse input at the outset so that subsequent methods
# don't have to worry about such low-level details.
my $inp = parse_input();
# Prune the input array at the outset.
# This helps to keep subsequent methods simpler.
splice #$inp, $max_n if $max_n;
# Don't require the user to compute max width.
my $wid = determine_width($inp);
# The format string can be defined at the outset.
my $fmt = join '', $SEP, $SPC, '%', $wid, 's', $SPC;
# You can print both data and stars using one method.
print_border($inp, $wid, $SPC);
print_content($inp, $fmt, $_) for qw(star data);
print_border($inp, $wid, $SEP);
}
sub parse_input {
my #parsed;
# Using <> provides more flexibility than <STDIN>.
while (<>){
chomp;
my ($value, $star) = split;
$star = $SPC unless defined $star;
push #parsed, { data => $value, star => $star }
}
return \#parsed;
}
sub determine_width {
my $inp = shift;
my $wid = 0;
for (#$inp){
my $len = length $_->{data};
$wid = $len if $len > $wid;
}
return $wid;
}
# Because we did work at the outset to create a data structure
# that represents our goals conveniently, generating output
# is much simpler.
sub print_border {
my ($inp, $wid, $wall_sep) = #_;
print $wall_sep, $UND x ($wid + 2) for #$inp;
print $wall_sep, $END;
}
sub print_content {
my ($inp, $fmt, $mode) = #_;
printf $fmt, $_->{$mode} for #$inp;
print $SEP, $END;
}
There's a lot of room for improvement here (I'll update this answer as and when I have time).
Let's start off with the inputs. You should not have to specify the number of entries or their maximum length as Perl can infer that for you:
my $entries = my #entries = <STDIN>;
Don't forget about CPAN.
For instance, consider Text::ASCIITable.
The return statements would not appear in most people's code - a sub returns when it reaches the end (but see discussion in comments).
In printwall, I'd unconditionally print the first left wall outside the loop; ditto the other functions.
I'm not convinced I'd read all the data into #inp as shown. More likely, I'd use:
my $num = <STDIN>; # Or, more likely, just <>
my $wid = <STDIN>;
my #inp = <STDIN>;
This would clean up the $inp[$_ + 2] in the functions.
I'd probably pass the array to the functions, rather than using global variables - globals are grubby in Perl as everywhere else.
The count of the number of values is not needed in the input. With the array containing just the data to be printed, you can iterate over each member of the array in the functions with a suitable foreach, improving its Perlishness.
In printnum, you can build the format string once (not each iteration).
This:
my $mid = sprintf "%d", (($w + 2) /2);
is a funny way of writing:
my $mid = int(($w + 2) / 2);
I'd probably use a regex to find the star; it isn't clear whether you should print a '*' if any character is found, or if you should print the character that is found.
I'd probably be using a single format to deal with the stars:
my $fmt = sprintf "%*s%%c%*s%c", $wid, $middle, $wid, $middle, $left;
I might need to tune one of the $wid values to allow for even widths, but the output would be:
" %c |"
You can then simply print each cell with a blank or a '*' for the value using the format.
Similarly, in printnum, I'd be generating a simple format string like " %2d |" to print each number - and I'd generate that format once.
Etc.

Resources