Array elements are not printed in Perl - arrays

I have two sets of files. One file gives a list of gene names (one gene per line). The second file has a list of gene pairs (e.g., => '1,2' and one gene pair perl line). The gene names are numerical. I want to list all possible gene combinations except the known gene pairs.
My output should be:
3,4
4,5
6,7
...
...
But, I get something like this =>
,4
,5
,7
All the first elements do not print. I'm not sure exactly what's wrong with the code. Can anyone help?
My code:
#! usr/bin/perl
use strict;
use warnings;
if (#ARGV !=2) {
die "Usage: generate_random_pairs.pl <entrez_genes> <known_interactions>\n";
}
my ($e_file, $k_file) = #ARGV;
open (IN, $e_file) or die "Error!! Cannot open $e_file\n";
open (IN2, $k_file) or die "Error!! Cannot open $k_file\n";
my #e_file = <IN>; chomp (#e_file);
my #k_file = <IN2>; chomp (#k_file);
my (%known_interactions, %random_interactions);
foreach my $line (#k_file) {
my #array = split (/,/, $line);
$known_interactions{$array[0]} = $array[1];
}
for (my $i = 0; $i <= $#e_file; $i++) {
for (my $j = $i+1 ; $j <= $#e_file; $j++) {
if ((exists $known_interactions{$e_file[$i]}) && ($known_interactions{$e_file[$i]} == $e_file[$j])) {next;}
if ((exists $known_interactions{$e_file[$j]}) && ($known_interactions{$e_file[$j]} == $e_file[$i])) {next;}
print "$e_file[$i],$e_file[$j]\n";
}
}

Your file uses CR LF for line endings, but you're on a system that uses LF for line endings, so your program outputs
"3" <CR> "," "4" <CR> <LF>
which your terminal shows as
,4
Either fix the line endings using
dos2unix inputfile
Or change
chomp(#e_file);
chomp(#k_file);
to
s/\s+\z// for #e_file;
s/\s+\z// for #k_file;

Related

Output the line numbers where a string appears

I am trying to determine how many time a string, Apples appears in a text file and in which lines it appears.
The script outputs incorrect line numbers, instead it outputs numbers consecutively (1,2,..) and not the correct lines for the word.
file.txt
Apples
Grapes
Oranges
Apples
Goal Output
Apples appear 2 times in this file
Apples appear on these lines: 1, 4,
Instead my output as illustrated from the code below is:
Apples appear 2 times in this file
Apples appear on these lines: 1, 2,
Perl
my $filename = "<file.txt";
open( TEXT, $filename );
$initialLine = 10; ## holds the number of the line
$line = 0;
$counter = 0;
# holder for line numbers
#lineAry = ();
while ( $line = <TEXT> ) {
chomp( $line );
if ( $line =~ /Apples/ ) {
while ( $line =~ /Apples/ig ) {
$counter++;
}
push( #lineAry, $counter );
$initialLine++;
}
}
close( TEXT );
# print "\n\n'Apples' occurs $counter times in file.\n";
print "Apples appear $counter times in this file\n";
print "Apples appear on these lines: ";
foreach $a ( #lineAry ) {
print "$a, ";
}
print "\n\n";
exit;
There are a number of problems with your code, but the reason for the line numbers being printed wrongly is that you are incrementing your variable $counter once each time Apples appears on a line and saving it to #lineAry. That is different from the number of the line where the string appears, and the easiest fix is to use the built-in variable $. which represents the number of times a read has been performed on the file handle
In addition, I would encourage you to use lexical file handles, and the three-parameter form of open, and check that every call to open has succeeded
You never use the value of $initialLine, and I don't understand why you have initialised it to 10
I would write it like this
use strict;
use warnings 'all';
my $filename = 'file.txt';
open my $fh, '<', $filename or die qq{Unable to open "$filename" for input: $!};
my #lines;
my $n;
while ( <$fh> ) {
push #lines, $. if /apples/i;
++$n while /apples/ig;
}
print "Apples appear $n times in this file\n";
print "Apples appear on these lines: ", join( ', ', #lines ), "\n\n";
output
Apples appear 2 times in this file
Apples appear on these lines: 1, 4
Change
push(#lineAry, $counter);
to
push(#lineAry, $.);
$. is a variable that stores the line number when using perl's while (<>).
The alternative, if you want to use your $counter variable, is that you move the increment to increment on every line, not on every match.

Printing in unexpected order

I expected the following to print in the order of the elements of #Data, but it's printing in the order of the elements of #Queries. Am I missing something? I also tried declaring the items to be printed after foreach(#data){... and then printing inside that loop, but still wrong order.
$datafile is a file with the following:
GR29929,JAMES^BOB
GR21122,HANK^REN
$queryfile is a file with the following:
(3123123212):# FD [GR21122]
line 2
line 3
line 4
(12): # FD [HANK^REN]
line 6
line 7
line 8
(13): # FD [Y]
-------------------------------
--------------------------------
(3123123212):# FD [GR29929]
line 2
line 3
line 4
(12): # FD [JAMES^BOB]
line 6
line 7
line 8
(13): # FD [Z]
The output file is:
GR21122,HANK^WREN,Y
GR29929,JAMES^BOB,Z
When I want:
GR29929,JAMES^BOB,Z
GR21122,HANK^WREN,Y
Code is:
open(DA, "<$datafile");
open(QR, "<$queryfile");
my #Data = <DA>;
my #Queries = <QR>;
foreach (#Data) {
my ( $acce, $namee ) = split( ',', $_ );
chomp $acce;
chomp $namee;
print "'$acce' and '$namee'\n";
for my $i ( 0 .. $#Queries ) {
my $Qacce = $Queries[$i];
my $Qname = $Queries[ $i + 4 ];
my $Gen = $Queries[ $i + 8 ];
if ( $Qacce =~ m/$acce/ and $Qname =~ m/$namee/ ) {
my ($acc) = $Qacce =~ /\[(.+?)\]/;
my ($gen) = $Gen =~ /\[(.+?)\]/;
$gen =~ s/\s+$//;
my ($name) = $Qname =~ /\[(.+?)\]/;
print GL "$i,$acc,$gen,$name\n";
}
}
}
The basic shell of your program prints what you ask for, but there is a lot missing. The refactoring below should do what you want.
You had a problem with the values of your $i index variable, so that the first time around the loop you were accessing #data elements [0, 4, 8], the second time [1, 5, 9] etc. It looks like the second loop execution should use elements [11, 15, 19] and so on. Please correct me if I'm wrong.
In addition you were using regular expressions to compare the keys in the two files, and you were finding nothing because the name values contain caret ^ characters which are special within regexes. Escaping the strings using \Q...\E fixed this.
Note that a better solution would use hashes to match keys across the two files, but without details on your file format - particularly queryfile - I have had to follow your own algorithm.
use strict;
use warnings;
use autodie;
my ($data_file, $query_file) = qw/ datafile.txt queryfile.txt /;
my #queries = do {
open my $query_fh, '<', $query_file;
<$query_fh>;
};
chomp #queries;
open my $data_fh, '<', $data_file;
while (<$data_fh>) {
chomp;
my ($acce, $namee) = split /,/;
for (my $i = 0; $i < #queries; $i += 11) {
my ($qacce, $qname, $qgen) = #queries[$i, $i+4, $i+8];
if ( $qacce =~ /\Q$acce\E/ and $qname =~ /\Q$namee\E/ ) {
my ($acc, $name, $gen) = map / \[ ( [^\[\]]+ ) \] /x, ($qacce, $qname, $qgen);
$gen =~ s/\s+\z//;
print "$acc,$name,$gen\n";
}
}
}
output
GR29929,JAMES^BOB,Z
GR21122,HANK^REN,Y

Perl Script that should move through array with i=3 prints indicies that aren't x3

I have these arrays of Sequences and I wrote this script to walk through each sequence three letters at a time (eg. {0,1,2}, {3,4,5},{6,7,8}) and print the index of where it first encounters a certian 3 letter combination (TAA,TAG,TGA). (EX. if sequence were CGTAGCCCCTAACCCC, then the script would skip over the TAG in the 2 position because its not in the correct frame of 3 and report the TAA in the 9 position). Therefore, I am only expecting indices in multiples of 3 in my results.
On most strings there is no problem, however every once in a while it will index at 4 or other non multiples of three. I was wondering if anyone more advanced than I can figure out why this may happen. I know this script is ugly and I am sorry for that, I am a biologist and I mod it for whatever I am mining out of sequences at the time. I just can't figure out the bug.
Here are some sequences from my file. The 3rd line is the sequence that gives the strange result. Just for an example of what I am dealing with.
AGGTACGCGAGTCACCTTTCGTCTTCAATCTCGTTTGATCGAAGCTATTTGTCAAAAAGAGAGGATTTTTTTGCATCTCAATTATGATCATTCCTTAGGGTTTTCAGGGTTTTGGATTGTTGTTTTTGTTAACATTTATCTGATTCGTTTGTATTTGTGTGGCAGTCTAAAGTGGCATCAACAATGGCGTCTTTTATTATACATAAGCCAAAGGAGAGATCGCCTTTCACGAAAGCTGCTTTCAAAACGGTACCTTTAGTGATTCAGCATTTTTATCTGAAATATGTTTGTTGCATTATTGAATGATTCTGATGTGGTGTTGCTACCAACTTGTCTATGTTGGTTGATTTAGCTTGATAGCATCAAGGAGTTGGAACTGTTTATGTTGAAGCATCGAAAGGATTATGTTGATCTGCACCGGACTACAGAACAGGAAAAGGATAGTATTGAACAAGAAGTAAGTACTCTGAGCTAGGCTTGCCCGTAGTATATATCTGAACTCATGAAGTTACTGCGATAAATCTATGCTTGAGTTGAGATTGAACATATGGAACTATGGAATCATAAGAAATGTAGCAACTCATATTGAGATAACTCAGGAAGATTAATGTCTATTACTTTAGATAGCGAGGGAGTTAGTATATTGTGACACTGAGGAACTTGGATCTTGTATTCTTATACCTCTTGCAGTGTTTGATCGAGAACTATGTCTACTTATGTGTTGTGTAATATCATCAAACTCTCTCTCTCTCCCTCTTGCAGGTTGCTGCTTTTATTAAAGCTTGCAAAGAACAGATCGATATTCTCATAAACAGTATTAGAAATGAAGAAGCAAACTCCAAAGGATGGCTTGGCCTCCCCGCAGATAACTTCAATGCTGATTCTATAGCACACAAACATGGAGTGGTATGATATGCACCAATGTAGTAAGCCAACTTTGGTTTTTTTTTACTATGTTTTCTTTCAAAGTATCTAGATGTGTAGAAGTAATGGTAATTTTTTTTGTATGCAGGTTTTGATTCTGAGTGAGAAACTTCATTCAGTCACTGCCCAGTTTGATCAGCTTAGAGCTACTCGTTTCCAAGATATTATAAACAGAGCTATGCCGAGAAGAAAACCTAAGAGGGTCATAAAGGAAGCTACCCCAATTAATACAACTCTGGGAAATTCGGAGTCCATAGAACCGGATGAAATCCAGGCCCAACCTCGTAGATTACAACAACAACAACTTCTAGACGATGAAACACAAGCCCTTCAGGTAACAAGGCAAATATACATGATCTTCGAAAACTTGCATAAGTTTTGTAGTTATGCTAAATTTTGAAATTGATAATTTTTGCAGGTAGAGCTAAGTAATCTTTTAGATGGTGCTAGGCAGACAGAAACTAAGATGGTGGAGATGTCTGCATTAAACCACTTGATGGCAACTCATGTTCTGCAGCAAGCCCAACAGATAGAGTTTCTTTATGACCAGGTTAGGACTTATTAACTTCTCTAACGCTCTCATGTCAACACACTGTTTTGTTAGGCTTTCACTGTTCTTTACACTCCTTTGCTATCTCAAAGTTAAATTCGGATGCTTATTGTATTCAGAACTTTTCCTTGTCACATTCACCTAAATTAGGTATAGAGACGGGAAAGAAACTTTGTATTGGTCCAATTTTAATTGCTCTCCAATTTAGTGGTAGGAAATGGAACGGTTAATGTTTTTAGCTATGTAAAGTCTCTAAAACTCCATTTGAATGTGTCAATGACTCAATGCCATTCCCAATACTTTAGTTTATGGGGCTTTGCAGTTTTCCTACTCTGTAAACGTACAGCTTATGACTGACTTGGTGGCTCTCTTTATGTGTGTGTGTGTGTGTCTTGAGGCCCTTTTTCTCACTCAGTTTGACACTAAATGCAGGCAGTTGAGGCAACAAAGAACGTGGAGCTTGGAAACAAAGAGCTTTCTCAAGCAATCCAACGAAACAGCAGCAGCAGAACCTTTCTCTTACTGTTTTTCTTCGTCCTTACTTTCTCCGTCTTGTTCTTGGATTGGTACAGTTAAaaaacc
AGGTGATTGTTTTGTTATTATAAATCAAGATCAGTACATATATATTTTTGTTTTTCTTGGTTTCATATGTAATATTTTGGACTTTTGGTGTTTAGGTTTTTGACTTGGAAGAAAAGAACGTAATGGATGAGTCACTACACGAGGTGTATAAATTTTGCCTCACCGATGTTGATGAGAGAAGCAAGAAAGAGACATCAATGAAAGATGATTACATAGAACATAAGAAGTCTACTAGATTGTTGGCTGAAAATGCGAAGAAGTCCGGTCACAGTTTAGAAATATTAAGGCCGGAATCTAAACCTGAGACTGAAAAAGAGGTGATTTTATTTTCTTGTTATATAAAGATTCGTAGACATATATTTGGTTTTTCTTTGGTTTCATAATATTTTGGACTTATGTGTGTTTAGGTCAATGAAGAGGAAGAGAAGAGAGTAATGGATCCGGATGTGGATATTAGTTGTTATGAAGAGTCACCACACGAGGTGTATAAATTTAGCCTCACCGATTTCGAAGAAGAGATAATGGAAGATGATTACAGAGAAGATATGAAGTGTAGAATGTTGGATGATATAGTGAAGAATTCCGGTCACCGTGTAGAAATATCAAGGCCGGAATATTATAAACCTGAGATTGAAAAACAGGTTTTATTTTTTTGGTTATTTTGTGATTAAGATCAGTTTTTTTTTTTTTTTTTTTTGGTTTAATAATATTTGATCTTGTGTGTGTTTAGGTATATGAAAAGGAAGAGAAGAAAGTAATGGATCCGGATATCTATATTAGATCTTATGAAGAGTCACCAAACGAGGTGTATAAATTTAGCCTCACTGATTTGGAAGAAGAGATAATGGAAAATGACTCCATAGAAGGTGTGAAGTGTAGAATGTTGGATGAAATAATGAAGAAGTCCGGTCACCATTTAAAAATATCAAGGCCGGAATATAAACCTGAGATTGAAAAACAGGTTAGTTTTTAATAAAAAGATCACTAGATATTTTTTTTTATTTTTTTTTGTTTTTGGTTTCATAATATTTGACTTGTGGCATGTGTTTAGGTATATGAAGAGGAAGAGAAGAAAGTAATGGATCCAGATGTGGATATTAGATGTTATGAAGAGTCACCACACGAGGTGTCTAAATTTAGCCTCACCGATTTCGAAGAAGAGATAATGGAAGATGATTACATAGAAGCTTTGAAGTGTAGAATGTTGGATGATATATTGAAGAAGTCCGGTCACCGTTTAGAAATATCAAGGCGGCAATATAATAAACCTGAGATTGAAATACAGGTGATTTTTTTTTTTTATTATTGTTGTTATAGTAAGATCAGTAGATATATATCTTGGTTTCATAATATTTTGGACTTGTGTGTGTTTAGGTCAATGAAAAGGAAGAGAAGAAAGTAATCAATACGGATATGGATATTAGATATGATGATGAGTCACCAGAAGAGGTGGAGACATATTCTAGTCTCACGGATGATGAAGAAGAGAGAAGCAAGGAAGATACATCAATGGAAGATGTGAAGTGTAGAATGTTGGATTAAAAAACGACGAAGCTCGGCCACCTTTTAGGAATATCAAGGCCGGAATATAGACCTGAGATTGAAAAACAGGTGATTTTATTTTGTTGTTAATTGTATTAGTAAAGATCAGTAGATATATATTTGTTTTTGTTTTTCGGTTTCATAATATTTTGGACGCTTGTGTTTAGGTCAATGAAGAGAAAGAAAGAAAGTAATGGATATTAGATCTGCTGGTCAGTCACAAACACGAGGTGTACAAATTTAGCCTCACCGATATCAAAGAAGAGAGAAGCAATGAAGATACATCAATGGAAGATTGTTGCATAGAAGAGGCTCAAGTCGGAAAAGATCAAAGAGTCTTCAGATTCAGAGAAAGTAGTGAAGAGAAGAGAAAATCCTCATCATCACCATTATCACCACTAACAGAGTTTAGGGATATGGAGAGTTTGACGTATTACATGAGGCAAAAAGGGATGCATCGAAGAAGAAGAAGATCATCAACATCACCACATTGTTGCCATAATGTAGTATACAATGAGTTTAAAGTGACGAAGGAAGAAGAAGAGGAAGAAAGACAAAGATTAACAACCAAACGTGTTCATTCTAAGCTTCATGAATACGAACAATTTTTAACTCAGTTTAAAAAGAAGAAGGAAGAAGAAAACGAGAGACGAAGATTATCACCCAAAGACTTTGAGCCTACGCTTCCTGATTACGACCAAGTGATTACTCGCTTTAGAGTGCTGGAGAAGGAAGAAGAAGAAAGACGAAGATTAGCAACAAAACATGTTCATCCTAAGCTTCCTGATTACGACCAGATTGCTACTAAGTTTAAACTCCTGAAGGAGGTAGAAAAAGAAAGACGAAGATTATTAACCAAACACAGTTCATCCTAAgcttcc
TGGTAATTTTTGCATCTTCAAAATGTTCTAAAATTTTGGCAAATGGTTTTGTTAAGTTCGAATTTTTGGTTATGATACAGTTTGAACGTTTTTCTTCATAGATTACAGTTTTAGCAAATGTGAATCATTAAAAGTGGAATAGTTGGTTTGAAAACAATTGTCAATTTCATTTTTTTTTTGGTTTTATGGTTAGGCGAGGAAAGCATTAAGAGCTTTGAAAGGTATAGTGAAGCTACAAGCATTAGTGAGAGGATACTTAGTAAGGAAACGCGCGGCCGCAATGTTGCAGAGCATACAAACTTTGATCAGAGTCCAAACCGCTATGCGATCAAAACGCATCAATCGCAGCCTCAACAAAGAGTACAACAACATGTTTCAACCTCGACAATCCTTTGTAAAGAACTATTCTCATTTCCATTGGCTCTCTTTTTTTCTTTAAGCCAAAACAAGACTTAAAGTGTGTCCTCTGTTTGTAGGATAAGTTTGATGAAGCAACGTTCGATGACAGAAGAACAAAGATTGTAGAGAAGGACGATAGATACATGAGAAGATCAAGTTCAAGATCAAGATCTAGACAAGTGCACAATGTTGTTTCAATGTCTGACTATGAAGGCGATTTTGTTTACAAAGGGAATGATTTGGAGTTGTGTTTCTCGGATGAGAAGTGGAAGTTTGCTACCGCGCAGAACACGCCGAGATTATTGCATCACCATTCTGCTAATAATCGCTATTATGTAATGCAGTCTCCAGCTAAGAGTGTTGGTGGAAAGGCTTTGTGTGACTATGAAAGCAGTGTGAGTACTCCTGGCTACATGGAGAAAACTAAGTCCTTTAAGGCAAAAGTGCGTTCACACAGCGCACCGCGCCAGCGATCTGAGAGGCAGAGGTTGTCGCTAGATGAAGTTATGGCCTCTAAGAGTAGCGTTAGCGGTGTGAGTATGTCGCATCAGCATCCACCACGCCATTCTTGTTCCTGTGATCCGCTTTAActtaac
GAGTTAGTAAACAAAGTGTTCACATTTTAGTAAACATTGTTGTTCGTTAATCACGTAACGTTTTGTTTTTCCAGTTTACACTGAGCTCTGATGAGTATATAACGGAGGTGAATGGTTACTACAAAACTACGTTTTCGGGAGAAGTCATAACGTCGTTGACGTTCAAGACGAACAAAAGGACATATGGGACTTACGGAAATAAAACCAGTAGCTACTTTTCTGTTGCCGCACCCAAAGATAACCAGATTGTCGGTTTTCTTGGAAGTAGCAGCCATGCTCTCAACTCCATCGACGCTCATTTTGCCCCTGCTCCTCCTCCTGGTAGCACCGGAGCTAAGCCCGGTGCTAGTGGCATCGGAAGTGATTCTGGTAGCATTGGTAGTGCCGGAACTAACCCTGGTGCTGATGGCACCAGAGAAACCGAAAAAAACGCTGGTGGCTCAAAACCTAGTAGTGGTAGTGCCGGAACTAACCCTGGTGCTAGTGCTGTTGGCAACGGAGAAACCGAAAAAAATGCTGGTGGCTCAAAACCTAGCAGTGGTAGTGCTGGAACTAACCCTGGTGCTAGTGCTGGTGGCAACGGAGAAACCGAAAAAAACGTTGGTGGCTCAAAACCTAGCAGTGGTAAAGCCGGAACTAACCCTGGTGCTAATGCTGGTGGCAACGGAGGAACCGAAAAAAACGCTGGTGGCTCAAAATCTAGCAGTGGTAGTGCTCGAACTAACCCTGGTGCTAGTGCTGGTGGCAACGGAGAAACTGTTTCCAACATTGGAGATACGGAAAGTAACGCTGGTGGCTCGAAAAGTAATGATGGTGCTAACAATGGTGCTAGTGGCATTGAAAGTAATGCTGGTAGCACTGGAACTAACTTTGGTGCTGGTGGCACCGGGGGAATTGGAGATACGGAAAGTGATGCTGGTGGCTCCAAAACTAACTCTGGAAACGGCGGAACTAACGATGGTGCTAGTGGTATTGGAAGTAATGATGGTAGCACTGGAACTAACCCTGGTGCTGGTGGAGGAACAGATTCAAACATCGAAGGTACTGAAAATAACGTTGGTGGCAAGGAAACTAACCCTGGTGCTAGTGGCATTGGAAATAGTGATGGTAGCACTGGAACTAGCCCCGAAGGTACCGAAAGTAACGCTGACGGCACAAAAACTAACACGGGAGGCAAAGAATCTAACACCGGAAGTGAATCCAACACCAATTCTAGTCCACAAAAGTTGGAAGCACAAGGAGGCAATGGAGGAAATCAATGGGACGACGGAACCGATCATGATGGTGTGATGAAGATACATGTTGCAGTTGGTGGTCTAGGAATTGAGCAAATTAGATTTGATTATGTCAAGAACGGACAGTTGAAGGAAGGACCCTTCCACGGTGTCAAAGGAAGAGGTGGCACTTCAACGGTGCGTAAATTTTTATTATTATGGCTCAATTACGTTTTTCGAATAAGTGTTAATTCAAGATTATTGATCTTCATGATTCTGCAGATTGAGATTAGCCATCCGGACGAGTATCTTGTTTCCGTCGAGGGGTTGTACGACTCTTCCAATATCATTCAAGGAATCCAGTTTCAATCCAACAAACACACTTCTCAGTACTTTGGATATGAATATTATGGAGATGGTACACAATTTTCACTTCAAGTTAATGAAAAGAAGATCATTGGTTTCCATGGTTTTGCCGACTCACACCTTAATTCTCTTGGAGCTTATTTCGTTCCAATCTCATCCTCTTCTTCCTCCTTGACTCCTCCTCCCAACAAAGTTAAAGCTCAAGGAGGAAGTTATGGAGAAACATTTGACGATGGTGCTTTCGATCATGTAAGAAAGGTTTATGTTGGTCAAGGTGATTCTGGTGTAGCTTATGTCAAGTTCGATTATGAAAAAGACGGTAAAAAGGAGACACAAGAACATGGAAAAATGACATTGTCAGGAACAGAGGAGTTTGAGGTTGATTCAGACGATTACATAACATCAATGGAGGTTTATGTCGACAAAGTCTACGGTTATAAAAGCGAAATCGTCATTGCTCTTACCTTCAAGACCTTTAAGGGTGAAACTTCTCCACGTTTTGGAATAGAGACTGAGAATAAATATGAAGTTAAAGACGGTAAAGGAGGAAAACTTGCTGGTTTCCATGGAAAAGCTAGCGATGTTCTTTATGCTATTGGTGCTTATTTCATTCCAGCAGCAAATTAGagagtt
ACGTATGTCTTAGTTACTACTATCATACTATATTACTATGTATTGGAAAACTTTTGGTTAGAACCTGTTGGGAGGAAAGGGTTTATGTTCTGGTTCATTTTACGTGTACTAAGTACTTATAATTAAGATTAAAAGAAACATTTACAGCTTCACCCTCTGGTCGATGTATGTGGGCTGTGGGCATGTGGCCAATCTCTGAAGCGTTAGGTAGAGCAAATATAGAGTTGAGAGTTGCTTAAGTTAGTGAACGTGAATGACTAAAAAGATATGTTGCATTTAAATCGTATTGGGCCTCATCCCATCTAAAATATAGTAGGTGTAGGCCTTTTAGGTTAATTTGAATAAAATCAACCTTTTTGTAAGCAACATCGACGATTGTCACATTTTTCTCATACACATAGGTGTAATCTAGCTTTGAATGTTTTCTCATACACATAGGTGTAATCACCGTAATTATCATTTGTGAAGATATATGTTTTACCAAGTGGTTTGTATTGTCCATATATACTTTACCACTTTCATATTAACATATAATGTTTTTGTAAGTATTATACCATAAAGGATTGGTTTCTTAATATTATTAACAAAACGCAAAAATTCTTTTAAACGCAGGCGATTCCAATCCACAGCGTTGCGGTTAGAGTAGGATCAACACAAAGAGTAGTGATGGAGATCATAATCACATTCGCATTGGTCTACACTGTTTACGCCACAGCCATTGACTCCAACAATGGCACTCTCGGAACCATCGCTCCACTTGCTATCAGACTCATCGTTGGTGCTAACATTCTTGCAGCCGGCCCATTCTCTGGTGGTCCAATGAACCCTGGACGTTCTTTTGGATCATCTCTTGCCGTTGGAAATTTTTCAGGACATTAGgtttat
and here is the script I am running:
#!/usr/bin/perl
use strict;
use warnings;
# A program to find the first inframe stop codon of non-spliced intron containing genes
print "ENTER THE FILENAME FOR DNA SEQUENCES:= ";
# Asks for Sequence file and if file does not exist prints error message
my $filename = <STDIN>;
#my $sequence;
my #sequence;
chomp $filename;
unless (open(DNAFILE, $filename) ) {
print "Cannot open file \"$filename\"\n\n";
}
#sequence = <DNAFILE>;
close DNAFILE;
open (FILE, ">AtPTCindex.txt");
my $j;
my $i;
my $codon;
my $stopseq;
my $counter;
#Change $j<(375) to n=number of sequences
for ($j = 0; $j < #sequence; $j ++) {
$counter = 0;
for ($i = 0; $i < (length($sequence[$j]) - 2) && $counter < 1; $i += 3) {
$codon = substr($sequence[$j], $i, 3);
if ($codon =~ m/TAG|TGA|TAA/g) {
# m added before /TAG... above
$stopseq = substr($sequence[$j], $i, 9);
my $result = index($sequence[$j], $stopseq);
$counter = 1;
#my $results = index($sequence[$j], $stopseq);
print FILE "$result \n";
#print FILE "$results $j \n";
}
}
if ($counter == 0) {
print FILE "\n"
}
}
close FILE;
exit;
Thanks so much.
As threatened, the following is a cleaned up version of your script:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
die "Usage: $0 Filename\n" if #ARGV != 1;
my $file = shift;
open my $infh, '<', $file;
open my $outfh, '>', 'AtPTCindex.txt';
while (my $line = <$infh>) {
chomp($line);
my $result = '';
for (my $i = 0; $i < (length($line) - 2); $i += 3) {
my $codon = substr($line, $i, 3);
if ($codon =~ m/TAG|TGA|TAA/) {
# m added before /TAG... above
my $stopseq = substr($line, $i, 9);
$result = index($line, $stopseq);
$result .= " ($i, $codon, $stopseq)";
last;
}
}
print "$result\n";
# print $outfh "$result\n";
# print $outfh "$result $.\n";
}
close $infh;
close $outfh;
For the 5 lines of data that you provided, the following is the output:
84 (84, TGA, TGATCATTC)
3 (3, TGA, TGATTGTTT)
3 (3, TAA, TAATTTTTG)
4 (27, TAG, TAGTAAACA)
123 (123, TAA, TAAGATTAA)
I believe your issue is with these lines:
my $stopseq = substr($line, $i, 9);
$result = index($line, $stopseq);
You're pulling a sequence from the $line at position $i, and then immediately doing an index for it. In the case of 4 of 5 of those lines, it immediately finds the same value $i. However, in the case of line 4, it finds a matching sequence earlier in the line.
If this isn't desired, you'll have to explain what your desired behavior actually is. Perhaps, you just want $i? Or are you looking for a matching stop sequence any point AFTER $i? You'll have to specify what your actual logic wants to be.
I took a different approach, unpacking it into groups of three instead of counting by indexes of three. I believe this script does what you want, and it looks a lot cleaner. It can also optionally take the filename as argument.
#!/usr/bin/perl
use strict;
use warnings;
my $filename = 'a'; # dummy value
my $resultfile = 'AtPTCindex.txt';
# User may have passed filename as arguement
if (#ARGV) { if (-e $ARGV[0]) { $filename = $ARGV[0] } }
unless (-e $filename)
{
print "ENTER THE FILENAME FOR DNA SEQUENCES: ";
chomp($filename = <STDIN>)
}
open DNA,"<$filename" or die "Couldn't open $filename for reading: $!\n";
my #sequence = <DNA> or die "Couldn't read $filename: $!\n";;
close DNA;
# Uncomment the below line if you're braver than me
if (-e $resultfile) { die "Cowardly refusing to write to existing file" }
if (-e $resultfile) { unlink $resultfile };
open RESULT,">>$resultfile" or die "Courdn't open$!\n";
foreach my $string (#sequence)
{
# split into groups of 3
my #groups = unpack "(A3)*", $string;
# Search for the group you want
for (my $groupnum = 0; $groupnum < #groups - 1; $groupnum++)
{
if ($groups[$groupnum] =~ m/(TAG|TGA|TAA)/g)
{
print RESULT (($groupnum + 0) * 3) . "\n";
print "$1 (" . $1 . ( $groups[$groupnum + 1]) . ($groups[$groupnum + 2]) . ") at index " . (($groupnum + 0) * 3) . "\n";
last;
}
}
}
close RESULT;
Running the script on your sample data, it outputs:
TGA (TGATCATTC) at index 84
TGA (TGATTGTTT) at index 3
TAA (TAATTTTTG) at index 3
TAG (TAGTAAACA) at index 27
TAA (TAAGATTAA) at index 123
...as well as writes the raw index numbers to the file specified.

Popping keys of an array to calculate a total

I'm trying to simply pop off each numeric value and add them together to gain a total.
Input file:
Samsung 46
RIM 16
Apple 87
Microsoft 30
My code compiles, however, it only returns 0:
open (UNITS, 'units.txt') || die "Can't open it $!";
my #lines = <UNITS>;
my $total = 0;
while (<UNITS>) {
chomp;
my $line = pop #lines;
$line += $total;
}
print $total;
No need to slurp all lines into an array if you're just going to loop through them anyway with a while. Also, you need to split each line to get your numbers.
use warnings;
use strict;
open (UNITS, 'units.txt') || die "Can't open it $!";
my $total = 0;
while (<UNITS>) {
chomp;
my $num = (split)[1];
$total += $num;
}
print "$total\n";
__END__
179
There are three problems here
You are trying to add strings like 'Samsung 46' + 'RIM 16'
You read the entire file into #lines and then try to read more from the file in the while loop. That loop is never entered because you have already read to end of file
You are adding $total to the (undeclared) variable $line within the loop, instead of the other way around. So $total remains at zero and $line keeps having zero added to it
It is best to use while to read files unless you need something other than sequential access to the records, so removing #lines is a start.
It isn't completely clear which part of the records you want to accumulate. This program splits the lines on whitespace and adds together the last field of each line.
You must always use strict and use warnings at the start of every program. It is a measure that will make it far easier to locate bugs in your code. It is also best to use lexical file handles rather than the global one you used, and the three-parameter form of open.
use strict;
use warnings;
open my $units, '<', 'units.txt' or die "Can't open it: $!";
my $total;
while (<$units>) {
my #fields = split;
$total += $fields[-1];
}
print $total;
output
179
use strict;
use warnings;
open my $fh, "<", "units.txt" or die "well...";
my $total = 0;
while(<$fh>){
chomp;
my ($string, $num) = split(" ", $_);
$total += $num;
}
print $total;
This problem is a doddle with a one-liner:
$ perl -ane '$sum += $F[1] }{ print $sum' units.txt
Explanation
-a enables autosplit, each line is split and stored in #F
-n loops over the file line by line
-e tells perl that the next argument is to be treated as Perl code
the LHS of the Eskimo-kiss (that funny-looking }{ in the middle) is performed for every line in the input file, RHS performed only once
LHS accumulates the second column of every line in $sum
RHS prints the result of $sum once all lines have been processed

perl - cutting many strings with given array of numbers

dear my fellow perl masters in the world~!
I need your help.
I have a string file A and a number file B like this:
File A:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
...and so on till 200.
File B:
3, 6, 2, 5, 6, 1, ... 2
(total 200 numbers in an array)
then, with the numbers in file B, I would like to cut each string from the start position to the number of characters in File B.
E.g. as File B starts with 3, 6, 2 ...
File A will be
AAAAAAAAAAAAAAAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
like this.
So. this is my code so far...
use strict;
if (#ARGV != 2) {
print "Invalid usage\n";
print "Usahe: perl program.pl [num_list] [string_file]\n";
exit(0);
}
my $numbers=$ARGV[0];
my $strings=$ARGV[1];
my $i;
open(LIST,$number);
open(DATA,$strings);
my #list = <LIST>;
my $list_size = scalar #sp_list;
for ($i=0;$i<=$list_size;$i++) {
print $i,"\n";
#while (my $line = <DATA>) {
}
close(LIST);
close(DATA);
As the strings and numbers are 200 I changed the array into a scalar value to work on every numbers of every strings.
I'm working on this. and I know I suppose to use a pos function but i do not know how to match each number with each string. is reading the string first by while? or using for to know how many time that I have to run this to achieve the result?
Your help will be much appreciated!
Thank you.
I will be working on it, too. Need your feedback.
It is good that you use strict, and you should also use warnings. Further things to note:
You should check the return value of open to make sure they did not fail. You should also use the three argument form of open and use a lexical file handle. Especially when handling command line arguments, which does pose a security risk.
open my $listfh, "<", $file or die $!;
You may wish to use a safety precaution
use ARGV::readonly;
You can easily make the list of numbers with a map statement. Assuming the numbers are in a comma separated list:
my #list = map split(/\s*,\s*/), <$listfh>;
This will split the input line(s) on comma and strip excess whitespace.
When reading your input file, you do not need to use a counter variable. You can simply do
open my $inputfh, "<", $file or die $!;
while (<$inputfh>) {
my $length = shift #list; # these are your numbers
chomp; # remove newline
my $string = substr($_, 0, -$length); # negative length on substr
print "$string\n";
}
The negative length on substr makes it leave that many characters off the end of the string.
Here is a one-liner in action that demonstrates these principles:
perl -lwe '$f = pop; # save file name for later
#nums = map split(/\s*,\s*/), <>; # process first file
push #ARGV, $f; # put back file name
while (<>) {
my $len = shift #nums;
chomp;
print substr($_,0,-$len);
}' fileb.txt filea.txt
Output:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
DDDDDDDDDDDDDDDDDDDDDDDDDDD
EEEEEEEEEEEEEEEEEEEEEEEEEE
Note the use of implicit open of file name arguments by manipulating #ARGV. Also handling newlines with -l switch.
Here is my suggestion. It does use autodie so that there is no need to explicitly check the status of open calls, and temporarily undefines $/ - the input record separator - so that all of the num_list file is read in one go. You aren't clear whether this file will always contain just single line, in which case you can omit local $/.
The numbers are extracted from the text using a regular expression /\d+/g returns all the strings of digits in the input as a list.
The second parameter to substr is the start position of the substring you want, and using a negative number counts from the end of the string instead of the beginning. The third parameter is the number of characters in the substring, and the fourth is a string to replace that substring in the target variable. So substr $data, -$n, $n, '' replaces the substring of length $n starting $n characters from the end with an empty string - i.e. it deletes it.
Note that if it is your intention to remove the given number of characters from the beginning of the string, then you would write substr $data, 0, $n, '' instead.
use strict;
use warnings;
use autodie;
unless (#ARGV == 2) {
print "Usage: perl program.pl [num_list] [string_file]\n";
exit;
}
my #numbers;
{
open my $listfh, '<', $ARGV[0];
local $/;
my $numbers = <$listfh>;
#numbers = $numbers =~ /\d+/g;
};
open my $datafh, '<', $ARGV[1];
for my $i (0 .. $#numbers) {
print "$i\n";
my $n = $numbers[$i];
my $data = <$datafh>;
chomp $data;
substr $data, -$n, $n, '';
print "$data\n";
}
Here is how I would do it. substr is the function to remove a part of a string. From your example, it is not clear whether you want to remove the characters at the beginning or at the end. Both alternatives are shown here:
#!/usr/bin/perl
use warnings;
use strict;
if (#ARGV != 2) {
die "Invalid usage\n"
. "Usage: perl program.pl [num_list] [string_file]\n";
}
my ($number_f, $string_f) = #ARGV;
open my $LIST, '<', $number_f or die "Cannot open $number_f: $!";
my #numbers = split /, */, <$LIST>;
close $LIST;
open my $DATA, '<', $string_f or die "Cannot open $string_f: $!";
while (my $string = <$DATA>) {
substr $string, 0, shift #numbers, q(); # Replace the first n characters with an empty string.
# To remove the trailing portion, replace the previous line with the following:
# my $n = shift #numbers;
# substr $string, -$n-1, $n, q();
print $string;
}
You were not checking the return value of open. Try to remember to always do that.
Do not declare variables far before you are going to use them ($i here).
Do not use C-style for loops if you do not have to. They are prone to fence post errors.
You can use substr():
use strict;
use warnings;
if (#ARGV != 2) {
print "Invalid usage\n";
print "Usage: perl program.pl [num_list] [string_file]\n";
exit(0);
}
my $numbers=$ARGV[0];
my $strings=$ARGV[1];
open my $list, '<', $numbers or die "Can't open $numbers: $!";
open my $data, '<', $strings or die "Can't open $strings: $!";
chomp(my $numlist = <$list>);
my #numbers = split /\s*,\s*/,$numlist;
for my $chop_length (#numbers)
{
my $data = <$data> // die "not enough data in $strings";
print substr($data,0,length($data)-$chop_length)."\n";
}
Your specs say you want "... to cut each string from the start position to the number of characters in File B." I agree with choroba that it's not perfectly clear whether characters from the start or the end of the string are to be cut. However, I tend to think that you want to remove characters from the beginning when you say, "... from the start position ...", but a string like ABCDEFGHIJKLMNOPQRSTUVWXYZ012345 would help clarify this issue.
This option is not as well self-documenting as the other solutions, but a discussion of it will follow:
use strict;
use warnings;
#ARGV == 2 or die "Usage: perl program.pl [num_list] [string_file]\n";
open my $fh, '<', pop or die "Cannot open string file: $!";
chomp( my #str = <$fh> );
local $/ = ', ';
while (<>) {
chomp;
print +( substr $str[ $. - 1 ], $_ ) . "\n";
}
Strings:
ABCDEFGHIJKLMNOPQRSTUVWXYZ012345
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Numbers:
3, 6, 2, 5, 6
Output:
DEFGHIJKLMNOPQRSTUVWXYZ012345
BBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
DDDDDDDDDDDDDDDDDDDDDDDDDDD
EEEEEEEEEEEEEEEEEEEEEEEEEE
The strings' file name is poped off #ARGV (since an explicit argument for pop is not used) and passed to open to read the strings into #str. The record separator is set to ', ' so chomp leaves only the number. The current line number in $. is used as part of the index to the corresponding #str element, and the remaining characters in the string from n on are printed.

Resources