Printing in unexpected order - arrays

I expected the following to print in the order of the elements of #Data, but it's printing in the order of the elements of #Queries. Am I missing something? I also tried declaring the items to be printed after foreach(#data){... and then printing inside that loop, but still wrong order.
$datafile is a file with the following:
GR29929,JAMES^BOB
GR21122,HANK^REN
$queryfile is a file with the following:
(3123123212):# FD [GR21122]
line 2
line 3
line 4
(12): # FD [HANK^REN]
line 6
line 7
line 8
(13): # FD [Y]
-------------------------------
--------------------------------
(3123123212):# FD [GR29929]
line 2
line 3
line 4
(12): # FD [JAMES^BOB]
line 6
line 7
line 8
(13): # FD [Z]
The output file is:
GR21122,HANK^WREN,Y
GR29929,JAMES^BOB,Z
When I want:
GR29929,JAMES^BOB,Z
GR21122,HANK^WREN,Y
Code is:
open(DA, "<$datafile");
open(QR, "<$queryfile");
my #Data = <DA>;
my #Queries = <QR>;
foreach (#Data) {
my ( $acce, $namee ) = split( ',', $_ );
chomp $acce;
chomp $namee;
print "'$acce' and '$namee'\n";
for my $i ( 0 .. $#Queries ) {
my $Qacce = $Queries[$i];
my $Qname = $Queries[ $i + 4 ];
my $Gen = $Queries[ $i + 8 ];
if ( $Qacce =~ m/$acce/ and $Qname =~ m/$namee/ ) {
my ($acc) = $Qacce =~ /\[(.+?)\]/;
my ($gen) = $Gen =~ /\[(.+?)\]/;
$gen =~ s/\s+$//;
my ($name) = $Qname =~ /\[(.+?)\]/;
print GL "$i,$acc,$gen,$name\n";
}
}
}

The basic shell of your program prints what you ask for, but there is a lot missing. The refactoring below should do what you want.
You had a problem with the values of your $i index variable, so that the first time around the loop you were accessing #data elements [0, 4, 8], the second time [1, 5, 9] etc. It looks like the second loop execution should use elements [11, 15, 19] and so on. Please correct me if I'm wrong.
In addition you were using regular expressions to compare the keys in the two files, and you were finding nothing because the name values contain caret ^ characters which are special within regexes. Escaping the strings using \Q...\E fixed this.
Note that a better solution would use hashes to match keys across the two files, but without details on your file format - particularly queryfile - I have had to follow your own algorithm.
use strict;
use warnings;
use autodie;
my ($data_file, $query_file) = qw/ datafile.txt queryfile.txt /;
my #queries = do {
open my $query_fh, '<', $query_file;
<$query_fh>;
};
chomp #queries;
open my $data_fh, '<', $data_file;
while (<$data_fh>) {
chomp;
my ($acce, $namee) = split /,/;
for (my $i = 0; $i < #queries; $i += 11) {
my ($qacce, $qname, $qgen) = #queries[$i, $i+4, $i+8];
if ( $qacce =~ /\Q$acce\E/ and $qname =~ /\Q$namee\E/ ) {
my ($acc, $name, $gen) = map / \[ ( [^\[\]]+ ) \] /x, ($qacce, $qname, $qgen);
$gen =~ s/\s+\z//;
print "$acc,$name,$gen\n";
}
}
}
output
GR29929,JAMES^BOB,Z
GR21122,HANK^REN,Y

Related

Referencing an element in a 2D array in Perl

I have the following code which reads in a 6x6 array from STDIN and saves it as an array of anonymous arrays. I am trying to print out each element with $arr[i][j], but the code below isn't working. It just prints out the first element over and over. How am I not accessing the element correctly?
#!/user/bin/perl
my $arr_i = 0;
my #arr = ();
while ($arr_i < 6){
my $arr_temp = <STDIN>;
my #arr_t = split / /, $arr_temp;
chomp #arr_t;
push #arr,\#arr_t;
$arr_i++;
}
foreach my $i (0..5){
foreach my $j (0..5){
print $arr[i][j] . "\n";
}
}
i and j are not the same as the variables you declared in the foreach lines. Change:
print $arr[i][j] . "\n";
to:
print $arr[$i][$j] . "\n";
warnings alerted me to this issue. You should add these lines to all your Perl code:
use warnings;
use strict;
To demonstrate the Perlish mantra that there's "more than one way to do it":
use 5.10.0; # so can use "say"
use strict;
use warnings qw(all);
sub get_data {
my ($cols, $rows) = #_;
my ($line, #rows);
my $i;
for ($i = 1; $i <= $rows and $line = <DATA>; $i++) {
chomp $line;
my $cells = [ split ' ', $line ];
die "Row $i had ", scalar(#$cells), " instead of $cols" if #$cells != $cols;
push #rows, $cells;
}
die "Not enough rows, got ", $i - 1, "\n" if $i != $rows + 1;
\#rows;
}
sub print_data {
my ($cols, $rows, $data) = #_;
for (my $i = 0; $i < $rows; $i++) {
for (my $j = 0; $j < $cols; $j++) {
say $data->[$i][$j];
}
}
}
my $data = get_data(6, 6);
print_data(6, 6, $data);
__DATA__
1 2 3 4 5 6
a b c d e f
6 5 4 3 2 1
f e d c b a
A B C D E F
7 8 9 10 11 12
Explanation:
if we use say, that avoids unsightly print ..., "\n"
get_data is a function that can be called and/or reused, instead of just being part of the main script
get_data knows what data-shape it expects and throws an error if it doesn't get it
[ ... ] creates an anonymous array and returns a reference to it
get_data returns an array-reference so data isn't copied
print_data is a function too
both functions use a conventional for loop instead of making lists of numbers, which in Perl 5 needs to allocate memory
There is also a two-line version of the program (with surrounding bits, and test data):
use 5.10.0; # so can use "say"
my #lines = map { [ split ' ', <DATA> ] } (1..6);
map { say join ' ', map qq{"$_"}, #$_ } #lines;
__DATA__
1 2 3 4 5 6
a b c d e f
6 5 4 3 2 1
f e d c b a
A B C D E F
7 8 9 10 11 12
Explanation:
using map is the premier way to iterate over lists of things where you don't need to know how many you've seen (otherwise, a for loop is needed)
the adding of " around the cell contents is only to prove they've been processed. Otherwise the second line could just be: map { say join ' ', #$_ } #lines;

Is it possible to assign two variables in Perl foreach loop?

Is it possible to assign two variables the same data from an array in a Perl foreach loop?
I am using Perl 5, I think I came across something in Perl 6.
Something like this:
my $var1;
my $var2;
foreach $var1,$var2 (#array){...}
It's not in the Perl 5 core language, but List::Util has a pairs function which should be close enough (and a number of other pair... functions which may be more convenient, depending on what you're doing inside the loop):
#!/usr/bin/env perl
use strict;
use warnings;
use 5.010;
use List::Util 'pairs';
my #list = qw(a 1 b 2 c 3);
for my $pair (pairs #list) {
my ($first, $second) = #$pair;
say "$first => $second";
}
Output:
a => 1
b => 2
c => 3
The easiest way to use this is with a while loop that calls splice on the first two elements of the array each time,
while (my($var1, $var2) = splice(#array, 0, 2)) {
...
}
However, unlike foreach, this continually does a double-shift on the original array, so when you’re done, the array is empty. Also, the variables assigned are copies, not aliases as with foreach.
If you don’t like that, you can use a C-style for loop:
for (my $i = 0; $i < #array; $i += 2) {
my($var1, $var2) = #array[$i, $i+1];
...
}
That leaves the array in place but does not allow you to update it the way foreach does. To do that, you need to address the array directly.
my #pairlist = (
fee => 1,
fie => 2,
foe => 3,
fum => 4,
);
for (my $i = 0; $i < #pairlist; $i += 2) {
$pairlist[ $i + 0 ] x= 2;
$pairlist[ $i + 1 ] *= 2;
}
print "Array is #pairlist\n";
That prints out:
Array is feefee 2 fiefie 4 foefoe 6 fumfum 8
You can get those into aliased variables if you try hard enough, but it’s probably not worth it:
my #kvlist = (
fee => 1,
fie => 2,
foe => 3,
fum => 4,
);
for (my $i = 0; $i < #kvlist; $i += 2) {
our ($key, $value);
local(*key, $value) = \#kvlist[ $i, $i + 1 ];
$key x= 2;
$value *= 2;
}
print "Array is #kvlist\n";
Which prints out the expected changed array:
Array is feefee 2 fiefie 4 foefoe 6 fumfum 8
Note that the pairs offered by the List::Pairwise module, which were but very recently added to the core List::Util module (and so you probably cannot use it), are still not giving you aliases:
use List::Util 1.29 qw(pairs);
my #pairlist = (
fee => 1,
fie => 2,
foe => 3,
fum => 4,
);
for my $pref (pairs(#pairlist)) {
$pref->[0] x= 2;
$pref->[1] *= 2;
}
print "Array is #pairlist\n";
That prints out only:
Array is fee 1 fie 2 foe 3 fum 4
So it didn’t change the array at all. Oops. :(
Of course, if this were a real hash, you could double the values trivially:
for my $value (values %hash) { $value *= 2 }
The reasons that works is because those are aliases into the actual hash values.
You cannot change the keys, since they’re immutable. However, you can make a new hash that’s an updated copy of the old one easily enough:
my %old_hash = (
fee => 1,
fie => 2,
foe => 3,
fum => 4,
);
my %new_hash;
#new_hash{ map { $_ x 2 } keys %old_hash } =
map { $_ * 2 } values %old_hash;
print "Old hash is: ", join(" " => %old_hash), "\n";
print "New hash is: ", join(" " => %new_hash), "\n";
That outputs
Old hash is: foe 3 fee 1 fum 4 fie 2
New hash is: foefoe 6 fiefie 4 fumfum 8 feefee 2
A general algorithm for more than 2 variables:
while( #array ){
my $var1 = shift #array;
my $var2 = shift #array;
my $var3 = shift #array;
# other variables from #array
# do things with $var1, $var2, $var3, ...
}
PS: Using a working copy of the array to that it is preserved for use later:
if( my #working_copy = #array ){
while( #working_copy ){
my $var1 = shift #working_copy;
my $var2 = shift #working_copy;
my $var3 = shift #working_copy;
# other variables from #working_copy
# do things with $var1, $var2, $var3, ...
}
}
PPS: another way is to use indexing. Of course, that is a sure sign that the data structure is wrong. It should be an array of arrays (AoA) or an array of hashes (AoH). See perldoc perldsc and perldoc perllol.
my $i = 0;
while( $i < #array ){
my $var1 = $array[ $i++ ];
my $var2 = $array[ $i++ ];
my $var3 = $array[ $i++ ];
# other variables from #array
# do things with $var1, $var2, $var3, ...
}
PPPS: I've been asked to clarify why the data structure is wrong. It is a flatten set of tuples (aka records aka datasets). The tuples are recreated by counting of the number of data for each. But what is the reader constructing the set has a bug and doesn't always get the number right? If, for a missing value, it just skips adding anything? Then all the remaining tuples are shifted by one, causing the following tuples to be grouped incorrectly and therefore, invalid. That is why an AoA is better; only the tuple with the missing data would be invalid.
But an better structure would be an AoH. Each datum would access by a key. Then new or optional data can be added without breaking the code downstream.
While I'm at it, I'll add some code examples:
# example code for AoA
for my $tuple ( #aoa ){
my $var1 = $tuple->[0];
my $var2 = $tuple->[1];
my $var3 = $tuple->[2];
# etc
}
# example code for AoH
for my $tuple ( #aoh ){
my $var1 = $tuple->{keyname1};
my $var2 = $tuple->{key_name_2};
my $var3 = $tuple->{'key name with spaces'};
my $var4 = $tuple->{$key_name_in_scalar_variable};
# etc
}
Here is a module-less way to "loop" by an arbitrary value ($by) and output the resulting group of elements using an array slice:
#!perl -l
#array = "1".."6";
$by = 3; $by--;
for (my $i = 0 ; $i < #array ; $i += $by ) {
print "#array[$i..$i+$by]";
$i++ ;
}
As a one-liner to test (cut and paste to a Unix shell):
perl -E '#array = "1".."6"; $by = 3; $by--;
for (my $i = 0 ; $i < #array ; $i += $by ) {
say "#array[$i..$i+$by]"; $i++ }'
Output:
1 2 3
4 5 6
If you make $by = 2; it will print pairs of numbers. To get at specific elements of the resulting slice access it as an anonymous array: (e.g. [#array[$i..$i+$by]]->[1]).
See also:
How do I read two items at a time in a Perl foreach loop?
Perl way of iterating over 2 arrays in parallel
Some good responses there, including reference to natatime which is quite easy to use. It's easy to implement too - it is essentially a wrapper around the splice solutions mentioned in the responses here.
The following is not the nicest example, but I've been using autobox::Core and made an #array->natatime() "method" ;-) like this:
use autobox::Core ;
sub autobox::Core::ARRAY::natatime {
my ($self, $by) = #_;
my #copy = #$self ;
my #array ;
push #array, [splice (#copy, 0, $by) ] while #copy ;
if ( not defined wantarray ) {
print "#{ $_ } \n" for #array ;
}
return wantarray ? #array : \#array;
}
The #copy array is spliced destructively, but $self (which is how the #array in front of the autobox method -> arrow gets passed to the function) is still there. So I can do:
my #dozen = "1" .. "12" ; # cakes to eat
#dozen->natatime(4) ; # eat 4 at time
my $arr_ref = #dozen->natatime(4) ; # make a reference
say "Group 3: #{ $arr_ref->[2] }" ; # prints a group of elements
say scalar #dozen , " cakes left" ; # eat cake; still have it
Output:
1 2 3 4
5 6 7 8
9 10 11 12
Group 3: 9 10 11 12
12 cakes left
One other approach that also uses a CPAN module (I gave this answer elsewhere but it is worth repeating). This can also be done non-destructively, with Eric Strom's excellent List::Gen module:
perl -MList::Gen=":all" -E '#n = "1".."6"; say "#$_" for every 2 => #n'
1 2
3 4
5 6
Each group of elements you grab is returned in an anonymous array so the individual values are in: $_->[0] $_->[1] ... etc.
You mentioned Perl6, which handles multiple looping values nicely:
my #qarr = 1 .. 6;
my ($x, $y, $z) ;
for #qarr -> $x , $y , $z { say $x/$y ; say "z = " ~ $z }
Output:
0.5
z = 3
0.8
z = 6
For more on the Perl6 approach see: Looping for Fun and Profit from the 2009 Perl6 Advent Calendar, or the Blocks and Statements Synopsis for details. Perhaps Perl 5 will have a similar "loop by multliple values" construct one day - à la perl5i's foreach :-)

Pulling out potentially overlapping subsets of elements in array to make smaller arrays

My input file looks like below (real one is much larger):
rs3683945_mark 0
rs6336442_mark 1E-07
rs31328150_impute 0.444121193
rs3658242_mark 0.444121293
rs39342374_impute 0.444121393
IMP!1! 1
rs3677817_mark 1.986015679
IMP!2! 2
SNP117_impute 2.685815665
IMP!3! 3
SNP3_1_impute 3.643119709
SNP1_impute 3.643119809
rs13475706_mark 3.643119909
13 lines, two elements each line. First element is a name. Each name ends either with a "tag" _mark or impute, or there is no tag. The point of the tag is to distinguish between types of names, which form the basis of my search for subsets within the entire list.
The subsets begin with a _mark name that immediately precedes an instance of an _impute name. The subsets end with the very next instance of _mark. All names in between, which will necessarily not have any such tag, also go into a subset, which I'd like to collect into an array and send off to a subroutine to process (details of that not important). Please note, the positions with IMP in the name are not the same as those actually tagged with a _impute.
For example, with the above, the first useable subset is:
rs6336442_mark 1E-07
rs31328150_impute 0.444121193
rs3658242_mark 0.444121293
The second useable subset is:
rs3658242_mark 0.444121293
rs39342374_impute 0.444121393
IMP!1! 1
rs3677817_mark 1.986015679
and so on... EDIT: Note that last _mark name of the first set is the first _mark name of the second.
My code for this:
#!/usr/bin/perl
use strict; use warnings;
my $usage = "usage: merge_impute.pl {genotype file} {distances file} \n";
die $usage unless #ARGV == 2;
my $genotypes = $ARGV[0];
open (FILE, "<$genotypes");
my #genotypes = <FILE>;
close FILE;
my $distances = $ARGV[1];
open (DISTS, "<$distances");
my #distances = <DISTS>;
close DISTS;
my #workingset = ();
#print scalar #distances;
for ( my $i = 0; $i < scalar #distances; $i++ ){
chomp $distances[$i];
#print "$distances[$i]\n";
if ( $distances[$i] =~ m/impute/ ){
push ( #workingset,$distances[$i-1],$distances[$i],$distances[$i+1]);
}
print "i=$i: #workingset\n";
# at this point send off to sub routine
#workingset=();
}
As you can see, the if loop is only set up to find subsets that contain only one _impute name. How can I modify the code so that a subset will "fill up" with as many names as required until we arrive at the next _mark name?
EDIT: Perhaps instead of the for loop, I could something like...
push (#workingset, $distances[0], $distances[1] );
until ( $distance[ ??? ] =~ m/_mark/ ){
push ( #workingset, $distance[ ??? ] );
}
But what could $distances[ ??? ] be?
EDIT: Or an alternative for loop...
push (#workingset, $distances[0] );
for ( my $i = 1; $i < scalar #distances - 1 ; $i++ ){
until ( $distances[ $i ] =~ m/_mark/ ){
push ( #workingset, $distances[ $i ] );
# send #workingset to sub routine
#clear workingset
#workingset = ();
}
}
Though this isn't working.
I also tried...
push (#workingset, $distances[0] );
for ( my $i = 1; $i < scalar #distances - 1 ; $i++ ){
until ( $distances[ $i ] =~ m/_mark/ ){
push ( #workingset, $distances[ $i ] );
next if $distances[ $i+1 ] !~ /_mark/;
}
# send #workingset to sub routine here
print "i=$i, #workingset\n\n";
#clear workingset
#workingset = ();
}
I don't have a lot of time right now but I'll hopefully have some time in the morning to check back. Here's a quick example on how you could do it (it is meant to be simple and easy to understand, not fancy). Hopefully it helps you get on the right track for parsing the data.
use strict;
use warnings;
my $first_mark;
my #workingset = ();
my $second_mark;
while (<DATA>){
chomp;
if ( /_mark/ and scalar #workingset == 0 ) {
$first_mark = $_;
} elsif ( /IMP|_impute/ and defined $first_mark) {
push #workingset, $_;
} elsif ( /_mark/ and defined $first_mark) {
$second_mark = $_;
print "Found valid set: ";
print "$first_mark," . join(",", #workingset) . ",$second_mark\n";
#workingset = ();
$first_mark = $second_mark;
undef $second_mark;
}
}
__DATA__
rs3683945_mark 0
rs6336442_mark 1E-07
rs31328150_impute 0.444121193
rs3658242_mark 0.444121293
rs39342374_impute 0.444121393
IMP!1! 1
rs3677817_mark 1.986015679
IMP!2! 2
SNP117_impute 2.685815665
IMP!3! 3
SNP3_1_impute 3.643119709
SNP1_impute 3.643119809
rs13475706_mark 3.643119909
Output:
Found valid set: rs6336442_mark 1E-07,rs31328150_impute 0.444121193,rs3658242_mark 0.444121293
Found valid set: rs3658242_mark 0.444121293,rs39342374_impute 0.444121393,IMP!1! 1,rs3677817_mark 1.986015679
Found valid set: rs3677817_mark 1.986015679,IMP!2! 2,SNP117_impute 2.685815665,IMP!3! 3,SNP3_1_impute 3.643119709,SNP1_impute 3.643119809,rs13475706_mark 3.643119909

Perl Script that should move through array with i=3 prints indicies that aren't x3

I have these arrays of Sequences and I wrote this script to walk through each sequence three letters at a time (eg. {0,1,2}, {3,4,5},{6,7,8}) and print the index of where it first encounters a certian 3 letter combination (TAA,TAG,TGA). (EX. if sequence were CGTAGCCCCTAACCCC, then the script would skip over the TAG in the 2 position because its not in the correct frame of 3 and report the TAA in the 9 position). Therefore, I am only expecting indices in multiples of 3 in my results.
On most strings there is no problem, however every once in a while it will index at 4 or other non multiples of three. I was wondering if anyone more advanced than I can figure out why this may happen. I know this script is ugly and I am sorry for that, I am a biologist and I mod it for whatever I am mining out of sequences at the time. I just can't figure out the bug.
Here are some sequences from my file. The 3rd line is the sequence that gives the strange result. Just for an example of what I am dealing with.
AGGTACGCGAGTCACCTTTCGTCTTCAATCTCGTTTGATCGAAGCTATTTGTCAAAAAGAGAGGATTTTTTTGCATCTCAATTATGATCATTCCTTAGGGTTTTCAGGGTTTTGGATTGTTGTTTTTGTTAACATTTATCTGATTCGTTTGTATTTGTGTGGCAGTCTAAAGTGGCATCAACAATGGCGTCTTTTATTATACATAAGCCAAAGGAGAGATCGCCTTTCACGAAAGCTGCTTTCAAAACGGTACCTTTAGTGATTCAGCATTTTTATCTGAAATATGTTTGTTGCATTATTGAATGATTCTGATGTGGTGTTGCTACCAACTTGTCTATGTTGGTTGATTTAGCTTGATAGCATCAAGGAGTTGGAACTGTTTATGTTGAAGCATCGAAAGGATTATGTTGATCTGCACCGGACTACAGAACAGGAAAAGGATAGTATTGAACAAGAAGTAAGTACTCTGAGCTAGGCTTGCCCGTAGTATATATCTGAACTCATGAAGTTACTGCGATAAATCTATGCTTGAGTTGAGATTGAACATATGGAACTATGGAATCATAAGAAATGTAGCAACTCATATTGAGATAACTCAGGAAGATTAATGTCTATTACTTTAGATAGCGAGGGAGTTAGTATATTGTGACACTGAGGAACTTGGATCTTGTATTCTTATACCTCTTGCAGTGTTTGATCGAGAACTATGTCTACTTATGTGTTGTGTAATATCATCAAACTCTCTCTCTCTCCCTCTTGCAGGTTGCTGCTTTTATTAAAGCTTGCAAAGAACAGATCGATATTCTCATAAACAGTATTAGAAATGAAGAAGCAAACTCCAAAGGATGGCTTGGCCTCCCCGCAGATAACTTCAATGCTGATTCTATAGCACACAAACATGGAGTGGTATGATATGCACCAATGTAGTAAGCCAACTTTGGTTTTTTTTTACTATGTTTTCTTTCAAAGTATCTAGATGTGTAGAAGTAATGGTAATTTTTTTTGTATGCAGGTTTTGATTCTGAGTGAGAAACTTCATTCAGTCACTGCCCAGTTTGATCAGCTTAGAGCTACTCGTTTCCAAGATATTATAAACAGAGCTATGCCGAGAAGAAAACCTAAGAGGGTCATAAAGGAAGCTACCCCAATTAATACAACTCTGGGAAATTCGGAGTCCATAGAACCGGATGAAATCCAGGCCCAACCTCGTAGATTACAACAACAACAACTTCTAGACGATGAAACACAAGCCCTTCAGGTAACAAGGCAAATATACATGATCTTCGAAAACTTGCATAAGTTTTGTAGTTATGCTAAATTTTGAAATTGATAATTTTTGCAGGTAGAGCTAAGTAATCTTTTAGATGGTGCTAGGCAGACAGAAACTAAGATGGTGGAGATGTCTGCATTAAACCACTTGATGGCAACTCATGTTCTGCAGCAAGCCCAACAGATAGAGTTTCTTTATGACCAGGTTAGGACTTATTAACTTCTCTAACGCTCTCATGTCAACACACTGTTTTGTTAGGCTTTCACTGTTCTTTACACTCCTTTGCTATCTCAAAGTTAAATTCGGATGCTTATTGTATTCAGAACTTTTCCTTGTCACATTCACCTAAATTAGGTATAGAGACGGGAAAGAAACTTTGTATTGGTCCAATTTTAATTGCTCTCCAATTTAGTGGTAGGAAATGGAACGGTTAATGTTTTTAGCTATGTAAAGTCTCTAAAACTCCATTTGAATGTGTCAATGACTCAATGCCATTCCCAATACTTTAGTTTATGGGGCTTTGCAGTTTTCCTACTCTGTAAACGTACAGCTTATGACTGACTTGGTGGCTCTCTTTATGTGTGTGTGTGTGTGTCTTGAGGCCCTTTTTCTCACTCAGTTTGACACTAAATGCAGGCAGTTGAGGCAACAAAGAACGTGGAGCTTGGAAACAAAGAGCTTTCTCAAGCAATCCAACGAAACAGCAGCAGCAGAACCTTTCTCTTACTGTTTTTCTTCGTCCTTACTTTCTCCGTCTTGTTCTTGGATTGGTACAGTTAAaaaacc
AGGTGATTGTTTTGTTATTATAAATCAAGATCAGTACATATATATTTTTGTTTTTCTTGGTTTCATATGTAATATTTTGGACTTTTGGTGTTTAGGTTTTTGACTTGGAAGAAAAGAACGTAATGGATGAGTCACTACACGAGGTGTATAAATTTTGCCTCACCGATGTTGATGAGAGAAGCAAGAAAGAGACATCAATGAAAGATGATTACATAGAACATAAGAAGTCTACTAGATTGTTGGCTGAAAATGCGAAGAAGTCCGGTCACAGTTTAGAAATATTAAGGCCGGAATCTAAACCTGAGACTGAAAAAGAGGTGATTTTATTTTCTTGTTATATAAAGATTCGTAGACATATATTTGGTTTTTCTTTGGTTTCATAATATTTTGGACTTATGTGTGTTTAGGTCAATGAAGAGGAAGAGAAGAGAGTAATGGATCCGGATGTGGATATTAGTTGTTATGAAGAGTCACCACACGAGGTGTATAAATTTAGCCTCACCGATTTCGAAGAAGAGATAATGGAAGATGATTACAGAGAAGATATGAAGTGTAGAATGTTGGATGATATAGTGAAGAATTCCGGTCACCGTGTAGAAATATCAAGGCCGGAATATTATAAACCTGAGATTGAAAAACAGGTTTTATTTTTTTGGTTATTTTGTGATTAAGATCAGTTTTTTTTTTTTTTTTTTTTGGTTTAATAATATTTGATCTTGTGTGTGTTTAGGTATATGAAAAGGAAGAGAAGAAAGTAATGGATCCGGATATCTATATTAGATCTTATGAAGAGTCACCAAACGAGGTGTATAAATTTAGCCTCACTGATTTGGAAGAAGAGATAATGGAAAATGACTCCATAGAAGGTGTGAAGTGTAGAATGTTGGATGAAATAATGAAGAAGTCCGGTCACCATTTAAAAATATCAAGGCCGGAATATAAACCTGAGATTGAAAAACAGGTTAGTTTTTAATAAAAAGATCACTAGATATTTTTTTTTATTTTTTTTTGTTTTTGGTTTCATAATATTTGACTTGTGGCATGTGTTTAGGTATATGAAGAGGAAGAGAAGAAAGTAATGGATCCAGATGTGGATATTAGATGTTATGAAGAGTCACCACACGAGGTGTCTAAATTTAGCCTCACCGATTTCGAAGAAGAGATAATGGAAGATGATTACATAGAAGCTTTGAAGTGTAGAATGTTGGATGATATATTGAAGAAGTCCGGTCACCGTTTAGAAATATCAAGGCGGCAATATAATAAACCTGAGATTGAAATACAGGTGATTTTTTTTTTTTATTATTGTTGTTATAGTAAGATCAGTAGATATATATCTTGGTTTCATAATATTTTGGACTTGTGTGTGTTTAGGTCAATGAAAAGGAAGAGAAGAAAGTAATCAATACGGATATGGATATTAGATATGATGATGAGTCACCAGAAGAGGTGGAGACATATTCTAGTCTCACGGATGATGAAGAAGAGAGAAGCAAGGAAGATACATCAATGGAAGATGTGAAGTGTAGAATGTTGGATTAAAAAACGACGAAGCTCGGCCACCTTTTAGGAATATCAAGGCCGGAATATAGACCTGAGATTGAAAAACAGGTGATTTTATTTTGTTGTTAATTGTATTAGTAAAGATCAGTAGATATATATTTGTTTTTGTTTTTCGGTTTCATAATATTTTGGACGCTTGTGTTTAGGTCAATGAAGAGAAAGAAAGAAAGTAATGGATATTAGATCTGCTGGTCAGTCACAAACACGAGGTGTACAAATTTAGCCTCACCGATATCAAAGAAGAGAGAAGCAATGAAGATACATCAATGGAAGATTGTTGCATAGAAGAGGCTCAAGTCGGAAAAGATCAAAGAGTCTTCAGATTCAGAGAAAGTAGTGAAGAGAAGAGAAAATCCTCATCATCACCATTATCACCACTAACAGAGTTTAGGGATATGGAGAGTTTGACGTATTACATGAGGCAAAAAGGGATGCATCGAAGAAGAAGAAGATCATCAACATCACCACATTGTTGCCATAATGTAGTATACAATGAGTTTAAAGTGACGAAGGAAGAAGAAGAGGAAGAAAGACAAAGATTAACAACCAAACGTGTTCATTCTAAGCTTCATGAATACGAACAATTTTTAACTCAGTTTAAAAAGAAGAAGGAAGAAGAAAACGAGAGACGAAGATTATCACCCAAAGACTTTGAGCCTACGCTTCCTGATTACGACCAAGTGATTACTCGCTTTAGAGTGCTGGAGAAGGAAGAAGAAGAAAGACGAAGATTAGCAACAAAACATGTTCATCCTAAGCTTCCTGATTACGACCAGATTGCTACTAAGTTTAAACTCCTGAAGGAGGTAGAAAAAGAAAGACGAAGATTATTAACCAAACACAGTTCATCCTAAgcttcc
TGGTAATTTTTGCATCTTCAAAATGTTCTAAAATTTTGGCAAATGGTTTTGTTAAGTTCGAATTTTTGGTTATGATACAGTTTGAACGTTTTTCTTCATAGATTACAGTTTTAGCAAATGTGAATCATTAAAAGTGGAATAGTTGGTTTGAAAACAATTGTCAATTTCATTTTTTTTTTGGTTTTATGGTTAGGCGAGGAAAGCATTAAGAGCTTTGAAAGGTATAGTGAAGCTACAAGCATTAGTGAGAGGATACTTAGTAAGGAAACGCGCGGCCGCAATGTTGCAGAGCATACAAACTTTGATCAGAGTCCAAACCGCTATGCGATCAAAACGCATCAATCGCAGCCTCAACAAAGAGTACAACAACATGTTTCAACCTCGACAATCCTTTGTAAAGAACTATTCTCATTTCCATTGGCTCTCTTTTTTTCTTTAAGCCAAAACAAGACTTAAAGTGTGTCCTCTGTTTGTAGGATAAGTTTGATGAAGCAACGTTCGATGACAGAAGAACAAAGATTGTAGAGAAGGACGATAGATACATGAGAAGATCAAGTTCAAGATCAAGATCTAGACAAGTGCACAATGTTGTTTCAATGTCTGACTATGAAGGCGATTTTGTTTACAAAGGGAATGATTTGGAGTTGTGTTTCTCGGATGAGAAGTGGAAGTTTGCTACCGCGCAGAACACGCCGAGATTATTGCATCACCATTCTGCTAATAATCGCTATTATGTAATGCAGTCTCCAGCTAAGAGTGTTGGTGGAAAGGCTTTGTGTGACTATGAAAGCAGTGTGAGTACTCCTGGCTACATGGAGAAAACTAAGTCCTTTAAGGCAAAAGTGCGTTCACACAGCGCACCGCGCCAGCGATCTGAGAGGCAGAGGTTGTCGCTAGATGAAGTTATGGCCTCTAAGAGTAGCGTTAGCGGTGTGAGTATGTCGCATCAGCATCCACCACGCCATTCTTGTTCCTGTGATCCGCTTTAActtaac
GAGTTAGTAAACAAAGTGTTCACATTTTAGTAAACATTGTTGTTCGTTAATCACGTAACGTTTTGTTTTTCCAGTTTACACTGAGCTCTGATGAGTATATAACGGAGGTGAATGGTTACTACAAAACTACGTTTTCGGGAGAAGTCATAACGTCGTTGACGTTCAAGACGAACAAAAGGACATATGGGACTTACGGAAATAAAACCAGTAGCTACTTTTCTGTTGCCGCACCCAAAGATAACCAGATTGTCGGTTTTCTTGGAAGTAGCAGCCATGCTCTCAACTCCATCGACGCTCATTTTGCCCCTGCTCCTCCTCCTGGTAGCACCGGAGCTAAGCCCGGTGCTAGTGGCATCGGAAGTGATTCTGGTAGCATTGGTAGTGCCGGAACTAACCCTGGTGCTGATGGCACCAGAGAAACCGAAAAAAACGCTGGTGGCTCAAAACCTAGTAGTGGTAGTGCCGGAACTAACCCTGGTGCTAGTGCTGTTGGCAACGGAGAAACCGAAAAAAATGCTGGTGGCTCAAAACCTAGCAGTGGTAGTGCTGGAACTAACCCTGGTGCTAGTGCTGGTGGCAACGGAGAAACCGAAAAAAACGTTGGTGGCTCAAAACCTAGCAGTGGTAAAGCCGGAACTAACCCTGGTGCTAATGCTGGTGGCAACGGAGGAACCGAAAAAAACGCTGGTGGCTCAAAATCTAGCAGTGGTAGTGCTCGAACTAACCCTGGTGCTAGTGCTGGTGGCAACGGAGAAACTGTTTCCAACATTGGAGATACGGAAAGTAACGCTGGTGGCTCGAAAAGTAATGATGGTGCTAACAATGGTGCTAGTGGCATTGAAAGTAATGCTGGTAGCACTGGAACTAACTTTGGTGCTGGTGGCACCGGGGGAATTGGAGATACGGAAAGTGATGCTGGTGGCTCCAAAACTAACTCTGGAAACGGCGGAACTAACGATGGTGCTAGTGGTATTGGAAGTAATGATGGTAGCACTGGAACTAACCCTGGTGCTGGTGGAGGAACAGATTCAAACATCGAAGGTACTGAAAATAACGTTGGTGGCAAGGAAACTAACCCTGGTGCTAGTGGCATTGGAAATAGTGATGGTAGCACTGGAACTAGCCCCGAAGGTACCGAAAGTAACGCTGACGGCACAAAAACTAACACGGGAGGCAAAGAATCTAACACCGGAAGTGAATCCAACACCAATTCTAGTCCACAAAAGTTGGAAGCACAAGGAGGCAATGGAGGAAATCAATGGGACGACGGAACCGATCATGATGGTGTGATGAAGATACATGTTGCAGTTGGTGGTCTAGGAATTGAGCAAATTAGATTTGATTATGTCAAGAACGGACAGTTGAAGGAAGGACCCTTCCACGGTGTCAAAGGAAGAGGTGGCACTTCAACGGTGCGTAAATTTTTATTATTATGGCTCAATTACGTTTTTCGAATAAGTGTTAATTCAAGATTATTGATCTTCATGATTCTGCAGATTGAGATTAGCCATCCGGACGAGTATCTTGTTTCCGTCGAGGGGTTGTACGACTCTTCCAATATCATTCAAGGAATCCAGTTTCAATCCAACAAACACACTTCTCAGTACTTTGGATATGAATATTATGGAGATGGTACACAATTTTCACTTCAAGTTAATGAAAAGAAGATCATTGGTTTCCATGGTTTTGCCGACTCACACCTTAATTCTCTTGGAGCTTATTTCGTTCCAATCTCATCCTCTTCTTCCTCCTTGACTCCTCCTCCCAACAAAGTTAAAGCTCAAGGAGGAAGTTATGGAGAAACATTTGACGATGGTGCTTTCGATCATGTAAGAAAGGTTTATGTTGGTCAAGGTGATTCTGGTGTAGCTTATGTCAAGTTCGATTATGAAAAAGACGGTAAAAAGGAGACACAAGAACATGGAAAAATGACATTGTCAGGAACAGAGGAGTTTGAGGTTGATTCAGACGATTACATAACATCAATGGAGGTTTATGTCGACAAAGTCTACGGTTATAAAAGCGAAATCGTCATTGCTCTTACCTTCAAGACCTTTAAGGGTGAAACTTCTCCACGTTTTGGAATAGAGACTGAGAATAAATATGAAGTTAAAGACGGTAAAGGAGGAAAACTTGCTGGTTTCCATGGAAAAGCTAGCGATGTTCTTTATGCTATTGGTGCTTATTTCATTCCAGCAGCAAATTAGagagtt
ACGTATGTCTTAGTTACTACTATCATACTATATTACTATGTATTGGAAAACTTTTGGTTAGAACCTGTTGGGAGGAAAGGGTTTATGTTCTGGTTCATTTTACGTGTACTAAGTACTTATAATTAAGATTAAAAGAAACATTTACAGCTTCACCCTCTGGTCGATGTATGTGGGCTGTGGGCATGTGGCCAATCTCTGAAGCGTTAGGTAGAGCAAATATAGAGTTGAGAGTTGCTTAAGTTAGTGAACGTGAATGACTAAAAAGATATGTTGCATTTAAATCGTATTGGGCCTCATCCCATCTAAAATATAGTAGGTGTAGGCCTTTTAGGTTAATTTGAATAAAATCAACCTTTTTGTAAGCAACATCGACGATTGTCACATTTTTCTCATACACATAGGTGTAATCTAGCTTTGAATGTTTTCTCATACACATAGGTGTAATCACCGTAATTATCATTTGTGAAGATATATGTTTTACCAAGTGGTTTGTATTGTCCATATATACTTTACCACTTTCATATTAACATATAATGTTTTTGTAAGTATTATACCATAAAGGATTGGTTTCTTAATATTATTAACAAAACGCAAAAATTCTTTTAAACGCAGGCGATTCCAATCCACAGCGTTGCGGTTAGAGTAGGATCAACACAAAGAGTAGTGATGGAGATCATAATCACATTCGCATTGGTCTACACTGTTTACGCCACAGCCATTGACTCCAACAATGGCACTCTCGGAACCATCGCTCCACTTGCTATCAGACTCATCGTTGGTGCTAACATTCTTGCAGCCGGCCCATTCTCTGGTGGTCCAATGAACCCTGGACGTTCTTTTGGATCATCTCTTGCCGTTGGAAATTTTTCAGGACATTAGgtttat
and here is the script I am running:
#!/usr/bin/perl
use strict;
use warnings;
# A program to find the first inframe stop codon of non-spliced intron containing genes
print "ENTER THE FILENAME FOR DNA SEQUENCES:= ";
# Asks for Sequence file and if file does not exist prints error message
my $filename = <STDIN>;
#my $sequence;
my #sequence;
chomp $filename;
unless (open(DNAFILE, $filename) ) {
print "Cannot open file \"$filename\"\n\n";
}
#sequence = <DNAFILE>;
close DNAFILE;
open (FILE, ">AtPTCindex.txt");
my $j;
my $i;
my $codon;
my $stopseq;
my $counter;
#Change $j<(375) to n=number of sequences
for ($j = 0; $j < #sequence; $j ++) {
$counter = 0;
for ($i = 0; $i < (length($sequence[$j]) - 2) && $counter < 1; $i += 3) {
$codon = substr($sequence[$j], $i, 3);
if ($codon =~ m/TAG|TGA|TAA/g) {
# m added before /TAG... above
$stopseq = substr($sequence[$j], $i, 9);
my $result = index($sequence[$j], $stopseq);
$counter = 1;
#my $results = index($sequence[$j], $stopseq);
print FILE "$result \n";
#print FILE "$results $j \n";
}
}
if ($counter == 0) {
print FILE "\n"
}
}
close FILE;
exit;
Thanks so much.
As threatened, the following is a cleaned up version of your script:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
die "Usage: $0 Filename\n" if #ARGV != 1;
my $file = shift;
open my $infh, '<', $file;
open my $outfh, '>', 'AtPTCindex.txt';
while (my $line = <$infh>) {
chomp($line);
my $result = '';
for (my $i = 0; $i < (length($line) - 2); $i += 3) {
my $codon = substr($line, $i, 3);
if ($codon =~ m/TAG|TGA|TAA/) {
# m added before /TAG... above
my $stopseq = substr($line, $i, 9);
$result = index($line, $stopseq);
$result .= " ($i, $codon, $stopseq)";
last;
}
}
print "$result\n";
# print $outfh "$result\n";
# print $outfh "$result $.\n";
}
close $infh;
close $outfh;
For the 5 lines of data that you provided, the following is the output:
84 (84, TGA, TGATCATTC)
3 (3, TGA, TGATTGTTT)
3 (3, TAA, TAATTTTTG)
4 (27, TAG, TAGTAAACA)
123 (123, TAA, TAAGATTAA)
I believe your issue is with these lines:
my $stopseq = substr($line, $i, 9);
$result = index($line, $stopseq);
You're pulling a sequence from the $line at position $i, and then immediately doing an index for it. In the case of 4 of 5 of those lines, it immediately finds the same value $i. However, in the case of line 4, it finds a matching sequence earlier in the line.
If this isn't desired, you'll have to explain what your desired behavior actually is. Perhaps, you just want $i? Or are you looking for a matching stop sequence any point AFTER $i? You'll have to specify what your actual logic wants to be.
I took a different approach, unpacking it into groups of three instead of counting by indexes of three. I believe this script does what you want, and it looks a lot cleaner. It can also optionally take the filename as argument.
#!/usr/bin/perl
use strict;
use warnings;
my $filename = 'a'; # dummy value
my $resultfile = 'AtPTCindex.txt';
# User may have passed filename as arguement
if (#ARGV) { if (-e $ARGV[0]) { $filename = $ARGV[0] } }
unless (-e $filename)
{
print "ENTER THE FILENAME FOR DNA SEQUENCES: ";
chomp($filename = <STDIN>)
}
open DNA,"<$filename" or die "Couldn't open $filename for reading: $!\n";
my #sequence = <DNA> or die "Couldn't read $filename: $!\n";;
close DNA;
# Uncomment the below line if you're braver than me
if (-e $resultfile) { die "Cowardly refusing to write to existing file" }
if (-e $resultfile) { unlink $resultfile };
open RESULT,">>$resultfile" or die "Courdn't open$!\n";
foreach my $string (#sequence)
{
# split into groups of 3
my #groups = unpack "(A3)*", $string;
# Search for the group you want
for (my $groupnum = 0; $groupnum < #groups - 1; $groupnum++)
{
if ($groups[$groupnum] =~ m/(TAG|TGA|TAA)/g)
{
print RESULT (($groupnum + 0) * 3) . "\n";
print "$1 (" . $1 . ( $groups[$groupnum + 1]) . ($groups[$groupnum + 2]) . ") at index " . (($groupnum + 0) * 3) . "\n";
last;
}
}
}
close RESULT;
Running the script on your sample data, it outputs:
TGA (TGATCATTC) at index 84
TGA (TGATTGTTT) at index 3
TAA (TAATTTTTG) at index 3
TAG (TAGTAAACA) at index 27
TAA (TAAGATTAA) at index 123
...as well as writes the raw index numbers to the file specified.

Read CSV file and save in 2 d array

I am trying to read a huge CSV file in 2 D array, there must be a better way to split the line and save it in the 2 D array in one step :s
Cheers
my $j = 0;
while (<IN>)
{
chomp ;
my #cols=();
#cols = split(/,/);
shift(#cols) ; #to remove the first number which is a line header
for(my $i=0; $i<11; $i++)
{
$array[$i][$j] = $cols[$i];
}
$j++;
}
CSV is not trivial. Don't parse it yourself. Use a module like Text::CSV, which will do it correctly and fast.
use strict;
use warnings;
use Text::CSV;
my #data; # 2D array for CSV data
my $file = 'something.csv';
my $csv = Text::CSV->new;
open my $fh, '<', $file or die "Could not open $file: $!";
while( my $row = $csv->getline( $fh ) ) {
shift #$row; # throw away first value
push #data, $row;
}
That will get all your rows nicely in #data, without worrying about parsing CSV yourself.
If you ever find yourself reaching for the C-style for loop, then there's a good chance that your program design can be improved.
while (<IN>) {
chomp;
my #cols = split(/,/);
shift(#cols); #to remove the first number which is a line header
push #array, \#cols;
}
This assumes that you have a CSV file that can be processed with a simple split (i.e. the records contain no embedded commas).
Aside: You can simplify your code with:
my #cols = split /,/;
Your assignment to $array[$col][$row] uses an unusual subscript order; it complicates life.
With your column/row assignment order in the array, I don't think there's a simpler way to do it.
Alternative:
If you were to reverse the order of the subscripts in the array ($array[$row][$col]), you could think about using:
use strict;
use warnings;
my #array;
for (my $j = 0; <>; $j++) # For testing I used <> instead of <IN>
{
chomp;
$array[$j] = [ split /,/ ];
shift #{$array[$j]}; # Remove the line label
}
for (my $i = 0; $i < scalar(#array); $i++)
{
for (my $j = 0; $j < scalar(#{$array[$i]}); $j++)
{
print "array[$i,$j] = $array[$i][$j]\n";
}
}
Sample Data
label1,1,2,3
label2,3,2,1
label3,2,3,1
Sample Output
array[0,0] = 1
array[0,1] = 2
array[0,2] = 3
array[1,0] = 3
array[1,1] = 2
array[1,2] = 1
array[2,0] = 2
array[2,1] = 3
array[2,2] = 1

Resources