Morse Code Decoder in Perl - arrays

I am trying to teach myself Perl and I have been struggling... Last night I did a program to calculate the average of a set of numbers that the user provided in order to learn about lists and user input so today I thought I would do a Morse Code decoder to learn about Hashes. I have looked through the book that I bought and it doesn't really explain hashes very well... it actually doesn't explain a lot of things very well. Any help would be appreciated!
Anyways, I am wanting to write a program that decodes the morse code that the user inputs. So the user would enter:
-.-.
.-
-
...
!
.-.
..-
.-..
.
The exclamation point would signify a separate word. This message would return "Cats Rule" to the user. Below is the code I have so far... Remember.. I have been programming in perl for under 24 hours haha.
Code:
use 5.010;
my %morsecode=(
'.-' =>'A', '-...' =>'B', '-.-.' =>'C', '-..' =>'D',
'.' =>'E', '..-.' =>'F', '--.' =>'G', '....' =>'H',
'..' =>'I', '.---' =>'J', '-.-' =>'K', '.-..' =>'L',
'--' =>'M', '-.' =>'N', '---' =>'O', '.--.' =>'P',
'--.-' =>'Q', '.-.' =>'R', '...' =>'S', '-' =>'T',
'..-' =>'U', '...-' =>'V', '.--' =>'W', '-..-' =>'X',
'-.--' =>'Y', '--..' =>'Z', '.----' =>'1', '..---' =>'2',
'...--' =>'3', '....-' =>'4', '.....' =>'5', '-....' =>'6',
'--...' =>'7', '---..' =>'8', '----.' =>'9', '-----' =>'0',
'.-.-.-'=>'.', '--..--'=>',', '---...'=>':', '..--..'=>'?',
'.----.'=>'\'', '-...-' =>'-', '-..-.' =>'/', '.-..-.'=>'\"'
);
my #k = keys %morsecode;
my #v = values %morsecode;
say "Enter a message in morse code separated by a line. Use the exclamation point (!) to separate words. Hit Control+D to signal the end of input.";
my #message = <STDIN>;
chomp #message;
my $decodedMessage = encode(#message);
sub encode {
foreach #_ {
if (#_ == #k) {
return #k;
#This is where I am confused... I am going to have to add the values to an array, but I don't really know how to go about it.
}
else if(#_ == '!') {return ' '}
else
{
return 'Input is not valid';
}
}
}

Your code contains two syntactic errors: foreach requires a list to iterate over; this means parens. Unlike C and other languages, Perl doesn't support else if (...). Instead, use elsif (...).
Then there are a few semantic mistakes: The current value of an iteration is stored in $_. The array #_ contains the arguments of the call to your function.
Perl comparse strings and numbers differently:
Strings Numbers
eq ==
lt <
gt >
le <=
ge >=
ne !=
cmp <=>
Use the correct operators for the task at hand, in this case, the stringy ones.
(Your code #_ == #k does something, namely using arrays in numeric context. This produces the number of elements, which is subsequenty compared. #_ == '!' is just weird.)
What you really want to do is to map the inputted values to a list of characters. Your hash defines this mapping, but we want to apply it. Perl has a map function, it works like
#out_list = map { ACTION } #in_list;
Inside the action block, the current value is available as $_.
We want our action to look up the appropriate value in the hash, or include an error message if there is no mapping for the input string:
my #letters = map { $morsecode{$_} // "<unknown code $_>" } #message;
This assumes ! is registered as a space in the morsecode hash.
We then make a single string of these letters by joining them with the empty string:
my $translated_message = join "", #letters;
And don't forget to print out the result!
The complete code:
#!/usr/bin/perl
use strict; use warnings; use 5.012;
my %morsecode=(
'.-' =>'A', '-...' =>'B', '-.-.' =>'C', '-..' =>'D',
'.' =>'E', '..-.' =>'F', '--.' =>'G', '....' =>'H',
'..' =>'I', '.---' =>'J', '-.-' =>'K', '.-..' =>'L',
'--' =>'M', '-.' =>'N', '---' =>'O', '.--.' =>'P',
'--.-' =>'Q', '.-.' =>'R', '...' =>'S', '-' =>'T',
'..-' =>'U', '...-' =>'V', '.--' =>'W', '-..-' =>'X',
'-.--' =>'Y', '--..' =>'Z', '.----' =>'1', '..---' =>'2',
'...--' =>'3', '....-' =>'4', '.....' =>'5', '-....' =>'6',
'--...' =>'7', '---..' =>'8', '----.' =>'9', '-----' =>'0',
'.-.-.-'=>'.', '--..--'=>',', '---...'=>':', '..--..'=>'?',
'.----.'=>'\'', '-...-' =>'-', '-..-.' =>'/', '.-..-.'=>'"',
'!' =>' ',
);
say "Please type in your morse message:";
my #codes = <>;
chomp #codes;
my $message = join "", map { $morsecode{$_} // "<unknown code $_>" } #codes;
say "You said:";
say $message;
This produces the desired output.

There's a lot of value in learning the how and why, but here's the what:
sub encode {
my $output;
foreach my $symbol (#_) {
my $letter = $morsecode{$symbol};
die "Don't know how to decode $symbol" unless defined $letter;
$output .= $letter
}
return $output;
}
or even as little as sub encode { join '', map $morsecode{$_}, #_ } if you're not too worried about error-checking. #k and #v aren't needed for anything.

Searching for values in hashes is a very intense job, you are better of by just using a reverse hash. You can easily reverse a hash with the reverse function in Perl. Also, while watching your code I have seen that you will be able to enter lower-case input. But while searching in hashes on keys, this is case-sensitive. So you will need to uppercase your input. Also, I do not really like the way to "end" an STDIN. An exit word/sign would be better and cleaner.
My take on your code
my %morsecode=(
'.-' =>'A', '-...' =>'B', '-.-.' =>'C', '-..' =>'D',
'.' =>'E', '..-.' =>'F', '--.' =>'G', '....' =>'H',
'..' =>'I', '.---' =>'J', '-.-' =>'K', '.-..' =>'L',
'--' =>'M', '-.' =>'N', '---' =>'O', '.--.' =>'P',
'--.-' =>'Q', '.-.' =>'R', '...' =>'S', '-' =>'T',
'..-' =>'U', '...-' =>'V', '.--' =>'W', '-..-' =>'X',
'-.--' =>'Y', '--..' =>'Z', '.----' =>'1', '..---' =>'2',
'...--' =>'3', '....-' =>'4', '.....' =>'5', '-....' =>'6',
'--...' =>'7', '---..' =>'8', '----.' =>'9', '-----' =>'0',
'.-.-.-'=>'.', '--..--'=>',', '---...'=>':', '..--..'=>'?',
'.----.'=>'\'', '-...-' =>'-', '-..-.' =>'/', '.-..-.'=>'\"'
);
my %reversemorse = reverse %morsecode;
print "Enter a message\n";
chomp (my $message = <STDIN>);
print &encode($message);
sub encode{
my $origmsg = shift(#_);
my #letters = split('',$origmsg);
my $morse = '';
foreach $l(#letters)
{
$morse .= $reversemorse{uc($l)}." ";
}
return $morse;
}

Related

ref to a hash -> its member array -> this array's member's value. How to elegantly access and test?

I want to use an expression like
#{ %$hashref{'key_name'}[1]
or
%$hashref{'key_name}->[1]
to get - and then test - the second (index = 1) member of an array (reference) held by my hash as its "key_name" 's value. But, I can not.
This code here is correct (it works), but I would have liked to combine the two lines that I have marked into one single, efficient, perl-elegant line.
foreach my $tag ('doit', 'source', 'dest' ) {
my $exists = exists( $$thisSectionConfig{$tag});
my #tempA = %$thisSectionConfig{$tag} ; #this line
my $non0len = (#tempA[1] =~ /\w+/ ); # and this line
if ( !$exists || !$non0len) {
print STDERR "No complete \"$tag\" ... etc ... \n";
# program exit ...
}
I know you (the general 'you') can elegantly combine these two lines. Could someone tell me how I could do this?
This code it testing a section of a config file that has been read into a $thisSectionConfig reference-to-a-hash by Config::Simple. Each config file key=value pair then is (I looked with datadumper) held as a two-member array: [0] is the key, [1] is the value. The $tag 's are configuration settings that must be present in the config file sections being processed by this code snippet.
Thank you for any help.
You should read about Arrow operator(->). I guess you want something like this:
foreach my $tag ('doit', 'source', 'dest') {
if(exists $thisSectionConfig -> {$tag}){
my $non0len = ($thisSectionConfig -> {$tag} -> [1] =~ /(\w+)/) ;
}
else {
print STDERR "No complete \"$tag\" ... etc ... \n";
# program exit ...
}

I want to use loop in regular expression to increase the reg ex everytime is search match and add this match in new array

I have an array as #versions with various versions ex- 6.1.1.1; 6.2.22.123; 2.12.1.123; 2.1.11.11, etc.
I have to make classes of all these specific format versions like- An array with version whose format is 1.1.1.1, 1.1.11.1, 1.1.1.111 and so on
My code is:
my #version1=();
foreach my $values (#versions)
{
if( $values=~ /(.?[0-9]{1}.[0-9]{1}.[0-9]{1}.[0-9]{1}$)/) # Regular expression to find version ex-1.1.1.1 at the end of the line($).
{
push (#version1,$values);
}
};
print "Version1 values are: #version1\n";
my #version2=();
foreach my $values (#versions)
{
if( $values=~ /(.?[0-9]{1}.[0-9]{1}.[0-9]{1}.[0-9]{2}$)/) # Regular expression to find version ex-1.1.1.11 at the end of the line($).
{
push (#version2,$values);
}
};
#print "Version2 values are: #version2\n";
It gives me the result but I cannot create so many regular expression, so I need to use looping.
Use the + quantifier to mean 1 or more digits:
/(.?[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$)/

Perl: Filtering through an Array to Make a New Array

I'm trying to filter an array of a delimited text file in my program. The array from this text file looks like this:
YCL049C 1 511.2465 0 0 MFSK
YCL049C 2 4422.3098 0 0 YLVTASSLFVALT
YCL049C 3 1131.5600 0 0 DFYQVSFVK
YCL049C 4 1911.0213 0 0 SIAPAIVNSSVIFHDVSR
YCL049C 5 774.4059 0 0 GVAMGNVK
..
.
and the code I have for this section of the program is:
my #msfile_filtered;
my $msline;
foreach $msline (#msfile) {
my ($name, $pnum, $m2c, $charge, $missed, $sequence) = split (" ", $msline);
if (defined $amino) {
if ($amino =~ /$sequence/i) {
push (#msfile_filtered, $msline);
}
}
else {
push (#msfile_filtered, $msline);
}
}
$amino will just be a letter that will be input by the user, and corresponds to the last field $sequence. It is not essential that the user actually inputs $amino, so I need to duplicate this array and keep it unchanged if this is the case (hence the else statement). At the minute the #msfile_filtered array is empty, but I am unsure why, any ideas?
EDIT: just to clarify, there is only one space between each field, I copy and pasted this from notpad++, so extra spaced were added. The file itself will only have one space between fields.
Thanks in advance!
The regex that tries to find matching rows is backwards. To find a needle in a haystack, you need to write $haystack =~ /needle/, not the other way around.
Also, to simplify your logic, if $amino is undef, skip the loop entirely. I would rewrite your code as follows:
if (defined $amino)
{
foreach $msline (#msfile)
{
my ($name, $pnum, $m2c, $charge, $missed, $sequence) = split(" ", $msline);
push #msfile_filtered, $msline if ($sequence =~ /$amino/i);
}
} else
{
#msfile_filtered = #msfile;
}
You could simplify this further down to a single grep statement, but that begins to get hard to read. An example of such a line might be:
#msfile_filtered =
defined $amino
? grep { ( split(" ", $_ ) )[5] =~ /$amino/i } #msfile
: #msfile;
The split is should take more than one whitespaces, and the regex vars are vice versa.
First debug to check that values are correct after the split.
Also, you must swap your regex variables like this:
if ($sequence =~ /$amino/i) {
Now you're checking if $amino contains $sequence, which obviously it doesn't

Perl: Search a pattern across array elements

I am a Perl newbie, stuck with another bioinformatics problem that requires some help and input.
The problem in brief:
I have a file, which has over 40,000 unique DNA sequences. By unique, I mean unique sequence id. I am attaching a portion of it at the end of my post to help you show what it looks like.
I need to divide each of the 40,000 sequences into 3 parts. So if a particular sequence is 999 character long, each of the 3 parts would have 333 characters.
I need to look for the following pattern through each of the 3 individual parts:
$gpat = [G]{3,5};
$npat = [A-Z]{1,25};
$pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;
If $pattern appears in the first of the 3 parts, increase the counter of 'beginning', if $pattern occurs in the 2nd of the 3 parts, increase counter of 'middle' and lastly if the $pattern appears in the 3rd part, increase counter of 'end'.
Print the counters of 'beginning','middle' and 'end' i.e basically summation of 'beginning','middle','end' for each of the sequences.
Say in 1st sequence, the values are like '2','5','3' respectively and in 2nd sequence, the values are '4','1','6', the final count should be '7,6,9'.
The issues I am having:
If a particular sequence is split into 3 parts, potential $pattern is lost. eg say on a sequence like :
gggatgtcgatgcatggggatgcatcgatgcggggactagctagcgggatgctacgatggggatgatgataatatcgcggcgcatatatgctagtctatatatta
a split into 3 parts produces following 3 sub-parts,each of 35 character length:
gggatgtcgatgcatggggatgcatcgatgcgggg
actagctagcgggatgctacgatggggatgatgat
aatatcgcggcgcatatatgctagtctatatatta
Hence, $pattern gets split into the first 2 parts. Is there anyway to say "If $pattern begins in 1st part and ends in 2nd part", increase count of "beginning" ?
##UPDATE## The following issue has been resolved thanks to the code suggested by Cupidvogel
2.How do I divide a sequence into 3 parts if its length is not divisible by 3? I tried using int, but then the last part is 1-2
characters short.
The following is the code I have written so far.
It reads in the file, displays the header name and sequence, the length into which each sequence will be divided and finally the sequence split into 3 parts which works fine provided the sequence length is divisible by 3, for those which aren't, the final 3rd part is 1-2 characters short.
#Take Filename from user
print "Please enter file name : ";
$in =<>;
chomp $in;
open (FASTA,"$in") or die ;
while (<FASTA>)
{
$/=">";
#array = split '\n', $_;
$header=shift #array; # Header of the fasta sequence
print "\n\nNext sequence: \n";
print $header,"\n";
$seq= join '', #array; # sequence
$seq=~s/\s//g;
$seq=~s/\*//g;
$seq=~s/>//g;
print $seq,"\n\n";
$num = int(length($seq)/3);
#arr = unpack("A$num A$num A*",$seq);
print " New method gives this :", #arr;
print "\nThe first element is :", $arr[0];
print "\nThe second element is :",$arr[1];
print "\nThe third element is :",$arr[2] ;
#The following lines of code were originally written to split...
#...the sequence into 3 parts, albeit unsuccessfully
#my $split = (length $seq)/3;
#print $split,"\n\n";
#my $int = int $split;
#print $int,"\n\n";
#my #array2 = $seq =~ /(.{$int})/g;
#print join (" ", #array2),"\n\n";
#print $array2[0],"\n",$array2[1],"\n",$array2[2];
}
exit;
I have been trying the code I have written so far with the following sample file : sample.fa
>ABC_123 2
atgtcgatcgatcggcgggcatgcgcgcgcggatg
atatatagcgcgcgctatatagcgcgactctacgc
atgctgctgactagctatagtcgctgactgcgcgt
gggaaaaagggcccgggccccgttttggggatcta
ggggatagctgatgctagcatgcatgctgactgca
>DEF_456 4
gggatgtcgatgcatggggatgcatcgatgcgggg
actagctagcgggatgctacgatggggatgatgat
aatatcgcggcgcatatatgctagtctatatatta
>GHI_789 1
atagctgctagtcgatcggcgcgggtatcgatcgg
ggatcgatcgatcggggatcgatcgggggatcgat
The actual input file looks like the following:
>NR_037701 1
aggagctatgaatattaatgaaagtggtcctgatgcatgcatattaaaca
tgcatcttacatatgacacatgttcaccttggggtggagacttaatattt
aaatattgcaatcaggccctatacatcaaaaggtctattcaggacatgaa
ggcactcaagtatgcaatctctgtaaacccgctagaaccagtcatggtcg
gtgggctccttaccaggagaaaattaccgaaatcactcttgtccaatcaa
agctgtagttatggctggtggagttcagttagtcagcatctggtggagct
gcaagtgttttagtattgtttatttagaggccagtgcttatttagctgct
agagaaaaggaaaacttgtggcagttagaacatagtttattcttttaagt
gtagggctgcatgacttaacccttgtttggcatggccttaggtcctgttt
gtaatttggtatcttgttgccacaaagagtgtgtttggtcagtcttatga
cctctattttgacattaatgctggttggttgtgtctaaaccataaaaggg
aggggagtataatgaggtgtgtctgacctcttgtcctgtcatggctggga
actcagtttctaaggtttttctggggtcctctttgccaagagcgtttcta
ttcagttggtggaggggacttaggattttatttttagtttgcagccaggg
tcagtacatttcagtcacccccgcccagccctcctgatcctcctgtcatt
cctcacatcctgtcattgtcagagattttacagatatagagctgaatcat
ttcctgccatctcttttaacacacaggcctcccagatctttctaacccag
gacctacttggaaaggcatgctgggtctcttccacagactttaagctctc
cctacaccagaatttaggtgagtgctttgaggacatgaagctattcctcc
caccaccagtagccttgggctggcccacgccaactgtggagctggagcgg
gagggaggagtacagacatggaattttaattctgtaatccagggcttcag
ttatgtacaacatccatgccatttgatgattccaccactccttttccatc
tcccagaagcctgctttttaatgcccgcttaatattatcagagccgagcc
tggaatcaaactgcctctttcaaaacctgccactatatcctggctttgtg
acctcagccaagttgcttgactattctcagtctcagtttctgcacctgtc
aaatagggtttatgttaacctaactttcagggctgtcaggattaaatgag
catgaaccacataaaatgtttggtgtatagtaagtgtacagtaaatactt
ccattatcagtccctgcaattctatttttcttccttctctacacagcccc
tgtctggctttaaaatgtcctgccctgctttttatgagtggataccccca
gccctatgtggattagcaagttaagtaatgacactcagagacagttccat
ctttgtccataacttgctctgtgatccagtgtgcatcactcaaacagact
atctcttttctcctacaaaacagacagctgcctctcagataatgttgggg
gcataggaggaatgggaagcccgctaagagaacagaagtcaaaaacagtt
gggttctagatgggaggaggtgtgcgtgcacatgtatgtttgtgtttcag
gtcttggaatctcagcaggtcagtcacattgcagtgtgtcgcttcacctg
gctccctcttttaaagattttccttccctctttccaactccctgggtcct
ggatcctccaacagtgtcagggttagatgccttttatgggccacttgcat
tagtgtcctgatagaggcttaatcactgctcagaaactgccttctgccca
ctggcaaagggaggcaggggaaatacatgattctaattaatggtccaggc
agagaggacactcagaatttcaggactgaagagtatacatgtgtgtgatg
gtaaatgggcaaaaatcatcccttggcttctcatgcataatgcatgggca
cacagactcaaaccctctctcacacacatacacatatacattgttattcc
acacacaaggcataatcccagtgtccagtgcacatgcatacacgcacaca
ttcccttcctaggccactgtattgctttcctagggcatcttcttataaga
caccagtcgtataaggagcccaccccactcatctgagcttatcaaccaat
tacattaggaaagactgtatttcctagtaaggtcacattcagtagtactg
agggttgggacttcaacacagctttttgggggatcataattcaacccatg
acagccactgagattattatatctccagagaataaatgtgtggagttaaa
aggaagatacatgtggtacaaggggtggtaaggcaagggtaaaaggggag
ggaggggattgaactagacacagacacatgagcaggactttggggagtgt
gttttatatctgtcagatgcctagaacagcacctgaaatatgggactcaa
tcattttagtccccttctttctataagtgtgtgtgtgcggatatgtgtgc
tagatgttcttgctgtgttaggaggtgataaacatttgtccatgttatat
aggtggaaagggtcagactactaaattgtgaagacatcatctgtctgcat
ttattgagaatgtgaatatgaaacaagctgcaagtattctataaatgttc
actgttattagatattgtatgtctttgtgtccttttattcatgaattctt
gcacattatgaagaaagagtccatgtggtcagtgtcttacccggtgtagg
gtaaatgcacctgatagcaataacttaagcacacctttataatgacccta
tatggcagatgctcctgaatgtgtgtttcgagctagaaaatccgggagtg
gccaatcggagattcgtttcttatctataatagacatctgagcccctggc
ccatcccatgaaacccaggctgtagagaggattgaggccttaagttttgg
gttaaatgacagttgccaggtgtcgctcattagggaaaggggttaagtga
aaatgctgtataaactgcatgatgtttgcaggcagttgtggttttcctgc
ccagcctgccaccaccgggccatgcggatatgttgtccagcccaacacca
caggaccatttctgtatgtaagacaattctatccagcccgccacctctgg
actccctcccctgtatgtaagccctcaataaaaccccacgtctcttttgc
tggcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaa
>NM_198399 1
aacagattttaactctgaaaagccatttccagtgtctatagactattgtg
agcctggagaagtagcatttagttgggatagcttcactagagctgcctgc
caaagacttccttccacaggatcttgtcgcaccagcaactgacaggagct
tgggagctcgggagcttgggagagggcttatgtttttaataatgtagctg
tcagttcgaagcctggaaatgttgaccctcaaagggcataaaatcttgtt
attttaatttgcatctgggagaatgtctgagcaaggagacctgaatcagg
caatagcagaggaaggagggactgagcaggagacggccactccagagaac
ggcattgttaaatcagaaagtctggatgaagaggagaaactggaactgca
gaggcggctggaggctcagaatcaagaaagaagaaaatccaagtcaggag
caggaaaaggtaaactgactcgcagccttgctgtctgtgaggaatcttct
gccagaccaggaggtgaaagtcttcaggatcagactctctgaaaactgca
aatggaaaggaattcaaaagaatttagattaaaagttaaataaaaagtag
gcacagtagtgctgaattttcctcaaaggctctcttttgataaggctgaa
ccaaatataatcccaagtatcctctctccttccttgttggagatgtctta
cctctcagctccccaaaatgcacttgcctataagaaacacaattgctggt
tcatatgaaacttaggaaatagtgaataaggtgcatttaactttggagaa
atacttttatggctttggtggagatttctcaatactgcaaaagttgtcca
gaaatgaatctgagctgatggtgactttaagttaatattattaatatatc
actgcatatttttacccttatttttgctccttacagcaagattagtaggt
tataaaaatttaaatttaaacaaaattatttcatgacaaaatgggaaact
tcacatcatacttatttttgtttgcctttcaggcatcatattagctttta
taaaaaatggtcttgctgctgaaattgtacttattttatcagaggctggg
tgcagtcaagacaaaagtaaaatggtttacctgagcccaggggagggaaa
attgattaagatatcattatttttgtttggtttggttttgcttttttcct
cttactttaattgaaatactctgaattcccctcatggaaacagagagcat
tgagagcactttctttaaaaggaccaaaaataaattcctaatagattttg
tcctaagagagtgtttttttttctagcatcattttctttacatgccactc
atgtcataaggcatggacaggctatctttcagtggccattactatgtttc
gtacacatgctttattttacttgggctctgagaaatgtgtggctttcctt
cagcattttatttgtgcttctctttttaatggagattgaaaagggagaat
aatgtgaatatcacggcttatattattaaatgttgattgatggcttgtaa
tgtactgcacacaatatatgttaactctgcagaatgacagaccctgggag
aagtaatgccccagttgtcccccactcctaatgccaggcagagaaggaca
gcctttatagacttaatctgctttttgtcccatttgacaaggtaccagga
ggaaattttttaagggatcaactgtatcacagtgcccactctggacctaa
gtctagtgtatccatacaattggtgcagagaaataaggtgtaaatggtgc
tttgttcctgctggttccaagctcagaaaccaagactagctttgtaggag
agaatgagagcctgcaagcctctctttggattggctgaggagtggtggga
gcagggggttgatagaaaacatccagacacacatataagcaagtggccgt
gctacctttttagagaataaagaaacagacttttgagtttatatgcaatg
ccttcattaggtaccaccggcacttacaaaatgtgcggactgaatcccag
agaacactggcagatgtatacagtatatggattgtatcgcttccccaatg
tttgtaaattcacagtatttggaaaactgccttcattttccagtgtggga
aaaactcttgctacctgtattacttgatctcagacccatacctgatggtt
cagtctgtccttaagttaaaagaattttgcttttctaatgttatactatt
tacctgtcagtgtattactgcaacttgaatcactcttttactgttgttgg
atataaacttatcctgtaccaatgtatttattaacacttgtattttatta
ttgagcatatcaataaaaatattaaaaaataacagattgttttttaccaa
aaaaaaaaaaaaa
>NR_026816 1
caacccactctctgtgctatgacttcattactctttcccagcccagccct
gggcaagccccttacgaagtctcaggctacctggatgaccaccctttctt
atgatgctgcaaggagggcaggtgggcagagccccgtgcatcctgggctc
aggccagggacccaagagcttgggagaagctggttctcagactgaaggcc
agagcccagcaccttgtcaccatcccggggagcatcatggcacacaacaa
ccagagccaaggctacagctagagagttgactcctctatttgagattgac
aggcctcggaagtcaaaataagtggtttcctagaccgggtcgagagcaag
tctctattggtcccaactgagttttttcagctggtttttcaaccaaacag
cacctcatctcccagtgaggggaagggaaggctgggctgagagcagcaag
gctgctcatctcacctctccccacccagccatgccagccgcctcacctgg
tggggagaggtgggcctcacctgggtcccctggcagtgctctgtgaaggg
tcttgacattgcactgtaataataaaggtgtgtgtgaagtatcaaaaaaa
>NR_027917 1
atgaagatgattgagcagcacaatcaggaatacagggaagggaaacacag
cttcacaatggccatgaacgcctttggagaaatgaccagtgaagaattca
ggcaggtggtgaatggctttcaaaaccagaagcacaggaaggggaaagtg
ctccaggaacctctgcttcatgacatccgcaaatctgtggattggagaga
gaaaggctacgtgactcctgtgaaggatcagtgcagctggggctctgtaa
ggacagatgttaggaaaactgagaaactagtttcactgagtgtgcagacc
tggtggactgctctaggcttcaaggcaatgttggctgcatttttggagaa
ccattattttgcttccagtatgttgccgacaatggaggcctggactctga
ggaatccttttcatatgaagaaaagctctggagactggaaagtccaaggt
cacagaggtgcatctggtgagagccttcttgctagtggggaatctcagca
gagtcctgaggtggcacagtattctgggaagcatcaagtgcagtgtcatc
ttatcgaggaggctctgcagatgctaagtggtggggatgaggatcacgat
gaagacaaatggccccatgacatgaggaatcatctggctggagaggccca
ggtgtag
>NR_002777 3
cttgtcctttcagaagatcagagacaagtgatatctgtgccaatttggcc
ttttcagtgttataattatggtgtcttgggatcccaatatttctcctaat
gtttccctgatgtgatactttgagagcccaggatgccagtacaataattg
aaattcacaaatgtctggtatcttgtccctcgtgccccatatattatctg
tggtttcggagagctcacttgtctcttatcttcagaaatgacagcacatg
aaatgttgtttggagccactgtcacatcaactgtagaaaaattaacaggt
cagctaagggatataatgtaactttatttgtgatatgagagaaatcttga
taaagacttgagagaaaactgggaggaaccttgtttagaagttataagga
ggggtaagttatgtgtgtcttggaaggagaatcataaatcttaaaacatg
agcctaatagagaacataaaattctaaaagataaagataataataatgat
aagccgcagggtggcttatgataatgtgacttctccttaccccagtagcg
tcggacatctgtcagctctgaaatgataaaaatgcacaatattgaataca
aacaaaggagtcagcactgaaattcattttctctccagattagggaaaga
gtaggtatgccctatggtagggcagtaaattgctgaatgatgagatgaaa
cagccacctagccatttcccattaaatataatcccatcagcagcagacaa
tatctatcctcccctatcccctctatccatatttggaaactgcaccctct
tccctatttagcaccctaacaccacttgaattccataaccctgttgttga
tctagctctcctcacctctaaacacttctagcattcctttcagatcagga
gctcgaaacactctcctttgattttttggaaaagtttctggcttcttcaa
ggtcacgttctccgtcctaagaattaaaaaaaaaaaaaaaaacttccaaa
cctttgaccttgtgtccgtggaacacccctgacttcctatcatttcaacc
cattgaggcacttgaactctcttcttggggatcctgagaagggagagtgc
aaactcttgaccctggaggcaaacaaaatgttctcatgtttgccttccca
cttactttctgtgagaacgtgggaagatcttaacctctcagaagcacagt
ttcttccttctaaaatgaaataattaacctctccctgtctacattcttaa
actcataggacataaaaaaaaaaaaaa
>NR_033769 1
ggcctctggcgggcctccagccagttagaccatttgactaggacgtgtgc
agctcagccagccacagaactggaatttttcaggagcagggggagcatgg
agtttggactttgctgagcaactgaagtggagcgcagagcttgctcgctt
aggagagggcagcatggatggcaaacaagggggcatggatgggagcaagc
ccacggggccaagagactctcctgacaccaggcttctttcaaacccattg
atgggtgattctgtgtctgattggtctcctatgcctgaagctgcaatcta
cggacatcagctgtctctgaggaacctcatcagccacgggtggcttgtga
acatcatcatggcagatcatgtttccccactccatgaagcctgtctcaga
ggtcatccctctcgtgtaaagattttattaaagcatggagctcaggtgaa
tggcgtgacaacagactggcacactccactgtttaatgtttgtatcagca
gcagctgggattatgcttctgcagcatggagccagcgttcaacctgagag
tgatctggcatcccccgtccatgaagctgctaggagaggccacgtggagt
gtgtcgactctcttacagcttataggggcaaaaatgaccataacatcagc
cacgtgggcacttcactgtatttggcttgtgaaaaccagcagatagcctg
tgtcaagaagcttctggagtcaggagcagacctgaacccagggagaggtt
ccccacttcatgcagtggccttcatgaaggccctcatgaaggattcccca
cttcatgcagtggccaggacagccagtgaagagctggcctgcctgctcat
ggattttggagcagacacccaggccaagaatgctgaaggcaaatgtcatg
tggagctggtgcctccagagagccctttgatccagctcttcttggagaga
gaagggcccccttcttttgatgcagttatgcctagaaatcagaagggctt
tggaatccagcagcatcataagataaccaaagtcgtcctcccagaggatc
tgaaatggtttctcctacatctttgtatgtatcaatggaatggattcaca
aacaatgtgaaaacattattgagtgttgtagccactagaattttaaaatc
aagttaggtttatagagtttgactagttttttcgattagatttgtattag
ttataaatttgttcatagagtttgactaattttttcgattagatttgtat
ttgttaaactctgaagccagagtttaaacacactgcatacgtttgtatga
ttagttagaaggcatgaagacttttttccctgcttggagactgtctaaaa
taacagctattgttttgcatatccactgcaggccaagcactttcagcatc
atctaattcagccctcacagcaactgggtcaatctgtccaatttcccagg
gcaaggatagaggagtcagattcaaatacaggttttctgacgttaactta
tgtgatgatttgatcaaagcaggattttccagcatcactatccttgttcc
atctctgctatatgggaatgaaaataaagaaatgtatttcaaaaaaataa
aaagaaaagaaaaacagagacggtc
>NM_016326 3
atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
tgtgaaaggccacgtgaagatgctgcggctggtgtttgcacttgtgacag
cagtatgctgtcttgccgacggggcccttatttaccggaagcttctgttc
aatcccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaagaagt
tttgtaattttatattactttttagtttgatactaagtattaaacatatt
tctgtattcttccacatattttctgcagttattttaactcagtataggag
ctagaggaagagatttccgaagtctgcaccccgcgcagagcactactgta
acttccaagggagcgctgggagcagcgggatcgggttttccggcacccgg
gcctgggtggcagggaagaatgtgccgggatccgcctcagggatctttga
atctctttactgcctggctggccggcagctccg
>NM_181641 2
atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
tgtgaaaggccacgtgaagatgctgcggctggcactaactgtgacatcta
tgaccttttttatcatcgcacaagcccctgaaccatatattgttatcact
ggatttgaagtcaccgttatcttatttttcatacttttatatgtactcag
acttgatcgattaatgaagtggttattttggcctttgcttgtgtttgcac
ttgtgacagcagtatgctgtcttgccgacggggcccttatttaccggaag
cttctgttcaatcccagcggtccttaccagaaaaagcctgtgcatgaaaa
aaaagaagttttgtaattttatattactttttagtttgatactaagtatt
aaacatatttctgtattcttccacatattttctgcagttattttaactca
gtataggagctagaggaagagatttccgaagtctgcaccccgcgcagagc
actactgtaacttccaagggagcgctgggagcagcgggatcgggttttcc
ggcacccgggcctgggtggcagggaagaatgtgccgggatccgcctcagg
gatctttgaatctctttactgcctggctggccggcagctccg
>NM_001144931 1
gtttccgttcctctgcccgccatgccgttcctagagctgcacacgaattt
ccccgccaaccgagtgcccgcggggctggagaaacggctgtgcgccgtcg
ctgcctccatcttgggcaaacctgcagaccttgtgaacgtgacggtacgg
ccgggcctggccagggcgctgagcgggtccaccgagccctgcgcgcagct
gtccatctcctccatcggcgtagtgggcaccgccgaggacaaccgcagcc
acagtgcccacttctttgagtttctcaccaaggagctagccctgggccag
gaccggtgcgcaggggtagtaggcccggaatattattctaaaacacaatc
agagtactccattcctgctaacagtttaaagccaaacacctaggcaggcc
atttaggcttctgaatgactgggtcttgaccaggagagctgctgtctagg
ttttctcttcctgaccagttcctcaagagaaatgcaaaactagtgattaa
cagtaagagtcaggcagggcgcggtggctcacgcctgtaatcccagcact
ttgggaggccgag
>NR_029429 1
ggacaccaccccaaaatttcctagtcctctttgatacgggttcctccaat
ctgtagctgccctccatctactgccagagccaagtctgctccaatcacaa
caggttcaatcccagcctgtcctccaccttcagaaacgatggacaaacct
atggactatcctatgggagtggcagcctgagtgtgttcctgggctatgac
actgtgactgttcataacatcgttgtcaataaccaggagtttggcctgag
tgagaatgagcccagcgaccccttttactattcagactttgacgggatcc
tgggaatggcctacccaaacatggcagaggggaattcccctacagtaatg
caggggatgctgcagcagagccagcttactcagcccgtcttcagcttcta
cttcacctgccagccaacccgccagtattgtggagagctcatccttggag
gtgtggaccccaactttattctggtcagatcatctggacccctgtcagcc
cgtaactgtactggcagattgccatcgaggaatttgccatcggtaaccag
gccactggcttgtgctctgagggttgccaggccattgtggataccgagac
cttcctgc
>NR_026551 1
tgtggcctgagaggacggccaggactggccagaaaagagagggacgtggc
taaacgtgagggggcgtggccaagatggccgcgtgcgggatcctcgggta
ccgggagcgaacgaggaggttctggctcagtgcatccactctgggagagc
gtggacctggttcctgggggcgatcgccagtcacccatcaacattcggtg
gagggacagtgtttatgatcccggcttaaaaccactgaccatctcttatg
acccagccacctgcctccacgtctggaataatgggtactctttcctcgtg
gaatttgaagattctacagataaatcagctgcacttagtgcattggaacg
cagtcaaatttgaaaactttgaggatgcagcactggaagaaaatggtttg
gctgtgataggagtatttttaaagatttcggaaacttctggcagcccagt
gtctactggaaggcccaagccgcttgccagaaagctgcgccccgcccaaa
agcactgggttctgcagtccaggcccttcctcagctcccaggtccaggag
aactgcaaggtcacctacttccacaggaagcactgggtccgcatccggcc
cctccgcaccactcctcccagctgggactacacccgcatctgcatccaga
gagagatggtccccgcccgcatccgcgtcctgagagagatggtccccgag
gcctggaggtgctttcccaacaggctgccgctgctgagcaacatcaggcc
tgatttctccaaggctcccctggcctacgtgaagcggtggctttggaccg
cccgccacccccacagcctgtccgcagcctggtgaccgtgaaaatcgccc
cgccagagagcagaggaagcccgacgcccaggccatctgccttcaggtct
gtgatgagaaacggagtggcctgttccgttgtgcccaggtctaggccgct
gagcagagccctcactcccaggcagagttgtctgaatccttcct
>NM_181640 2
atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
tgtgaaaggccacgtgaagatgctgcggctggatattatcaactcactgg
taacaacagtattcatgctcatcgtatctgtgttggcactgataccagaa
accacaacattgacagttggtggaggggtgtttgcacttgtgacagcagt
atgctgtcttgccgacggggcccttatttaccggaagcttctgttcaatc
ccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaagaagttttg
taattttatattactttttagtttgatactaagtattaaacatatttctg
tattcttccacatattttctgcagttattttaactcagtataggagctag
aggaagagatttccgaagtctgcaccccgcgcagagcactactgtaactt
ccaagggagcgctgggagcagcgggatcgggttttccggcacccgggcct
gggtggcagggaagaatgtgccgggatccgcctcagggatctttgaatct
ctttactgcctggctggccggcagctccg
>NM_016951 3
atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
tgtgaaaggccacgtgaagatgctgcggctggcactaactgtgacatcta
tgaccttttttatcatcgcacaagcccctgaaccatatattgttatcact
ggatttgaagtcaccgttatcttatttttcatacttttatatgtactcag
acttgatcgattaatgaagtggttattttggcctttgcttgatattatca
actcactggtaacaacagtattcatgctcatcgtatctgtgttggcactg
ataccagaaaccacaacattgacagttggtggaggggtgtttgcacttgt
gacagcagtatgctgtcttgccgacggggcccttatttaccggaagcttc
tgttcaatcccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaa
gaagttttgtaattttatattactttttagtttgatactaagtattaaac
atatttctgtattcttccacatattttctgcagttattttaactcagtat
aggagctagaggaagagatttccgaagtctgcaccccgcgcagagcacta
ctgtaacttccaagggagcgctgggagcagcgggatcgggttttccggca
cccgggcctgggtggcagggaagaatgtgccgggatccgcctcagggatc
tttgaatctctttactgcctggctggccggcagctccg
>NR_002773 1
cagcaccacaccaggaccctccagaggctgtgagaaacatcctgcaccca
ggtcctctctatctgtttatcattgtctattttgtattctgcattcagaa
ccaagagcctgaagacgacccaggagctttagctatggctgtcttcatta
ttttgtccctgtttagtgttctggtgacaggcatgggtgaaggtggggct
gggagtgagaaaggaggtgagagggaatgtaagctgaaccagcttcccca
ttgcccctccgtatctcccagtgcccagccttggacacaccctggccaga
gccagctgtttgcagacctgagccgagaggagctgacggctgtgatgcgc
tttctgacccagcagctggggccagggctggtggatgcagcccaggccca
gccctcggacaactgtgtcttctcagtggagttgcagctgcctcccaagg
ctgcagccctggctcacttggacagggggagccccccacctgcccgggag
gcactggccatcgtcttctttggcaggcaaccccagcccaacgtgagtga
gctggtggtggggccactgcctcacccctcctacatgcgggacgtgactg
tggagcgtcatggaggccccctgccctatcaccgacgccccatgttgttc
caagagtacctggacatagaccagatgatcttcgacagagagctgcccca
ggcttctgggcttctccatcactgttgcttctacaagcgccggggacgga
acctggtgacaatgaccacggctccccgtggtctgcaatcaggggaccgg
gccacctagtttggcctctactacaacatctcgggcgctgggttcttcct
gcaccacgtgggcttggagctgctagtgaaccacaaggcccttgaccctg
cccgctggactatccagaaggtgttctatcaaggccgctactatgacagc
ctggcccagctggaggcccagtttgaggccggcctggtgaatgtggtgct
gatcccagacaatggcacaggtgggtcctggtccctgaagtcccctgtgc
ccccgggtccagctccccctctgcagttccatccccaaggcccccgcttc
agtgtccagggaagtcgagtggcctcctcactgtggactttctcctttgg
cctcggagcattcagtggcccaaggatctttgacgttcccttccaagggg
agagggtggcctatgaagtcagtgtccaggcggccttggccatctatgga
ggcaattctccttctgctctacgaagccggtacatagatagtggctttgg
cttgggccacttctccacgcccctgacccatggggtggactgcccctacc
tggccacctacgtggactggcacttcctttttgagtcccaggccgccaag
acaatacgcgatgccttttgtatatttgaacagaaccagggcctccccct
gcggcgacaccactcagatctctactcccactactttgggggccttgcgg
aaacggtgctggtcatcagatctgtgtctactatgctcaactatgactat
gtgtgggatatggtcttccaccctaatggggccatagaaatcagactcca
caccaccggctacatcagctcagcattcccctttggtgctgcccagaggt
atggaaacaaagtttcagagcacaccctgggcacggtccacacccacagc
gcccacttcaaggtggacctggatgtagcaggtaaggcatcctggcagag
gcaaaagtgctggaggggtgagctgaagtctccatgcctagctttaaaag
ttttcgttgggctgggagcagtagcttatgcctgtaagcccaacactttg
ggagactgaggggggtggatcacttgaggtcaggagttcaaaaccagcct
ggccaacatggcgaaatcctgtctgtactaaaaatacaaaaattagctgg
gcatgggtatgctgtaatcctagctactcgggaggctgaggcaggagaat
cacttgaatctgggagtcagaggttgcagtgagctgagattgagccactg
cactccatcctgcgtgactgaac
>NR_037806 1
attcccagtcacccactcactcagaaagccgggagtcatcggacaccttg
ctggtcagaggtcctgggggtggttttgaaccatcagagcttggactttt
ctgacttccccagcaaggatcttcccacttcctgctccctgtgttcccac
cctccagtgttggcacaggcccacccctggctccaccagagccagaagca
gaggtagaatcaggcgggccccgggctgcactccgagcagtgttcctggc
catctttgctactttcctagagaacccggctgttgccttaaatgtgtgag
agggacttggccaaggcaaaagctggggagatgccagtgacaacatacag
ttcatgactaggtttaggaattgggcactgagaaaattctcaatatttca
gagagtccttcccttatttgggactcttaacacggtatcctcgctagttg
gttttaagggaaacactctgctcctgggtgtgagcagaggctctggtctt
gccctgtggtttgactctccttagaaccaccgcccaccagaaacataaag
gattaaaatcacactaataacccctggatggtcaatctgataataggatc
agatttacgtctaccctaattcttaacattgcagctttctctccatctgc
agattattcccagtctcccagtaacacgtttctacccagatcctttttca
tttccttaagttttgatctccgtcttcctgatgaagcaggcagagctcag
aggatcttggcatcacccaccaaagttagctgaaagcagggcactcctgg
ataaagcagcttcactcaactctggggaatgctaccattttttttccaaa
gtagaaaggaagcacttctgagccagtgaccactgaaagatgaacactct
tcctgatcctctcctctagaattcatctcctcctgctagcagccgcgtcc
tggaggagcagcggatggggaatccattctgtttcttcctggtgtttagg
aagttgccccacacacagattgccccgatgtccaaccagaagaagtgaaa
ctgctgctgggtctggagaggtgaagacccgtggccagcttctgttgttg
ccatcggccattgctttttgttcgcttgcttttggttttgcaagaagagc
ggcctctgtctctgatctgcttcaaatcatcattccatcagtgacagaag
tggctgttccatcagtggtcgcagccagttcagctcctgcatccatcccc
aagtgttctgagtggaatttgaggcctccccaaccacctaccaaaaaagg
agggtgaaatgaaaggaagaagaaaaactcagcattctttcctctgacaa
agagtaaaacgacaaggaatatcggcctgaattctcttcccaagaagaaa
gaaagcacaccaacgcaggcatttgtcttctgtccatggtgctgaagttt
attcactttcaaaccactttcagtaacagcaaattctttagaaaaggaaa
atacagggaaagggataaacctcactgacttggaggaaatcaagaggagt
gagcacagcatcagaaagccccctggccccagactgcacccgctttcctg
gccctaccttgaaatccatcaggtctgcgttggacacggcattgtacatg
ggattagctctg
Any help and input would be deeply appreciated.
Thank you for taking the time to go through my problem!
Rather than splitting the sequence into three parts, the way I see this working is to find all occurrences of $pattern in the complete sequence and determine in which third the pattern starts.
The built-in variable $-[0] contains the offset of the start of the most recent successful match.
The code below does what I think you want. It works by accumulating each sequence (which ends either when a new sequence ID is found or the end of file is reached) and passing it to the process_seq subroutine.
The subroutine takes the length of the sequence and caclulates the offset of the end of each third of the string. The idiomatic sprintf '%.0f', $value is used to round fractional values to the nearest character position.
The #counts array is adjusted for each occurrence of $regex in the sequence. The element of #counts to be incremented is established by comparing the starting position of the match in $-[0] with the end offset of each of the three segments of the sequence.
Once each sequence has been processed the values in #counts are accumulated into #totals to give overall figures for all sequences.
The output of the program when using your sample data is shown. The grand total is (9, 1, 6).
use strict;
use warnings;
my $gpat = '[G]{3,5}';
my $npat = '[A-Z]{1,25}';
my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;
my $regex = qr/$pattern/i;
open my $fh, '<', 'sequences.txt' or die $!;
my ($id, $seq);
my #totals = (0, 0, 0);
while (<$fh>) {
chomp;
if (/^>(\w+)/) {
process_seq($seq) if $id;
$id = $1;
$seq = '';
print "$id\n";
}
elsif ($id) {
$seq .= $_;
process_seq($seq) if eof;
}
}
print "Total: #totals\n";
sub process_seq {
my $sequence = shift;
my $length = length $sequence;
my #offsets = map {sprintf '%.0f', $length * $_ / 3} 1..3;
my #counts = (0, 0, 0);
while ($sequence =~ /$regex/g) {
my $place = $-[0];
for my $i (0..2) {
next if $place >= $offsets[$i];
$counts[$i]++;
last;
}
}
print "#counts\n\n";
$totals[$_] += $counts[$_] for 0..2;
}
output
NR_037701
0 0 1
NM_198399
1 0 0
NR_026816
1 0 1
NR_027917
0 0 0
NR_002777
0 0 0
NR_033769
1 0 0
NM_016326
1 0 1
NM_181641
1 0 1
NM_001144931
0 0 0
NR_029429
0 1 0
NR_026551
1 0 0
NM_181640
1 0 1
NM_016951
1 0 1
NR_002773
1 0 0
NR_037806
0 0 0
Total: 9 1 6
I lifted Borodin's process_seq function but used Bio:SeqIO to read in the file sequence by sequence, an advantage over manually reading line by line and the logic to determine various processing. I believe those advantages are:
Code that has been developed and tested by many others
Whenever possible, if output is done via the Bio::SeqIO module, the result file can then be read using Bio::SeqIO read (next_seq) method.
Other reasons I can't think of now :-)
I imagine the BioPerl package of Bio Genetic code modules must be overwhelming to a biologist beginning programming. He might not be willing to try to dig out the information he needs to begin building a program. BioPerl wiki is a good starting place, especially the Howto section, and then there's a how to for beginners and others. You'll find code examples which are mostly(?) helpful. Bio::Seq has some good code examples in the beginning and is where most of the general sequence functions are. Also, for input/output, the Bio::SeqIO module is used and it has examples at the beginning of it's manual.
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $gpat = '[G]{3,5}';
my $npat = '[A-Z]{1,25}';
my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;
my $regex = qr/$pattern/i;
my $in = Bio::SeqIO->new ( -file => "fasta_dat.txt",
-format => 'fasta');
my #totals;
while ( my $seq = $in->next_seq() ) {
process($seq);
}
print "Totals: ";
print "#totals\n";
sub process {
my $seq = shift;
my #offset = map {sprintf '%.0f', $seq->length * $_ / 3} 1..3;
my $sequence = $seq->seq;
my #count = (0,0,0);
while ($sequence =~ /$regex/g) {
my $place = $-[0];
for my $i (0 .. 2) {
next if $place >= $offset[$i];
$count[$i]++;
last;
}
}
print $seq->id, "\n#count\n";
$totals[$_] += $count[$_] for 0 .. $#count;
}

Array not being overwritten

This is a loop and the array does not get overwritten, it just grows.
for ( $j=0; $j <=$#just_ecps ; $j++){
print "$just_ecps[$j]\n";
for ($x=0; $x<=$#folder_dates ; $x++){
my $archivo_histo = "/home/ha2_d11/data/ibprod/archive/$folder_dates[$x]/$just_ecps[$j]/ghistogram.gz";
next unless (-r $archivo_histo);
open(FILEHANDLE, "gunzip -c $archivo_histo |") or die ("could not open file $archivo_histo");
while (<FILEHANDLE>) {
if ($_ =~ /ave:\s+(\d+\.\d+)\s/){
push ( #ecp_average , $1);
sleep 1;
}
print "#ecp_average\n";
}
}
In every instance it is the last three values that are valid, everything before it is a duplicate. I need to get rid of the duplicates and just keep the last three values.
Eislnd1
0.00420743 0.00414601 0.0044511
Eislnd2
0.00420743 0.00414601 0.0044511 0.00303575 0.00309721 0.00302753
Eislnd3
0.00420743 0.00414601 0.0044511 0.00303575 0.00309721 0.00302753 0.0031753 0.00324729 0.00295381
Eislnd4
0.00420743 0.00414601 0.0044511 0.00303575 0.00309721 0.00302753 0.0031753 0.00324729 0.00295381 0.00324191 0.00344244 0.00311481
You need to clear the array for every file:
open(FILEHANDLE, "gunzip -c $archivo_histo |") or die ...
#ecp_average = ();
while (<FILEHANDLE>) {
...
}
At some point you'll want to read up on lexically-scoped variables (i.e. the my declaration)
but for now this should work.
Things to note that will help improve this code:
use strict; use warnings;
Consider using for my $just_ecp ( #just_ecps ) { ... } instead of the C-style construct
Defining an averaging subroutine ( sub avg { sum(#_) / #_ } ) and using it ( avg( #ecp ) ) is a more intuitive way to compute averages instead of mucking around with array lengths. All it takes one line!

Resources