Bison yyerror ignore next token on grammar - c

So I'm having a litle problem on my Bison program, first be aware that my program works and is fully function what you see here is just an exerpt and sufficiant to help me.
basicaly the parser reads "vector(x,2)" the program is checking if the variable "x" already exists and if the number of elements"2" are greater then 1. I'm throwing yyerror when this situations dont happen. the problem is that the error is not ignoring the rest of the token. Let me show an example of vector(x,1), you will see that it should ignore the statement when reducing but is not.
Starting parse
Entering state 0
Stack now 0
Reducing stack by rule 1 (line 58):
-> $$ = nterm PROGRAM (1.1: )
Entering state 1
...
....
Entering state 24
Stack now 0 1 5 12 15 18 24
Next token is token PARENTSIS_RIGHT (1.1: )
Shifting token PARENTSIS_RIGHT (1.1: )
Entering state 33
Stack now 0 1 5 12 15 18 24 33
Reducing stack by rule 10 (line 87):
$1 = token RESEV_VAR (1.1: )
$2 = token PARENTSIS_LEFT (1.1: )
$3 = token STRING (1.1: )
$4 = token COMMA (1.1: )
$5 = nterm ELEMENTS (1.1: )
$6 = token PARENTSIS_RIGHT (1.1: )
Erro: blalalala.
-> $$ = nterm DEFINE_VECTOR (1.1: ) //here it should not run this one should skip it
Entering state 9
Stack now 0 1 9
Reducing stack by rule 6 (line 65):
$1 = nterm DEFINE_VECTOR (1.1: )
VECTOR
-> $$ = nterm DECLARACAO (1.1: )
As you can see at the end he is running the reduction to DEFINE_VECTOR when i specify error.
%start PROGRAM
%%
PROGRAM:
| PROGRAM STATMENT
| PROGRAM error STATMENT
;
STATMENT: QUIT { fclose(yyin); printf("\nBYE\n"); exit('0');}
| DEFINE_VAR { printf("\tVAR\n"); }
| DEFINE_VECTOR { printf("\tVAR\n"); }
;
...
DEFINE_VECTOR: RESEV_VAR PARENTSIS_LEFT STRING COMMA ELEMENT PARENTSIS_RIGHT {
if($5<=1){
yyerror("Vector to small");
}else{
if(check($3)==0){
createVar($3, 2, $5);
}else{
yyerror("Variable alrady exists");
}
}
}
;
ELEMENTS: ELMENTINT
| ELMENTFLOAT{ $$ = $1;}
;
SOLUTION:
%start PROGRAM
%%
PROGRAM:
| PROGRAM STATMENT
| PROGRAM error STATMENT
| PROGRAM error PARENTSIS_RIGHT STATMENT // added here
;
STATMENT: QUIT { fclose(yyin); printf("\nBYE\n"); exit('0');}
| DEFINE_VAR { printf("\tVAR\n"); }
| DEFINE_VECTOR { printf("\tVAR\n"); }
;
...
DEFINE_VECTOR: RESEV_VAR PARENTSIS_LEFT STRING COMMA ELEMENT PARENTSIS_RIGHT {
if($5<=1){
yyerror("Vector to small");
YYERROR; //added here
}else{
if(check($3)==0){
createVar($3, 2, $5);
}else{
yyerror("Variable alrady exists");
YYERROR; added here
}
}
}
;
ELEMENTS: ELMENTINT
| ELMENTFLOAT{ $$ = $1;}
;

yyerror is just a function that prints an error message. It is called by the bison parser when it has a syntax error, but you can also call it elsewhere to print an error. It has no effect on the parsing (it doesn't interrupt or change it -- it does not "throw")
If you want to affect the parsing, you need to use YYERROR (a macro that triggers a parser error, which the parser will then try to recover from). However, if you have a semantic error like this (not a syntax error), you probably don't want to do this, as there's no syntax to recover. You might want to use YYABORT to abort the parse immediately (return from yyparse). Or you might want to just continue parsing, to possibly identify other errors.

Related

Antlr4 parsing templateLiteral in jsx

I have tried to use grammar defined in grammars-v4 project(https://github.com/antlr/grammars-v4/tree/master/javascript/jsx) to parse jsx file.
When I parse the code snippet below,
let str =
`${dsName}${parameterStr ? `( ${parameterStr} )` : ""}${returns ? `{
${returns}}` : ""}`;
it shows the following error
line 2:32 at [#8,42:42='(',<8>,2:32]:no viable alternative at input '('
https://astexplorer.net/ shows it is a TemplateLiteral with conditionalExpression inside, any thoughts for parsing such grammar?
Thanks in advance.
Edit:
Thanks #Bart, it works.
It works fine on the following code.
const href1 = `https://example.com/lib/downloads_${page}.htm?joinSource=${joinSource}${inviter?`&inviter=${inviter}`: ''}`;
const href2 = `https://example.com/act/kol/detail_${this.props.values.list.res.rows && this.props.values.list.res.rows[0].id}_${page}.htm?joinSource=${joinSource}${inviter?`&inviter=${inviter}`: ''}`;
Given Mike's observation that it's the template string that is messing things up, here's a quick solution:
Changes:
JavaScriptLexerBase.java
Add the instance variable:
// Keeps track of the the current depth of nested template string backticks.
// E.g. after the X in:
//
// `${a ? `${X
//
// templateDepth will be 2. This variable is needed to determine if a `}` is a
// plain CloseBrace, or one that closes an expression inside a template string.
protected int templateDepth = 0;
JavaScriptLexer.g4
// Place TemplateCloseBrace above CloseBrace!
TemplateCloseBrace: {this.templateDepth > 0}? '}' -> popMode;
...
// Remove (or comment) TemplateStringLiteral
// TemplateStringLiteral: '`' ('\\`' | ~'`')* '`';
BackTick
: '`' {this.templateDepth++;} -> pushMode(TEMPLATE)
;
...
// Place at the end of the file:
mode TEMPLATE;
BackTickInside
: '`' {this.templateDepth--;} -> type(BackTick), popMode
;
TemplateStringStartExpression
: '${' -> pushMode(DEFAULT_MODE)
;
TemplateStringAtom
: ~[`]
;
JavaScriptParser.g4
singleExpression
: ...
| singleExpression templateStringLiteral # TemplateStringExpression // ECMAScript 6
| ...
;
literal
: ...
| templateStringLiteral
| ...
;
templateStringLiteral
: BackTick templateStringAtom* BackTick
;
templateStringAtom
: TemplateStringAtom
| TemplateStringStartExpression singleExpression TemplateCloseBrace
;
I did not fully test my solution, but it parses your example input successfully:
A quick look at the ANTLR grammar shows:
// TODO: `${`tmp`}`
TemplateStringLiteral: '`' ('\\`' | ~'`')* '`'
So, apparently, it’s on the “TODO List”. Allowing nested “`” string interpolation is surprising (to say the least), and will be “non-trivial” for the ANTLR Lexer (So, I would not "hold my breather" waiting on a fix for this).
Have you tried: (replacing the nested “`”s with “‘s?).
let str =
`${dsName}${parameterStr ? “( ${parameterStr} )” : ""}${returns ? “{
${returns}}” : ""}`;

Perl, grouping Array of Array element based on one column and condition

I have an AoA construct with four columns and many rows. Following is an example of data (input).
DQ556929 103480190 103480214 154943
DQ540839 103325247 103325275 2484
DQ566549 103322763 103322792 99
DQ699634 103322664 103322694 0
DQ544472 103322664 103322692 373
DQ709105 103322291 103322318 46
DQ705937 103322245 103322273 486
DQ699398 103321759 103321788 1211
DQ710151 103320548 103320577 692251
DQ548430 102628297 102628326 1
DQ558403 102628296 102628321 855795
DQ692476 101772501 101772529 481463
DQ544274 101291038 101291068 484047
DQ723982 100806991 100807020 1
DQ709023 100806990 100807020 3
DQ712307 100806987 100807014 0
DQ709654 100806987 100807012 571051
DQ707370 100235936 100235962 1481849
I want to group and write into a file all the row elements (sequentially).
Conditions are if column four values less than 1000 and minimum two values are next to each other, group them else if the value less than 1000 and lies between the values more than 1000 treat them as single and append separately in the same file and the values which are more than 1000 also write as a block but with out affecting the order of the 2nd and third column.
This file is output of my previous program, now for this I have tried implementing my hands but getting some weird results. Here is my chunk of code, but non functional. Guys I need just help if i am executing my logic well here, I am open for any comments as a beginner. And also correct me anywhere.
my #dataf= sort{ $a->[1]<=> $b->[1]} #data;
#dataf=reverse #dataf;
for(my $i>=0;$i<=$#Start;$i++)
{
print "$sortStart[$i]\n";
my $diff = $sortStart[$i] - $sortStart[$i+1];
$dataf[$i][3]= $diff;
# $IDdiff{$ID[$i]}=$diff;
}
#print Dumper(#dataf);
open (CLUST, ">> ./clustTest.txt" );
for (my $k=0;$k<=$#Start;$k++)
{
for (my $l=0;$l<=3;$l++)
{
# my $tempdataf = shift $dataf[$k][$l];
# print $tempdataf;
if ($dataf[$k][3]<=1000)
{
$flag = 1;
do
{
print CLUST"----- Cluster $clustNo -----\n";
print CLUST"$dataf[$k][$l]\t";
if ($dataf[$k][3]<=1000)
{
$flag1 = 1;
}else {$flag1=0;}
$clustNo++;
}until($flag1==0 && $data[$k][3] > 1000);
if($flag1==0 && $data[$k][3] > 1000)
{
print CLUST"Singlet \n";
print CLUST"$dataf[$k][$l]\t";
next;
}
#print CLUST"$dataf[$k][$l]\t"; ##IDdiff
}
print CLUST"\n";
}
}
Expected output in file:
Singlets
DQ556929 103480190 103480214 154943
DQ540839 103325247 103325275 2484
Cluster1
DQ566549 103322763 103322792 99
DQ699634 103322664 103322694 0
DQ544472 103322664 103322692 373
DQ709105 103322291 103322318 46
DQ705937 103322245 103322273 486
Singlets
DQ699398 103321759 103321788 1211
DQ710151 103320548 103320577 692251
DQ548430 102628297 102628326 1
DQ558403 102628296 102628321 855795
DQ692476 101772501 101772529 481463
DQ544274 101291038 101291068 484047
Cluster2
DQ723982 100806991 100807020 1
DQ709023 100806990 100807020 3
DQ712307 100806987 100807014 0
Singlets
DQ709654 100806987 100807012 571051
DQ707370 100235936 100235962 1481849
This seems to produce the expected output. I'm not sure I understood the specification correctly, so there might be errors and edge cases.
How it works: it remembers what kind of section it's currently outputting ($section, Singlet or Cluster). It accumulates lines in the #cluster array if they belong together, when an incompatible line arrives, the cluster is printed and a new one is started. If the cluster to print has only one member, it's treated as a singlet.
#!/usr/bin/perl
use warnings;
use strict;
my $section = q();
my #cluster;
my $cluster_count = 1;
sub output {
if (#cluster > 1) {
print "Cluster$cluster_count\n";
$cluster_count++;
} elsif (1 == #cluster) {
print $section = 'Singlet', "s\n" unless 'Singlet' eq $section;
}
print for #cluster;
#cluster = ();
}
my $last = 'INF';
while (<>) {
my ($id, $from, $to, $value) = split;
if ($value > 1000 || 1000 < abs($last - $from)) {
output();
} else {
$section = 'Cluster';
}
push #cluster, $_;
$last = $to;
}
output();

javacc C grammar and C "Bit fields" ; ParseException

I'm trying to use this javacc grammar https://java.net/downloads/javacc/contrib/grammars/C.jj to parse a C code containing bit fields
struct T{
int w:2;
};
struct T a;
The generated parser cannot parse this code:
$ javacc -DEBUG_PARSER=true C.jj && javac CParser.java && gcc -E input.c | java CParser
Java Compiler Compiler Version 5.0 (Parser Generator)
(type "javacc" with no arguments for help)
Reading from file C.jj . . .
File "TokenMgrError.java" is being rebuilt.
File "ParseException.java" is being rebuilt.
File "Token.java" is being rebuilt.
File "SimpleCharStream.java" is being rebuilt.
Parser generated successfully.
Note: CParser.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
C Parser Version 0.1Alpha: Reading from standard input . . .
Call: TranslationUnit
Call: ExternalDeclaration
Call: Declaration
Call: DeclarationSpecifiers
Call: TypeSpecifier
Call: StructOrUnionSpecifier
Call: StructOrUnion
Consumed token: <"struct" at line 5 column 1>
Return: StructOrUnion
Consumed token: <<IDENTIFIER>: "T" at line 5 column 8>
Consumed token: <"{" at line 5 column 9>
Call: StructDeclarationList
Call: StructDeclaration
Call: SpecifierQualifierList
Call: TypeSpecifier
Consumed token: <"int" at line 6 column 2>
Return: TypeSpecifier
Return: SpecifierQualifierList
Call: StructDeclaratorList
Call: StructDeclarator
Call: Declarator
Call: DirectDeclarator
Consumed token: <<IDENTIFIER>: "w" at line 6 column 6>
Return: DirectDeclarator
Return: Declarator
Return: StructDeclarator
Return: StructDeclaratorList
Return: StructDeclaration
Return: StructDeclarationList
Return: StructOrUnionSpecifier
Return: TypeSpecifier
Return: DeclarationSpecifiers
Return: Declaration
Return: ExternalDeclaration
Return: TranslationUnit
C Parser Version 0.1Alpha: Encountered errors during parse.
ParseException: Encountered " ":" ": "" at line 6, column 7.
Was expecting one of:
";" ...
"," ...
"(" ...
"[" ...
"[" ...
"(" ...
"(" ...
"," ...
";" ...
";" ...
";" ...
"[" ...
"(" ...
"(" ...
at CParser.generateParseException(CParser.java:4279)
at CParser.jj_consume_token(CParser.java:4154)
at CParser.StructDeclaration(CParser.java:433)
at CParser.StructDeclarationList(CParser.java:372)
at CParser.StructOrUnionSpecifier(CParser.java:328)
at CParser.TypeSpecifier(CParser.java:274)
at CParser.DeclarationSpecifiers(CParser.java:182)
at CParser.Declaration(CParser.java:129)
at CParser.ExternalDeclaration(CParser.java:96)
at CParser.TranslationUnit(CParser.java:77)
at CParser.main(CParser.java:63)
I tried to change (line 245)
(...) LOOKAHEAD( { isType(getToken(1).image) } )TypedefName() )
to
LOOKAHEAD( { isType(getToken(1).image) } ) TypedefName2()
(...)
void TypedefName2() : {}
{
TypedefName() (LOOKAHEAD(2) ":" <INTEGER_LITERAL> )?
}
but it doesn't work (same error) .
Is there a simple way to fix the javaCC grammar to handle Bit Fields ?
Try fixing this by modifying the StructDeclarator() rule on lines 310..313 as follows:
void StructDeclarator() : {}
{
( Declarator() [ ":" ConstantExpression() ] | ":" ConstantExpression() )
}
The idea is to remove the need of lookahead by letting the parser make a decision by checking if the struct declarator starts with a colon ":".

how to write a antlr parser for samba configuration file

I want to create a parser to a document very similar to the following samba configuration file. it has many sections, every section has a header line, which start with [ followed by a keyword section name, e.g. global, share_name, etc., till the end of line. followed the section header line is the parameters for this section. We don't know the end of a section till we reach the beginning of another section new line [.., how can I write a rule for this kind of doc? All antlr examples I found knows exactly when start a section and when to end a section. Thanks a lot!
[global]
netbios name = NETBIOS_NAME
workgroup = WORKGROUP
security = user
[SHARE_NAME]
comment = COMMENT
force create mode = 0770
locking = yes
[printers]
comment = COMMENT
path = /var/spool/samba
browseable = No
Here is my grammar:
grammar SambaConfiguration;
file : global_section
share_name_section
printer_section
EOF
;
global_section
: SECTION_TAG_START GLOBAL_SECTION_TAG (.)* SECTION_TAG_END NEW_LINE
(~SECTION_TAG_START (.)* NEW_LINE)*
;
share_name_section
: SECTION_TAG_START SHARE_NAME_SECTION_TAG (.)* SECTION_TAG_END NEW_LINE
((~SECTION_TAG_START) (.)* NEW_LINE)*
;
printer_section
: SECTION_TAG_START PRINTER_SECTION_TAG (.)* SECTION_TAG_END NEW_LINE
((~SECTION_TAG_START) (.)* NEW_LINE)*
;
SECTION_TAG_START
: '['
;
SECTION_TAG_END
: ']'
;
GLOBAL_SECTION_TAG
: 'global'
;
SHARE_NAME_SECTION_TAG
: 'SHARE_NAME'
;
PRINTER_SECTION_TAG
: 'printer'
;
NEW_LINE :
'\r' ? '\n' | '\r'
;
WHITE_SPACE
: ' ' | '\t'
;
Somehow, it does not work properly. When running in Antlrworks, it gives me the following exception:
problem matching token at 12:19 NoViableAltException('o'#[1:1: Tokens
: ( SECTION_TAG_START | SECTION_TAG_END | GLOBAL_SECTION_TAG |
SHARE_NAME_SECTION_TAG | PRINTER_SECTION_TAG | NEW_LINE | WHITE_SPACE
);])
Thanks.
The error message:
problem matching token at 12:19 NoViableAltException('o'#[1:1: Tokens : ( SECTION_TAG_START | SECTION_TAG_END | GLOBAL_SECTION_TAG | SHARE_NAME_SECTION_TAG | PRINTER_SECTION_TAG | NEW_LINE | WHITE_SPACE );])
means that ANTLR encounters a character, 'o', that it cannot create a token for. You probably think it will be matched by the . in your parser rules, but it doesn't. Inside parser rules, the . matches any token, while only inside lexer rules it matches any character.
Your lexer only creates the following tokens: SECTION_TAG_START, SECTION_TAG_END, GLOBAL_SECTION_TAG, SHARE_NAME_SECTION_TAG, PRINTER_SECTION_TAG, NEW_LINE and WHITE_SPACE. So a . inside a parser rule matches any of these tokens, nothing more.
Unless you're doing this to learn ANTLR, I'd hesitate to use ANTLR for this task. You can do this easier with some built-in string operations and reading the input line-by-line.
Using ANTLR, you could do something similar to this:
grammar T;
parse
: section* EOF
;
section
: header line*
;
header
: SECTION_TAG_START name=text SECTION_TAG_END NEW_LINE
{
System.out.println("name=" + $name.text);
}
;
line
: key=text ASSIGN value=text (NEW_LINE | EOF)
{
System.out.println(" key=`" + $key.text.trim() +
"`, value=`" + $value.text.trim() + "`");
}
;
text
: OTHER+
;
SECTION_TAG_START : '[';
SECTION_TAG_END : ']';
ASSIGN : '=';
NEW_LINE : '\r'? '\n';
OTHER : . /* any other char: must be the last rule! */;
Parsing your example input would print the following to your console:
name=global
key=`netbios name`, value=`NETBIOS_NAME`
key=`workgroup`, value=`WORKGROUP`
key=`security`, value=`user`
name=SHARE_NAME
key=`comment`, value=`COMMENT`
key=`force create mode`, value=`0770`
key=`locking`, value=`yes`
name=printers
key=`comment`, value=`COMMENT`
key=`path`, value=`/var/spool/samba`
key=`browseable`, value=`No`

Perl: Search a pattern across array elements

I am a Perl newbie, stuck with another bioinformatics problem that requires some help and input.
The problem in brief:
I have a file, which has over 40,000 unique DNA sequences. By unique, I mean unique sequence id. I am attaching a portion of it at the end of my post to help you show what it looks like.
I need to divide each of the 40,000 sequences into 3 parts. So if a particular sequence is 999 character long, each of the 3 parts would have 333 characters.
I need to look for the following pattern through each of the 3 individual parts:
$gpat = [G]{3,5};
$npat = [A-Z]{1,25};
$pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;
If $pattern appears in the first of the 3 parts, increase the counter of 'beginning', if $pattern occurs in the 2nd of the 3 parts, increase counter of 'middle' and lastly if the $pattern appears in the 3rd part, increase counter of 'end'.
Print the counters of 'beginning','middle' and 'end' i.e basically summation of 'beginning','middle','end' for each of the sequences.
Say in 1st sequence, the values are like '2','5','3' respectively and in 2nd sequence, the values are '4','1','6', the final count should be '7,6,9'.
The issues I am having:
If a particular sequence is split into 3 parts, potential $pattern is lost. eg say on a sequence like :
gggatgtcgatgcatggggatgcatcgatgcggggactagctagcgggatgctacgatggggatgatgataatatcgcggcgcatatatgctagtctatatatta
a split into 3 parts produces following 3 sub-parts,each of 35 character length:
gggatgtcgatgcatggggatgcatcgatgcgggg
actagctagcgggatgctacgatggggatgatgat
aatatcgcggcgcatatatgctagtctatatatta
Hence, $pattern gets split into the first 2 parts. Is there anyway to say "If $pattern begins in 1st part and ends in 2nd part", increase count of "beginning" ?
##UPDATE## The following issue has been resolved thanks to the code suggested by Cupidvogel
2.How do I divide a sequence into 3 parts if its length is not divisible by 3? I tried using int, but then the last part is 1-2
characters short.
The following is the code I have written so far.
It reads in the file, displays the header name and sequence, the length into which each sequence will be divided and finally the sequence split into 3 parts which works fine provided the sequence length is divisible by 3, for those which aren't, the final 3rd part is 1-2 characters short.
#Take Filename from user
print "Please enter file name : ";
$in =<>;
chomp $in;
open (FASTA,"$in") or die ;
while (<FASTA>)
{
$/=">";
#array = split '\n', $_;
$header=shift #array; # Header of the fasta sequence
print "\n\nNext sequence: \n";
print $header,"\n";
$seq= join '', #array; # sequence
$seq=~s/\s//g;
$seq=~s/\*//g;
$seq=~s/>//g;
print $seq,"\n\n";
$num = int(length($seq)/3);
#arr = unpack("A$num A$num A*",$seq);
print " New method gives this :", #arr;
print "\nThe first element is :", $arr[0];
print "\nThe second element is :",$arr[1];
print "\nThe third element is :",$arr[2] ;
#The following lines of code were originally written to split...
#...the sequence into 3 parts, albeit unsuccessfully
#my $split = (length $seq)/3;
#print $split,"\n\n";
#my $int = int $split;
#print $int,"\n\n";
#my #array2 = $seq =~ /(.{$int})/g;
#print join (" ", #array2),"\n\n";
#print $array2[0],"\n",$array2[1],"\n",$array2[2];
}
exit;
I have been trying the code I have written so far with the following sample file : sample.fa
>ABC_123 2
atgtcgatcgatcggcgggcatgcgcgcgcggatg
atatatagcgcgcgctatatagcgcgactctacgc
atgctgctgactagctatagtcgctgactgcgcgt
gggaaaaagggcccgggccccgttttggggatcta
ggggatagctgatgctagcatgcatgctgactgca
>DEF_456 4
gggatgtcgatgcatggggatgcatcgatgcgggg
actagctagcgggatgctacgatggggatgatgat
aatatcgcggcgcatatatgctagtctatatatta
>GHI_789 1
atagctgctagtcgatcggcgcgggtatcgatcgg
ggatcgatcgatcggggatcgatcgggggatcgat
The actual input file looks like the following:
>NR_037701 1
aggagctatgaatattaatgaaagtggtcctgatgcatgcatattaaaca
tgcatcttacatatgacacatgttcaccttggggtggagacttaatattt
aaatattgcaatcaggccctatacatcaaaaggtctattcaggacatgaa
ggcactcaagtatgcaatctctgtaaacccgctagaaccagtcatggtcg
gtgggctccttaccaggagaaaattaccgaaatcactcttgtccaatcaa
agctgtagttatggctggtggagttcagttagtcagcatctggtggagct
gcaagtgttttagtattgtttatttagaggccagtgcttatttagctgct
agagaaaaggaaaacttgtggcagttagaacatagtttattcttttaagt
gtagggctgcatgacttaacccttgtttggcatggccttaggtcctgttt
gtaatttggtatcttgttgccacaaagagtgtgtttggtcagtcttatga
cctctattttgacattaatgctggttggttgtgtctaaaccataaaaggg
aggggagtataatgaggtgtgtctgacctcttgtcctgtcatggctggga
actcagtttctaaggtttttctggggtcctctttgccaagagcgtttcta
ttcagttggtggaggggacttaggattttatttttagtttgcagccaggg
tcagtacatttcagtcacccccgcccagccctcctgatcctcctgtcatt
cctcacatcctgtcattgtcagagattttacagatatagagctgaatcat
ttcctgccatctcttttaacacacaggcctcccagatctttctaacccag
gacctacttggaaaggcatgctgggtctcttccacagactttaagctctc
cctacaccagaatttaggtgagtgctttgaggacatgaagctattcctcc
caccaccagtagccttgggctggcccacgccaactgtggagctggagcgg
gagggaggagtacagacatggaattttaattctgtaatccagggcttcag
ttatgtacaacatccatgccatttgatgattccaccactccttttccatc
tcccagaagcctgctttttaatgcccgcttaatattatcagagccgagcc
tggaatcaaactgcctctttcaaaacctgccactatatcctggctttgtg
acctcagccaagttgcttgactattctcagtctcagtttctgcacctgtc
aaatagggtttatgttaacctaactttcagggctgtcaggattaaatgag
catgaaccacataaaatgtttggtgtatagtaagtgtacagtaaatactt
ccattatcagtccctgcaattctatttttcttccttctctacacagcccc
tgtctggctttaaaatgtcctgccctgctttttatgagtggataccccca
gccctatgtggattagcaagttaagtaatgacactcagagacagttccat
ctttgtccataacttgctctgtgatccagtgtgcatcactcaaacagact
atctcttttctcctacaaaacagacagctgcctctcagataatgttgggg
gcataggaggaatgggaagcccgctaagagaacagaagtcaaaaacagtt
gggttctagatgggaggaggtgtgcgtgcacatgtatgtttgtgtttcag
gtcttggaatctcagcaggtcagtcacattgcagtgtgtcgcttcacctg
gctccctcttttaaagattttccttccctctttccaactccctgggtcct
ggatcctccaacagtgtcagggttagatgccttttatgggccacttgcat
tagtgtcctgatagaggcttaatcactgctcagaaactgccttctgccca
ctggcaaagggaggcaggggaaatacatgattctaattaatggtccaggc
agagaggacactcagaatttcaggactgaagagtatacatgtgtgtgatg
gtaaatgggcaaaaatcatcccttggcttctcatgcataatgcatgggca
cacagactcaaaccctctctcacacacatacacatatacattgttattcc
acacacaaggcataatcccagtgtccagtgcacatgcatacacgcacaca
ttcccttcctaggccactgtattgctttcctagggcatcttcttataaga
caccagtcgtataaggagcccaccccactcatctgagcttatcaaccaat
tacattaggaaagactgtatttcctagtaaggtcacattcagtagtactg
agggttgggacttcaacacagctttttgggggatcataattcaacccatg
acagccactgagattattatatctccagagaataaatgtgtggagttaaa
aggaagatacatgtggtacaaggggtggtaaggcaagggtaaaaggggag
ggaggggattgaactagacacagacacatgagcaggactttggggagtgt
gttttatatctgtcagatgcctagaacagcacctgaaatatgggactcaa
tcattttagtccccttctttctataagtgtgtgtgtgcggatatgtgtgc
tagatgttcttgctgtgttaggaggtgataaacatttgtccatgttatat
aggtggaaagggtcagactactaaattgtgaagacatcatctgtctgcat
ttattgagaatgtgaatatgaaacaagctgcaagtattctataaatgttc
actgttattagatattgtatgtctttgtgtccttttattcatgaattctt
gcacattatgaagaaagagtccatgtggtcagtgtcttacccggtgtagg
gtaaatgcacctgatagcaataacttaagcacacctttataatgacccta
tatggcagatgctcctgaatgtgtgtttcgagctagaaaatccgggagtg
gccaatcggagattcgtttcttatctataatagacatctgagcccctggc
ccatcccatgaaacccaggctgtagagaggattgaggccttaagttttgg
gttaaatgacagttgccaggtgtcgctcattagggaaaggggttaagtga
aaatgctgtataaactgcatgatgtttgcaggcagttgtggttttcctgc
ccagcctgccaccaccgggccatgcggatatgttgtccagcccaacacca
caggaccatttctgtatgtaagacaattctatccagcccgccacctctgg
actccctcccctgtatgtaagccctcaataaaaccccacgtctcttttgc
tggcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaa
>NM_198399 1
aacagattttaactctgaaaagccatttccagtgtctatagactattgtg
agcctggagaagtagcatttagttgggatagcttcactagagctgcctgc
caaagacttccttccacaggatcttgtcgcaccagcaactgacaggagct
tgggagctcgggagcttgggagagggcttatgtttttaataatgtagctg
tcagttcgaagcctggaaatgttgaccctcaaagggcataaaatcttgtt
attttaatttgcatctgggagaatgtctgagcaaggagacctgaatcagg
caatagcagaggaaggagggactgagcaggagacggccactccagagaac
ggcattgttaaatcagaaagtctggatgaagaggagaaactggaactgca
gaggcggctggaggctcagaatcaagaaagaagaaaatccaagtcaggag
caggaaaaggtaaactgactcgcagccttgctgtctgtgaggaatcttct
gccagaccaggaggtgaaagtcttcaggatcagactctctgaaaactgca
aatggaaaggaattcaaaagaatttagattaaaagttaaataaaaagtag
gcacagtagtgctgaattttcctcaaaggctctcttttgataaggctgaa
ccaaatataatcccaagtatcctctctccttccttgttggagatgtctta
cctctcagctccccaaaatgcacttgcctataagaaacacaattgctggt
tcatatgaaacttaggaaatagtgaataaggtgcatttaactttggagaa
atacttttatggctttggtggagatttctcaatactgcaaaagttgtcca
gaaatgaatctgagctgatggtgactttaagttaatattattaatatatc
actgcatatttttacccttatttttgctccttacagcaagattagtaggt
tataaaaatttaaatttaaacaaaattatttcatgacaaaatgggaaact
tcacatcatacttatttttgtttgcctttcaggcatcatattagctttta
taaaaaatggtcttgctgctgaaattgtacttattttatcagaggctggg
tgcagtcaagacaaaagtaaaatggtttacctgagcccaggggagggaaa
attgattaagatatcattatttttgtttggtttggttttgcttttttcct
cttactttaattgaaatactctgaattcccctcatggaaacagagagcat
tgagagcactttctttaaaaggaccaaaaataaattcctaatagattttg
tcctaagagagtgtttttttttctagcatcattttctttacatgccactc
atgtcataaggcatggacaggctatctttcagtggccattactatgtttc
gtacacatgctttattttacttgggctctgagaaatgtgtggctttcctt
cagcattttatttgtgcttctctttttaatggagattgaaaagggagaat
aatgtgaatatcacggcttatattattaaatgttgattgatggcttgtaa
tgtactgcacacaatatatgttaactctgcagaatgacagaccctgggag
aagtaatgccccagttgtcccccactcctaatgccaggcagagaaggaca
gcctttatagacttaatctgctttttgtcccatttgacaaggtaccagga
ggaaattttttaagggatcaactgtatcacagtgcccactctggacctaa
gtctagtgtatccatacaattggtgcagagaaataaggtgtaaatggtgc
tttgttcctgctggttccaagctcagaaaccaagactagctttgtaggag
agaatgagagcctgcaagcctctctttggattggctgaggagtggtggga
gcagggggttgatagaaaacatccagacacacatataagcaagtggccgt
gctacctttttagagaataaagaaacagacttttgagtttatatgcaatg
ccttcattaggtaccaccggcacttacaaaatgtgcggactgaatcccag
agaacactggcagatgtatacagtatatggattgtatcgcttccccaatg
tttgtaaattcacagtatttggaaaactgccttcattttccagtgtggga
aaaactcttgctacctgtattacttgatctcagacccatacctgatggtt
cagtctgtccttaagttaaaagaattttgcttttctaatgttatactatt
tacctgtcagtgtattactgcaacttgaatcactcttttactgttgttgg
atataaacttatcctgtaccaatgtatttattaacacttgtattttatta
ttgagcatatcaataaaaatattaaaaaataacagattgttttttaccaa
aaaaaaaaaaaaa
>NR_026816 1
caacccactctctgtgctatgacttcattactctttcccagcccagccct
gggcaagccccttacgaagtctcaggctacctggatgaccaccctttctt
atgatgctgcaaggagggcaggtgggcagagccccgtgcatcctgggctc
aggccagggacccaagagcttgggagaagctggttctcagactgaaggcc
agagcccagcaccttgtcaccatcccggggagcatcatggcacacaacaa
ccagagccaaggctacagctagagagttgactcctctatttgagattgac
aggcctcggaagtcaaaataagtggtttcctagaccgggtcgagagcaag
tctctattggtcccaactgagttttttcagctggtttttcaaccaaacag
cacctcatctcccagtgaggggaagggaaggctgggctgagagcagcaag
gctgctcatctcacctctccccacccagccatgccagccgcctcacctgg
tggggagaggtgggcctcacctgggtcccctggcagtgctctgtgaaggg
tcttgacattgcactgtaataataaaggtgtgtgtgaagtatcaaaaaaa
>NR_027917 1
atgaagatgattgagcagcacaatcaggaatacagggaagggaaacacag
cttcacaatggccatgaacgcctttggagaaatgaccagtgaagaattca
ggcaggtggtgaatggctttcaaaaccagaagcacaggaaggggaaagtg
ctccaggaacctctgcttcatgacatccgcaaatctgtggattggagaga
gaaaggctacgtgactcctgtgaaggatcagtgcagctggggctctgtaa
ggacagatgttaggaaaactgagaaactagtttcactgagtgtgcagacc
tggtggactgctctaggcttcaaggcaatgttggctgcatttttggagaa
ccattattttgcttccagtatgttgccgacaatggaggcctggactctga
ggaatccttttcatatgaagaaaagctctggagactggaaagtccaaggt
cacagaggtgcatctggtgagagccttcttgctagtggggaatctcagca
gagtcctgaggtggcacagtattctgggaagcatcaagtgcagtgtcatc
ttatcgaggaggctctgcagatgctaagtggtggggatgaggatcacgat
gaagacaaatggccccatgacatgaggaatcatctggctggagaggccca
ggtgtag
>NR_002777 3
cttgtcctttcagaagatcagagacaagtgatatctgtgccaatttggcc
ttttcagtgttataattatggtgtcttgggatcccaatatttctcctaat
gtttccctgatgtgatactttgagagcccaggatgccagtacaataattg
aaattcacaaatgtctggtatcttgtccctcgtgccccatatattatctg
tggtttcggagagctcacttgtctcttatcttcagaaatgacagcacatg
aaatgttgtttggagccactgtcacatcaactgtagaaaaattaacaggt
cagctaagggatataatgtaactttatttgtgatatgagagaaatcttga
taaagacttgagagaaaactgggaggaaccttgtttagaagttataagga
ggggtaagttatgtgtgtcttggaaggagaatcataaatcttaaaacatg
agcctaatagagaacataaaattctaaaagataaagataataataatgat
aagccgcagggtggcttatgataatgtgacttctccttaccccagtagcg
tcggacatctgtcagctctgaaatgataaaaatgcacaatattgaataca
aacaaaggagtcagcactgaaattcattttctctccagattagggaaaga
gtaggtatgccctatggtagggcagtaaattgctgaatgatgagatgaaa
cagccacctagccatttcccattaaatataatcccatcagcagcagacaa
tatctatcctcccctatcccctctatccatatttggaaactgcaccctct
tccctatttagcaccctaacaccacttgaattccataaccctgttgttga
tctagctctcctcacctctaaacacttctagcattcctttcagatcagga
gctcgaaacactctcctttgattttttggaaaagtttctggcttcttcaa
ggtcacgttctccgtcctaagaattaaaaaaaaaaaaaaaaacttccaaa
cctttgaccttgtgtccgtggaacacccctgacttcctatcatttcaacc
cattgaggcacttgaactctcttcttggggatcctgagaagggagagtgc
aaactcttgaccctggaggcaaacaaaatgttctcatgtttgccttccca
cttactttctgtgagaacgtgggaagatcttaacctctcagaagcacagt
ttcttccttctaaaatgaaataattaacctctccctgtctacattcttaa
actcataggacataaaaaaaaaaaaaa
>NR_033769 1
ggcctctggcgggcctccagccagttagaccatttgactaggacgtgtgc
agctcagccagccacagaactggaatttttcaggagcagggggagcatgg
agtttggactttgctgagcaactgaagtggagcgcagagcttgctcgctt
aggagagggcagcatggatggcaaacaagggggcatggatgggagcaagc
ccacggggccaagagactctcctgacaccaggcttctttcaaacccattg
atgggtgattctgtgtctgattggtctcctatgcctgaagctgcaatcta
cggacatcagctgtctctgaggaacctcatcagccacgggtggcttgtga
acatcatcatggcagatcatgtttccccactccatgaagcctgtctcaga
ggtcatccctctcgtgtaaagattttattaaagcatggagctcaggtgaa
tggcgtgacaacagactggcacactccactgtttaatgtttgtatcagca
gcagctgggattatgcttctgcagcatggagccagcgttcaacctgagag
tgatctggcatcccccgtccatgaagctgctaggagaggccacgtggagt
gtgtcgactctcttacagcttataggggcaaaaatgaccataacatcagc
cacgtgggcacttcactgtatttggcttgtgaaaaccagcagatagcctg
tgtcaagaagcttctggagtcaggagcagacctgaacccagggagaggtt
ccccacttcatgcagtggccttcatgaaggccctcatgaaggattcccca
cttcatgcagtggccaggacagccagtgaagagctggcctgcctgctcat
ggattttggagcagacacccaggccaagaatgctgaaggcaaatgtcatg
tggagctggtgcctccagagagccctttgatccagctcttcttggagaga
gaagggcccccttcttttgatgcagttatgcctagaaatcagaagggctt
tggaatccagcagcatcataagataaccaaagtcgtcctcccagaggatc
tgaaatggtttctcctacatctttgtatgtatcaatggaatggattcaca
aacaatgtgaaaacattattgagtgttgtagccactagaattttaaaatc
aagttaggtttatagagtttgactagttttttcgattagatttgtattag
ttataaatttgttcatagagtttgactaattttttcgattagatttgtat
ttgttaaactctgaagccagagtttaaacacactgcatacgtttgtatga
ttagttagaaggcatgaagacttttttccctgcttggagactgtctaaaa
taacagctattgttttgcatatccactgcaggccaagcactttcagcatc
atctaattcagccctcacagcaactgggtcaatctgtccaatttcccagg
gcaaggatagaggagtcagattcaaatacaggttttctgacgttaactta
tgtgatgatttgatcaaagcaggattttccagcatcactatccttgttcc
atctctgctatatgggaatgaaaataaagaaatgtatttcaaaaaaataa
aaagaaaagaaaaacagagacggtc
>NM_016326 3
atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
tgtgaaaggccacgtgaagatgctgcggctggtgtttgcacttgtgacag
cagtatgctgtcttgccgacggggcccttatttaccggaagcttctgttc
aatcccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaagaagt
tttgtaattttatattactttttagtttgatactaagtattaaacatatt
tctgtattcttccacatattttctgcagttattttaactcagtataggag
ctagaggaagagatttccgaagtctgcaccccgcgcagagcactactgta
acttccaagggagcgctgggagcagcgggatcgggttttccggcacccgg
gcctgggtggcagggaagaatgtgccgggatccgcctcagggatctttga
atctctttactgcctggctggccggcagctccg
>NM_181641 2
atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
tgtgaaaggccacgtgaagatgctgcggctggcactaactgtgacatcta
tgaccttttttatcatcgcacaagcccctgaaccatatattgttatcact
ggatttgaagtcaccgttatcttatttttcatacttttatatgtactcag
acttgatcgattaatgaagtggttattttggcctttgcttgtgtttgcac
ttgtgacagcagtatgctgtcttgccgacggggcccttatttaccggaag
cttctgttcaatcccagcggtccttaccagaaaaagcctgtgcatgaaaa
aaaagaagttttgtaattttatattactttttagtttgatactaagtatt
aaacatatttctgtattcttccacatattttctgcagttattttaactca
gtataggagctagaggaagagatttccgaagtctgcaccccgcgcagagc
actactgtaacttccaagggagcgctgggagcagcgggatcgggttttcc
ggcacccgggcctgggtggcagggaagaatgtgccgggatccgcctcagg
gatctttgaatctctttactgcctggctggccggcagctccg
>NM_001144931 1
gtttccgttcctctgcccgccatgccgttcctagagctgcacacgaattt
ccccgccaaccgagtgcccgcggggctggagaaacggctgtgcgccgtcg
ctgcctccatcttgggcaaacctgcagaccttgtgaacgtgacggtacgg
ccgggcctggccagggcgctgagcgggtccaccgagccctgcgcgcagct
gtccatctcctccatcggcgtagtgggcaccgccgaggacaaccgcagcc
acagtgcccacttctttgagtttctcaccaaggagctagccctgggccag
gaccggtgcgcaggggtagtaggcccggaatattattctaaaacacaatc
agagtactccattcctgctaacagtttaaagccaaacacctaggcaggcc
atttaggcttctgaatgactgggtcttgaccaggagagctgctgtctagg
ttttctcttcctgaccagttcctcaagagaaatgcaaaactagtgattaa
cagtaagagtcaggcagggcgcggtggctcacgcctgtaatcccagcact
ttgggaggccgag
>NR_029429 1
ggacaccaccccaaaatttcctagtcctctttgatacgggttcctccaat
ctgtagctgccctccatctactgccagagccaagtctgctccaatcacaa
caggttcaatcccagcctgtcctccaccttcagaaacgatggacaaacct
atggactatcctatgggagtggcagcctgagtgtgttcctgggctatgac
actgtgactgttcataacatcgttgtcaataaccaggagtttggcctgag
tgagaatgagcccagcgaccccttttactattcagactttgacgggatcc
tgggaatggcctacccaaacatggcagaggggaattcccctacagtaatg
caggggatgctgcagcagagccagcttactcagcccgtcttcagcttcta
cttcacctgccagccaacccgccagtattgtggagagctcatccttggag
gtgtggaccccaactttattctggtcagatcatctggacccctgtcagcc
cgtaactgtactggcagattgccatcgaggaatttgccatcggtaaccag
gccactggcttgtgctctgagggttgccaggccattgtggataccgagac
cttcctgc
>NR_026551 1
tgtggcctgagaggacggccaggactggccagaaaagagagggacgtggc
taaacgtgagggggcgtggccaagatggccgcgtgcgggatcctcgggta
ccgggagcgaacgaggaggttctggctcagtgcatccactctgggagagc
gtggacctggttcctgggggcgatcgccagtcacccatcaacattcggtg
gagggacagtgtttatgatcccggcttaaaaccactgaccatctcttatg
acccagccacctgcctccacgtctggaataatgggtactctttcctcgtg
gaatttgaagattctacagataaatcagctgcacttagtgcattggaacg
cagtcaaatttgaaaactttgaggatgcagcactggaagaaaatggtttg
gctgtgataggagtatttttaaagatttcggaaacttctggcagcccagt
gtctactggaaggcccaagccgcttgccagaaagctgcgccccgcccaaa
agcactgggttctgcagtccaggcccttcctcagctcccaggtccaggag
aactgcaaggtcacctacttccacaggaagcactgggtccgcatccggcc
cctccgcaccactcctcccagctgggactacacccgcatctgcatccaga
gagagatggtccccgcccgcatccgcgtcctgagagagatggtccccgag
gcctggaggtgctttcccaacaggctgccgctgctgagcaacatcaggcc
tgatttctccaaggctcccctggcctacgtgaagcggtggctttggaccg
cccgccacccccacagcctgtccgcagcctggtgaccgtgaaaatcgccc
cgccagagagcagaggaagcccgacgcccaggccatctgccttcaggtct
gtgatgagaaacggagtggcctgttccgttgtgcccaggtctaggccgct
gagcagagccctcactcccaggcagagttgtctgaatccttcct
>NM_181640 2
atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
tgtgaaaggccacgtgaagatgctgcggctggatattatcaactcactgg
taacaacagtattcatgctcatcgtatctgtgttggcactgataccagaa
accacaacattgacagttggtggaggggtgtttgcacttgtgacagcagt
atgctgtcttgccgacggggcccttatttaccggaagcttctgttcaatc
ccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaagaagttttg
taattttatattactttttagtttgatactaagtattaaacatatttctg
tattcttccacatattttctgcagttattttaactcagtataggagctag
aggaagagatttccgaagtctgcaccccgcgcagagcactactgtaactt
ccaagggagcgctgggagcagcgggatcgggttttccggcacccgggcct
gggtggcagggaagaatgtgccgggatccgcctcagggatctttgaatct
ctttactgcctggctggccggcagctccg
>NM_016951 3
atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
tgtgaaaggccacgtgaagatgctgcggctggcactaactgtgacatcta
tgaccttttttatcatcgcacaagcccctgaaccatatattgttatcact
ggatttgaagtcaccgttatcttatttttcatacttttatatgtactcag
acttgatcgattaatgaagtggttattttggcctttgcttgatattatca
actcactggtaacaacagtattcatgctcatcgtatctgtgttggcactg
ataccagaaaccacaacattgacagttggtggaggggtgtttgcacttgt
gacagcagtatgctgtcttgccgacggggcccttatttaccggaagcttc
tgttcaatcccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaa
gaagttttgtaattttatattactttttagtttgatactaagtattaaac
atatttctgtattcttccacatattttctgcagttattttaactcagtat
aggagctagaggaagagatttccgaagtctgcaccccgcgcagagcacta
ctgtaacttccaagggagcgctgggagcagcgggatcgggttttccggca
cccgggcctgggtggcagggaagaatgtgccgggatccgcctcagggatc
tttgaatctctttactgcctggctggccggcagctccg
>NR_002773 1
cagcaccacaccaggaccctccagaggctgtgagaaacatcctgcaccca
ggtcctctctatctgtttatcattgtctattttgtattctgcattcagaa
ccaagagcctgaagacgacccaggagctttagctatggctgtcttcatta
ttttgtccctgtttagtgttctggtgacaggcatgggtgaaggtggggct
gggagtgagaaaggaggtgagagggaatgtaagctgaaccagcttcccca
ttgcccctccgtatctcccagtgcccagccttggacacaccctggccaga
gccagctgtttgcagacctgagccgagaggagctgacggctgtgatgcgc
tttctgacccagcagctggggccagggctggtggatgcagcccaggccca
gccctcggacaactgtgtcttctcagtggagttgcagctgcctcccaagg
ctgcagccctggctcacttggacagggggagccccccacctgcccgggag
gcactggccatcgtcttctttggcaggcaaccccagcccaacgtgagtga
gctggtggtggggccactgcctcacccctcctacatgcgggacgtgactg
tggagcgtcatggaggccccctgccctatcaccgacgccccatgttgttc
caagagtacctggacatagaccagatgatcttcgacagagagctgcccca
ggcttctgggcttctccatcactgttgcttctacaagcgccggggacgga
acctggtgacaatgaccacggctccccgtggtctgcaatcaggggaccgg
gccacctagtttggcctctactacaacatctcgggcgctgggttcttcct
gcaccacgtgggcttggagctgctagtgaaccacaaggcccttgaccctg
cccgctggactatccagaaggtgttctatcaaggccgctactatgacagc
ctggcccagctggaggcccagtttgaggccggcctggtgaatgtggtgct
gatcccagacaatggcacaggtgggtcctggtccctgaagtcccctgtgc
ccccgggtccagctccccctctgcagttccatccccaaggcccccgcttc
agtgtccagggaagtcgagtggcctcctcactgtggactttctcctttgg
cctcggagcattcagtggcccaaggatctttgacgttcccttccaagggg
agagggtggcctatgaagtcagtgtccaggcggccttggccatctatgga
ggcaattctccttctgctctacgaagccggtacatagatagtggctttgg
cttgggccacttctccacgcccctgacccatggggtggactgcccctacc
tggccacctacgtggactggcacttcctttttgagtcccaggccgccaag
acaatacgcgatgccttttgtatatttgaacagaaccagggcctccccct
gcggcgacaccactcagatctctactcccactactttgggggccttgcgg
aaacggtgctggtcatcagatctgtgtctactatgctcaactatgactat
gtgtgggatatggtcttccaccctaatggggccatagaaatcagactcca
caccaccggctacatcagctcagcattcccctttggtgctgcccagaggt
atggaaacaaagtttcagagcacaccctgggcacggtccacacccacagc
gcccacttcaaggtggacctggatgtagcaggtaaggcatcctggcagag
gcaaaagtgctggaggggtgagctgaagtctccatgcctagctttaaaag
ttttcgttgggctgggagcagtagcttatgcctgtaagcccaacactttg
ggagactgaggggggtggatcacttgaggtcaggagttcaaaaccagcct
ggccaacatggcgaaatcctgtctgtactaaaaatacaaaaattagctgg
gcatgggtatgctgtaatcctagctactcgggaggctgaggcaggagaat
cacttgaatctgggagtcagaggttgcagtgagctgagattgagccactg
cactccatcctgcgtgactgaac
>NR_037806 1
attcccagtcacccactcactcagaaagccgggagtcatcggacaccttg
ctggtcagaggtcctgggggtggttttgaaccatcagagcttggactttt
ctgacttccccagcaaggatcttcccacttcctgctccctgtgttcccac
cctccagtgttggcacaggcccacccctggctccaccagagccagaagca
gaggtagaatcaggcgggccccgggctgcactccgagcagtgttcctggc
catctttgctactttcctagagaacccggctgttgccttaaatgtgtgag
agggacttggccaaggcaaaagctggggagatgccagtgacaacatacag
ttcatgactaggtttaggaattgggcactgagaaaattctcaatatttca
gagagtccttcccttatttgggactcttaacacggtatcctcgctagttg
gttttaagggaaacactctgctcctgggtgtgagcagaggctctggtctt
gccctgtggtttgactctccttagaaccaccgcccaccagaaacataaag
gattaaaatcacactaataacccctggatggtcaatctgataataggatc
agatttacgtctaccctaattcttaacattgcagctttctctccatctgc
agattattcccagtctcccagtaacacgtttctacccagatcctttttca
tttccttaagttttgatctccgtcttcctgatgaagcaggcagagctcag
aggatcttggcatcacccaccaaagttagctgaaagcagggcactcctgg
ataaagcagcttcactcaactctggggaatgctaccattttttttccaaa
gtagaaaggaagcacttctgagccagtgaccactgaaagatgaacactct
tcctgatcctctcctctagaattcatctcctcctgctagcagccgcgtcc
tggaggagcagcggatggggaatccattctgtttcttcctggtgtttagg
aagttgccccacacacagattgccccgatgtccaaccagaagaagtgaaa
ctgctgctgggtctggagaggtgaagacccgtggccagcttctgttgttg
ccatcggccattgctttttgttcgcttgcttttggttttgcaagaagagc
ggcctctgtctctgatctgcttcaaatcatcattccatcagtgacagaag
tggctgttccatcagtggtcgcagccagttcagctcctgcatccatcccc
aagtgttctgagtggaatttgaggcctccccaaccacctaccaaaaaagg
agggtgaaatgaaaggaagaagaaaaactcagcattctttcctctgacaa
agagtaaaacgacaaggaatatcggcctgaattctcttcccaagaagaaa
gaaagcacaccaacgcaggcatttgtcttctgtccatggtgctgaagttt
attcactttcaaaccactttcagtaacagcaaattctttagaaaaggaaa
atacagggaaagggataaacctcactgacttggaggaaatcaagaggagt
gagcacagcatcagaaagccccctggccccagactgcacccgctttcctg
gccctaccttgaaatccatcaggtctgcgttggacacggcattgtacatg
ggattagctctg
Any help and input would be deeply appreciated.
Thank you for taking the time to go through my problem!
Rather than splitting the sequence into three parts, the way I see this working is to find all occurrences of $pattern in the complete sequence and determine in which third the pattern starts.
The built-in variable $-[0] contains the offset of the start of the most recent successful match.
The code below does what I think you want. It works by accumulating each sequence (which ends either when a new sequence ID is found or the end of file is reached) and passing it to the process_seq subroutine.
The subroutine takes the length of the sequence and caclulates the offset of the end of each third of the string. The idiomatic sprintf '%.0f', $value is used to round fractional values to the nearest character position.
The #counts array is adjusted for each occurrence of $regex in the sequence. The element of #counts to be incremented is established by comparing the starting position of the match in $-[0] with the end offset of each of the three segments of the sequence.
Once each sequence has been processed the values in #counts are accumulated into #totals to give overall figures for all sequences.
The output of the program when using your sample data is shown. The grand total is (9, 1, 6).
use strict;
use warnings;
my $gpat = '[G]{3,5}';
my $npat = '[A-Z]{1,25}';
my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;
my $regex = qr/$pattern/i;
open my $fh, '<', 'sequences.txt' or die $!;
my ($id, $seq);
my #totals = (0, 0, 0);
while (<$fh>) {
chomp;
if (/^>(\w+)/) {
process_seq($seq) if $id;
$id = $1;
$seq = '';
print "$id\n";
}
elsif ($id) {
$seq .= $_;
process_seq($seq) if eof;
}
}
print "Total: #totals\n";
sub process_seq {
my $sequence = shift;
my $length = length $sequence;
my #offsets = map {sprintf '%.0f', $length * $_ / 3} 1..3;
my #counts = (0, 0, 0);
while ($sequence =~ /$regex/g) {
my $place = $-[0];
for my $i (0..2) {
next if $place >= $offsets[$i];
$counts[$i]++;
last;
}
}
print "#counts\n\n";
$totals[$_] += $counts[$_] for 0..2;
}
output
NR_037701
0 0 1
NM_198399
1 0 0
NR_026816
1 0 1
NR_027917
0 0 0
NR_002777
0 0 0
NR_033769
1 0 0
NM_016326
1 0 1
NM_181641
1 0 1
NM_001144931
0 0 0
NR_029429
0 1 0
NR_026551
1 0 0
NM_181640
1 0 1
NM_016951
1 0 1
NR_002773
1 0 0
NR_037806
0 0 0
Total: 9 1 6
I lifted Borodin's process_seq function but used Bio:SeqIO to read in the file sequence by sequence, an advantage over manually reading line by line and the logic to determine various processing. I believe those advantages are:
Code that has been developed and tested by many others
Whenever possible, if output is done via the Bio::SeqIO module, the result file can then be read using Bio::SeqIO read (next_seq) method.
Other reasons I can't think of now :-)
I imagine the BioPerl package of Bio Genetic code modules must be overwhelming to a biologist beginning programming. He might not be willing to try to dig out the information he needs to begin building a program. BioPerl wiki is a good starting place, especially the Howto section, and then there's a how to for beginners and others. You'll find code examples which are mostly(?) helpful. Bio::Seq has some good code examples in the beginning and is where most of the general sequence functions are. Also, for input/output, the Bio::SeqIO module is used and it has examples at the beginning of it's manual.
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $gpat = '[G]{3,5}';
my $npat = '[A-Z]{1,25}';
my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;
my $regex = qr/$pattern/i;
my $in = Bio::SeqIO->new ( -file => "fasta_dat.txt",
-format => 'fasta');
my #totals;
while ( my $seq = $in->next_seq() ) {
process($seq);
}
print "Totals: ";
print "#totals\n";
sub process {
my $seq = shift;
my #offset = map {sprintf '%.0f', $seq->length * $_ / 3} 1..3;
my $sequence = $seq->seq;
my #count = (0,0,0);
while ($sequence =~ /$regex/g) {
my $place = $-[0];
for my $i (0 .. 2) {
next if $place >= $offset[$i];
$count[$i]++;
last;
}
}
print $seq->id, "\n#count\n";
$totals[$_] += $count[$_] for 0 .. $#count;
}

Resources