Force ANTLR (version 3) to match lexer rule - c

I have the following ANTLR (version 3) grammar:
grammar GRM;
options
{
language = C;
output = AST;
}
create_statement : CREATE_KEYWORD SPACE_KEYWORD FILE_KEYWORD SPACE_KEYWORD value -> ^(value);
value : NUMBER | STRING;
CREATE_KEYWORD : 'CREATE';
FILE_KEYWORD : 'FILE';
SPACE_KEYWORD : ' ';
NUMBER : DIGIT+;
STRING : (LETTER | DIGIT)+;
fragment DIGIT : '0'..'9';
fragment LETTER : 'a'..'z' | 'A'..'Z';
With this grammar, I am able to successfully parse strings like CREATE FILE dump or CREATE FILE output. However, when I try to parse a string like CREATE FILE file it doesn't work. ANTLR matches the text file (in the string) with lexer rule FILE_KEYWORD which is not the match that I was expecting. I was expecting it to match with lexer rule STRING.
How can I force ANTLR to do this?

Your problem is a variant on classic contextual keyword vs identifier issue, it seems.
Either "value" should be a lexer rule, not a parser rule, it's too late otherwise, or you should reorder the rules (or both).
Hence using VALUE = NUMBER | STRING (lexer rule) instead of lower case value (grammar rule) will help. The order of the lexer rules are also important, usually definition of ID ("VALUE" in your code) comes after keyword definitions.
See also : 'IDENTIFIER' rule also consumes keyword in ANTLR Lexer grammar
grammar GMR;
options
{
language = C;
output = AST;
}
create_statement : CREATE_KEYWORD SPACE_KEYWORD FILE_KEYWORD SPACE_KEYWORD value -> ^(value);
CREATE_KEYWORD : 'CREATE';
FILE_KEYWORD : 'FILE';
value : (LETTER | DIGIT) + | FILE_KEYWORD | CREATE_KEYWORD ;
SPACE_KEYWORD : ' ';
this works for me in ANTLRworks for input CREATE FILE file and for input CREATE FILE FILE if needed.

Related

Antlr4 parsing templateLiteral in jsx

I have tried to use grammar defined in grammars-v4 project(https://github.com/antlr/grammars-v4/tree/master/javascript/jsx) to parse jsx file.
When I parse the code snippet below,
let str =
`${dsName}${parameterStr ? `( ${parameterStr} )` : ""}${returns ? `{
${returns}}` : ""}`;
it shows the following error
line 2:32 at [#8,42:42='(',<8>,2:32]:no viable alternative at input '('
https://astexplorer.net/ shows it is a TemplateLiteral with conditionalExpression inside, any thoughts for parsing such grammar?
Thanks in advance.
Edit:
Thanks #Bart, it works.
It works fine on the following code.
const href1 = `https://example.com/lib/downloads_${page}.htm?joinSource=${joinSource}${inviter?`&inviter=${inviter}`: ''}`;
const href2 = `https://example.com/act/kol/detail_${this.props.values.list.res.rows && this.props.values.list.res.rows[0].id}_${page}.htm?joinSource=${joinSource}${inviter?`&inviter=${inviter}`: ''}`;
Given Mike's observation that it's the template string that is messing things up, here's a quick solution:
Changes:
JavaScriptLexerBase.java
Add the instance variable:
// Keeps track of the the current depth of nested template string backticks.
// E.g. after the X in:
//
// `${a ? `${X
//
// templateDepth will be 2. This variable is needed to determine if a `}` is a
// plain CloseBrace, or one that closes an expression inside a template string.
protected int templateDepth = 0;
JavaScriptLexer.g4
// Place TemplateCloseBrace above CloseBrace!
TemplateCloseBrace: {this.templateDepth > 0}? '}' -> popMode;
...
// Remove (or comment) TemplateStringLiteral
// TemplateStringLiteral: '`' ('\\`' | ~'`')* '`';
BackTick
: '`' {this.templateDepth++;} -> pushMode(TEMPLATE)
;
...
// Place at the end of the file:
mode TEMPLATE;
BackTickInside
: '`' {this.templateDepth--;} -> type(BackTick), popMode
;
TemplateStringStartExpression
: '${' -> pushMode(DEFAULT_MODE)
;
TemplateStringAtom
: ~[`]
;
JavaScriptParser.g4
singleExpression
: ...
| singleExpression templateStringLiteral # TemplateStringExpression // ECMAScript 6
| ...
;
literal
: ...
| templateStringLiteral
| ...
;
templateStringLiteral
: BackTick templateStringAtom* BackTick
;
templateStringAtom
: TemplateStringAtom
| TemplateStringStartExpression singleExpression TemplateCloseBrace
;
I did not fully test my solution, but it parses your example input successfully:
A quick look at the ANTLR grammar shows:
// TODO: `${`tmp`}`
TemplateStringLiteral: '`' ('\\`' | ~'`')* '`'
So, apparently, it’s on the “TODO List”. Allowing nested “`” string interpolation is surprising (to say the least), and will be “non-trivial” for the ANTLR Lexer (So, I would not "hold my breather" waiting on a fix for this).
Have you tried: (replacing the nested “`”s with “‘s?).
let str =
`${dsName}${parameterStr ? “( ${parameterStr} )” : ""}${returns ? “{
${returns}}” : ""}`;

String Arrays in Guvnor rule

I am having some difficulty in understanding how a String[] can be represented in a Guvnor rule. How can an array of strings be passed to a Java method that uses String[] as an argument from a rule in Guvnor?
I keep getting mismatched input errors, Error Code 102 when I attempt to validate the rule in Guvnor.
Any pointers/tips welcome
In the following rule, comm is a global object with a function sendMail with the function signature (String[] recipientlist, String alertType, String message)
rule "list-email"
dialect "java"
when
$result : Grade( subject == "Math" , $marks : mark >= 99.0 )
$emailList : "{xyz#abc.com, fgh#def.com}"
then
comm.sendMail($emailList, "High Grade Alert", "Scored: " + " Marks:" + Double.toString($marks));
It's not a good idea to try and introduce a String[] on the LHS - you aren't matching with it, and I doubt the syntax is correct. Use this - on the RHS it's Java:
rule "list-email"
dialect "java"
when
$result : Grade( subject == "Math" , $marks : mark >= 99.0 )
then
String[] addrs = new String[]{"xyz#abc.com", "fgh#def.com"};
comm.sendMail(addrs, "High Grade Alert", "Scored: " + " Marks:" + $marks );
end

javacc C grammar and C "Bit fields" ; ParseException

I'm trying to use this javacc grammar https://java.net/downloads/javacc/contrib/grammars/C.jj to parse a C code containing bit fields
struct T{
int w:2;
};
struct T a;
The generated parser cannot parse this code:
$ javacc -DEBUG_PARSER=true C.jj && javac CParser.java && gcc -E input.c | java CParser
Java Compiler Compiler Version 5.0 (Parser Generator)
(type "javacc" with no arguments for help)
Reading from file C.jj . . .
File "TokenMgrError.java" is being rebuilt.
File "ParseException.java" is being rebuilt.
File "Token.java" is being rebuilt.
File "SimpleCharStream.java" is being rebuilt.
Parser generated successfully.
Note: CParser.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
C Parser Version 0.1Alpha: Reading from standard input . . .
Call: TranslationUnit
Call: ExternalDeclaration
Call: Declaration
Call: DeclarationSpecifiers
Call: TypeSpecifier
Call: StructOrUnionSpecifier
Call: StructOrUnion
Consumed token: <"struct" at line 5 column 1>
Return: StructOrUnion
Consumed token: <<IDENTIFIER>: "T" at line 5 column 8>
Consumed token: <"{" at line 5 column 9>
Call: StructDeclarationList
Call: StructDeclaration
Call: SpecifierQualifierList
Call: TypeSpecifier
Consumed token: <"int" at line 6 column 2>
Return: TypeSpecifier
Return: SpecifierQualifierList
Call: StructDeclaratorList
Call: StructDeclarator
Call: Declarator
Call: DirectDeclarator
Consumed token: <<IDENTIFIER>: "w" at line 6 column 6>
Return: DirectDeclarator
Return: Declarator
Return: StructDeclarator
Return: StructDeclaratorList
Return: StructDeclaration
Return: StructDeclarationList
Return: StructOrUnionSpecifier
Return: TypeSpecifier
Return: DeclarationSpecifiers
Return: Declaration
Return: ExternalDeclaration
Return: TranslationUnit
C Parser Version 0.1Alpha: Encountered errors during parse.
ParseException: Encountered " ":" ": "" at line 6, column 7.
Was expecting one of:
";" ...
"," ...
"(" ...
"[" ...
"[" ...
"(" ...
"(" ...
"," ...
";" ...
";" ...
";" ...
"[" ...
"(" ...
"(" ...
at CParser.generateParseException(CParser.java:4279)
at CParser.jj_consume_token(CParser.java:4154)
at CParser.StructDeclaration(CParser.java:433)
at CParser.StructDeclarationList(CParser.java:372)
at CParser.StructOrUnionSpecifier(CParser.java:328)
at CParser.TypeSpecifier(CParser.java:274)
at CParser.DeclarationSpecifiers(CParser.java:182)
at CParser.Declaration(CParser.java:129)
at CParser.ExternalDeclaration(CParser.java:96)
at CParser.TranslationUnit(CParser.java:77)
at CParser.main(CParser.java:63)
I tried to change (line 245)
(...) LOOKAHEAD( { isType(getToken(1).image) } )TypedefName() )
to
LOOKAHEAD( { isType(getToken(1).image) } ) TypedefName2()
(...)
void TypedefName2() : {}
{
TypedefName() (LOOKAHEAD(2) ":" <INTEGER_LITERAL> )?
}
but it doesn't work (same error) .
Is there a simple way to fix the javaCC grammar to handle Bit Fields ?
Try fixing this by modifying the StructDeclarator() rule on lines 310..313 as follows:
void StructDeclarator() : {}
{
( Declarator() [ ":" ConstantExpression() ] | ":" ConstantExpression() )
}
The idea is to remove the need of lookahead by letting the parser make a decision by checking if the struct declarator starts with a colon ":".

how to write a antlr parser for samba configuration file

I want to create a parser to a document very similar to the following samba configuration file. it has many sections, every section has a header line, which start with [ followed by a keyword section name, e.g. global, share_name, etc., till the end of line. followed the section header line is the parameters for this section. We don't know the end of a section till we reach the beginning of another section new line [.., how can I write a rule for this kind of doc? All antlr examples I found knows exactly when start a section and when to end a section. Thanks a lot!
[global]
netbios name = NETBIOS_NAME
workgroup = WORKGROUP
security = user
[SHARE_NAME]
comment = COMMENT
force create mode = 0770
locking = yes
[printers]
comment = COMMENT
path = /var/spool/samba
browseable = No
Here is my grammar:
grammar SambaConfiguration;
file : global_section
share_name_section
printer_section
EOF
;
global_section
: SECTION_TAG_START GLOBAL_SECTION_TAG (.)* SECTION_TAG_END NEW_LINE
(~SECTION_TAG_START (.)* NEW_LINE)*
;
share_name_section
: SECTION_TAG_START SHARE_NAME_SECTION_TAG (.)* SECTION_TAG_END NEW_LINE
((~SECTION_TAG_START) (.)* NEW_LINE)*
;
printer_section
: SECTION_TAG_START PRINTER_SECTION_TAG (.)* SECTION_TAG_END NEW_LINE
((~SECTION_TAG_START) (.)* NEW_LINE)*
;
SECTION_TAG_START
: '['
;
SECTION_TAG_END
: ']'
;
GLOBAL_SECTION_TAG
: 'global'
;
SHARE_NAME_SECTION_TAG
: 'SHARE_NAME'
;
PRINTER_SECTION_TAG
: 'printer'
;
NEW_LINE :
'\r' ? '\n' | '\r'
;
WHITE_SPACE
: ' ' | '\t'
;
Somehow, it does not work properly. When running in Antlrworks, it gives me the following exception:
problem matching token at 12:19 NoViableAltException('o'#[1:1: Tokens
: ( SECTION_TAG_START | SECTION_TAG_END | GLOBAL_SECTION_TAG |
SHARE_NAME_SECTION_TAG | PRINTER_SECTION_TAG | NEW_LINE | WHITE_SPACE
);])
Thanks.
The error message:
problem matching token at 12:19 NoViableAltException('o'#[1:1: Tokens : ( SECTION_TAG_START | SECTION_TAG_END | GLOBAL_SECTION_TAG | SHARE_NAME_SECTION_TAG | PRINTER_SECTION_TAG | NEW_LINE | WHITE_SPACE );])
means that ANTLR encounters a character, 'o', that it cannot create a token for. You probably think it will be matched by the . in your parser rules, but it doesn't. Inside parser rules, the . matches any token, while only inside lexer rules it matches any character.
Your lexer only creates the following tokens: SECTION_TAG_START, SECTION_TAG_END, GLOBAL_SECTION_TAG, SHARE_NAME_SECTION_TAG, PRINTER_SECTION_TAG, NEW_LINE and WHITE_SPACE. So a . inside a parser rule matches any of these tokens, nothing more.
Unless you're doing this to learn ANTLR, I'd hesitate to use ANTLR for this task. You can do this easier with some built-in string operations and reading the input line-by-line.
Using ANTLR, you could do something similar to this:
grammar T;
parse
: section* EOF
;
section
: header line*
;
header
: SECTION_TAG_START name=text SECTION_TAG_END NEW_LINE
{
System.out.println("name=" + $name.text);
}
;
line
: key=text ASSIGN value=text (NEW_LINE | EOF)
{
System.out.println(" key=`" + $key.text.trim() +
"`, value=`" + $value.text.trim() + "`");
}
;
text
: OTHER+
;
SECTION_TAG_START : '[';
SECTION_TAG_END : ']';
ASSIGN : '=';
NEW_LINE : '\r'? '\n';
OTHER : . /* any other char: must be the last rule! */;
Parsing your example input would print the following to your console:
name=global
key=`netbios name`, value=`NETBIOS_NAME`
key=`workgroup`, value=`WORKGROUP`
key=`security`, value=`user`
name=SHARE_NAME
key=`comment`, value=`COMMENT`
key=`force create mode`, value=`0770`
key=`locking`, value=`yes`
name=printers
key=`comment`, value=`COMMENT`
key=`path`, value=`/var/spool/samba`
key=`browseable`, value=`No`

ANTLR3 C Target - parser return 'misses' out root element

I'm trying to use the ANTLR3 C Target to make sense of an AST, but am running into some difficulties.
I have a simple SQL-like grammar file:
grammar sql;
options
{
language = C;
output=AST;
ASTLabelType=pANTLR3_BASE_TREE;
}
sql : VERB fields;
fields : FIELD (',' FIELD)*;
VERB : 'SELECT' | 'UPDATE' | 'INSERT';
FIELD : CHAR+;
fragment
CHAR : 'a'..'z';
and this works as expected within ANTLRWorks.
In my C code I have:
const char pInput[] = "SELECT one,two,three";
pANTLR3_INPUT_STREAM pNewStrm = antlr3NewAsciiStringInPlaceStream((pANTLR3_UINT8) pInput,sizeof(pInput),NULL);
psqlLexer lex = sqlLexerNew (pNewStrm);
pANTLR3_COMMON_TOKEN_STREAM tstream = antlr3CommonTokenStreamSourceNew(ANTLR3_SIZE_HINT,
TOKENSOURCE(lex));
psqlParser ps = sqlParserNew( tstream );
sqlParser_sql_return ret = ps->sql(ps);
pANTLR3_BASE_TREE pTree = ret.tree;
cout << "Tree: " << pTree->toStringTree(pTree)->chars << endl;
ParseSubTree(0,pTree);
This outputs a flat tree structure when you use ->getChildCount and ->children->get to recurse through the tree.
void ParseSubTree(int level,pANTLR3_BASE_TREE pTree)
{
ANTLR3_UINT32 childcount = pTree->getChildCount(pTree);
for (int i=0;i<childcount;i++)
{
pANTLR3_BASE_TREE pChild = (pANTLR3_BASE_TREE) pTree->children->get(pTree->children,i);
for (int j=0;j<level;j++)
{
std::cout << " - ";
}
std::cout <<
pChild->getText(pChild)->chars <<
std::endl;
int f=pChild->getChildCount(pChild);
if (f>0)
{
ParseSubTree(level+1,pChild);
}
}
}
Program output:
Tree: SELECT one , two , three
SELECT
one
,
two
,
three
Now, if I alter the grammar file:
sql : VERB ^fields;
.. the call to ParseSubTree only displays the child nodes of fields.
Program output:
Tree: (SELECT one , two , three)
one
,
two
,
three
My question is: why, in the second case, is Antlr just give the child nodes? (in effect missing out the SELECT token)
I'd be very grateful if anybody can give me any pointers for making sense of the tree returned by Antlr.
Useful Information:
AntlrWorks 1.4.2,
Antlr C Target 3.3,
MSVC 10
Placing output=AST; in the options section will not produce an actual AST, it only causes ANTLR to create CommonTree tokens instead of CommonTokens (or, in your case, the equivalent C structs).
If you use output=AST;, the next step is to put tree operators, or rewrite rules inside your parser rules that give shape to your AST.
See this previous Q&A to find out how to create a proper AST.
For example, the following grammar (with rewrite rules):
options {
output=AST;
// ...
}
sql // make VERB the root
: VERB fields -> ^(VERB fields)
;
fields // omit the comma's from the AST
: FIELD (',' FIELD)* -> FIELD+
;
VERB : 'SELECT' | 'UPDATE' | 'INSERT';
FIELD : CHAR+;
SPACE : ' ' {$channel=HIDDEN;};
fragment CHAR : 'a'..'z';
will parse the following input:
UPDATE field, foo , bar
into the following AST:
I think it is important that you realize that the tree you see in Antrlworks is not the AST. The ".tree" in your code is the AST but may look different from what you expect. In order to create the AST, you need to specify the nodes using the ^ symbol in strategic places using rewrite rules.
You can read more here

Resources