I want to create a parser to a document very similar to the following samba configuration file. it has many sections, every section has a header line, which start with [ followed by a keyword section name, e.g. global, share_name, etc., till the end of line. followed the section header line is the parameters for this section. We don't know the end of a section till we reach the beginning of another section new line [.., how can I write a rule for this kind of doc? All antlr examples I found knows exactly when start a section and when to end a section. Thanks a lot!
[global]
netbios name = NETBIOS_NAME
workgroup = WORKGROUP
security = user
[SHARE_NAME]
comment = COMMENT
force create mode = 0770
locking = yes
[printers]
comment = COMMENT
path = /var/spool/samba
browseable = No
Here is my grammar:
grammar SambaConfiguration;
file : global_section
share_name_section
printer_section
EOF
;
global_section
: SECTION_TAG_START GLOBAL_SECTION_TAG (.)* SECTION_TAG_END NEW_LINE
(~SECTION_TAG_START (.)* NEW_LINE)*
;
share_name_section
: SECTION_TAG_START SHARE_NAME_SECTION_TAG (.)* SECTION_TAG_END NEW_LINE
((~SECTION_TAG_START) (.)* NEW_LINE)*
;
printer_section
: SECTION_TAG_START PRINTER_SECTION_TAG (.)* SECTION_TAG_END NEW_LINE
((~SECTION_TAG_START) (.)* NEW_LINE)*
;
SECTION_TAG_START
: '['
;
SECTION_TAG_END
: ']'
;
GLOBAL_SECTION_TAG
: 'global'
;
SHARE_NAME_SECTION_TAG
: 'SHARE_NAME'
;
PRINTER_SECTION_TAG
: 'printer'
;
NEW_LINE :
'\r' ? '\n' | '\r'
;
WHITE_SPACE
: ' ' | '\t'
;
Somehow, it does not work properly. When running in Antlrworks, it gives me the following exception:
problem matching token at 12:19 NoViableAltException('o'#[1:1: Tokens
: ( SECTION_TAG_START | SECTION_TAG_END | GLOBAL_SECTION_TAG |
SHARE_NAME_SECTION_TAG | PRINTER_SECTION_TAG | NEW_LINE | WHITE_SPACE
);])
Thanks.
The error message:
problem matching token at 12:19 NoViableAltException('o'#[1:1: Tokens : ( SECTION_TAG_START | SECTION_TAG_END | GLOBAL_SECTION_TAG | SHARE_NAME_SECTION_TAG | PRINTER_SECTION_TAG | NEW_LINE | WHITE_SPACE );])
means that ANTLR encounters a character, 'o', that it cannot create a token for. You probably think it will be matched by the . in your parser rules, but it doesn't. Inside parser rules, the . matches any token, while only inside lexer rules it matches any character.
Your lexer only creates the following tokens: SECTION_TAG_START, SECTION_TAG_END, GLOBAL_SECTION_TAG, SHARE_NAME_SECTION_TAG, PRINTER_SECTION_TAG, NEW_LINE and WHITE_SPACE. So a . inside a parser rule matches any of these tokens, nothing more.
Unless you're doing this to learn ANTLR, I'd hesitate to use ANTLR for this task. You can do this easier with some built-in string operations and reading the input line-by-line.
Using ANTLR, you could do something similar to this:
grammar T;
parse
: section* EOF
;
section
: header line*
;
header
: SECTION_TAG_START name=text SECTION_TAG_END NEW_LINE
{
System.out.println("name=" + $name.text);
}
;
line
: key=text ASSIGN value=text (NEW_LINE | EOF)
{
System.out.println(" key=`" + $key.text.trim() +
"`, value=`" + $value.text.trim() + "`");
}
;
text
: OTHER+
;
SECTION_TAG_START : '[';
SECTION_TAG_END : ']';
ASSIGN : '=';
NEW_LINE : '\r'? '\n';
OTHER : . /* any other char: must be the last rule! */;
Parsing your example input would print the following to your console:
name=global
key=`netbios name`, value=`NETBIOS_NAME`
key=`workgroup`, value=`WORKGROUP`
key=`security`, value=`user`
name=SHARE_NAME
key=`comment`, value=`COMMENT`
key=`force create mode`, value=`0770`
key=`locking`, value=`yes`
name=printers
key=`comment`, value=`COMMENT`
key=`path`, value=`/var/spool/samba`
key=`browseable`, value=`No`
Related
I have tried to use grammar defined in grammars-v4 project(https://github.com/antlr/grammars-v4/tree/master/javascript/jsx) to parse jsx file.
When I parse the code snippet below,
let str =
`${dsName}${parameterStr ? `( ${parameterStr} )` : ""}${returns ? `{
${returns}}` : ""}`;
it shows the following error
line 2:32 at [#8,42:42='(',<8>,2:32]:no viable alternative at input '('
https://astexplorer.net/ shows it is a TemplateLiteral with conditionalExpression inside, any thoughts for parsing such grammar?
Thanks in advance.
Edit:
Thanks #Bart, it works.
It works fine on the following code.
const href1 = `https://example.com/lib/downloads_${page}.htm?joinSource=${joinSource}${inviter?`&inviter=${inviter}`: ''}`;
const href2 = `https://example.com/act/kol/detail_${this.props.values.list.res.rows && this.props.values.list.res.rows[0].id}_${page}.htm?joinSource=${joinSource}${inviter?`&inviter=${inviter}`: ''}`;
Given Mike's observation that it's the template string that is messing things up, here's a quick solution:
Changes:
JavaScriptLexerBase.java
Add the instance variable:
// Keeps track of the the current depth of nested template string backticks.
// E.g. after the X in:
//
// `${a ? `${X
//
// templateDepth will be 2. This variable is needed to determine if a `}` is a
// plain CloseBrace, or one that closes an expression inside a template string.
protected int templateDepth = 0;
JavaScriptLexer.g4
// Place TemplateCloseBrace above CloseBrace!
TemplateCloseBrace: {this.templateDepth > 0}? '}' -> popMode;
...
// Remove (or comment) TemplateStringLiteral
// TemplateStringLiteral: '`' ('\\`' | ~'`')* '`';
BackTick
: '`' {this.templateDepth++;} -> pushMode(TEMPLATE)
;
...
// Place at the end of the file:
mode TEMPLATE;
BackTickInside
: '`' {this.templateDepth--;} -> type(BackTick), popMode
;
TemplateStringStartExpression
: '${' -> pushMode(DEFAULT_MODE)
;
TemplateStringAtom
: ~[`]
;
JavaScriptParser.g4
singleExpression
: ...
| singleExpression templateStringLiteral # TemplateStringExpression // ECMAScript 6
| ...
;
literal
: ...
| templateStringLiteral
| ...
;
templateStringLiteral
: BackTick templateStringAtom* BackTick
;
templateStringAtom
: TemplateStringAtom
| TemplateStringStartExpression singleExpression TemplateCloseBrace
;
I did not fully test my solution, but it parses your example input successfully:
A quick look at the ANTLR grammar shows:
// TODO: `${`tmp`}`
TemplateStringLiteral: '`' ('\\`' | ~'`')* '`'
So, apparently, it’s on the “TODO List”. Allowing nested “`” string interpolation is surprising (to say the least), and will be “non-trivial” for the ANTLR Lexer (So, I would not "hold my breather" waiting on a fix for this).
Have you tried: (replacing the nested “`”s with “‘s?).
let str =
`${dsName}${parameterStr ? “( ${parameterStr} )” : ""}${returns ? “{
${returns}}” : ""}`;
I have the following ANTLR (version 3) grammar:
grammar GRM;
options
{
language = C;
output = AST;
}
create_statement : CREATE_KEYWORD SPACE_KEYWORD FILE_KEYWORD SPACE_KEYWORD value -> ^(value);
value : NUMBER | STRING;
CREATE_KEYWORD : 'CREATE';
FILE_KEYWORD : 'FILE';
SPACE_KEYWORD : ' ';
NUMBER : DIGIT+;
STRING : (LETTER | DIGIT)+;
fragment DIGIT : '0'..'9';
fragment LETTER : 'a'..'z' | 'A'..'Z';
With this grammar, I am able to successfully parse strings like CREATE FILE dump or CREATE FILE output. However, when I try to parse a string like CREATE FILE file it doesn't work. ANTLR matches the text file (in the string) with lexer rule FILE_KEYWORD which is not the match that I was expecting. I was expecting it to match with lexer rule STRING.
How can I force ANTLR to do this?
Your problem is a variant on classic contextual keyword vs identifier issue, it seems.
Either "value" should be a lexer rule, not a parser rule, it's too late otherwise, or you should reorder the rules (or both).
Hence using VALUE = NUMBER | STRING (lexer rule) instead of lower case value (grammar rule) will help. The order of the lexer rules are also important, usually definition of ID ("VALUE" in your code) comes after keyword definitions.
See also : 'IDENTIFIER' rule also consumes keyword in ANTLR Lexer grammar
grammar GMR;
options
{
language = C;
output = AST;
}
create_statement : CREATE_KEYWORD SPACE_KEYWORD FILE_KEYWORD SPACE_KEYWORD value -> ^(value);
CREATE_KEYWORD : 'CREATE';
FILE_KEYWORD : 'FILE';
value : (LETTER | DIGIT) + | FILE_KEYWORD | CREATE_KEYWORD ;
SPACE_KEYWORD : ' ';
this works for me in ANTLRworks for input CREATE FILE file and for input CREATE FILE FILE if needed.
So I'm having a litle problem on my Bison program, first be aware that my program works and is fully function what you see here is just an exerpt and sufficiant to help me.
basicaly the parser reads "vector(x,2)" the program is checking if the variable "x" already exists and if the number of elements"2" are greater then 1. I'm throwing yyerror when this situations dont happen. the problem is that the error is not ignoring the rest of the token. Let me show an example of vector(x,1), you will see that it should ignore the statement when reducing but is not.
Starting parse
Entering state 0
Stack now 0
Reducing stack by rule 1 (line 58):
-> $$ = nterm PROGRAM (1.1: )
Entering state 1
...
....
Entering state 24
Stack now 0 1 5 12 15 18 24
Next token is token PARENTSIS_RIGHT (1.1: )
Shifting token PARENTSIS_RIGHT (1.1: )
Entering state 33
Stack now 0 1 5 12 15 18 24 33
Reducing stack by rule 10 (line 87):
$1 = token RESEV_VAR (1.1: )
$2 = token PARENTSIS_LEFT (1.1: )
$3 = token STRING (1.1: )
$4 = token COMMA (1.1: )
$5 = nterm ELEMENTS (1.1: )
$6 = token PARENTSIS_RIGHT (1.1: )
Erro: blalalala.
-> $$ = nterm DEFINE_VECTOR (1.1: ) //here it should not run this one should skip it
Entering state 9
Stack now 0 1 9
Reducing stack by rule 6 (line 65):
$1 = nterm DEFINE_VECTOR (1.1: )
VECTOR
-> $$ = nterm DECLARACAO (1.1: )
As you can see at the end he is running the reduction to DEFINE_VECTOR when i specify error.
%start PROGRAM
%%
PROGRAM:
| PROGRAM STATMENT
| PROGRAM error STATMENT
;
STATMENT: QUIT { fclose(yyin); printf("\nBYE\n"); exit('0');}
| DEFINE_VAR { printf("\tVAR\n"); }
| DEFINE_VECTOR { printf("\tVAR\n"); }
;
...
DEFINE_VECTOR: RESEV_VAR PARENTSIS_LEFT STRING COMMA ELEMENT PARENTSIS_RIGHT {
if($5<=1){
yyerror("Vector to small");
}else{
if(check($3)==0){
createVar($3, 2, $5);
}else{
yyerror("Variable alrady exists");
}
}
}
;
ELEMENTS: ELMENTINT
| ELMENTFLOAT{ $$ = $1;}
;
SOLUTION:
%start PROGRAM
%%
PROGRAM:
| PROGRAM STATMENT
| PROGRAM error STATMENT
| PROGRAM error PARENTSIS_RIGHT STATMENT // added here
;
STATMENT: QUIT { fclose(yyin); printf("\nBYE\n"); exit('0');}
| DEFINE_VAR { printf("\tVAR\n"); }
| DEFINE_VECTOR { printf("\tVAR\n"); }
;
...
DEFINE_VECTOR: RESEV_VAR PARENTSIS_LEFT STRING COMMA ELEMENT PARENTSIS_RIGHT {
if($5<=1){
yyerror("Vector to small");
YYERROR; //added here
}else{
if(check($3)==0){
createVar($3, 2, $5);
}else{
yyerror("Variable alrady exists");
YYERROR; added here
}
}
}
;
ELEMENTS: ELMENTINT
| ELMENTFLOAT{ $$ = $1;}
;
yyerror is just a function that prints an error message. It is called by the bison parser when it has a syntax error, but you can also call it elsewhere to print an error. It has no effect on the parsing (it doesn't interrupt or change it -- it does not "throw")
If you want to affect the parsing, you need to use YYERROR (a macro that triggers a parser error, which the parser will then try to recover from). However, if you have a semantic error like this (not a syntax error), you probably don't want to do this, as there's no syntax to recover. You might want to use YYABORT to abort the parse immediately (return from yyparse). Or you might want to just continue parsing, to possibly identify other errors.
Trying to find an elegant F# solution for this. I'm reading 1000 bytes from a file into a buffer, "buff". That part is easy.
Now, I want to scan the buffer looking for the last occurrence of a two-character combination:
Either a carriage return ('\r') or a line feed ('\f') that is not followed by a comma.
When I've found that, I need to find the next CR or LF (or the end of the buffer) and print the contents in between as a string.
Context: The file is a CSV file and I want the last line that has some non-empty value in the first column.
First of all, if you are reading CSV files, then it might be better idea to use CSV type provider. This gives you a nice typed access to CSV files and it has a couple of options that you can use for dealing with messy CSV files (e.g. if you need to skip a few lines). Alternatively, the F# Data library also has CSV parser, which lets you read the file using untyped API.
That said, if you really want to implement parsing on your own, then the following example should illustrate the idiomatic approach. I'm not sure I understand your problem exactly, but say we have:
let input = "start \r body \r, comma"
let buff = input.ToCharArray()
I believe you want to find the region between \r and \r,. You can do this using a recursive function that remembers the end of the range and the start of the range and decrements the starting range as it iterates over the string. You can use pattern matching to detect the cases that you need:
let rec findRange startLoc endLoc =
if startLoc < 0 then failwith "reached beginning"
match buff.[startLoc], buff.[startLoc+1] with
| ('\r' | '\f'), ',' -> findRange (startLoc - 1) startLoc
| ('\r' | '\f'), _ -> startLoc, endLoc
| _, _ -> findRange (startLoc - 1) endLoc
Using this, we can now get the range and get the required substring:
let s, e = findRange (buff.Length-2) (buff.Length-1)
input.Substring(s + 1, e - s - 1)
Elegant is in the eye of the beholder but one approach is implementing a matcher type. A matcher is function that given an input string and a position either succeeds returning a new matcher state with an updated position or fails.
// A matcher state holds a string and the position
[<Struct>]
type MatcherState =
{
Input : string
Pos : int
}
static member New i p : MatcherState = { Input = i ; Pos = p }
member x.Reposition p : MatcherState = { Input = x.Input ; Pos = p }
member x.AdvanceBy i : MatcherState = { Input = x.Input ; Pos = x.Pos + i }
member x.Current = x.Input.[x.Pos]
member x.InRange = x.Pos >= 0 && x.Pos < x.Input.Length
member x.Eos = x.Pos >= x.Input.Length
// A Matcher is a function that given a MatcherState
// returns Some MatcherState with a new position if successful
// otherwise returns None
type Matcher = MatcherState -> MatcherState option
By defining a few active patterns we can pattern match for the line start:
// Matches a line start
let mlineStart =
fun ms ->
match ms with
// Bad cases, new line followed by WS + Comma
| Cr (Ln (Ws (Comma _ | Eos _)))
| Ln (Ws (Comma _ | Eos _)) -> mbad
// Good cases, new line not followed by WS + Comma
| Cr (Ln (Ws ms))
| Ln (Ws ms) -> mgood ms
// All other cases bad
| _ -> mbad
Note: I handle new line followed by whitespace + comma here.
The line end is matched similar:
// Matches a line end
let mlineEnd =
fun ms ->
match ms with
// Good cases, new line or EOS
| Cr (Ln ms)
| Ln ms
| Eos ms -> mgood ms
// All other cases bad
| _ -> mbad
Finally we scanBackward looking for the line start and if we find it scanForward from that position until we find the line end.
match scanBackward testCase testCase.Length mlineStart with
| None -> printfn "No matching line start found"
| Some startPos ->
// Scan forwards from line start until we find a line end
match scanForward testCase startPos mlineEnd with
| None -> printfn "Line start found #%d, but no matching line end found" startPos
| Some endPos ->
let line = testCase.Substring (startPos, endPos - startPos)
printfn "Line found: %s" line
Matcher is actually a simplistic parser but that produces no values and that support scanning forward and backwards. The approach I have chosen is not the most efficient. If efficiency is important it can be improved by applying parser combinator techniques used by for example FParsec.
Hope this was interesting. I am sure someone can up with a shorter regex solution but what fun is that?
Full example follows (no quality guarantees given, use it as an inspiration)
// A matcher state holds a string and the position
[<Struct>]
type MatcherState =
{
Input : string
Pos : int
}
static member New i p : MatcherState = { Input = i ; Pos = p }
member x.Reposition p : MatcherState = { Input = x.Input ; Pos = p }
member x.AdvanceBy i : MatcherState = { Input = x.Input ; Pos = x.Pos + i }
member x.Current = x.Input.[x.Pos]
member x.InRange = x.Pos >= 0 && x.Pos < x.Input.Length
member x.Eos = x.Pos >= x.Input.Length
// A Matcher is a function that given a MatcherState
// returns Some MatcherState with a new position if successful
// otherwise returns None
type Matcher = MatcherState -> MatcherState option
let mgood ms = Some ms
let mbad = None
// Matches EOS
let meos : Matcher =
fun ms ->
if ms.Eos then
mgood ms
else
mbad
// Matches a specific character
let mch ch : Matcher =
fun ms ->
if not ms.InRange then
mbad
elif ms.Current = ch then
mgood <| ms.AdvanceBy 1
else mbad
// Matches zero or more whitespaces
let mws : Matcher =
fun ms ->
let rec loop pos =
if pos < ms.Input.Length then
let ch = ms.Input.[pos]
if ch = ' ' then
loop (pos + 1)
else
mgood <| ms.Reposition pos
else
mgood <| ms.Reposition pos
loop (max ms.Pos 0)
// Active patterns
let (|Eos|_|) = meos
let (|Comma|_|) = mch ','
let (|Cr|_|) = mch '\r'
let (|Ln|_|) = mch '\n'
let (|Ws|_|) = mws
// Matches a line start
let mlineStart =
fun ms ->
match ms with
// Bad cases, new line followed by WS + Comma
| Cr (Ln (Ws (Comma _ | Eos _)))
| Ln (Ws (Comma _ | Eos _)) -> mbad
// Good cases, new line not followed by WS + Comma
| Cr (Ln (Ws ms))
| Ln (Ws ms) -> mgood ms
// All other cases bad
| _ -> mbad
// Matches a line end
let mlineEnd =
fun ms ->
match ms with
// Good cases, new line or EOS
| Cr (Ln ms)
| Ln ms
| Eos ms -> mgood ms
// All other cases bad
| _ -> mbad
// Scans either backward or forward looking for a match
let scan steps input pos (m : Matcher) =
let rec loop ms =
match m ms with
| Some mms ->
if steps < 0 then
Some mms.Pos
else
Some ms.Pos
| None ->
if steps = 0 then
None
elif steps > 0 && ms.Pos >= ms.Input.Length then
None
elif steps < 0 && ms.Pos < 0 then
None
else
loop <| ms.AdvanceBy steps
loop (MatcherState.New input (min input.Length (max 0 pos)))
let scanForward = scan 1
let scanBackward = scan -1
[<EntryPoint>]
let main argv =
// Some test cases
let testCases =
[|
"""1,2,3,4
4,5,6,7"""
"""1,2,3,4
4,5,6,7
"""
"""1,2,3,4
4,5,6,7
,2,3,4
"""
"""1,2,3,4
4,5,6,7
,2,3,4
"""
|]
for testCase in testCases do
// Scan backwards from end until we find a line start
match scanBackward testCase testCase.Length mlineStart with
| None -> printfn "No matching line start found"
| Some startPos ->
// Scan forwards from line start until we find a line end
match scanForward testCase startPos mlineEnd with
| None -> printfn "Line start found #%d, but no matching line end found" startPos
| Some endPos ->
let line = testCase.Substring (startPos, endPos - startPos)
printfn "Line found: %s" line
0
I'm trying to use this javacc grammar https://java.net/downloads/javacc/contrib/grammars/C.jj to parse a C code containing bit fields
struct T{
int w:2;
};
struct T a;
The generated parser cannot parse this code:
$ javacc -DEBUG_PARSER=true C.jj && javac CParser.java && gcc -E input.c | java CParser
Java Compiler Compiler Version 5.0 (Parser Generator)
(type "javacc" with no arguments for help)
Reading from file C.jj . . .
File "TokenMgrError.java" is being rebuilt.
File "ParseException.java" is being rebuilt.
File "Token.java" is being rebuilt.
File "SimpleCharStream.java" is being rebuilt.
Parser generated successfully.
Note: CParser.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
C Parser Version 0.1Alpha: Reading from standard input . . .
Call: TranslationUnit
Call: ExternalDeclaration
Call: Declaration
Call: DeclarationSpecifiers
Call: TypeSpecifier
Call: StructOrUnionSpecifier
Call: StructOrUnion
Consumed token: <"struct" at line 5 column 1>
Return: StructOrUnion
Consumed token: <<IDENTIFIER>: "T" at line 5 column 8>
Consumed token: <"{" at line 5 column 9>
Call: StructDeclarationList
Call: StructDeclaration
Call: SpecifierQualifierList
Call: TypeSpecifier
Consumed token: <"int" at line 6 column 2>
Return: TypeSpecifier
Return: SpecifierQualifierList
Call: StructDeclaratorList
Call: StructDeclarator
Call: Declarator
Call: DirectDeclarator
Consumed token: <<IDENTIFIER>: "w" at line 6 column 6>
Return: DirectDeclarator
Return: Declarator
Return: StructDeclarator
Return: StructDeclaratorList
Return: StructDeclaration
Return: StructDeclarationList
Return: StructOrUnionSpecifier
Return: TypeSpecifier
Return: DeclarationSpecifiers
Return: Declaration
Return: ExternalDeclaration
Return: TranslationUnit
C Parser Version 0.1Alpha: Encountered errors during parse.
ParseException: Encountered " ":" ": "" at line 6, column 7.
Was expecting one of:
";" ...
"," ...
"(" ...
"[" ...
"[" ...
"(" ...
"(" ...
"," ...
";" ...
";" ...
";" ...
"[" ...
"(" ...
"(" ...
at CParser.generateParseException(CParser.java:4279)
at CParser.jj_consume_token(CParser.java:4154)
at CParser.StructDeclaration(CParser.java:433)
at CParser.StructDeclarationList(CParser.java:372)
at CParser.StructOrUnionSpecifier(CParser.java:328)
at CParser.TypeSpecifier(CParser.java:274)
at CParser.DeclarationSpecifiers(CParser.java:182)
at CParser.Declaration(CParser.java:129)
at CParser.ExternalDeclaration(CParser.java:96)
at CParser.TranslationUnit(CParser.java:77)
at CParser.main(CParser.java:63)
I tried to change (line 245)
(...) LOOKAHEAD( { isType(getToken(1).image) } )TypedefName() )
to
LOOKAHEAD( { isType(getToken(1).image) } ) TypedefName2()
(...)
void TypedefName2() : {}
{
TypedefName() (LOOKAHEAD(2) ":" <INTEGER_LITERAL> )?
}
but it doesn't work (same error) .
Is there a simple way to fix the javaCC grammar to handle Bit Fields ?
Try fixing this by modifying the StructDeclarator() rule on lines 310..313 as follows:
void StructDeclarator() : {}
{
( Declarator() [ ":" ConstantExpression() ] | ":" ConstantExpression() )
}
The idea is to remove the need of lookahead by letting the parser make a decision by checking if the struct declarator starts with a colon ":".