Formal grammar of XML - c

Im trying to build small parser for XML files in C. I know, i could find some finished solutions but, i need just some basic stuff for embedded project. I`m trying to create grammar for describing XML without attributes, just tags, but it seems it is not working and i was not able to figure out why.
Here is the grammar:
XML : FIRST_TAG NIZ
NIZ : VAL NIZ | eps
VAL : START VAL END
| STR
| eps
Here is part of C code that implement this grammar :
void check() {
getSymbol();
if( sym == FIRST_LINE )
{
niz();
}
else {
printf("FIRST_LINE EXPECTED");
exit(1);
}
}
void niz() {
getSymbol();
if( sym == ERROR )
return;
if( sym == START ) {
back = 1;
val();
niz();
}
printf(" EPS OR START EXPECTED\n");
}
void val() {
getSymbol();
if( sym == ERROR )
return;
if( sym == START ) {
back = 0;
val();
getSymbol();
if( sym != END ) {
printf("END EXPECTED");
exit(1);
}
return;
}
if( sym == EMPTY_TAG || sym == STR)
return;
printf("START, STR, EMPTY_TAG OR EPS EXPECTED\n");
exit(1);
}
void getSymbol() {
int pom;
if(back == 1) {
back = 0;
return;
}
sym = getNextToken(cmd + offset, &pom);
offset += pom + 1;
}
EDIT: Here is the example of XML file that does not satisfy this grammar:
<?xml version="1.0"?>
<VATCHANGES>
<DATE>15/08/2012</DATE>
<TIME>1452</TIME>
<EFDSERIAL>01KE000001</EFDSERIAL>
<CHANGENUM>1</CHANGENUM>
<VATRATE>A</VATRATE>
<FROMVALUE>16.00</FROMVALUE>
<TOVALUE>18.00</TOVALUE>
<VATRATE>B</VATRATE>
<FROMVALUE>2.00</FROMVALUE>
<TOVALUE>0.00</TOVALUE>
<VATRATE>C</VATRATE>
<FROMVALUE>5.00</FROMVALUE>
<TOVALUE>0.00</TOVALUE>
<DATE>25/05/2010</DATE>
<CHANGENUM>2</CHANGENUM>
<VATRATE>C</VATRATE>
<FROMVALUE>0.00</FROMVALUE>
<TOVALUE>4.00</TOVALUE>
</VATCHANGES>
It gives END EXPECTED at the output.

First, your grammar needs some work. Assuming the preamble is handled correctly, you have a basic error in the definition of NIZ.
NIZ : VAL NIZ | eps
VAL : START VAL END
| STR
| eps
So we enter NIZ and we look for VAL first. The problem is the eps on the end of both VAL's possible productions and NIZ. Therefore, if VAL produces nothing (i.e. eps) and consumes no tokens in the process (which it can't to be proper, since eps is the production), NIZ reduces to:
NIZ: eps NIZ | eps
which isn't good.
Consider into something more along these lines: I just spewed this with no real foresight into having something beyond a purely basic construction.
XML: START_LINE ELEMENT
ELEMENT: OPENTAG BODY CLOSETAG
OPENTAG: lt id(n) gt
CLOSETAG: lt fs id(n) gt
BODY: ELEMENT | VALUE
VALUE: str | eps
This is super basic. Terminals include:
lt: '<'
gt: '>'
fs: '/'
str: any alphanumeric string excluding chars lt or gt.
id(n): any alphanumeric string excluding chars lt, gt, or fs.
I can almost feel the wrath of the XML purists raining down on me right now, but the point I'm trying to get across is that, when an grammar is well-defined, the RDP will literally write itself. Obviously the lexer (i.e. the token engine) needs to handle the terminals accordingly. Note: the id(n) is an id-stack to ensure you properly close the innermost tag, and is an attribute of your parser in accordance with how it manages tag ids. Its not traditional, but it makes things MUCH easier.
This can/should clearly be expanded to include stand-alone element declarations and short-cut element closure. For example, this grammar allows for elements of this form:
<ElementName>...</ElementName>
but not of this form:
<ElementName/>
Nor does it account for short-cut termination such as:
<ElementName>...</>
Accounting for such additions will obviously complicate the grammar considerably, but also make the parser substantially more robust. Like I said, the sample above is basic with a capital B. If you're really going to embark on this these are things you want to consider when designing your grammar, and thus also your RDP by consequence.
Anyway, just consider how a few reworks in your grammar can/will substantially make this easier on you.

Related

How would I go about writing an interpreter in C? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'd love some references, or tips, possibly an e-book or two. I'm not looking to write a compiler, just looking for a tutorial I could follow along and modify as I go. Thank you for being understanding!
BTW: It must be C.
Any more replies would be appreciated.
A great way to get started writing an interpreter is to write a simple machine simulator. Here's a simple language you can write an interpreter for:
The language has a stack and 6 instructions:
push <num> # push a number on to the stack
pop # pop off the first number on the stack
add # pop off the top 2 items on the stack and push their sum on to the stack. (remember you can add negative numbers, so you have subtraction covered too). You can also get multiplication my creating a loop using some of the other instructions with this one.
ifeq <address> # examine the top of the stack, if it's 0, continue, else, jump to <address> where <address> is a line number
jump <address> # jump to a line number
print # print the value at the top of the stack
dup # push a copy of what's at the top of the stack back onto the stack.
Once you've written a program that can take these instructions and execute them, you've essentially created a very simple stack based virtual machine. Since this is a very low level language, you won't need to understand what an AST is, how to parse a grammar into an AST, and translate it to machine code, etc. That's too complicated for a tutorial project. Start with this, and once you've created this little VM, you can start thinking about how you can translate some common constructs into this machine. e.g. you might want to think about how you might translate a C if/else statement or while loop into this language.
Edit:
From the comments below, it sounds like you need a bit more experience with C before you can tackle this task.
What I would suggest is to first learn about the following topics:
scanf, printf, putchar, getchar - basic C IO functions
struct - the basic data structure in C
malloc - how to allocate memory, and the difference between stack memory and heap memory
linked lists - and how to implement a stack, then perhaps a binary tree (you'll need to
understand structs and malloc first)
Then it'll be good to learn a bit more about the string.h library as well
- strcmp, strdup - a couple useful string functions that will be useful.
In short, C has a much higher learning curve compared to python, just because it's a lower level language and you have to manage your own memory, so it's good to learn a few basic things about C first before trying to write an interpreter, even if you already know how to write one in python.
The only difference between an interpreter and a compiler is that instead of generating code from the AST, you execute it in a VM instead. Once you understand this, almost any compiler book, even the Red Dragon Book (first edition, not second!), is enough.
I see this is a bit of a late reply, however since this thread showed up at second place in the result list when I did a search for writing an interpreter and no one have mentioned anything very concrete I will provide the following example:
Disclaimer: This is just some simple code I wrote in a hurry in order to have a foundation for the explanation below and are therefore not perfect, but it compiles and runs, and seems to give the expected answers.
Read the following C-code from bottom to top:
#include <stdio.h>
#include <stdlib.h>
double expression(void);
double vars[26]; // variables
char get(void) { char c = getchar(); return c; } // get one byte
char peek(void) { char c = getchar(); ungetc(c, stdin); return c; } // peek at next byte
double number(void) { double d; scanf("%lf", &d); return d; } // read one double
void expect(char c) { // expect char c from stream
char d = get();
if (c != d) {
fprintf(stderr, "Error: Expected %c but got %c.\n", c, d);
}
}
double factor(void) { // read a factor
double f;
char c = peek();
if (c == '(') { // an expression inside parantesis?
expect('(');
f = expression();
expect(')');
} else if (c >= 'A' && c <= 'Z') { // a variable ?
expect(c);
f = vars[c - 'A'];
} else { // or, a number?
f = number();
}
return f;
}
double term(void) { // read a term
double t = factor();
while (peek() == '*' || peek() == '/') { // * or / more factors
char c = get();
if (c == '*') {
t = t * factor();
} else {
t = t / factor();
}
}
return t;
}
double expression(void) { // read an expression
double e = term();
while (peek() == '+' || peek() == '-') { // + or - more terms
char c = get();
if (c == '+') {
e = e + term();
} else {
e = e - term();
}
}
return e;
}
double statement(void) { // read a statement
double ret;
char c = peek();
if (c >= 'A' && c <= 'Z') { // variable ?
expect(c);
if (peek() == '=') { // assignment ?
expect('=');
double val = expression();
vars[c - 'A'] = val;
ret = val;
} else {
ungetc(c, stdin);
ret = expression();
}
} else {
ret = expression();
}
expect('\n');
return ret;
}
int main(void) {
printf("> "); fflush(stdout);
for (;;) {
double v = statement();
printf(" = %lf\n> ", v); fflush(stdout);
}
return EXIT_SUCCESS;
}
This is an simple recursive descend parser for basic mathematical expressions supporting one letter variables. Running it and typing some statements yields the following results:
> (1+2)*3
= 9.000000
> A=1
= 1.000000
> B=2
= 2.000000
> C=3
= 3.000000
> (A+B)*C
= 9.000000
You can alter the get(), peek() and number() to read from a file or list of code lines. Also you should make a function to read identifiers (basically words). Then you expand the statement() function to be able to alter which line it runs next in order to do branching. Last you add the branch operations you want to the statement function, like
if "condition" then
"statements"
else
"statements"
endif.
while "condition" do
"statements"
endwhile
function fac(x)
if x = 0 then
return 1
else
return x*fac(x-1)
endif
endfunction
Obviously you can decide the syntax to be as you like. You need to think about ways of define functions and how to handle arguments/parameter variables, local variables and global variables. If preferable arrays and data structures. References∕pointers. Input/output?
In order to handle recursive function calls you probably need to use a stack.
In my opinion this would be easier to do all this with C++ and STL. Where for example one std::map could be used to hold local variables, and another map could be used for globals...
It is of course possible to write a compiler that build an abstract syntax tree out of the code. Then travels this tree in order to produce either machine code or some kind of byte code which executed on a virtual machine (like Java and .Net). This gives better performance than naively parse line by line and executing them, but in my opinion that is NOT writing an interpreter. That is writing both a compiler and its targeted virtual machine.
If someone wants to learn to write an interpreter, they should try making the most basic simple and practical working interpreter.

How to parse lambda term

I would like to parse a lambda calculus. I dont know how to parse the term and respect parenthesis priority. Ex:
(lx ly (x(xy)))(lx ly xxxy)
I don't manage to find the good way to do this. I just can't see the adapted algorithm.
A term is represented by a structure that have a type (APPLICATION, ABSTRACTION, VARIABLE) and
a right and left component of type "struc term".
Any idea how to do this ?
EDIT
Sorry to disturb you again, but I really want to understand. Can you check the function "expression()" to let me know if I am right.
Term* expression(){
if(current==LINKER){
Term* t = create_node(ABSTRACTION);
get_next_symbol();
t->right = create_node_variable();
get_next_symbol();
t->left = expression();
}
else if(current==OPEN_PARENTHESIS){
application();
get_next_symbol();
if(current != CLOSE_PARENTHESIS){
printf("Error\n");
exit(1);
}
}
else if(current==VARIABLE){
return create_node_variable();
}
else if(current==END_OF_TERM)
{
printf("Error");
exit(1);
}
}
Thanks
The can be simplified by separating the application from other expressions:
EXPR -> l{v} APPL "abstraction"
-> (APPL) "brackets"
-> {v} "variable"
APPL -> EXPR + "application"
The only difference with your approach is that the application is represented as a list of expressions, because abcd can be implicitly read as (((ab)c)d) so you might at well store it as abcd while parsing.
Based on this grammar, a simple recursive descent parser can be created with a single character of lookahead:
EXPR: 'l' // read character, then APPL, return as abstraction
'(' // read APPL, read ')', return as-is
any // read character, return as variable
eof // fail
APPL: ')' // unread character, return as application
any // read EXPR, append to list, loop
eof // return as application
The root symbol is APPL, of course. As a post-parsing step, you can turn your APPL = list of EXPR into a tree of applications. The recursive descent is so simple that you can easily turn into an imperative solution with an explicit stack if you wish.

Way to pass argv[] to CreateProcess()

My C Win32 application should allow passing a full command line for another program to start, e.g.
myapp.exe /foo /bar "C:\Program Files\Some\App.exe" arg1 "arg 2"
myapp.exe may look something like
int main(int argc, char**argv)
{
int i;
for (i=1; i<argc; ++i) {
if (!strcmp(argv[i], "/foo") {
// handle /foo
} else if (!strcmp(argv[i], "/bar") {
// handle /bar
} else {
// not an option => start of a child command line
break;
}
}
// run the command
STARTUPINFO si;
PROCESS_INFORMATION pi;
// customize the above...
// I want this, but there is no such API! :(
CreateProcessFromArgv(argv+i, NULL, NULL, FALSE, 0, NULL, NULL, &si, &pi);
// use startup info si for some operations on a process
// ...
}
I can think about some workarounds:
use GetCommandLine()
and find a substring corresponding to argv[i]
write something similar to ArgvToCommandLine() also mentioned in another SO question
Both of them lengthy and re-implement cumbersome windows command line parsing logic, which is already a part of CommandLineToArgvW().
Is there a "standard" solution for my problem? A standard (Win32, CRT, etc.) implementation of workarounds counts as a solution.
It's actually easier than you think.
1) There is an API, GetCommandLine() that will return you the whole string
myapp.exe /foo /bar "C:\Program Files\Some\App.exe" arg1 "arg 2"
2) CreateProcess() allows to specify the command line, so using it as
CreateProcess(NULL, "c:\\hello.exe arg1 arg2 etc", ....)
will do exactly what you need.
3) By parsing your command line, you can just find where the exe name starts, and pass that address to the CreateProcess() . It could be easily done with
char* cmd_pos = strstr(GetCommandLine(), argv[3]);
and finally: CreateProcess(NULL, strstr(GetCommandLine(), argv[i]), ...);
EDIT: now I see that you've already considered this option. If you're concerned about performance penalties, they are nothing comparing to process creation.
The only standard function which you not yet included in your question is PathGetArgs, but it do not so much. The functions PathQuoteSpaces and PathUnquoteSpaces can be also helpful. In my opinion the usage of CommandLineToArgvW in combination with the with GetCommandLineW is what you really need. The usage of UNICODE during the parsing of the command line is in my opinion mandatory if you want to have a general solution.
I solved it as follows: With your Visual Studio install you can find a copy of some of the standard code used to create the C library. In particular if you look in VC\crt\src\stdargv.c you will find the implementation of "wparse_cmdline" function which creates argc and argv from the result of GetCommandLineW API. I created an augmented version of this code which also created a "cmdv" array of pointers which pointed back into the original string at the place where each argv pointer begins. You can then act on argv arguments as you wish, and when you want to pass the "rest" on to CreateProcess you can just pass in cmdv[i] instead.
This solution has the advantages that is uses the exact same parsing code, still provides argv as usual, and allows you to pass on the original without needing to re-quote or re-escape it.
I have faced the same problems with you. The thing is, we don't need to parse the whole string, if we can separate the result of GetCommandLine(), then you can put them together afterwards.
According to Microsoft's documentation, you should consider only backslashes and quotes.
You can find their documents here.
And, then, you can call Solve to get the next parameter start point.
E.g.
"a b c" d e
First Part: "a b c"
Next Parameter Start: d e
I solved the examples in Microsoft documentation, so do worry about the compatibility.
By calling the Solve function recursively, you can get the whole argv array.
Here's the file test.c
#include <stdio.h>
extern char* Solve(char* p);
void showString(char *str)
{
char *end = Solve(str);
char *p = str;
printf("First Part: ");
while(p < end){
fputc(*p, stdout);
p++;
}
printf("\nNext Parameter Start: %s\n", p + 1);
}
int main(){
char str[] = "\"a b c\" d e";
char str2[] = "a\\\\b d\"e f\"g h";
char str3[] = "a\\\\\\\"b c d";
char str4[] = "a\\\\\\\\\"b c\" d e";
showString(str);
showString(str2);
showString(str3);
showString(str4);
return 0;
}
Running result are:
First Part: "a b c"
Next Parameter Start: d e
First Part: a\\b
Next Parameter Start: d"e f"g h
First Part: a\\\"b
Next Parameter Start: c d
First Part: a\\\\"b c"
Next Parameter Start: d e
Here's all the source code of Solve function, file findarg.c
/**
This is a FSM for quote recognization.
Status will be
1. Quoted. (STATUS_QUOTE)
2. Normal. (STATUS_NORMAL)
3. End. (STATUS_END)
Quoted can be ended with a " or \0
Normal can be ended with a " or space( ) or \0
Slashes
*/
#ifndef TRUE
#define TRUE 1
#endif
#define STATUS_END 0
#define STATUS_NORMAL 1
#define STATUS_QUOTE 2
typedef char * Pointer;
typedef int STATUS;
static void MoveSlashes(Pointer *p){
/*According to Microsoft's note, http://msdn.microsoft.com/en-us/library/17w5ykft.aspx */
/*Backslashes are interpreted literally, unless they immediately precede a double quotation mark.*/
/*Here we skip every backslashes, and those linked with quotes. because we don't need to parse it.*/
while (**p == '\\'){
(*p)++;
//You need always check the next element
//Skip \" as well.
if (**p == '\\' || **p == '"')
(*p)++;
}
}
/* Quoted can be ended with a " or \0 */
static STATUS SolveQuote(Pointer *p){
while (TRUE){
MoveSlashes(p);
if (**p == 0)
return STATUS_END;
if (**p == '"')
return STATUS_NORMAL;
(*p)++;
}
}
/* Normal can be ended with a " or space( ) or \0 */
static STATUS SolveNormal(Pointer *p){
while (TRUE){
MoveSlashes(p);
if (**p == 0)
return STATUS_END;
if (**p == '"')
return STATUS_QUOTE;
if (**p == ' ')
return STATUS_END;
(*p)++;
}
}
/*
Solve the problem and return the end pointer.
#param p The start pointer
#return The target pointer.
*/
Pointer Solve(Pointer p){
STATUS status = STATUS_NORMAL;
while (status != STATUS_END){
switch (status)
{
case STATUS_NORMAL:
status = SolveNormal(&p); break;
case STATUS_QUOTE:
status = SolveQuote(&p); break;
case STATUS_END:
default:
break;
}
//Move pointer to the next place.
if (status != STATUS_END)
p++;
}
return p;
}
I think it's actually harder than you think for a general case.
See What's up with the strange treatment of quotation marks and backslashes by CommandLineToArgvW.
It's ultimately up to the individual programs how they tokenize the command-line into an argv array (and even CommandLineToArgv in theory (and perhaps in practice, if what one of the comments said is true) could behave differently than the CRT when it initializes argv to main()), so there isn't even a standard set of esoteric rules that you can follow.
But anyway, the short answer is: no, there unfortunately is no easy/standard solution. You'll have to roll your own function to deal with quotes and backslashes and the like.

Parsing some particular statements with antlr3 in C target

I have some questions about antlr3 with tree grammar in C target.
I have almost done my interpretor (functions, variables, boolean and math expressions ok) and i have kept the most difficult statements for the end (like if, switch, etc.)
1) I would like interpreting a simple loop statement:
repeat: ^(REPEAT DIGIT stmt);
I've seen many examples but nothing about the tree walker (only a topic here with the macros MARK() / REWIND(m) + #init / #after but not working (i've antlr errors: "unexpected node at offset 0")). How can i interpret this statement in C?
2) Same question with a simple if statement:
if: ^(IF condition stmt elseifstmt* elsestmt?);
The problem is to skip the statement if the condition is false and test the other elseif/else statements.
3) I have some statements which can stop the script (like "break" or "exit"). How can i interrupt the tree walker and skip the following tokens?
4) When a lexer or parser error is detected, antlr returns an error. But i would like to make my homemade error messages. How can i have the line number where parser crashed?
Ask me if you want more details.
Thanks you very much (and i apologize for my poor english)
About the repeat statement, i think i've found a way to do it. In antlr.org, i've found a complete interpreter for C-- language but made in Java.
I put here the while statement (a bit different but the way is the same):
whileStmt
scope{
Boolean breaked;
}
#after{
CommonTree stmtNode=(CommonTree)$whileStmt.start.getChild(1);
CommonTree exprNode=(CommonTree)$whileStmt.start.getChild(0);
int test;
$whileStmt::breaked=false;
while($whileStmt::breaked==false){
stream.push(stream.getNodeIndex(exprNode));
test=expr().value;
stream.pop();
if (test==0) break;
stream.push(stream.getNodeIndex(stmtNode));
stmt();
stream.pop();
}
}
: ^(WHILE . .)
;
I've tried to transform this code into C language:
repeat
scope {
int breaked;
int tours;
}
#after
{
int test;
pANTLR3_BASE_TREE repeatstmt = (pANTLR3_BASE_TREE)$repeat.start->getChild($repeat.start,1);
pANTLR3_BASE_TREE exprstmt = (pANTLR3_BASE_TREE)$repeat.start->getChild($repeat.start,0);
$repeat::breaked = 0;
test = 1;
while($repeat::breaked == 0)
{
TW_FOLLOWPUSH(exprstmt);
TW_FOLLOWPOP();
test++;
if(test == $repeat::tours)
break;
TW_FOLLOWPUSH(repeatstmt);
CTX->repeat(CTX);
TW_FOLLOWPOP();
}
}
: ^(REPEAT DIGIT stmt)
{
$repeat::tours = $DIGIT.text->toInt32($DIGIT.text);
}
But nothing happened (stmt is parsed juste once).
Do you have an idea about this please?
About the homemade errors messages, i've found the macro GETLINE() in the lexer. It works when the tree walker crashes but antlr continues to display errors messages for lexer or parser errors.
Thanks.

How to get entire input string in Lex and Yacc?

OK, so here is the deal.
In my language I have some commands, say
XYZ 3 5
GGB 8 9
HDH 8783 33
And in my Lex file
XYZ { return XYZ; }
GGB { return GGB; }
HDH { return HDH; }
[0-9]+ { yylval.ival = atoi(yytext); return NUMBER; }
\n { return EOL; }
In my yacc file
start : commands
;
commands : command
| command EOL commands
;
command : xyz
| ggb
| hdh
;
xyz : XYZ NUMBER NUMBER { /* Do something with the numbers */ }
;
etc. etc. etc. etc.
My question is, how can I get the entire text
XYZ 3 5
GGB 8 9
HDH 8783 33
Into commands while still returning the NUMBERs?
Also when my Lex returns a STRING [0-9a-zA-Z]+, and I want to do verification on it's length, should I do it like
rule: STRING STRING { if (strlen($1) < 5 ) /* Do some shit else error */ }
or actually have a token in my Lex that returns different tokens depending on length?
If I've understood your first question correctly, you can have semantic actions like
{ $$ = makeXYZ($2, $3); }
which will allow you to build the value of command as you want.
For your second question, the borders between lexical analysis and grammatical analysis and between grammatical analysis and semantic analysis aren't hard and well fixed. Moving them is a trade-off between factors like easiness of description, clarity of error messages and robustness in presence of errors. Considering the verification of string length, the likelihood of an error occurring is quite high and the error message if it is handled by returning different terminals for different length will probably be not clear. So if it is possible -- that depend on the grammar -- I'd handle it in the semantic analysis phase, where the message can easily be tailored.
If you arrange for your lexical analyzer (yylex()) to store the whole string in some variable, then your code can access it. The communication with the parser proper will be through the normal mechanisms, but there's nothing that says you can't also have another variable lurking around (probably a file static variable - but beware multithreading) that stores the whole input line before it is dissected.
As you use yylval.ival you already have union with ival field in your YACC source, like this:
%union {
int ival;
}
Now you specify token type, like this:
%token <ival> NUMBER
So now you can access ival field simply for NUMBER token as $1 in your rules, like
xyz : XYZ NUMBER NUMBER { printf("XYZ %d %d", $2, $3); }
For your second question I'd define union like this:
%union {
char* strval;
int ival;
}
and in you LEX source specify token types
%token <strval> STRING;
%token <ival> NUMBER;
So now you can do things like
foo : STRING NUMBER { printf("%s (len %d) %d", $1, strlen($1), $2); }

Resources