Why Use Lexical Analyzers? - lexical-analysis

I'm building my own language using Flex, but I want to know some things:
Why should I use lexical analyzers?
Are they going to help me in something?
Are they obligatory?

Lexical analysis helps simplify parsing because the lexemes can be treated as abstract entities rather than concrete character sequences.
You'll need more than flex to build your language, though: Lexical analysis is just the first step.

Any time you are converting an input string into space-separated strings and/or numeric values, you are performing lexical analysis. Writing a cascading series of else if (strcmp (..)==0) ... statements counts as lexical analysis. Even such nasty tools as sscanf and strtok are lexical analysis tools.
You'd want to use a tool like flex instead of one of the above for one of several reasons:
The error handling can be made much better.
You can be much more flexible in what different things you recognize with flex. For instance, it is tough to parse a C-format hexidecimal value properly with scanf routines. scanf pretty much has to know the hex value is comming. Lex can figure it out for you.
Lex scanners are faster. If you are parsing a lot of files, and/or large ones, this could become important.

You would consider using a lexical analyzer because you could use BNF (or EBNF) to describe your language (the grammar) declaratively, and then just use a parser to parse a program written in your language and get it in a structure in memory and then manipulate it freely.
It's not obligatory and you can of course write your own, but that depends on how complex the language is and how much time you have to reinvent the wheel.
Also, the fact that you can use a language (BNF) to describe your language without changing the lexical analyzer itself, enables you to make many experiments and change the grammar of your language until you have exactly what it works for you.

Related

How should I parse keywords when writing a C Compiler?

I am currently in the process of writing a C to Assembly compiler, it is not meant to be practical, but I would like to do it for the educational value. I was wondering when I am testing for keywords, is there any more efficient way rather than just reading in the next word in the file and then running it through a bunch of nested if statements that test for the keywords. Is there any better way?
Your question is actually quite specific. You are asking about how to build the lexical analyzer, also known as the scanner, and how to efficiently and conveniently recognize keywords. The scanner is the first phase of a typical compiler, and it converts the source code, which is a sequence of characters, to a sequence of tokens, where a token is a unit such as a number, an operator or a keyword.
Since keywords match the pattern for general identifiers, a common trick is to put all the keywords in the symbol table, together with information that it is a keyword. Then, when the scanner finds an identifier, it as usual searches the symbol table to see if that identifier has been seen before. If this identifier was a kewyord, it will be found, together with the information about which keyword it is.
Are you doing this for part of a class? If so, there should be guidelines on parsing and lexing. If not, you're in for a lot of work!
Writing an actual compiler is much more complicated than just going through a bunch of if statements, because you need to keep track of the environment. You'll need to think about how you allow classes, functions, function calls, class instantiations, recursive functions... the list goes on.
Take a look at course lectures from UC Berkeley on the subject, i.e. parsing, lexing, code generation, and the tools you'll need:
http://www-inst.eecs.berkeley.edu/~cs164/fa13/
Note that this course in particular used C++ to write a Python2.5 to Assembly compiler, but the concepts in the Lectures and Readings and some of the tools are not language-restricted.
Keywords (instead of tokens in general) is a closed set, for which it's practical to generate a collision free hash function. Because the set is small, it's not even necessary to have a minimum hash function.
You can do it with a bunch of if - else if statements and strcmp(). However, writing statements for all of the keywords gets annoying very quickly. You'd be better off using a hash table - at the start of the compilation you put all keywords in the table and then you do lookups as needed. The drawback of this is that if you have to use C, you'll also have to write your own hash table (or use one from a library). If you can use C++, though, then you can use a map or an unordered_map from the STL. In any case, if you are worried about the performance, as someone else mentioned, it won't be a bottle neck.

What parser-generators with code separation and language extensibility would you recommend?

I'm looking for a context-free grammar parser generator with grammar/code separation and a possibility to add support for new target languages. For instance if I want parser in Pascal, I can write my own pascal code generator without reimplementing the whole thing.
I understand that most open-source parser generators can in theory be extended, still I'd prefer something that has extendability planned and documented.
Feature-wise I need the parser to at least support Python-style indentation, maybe with some additional work. No requirement on the type of parser generated, but I'd prefer something fast.
Which are the most well-known/maintained options?
Popular parser-generators seem to mostly use mixed grammar/code approach which I really don't like. Comparison list on Wikipedia lists a few but I'm a novice at this and can't tell which to try.
Why I don't like mixing grammar/code: because this approach seems like a mess. Grammar is grammar, implementation details are implementation details. They're different things written in different languages, it's intuitive to keep them in separate places.
What if I want to reuse parts of grammar in another project, with different implementation details? What if I want to compile a parser in a different language? All of this requires grammar to be kept separate.
Most parser generators won't handle context-free grammars. They handle some subset (LL(1), LL(k), LL(*), LALR(1), LR(k), ...). If you choose one of these, you will almost certainly have to hack your grammar to match the limitations of the parser generator (no left recursion, limited lookahead, ...). If you want a real context free parser generator you want an Early parser generator (inefficient), a GLR parser generator (the most practical of the lot), or a PEG parser generator (and the last isn't context-free; it requires rules to be ordered to determine which ones take precedence).
You seem to be worried about mixing syntax and parser-actions used to build the trees.
If the tree you build isn't a direct function of the syntax, there has to be some way to tie the tree-building machinery to the grammar productions. Placing it "near" the grammar production is one way, but leads to your "mixed" notation objection.
Another way is to give each rule a name (or some unique identifier), and set the tree-building machinery off to the side indexed by the names. This way your grammar isn't contaminated with the "other stuff", which seems to be your objection. None of the parser generator systems I know of do this. An awkward issue is that you now have to invent lots of rule names, and anytime you have a few hundred names that's inconvenient by itself and it is hard to make them mnemonic.
A third way is to make the a function of the syntax, and auto-generate the tree building steps. This requires no extra stuff off to the side at all to produce the ASTs. The only tool I know that does it (there may be others but I've been looking for 20 odd years and haven't seen one) is my company's product,, the DMS Software Reengineering Toolkit. [DMS isn't just a parser generator; it is a complete ecosystem for building program analysis and transformation tools for arbitrary languages, using a GLR parsing engine; yes it handles Python style indents].
One objection is that such trees are concrete, bloated and confusing; if done right, that's not true.
My SO answer to this question:
What is the difference between an Abstract Syntax Tree and a Concrete Syntax Tree? discusses how we get the benefits of ASTs from automatically generated compressed CSTs.
The good news about DMS's scheme is that the basic grammar isn't bloated with parsing support. The not so good news is that you will find lots of other things you want to associate with grammar rules (prettyprinting rules, attribute computations, tree synthesis,...) and you come right back around to the same choices. DMS has all of these "other things" and solves the association problem a number of ways:
By placing other related descriptive formalisms next to the grammar rule (producing the mixing you complained about). We tolerate this for pretty-printing rules because in fact it is nice to have the grammar (parse) rule adjacent to the pretty-print (anti-parse) rule. We also allow attribute computations to be placed near the grammar rules to provide an association.
While DMS allows rules to have names, this is only for convenient access by procedural code, not associating other mechanisms with the rule.
DMS provides a third way to associate these mechanisms (esp. attribute grammar computations) by using the rule itself as a kind of giant name. So, you write the grammar and prettyprint rules in one place, and somewhere else you can write the grammar rule again with an associated attribute computation. In principle, this is just like giving each rule a name (well, a signature) and associating the computation with the name. But it also allows us to define many, many different attribute computations (for different purposes) and associate them with their rules, without cluttering up the base grammar. Our tools check that a (rule,associated-computation) has a valid rule in the base grammar, so it makes it relatively each to track down what needs fixing when the base grammar changes.
This being my tool (I'm the architect) you shouldn't take this as a recommendation, just a bias. That bias is supported by DMS's ability to parse (without whimpering) C, C++, Java, C#, IBM Enterprise COBOL, Python, F77/F90/F95 with column6 continues/F90 continues and embedded C preprocessor directives to boot under most circumstances), Mumps, PHP4/5 and many other languages.
First off, any decent parser generator is going to be robust enough to support Python's indenting. That isn't really all that weird as languages go. You should try parsing column-sensitive languages like Fortran77 some time...
Secondly, I don't think you really need the parser itself to be "extensible" do you? You just want to be able to use it to lex and parse the language or two you have in mind, right? Again, any decent parser-generator can do that.
Thirdly, you don't really say what about the mix between grammar and code you don't like. Would you rather it be all implemented in a meta-language (kinda tough), or all in code?
Assuming it is the latter, there are a couple of in-language parser generator toolkits I know of. The first is Boost's Spirit, which is implemented in C++. I've used it, and it works. However, back when I used it you pretty much needed a graduate degree in "boostology" to be able to understand its error messages well enough to get anything working in a reasonable amount of time.
The other I know about is OpenToken, which is a parser-generation toolkit implemented in Ada. Ada doesn't have the error-novel problem that C++ has with its templates, so OpenToken is far easier to use. However, you have to use it in Ada...
Typical functional languages allow you to implement any sublanguage you like (mostly) within the language itself, thanks to their inhernetly good support for things like lambdas and metaprogramming. However, their parsers tend to be slower. That's really no problem at all if you are just parsing a configuration file or two. Its a tremendous problem if you are parsing hundreds of files at a go.

How to define grammar which excludes a certain set of words?

I have built a small code for static analysis of C code. The purpose of building it is to warn users about the use of methods such as strcpy() which could essentially cause buffer overflows.
Now, to formalise the same, I need to write a formal Grammar which shows the excluded libraries as NOT a part of the allowed set of accepted library methods used.
For example,
AllowedSentence->ANSI C Permitted Code, NOT UnSafeLibraryMethods
UnSafeLibraryMethods->strcpy|other potentially unsafe methods
Any ideas on how this grammar can be formalised?
I think, this should not be done at the grammar level. It should be a rule that is applied to the parse tree after parsing is done.
You hardly need a parser for the way you have posed the problem. If your only goal is to object to the presence of certain identifiers ("strcpy"), you can simply build a lexer that processes C and picks identifiers. Special lexemes can recognize your list of "you shouldn't use this". This way you use positive recognition instead of negative recognition to pick out the identifiers that you belive to be trouble.
If you want a more sophisticated analaysis tool, you'll likely want to parse C, an name-resolve the identifers to their actual definitisn, then the scan the tree looking for identifiers that are objectionable. This will at least let you decide if the identifier is actually defined by the user, or comes from some known library; surely, if my code defines strcpy, you shouldn't complain unless you know my strcpy is defective somehow.

What libraries would be useful for implementing a small language interpreter in C?

For my own learning experience, I want to try writing an interpreter for a simple programming language in C – the main thing I think I need is a hash table library, but a general purpose collection of data structures and helper functions would be pretty helpful. What would you guys recommend?
libbasekit - by the author of Io. You can also use libcoroutine.
One library I recommend looking into is libgc, a garbage collector for C.
You use it by replacing calls to malloc, realloc, strdup, etc. with their libgc counterparts (e.g. GC_MALLOC). It works by scanning the stack, global variables, and GC-allocated blocks, looking for numbers that might be pointers. Believe it or not, it actually performs quite well (almost on par with the very good ptmalloc, which is the default (non-garbage collected) malloc implementation in GNU/Linux), and a lot of programs use it (including Mono and GCJ). A disadvantage, though, is it might not play well with other libraries you may want to use, and you may even have to recompile some of them by hand to replace calls to malloc with GC_MALLOC.
Honestly - and I know some people will hate me for it - but I recommend you use C++. You don't have to bust a gut to learn it just to be able to start your project. Just use it like C, but in an hour you can learn how to use std::map<> (an associative container), std::string for easy textual data handling, and std::vector<> for a resizable heap-allocated array. If you want to spend an extra hour or two, learn to put member functions in classes (don't worry about polymorphism, virtual functions etc. to begin with), and you'll get a more organised program.
You need no more than the standard library for a suitably small language with simple constructs. The most complex part of an interpreted language is probably expression evaluation. For both that, procedure-calling, and construct-nesting you will need to understand and implement stack data structures.
The code at the link above is C++, but the algorithm is described clearly and you could re-implement it easily in C. There again there are few valid arguments for not using C++ IMO.
Before diving into what libraries to use I suggest you learn about grammars and compiler design. Especially input parsing is for compilers and interpreters similar, that is tokenizing and parsing. The process of tokenizing converts a stream characters (your input) into a stream of tokens. A parser takes this stream of tokens and matches it with your grammar.
You don't mention what language you're writing an interpreter for. But very likely that language contains recursion. In that case you need to use a so-called bottom-up parser which you cannot write by hand but needs to be generated. If you try write such a parser by hand you will end up with a error-prone mess.
If you're developing for a posix platform then you can use lex and yacc. These tools are a bit old but very powerful for building parsers. Lex can generate code that implements the tokenizing process and yacc can generate a bottom-up parser.
My answer probably raises more questions than it answers. That's because the field of compilers/interpreters is quite complex and cannot simply be explained in a short answer. Just get a good book on compiler design.

Need a way to parse algebraic expressions in C

I need to parse algebraic expressions for an application I'm working on and am hoping to garnish a bit of collective wisdom before taking a crack at it and, possibly, heading down the wrong road.
What I need to do is pretty straight forward: given a textual algebraic expression (3*x - 4(y - sin(pi))) create a object representation of the equation. The custom objects already exist, so I need a parser that creates a tree I can walk to instantiate the objects I need.
The basic requirements would be:
Ability to express the algebra as a grammar so I have control and can customize/extend it as necessary.
The initial syntax will include integers, real numbers, constants, variables, arithmetic operators (+, - , *, /), powers (^), equations (=), parenthesis, precedence, and simple functions (sin(pi)). I'm hoping to extend my app fairly quickly to support functions proper (f(x) = 3x +2).
Must compile in C as it needs to be integrated into my code.
I DON'T need to evaluate the expression mathematically, so software that solves for a variable or performs the arithmetic is noise.
I've done my Google homework and it looks like the best approach is to use a BNF grammar and software to generate a compiler in C. So my questions:
Does a BNF grammar with corresponding parser generator for algebraic expressions (or better yet, LaTex) already exist? Someone has to have done this already. I REALLY want to avoid rolling my own, mainly because I don't want to test it. I'd be willing to pay a reasonable amount for a library (under $50)
If not, which parser generator for C do you think is the easiest to learn/use here? Lex? YACC? Flex, Bison, Python/SymPy, Others? I'm not familiar with any of these.
The standard Linux tools flex and bison would probably be most appropriate here. IIRC the sample parsers and lexers used in these tools do something close to what you want, so you might be able to just modify that code to get what you need.
These tools seem like they meet your specifications. You can customize the grammars, compile down to C, and use any operator you want.
I've had very good luck with ANTLR. It has runtimes for many different languages, including C, and has a very nice syntax for specifying grammars and building trees. I recently wrote a similar grammar (algebraic expressions) in 131 lines, which is definitely manageable.
I used the code (found on the net) from the following:
Program Translation Fundamentals" by Peter Calingaert
I enhanced it to handle functions, which lets you implement things like "if(a, b, c)" (kind of like how Excel does things).
you can build simple parser yourself or use any of popular "compiler-compiler" (some of them were listed by other posts). just decide if your parser will be complicated enough to use (and learn) an external tool. in any case you'll need to define the grammar, usually it's the most brain intensive task if you don't have prior experience. the formal way to define syntactic grammars is BNF or EBNF

Resources