Synchronize flex states with bison (HTTP parser) - c

I am trying to write a HTTP parser using flex bison. My first thought was to lex only the most basic tokens and use the parser to combine them. For example in the HTTP header there would be tokens like GET, POST or PUT as well as URL and HTTP_VERSION. The problem now is, that it is perfectly valid for a URL to include e.g. the string "POST" and the lexer can not differentiate between the "POST" in an URL token or the "POST" in an actual POST token.
A solution to that problem would be to use flex start condition states, to enable and disable specific patterns dependent on what is expected. After finding a GET, POST, etc. token in the INITIAL state, the lexer switches to another state where it only matches the URL. This solution does not feel right though. It feels like implementing logic in the lexer that belongs into the parser (stuff like "there is exactly one GET, POST,... token followed by exactly one URL).
That brings me to my question:
Is there a way to "synchronize" between a bison parser and a flex lexer, so that I can change the lexer state depending on what bison expects to be the next token?
P.S.:
I don't know much about grammar classes and theoretical stuff behind parsing languages. It feels to me that parsing a HTTP header does not even require a contex free language parser like bison. Is HTTP a regular language, so that I can parse it with a regular expression?

Using Bison to parse a HTTP header is not just overkill; it's a mismatch between tool and problem. It's like trying to write an essay using Photoshop. Photoshop is a highly sophisticated tool which, in the hands of a skilled operator, can make any image look beautiful, and in a certain sense, an essay is an image (or a series of images). But the task is ridiculous. Sure, Photoshop has text blocks. But you don't want to concentrate on every rectangular block of text. You want to write, letting the words flow from page to page, something which is outside of Photoshop's model, regardless of how sophisticated the tool is.
Bison and Flex are designed to parse a language with a complex syntax, in which the input can first be separated into lexical units (tokens) without regard to their syntactic context. If you find yourself asking questions like "how can I communicate the nature of the expected input to the lexer?", then you are probably using the wrong tool. Certainly, Flex is a powerful tool and it has features which let it adapt to minor variations in the lexical context. But these should be exceptional cases.
Protocols like HTTP are designed to be easy to analyse. You need neither a parser nor a lexer to see which of a handful of possible verbs is at the start of a line, or to extract the following string up to the first space character. You don't need to do sophisticated error analysis and recovery, and you shouldn't even try because it will leak information and could open a vulnerability. If a character can't be parsed, just send a 400 and stop reading.
Good tools simplify the solution of problems within the domain fir which they were designed. Part of the skillset of a good engineer is recognising which tools are appropriate to a given problem. In this case, Bison/Flex is not the tool.
And I'm not sure that regular expressions help much either, because the main challenge in HTTP parsing is dealing with an input stream which is intermittent, asynchronous, somewhat unreliable, and subject to attack. Like Flex, most regex libraries expect a single input string, already clearly terminated, which is not the case for a protocol transmitted over the internet.
It is possible to construct a backchannel from a Bison parser to a Flex scanner, but this is not the place to try it. So I'm disinclined to try to explain the techniques within this context.

HTTP is a regular language. We can see that from the fact, that it can be parsed with a finite state machine. In theory it should be possible (don't quote me on that) to match an entire HTTP message with a regular expression. The problem with this parsing approach is, that common regex languages don't have sofisticated capture options. Often we can capture single submatches and store them for later use, but there is no such thing like the callback-like approach in bison, where one can specify arbitrary actions based on the recognition of some subpattern.
The reason that bison and flex are not the right tools for this problem (as stated in other answers) lays in a very fundamental property that distuingishes the HTTP language from typical programming languages, where flex and bison are usually used:
A programming language consists of an alphabet of tokens. That means e.g. a keyword can be thought of a single element in that alphabet and the input stream of characters can unambiguously be converted into a stream of tokens. This is exactly the purpose of the lexer, which kind of acts as a translator from one alphabet to another. There are situations where the lexer additionally acts as parser and requires more internal logic, than simply translating input sequences to tokens. One of the most common of these situations are comments. A keyword in a comment no longer emits a token.
For HTTP on the other hand, the alphabet is just the input characters. It makes not much sense to use a lexer. The problem that the OP mentioned about the communication between flex and bison stems from the fact, that he tried to split the parsing of the language (which is a single problem) in two parts. Flex is not a parser and not suitable for parsing. The solution would be to just ignore the lexer stage, as it is not needed. Bison is very capable of producing a parser for a regular language and can be used to parse HTTP.
That brings me to the last point. Even if bison is capable of generating a regular language parser, it is not the most suitable tool. Bison can generate parser for context-free languages. Those are parsed using a finite state machine with a stack. For parsing a regular language, one does not need the stack, as there are no nested structures, that need to be "remembered". Thus most likely a HTTP parser generated by bison will not be as optimized as a parser generated by a regular language parser generator. Sadly, I could not find a pure regular language parser generator, but such a thing would be the preferrable tool to generate a HTTP parser.

Related

C Language. How to use a string value as delimiter in SSCANF

Is there a way to use a string as a delimiter?
We can use characters as delimiters using sscanf();
Example
I have
char url[]="username=jack&pwd=jack123&email=jack#example.com"
i can use.
char username[100],pwd[100],email[100];
sscanf(url, "username=%[^&]&pwd=%[^&]&email=%[^\n]", username,pwd,email);
it works fine for this string. but for
url="username=jack&jill&pwd=jack&123&email=jack#example.com"
it cant be used...its to remove SQL injection...but i want learn a trick to use
&pwd,&email as delimiters..not necessarily with sscanf.
Update: Solution doesnt necessarily need to be in C language. I only want to know of a way to use string as a delimiter
Just code your own parsing. In many cases, representing in memory the AST you have parsed is useful. But do specify and document your input language (perhaps using EBNF notation).
Your input language (which you have not defined in your question) seems to be similar to the MIME type application/x-www-form-urlencoded used in HTTP POST requests. So you might look, at least for inspiration, into the source code of free software libraries related to HTTP server processing (like libonion) and HTTP client processing (like libcurl).
You could read an entire line with getline (or perhaps fgets) then parse it appropriately. sscanf with %n, or strtok might be useful, but you can also parse the line "manually" (consider using e.g. your recursive descent parser). You might use strchr or strstr also.
BTW, in many cases, using common textual representations like JSON, YAML, XML can be helpful, and you can easily find many libraries to handle them.
Notice also that strings can be processed as FILE* by using fmemopen and/or open_memstream.
You could use parser generators such as bison (with flex).
In some cases, regular expressions could be useful. See regcomp and friends.
So what you want to achieve is quite easy to do and standard practice. But you need more that just sscanf and you may want to combine several things.
Many external libraries (e.g. glib from GTK) provide some parsing. And you should care about UTF-8 (today, you have UTF-8 everywhere).
On Linux, if permitted to do so, you might use GNU readline instead of getline when you want interactive input (with editing abilities and autocompletion). Then take inspiration from the source code of GNU bash (or of RefPerSys, if interested by C++).
If you are unfamiliar with usual parsing techniques, read a good book such as the Dragon Book. Most large programs deal somewhere with parsing, so you need to know how that can be done.

Multiple start points for Bison grammar/parser

OK, so I have a complete (and working) Bison grammar.
The thing is I want to be able to set another starting point (%start) if I wish.
How is this doable, without having to create a separate grammar/parser?
I'm going to try to put together a version of yacc that does this. There is one complication that makes this not as trivial as it seems: the question of what constitutes an "end" symbol. The kind of place where this is of greatest use is in processing chunks in mid-stream (Knuth's TeX processor for [c]Web does this, for instance). Along these lines, another example where this can be used is in providing a unified parser for both the pre-processing layer and language layer and in processing individual macros themselves as entire parsing units (as well as being able to account for which macro bodies are common syntactic units like "expression" or "statement" and which are not).
In those kinds of applications, there is no natural "end" symbol to mark off the boundary of a segment for parsing. Normally, the LR method requires this in order to recognize when to take the "accept" action. Otherwise, you have accept-reduce (and even accept-shift) conflicts to contend with!

What parser-generators with code separation and language extensibility would you recommend?

I'm looking for a context-free grammar parser generator with grammar/code separation and a possibility to add support for new target languages. For instance if I want parser in Pascal, I can write my own pascal code generator without reimplementing the whole thing.
I understand that most open-source parser generators can in theory be extended, still I'd prefer something that has extendability planned and documented.
Feature-wise I need the parser to at least support Python-style indentation, maybe with some additional work. No requirement on the type of parser generated, but I'd prefer something fast.
Which are the most well-known/maintained options?
Popular parser-generators seem to mostly use mixed grammar/code approach which I really don't like. Comparison list on Wikipedia lists a few but I'm a novice at this and can't tell which to try.
Why I don't like mixing grammar/code: because this approach seems like a mess. Grammar is grammar, implementation details are implementation details. They're different things written in different languages, it's intuitive to keep them in separate places.
What if I want to reuse parts of grammar in another project, with different implementation details? What if I want to compile a parser in a different language? All of this requires grammar to be kept separate.
Most parser generators won't handle context-free grammars. They handle some subset (LL(1), LL(k), LL(*), LALR(1), LR(k), ...). If you choose one of these, you will almost certainly have to hack your grammar to match the limitations of the parser generator (no left recursion, limited lookahead, ...). If you want a real context free parser generator you want an Early parser generator (inefficient), a GLR parser generator (the most practical of the lot), or a PEG parser generator (and the last isn't context-free; it requires rules to be ordered to determine which ones take precedence).
You seem to be worried about mixing syntax and parser-actions used to build the trees.
If the tree you build isn't a direct function of the syntax, there has to be some way to tie the tree-building machinery to the grammar productions. Placing it "near" the grammar production is one way, but leads to your "mixed" notation objection.
Another way is to give each rule a name (or some unique identifier), and set the tree-building machinery off to the side indexed by the names. This way your grammar isn't contaminated with the "other stuff", which seems to be your objection. None of the parser generator systems I know of do this. An awkward issue is that you now have to invent lots of rule names, and anytime you have a few hundred names that's inconvenient by itself and it is hard to make them mnemonic.
A third way is to make the a function of the syntax, and auto-generate the tree building steps. This requires no extra stuff off to the side at all to produce the ASTs. The only tool I know that does it (there may be others but I've been looking for 20 odd years and haven't seen one) is my company's product,, the DMS Software Reengineering Toolkit. [DMS isn't just a parser generator; it is a complete ecosystem for building program analysis and transformation tools for arbitrary languages, using a GLR parsing engine; yes it handles Python style indents].
One objection is that such trees are concrete, bloated and confusing; if done right, that's not true.
My SO answer to this question:
What is the difference between an Abstract Syntax Tree and a Concrete Syntax Tree? discusses how we get the benefits of ASTs from automatically generated compressed CSTs.
The good news about DMS's scheme is that the basic grammar isn't bloated with parsing support. The not so good news is that you will find lots of other things you want to associate with grammar rules (prettyprinting rules, attribute computations, tree synthesis,...) and you come right back around to the same choices. DMS has all of these "other things" and solves the association problem a number of ways:
By placing other related descriptive formalisms next to the grammar rule (producing the mixing you complained about). We tolerate this for pretty-printing rules because in fact it is nice to have the grammar (parse) rule adjacent to the pretty-print (anti-parse) rule. We also allow attribute computations to be placed near the grammar rules to provide an association.
While DMS allows rules to have names, this is only for convenient access by procedural code, not associating other mechanisms with the rule.
DMS provides a third way to associate these mechanisms (esp. attribute grammar computations) by using the rule itself as a kind of giant name. So, you write the grammar and prettyprint rules in one place, and somewhere else you can write the grammar rule again with an associated attribute computation. In principle, this is just like giving each rule a name (well, a signature) and associating the computation with the name. But it also allows us to define many, many different attribute computations (for different purposes) and associate them with their rules, without cluttering up the base grammar. Our tools check that a (rule,associated-computation) has a valid rule in the base grammar, so it makes it relatively each to track down what needs fixing when the base grammar changes.
This being my tool (I'm the architect) you shouldn't take this as a recommendation, just a bias. That bias is supported by DMS's ability to parse (without whimpering) C, C++, Java, C#, IBM Enterprise COBOL, Python, F77/F90/F95 with column6 continues/F90 continues and embedded C preprocessor directives to boot under most circumstances), Mumps, PHP4/5 and many other languages.
First off, any decent parser generator is going to be robust enough to support Python's indenting. That isn't really all that weird as languages go. You should try parsing column-sensitive languages like Fortran77 some time...
Secondly, I don't think you really need the parser itself to be "extensible" do you? You just want to be able to use it to lex and parse the language or two you have in mind, right? Again, any decent parser-generator can do that.
Thirdly, you don't really say what about the mix between grammar and code you don't like. Would you rather it be all implemented in a meta-language (kinda tough), or all in code?
Assuming it is the latter, there are a couple of in-language parser generator toolkits I know of. The first is Boost's Spirit, which is implemented in C++. I've used it, and it works. However, back when I used it you pretty much needed a graduate degree in "boostology" to be able to understand its error messages well enough to get anything working in a reasonable amount of time.
The other I know about is OpenToken, which is a parser-generation toolkit implemented in Ada. Ada doesn't have the error-novel problem that C++ has with its templates, so OpenToken is far easier to use. However, you have to use it in Ada...
Typical functional languages allow you to implement any sublanguage you like (mostly) within the language itself, thanks to their inhernetly good support for things like lambdas and metaprogramming. However, their parsers tend to be slower. That's really no problem at all if you are just parsing a configuration file or two. Its a tremendous problem if you are parsing hundreds of files at a go.

How to define grammar which excludes a certain set of words?

I have built a small code for static analysis of C code. The purpose of building it is to warn users about the use of methods such as strcpy() which could essentially cause buffer overflows.
Now, to formalise the same, I need to write a formal Grammar which shows the excluded libraries as NOT a part of the allowed set of accepted library methods used.
For example,
AllowedSentence->ANSI C Permitted Code, NOT UnSafeLibraryMethods
UnSafeLibraryMethods->strcpy|other potentially unsafe methods
Any ideas on how this grammar can be formalised?
I think, this should not be done at the grammar level. It should be a rule that is applied to the parse tree after parsing is done.
You hardly need a parser for the way you have posed the problem. If your only goal is to object to the presence of certain identifiers ("strcpy"), you can simply build a lexer that processes C and picks identifiers. Special lexemes can recognize your list of "you shouldn't use this". This way you use positive recognition instead of negative recognition to pick out the identifiers that you belive to be trouble.
If you want a more sophisticated analaysis tool, you'll likely want to parse C, an name-resolve the identifers to their actual definitisn, then the scan the tree looking for identifiers that are objectionable. This will at least let you decide if the identifier is actually defined by the user, or comes from some known library; surely, if my code defines strcpy, you shouldn't complain unless you know my strcpy is defective somehow.

Why Use Lexical Analyzers?

I'm building my own language using Flex, but I want to know some things:
Why should I use lexical analyzers?
Are they going to help me in something?
Are they obligatory?
Lexical analysis helps simplify parsing because the lexemes can be treated as abstract entities rather than concrete character sequences.
You'll need more than flex to build your language, though: Lexical analysis is just the first step.
Any time you are converting an input string into space-separated strings and/or numeric values, you are performing lexical analysis. Writing a cascading series of else if (strcmp (..)==0) ... statements counts as lexical analysis. Even such nasty tools as sscanf and strtok are lexical analysis tools.
You'd want to use a tool like flex instead of one of the above for one of several reasons:
The error handling can be made much better.
You can be much more flexible in what different things you recognize with flex. For instance, it is tough to parse a C-format hexidecimal value properly with scanf routines. scanf pretty much has to know the hex value is comming. Lex can figure it out for you.
Lex scanners are faster. If you are parsing a lot of files, and/or large ones, this could become important.
You would consider using a lexical analyzer because you could use BNF (or EBNF) to describe your language (the grammar) declaratively, and then just use a parser to parse a program written in your language and get it in a structure in memory and then manipulate it freely.
It's not obligatory and you can of course write your own, but that depends on how complex the language is and how much time you have to reinvent the wheel.
Also, the fact that you can use a language (BNF) to describe your language without changing the lexical analyzer itself, enables you to make many experiments and change the grammar of your language until you have exactly what it works for you.

Resources