Is there a way to use a string as a delimiter?
We can use characters as delimiters using sscanf();
Example
I have
char url[]="username=jack&pwd=jack123&email=jack#example.com"
i can use.
char username[100],pwd[100],email[100];
sscanf(url, "username=%[^&]&pwd=%[^&]&email=%[^\n]", username,pwd,email);
it works fine for this string. but for
url="username=jack&jill&pwd=jack&123&email=jack#example.com"
it cant be used...its to remove SQL injection...but i want learn a trick to use
&pwd,&email as delimiters..not necessarily with sscanf.
Update: Solution doesnt necessarily need to be in C language. I only want to know of a way to use string as a delimiter
Just code your own parsing. In many cases, representing in memory the AST you have parsed is useful. But do specify and document your input language (perhaps using EBNF notation).
Your input language (which you have not defined in your question) seems to be similar to the MIME type application/x-www-form-urlencoded used in HTTP POST requests. So you might look, at least for inspiration, into the source code of free software libraries related to HTTP server processing (like libonion) and HTTP client processing (like libcurl).
You could read an entire line with getline (or perhaps fgets) then parse it appropriately. sscanf with %n, or strtok might be useful, but you can also parse the line "manually" (consider using e.g. your recursive descent parser). You might use strchr or strstr also.
BTW, in many cases, using common textual representations like JSON, YAML, XML can be helpful, and you can easily find many libraries to handle them.
Notice also that strings can be processed as FILE* by using fmemopen and/or open_memstream.
You could use parser generators such as bison (with flex).
In some cases, regular expressions could be useful. See regcomp and friends.
So what you want to achieve is quite easy to do and standard practice. But you need more that just sscanf and you may want to combine several things.
Many external libraries (e.g. glib from GTK) provide some parsing. And you should care about UTF-8 (today, you have UTF-8 everywhere).
On Linux, if permitted to do so, you might use GNU readline instead of getline when you want interactive input (with editing abilities and autocompletion). Then take inspiration from the source code of GNU bash (or of RefPerSys, if interested by C++).
If you are unfamiliar with usual parsing techniques, read a good book such as the Dragon Book. Most large programs deal somewhere with parsing, so you need to know how that can be done.
Related
I am trying to write a HTTP parser using flex bison. My first thought was to lex only the most basic tokens and use the parser to combine them. For example in the HTTP header there would be tokens like GET, POST or PUT as well as URL and HTTP_VERSION. The problem now is, that it is perfectly valid for a URL to include e.g. the string "POST" and the lexer can not differentiate between the "POST" in an URL token or the "POST" in an actual POST token.
A solution to that problem would be to use flex start condition states, to enable and disable specific patterns dependent on what is expected. After finding a GET, POST, etc. token in the INITIAL state, the lexer switches to another state where it only matches the URL. This solution does not feel right though. It feels like implementing logic in the lexer that belongs into the parser (stuff like "there is exactly one GET, POST,... token followed by exactly one URL).
That brings me to my question:
Is there a way to "synchronize" between a bison parser and a flex lexer, so that I can change the lexer state depending on what bison expects to be the next token?
P.S.:
I don't know much about grammar classes and theoretical stuff behind parsing languages. It feels to me that parsing a HTTP header does not even require a contex free language parser like bison. Is HTTP a regular language, so that I can parse it with a regular expression?
Using Bison to parse a HTTP header is not just overkill; it's a mismatch between tool and problem. It's like trying to write an essay using Photoshop. Photoshop is a highly sophisticated tool which, in the hands of a skilled operator, can make any image look beautiful, and in a certain sense, an essay is an image (or a series of images). But the task is ridiculous. Sure, Photoshop has text blocks. But you don't want to concentrate on every rectangular block of text. You want to write, letting the words flow from page to page, something which is outside of Photoshop's model, regardless of how sophisticated the tool is.
Bison and Flex are designed to parse a language with a complex syntax, in which the input can first be separated into lexical units (tokens) without regard to their syntactic context. If you find yourself asking questions like "how can I communicate the nature of the expected input to the lexer?", then you are probably using the wrong tool. Certainly, Flex is a powerful tool and it has features which let it adapt to minor variations in the lexical context. But these should be exceptional cases.
Protocols like HTTP are designed to be easy to analyse. You need neither a parser nor a lexer to see which of a handful of possible verbs is at the start of a line, or to extract the following string up to the first space character. You don't need to do sophisticated error analysis and recovery, and you shouldn't even try because it will leak information and could open a vulnerability. If a character can't be parsed, just send a 400 and stop reading.
Good tools simplify the solution of problems within the domain fir which they were designed. Part of the skillset of a good engineer is recognising which tools are appropriate to a given problem. In this case, Bison/Flex is not the tool.
And I'm not sure that regular expressions help much either, because the main challenge in HTTP parsing is dealing with an input stream which is intermittent, asynchronous, somewhat unreliable, and subject to attack. Like Flex, most regex libraries expect a single input string, already clearly terminated, which is not the case for a protocol transmitted over the internet.
It is possible to construct a backchannel from a Bison parser to a Flex scanner, but this is not the place to try it. So I'm disinclined to try to explain the techniques within this context.
HTTP is a regular language. We can see that from the fact, that it can be parsed with a finite state machine. In theory it should be possible (don't quote me on that) to match an entire HTTP message with a regular expression. The problem with this parsing approach is, that common regex languages don't have sofisticated capture options. Often we can capture single submatches and store them for later use, but there is no such thing like the callback-like approach in bison, where one can specify arbitrary actions based on the recognition of some subpattern.
The reason that bison and flex are not the right tools for this problem (as stated in other answers) lays in a very fundamental property that distuingishes the HTTP language from typical programming languages, where flex and bison are usually used:
A programming language consists of an alphabet of tokens. That means e.g. a keyword can be thought of a single element in that alphabet and the input stream of characters can unambiguously be converted into a stream of tokens. This is exactly the purpose of the lexer, which kind of acts as a translator from one alphabet to another. There are situations where the lexer additionally acts as parser and requires more internal logic, than simply translating input sequences to tokens. One of the most common of these situations are comments. A keyword in a comment no longer emits a token.
For HTTP on the other hand, the alphabet is just the input characters. It makes not much sense to use a lexer. The problem that the OP mentioned about the communication between flex and bison stems from the fact, that he tried to split the parsing of the language (which is a single problem) in two parts. Flex is not a parser and not suitable for parsing. The solution would be to just ignore the lexer stage, as it is not needed. Bison is very capable of producing a parser for a regular language and can be used to parse HTTP.
That brings me to the last point. Even if bison is capable of generating a regular language parser, it is not the most suitable tool. Bison can generate parser for context-free languages. Those are parsed using a finite state machine with a stack. For parsing a regular language, one does not need the stack, as there are no nested structures, that need to be "remembered". Thus most likely a HTTP parser generated by bison will not be as optimized as a parser generated by a regular language parser generator. Sadly, I could not find a pure regular language parser generator, but such a thing would be the preferrable tool to generate a HTTP parser.
I found this the above type of code in a pre-completed portion of a coding question in Hackerrank. I was wondering what \n would do? Does it make any difference?
Read some good C reference website, and perhaps the C11 standard n1570 and probably Modern C.
The documentation of scanf(3) explains what is happening for \n in the format control string. It is handled like a space and matches a sequence of space characters (such as ' ', or '\t', or '\n') in the input stream.
If you explicitly want to parse lines, you would use some parser generator like GNU bison and/or use first fgets(3) or getline(3) and later sscanf(3).
Don't forget to handle error cases. See errno(3). Consider documenting using EBNF notation the valid inputs of your program.
Study for inspiration the source code of existing open source programs, including GNU bash or GNU make. Be aware than in 2020 UTF-8 should be used everywhere (then you might want to use libunistring whose source code you could study and improve, since it is free software).
If you use Linux, consider using gdb(1) or ltrace(1) to understand the behavior of your program. Of course, read the documentation of your C compiler (perhaps GCC) and debugger (perhaps GDB).
For example, if the input is x+=5, the program should return an array of x, +=, 5. Notice that there is no space between x and +=, so splitting by spaces only probably won't work, because then you would have to iterate through it all over again to find the keywords.
How would I do something like this?
Is there an efficient way to do this in C?
Lexing is not specific to C (in the sense that you'll use similar techniques in other programming languages). You could do that with hand-written code (using finite automaton coding techniques). You could use a lexer generator like flex. You might even use regexprs, e.g. regex.h functions on POSIX systems.
Parsing is also a well known domain with standard techniques (at least for context free languages, if you want some efficiency). You could use recursive descent parsing, you could generate a parser using bison (which has examples very close to your homework) or ANTLR. Read more about LL parsing & LR parsing. BTW, parsing techniques can be used for lexing.
BTW, there are tons of free software (e.g. interpreters of scripting languages like Guile, Lua, Python, etc....), JSON, YAML, XML... parsers, several compilers (e.g. tinycc) etc... illustrating these techniques. You'll learn a lot by studying their source code.
It could be easier for your to sometimes have a lookahead of one or two characters, e.g. by first reading the entire line (with getline(3) or else fgets(3), and perhaps even readline, which gives you a line editor). If you cannot read a whole line consider using fgetc(3) and ungetc when needed. The classifying utilities from <ctype.h> like isalpha might be helpful.
If you care about UTF-8 (and in principle you should) things become slightly more complex since some Unicode characters (like €, é, 𝛃, ...) are represented in UTF-8 by several bytes. A library like libunistring should be very helpful.
What's the best way to concatenate a string using Win32? If Understand correctly, the normal C approach would be to use strcat, but since Win32 now deals with Unicode strings (aka LPWSTR), I can't think of a way for strcat to work with this.
Is there a function for this, or should I just write my own?
lstrcat comes in ANSI and Unicode variants. Actually lstrcat is simply a macro defined as either lstrcatA or lstrcatW.
These functions are available by importing kernel32.dll. Useful if you're trying to completely avoid the C runtime library. In most cases you can just use wcscat or _tcscat as roy commented.
Also consider the strsafe.h functions, such as StringCchCat These come in ANSI and Unicode variants as well, but they help protect against buffer overflow.
For my own learning experience, I want to try writing an interpreter for a simple programming language in C – the main thing I think I need is a hash table library, but a general purpose collection of data structures and helper functions would be pretty helpful. What would you guys recommend?
libbasekit - by the author of Io. You can also use libcoroutine.
One library I recommend looking into is libgc, a garbage collector for C.
You use it by replacing calls to malloc, realloc, strdup, etc. with their libgc counterparts (e.g. GC_MALLOC). It works by scanning the stack, global variables, and GC-allocated blocks, looking for numbers that might be pointers. Believe it or not, it actually performs quite well (almost on par with the very good ptmalloc, which is the default (non-garbage collected) malloc implementation in GNU/Linux), and a lot of programs use it (including Mono and GCJ). A disadvantage, though, is it might not play well with other libraries you may want to use, and you may even have to recompile some of them by hand to replace calls to malloc with GC_MALLOC.
Honestly - and I know some people will hate me for it - but I recommend you use C++. You don't have to bust a gut to learn it just to be able to start your project. Just use it like C, but in an hour you can learn how to use std::map<> (an associative container), std::string for easy textual data handling, and std::vector<> for a resizable heap-allocated array. If you want to spend an extra hour or two, learn to put member functions in classes (don't worry about polymorphism, virtual functions etc. to begin with), and you'll get a more organised program.
You need no more than the standard library for a suitably small language with simple constructs. The most complex part of an interpreted language is probably expression evaluation. For both that, procedure-calling, and construct-nesting you will need to understand and implement stack data structures.
The code at the link above is C++, but the algorithm is described clearly and you could re-implement it easily in C. There again there are few valid arguments for not using C++ IMO.
Before diving into what libraries to use I suggest you learn about grammars and compiler design. Especially input parsing is for compilers and interpreters similar, that is tokenizing and parsing. The process of tokenizing converts a stream characters (your input) into a stream of tokens. A parser takes this stream of tokens and matches it with your grammar.
You don't mention what language you're writing an interpreter for. But very likely that language contains recursion. In that case you need to use a so-called bottom-up parser which you cannot write by hand but needs to be generated. If you try write such a parser by hand you will end up with a error-prone mess.
If you're developing for a posix platform then you can use lex and yacc. These tools are a bit old but very powerful for building parsers. Lex can generate code that implements the tokenizing process and yacc can generate a bottom-up parser.
My answer probably raises more questions than it answers. That's because the field of compilers/interpreters is quite complex and cannot simply be explained in a short answer. Just get a good book on compiler design.