How to retrieve tokens from a string that are keywords? - c

For example, if the input is x+=5, the program should return an array of x, +=, 5. Notice that there is no space between x and +=, so splitting by spaces only probably won't work, because then you would have to iterate through it all over again to find the keywords.
How would I do something like this?
Is there an efficient way to do this in C?

Lexing is not specific to C (in the sense that you'll use similar techniques in other programming languages). You could do that with hand-written code (using finite automaton coding techniques). You could use a lexer generator like flex. You might even use regexprs, e.g. regex.h functions on POSIX systems.
Parsing is also a well known domain with standard techniques (at least for context free languages, if you want some efficiency). You could use recursive descent parsing, you could generate a parser using bison (which has examples very close to your homework) or ANTLR. Read more about LL parsing & LR parsing. BTW, parsing techniques can be used for lexing.
BTW, there are tons of free software (e.g. interpreters of scripting languages like Guile, Lua, Python, etc....), JSON, YAML, XML... parsers, several compilers (e.g. tinycc) etc... illustrating these techniques. You'll learn a lot by studying their source code.
It could be easier for your to sometimes have a lookahead of one or two characters, e.g. by first reading the entire line (with getline(3) or else fgets(3), and perhaps even readline, which gives you a line editor). If you cannot read a whole line consider using fgetc(3) and ungetc when needed. The classifying utilities from <ctype.h> like isalpha might be helpful.
If you care about UTF-8 (and in principle you should) things become slightly more complex since some Unicode characters (like €, é, 𝛃, ...) are represented in UTF-8 by several bytes. A library like libunistring should be very helpful.

Related

C Language. How to use a string value as delimiter in SSCANF

Is there a way to use a string as a delimiter?
We can use characters as delimiters using sscanf();
Example
I have
char url[]="username=jack&pwd=jack123&email=jack#example.com"
i can use.
char username[100],pwd[100],email[100];
sscanf(url, "username=%[^&]&pwd=%[^&]&email=%[^\n]", username,pwd,email);
it works fine for this string. but for
url="username=jack&jill&pwd=jack&123&email=jack#example.com"
it cant be used...its to remove SQL injection...but i want learn a trick to use
&pwd,&email as delimiters..not necessarily with sscanf.
Update: Solution doesnt necessarily need to be in C language. I only want to know of a way to use string as a delimiter
Just code your own parsing. In many cases, representing in memory the AST you have parsed is useful. But do specify and document your input language (perhaps using EBNF notation).
Your input language (which you have not defined in your question) seems to be similar to the MIME type application/x-www-form-urlencoded used in HTTP POST requests. So you might look, at least for inspiration, into the source code of free software libraries related to HTTP server processing (like libonion) and HTTP client processing (like libcurl).
You could read an entire line with getline (or perhaps fgets) then parse it appropriately. sscanf with %n, or strtok might be useful, but you can also parse the line "manually" (consider using e.g. your recursive descent parser). You might use strchr or strstr also.
BTW, in many cases, using common textual representations like JSON, YAML, XML can be helpful, and you can easily find many libraries to handle them.
Notice also that strings can be processed as FILE* by using fmemopen and/or open_memstream.
You could use parser generators such as bison (with flex).
In some cases, regular expressions could be useful. See regcomp and friends.
So what you want to achieve is quite easy to do and standard practice. But you need more that just sscanf and you may want to combine several things.
Many external libraries (e.g. glib from GTK) provide some parsing. And you should care about UTF-8 (today, you have UTF-8 everywhere).
On Linux, if permitted to do so, you might use GNU readline instead of getline when you want interactive input (with editing abilities and autocompletion). Then take inspiration from the source code of GNU bash (or of RefPerSys, if interested by C++).
If you are unfamiliar with usual parsing techniques, read a good book such as the Dragon Book. Most large programs deal somewhere with parsing, so you need to know how that can be done.

How do I lex unicode characters in C?

I've written a Lexer in C, it currently lexes files in ASCII successfully, however I'm confused as to how I would lex unicode. What unicode would I need to lex, for instance should I support utf-8, utf-16, etc. What do languages like Rust or Go support?
If so are there any libraries that can help me out, although I would prefer to try and do it myself so I can learn. Even then, a small library that I could read to learn from would be great.
There are already version of lex (and other lexer tools that support UniCode) and they are tabulated on the WikiPedia Page: List of Lexer Generators. There is also a list of lexer tools on the Wikipedia Parser Page. In summary, the following tools handle UniCode:
JavaCC - JavaCC generates lexical analyzers written in Java.
JFLex - A lexical analyzer generator for Java.
Quex - A fast universal lexical analyzer generator for C and C++.
FsLex - A lexer generator for byte and Unicode character input for F#
And, of course, there are the techniques used by W3.org and cited by #jim mcnamara at http://www.w3.org/2005/03/23-lex-U.
You say you have written your own lexer in C, but you have used the tag lex for the tool called lex; perhaps that was an oversight?
In the comments you say you have not used regular expressions, but also want to learn. Learning something about the theory of language recognition is key to writing an efficient and working lexer. The symbols being recognised are classified as a Chomsky Type 3 Language, or a Regular Language, which can be described by Regular Expressions. Regular Expressions can be implemented by coding that implements a Finite State Automata (or Finite State Machine). The standard implementation for a finite state machine is coded by a loop containing a switch. Most experienced coders should know, and be able to recognise and exploit this form:
while ( not <<EOF>> ) {
switch ( input_symbol ) {
case ( state_symbol[0] ) :
...
case ( state_symbol[1] ) :
...
default:
....
}
}
If you had coded in this style, the same coding could simply work whether the symbols being handled were 8 bit or 16 bit, as the algorithmic coding pattern remains the same.
Ad-Hoc coding of a lexical analyser without an understanding of the underlying theory and practice will eventually have its limits. I think you will find it beneficial to read a little more into this area.

What libraries would be useful for implementing a small language interpreter in C?

For my own learning experience, I want to try writing an interpreter for a simple programming language in C – the main thing I think I need is a hash table library, but a general purpose collection of data structures and helper functions would be pretty helpful. What would you guys recommend?
libbasekit - by the author of Io. You can also use libcoroutine.
One library I recommend looking into is libgc, a garbage collector for C.
You use it by replacing calls to malloc, realloc, strdup, etc. with their libgc counterparts (e.g. GC_MALLOC). It works by scanning the stack, global variables, and GC-allocated blocks, looking for numbers that might be pointers. Believe it or not, it actually performs quite well (almost on par with the very good ptmalloc, which is the default (non-garbage collected) malloc implementation in GNU/Linux), and a lot of programs use it (including Mono and GCJ). A disadvantage, though, is it might not play well with other libraries you may want to use, and you may even have to recompile some of them by hand to replace calls to malloc with GC_MALLOC.
Honestly - and I know some people will hate me for it - but I recommend you use C++. You don't have to bust a gut to learn it just to be able to start your project. Just use it like C, but in an hour you can learn how to use std::map<> (an associative container), std::string for easy textual data handling, and std::vector<> for a resizable heap-allocated array. If you want to spend an extra hour or two, learn to put member functions in classes (don't worry about polymorphism, virtual functions etc. to begin with), and you'll get a more organised program.
You need no more than the standard library for a suitably small language with simple constructs. The most complex part of an interpreted language is probably expression evaluation. For both that, procedure-calling, and construct-nesting you will need to understand and implement stack data structures.
The code at the link above is C++, but the algorithm is described clearly and you could re-implement it easily in C. There again there are few valid arguments for not using C++ IMO.
Before diving into what libraries to use I suggest you learn about grammars and compiler design. Especially input parsing is for compilers and interpreters similar, that is tokenizing and parsing. The process of tokenizing converts a stream characters (your input) into a stream of tokens. A parser takes this stream of tokens and matches it with your grammar.
You don't mention what language you're writing an interpreter for. But very likely that language contains recursion. In that case you need to use a so-called bottom-up parser which you cannot write by hand but needs to be generated. If you try write such a parser by hand you will end up with a error-prone mess.
If you're developing for a posix platform then you can use lex and yacc. These tools are a bit old but very powerful for building parsers. Lex can generate code that implements the tokenizing process and yacc can generate a bottom-up parser.
My answer probably raises more questions than it answers. That's because the field of compilers/interpreters is quite complex and cannot simply be explained in a short answer. Just get a good book on compiler design.

Why Use Lexical Analyzers?

I'm building my own language using Flex, but I want to know some things:
Why should I use lexical analyzers?
Are they going to help me in something?
Are they obligatory?
Lexical analysis helps simplify parsing because the lexemes can be treated as abstract entities rather than concrete character sequences.
You'll need more than flex to build your language, though: Lexical analysis is just the first step.
Any time you are converting an input string into space-separated strings and/or numeric values, you are performing lexical analysis. Writing a cascading series of else if (strcmp (..)==0) ... statements counts as lexical analysis. Even such nasty tools as sscanf and strtok are lexical analysis tools.
You'd want to use a tool like flex instead of one of the above for one of several reasons:
The error handling can be made much better.
You can be much more flexible in what different things you recognize with flex. For instance, it is tough to parse a C-format hexidecimal value properly with scanf routines. scanf pretty much has to know the hex value is comming. Lex can figure it out for you.
Lex scanners are faster. If you are parsing a lot of files, and/or large ones, this could become important.
You would consider using a lexical analyzer because you could use BNF (or EBNF) to describe your language (the grammar) declaratively, and then just use a parser to parse a program written in your language and get it in a structure in memory and then manipulate it freely.
It's not obligatory and you can of course write your own, but that depends on how complex the language is and how much time you have to reinvent the wheel.
Also, the fact that you can use a language (BNF) to describe your language without changing the lexical analyzer itself, enables you to make many experiments and change the grammar of your language until you have exactly what it works for you.

Where can I find a Lisp reader in C?

I have a Lisp reader written in Java that I'm thinking of translating into C. (Or perhaps C++.) It's a fairly complete and useful hack, so the main issue is doing the dynamic-storage allocation in a language without garbage collection. If someone has already thought this through I'd rather borrow their code than figure it out myself. (C is not my favorite language.)
Of course, having a Lisp reader makes no sense unless you're planning to do something with the things you read, so perhaps I should have phrased the question, Where do I find a simple Lisp core written in C?, but in my experience the hardest unavoidable part of writing a Lisp (somewhat surprisingly) is the reader. Plus, I don't want to have a garbage collector; I'm anticipating an application where the list structures will be freed more or less by hand.
Gary Knott's Interpreting Lisp is very nice.
You may also try others, like Jim Mayfield's Lisp. There are probably lots of little Lisps out there...
You mentioned that you don't like C. Maybe you'd like Haskell -- in which case you could try "Write yourself a Scheme in 48 hours", an interesting tutorial (you get to write a Scheme interpreter in Haskell).
Update: I know that a Lisper would hardly feel comfortable using Haskell, but hey, it's much more comfortable than C (at least for me)! Besides that, HAskell has a good FFI, so it should be easy to use the Haskell-made Lisp-reader as a C-compatible library.
Update 2: If you want to use XLisp, as suggested by another user, you will probably need src/xlread.c (863 lines) and include/xlisp.h (1379 lines) -- but I could be wrong...
Update 3: If you use Gary Knott's Lisp (one single C file with 942 lines), the function signature is int32 sread(void). This would be my choie if I didn't need anything fancy (like read macros) or highly optimized (there is a PDF paper that describes how the code is implemented, so you won't have to find your way in a labyrinth). The documentation for the function is:
This procedure scans an input string g using a lexical token scanning
routine, e(), where e() returns
1 if the token is '('
2 if the token is '''
3 if the token is '.'
4 if the token is ')' or a typed pointer d to an
atom or number stored in row ptrv(d) in the atom or number tables.
Due to the typecode (8 or 9) of d, d is a negative 32-bit integer. The
token found by e() is stripped from the front of g.
SREAD constructs an S-expression and returns a typed pointer to it as
its result.
See that Gary's Lisp is old and you'll need to change it so it compiles. Instead of including linuxenv.h, include:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <setjmp.h>
Also, it does not work in 64-bit machines (the documentation of sread should tell you why...)
Update 4: There's also the Scheme implementations by Nils Holm (there are books describing the internals)
Lisp500 http://code.google.com/p/lisp5000
ThinLisp http://www.thinlisp.org
lispreader is a simple Lisp file parser done in plain C.
If you want C++, you can dig around in the SuperTux source code, it contains a Lisp file parser written in C++.
When you want an actual implementation of Lisp instead of just a parser you could have a look at Abuse, which contains a small one as the games scripting language.
MIT Professor Rivest published a set of small readers for s-expressions back in 1997 http://people.csail.mit.edu/rivest/sexp.html as part of his DARPA supported research on public key cryptography. The code only does reading and printing and is well described in a document written in the style of an internet RFC.
there are lots of embeddable Scheme implementations, off the top of my head: SIOD, Guile, ChickenScheme, Scheme48....

Resources