I'm currently creating Linux shell to learn more about system calls.
I've already figured out most of the things. Parser, token generation, passing appropriate things to appropriate system calls - works.
The thing is, that even before I start making tokens, I split whole command string into separate words. It's based on array of separators, and it works surprisingly good. Except that I'm struggling with adding additional functionality to it, like escape sequences or quotes. I can't really live without it, since even people using basic grep commands use arguments with quotes. I'll need to add functionality for:
' ' - ignore every other separator, operator or double quotes found between those two, pass this as one string, don't include these quotation marks into resulting word,
" "- same as above, but ignore single quotes,
\\ - escape this into single backslash,
\(space) - escape this into space, do not parse resulting space as separator
\", \' - analogously to the above.
Many other things that I haven't figured out I need yet
and every single one of them seems like an exception on its own. Each of them must operate on diversity of possible positions in commands, being included into result or not, having influence on the rest of the parsing. It makes my code look like big ball of mud.
Is there a better approach to do this? Is there a more general algorithm for that purpose?
You are trying to solve a classic problem in program analysis (of lexing and parsing) using a nontraditional structure for lexer ( I split whole command string into separate words... ). OK, then you will have non-traditional troubles with getting the lexer "right".
That doesn't mean that way is doomed to failure, and without seeing specific instances of your problem, (you list a set of constructs you want to handle, but don't say why these are hard to process), it is hard to provide any specific advice. It also doesn't mean that way will lead to success; splitting the line may break tokens that shouldn't be broken (usually by getting confused about what has been escaped).
The point of using a standard lexer (such as Flex or any of the 1000 variants you can get) is that they provide a proven approach to complex lexing problems, based generally on the idea that one can use regular expressions to describe the shape of individual lexemes. Thus, you get one regexp per lexeme type, thus an ocean of them but each one is pretty easy to specify by itself.
I've done ~~40 languages using strong lexers and parsers (using one of the ones in that list). I assure you the standard approach is empirically pretty effective. The types of surprises are well understood and manageable. A nonstandard approach always has the risk that it will surprise you in a bad way.
Last remark: shell languages for Unix have had people adding crazy stuff for 40 years. Expect the job to be at least medium hard, and don't expect it to be pretty like Wirth's original Pascal.
Related
For a certain time now, I'm looking to build a logging framework in C (not C++!), but for small microcontrollers or devices with a small footprint of some sort. For this, I've had the idea of hashing the strings that are being logged to a certain value and just saving the hashed value with the timestamp instead of the complete ASCII string. The hash can then be correlated with a 'database' file that would be generated from an external process that parses the strings out of the C source files and saves the logged strings along with the hash value.
After doing a little bit of research, this idea is not new, but I do not find an implementation of this idea in C. In other languages, this idea has been worked out, but that is not the goal of my exercise. An example may be this talk where the same concept has been worked out in C++: youtube.com/watch?v=Dt0vx-7e_B0
Some of the requirements that I've set myself for this library are the following:
as portable C code as possible
COMPILE TIME optimization/hashing for the string hash conversion, it should be equivalent to just printf("%d\n", hashed_value) for a single log statement. (Assuming no parameters/arguments for this particular logging statement).
arguments can be passed to the logging statement similar to the printf function.
user can define their own output function (being console, file descriptor, sending the data directly over an UART connection,...)
fast to run!! fast to compile is nice to have, but it should not be terribly slow.
very easy to use, no very complicated API to use the library.
But to achieve this in C, what is a good approach? I've tried several things now, but do not seem to have found a good method of achieving this.
An overview of things I've tried so far, along with the drawbacks are:
Full pre-processor string hashing: did get it working, but the compile time is terribly slow. Also, this code does not feel to be very portable over multiple C compilers.
Semi pre-processor string hashing: The idea was to generate a hash for each string and make an external header file with the defines in of each string with their hash value. The problem here is that I cannot figure out a way of converting the string to the correct define preprocessor value.
Letting go of the default logging macro with a string pointer: Instead of working with the most used method of LOG_DEBUG("Some logging statement"), converting it with an external parser to /*LOG_DEBUG("Some logging statement") */ LOG_RAW(45). This solves the problem of hashing the string since the hash will be replaced by the external parser with the correct hash, but is not the cleanest to read since the original statement will be a comment.
Also expanding this idea to take care of arguments proved to be tricky. How to take care of multiple types of variables as efficiently as possible?
I've tried some other methods but all without success. Especially when I want to add arguments to log the value of a variable, for example, it gets very complicated, and I do not get the required result...
I'm writing a simple Lisp in C without any external dependencies (please do not link the BuildYourOwnLisp), and I'm following this guide as a basis to parse the Lisp. It describes two steps in tokenising a S-exp, those steps being:
Put spaces around every paranthesis
Split on white space
The first step is easy enough, I wrote a trivial function that replaces certain substrings with other substrings, but I'm having problems with the second step. In the article it only uses the string "Lisp" in its examples of S-exps; if I were to use strtok() to blindly split by whitespace, any string in a S-exp that had a space within it would become fragmented and interpreted incorrectly by my Lisp. Obviously, a language limited to single-word strings isn't very useful.
How would I write a function that splits a string by white space, besides when the text is in between two double quotes?
I've tried using regex, but from what I can see of the POSIX regex.h library and PCRE, just extracting the matches would be incredibly laborious in terms of the amount of auxillary code I'd have to write, which would only serve to bloat my codebase. Besides, one of my goals with this project was to use only ANSI C, or, if need be, C99, solely for the sake of portability - fiddling with the POSIX library and the Win32 API would just fatten my code and make moving my lisp around a nightmare.
When researching this problem I came across this StackOverflow answer; but the approved answer only sends the tokenised string onto stdout, which isn't useful for me; I'd ideally have the tokens in a char** so that I could then parse them into useful in memory data structures.
As well as this, the approved answer on the aforementioned SO question is written to be restricted to specifically my problem - ideally, I'd have myself a general purpose function that would allow me to tokenise a string, except when a substring is between two of charachter x. This isn't a huge deal, it's just that I'd like my codebase to be clean and composable.
You have two delimiters: the space and double quotes.
You can use the strcspn (or with example: cppreference - strcspn) function for that.
Iterate over the string and look for the delimiters (space and quotes). strcspn returns if such a delimiter was found. If a space was found, continue looking for both. If a double quote was found, the delimiter chages from " \"" (space and quotes) to "\"" (double quotes). If you then hit the quotes again, change the delimiter to " \"" (space and quotes).
Based on your comment:
Lets say you have a string like
This is an Example.
The output would be
This
is
an
Example.
If the string would look like
This "is an" Example.
The output would be
This
is an
Example.
I'm writing an intepreter for a new programming language. The language's syntax is very simple and the "system library" commands are treated as simple identifiers (even if is no special construct, but a function like everything else - only pre-defined internally). And no, this is not yet-another-one of the 1 million Lisp's out there.
The question is:
Should I have the Lexer catch them, or should I do it in the AST-construction code?
What I've done so far:
I tried recognizing all of them in my Lexer script, and they are a lot already - over 200. I send the same token back to Bison (SYSTEM_CMD) only with a different value (basically a numeric index pointing to the array of system commands where they are all stored).
As an approach, I think this makes it much faster than having to look up every single one of them in a hash and see if it's a system command.
The thing is the Lexer is getting quite huge (in term of resulting binary filesize I mean) rather fast. And I obviously don't like it.
Given that my focus is something both lightning-fast (I'm already quite good with that) and small enough to be embedded, what would be the most recommended approach?
Are there any libraries out there that I can pass my .c files through and will count the visible number of, of example, "if" statements?
We don't have to worry about "if" statement in other files called by the current file, just count of the current file.
I can do a simple grep or regex but wanted to check if there is something better (but still simple)
If you want to be sure it's done right, I'd probably make use of clang and walk the ast. A URL to get you started:
http://clang.llvm.org/docs/IntroductionToTheClangAST.html
First off, there is no way to use regular expressions or grep to give you the correct answer you are looking for. There are lots of ways that you would find those strings, but they could be buried in any amount of escape characters, quotations, comments, etc.
As some commenters have stated, you will need to use a parser/lexer that understands the C language. You want something simple, you said, so you won't be writing this yourself :)
This seems like it might be usable for you:
http://wiki.tcl.tk/3891
From the page:
lexes a string containing C source into a list of tokens
That will probably get you what you want, but even then it's not going to be trivial.
What everyone has said so far is correct; it seems a lot easier to just grep the shit out of your file. The performance hit of this is neglagible compared to the alternative which is to go get the gcc source code (or whichever compiler you're using), and then go through the parsing code and hook in what you want to do while it parses the syntax tree. This seems like a pain in the ass, especially when all you're worried about is the conditional statements. If you actually care about the branches, you could actually just take a look at the object code and count the number of if statements in the assembly, which would correctly tell you the number of branches (rather than just relying on how many times you typed a conditional, which will not translate exactly to the branching of the program).
I feel like this is a pretty common problem but I wasn't really sure what to search for.
I have a large file (so I don't want to load it all into memory) that I need to parse control strings out of and then stream that data to another computer. I'm currently reading in the file in 1000 byte chunks.
So for example if I have a string that contains ASCII codes escaped with ('$' some number of digits ';') and the data looked like this... "quick $33;brown $126;fox $a $12a". The string going to the other computer would be "quick brown! ~fox $a $12a".
In my current approach I have the following problems:
What happens when the control strings falls on a buffer boundary?
If the string is '$' followed by anything but digits and a ';' I want to ignore it. So I need to read ahead until the full control string is found.
I'm writing this in straight C so I don't have streams to help me.
Would an alternating double buffer approach work and if so how does one manage the current locations etc.
If I've followed what you are asking about it is called lexical analysis or tokenization or regular expressions. For regular languages you can construct a finite state machine which will recognize your input. In practice you can use a tool that understands regular expressions to recognize and perform different actions for the input.
Depending on different requirements you might go about this differently. For more complicated languages you might want to use a tool like lex to help you generate an input processor, but for this, as I understand it, you can use a much more simple approach, after we fix your buffer problem.
You should use a circular buffer for your input, so that indexing off the end wraps around to the front again. Whenever half of the data that the buffer can hold has been processed you should do another read to refill that. Your buffer size should be at least twice as large as the largest "word" you need to recognize. The indexing into this buffer will use the modulus (remainder) operator % to perform the wrapping (if you choose a buffer size that is a power of 2, such as 4096, then you can use bitwise & instead).
Now you just look at the characters until you read a $, output what you've looked at up until that point, and then knowing that you are in a different state because you saw a $ you look at more characters until you see another character that ends the current state (the ;) and perform some other action on the data that you had read in. How to handle the case where the $ is seen without a well formatted number followed by an ; wasn't entirely clear in your question -- what to do if there are a million numbers before you see ;, for instance.
The regular expressions would be:
[^$]
Any non-dollar sign character. This could be augmented with a closure ([^$]* or [^$]+) to recognize a string of non$ characters at a time, but that could get very long.
$[0-9]{1,3};
This would recognize a dollar sign followed by up 1 to 3 digits followed by a semicolon.
[$]
This would recognize just a dollar sign. It is in the brackets because $ is special in many regular expression representations when it is at the end of a symbol (which it is in this case) and means "match only if at the end of line".
Anyway, in this case it would recognize a dollar sign in the case where it is not recognized by the other, longer, pattern that recognizes dollar signs.
In lex you might have
[^$]{1,1024} { write_string(yytext); }
$[0-9]{1,3}; { write_char(atoi(yytext)); }
[$] { write_char(*yytext); }
and it would generate a .c file that will function as a filter similar to what you are asking for. You will need to read up a little more on how to use lex though.
The "f" family of functions in <stdio.h> can take care of the streaming for you. Specifically, you're looking for fopen(), fgets(), fread(), etc.
Nategoose's answer about using lex (and I'll add yacc, depending on the complexity of your input) is also worth considering. They generate lexers and parsers that work, and after you've used them you'll never write one by hand again.