Parsing: library functions, FSM, explode() or lex/yacc? - c

When I have to parse text (e.g. config files or other rather simple/descriptive languages), there are several solutions that come to my mind:
using library functions, e.g. strtok(), sscanf()
a finite state machine which processes one char at a time, tokenizing and parsing
using the explode() function I once wrote out of pure boredom
using lex/yacc (read: flex/bison) to generate an appropriate parser
I don't like the "library functions" approach. It feels clumsy and awkward. explode(), while it doesn't take much new code, feels even more blown up. And flex/bison often seems like sheer overkill.
I usually implement a FSM, but at the same time I already feel sorry for the poor guy that may have to maintain my code at a later point.
Hence my question:
What is the best way to parse relatively simple text files?
Does it matter at all?
Is there a commonly agreed-upon approach?

I'm going to break the rules a bit and answer your questions out of order.
Is there a commonly agreed-upon approach?
Absolutely not. IMHO the solution you choose should depend on (to name a few) your text, your timeframe, your experience, even your personality. If the text is simple enough to make flex and bison overkill, maybe C is itself overkill. Is it more important to be fast, or robust? Does it need to be maintained, or can it start quick and dirty? Are you a passionate C user, or can you be enticed away with the right language features? &c., &c.
Does it matter at all?
Again, this is something only you can answer. If you're working closely with a team of people, with particular skills and abilities, and the parser is important and needs to be maintained, it sure does matter! If you're writing something "out of pure boredom," I would suggest that it doesn't matter at all, no. :-)
What is the best way to parse relatively simple text files?
Well, I don't know that you're going to like my answer. Maybe first read some of the other fine answers here.
No, really, go ahead. I'll wait.
Ah, you're back and relaxed. Let's ease into things, shall we?
Never write it in 'C' if you can do it in 'awk';
Never do it in 'awk' if 'sed' can handle it;
Never use 'sed' when 'tr' can do the job;
Never invoke 'tr' when 'cat' is sufficient;
Avoid using 'cat' whenever possible.
-- Taylor's Laws of Programming
If you're writing it in C, but C feels like the wrong tool...it really might be the wrong tool. awk or perl will likely do what you're trying to do without all the aggravation. You may even be able to do it with cut or something similar.
On the other hand, if you're writing it in C, you probably have a good reason to write it in C. Maybe your parser is a tiny part of a much larger system, which, for the sake of argument, is embedded, in a refrigerator, on the moon. Or maybe you loooove C. You may even hate awk and perl, heaven forfend.
If you don't hate awk and perl, you may want to embed them into your C program. This is doable, in principle--I've never done it myself. For awk, try libmawk. For perl, there are proably a few ways (TMTOWTDI). You can run perl separately using popen to start it, or you can actually embed a Perl interpreter into your C program--see man perlembed.
Anyhow, as I've said, "the best way to parse" entirely depends on you and your team, the problem space, and your approach to the issue. What I can offer is my opinion.
I'm going to assume that in your C-only solutions (library functions and FSM (considering your explode to essentially be a library function)) you've already done your best at isolating the relevant code, designing the code and files well, and so forth.
Even so, I'm going to recommend lex and yacc.
Library functions feel "clumsy and awkward." A state machine seems unmaintainable. But you say that lex and yacc feel like overkill.
I think you should approach your complaints differently. What you're really doing is specifying a FSM. However, you're also hiring someone to write and maintain it for you, thereby solving most of the maintainability problem. Overkill? Did I mention they'll work for free?
I suspect, but do not know, that the reason lex and yacc originally felt like overkill was that your config / simple files just felt too, well, simple. If I'm right (a big if), you may be able to do most of your work in the lexer. (It's even conceivable that you can do all of your work in the lexer, but I know nothing about your input.) If your input is not only simple but widespread, you may be able to find a lexer/parser combination freely available for what you need.
In short: if you can do this not in C, try something else. If you want C, use lex and yacc--they have a little overhead, but they're a very good solution.

If you can get it to work, I'd go with an FSM, but with a huge assist from Perl-compatible regular expressions. This library is easy to understand, and you ought to be able to trim back sufficient extraneous spaghetti to give your monster that aerodynamic flair to which all flying monsters aspire. That, and plenty of comments in well-structured spaghetti, ought to make your code-maintaining successor comfortable. (And, as I'm sure you know, that code-maintaining successor is you after six months, when you've moved on to something else and the details of this code have slipped your mind.)

My short answer is to use the right too for the problem. If you have configuration files use existing standards and formats e.g. ini Files and parse them using Boost program_options.
If you enter the world of "own" languages use lex/yacc, since they provide you with the required features, but you have to consider the cost of maintaining the grammar and language implementation.
As a result I would recommend to further narrow you problem scope to find the right tool.

Related

Best way to identify system library commands in Lexer/Bison

I'm writing an intepreter for a new programming language. The language's syntax is very simple and the "system library" commands are treated as simple identifiers (even if is no special construct, but a function like everything else - only pre-defined internally). And no, this is not yet-another-one of the 1 million Lisp's out there.
The question is:
Should I have the Lexer catch them, or should I do it in the AST-construction code?
What I've done so far:
I tried recognizing all of them in my Lexer script, and they are a lot already - over 200. I send the same token back to Bison (SYSTEM_CMD) only with a different value (basically a numeric index pointing to the array of system commands where they are all stored).
As an approach, I think this makes it much faster than having to look up every single one of them in a hash and see if it's a system command.
The thing is the Lexer is getting quite huge (in term of resulting binary filesize I mean) rather fast. And I obviously don't like it.
Given that my focus is something both lightning-fast (I'm already quite good with that) and small enough to be embedded, what would be the most recommended approach?

Jest -- Proper way to write test descriptions

This may seem pedantic but I'm asking from a learner's perspective.
The syntax of the it helper lends itself to the syntax
it('should behave in this certain way...')
and most all the coders I know insist on writing 'should' for every single test.
To me this is a subtle but annoying pet peeve because the whole point of it is to save me keystrokes -- if it comes with the requirement to type should for every test it seems like an utter waste of a shortcut.
Is it a best practice to write it('should...') or can we simply write it('behaves in this expected way')?
Serious question -- I'm losing my mind over this little detail.
This might be an opinion based question, but I found myself writing a lot of shoulds in the past, until I found a spotify repository that changed my testing life
Should up is a CLI that removes shoulds from your tests and they explain in the README.md why and how you should write tests.
Basically it just removes a lot of text that is just redundant and makes everything easier to read.

Readable text in dissassembled code

is there any widely used procedure for hiding readable strings? After debugging my code i found a lot of plain text. I can use some simple encryption (Caesar cipher etc...) but this solution will totally slow down my code. Any ideas? Thanks for help
No, there is no widely used method for hiding referenced strings.
At some point an accessed string would have to be decrypted and this would reveal the key/method and your decryption becomes just obfuscation. If somebody wants to read all your referenced strings he could easily write some script to just convert them all to be readable.
I can't think of any reason to obfuscate strings like that. They are only visible to someone that analyses your executable. Those people would at the same time also be capable to reverse engineer your deobfuscation an apply it to all strings.
If secrecy of strings is vital to the security of your application, you have to rethink that.
Sidenote: There is no way that deciphering strings in C will slow down your application ...Except your application is full of strings and you do something very inefficient in the deciphering. Have you tested this?

Clojure Database Access Patterns/Idioms

The macros listed in this gist https://gist.github.com/1177043 (pasted below)
(defmacro wrap-connection [& body]
`(if (sql/find-connection)
~#body
(sql/with-connection db ~#body)))
(defmacro transaction [& body]
`(if (sql/find-connection)
(sql/transaction ~#body)
(sql/with-connection db (sql/transaction ~#body))))
seem to be pretty useful. Is there a "standard" implementation of these? By standard, I mean something in clojure.contrib or similar. I can easily copy and paste this into my code, but I'm wondering if there's a better way. Or, put another way, what's the clojure way of doing this?
This is my first forray into actually writing clojure code (I've read a lot about it and Common Lisp), so I'm also trying to get a feel for what libraries are out there. It seems to me that the Lisp mentality is kind of "I can write it myself in 15 lines, so why would I use somebody else's".
From what I have seen, when you abstract beyond sql/transactions etc then the abstractions tend to be more application specific. writing your own wrappers as above when it truely makes things simpler is the canonical way to go about things.
Only use as many macroes as actually makes your life simpler, it can be tempting to next macroes (as in the gist) in ways that are more confusing or harder to maintain. I like this gist; just be careful not to over-macro.
ps: if some code is more than about five lines I look to see if someone else has written it first :) but many clojurians feel differently.

C style printf/scanf

I've seen, here and elsewhere, many questions that, to get input data, use something like this:
...
printf("What's your name? ");
scanf("%s",name);
...
This is very reminiscent of the old BASIC days (INPUT for those who remember it).
The majority of those questions, if not all, are from people just learning C and are homeworks or example taken from their book.
I clearly remember that when I learned C I was told that this type of question/answer style was not a good practice for getting user input. The "Right Way" was either to get parameters on the command line (argv[...]) or reading from a data file to be parsed with fgets(). When user friendliness was a must, termio and friends had to be used.
Now, I wonder if anything changed in the past years. Are people thaught to structure user interaction as a set question/answer now?
I can only see disadvantages in using the printf()/scanf() approach, the main one being the diversity of terminals (^H anyone?) that could make difficult for the user to correct mistakes.
Could anyone point me to concrete advantages of this approach?
This structure is easy to explain and easy to learn, which is why it appears in so many introductory materials. Doing user input "the right way" in C can appear fairly daunting to a neophyte, especially when you have to deal with tokenizing and conversions.
However, I agree that it would be valuable for introductory materials to demonstrate more robust methods for handling user input.
I always thought the Unix way was to accept input from stdin. That way the caller of the command can pipe input in from another command, from a file or manually.
fgets() / sscanf() or similar is the right way to accept user input.
Take what you read here and elsewhere with a (large) pinch of salt.
The GNU readline library is really an excellent resource for this. Its main advantage is that it handles all the intricacies of editing, as well as letting users have their own input settings, eg Vi or Emacs mode.
This is the library bash and many other programs that accept interactive line-based data use.
By using the library, you get an interface that your users will have some knowledge of how to use, plus you get all sorts of nice features without having to explicitly code support for line editing.
I agree with CTFord, that IF you're doing a command line program, then stdin is a perfectly reasonable way to deal with input.
However, now adays, the correct way is to make a window that has a text box a label and a button and blah blah blah.
I know that doesn't really answer your question, but I think my point is that that style of programming has become MUCH less prevalent, so that it doesn't really matter what is "correct".
In many cases you are probably missing the point of the exercise if you think this is important. There's a world of a difference between a homework exercise and real world programs. The purpose of such exercises is seldom, if ever, to teach user interface design; the technique you describe is generally just a simple way to obtain test input for the real exercise, and is also probably encouraged by tutors who, pressed for time require submitted code to adhere to some 'course-style' to make marking easier. 'Course-style' is seldom the same thing as 'good-practice' or 'practical style', but that is not to say that it does not serve some purpose, or even that that purpose is beneficial to the students education.
Unfortunately novices often get hung up on this stuff when it is generally irrelevant to the actual purpose of the exercise. The problem is that user input using standard library primitives is not as foolproof and some tutors seem to think.

Resources