Parsing string with C - whats the right tool? - c

I got a string in this format:
Stuff: </value_1/value_2/value_3>; key="value"
What I need parsed out is value_1, value_2 and value_3 along with the key/value pair. value_3 might or might not be present in the string.
What would one use in C in order to get this done?
I thought about sscanf but the values can be of arbitratry size, so they should be allocated dynamically. strtok would have been my next idea, but that probably needs two separate loops to extract the / separated values and the key/value pair… seems tedious but at least doable.
Anyone with more experience in C has any better idea?
EDIT: regex could be an option, but I would prefer standard string functions if possible at all.

If you use "</>; =\"" as the delimiter set, then strtok() will work in a single pass, extracting in turn:
Stuff:
value_1
value_2
value_3
key
value
Not sure what Stuff: is or if it is needed in the extraction, or just an elidation on your part.

You could use regular expressions; see Regular Expressions - The GNU C Library.
If you don't know what regular expressions are; see Regular Expressions.
Here's an online regular expression editor (or this one, if you don't have Flash installed), so you can test your regexp!

Related

Parsing shell commands in c: string cutting with respect to its contents

I'm currently creating Linux shell to learn more about system calls.
I've already figured out most of the things. Parser, token generation, passing appropriate things to appropriate system calls - works.
The thing is, that even before I start making tokens, I split whole command string into separate words. It's based on array of separators, and it works surprisingly good. Except that I'm struggling with adding additional functionality to it, like escape sequences or quotes. I can't really live without it, since even people using basic grep commands use arguments with quotes. I'll need to add functionality for:
' ' - ignore every other separator, operator or double quotes found between those two, pass this as one string, don't include these quotation marks into resulting word,
" "- same as above, but ignore single quotes,
\\ - escape this into single backslash,
\(space) - escape this into space, do not parse resulting space as separator
\", \' - analogously to the above.
Many other things that I haven't figured out I need yet
and every single one of them seems like an exception on its own. Each of them must operate on diversity of possible positions in commands, being included into result or not, having influence on the rest of the parsing. It makes my code look like big ball of mud.
Is there a better approach to do this? Is there a more general algorithm for that purpose?
You are trying to solve a classic problem in program analysis (of lexing and parsing) using a nontraditional structure for lexer ( I split whole command string into separate words... ). OK, then you will have non-traditional troubles with getting the lexer "right".
That doesn't mean that way is doomed to failure, and without seeing specific instances of your problem, (you list a set of constructs you want to handle, but don't say why these are hard to process), it is hard to provide any specific advice. It also doesn't mean that way will lead to success; splitting the line may break tokens that shouldn't be broken (usually by getting confused about what has been escaped).
The point of using a standard lexer (such as Flex or any of the 1000 variants you can get) is that they provide a proven approach to complex lexing problems, based generally on the idea that one can use regular expressions to describe the shape of individual lexemes. Thus, you get one regexp per lexeme type, thus an ocean of them but each one is pretty easy to specify by itself.
I've done ~~40 languages using strong lexers and parsers (using one of the ones in that list). I assure you the standard approach is empirically pretty effective. The types of surprises are well understood and manageable. A nonstandard approach always has the risk that it will surprise you in a bad way.
Last remark: shell languages for Unix have had people adding crazy stuff for 40 years. Expect the job to be at least medium hard, and don't expect it to be pretty like Wirth's original Pascal.

Use of regular expressions in c for strcmp function

So, what im interesting in trying to do is read in a line with getline and look for specific phases if i see something like "verbose on" or "verbose off" i know what i want to do, however if it is "verbose "something"" then i want to error. Im pretty sure that this is going to require regular expressions because what comes after is arbitrary. Some insight on this problem would be much appreciated. Thank you.
strcmp(buf,"verbose on")==0
strcmp(buf,"verbose off")==0
strcmp(buf,"verbose "regex expression here im thinking"")==0
This is how i think it should go, just need a bit of a push.
No need for a regex. You can use strncmp:
strncmp(buf, "verbose", strlen("verbose")) == 0
This only compares the first 7 characters, so it will match any buf that starts with "verbose".
Note: I'm allergic against magic numbers, but you could of course replace the strlen call with a literal 7 if you prefer. Also, for real code, I would replace the duplicated string literal with a constant.

Access a cell directly in .csv file using C programming

hey guys!
is there any way of directly accessing a cell in a .csv file format using C?
e.g. i want to sum up a column using C, how do i do it?
It's probably easiest to use the scanf-family for this, but it depends a little on how your data is organized. Let's say you have three columns of numeric data, and you want to sum up the third column, you could loop over a statement like this: (file is a FILE*, and is opened using fopen, and you loop until end of file is reached)
int n; fscanf(file, "%*d,%*d,%d", &n);
and sum up the ns. If you have other kinds of data in your file, you need to specify your format string accordingly. If different lines have different kinds of data, you'll probably need to search the string for separators instead and pick the third interval.
That said, it's probably easier not to use C at all, e.g. perl or awk will probably do a better job, :) but I suppose that's not an option.
If you have to use C: read the entire line to memory, go couting "," until you reach your desired column, read the value and sum it, go to next line.
When you reach your value, you can use sscanf to read it.
You might want to start by looking at RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files, and then looking for implementations of the same. (Be aware though, that the notion of comma separated values predates the RFC and there are many implementations that do not comply with that document.)
I find:
ccsv
And not many others in plain c. There are quite a few c++ implementations, and most of the are probably readily adapted to c.

C library for parsing query strings?

I'm writing a C/CGI web application. Is there a library to parse a query string into something like a GHashTable? I could write my own but it certainly doesn't seem to be worth the effort to reinvent the wheel.
The uriparser library can parse query strings into key-value pairs.
If you're really writing C and the set of keys is known, you'd do much better to store the values in a struct instead of some bloated hash table. Have a static const table of:
key name
type (integer/string is probably sufficient)
offset in struct (using offsetof macro)
and use that to parse the query string and fill in the struct.
You may find usefull the ap_getword functions family from the Apache Portable Runtime (apr) library.
Most of the string parsing routines belong to the ap_getword* family, which together provide functionality similar to the Perl split() function. Each member of this family is able to extract a word from a string, splitting the text on delimiters such as whitespace or commas. Unlike Perl split(), in which the entire string is split at once and the pieces are returned in a list, the ap_getword* functions operate on one word at a time. The function returns the next word each time it's called and keeps track of where it's been by bumping up a pointer.
You may search google for "ap_getword(r->pool" "httpd.h" and find some relevant source code to learn from.

parsing of mathematical expressions

(in c90) (linux)
input:
sqrt(2 - sin(3*A/B)^2.5) + 0.5*(C*~(D) + 3.11 +B)
a
b /*there are values for a,b,c,d */
c
d
input:
cos(2 - asin(3*A/B)^2.5) +cos(0.5*(C*~(D)) + 3.11 +B)
a
b /*there are values for a,b,c,d */
c
d
input:
sqrt(2 - sin(3*A/B)^2.5)/(0.5*(C*~(D)) + sin(3.11) +ln(B))
/*max lenght of formula is 250 characters*/
a
b /*there are values for a,b,c,d */
c /*each variable with set of floating numbers*/
d
As you can see infix formula in the input depends on user.
My program will take a formula and n-tuples value.
Then it calculate the results for each value of a,b,c and d.
If you wonder I am saying ;outcome of program is graph.
/sometimes,I think i will take input and store in string.
then another idea is arise " I should store formula in the struct"
but ı don't know how I can construct
the code on the base of structure./
really, I don't know way how to store the formula in program code so that
I can do my job.
can you show me?
/* a,b,c,d is letters
cos,sin,sqrt,ln is function*/
You need to write a lexical analyzer to tokenize the input (break it into its component parts--operators, punctuators, identifiers, etc.). Inevitably, you'll end up with some sequence of tokens.
After that, there are a number of ways to evaluate the input. One of the easiest ways to do this is to convert the expression to postfix using the shunting yard algorithm (evaluation of a postfix expression is Easy with a capital E).
You should look up "abstract syntax trees" and "expression trees" as well as "lexical analysis", "syntax", "parse", and "compiler theory". Reading text input and getting meaning from it is quite difficult for most things (though we often try to make sure we have simple input).
The first step in generating a parser is to write down the grammar for your input language. In this case your input language is some Mathematical expressions, so you would do something like:
expr => <function_identifier> ( stmt )
( stmt )
<variable_identifier>
<numerical_constant>
stmt => expr <operator> stmt
(I haven't written a grammar like this {look up BNF and EBNF} in a few years so I've probably made some glaring errors that someone else will kindly point out)
This can get a lot more complicated depending on how you handle operator precedence (multiply and device before add and subtract type stuff), but the point of the grammar in this case is to help you to write a parser.
There are tools that will help you do this (yacc, bison, antlr, and others) but you can do it by hand as well. There are many many ways to go about doing this, but they all have one thing in common -- a stack. Processing a language such as this requires something called a push down automaton, which is just a fancy way of saying something that can make decisions based on new input, a current state, and the top item of the stack. The decisions that it can make include pushing, popping, changing state, and combining (turning 2+3 into 5 is a form of combining). Combining is usually referred to as a production because it produces a result.
Of the various common types of parsers you will almost certainly start out with a recursive decent parser. They are usually written directly in a general purpose programming language, such as C. This type of parser is made up of several (often many) functions that call each other, and they end up using the system stack as the push down automaton stack.
Another thing you will need to do is to write down the different types of words and operators that make up your language. These words and operators are called lexemes and represent the tokens of your language. I represented these tokens in the grammar <like_this>, except for the parenthesis which represented themselves.
You will most likely want to describe your lexemes with a set of regular expressions. You should be familiar with these if you use grep, sed, awk, or perl. They are a way of describing what is known as a regular language which can be processed by something known as a Finite State Automaton. That is just a fancy way of saying that it is a program that can make a decision about changing state by considering only its current state and the next input (the next character of input). For example part of your lexical description might be:
[A-Z] variable-identifier
sqrt function-identifier
log function-identifier
[0-9]+ unsigned-literal
+ operator
- operator
There are also tools which can generate code for this. lex which is one of these is highly integrated with the parser generating program yacc, but since you are trying to learn you can also write your own tokenizer/lexical analysis code in C.
After you have done all of this (it will probably take you quite a while) you will need to have your parser build a tree to represent the expressions and grammar of the input. In the simple case of expression evaluation (like writing a simple command line calculator program) you could have your parser evaluate the formula as it processed the input, but for your case, as I understand it, you will need to make a tree (or Reverse Polish representation, but trees are easier in my opinion).
Then after you have read the values for the variables you can traverse the tree and calculate an actual number.
Possibly the easiest thing to do is use an embedded language like Lua or Python, for both of which the interpreter is written in C. Unfortunately, if you go the Lua route you'll have to convert the binary operations to function calls, in which case it's likely easier to use Python. So I'll go down that path.
If you just want to output the result to the console this is really easy and you won't even have to delve too deep in Python embedding. Since, then you only have to write a single line program in Python to output the value.
Here is the Python code you could use:
exec "import math;A=<vala>;B=<valb>;C=<valc>;D=<vald>;print <formula>".replace("^", "**").replace("log","math.log").replace("ln", "math.log").replace("sin","math.sin").replace("sqrt", "math.sqrt").replace("cos","math.cos")
Note the replaces are done in Python, since I'm quite sure it's easier to do this in Python and not C. Also note, that if you want to use xor('^') you'll have to remove .replace("^","**") and use ** for powering.
I don't know enough C to be able to tell you how to generate this string in C, but after you have, you can use the following program to run it:
#include <Python.h>
int main(int argc, char* argv[])
{
char* progstr = "...";
Py_Initialize();
PyRun_SimpleString(progstr);
Py_Finalize();
return 0;
}
You can look up more information about embedding Python in C here: Python Extension and Embedding Documentation
If you need to use the result of the calculation in your program there are ways to read this value from Python, but you'll have to read up on them yourself.
Also, you should review your posts to SO and other posts regarding Binary Trees. Implement this using a tree structure. Traverse as infix to evaluate. There have been some excellent answers to tree questions.
If you need to store this (for persistance as in a file), I suggest XML. Parsing XML should make you really appreciate how easy your assignment is.
Check out this post:
http://blog.barvinograd.com/2011/03/online-function-grapher-formula-parser-part-2/
It uses ANTLR library for parsing math expression, this one specifically uses JavaScript output but ANTLR has many outputs such as Java, Ruby, C++, C# and you should be able to use the grammar in the post for any output language.

Resources