First, I know I know. This question has kind of been asked some times before, but most of the answers got on other topics only partly answer my question.
I'm doing something which can parse C like expressions.
That includes expressions for example like (some examples)
1) struct1.struct2.structarray[283].shd->_var
2) *((*array_dptr)[2][1] + 5)
3) struct1.struct2.struct3.var + b * c / 3 % 5
Problem is... I need to be fast on this. The fastest possible, even if it makes the code ugly - well, obviously, the speed improvement must be tangible. The reason is that it is interpreted. It needs to be fast...
I have many questions, and I will probably ask some more depending on your answers. But anyways...
First, I'm aware of "operator priorities". For example algorithms implemented in C compilers will assign to operators a priority number and evaluate the expression based on that.
I've consulted this table : http://en.wikipedia.org/wiki/Operators_in_C_and_C++#Operator_precedence
Now, this is cool but... I wonder a few things.
My principal question is... how would you implement this to be the fastest possible?
I have thought about for example... (please note the program I'm speaking about actually parses a file containing these expressions, and not all C operators will be supported)
1) Stocking the expression string into an array, storing each operator position inside an array, and then starting to parse all this crap, starting from the highest priority operator. For example if I had str = "2*1+3", then after checking all the operators present, I would check for the position at str[1], and the check at right and left, do the operation (here multiply) and then substitude the expression with the result and evaluate again.
The problem I see there is... say two operators in the expr are the same priority
for example : var1 * var2 / var3 / var4
since * and / have both the same precedence, how to know on which position to start the parsing? Of course this example is rather intuitive, but I can the problem growing on enormous expressions.
2) Is this even possible to do it non recursive? Usually recursive means slower due to multiple function call setting their own stack frames, re-initializing stuff etc etc.
3) How to distinguish unary operators from non unaries?
For example : 2 + *a + b * c
There is the dereferencing op and the multiplication one. Intuitively I have an idea on how to do it, but I ain't sure. I'd rather have your advices on this (i think : check if one of the right or left members are operators, if so, then it's unary?)
4) I don't get expressions being evaluated right-to-left. Seems so unnatural to be. More that I don't unterstand what does it means. Would you show an example? Why do it that way?!?
5) Do you have better algorithms in head? Better ideas of achieving it?
For now, that sums pretty much what I'm thinking about.
This ain't an homework by the way. It's practical stuff.
Thanks!
Related
From main(), I want the user to input a mathematical function (I,e: 2xy) through the command line. From there, I initially thought to iterate through the string and parse out different arithmetic operators, x, y, etc. However, this could become fairly complicated for more intricate functions, (e.g: (2x^2)/5 +sqrt(x^4) ). Is there a more general method to be able to parse a mathematical function string like this one?
One of the most helpful ways to deal with parsing issues like that is to switch the input methods from equations like that to an RPN based input where the arguments come first and the operators come last.
Rewriting your complex equation would end up looking like:
2 2 x ^ * 5 / x 4 ^ sqrt +
This is generally easier to implement, as you can do it with a simple stack -- pushing new arguments on, while the operators pull the require pieces off the stack and put the result back on. Greatly simplifies the parsing, but you still need to implement the functions.
What you need is an expression evaluator.
A while ago, I wrote a complete C expression evaluator (i.e. evaluated expressions written using C syntax) for a command line processor and scripting language on an embedded system. I used this description of the algorithm as a starting point. You could use the accompanying code directly, but I did not like the implementation, and wrote my own from the algorithm description.
It needed some work to support all C operators, function calls, and variables, but is a clear explanation and therefore a good starting point, especially if you don't need that level of completeness.
The basic principle is that expression evaluation is easier for a computer using a stack and 'Reverse Polish Notation', so the algorithm converts an in-fix notation expression with associated order of precedence and parentheses to RPN, and then evaluates it by popping operands, performing operations, and pushing results, until there are no operations left and one value left on the stack.
It might get a bit more complicated is you choose to deal with implicit multiply operators (2xy rather then 2 * x * y for example. Not least because you'd need to unambiguously distinguish the variables x and y from a single variable xy. That is probably only feasible if you only allow single character variable names. I suggest you either do that and insert explicit multiply operators on the operator stack as part of the parse, or you disallow implicit multiply.
I know this works to get the index (and do the pointer arithmetic automatically):
struct Obj *c = &p[i];
However, does this have a performance hit compared to doing the pointer arithmetic manually?
struct Obj *c = p+i;
Are there any reasons why the first should be avoided?
However, does this have a performance hit compared to doing the pointer arithmetic manually?
Expressions using the indexing operator are defined to be equivalent to corresponding operations involving pointer arithmetic and dereferencing, so
&p[i]
is exactly equivalent to
&(*(p + i))
The standard furthermore specifies that when the operand of the & operator is the result of a * operator, neither is evaluated, and the result is as if both were omitted, so that is in turn equivalent to
p + i
, which has defined behavior as long as it does not purport to produce a pointer to before the beginning of the array from which pointer p was derived, nor more than one position past its end.
Although it is conceivable that a compiler would produce different or worse code in one case than in the other, there is no reason to expect that.
Are there any reasons why the first should be avoided?
Not much. The latter has a lighter cognitive load and is easier to read, inasmuch as (syntactically) it involves only one operation rather than two. I tend to prefer it myself for that reason. On the other hand, understanding it may be a little less intuitive.
That second version is perplexing and stands out as an anomaly, so unless you have an extremely compelling case for using it, don't. If you do, then you should document why so someone doesn't go and replace it with the former.
The second form is also really brittle in that it might break if you changed something by refactoring and introducing a new struct like:
struct Mobj *c = p + i * sizeof(struct Obj);
Where you forgot to update the sizeof part and now your code is super broken.
That of course overlooks the fact that when doing pointer math it will automatically increment by the size of the structure anyway, that is:
p[i] == *(p + i)
But in your case what you're doing is effectively this:
p[i * sizeof(Obj)]
Which is not what you're intending.
Less is more. Coding is hard enough as it is, don't overcomplicate things.
Yea, it is okay. In fact, when you use p it is actually the same as &p[0] from the compiler's point of view. Always let the compiler do the pointer math for you, use the first form.
The C standard states that &array[index] is exactly equivalent to array + index. This means that both do pointer arithmetic, so they should be no different.
Even if they were different choosing one over the other would be a microoptimization that's not worth doing unless you have a significant measurable difference between them.
Was there a very early C "standard" where the following was legal for the definition of a two-dimensional array?
int array[const_x, const_y];
int array2[2, 10];
I've stumbled upon some old legacy code which uses this (and only this) notation for multi-dimensional arrays. The code is, except for this oddity, perfectly valid C (and surprisingly well-designed for the time).
As I didn't find any macros which convert between [,] and [][], and I assume it's not a form of practical joke, it seems that once upon a time there hath been thy olde C compiler which accepted this notation. Or did I miss something?
Edit: If it helps, it's for embedded microcontrollers (atmel). From experience I can tell, that embedded compilers are not that well-known for standard-compliance.
The code on current compilers works as intended (as far as it can be guessed from the function names, descriptions and variables) if I change all [,] to [][].
The first formal standard was ANSI X3.159-1989, and the first informal standard would generally be agreed to be the first edition of Kernighan & Ritchie. Neither of these allowed the comma to be used to declare a two-dimensional array.
It appears to be an idiosyncracy of your particular compiler (one that renders it non-standard-conforming, since it would change the semantics of some conforming programs).
Take a look at this forum post.
the comma operator evaluates the left hand side, discards
the result, then evaluates the right hand side. Thus "2,5" is the same
as "5", and "5,2" is the same as "2".
This could be what is happening, although the why of it is beyond me.
Note that comma cannot be used in indexing multidimensional array: the code A[i, j] evaluates to A[j] with the i discarded, instead of the correct A[i][j]. This differs from the syntax in Pascal, where A[i, j] is correct, and can be a source of errors.
From Wikipedia
In K&R Section 5.10, in their sample implementation of a grep-like function, there are these lines:
while (--argc > 0 && (*++argv)[0] == '-')
while (c = *++argv[0])
Understanding the syntax there was one of the most challenging things for me, and even now a couple weeks after viewing it for the first time, I still have to think very slowly through the syntax to make sense of it. I compiled the program with this alternate syntax, but I'm not sure that the second line is allowable. I've just never seen *'s and ++'s interleaved like this, but it makes sense to me, it compiles, and it runs. It also requires no parentheses or brackets, which is maybe part of why it seems more clear to me. I just read the operators in one direction only (right to left) rather than bouncing back and forth to either side of the variable name.
while (--argc > 0 && **++argv == '-')
while (c = *++*argv)
Well for one, that's one way to make anyone reading your code to go huh?!?!?!
So, from a readability standpoint, no, you probably shouldn't write code like that.
Nevertheless, it's valid code and breaks down as this:
*(++(*p))
First, p is dereferenced. Then it is incremented. Then it is dereferenced again.
To make thing worse, this line:
while (c = *++*argv)
has an assignment in the loop-condition. So now you have two side-effects to make your reader's head spin. YAY!!!
Seems valid to me. Of course, you should not read it left to right, that's not how C compiler parses the source, and that's not how C language grammatics work. As a rule of thumb, you should first locate the object that's subject to operating upon (in this case - argv), and then analyze the operators, often, like in this case, from inside (the object) to outside. The actual parsing (and reading) rules are of course more complicated.
P. S. And personally, I think this line of code is really not hard to understand (and I'm not a C programming guru), so I don't think you should surround it with parentheses as Mysticial suggests. That would only make the code look big, if you know what I mean...
There's no ambiguity, even without knowledge of the precedence rules.
Both ++ and * are prefix unary operators; they can only apply to an operand that follows them. The second * can only apply to argv, the ++ to *argv, and the first * to ++*argv. So it's equivalent to *(++(*argv)). There's no possible relationship between the precedences of ++ and * that could make it mean anything else.
This is unlike something like *argv++, which could conceivably be either (*argv)++ or *(argv++), and you have to apply precedence rules to determine which (it's *(argv++)` because postfix operators bind more tightly than prefix unary operators).
There's a constraint that ++ can only be applied to an lvalue; since *argv is an lvalue, that's not a problem.
Is this code valid? Yes, but that's not what you asked.
Is this code acceptable? That depends (acceptable to who?).
I wouldn't consider it acceptable - I'd consider it "harder to read than necessary" for a few different reasons.
First; lots of programmers have to work with several different languages, potentially with different operator precedence rules. If your code looks like it relies on a specific language's operator precedence rules (even if it doesn't) then people have to stop and try to remember which rules apply to which language.
Second; different programmers have different skill levels. If you're ever working in a large team of developers you'll find that the best programmers write code that everyone can understand, and the worst programmers write code that contains subtle bugs that half of the team can't spot. Most C programmers should understand "*++*argv", but a good programmer knows that a small number of "not-so-good" programmers either won't understand it or will take a while to figure it out.
Third; out of all the different ways of writing something, you should choose the variation that expresses your intent the best. For this code you're working with an array, and therefore it should look like you intend to be working with an array (and not a pointer). Note: For the same reason, "uint32_t foo = 0x00000002;" is better than "uint32_t foo = 0x02;".
This was an interview question. I said they were the same, but this was adjudged an incorrect response. From the assembler point of view, is there any imaginable difference? I have compiled two short C programs using default gcc optimization and -S to see the assembler output, and they are the same.
The interviewer may have wanted an answer something like this:
i=i+1 will have to load the value of i, add one to it, and then store the result back to i. In contrast, ++i may simply increment the value using a single assembly instruction, so in theory it could be more efficient. However, most compilers will optimize away the difference, and the generated code will be exactly the same.
FWIW, the fact that you know how to look at assembly makes you a better programmer than 90% of the people I've had to interview over the years. Take solace in the fact that you won't have to work with the clueless loser who interviewed you.
Looks like you were right and they were wrong. I had a similar issue in a job interview, where I gave the correct answer that was deemed incorrect.
I confidently argued the point with my interviewer, who obviously took offence to my impudence. I didn't get the job, but then again, working under someone who "knows everything" wouldn't be that desirable either.
You are probably right. A naive compiler might do:
++i to inc [ax]
and
i = i + 1 to add [ax], 1
but any half sensible compiler will just optimize adding 1 to the first version.
This all assumes the relevant architecture has inc and add instructions (like x86 does).
To defend the interviewer, context is everything. What is the type of i? Are we talking C or C++ (or some other C like language)? Were you given:
++i;
i = i + 1;
or was there more context?
If I had been asked this, my first response would've been "is i volatile?" If the answer is yes, then the difference is huge. If not, the difference is small and semantic, but pragmatically none. The proof of that is the difference in parse tree, and the ultimate meaning of the subtrees generated.
So it sounds like you got the pragmatic side right, but the semantic/critical thought side wrong.
To attack the interviewer (without context), I'd have to wonder what the purpose of the question was. If I asked the question, I'd want to use it to find out if the candidate knew subtle semantic differences, how to generate a parse tree, how to think critically and so on and so forth. I typically ask a C question of my interviewees that nearly every candidate gets wrong - and that's by design. I actually don't care about the answer to the question, I care about the journey that I will be taking with the candidate to reach understanding, which tells me far more about than right/wrong on a trivia question.
In C++, it depends if i is an int or an object. If it's an object, it would probably generate a temporary instance.
the context is the main thing here because on a optimized release build the compiler will optimize away the i++ if its available to a simple [inc eax]. whereas something like int some_int = i++ would need to store the i value in some_int FIRST and only then increment i.