Related
In Bob Nystrom's Crafting Interpreters, the author creates his parser using a function pointer table where the main parsing function parsePrecedence() looks up a particular rule in the table and calls the function in the table. This makes sense for unary arithmetic operators like negation, or binary arithmetic operators like addition and multiplication. However, once global variables come into the mix, I no longer understand. For instance, why is the function varDeclaration() not placed in the TOKEN_VAR slot in the table? Wouldn't it count as a prefix operator? And why isn't the assignment operator inserted into the table and considered as an infix operator?
Nystrom explains (in the page which you cited) the reason why assignment operators are special-cased in his parser:
Our bytecode VM uses a single-pass compiler. It parses and generates bytecode on the fly without any intermediate AST. As soon as it recognizes a piece of syntax, it emits code for it.
If the parser were building an AST as an intermediate representation, there would be no problem parsing all operators in the same generic fashion. But since he needs to emit correct byte-code as soon as the assignment is detected, he cannot use the generic framework, because that will emit code to look up the value of the thing on the left-hand side of the assignment operator, and the code which is required is code to set that value.
As for variable declarations (and other statements which start with keywords), it might be possible to use operator precedence techniques to parse them, but since the syntax which follows the keyword is not generally an expression, and because the keyword cannot generally be used as a value inside an expression, the operator-precedence parser would need a lot of special-casing to produce the correct parse. Still, there are language parsers which use operator-precedence parsing for everything, so it is possible. But it's not necessarily the most convenient.
I am learning the basics of C-language and I did not understand how this code is working, It should give an error because I am using an assignment operator instead of using equal to (==) in the if block
#include<stdio.h>
int main()
{
int i=4;
if(i=5){
printf("Yup");
}
else
printf("Nope");
}
While it might not seem intuitive, an assignment is actually a valid expression, with the assigned value being the value of the expression.
So when you see this:
if(i=5){
It is effectively:
if(5){
So why is this behavior allowed? A classic example is that it allows you to call a function, save the return value, and check the return value in one shot:
FILE *fp;
if ((fp = fopen("filename","r")) == NULL) {
perror("fopen failed");
exit(1);
}
// use fp
Here, the return value of fopen is assigned to fp, then fp is checked to see if it is NULL, i.e. if fopen failed.
Assignment operator when used inside an if statement will not give any error..
The assignent i = 5 will take place, and the if statement will be evaluated according to the result of the expression on the right side of the =. In this case that is 5.
The expression i=5 evaluates to a non zero value , hence the if() condition turns true.
I believe this question can be interpreted in two different ways. The first is the most literal: "Why does a C compiler allow this syntax?" The second is probably more vauge: "Why was C designed to allow such syntax to be legal?"
The answer to the first can be found in The C Programming Language (a highly recommend book if you do not already have it) and comes down too "because the language says so. It's just the way it is defined.
In the book you can refer to Appendix A to find a description of how the grammar is broken down. Specifically A7. Expressions, and A9. Statements.
A9.4 Selection Statements states:
selection-statement:
if ( expression ) statement
if ( expression ) statement else statement
switch ( expression ) statement
Meaning that any valid expression, of which assignment applies, is legal as the 'argument' to the selection with a minor cavet (emphasis is my own):
In both forms of the if statement, the expression, which must have arithmetic or pointer type, is evaluated, including all side effects, and if it compares unequal to 0, the first substatement is executed.
This might seem odd if you are coming from a language like Java, that requires the result of an expression used in a conditional to be expressly 'boolean' in nature, that attempts to lower runtime errors that are the results of typographical issues (i.e. using = instead of ==).
As for why C's syntax is like this I am not sure. A quick Google search returns nothing immediately but I offer this conjection (in which I stess I have found nothing to back up my claim and my experience with assembly languages is minimal):
C was designed to be a low level language that mapped closely to assembly level mechanisms; making it easier to implement a compiler for, and to translate assembly to.
In assembly level languages branches are the results of instructions that look at registers and decided to do. The work previously placed in the register is of no concern. Decrementing a counter is not a boolean operation but testing the resulting value in the register is. Allowing a general expression possibly made implementations of C easier to write. The original compiler written by Dennis Ritche simply spat our assembly files that needed to be assembled manually.
In C, the assignment operator = is just that: an operator. You can use it everywhere where an expression is expected,† including in the control expression of an if statement. Modern compilers typically warn about this, make sure to turn on this warning.
† Except where a constant expression is expected as an expression involving the = operator is not a constant expression.
I am studying about undefined behavior in C and I came to a statement that states that
there is no particular order of evaluation of function arguments
but then what about the standard calling conventions like _cdecl and _stdcall, whose definition said (in a book) that arguments are evaluated from right to left.
Now I am confused with these two definitions one, in accordance of UB, states different than the other which is in accordance of the definition of calling convention. Please justify the two.
As Graznarak's answer correctly points out, the order in which arguments are evaluated is distinct from the order in which arguments are passed.
An ABI typically applies only to the order in which arguments are passed, for example which registers are used and/or the order in which argument values are pushed onto the stack.
What the C standard says is that the order of evaluation is unspecified. For example (remembering that printf returns an int result):
some_func(printf("first\n"), printf("second\n"));
the C standard says that the two messages will be printed in some order (evaluation is not interleaved), but explicitly does not say which order is chosen. It can even vary from one call to the next, without violating the C standard. It could even evaluate the first argument, then evaluate the second argument, then push the second argument's result onto the stack, then push the first argument's result onto the stack.
An ABI might specify which registers are used to pass the two arguments, or exactly where on the stack the values are pushed, which is entirely consistent with the requirements of the C standard.
But even if an ABI actually requires the evaluation to occur in a specified order (so that, for example, printing "second\n" followed by "first\n" would violate the ABI) that would still be consistent with the C standard.
What the C standard says is that the C standard itself does not define the order of evaluation. Some secondary standard is still free to do so.
Incidentally, this does not by itself involve undefined behavior. There are cases where the unspecified order of evaluation can lead to undefined behavior, for example:
printf("%d %d\n", i++, i++); /* undefined behavior! */
Argument evaluation and argument passing are related but different problems.
Arguments tend to be passed left to right, often with some arguments passed in registers rather than on the stack. This is what is specified by the ABI and _cdecl and _stdcall.
The order of evaluation of arguments before placing them in the locations that the function call requires is unspecified. It can evaluate them left to right, right to left, or some other order. This is compiler dependent and may even vary depending on optimization level.
_cdecl and _stdcall merely specify that the arguments are pushed onto the stack in right-to-left order, not that they are evaluated in that order. Think about what would happen if calling conventions like _cdecl, _stdcall, and pascal changed the order that the arguments were evaluated.
If evaluation order were modified by calling convention, you would have to know the calling convention of the function you're calling in order to understand how your own code would behave. That's a leaky abstraction if I've ever seen one. Somewhere, buried in a header file someone else wrote, would be a cryptic key to understanding just that one line of code; but you've got a few hundred thousand lines, and the behavior changes for each one? That would be insanity.
I feel like much of the undefined behavior in C89 arose from the fact that the standard was written after multiple conflicting implementations existed. They were maybe more concerned with agreeing on a sane baseline that most implementers could accept than they were with defining all behavior. I like to think that all undefined behavior in C is just a place where a group of smart and passionate people agreed to disagree, but I wasn't there.
I'm tempted now to fork a C compiler and make it evaluate function arguments as if they're a binary tree that I'm running a breadth-first traversal of. You can never have too much fun with undefined behavior!
Check the book you mentioned for any references to "Sequence points", because I think that's what you're trying to get at.
Basically, a sequence point is a point that, once you've arrived there, you are certain that all preceding expressions have been fully evaluated, and its side-effects are sure to be no more.
For example, the end of an initializer is a sequence point. This means that after:
bool foo = !(i++ > j);
You are sure that i will be equal to i's initial value +1, and that foo has been assigned true or false. Another example:
int bar = i++ > j ? i : j;
Is perfectly predictable. It reads as follows: if the current value of i is greater than j, and add one to i after this comparison (the question mark is a sequence point, so after the comparison, i is incremented), then assign i (NEW VALUE) to bar, else assign j. This is down to the fact that the question mark in the ternary operator is also a valid sequence point.
All sequence points listed in the C99 standard (Annex C) are:
The following are the sequence points described in 5.1.2.3:
— The call to a function, after the arguments have been evaluated (6.5.2.2).
— The end of the first operand of the following operators: logical AND && (6.5.13);
logical OR || (6.5.14); conditional ? (6.5.15); comma , (6.5.17).
— The end of a full declarator: declarators (6.7.5);
— The end of a full expression: an initializer (6.7.8); the expression in an expression
statement (6.8.3); the controlling expression of a selection statement (if or switch)
(6.8.4); the controlling expression of a while or do statement (6.8.5); each of the
expressions of a for statement (6.8.5.3); the expression in a return statement
(6.8.6.4).
— Immediately before a library function returns (7.1.4).
— After the actions associated with each formatted input/output function conversion
specifier (7.19.6, 7.24.2).
— Immediately before and immediately after each call to a comparison function, and
also between any call to a comparison function and any movement of the objects
passed as arguments to that call (7.20.5).
What this means, in essence is that any expression that is not a followed by a sequence point can invoke undefined behaviour, like, for example:
printf("%d, %d and %d\n", i++, i++, i--);
In this statement, the sequence point that applies is "The call to a function, after the arguments have been evaluated". After the arguments are evaluated. If we then look at the semantics, in the same standard under 6.5.2.2, point ten, we see:
10 The order of evaluation of the function designator, the actual arguments, and
subexpressions within the actual arguments is unspecified, but there is a sequence point
before the actual call.
That means for i = 1, the values that are passed to printf could be:
1, 2, 3//left to right
But equally valid would be:
1, 0, 1//evaluated i-- first
//or
1, 2, 1//evaluated i-- second
What you can be sure of is that the new value of i after this call will be 2.
But all of the values listed above are, theoretically, equally valid, and 100% standard compliant.
But the appendix on undefined behaviour explicitly lists this as being code that invokes undefined behaviour, too:
Between two sequence points, an object is modified more than once, or is modified
and the prior value is read other than to determine the value to be stored (6.5).
In theory, your program could crash, instead of printinf 1, 2, and 3, the output "666, 666 and 666" would be possible, too
so finally i found it...yeah.
it is because the arguments are passed after they are evaluated.So passing arguments is a completely different story from the evaluation.Compiler of c as it is traditionally build to maximize the speed and optimization can evaluate the expression in any way.
so the both argument passing and evaluation are different stories altogether.
since the C standard does not specify any order for evaluating parameters, every compiler implementation is free to adopt one. That's one reason why coding something like foo(i++) is complete insanity- you may get different results when switching compilers.
One other important thing which has not been highlighted here - if your favorite ARM compiler evaluates parameters left to right, it will do so for all cases and for all subsequent versions. Reading order of parameters for a compiler is merely a convention...
Consider the function call (calling int sum(int, int))
printf("%d", sum(a,b));
How does the compiler decide that the , used in the function call sum(int, int) is not a comma operator?
NOTE: I didn't want to actually use the comma operator in the function call. I just wanted to know how the compiler knows that it is not a comma operator.
Look at the grammar for the C language. It's listed, in full, in Appendix A of the standard. The way it works is that you can step through each token in a C program and match them up with the next item in the grammar. At each step you have only a limited number of options, so the interpretation of any given character will depend on the context in which it appears. Inside each rule in the grammar, each line gives a valid alternative for the program to match.
Specifically, if you look for parameter-list, you will see that it contains an explicit comma. Therefore, whenever the compiler's C parser is in "parameter-list" mode, commas that it finds will be understood as parameter separators, not as comma operators. The same is true for brackets (that can also occur in expressions).
This works because the parameter-list rule is careful to use assignment-expression rules, rather than just the plain expression rule. An expression can contain commas, whereas an assignment-expression cannot. If this were not the case the grammar would be ambiguous, and the compiler would not know what to do when it encountered a comma inside a parameter list.
However, an opening bracket, for example, that is not part of a function definition/call, or an if, while, or for statement, will be interpreted as part of an expression (because there's no other option, but only if the start of an expression is a valid choice at that point), and then, inside the brackets, the expression syntax rules will apply, and that allows comma operators.
From C99 6.5.17:
As indicated by the syntax, the comma operator (as described in this subclause) cannot
appear in contexts where a comma is used to separate items in a list (such as arguments to functions or lists
of initializers). On the other hand, it can be used within a parenthesized expression or within the second
expression of a conditional operator in such contexts. In the function call
f(a, (t=3, t+2), c)
the function has three arguments, the second of which has the value 5.
Another similar example is the initializer list of arrays or structs:
int array[5] = {1, 2};
struct Foo bar = {1, 2};
If a comma operator were to be used as the function parameter, use it like this:
sum((a,b))
This won't compile, of course.
The reason is the C Grammar. While everyone else seems to like to cite the example, the real deal is the phrase structure grammar for function calls in the Standard (C99). Yes, a function call consists of the () operator applied to a postfix expression (like for example an identifier):
6.5.2 postfix-expression:
...
postfix-expression ( argument-expression-list_opt )
together with
argument-expression-list:
assignment-expression
argument-expression-list , assignment-expression <-- arglist comma
expression:
assignment-expression
expression , assignment-expression <-- comma operator
The comma operator can only occur in an expression, i.e. further down the in the grammar. So the compiler treats a comma in a function argument list as the one separating assignment-expressions, not as one separating expressions.
Existing answers say "because the C language spec says it's a list separator, and not an operator".
However, your question is asking "how does the compiler know...", and that's altogether different: It's really no different from how the compiler knows that the comma in printf("Hello, world\n"); isn't a comma operator: The compiler 'knows' because of the context where the comma appears - basically, what's gone before.
The C 'language' can be described in Backus-Naur Form (BNF) - essentially, a set of rules that the compiler's parser uses to scan your input file. The BNF for C will distinguish between these different possible occurences of commas in the language.
There are lots of good resources on how compilers work, and how to write one.
The draft C99 standard says:
As indicated by the syntax, the comma operator (as described in this subclause) cannot
appear in contexts where a comma is used to separate items in a list (such as arguments to functions or lists of initializers). On the other hand, it can be used within a parenthesized expression or within the second expression of a conditional operator in such contexts. In the function call f(a, (t=3, t+2), c) the function has three arguments, the second of which has the value 5.
In other words, "because".
There are multiple facets to this question. One par is that the definition says so. Well, how does the compiler know what context this comma is in? That's the parser's job. For C in particular, the language can be parsed by an LR(1) parser (http://en.wikipedia.org/wiki/Canonical_LR_parser).
The way this works is that the parser generates a bunch of tables that make up the possible states of the parser. Only a certain set of symbols are valid in certain states, and the symbols may have different meaning in different states. The parser knows that it is parsing a function because of the preceding symbols. Thus, it knows the possible states do not include the comma operator.
I am being very general here, but you can read all about the details in the Wiki.
Having been writing Java code for many years, I was amazed when I saw this C++ statement:
int a,b;
int c = (a=1, b=a+2, b*3);
My question is: Is this a choice of coding style, or does it have a real benefit? (I am looking for a practicle use case)
I think the compiler will see it the same as the following:
int a=1, b=a+2;
int c = b*3;
(What's the offical name for this? I assume it's a standard C/C++ syntax.)
It's the comma operator, used twice. You are correct about the result, and I don't see much point in using it that way.
Looks like an obscure use of a , (comma) operator.
It's not a representative way of doing things in C++.
The only "good-style" use for the comma operator might be in a for statement that has multiple loop variables, used something like this:
// Copy from source buffer to destination buffer until we see a zero
for (char *src = source, *dst = dest; *src != 0; ++src, ++dst) {
*dst = *src;
}
I put "good-style" in scare quotes because there is almost always a better way than to use the comma operator.
Another context where I've seen this used is with the ternary operator, when you want to have multiple side effects, e.g.,
bool didStuff = DoWeNeedToDoStuff() ? (Foo(), Bar(), Baz(), true) : false;
Again, there are better ways to express this kind of thing. These idioms are holdovers from the days when we could only see 24 lines of text on our monitors, and squeezing a lot of stuff into each line had some practical importance.
Dunno its name, but it seems to be missing from the Job Security Coding Guidelines!
Seriously: C++ allows you to a do a lot of things in many contexts, even when they are not necessarily sound. With great power comes great responsibility...
This is called 'obfuscated C'. It is legal, but intended to confuse the reader. And it seems to have worked. Unless you're trying to be obscure it's best avoided.
Hotei
Your sample code use two not very well known by beginners (but not really hidden either) features of C expressions:
the comma operator : a normal binary operator whose role is to return the last of it's two operands. If operands are expression they are evaluated from left to right.
assignment as an operator that returns a value. C assignment is not a statement as in other languages, and returns the value that has been assigned.
Most use cases of both these feature involve some form of obfuscation. But there is some legitimate ones. The point is that you can use them anywhere you can provide an expression : inside an if or a while conditional, in a for loop iteration block, in function call parameters (is using coma you must use parenthesis to avoid confusing with actual function parameters), in macro parameter, etc.
The most usual use of comma is probably in loop control, when you want to change two variables at once, or store some value before performing loop test, or loop iteration.
For example a reverse function can be written as below, thanks to comma operator:
void reverse(int * d, int len){
int i, j;
for (i = 0, j = len - 1 ; i < j ; i++, j--){
SWAP(d[i], d[j]);
}
}
Another legitimate (not obfuscated, really) use of coma operator I have in mind is a DEBUG macro I found in some project defined as:
#ifdef defined(DEBUGMODE)
#define DEBUG(x) printf x
#else
#define DEBUG(x) x
#endif
You use it like:
DEBUG(("my debug message with some value=%d\n", d));
If DEBUGMODE is on then you'll get a printf, if not the wrapper function will not be called but the expression between parenthesis is still valid C. The point is that any side effect of printing code will apply both in release code and debug code, like those introduced by:
DEBUG(("my debug message with some value=%d\n", d++));
With the above macro d will always be incremented regardless of debug or release mode.
There is probably some other rare cases where comma and assignment values are useful and code is easier to write when you use them.
I agree that assignment operator is a great source of errors because it can easily be confused with == in a conditional.
I agree that as comma is also used with a different meaning in other contexts (function calls, initialisation lists, declaration lists) it was not a very good choice for an operator. But basically it's not worse than using < and > for template parameters in C++ and it exists in C from much older days.
Its strictly coding style and won't make any difference in your program. Especially since any decent C++ compiler will optimize it to
int a=1;
int b=3;
int c=9;
The math won't even be performed during assignment at runtime. (and some of the variables may even be eliminated entirely).
As to choice of coding style, I prefer the second example. Most of the time, less nesting is better, and you won't need the extra parenthesis. Since the use of commas exhibited will be known to virtually all C++ programmers, you have some choice of style. Otherwise, I would say put each assignment on its own line.
Is this a choice of coding style, or does it have a real benefit? (I am looking for a practicle use case)
It's both a choice of coding style and it has a real benefit.
It's clearly a different coding style as compared to your equivalent example.
The benefit is that I already know I would never want to employ the person who wrote it, not as a programmer anyway.
A use case: Bob comes to me with a piece of code containing that line. I have him transferred to marketing.
You have found a hideous abuse of the comma operator written by a programmer who probably wishes that C++ had multiple assignment. It doesn't. I'm reminded of the old saw that you can write FORTRAN in any language. Evidently you can try to write Dijkstra's language of guarded commands in C++.
To answer your question, it is purely a matter of (bad) style, and the compiler doesn't care—the compiler will generate exactly the same code as from something a C++ programmer would consider sane and sensible.
You can see this for yourself if you make two little example functions and compile both with the -S option.