C standard addressing simplification inconsistency - c

Section §6.5.3.2 "Address and indirection operators" ¶3 says (relevant section only):
The unary & operator returns the address of its operand. ...
If the operand is the result of a unary * operator, neither that operator nor the & operator is evaluated and the result is as if both were omitted, except that the constraints on the operators still apply and the result is not an lvalue. Similarly, if the operand is the result of a [] operator, neither the & operator nor the unary * that is implied by the [] is evaluated and the result is as if the & operator were removed and the [] operator were changed to a + operator. ...
This means that this:
#define NUM 10
int tmp[NUM];
int *i = tmp;
printf("%ti\n", (ptrdiff_t) (&*i - i) );
printf("%ti\n", (ptrdiff_t) (&i[NUM] - i) );
Should be perfectly legal, printing 0 and the NUM (10). The standard seems very clear that both of those cases are required to be optimized.
However, it doesn't seem to require the following to be optimized:
struct { int a; short b; } tmp, *s = tmp;
printf("%ti\n", (ptrdiff_t) (&s->b - s) );
This seems awfully inconsistent. I can see no reason that the above code shouldn't print the sizeof(int) plus (unlikely) padding (possibly 4).
Simplifying a &-> expression is going to be the same conceptually (IMHO) as &[], a simple address-plus-offset. It's even an offset that's going to be determinable at compile time, rather than potentially runtime with the [] operator.
Is there anything in the rationale about why this is so seemingly inconsistent?

In your example, &i[10] is actually not legal: it becomes i + 10, which becomes NULL + 10, and you can't perform arithmetic on a null pointer. (6.5.6/8 lists the conditions under which pointer arithmetic can be performed)
Anyway, this rule was added in C99; it was not present in C89. My understanding is that it was added in large part to make code like the following well-defined:
int* begin, * end;
int v[10];
begin = &v[0];
end = &v[10];
That last line is technically invalid in C89 (and in C++) but is allowed in C99 because of this rule. It was a relatively minor change that made a commonly used construct well-defined.
Because you can't perform arithmetic on a null pointer, your example (&s->b) would be invalid anyway.
As for why there is this "inconsistency," I can only guess. It's likely that no one thought to make it consistent or no one saw a compelling use case for this. It's possible that this was considered and ultimately rejected. There are no remarks about the &* reduction in the Rationale. You might be able to find some definitive information in the WG14 papers, but unfortunately they seem to be quite poorly organized, so trawling through them may be tedious.

I think that the rule hasn't been added for optimization purpose (what does it bring that the as-if rule doesn't?) but to allow &t[sizeof(t)/sizeof(*t)] and &*(t+sizeof(t)/sizeof(*t)) which would be undefined behaviour without it (writing such things directly may seem silly, but add a layer or two of macros and it can make sense). I don't see a case where special casing &p->m would bring such benefit. Note that as James pointed out, &p[10] with p a null pointer is still undefined behaviour; &p->m with p a null pointer would similarly have stayed invalid (and I must admit that I don't see any use when p is the null pointer).

I believe that the compiler can choose to pack in different ways, possibly adding padding between members of a struct to increase memory access speed. This means that you can't for sure say that b will always be an offset of 4 away. The single value does not have the same problem.
Also, the compiler may not know the layout of a struct in memory during the optimization phase, thus preventing any sort of optimization concerning struct member accesses and subsequent pointer casts.
edit:
I have another theory...
many times the compiler will optimize the abstract syntax tree just after lexical analysis and parsing. This means it will find things like operators that cancel out and expressions that evaluate to a constant and reduce those sections of the tree to one node. This also means that the information about structs is not available. later optimization passes that occur after some code generation may be able to take this into account because they have additional information, but for things like trimming the AST, that information is not yet there.

Related

Array declared with negative buffer works if done with a variable

I'm new to C programming and I'm struggling really hard to understand why this works
#include <stdio.h>
int main() {
int l = -5;
char arr[l];
printf("content: %s; sizeof: %d\n", arr, sizeof(arr));
}
with output:
content: ; sizeof: -5
and this doesn't:
#include <stdio.h>
int main() {
char arr[-5];
printf("content: %s; sizeof: %d\n", arr, sizeof(arr));
}
with output:
name.c:6:10: error: size of array ‘arr’ is negative
6 | char arr[-5];
| ^~~
I was expecting an error also from the first example, but I really don't know what's happening here.
Neither version of the program conforms to the C language specification (even after the format specifiers are corrected to properly match the size_t argument). But the two cases are semantically different, and they violate different provisions of the language specification.
Taking this one first:
char arr[-5];
The expression -5 is an integer constant expression, so this is a declaration of an ordinary array (not a variable-length array). It is subject to paragraph 6.7.6.2/1 of the C17 language spec, which says, in part:
In addition to optional type qualifiers and the keyword static, the [
and ] may delimit an expression or *. If they delimit an expression
(which specifies the size of an array), the expression shall have an
integer type. If the expression is a constant expression, it shall
have a value greater than zero.
(Emphasis added.)
That is part of a language constraint, which means that the compiler is obligated to emit a diagnostic message when it observes a violation. In principle, implementations are not required to reject code containing constraint violations, but if they accept such codes then the language does not define the results.
On the other hand, consider
int l = -5;
char arr[l];
Because l is not a constant expression (and would not be even if l were declared const), the provision discussed above does not apply, and, separately, arr is a variable-length array. This is subject to paragraph 6.7.6.2/5 of the spec, the relevant part requiring of the size expression that:
each time it is evaluated it shall have a value greater than zero
The program violates that provision, but it is a semantic rule, not a language constraint, so the compiler is not obligated to diagnose it, much less to reject the code. In the general case, the compiler cannot recognize or diagnose violations of this particular rule, though in principle, it could do so in this particular case. If it accepts the code then the runtime behavior is undefined.
Why this program emits -5 when you compile and run it with your particular C implementation on your particular hardware is not established by C. It might be specified by your implementation, or it might not. Small variations in the program or different versions of your C implementation might produce different results.
Overall, this is yet another example of C refusing to hold your hand. New coders and those used to interpreted languages and virtual machines seem often to have the expectation that some component of the system will inform them when they have written bad code. Sometimes it does, but other times it just does something with that bad code that might or might not resemble what the programmer had in mind. Effective programming in C requires attention to detail.

What does this line mean in C99?

static int* p= (int*)(&foo);
I just know p points to a memory in the code segment.
But I don't know what exactly happens in this line.
I thought maybe it's a pointer to a function but the format to point a function is:
returnType (*pointerName) (params,...);
pointerName = &someFunc; // or pointerName=someFunc;
You take the address of foo and cast it to pointer to int.
If foo and p are of different types, the compiler might issue a warning about type mismatch. The cast is to supress that warning.
For example, consider the following code, which causes a warning from the compiler (initialization from incompatible pointer type):
float foo = 42;
int *p = &foo;
Here foo is a float, while p points to an int. Clearly - different types.
A typecasting makes the compiler treat one variable as if it was of different type. You typecast by putting new type name in parenthesis. Here we will make pointer to float to be treated like a pointer to int and the warning will be no more:
float foo = 5;
int *p = (int*)(&foo);
You could've omitted one pair of parenthesis as well and it'd mean the same:
float foo = 5;
int *p = (int*)&foo;
The issue is the same if foo is a function. We have a pointer to a function on right side of assignment and a pointer to int on left side. A cast would be added to make a pointer to function to be treated as an address of int.
A pointer of a type which points to an object (i.e. not void* and not a
pointer to a function) cannot be stored to a pointer to any other kind of
object without a cast, except in a few cases where the types are identical
except for qualifiers. Conforming compilers are required to issue a
diagnostic if that rule is broken.
Beyond that, the Standard allows compilers to interpret code that casts
pointers in nonsensical fashion unless code aides by some restrictions
which, as written make such casts essentially useless, for the nominal
purpose of promoting optimization. When the rules were written, most
compilers would probably do about half of the optimizations that would
be allowed under the rules, but would still process pointer casts sensibly
since doing so would cost maybe 5% of the theoretically-possible
optimizations. Today, however, it is more fashionable for compiler writers
to seek out all cases where an optimization would be allowed by the
Standard without regard for whether they make sense.
Compilers like gcc have an option -fno-strict-aliasing that blocks this
kind of optimization, both in cases where it would offer big benefits and
little risk, as well as in the cases where it would almost certainly break
code and be unlikely to offer any real benefit. It would be helpful if they
had an option to block only the latter, but I'm unaware of one. Thus, my
recommendation is that unless one wants to program in a very limited subset
of Dennis Ritchie's language, I'd suggest targeting the -fno-strict-aliasing
dialect.

The spiral rule about declarations — when is it in error?

I recently learned the spiral rule for deobfuscating complex declarations, that should have been written with a series of typedefs. However, the following comment alarms me:
A frequently cited simplification, which only works for a few simple cases.
I do not find void (*signal(int, void (*fp)(int)))(int); a "simple case". Which is all the more alarming, by the way.
So, my question is, in which situations will I be correct to apply the rule, and in which it would be in error?
Basically speaking, the rule simply doesn't work, or else it
works by redefining what is meant by spiral (in which case,
there's no point in it. Consider, for example:
int* a[10][15];
The spiral rule would give a is an array[10] of pointer to
array[15] of int, which is wrong. It the case you cite, it
doesn't work either; in fact, in the case of signal, it's not
even clear where you should start the spiral.
In general, it's easier to find examples of where the rule fails
than examples where it works.
I'm often tempted to say that parsing a C++ declaration is
simple, but nobody who has tried with complicated declarations
would believe me. On the other hand, it's not as hard as it is
sometimes made out to be. The secret is to think of the
declaration exactly as you would an expression, but with a lot
less operators, and a very simple precedence rule: all operators
to the right have precedence over all operators to the left. In
the absence of parentheses, this means process everything to the
right first, then everything to the left, and process
parentheses exactly as you would in any other expression. The
actual difficulty is not the syntax per se, but that it
results is some very complex and counterintuitive declarations,
in particular where function return values and pointers to
functions are involved: the first right, then left rule means
that operators at a particular level are often widely separated,
e.g.:
int (*f( /* lots of parameters */ ))[10];
The final term in the expansion here is int[10], but putting
the [10] after the complete function specification is (at
least to me) very unnatural, and I have to stop and work it out
each time. (It's probably this tendency for logically adjacent
parts to spread out that lead to the spiral rule. The problem
is, of course, that in the absence of parentheses, they don't
always spread out—anytime you see [i][j], the rule is go
right, then go right again, rather than spiral.)
And since we're now thinking of declarations in terms of
expressions: what do you do when an expression becomes too
complicated to read? You introduce intermediate variables in order
to make it easier to read. In the case of declarations, the
"intermediate variables" are typedef. In particular, I would
argue that any time part of the return type ends up after the
function arguments (and a lot of other times as well), you
should use a typedef to make the declaration simpler. (This
is a "do as I say, not as I do" rule, however. I'm afraid that
I'll occasionally use some very complex declarations.)
The rule is correct. However, one should be very careful in applying it.
I suggest to apply it in a more formal way for C99+ declarations.
The most important thing here is to recognize the following recursive structure of all declarations (const, volatile, static, extern, inline, struct, union, typedef are removed from the picture for simplicity but can be added back easily):
base-type [derived-part1: *'s] [object] [derived-part2: []'s or ()]
Yep, that's it, four parts.
where
base-type is one of the following (I'm using a bit compressed notation):
void
[signed/unsigned] char
[signed/unsigned] short [int]
signed/unsigned [int]
[signed/unsigned] long [long] [int]
float
[long] double
etc
object is
an identifier
OR
([derived-part1: *'s] [object] [derived-part2: []'s or ()])
* is *, denotes a reference/pointer and can be repeated
[] in derived-part2 denotes bracketed array dimensions and can be repeated
() in derived-part2 denotes parenthesized function parameters delimited with ,'s
[] elsewhere denotes an optional part
() elsewhere denotes parentheses
Once you've got all 4 parts parsed,
[object] is [derived-part2 (containing/returning)] [derived-part2 (pointer to)] base-type 1.
If there's recursion, you find your object (if there's any) at the bottom of the recursion stack, it'll be the inner-most one and you'll get the full declaration by going back up and collecting and combining derived parts at each level of recursion.
While parsing you may move [object] to after [derived-part2] (if any). This will give you a linearized, easy to understand, declaration (see 1 above).
Thus, in
char* (**(*foo[3][5])(void))[7][9];
you get:
base-type = char
level 1: derived-part1 = *, object = (**(*foo[3][5])(void)), derived-part2 = [7][9]
level 2: derived-part1 = **, object = (*foo[3][5]), derived-part2 = (void)
level 3: derived-part1 = *, object = foo, derived-part2 = [3][5]
From there:
level 3: * [3][5] foo
level 2: ** (void) * [3][5] foo
level 1: * [7][9] ** (void) * [3][5] foo
finally, char * [7][9] ** (void) * [3][5] foo
Now, reading right to left:
foo is an array of 3 arrays of 5 pointers to a function (taking no params) returning a pointer to a pointer to an array of 7 arrays of 9 pointers to a char.
You could reverse the array dimensions in every derived-part2 in the process as well.
That's your spiral rule.
And it's easy to see the spiral. You dive into the ever more deeply nested [object] from the left and then resurface on the right only to note that on the upper level there's another pair of left and right and so on.
The spiral rule is actually an over-complicated way of looking at it. The actual rule is much simpler:
postfix is higher precedence than prefix.
That's it. That's all you need to remember. The 'complex' cases are when you have parenthesis to override that postfix-higher-than-prefix precedence, but you really just need to find the matching parenthesis, then look at the things inside the parens first, and, if that is not complete, pull in the next level outside the parenthses, postfix first.
So looking at your complex example
void (*signal(int, void (*fp)(int)))(int);
we can start at any name and figure out what that name is. If you start at int, you're done -- int is a type and you can understand it by itself.
If you start at fp, fp is not a type, its a name being declared as something. So look at the first set of parens enclosing:
(*fp)
there's no suffix (deal with postfix first), then the prefix * means pointer. Pointer to what? not complete yet so look out another level
void (*fp)(int)
The suffix first is "function taking an int param", then the prefix is "returning void". So we have fn is "pointer to function taking int param, returning void"
If we start a signal, the first level has a suffix (function) and a prefix (returning pointer). Need the next level out to see what that points to (function returning void). So we end up with "function with two params (int and pointer to function), returning pointer to function with one (int) param, returning void"
E.g.:
int * a[][5];
This is not an array of pointers to arrays of int.

why *foo ++= *++foo may be undefined?

it should set the current character to next character. For example:
while( *foo ) {
if(baa(*foo)) *foo++ = *++foo;
foo++;
}
But I get the following errors:
error: operation on ‘foo’ may be undefined [-Werror=sequence-point]
cc1: all warnings being treated as errors
Can anyone explain why that? that isn't valid C syntax?
You're incrementing foo on both sides of the assignment, with no sequence points in between. That's not allowed; you can only modify a value once between sequence points.
Let's take a closer look at this expression:
*foo++ = *++foo
*foo++ evaluates as *(foo++) (postfix ++ has higher precedence than unary *); you take the current value of foo and dereference it, and advance foo as a side effect. *++foo evaluates as *(++foo) (both unary * and ++ have the same precedence, so they are applied left-to-right); you take the value of foo + 1, dereference the result, and then advance foo again as a side effect. Then you assign the result of the second expression to the first.
The problem is that the exact order in which all of those side effects are applied (assignment, postincrement, and preincrement) is unspecified; the compiler is free to reorder those operations as it sees fit. Because of this, expressions of the form x++ = ++x will give different results for different compilers, or for the same compiler with different compiler settings, or even based on the surrounding code.
The language standard explicitly calls this out as undefined behavior so that compiler implementors are free to handle the situation any way they see fit, with no requirement to try and do the "right thing" (whatever the "right thing" may be). GCC obviously issues a diagnostic in this case, but they don't have to. For one thing, not all cases are so easy to detect as this. Imagine a function like
void bar (int *a, int *b)
{
*a++ = *++b;
}
Is this a problem? Only if a and b point to the same thing, but if the caller is in a separate translation unit, there's no way to know that at compile time.

Can parentheses in C cause implicit cast?

Background
The last time I asked about whether parentheses were causing implicit cast (here), #pmg was nice enough to point out that "Nothing in C is done below int" But there, the discussion was about bitwise operators, and the parentheses turned out to be just a distraction.
Introduction
Below, the parentheses are the main attraction. Or, to be more boring but precise, the only operators I see are the parentheses and assignment operators.
At this reference about the C parentheses operator, I do not see anything about parentheses changing the type (outside of typecast syntax, which is not this case).
Meanwhile, here's a reference that reminds that there is automatic type conversion on assignment, but I don't think that will explain the static analysis tool behavior I will describe here.
As in my previous question, "OK" means that the static analysis tool did not warn about an implicit type conversion, and "NOT OK" means that it did.
int main(void)
{
unsigned int ui;
int i;
ui = (256U); // NOT OK (*) (1)
i = (256U); // NOT OK (*) (2)
i = 256; // OK
i = 256U; // NOT OK
ui = 256U; // OK (3)
ui = 256; // NOT OK
return(0);
}
I can understand them all except the first two - what do the parentheses do? If they do nothing in the way of implicit typecasting, then I would expect (1) to be OK and (2) to be NOT OK. If they do automatic type promotion of types smaller than int up to int, then I would expect (1) to be NOT OK and (2) to be OK. But this tool says that both are NOT OK.
Is this a static analysis tool error, or is the tool correct and there's something else I need to learn about implicit type conversions in C?
(BTW I hope that the value 256 is small enough not be causing overflow on my machine ...)
First, let's clear up some terminology. Nothing can cause an "implicit cast", because there is no such thing. A cast is a explicit operator, consisting of a type name in parentheses preceding an expression, such as (double)42; it specifies a conversion. Conversions can be either explicit (specified by a cast operator) or implicit, as in double x = 42;. So what you're really asking is whether parentheses can cause an implicit conversion.
And the answer, at least in the code you've shown us, is no.
Quoting the C99 standard (3.7 MB PDF), section 6.5.1p5:
A parenthesized expression is a primary expression. Its type and value
are identical to those of the unparenthesized expression. It is an
lvalue, a function designator, or a void expression if the
unparenthesized expression is, respectively, an lvalue, a function
designator, or a void expression.
And since 256U is already a primary expression, the parentheses make no difference at all; parentheses generally indicate precedence, but in this case there is no predecence to indicate.
What static analysis tool are you using? You should probably submit a bug report.
The tool is confused somehow. There's no casting here. Those parentheses just indicate precedence.

Resources