Related
Just curious, what actually happens if I define a zero-length array int array[0]; in code? GCC doesn't complain at all.
Sample Program
#include <stdio.h>
int main() {
int arr[0];
return 0;
}
Clarification
I'm actually trying to figure out if zero-length arrays initialised this way, instead of being pointed at like the variable length in Darhazer's comments, are optimised out or not.
This is because I have to release some code out into the wild, so I'm trying to figure out if I have to handle cases where the SIZE is defined as 0, which happens in some code with a statically defined int array[SIZE];
I was actually surprised that GCC does not complain, which led to my question. From the answers I've received, I believe the lack of a warning is largely due to supporting old code which has not been updated with the new [] syntax.
Because I was mainly wondering about the error, I am tagging Lundin's answer as correct (Nawaz's was first, but it wasn't as complete) -- the others were pointing out its actual use for tail-padded structures, while relevant, isn't exactly what I was looking for.
An array cannot have zero size.
ISO 9899:2011 6.7.6.2:
If the expression is a constant expression, it shall have a value greater than zero.
The above text is true both for a plain array (paragraph 1). For a VLA (variable length array), the behavior is undefined if the expression's value is less than or equal to zero (paragraph 5). This is normative text in the C standard. A compiler is not allowed to implement it differently.
gcc -std=c99 -pedantic gives a warning for the non-VLA case.
As per the standard, it is not allowed.
However it's been current practice in C compilers to treat those declarations as a flexible array member (FAM) declaration:
C99 6.7.2.1, ยง16: As a special case, the last element of a structure with more than one named member may have an incomplete array type; this is called a flexible array member.
The standard syntax of a FAM is:
struct Array {
size_t size;
int content[];
};
The idea is that you would then allocate it so:
void foo(size_t x) {
Array* array = malloc(sizeof(size_t) + x * sizeof(int));
array->size = x;
for (size_t i = 0; i != x; ++i) {
array->content[i] = 0;
}
}
You might also use it statically (gcc extension):
Array a = { 3, { 1, 2, 3 } };
This is also known as tail-padded structures (this term predates the publication of the C99 Standard) or struct hack (thanks to Joe Wreschnig for pointing it out).
However this syntax was standardized (and the effects guaranteed) only lately in C99. Before a constant size was necessary.
1 was the portable way to go, though it was rather strange.
0 was better at indicating intent, but not legal as far as the Standard was concerned and supported as an extension by some compilers (including gcc).
The tail padding practice, however, relies on the fact that storage is available (careful malloc) so is not suited to stack usage in general.
In Standard C and C++, zero-size array is not allowed..
If you're using GCC, compile it with -pedantic option. It will give warning, saying:
zero.c:3:6: warning: ISO C forbids zero-size array 'a' [-pedantic]
In case of C++, it gives similar warning.
It's totally illegal, and always has been, but a lot of compilers
neglect to signal the error. I'm not sure why you want to do this.
The one use I know of is to trigger a compile time error from a boolean:
char someCondition[ condition ];
If condition is a false, then I get a compile time error. Because
compilers do allow this, however, I've taken to using:
char someCondition[ 2 * condition - 1 ];
This gives a size of either 1 or -1, and I've never found a compiler
which would accept a size of -1.
Another use of zero-length arrays is for making variable-length object (pre-C99). Zero-length arrays are different from flexible arrays which have [] without 0.
Quoted from gcc doc:
Zero-length arrays are allowed in GNU C. They are very useful as the last element of a structure that is really a header for a variable-length object:
struct line {
int length;
char contents[0];
};
struct line *thisline = (struct line *)
malloc (sizeof (struct line) + this_length);
thisline->length = this_length;
In ISO C99, you would use a flexible array member, which is slightly different in syntax and semantics:
Flexible array members are written as contents[] without the 0.
Flexible array members have incomplete type, and so the sizeof operator may not be applied.
A real-world example is zero-length arrays of struct kdbus_item in kdbus.h (a Linux kernel module).
I'll add that there is a whole page of the online documentation of gcc on this argument.
Some quotes:
Zero-length arrays are allowed in GNU C.
In ISO C90, you would have to give contents a length of 1
and
GCC versions before 3.0 allowed zero-length arrays to be statically initialized, as if they were flexible arrays. In addition to those cases that were useful, it also allowed initializations in situations that would corrupt later data
so you could
int arr[0] = { 1 };
and boom :-)
Zero-size array declarations within structs would be useful if they were allowed, and if the semantics were such that (1) they would force alignment but otherwise not allocate any space, and (2) indexing the array would be considered defined behavior in the case where the resulting pointer would be within the same block of memory as the struct. Such behavior was never permitted by any C standard, but some older compilers allowed it before it became standard for compilers to allow incomplete array declarations with empty brackets.
The struct hack, as commonly implemented using an array of size 1, is dodgy and I don't think there's any requirement that compilers refrain from breaking it. For example, I would expect that if a compiler sees int a[1], it would be within its rights to regard a[i] as a[0]. If someone tries to work around the alignment issues of the struct hack via something like
typedef struct {
uint32_t size;
uint8_t data[4]; // Use four, to avoid having padding throw off the size of the struct
}
a compiler might get clever and assume the array size really is four:
; As written
foo = myStruct->data[i];
; As interpreted (assuming little-endian hardware)
foo = ((*(uint32_t*)myStruct->data) >> (i << 3)) & 0xFF;
Such an optimization might be reasonable, especially if myStruct->data could be loaded into a register in the same operation as myStruct->size. I know nothing in the standard that would forbid such optimization, though of course it would break any code which might expect to access stuff beyond the fourth element.
Definitely you can't have zero sized arrays by standard, but actually every most popular compiler gives you to do that. So I will try to explain why it can be bad
#include <cstdio>
int main() {
struct A {
A() {
printf("A()\n");
}
~A() {
printf("~A()\n");
}
int empty[0];
};
A vals[3];
}
I am like a human would expect such output:
A()
A()
A()
~A()
~A()
~A()
Clang prints this:
A()
~A()
GCC prints this:
A()
A()
A()
It is totally strange, so it is a good reason not to use empty arrays in C++ if you can.
Also there is extension in GNU C, which gives you to create zero length array in C, but as I understand it right, there should be at least one member in structure prior, or you will get very strange examples as above if you use C++.
I have been reading about the strict aliasing rule for a while, and I'm starting to get really confused. First of all, I have read these questions and some answers:
strict-aliasing-rule-and-char-pointers
when-is-char-safe-for-strict-pointer-aliasing
is-the-strict-aliasing-rule-really-a-two-way-street
According to them (as far as I understand), accessing a char buffer using a pointer to another type violates the strict aliasing rule. However, the glibc implementation of strlen() has such code (with comments and the 64-bit implementation removed):
size_t strlen(const char *str)
{
const char *char_ptr;
const unsigned long int *longword_ptr;
unsigned long int longword, magic_bits, himagic, lomagic;
for (char_ptr = str; ((unsigned long int) char_ptr
& (sizeof (longword) - 1)) != 0; ++char_ptr)
if (*char_ptr == '\0')
return char_ptr - str;
longword_ptr = (unsigned long int *) char_ptr;
himagic = 0x80808080L;
lomagic = 0x01010101L;
for (;;)
{
longword = *longword_ptr++;
if (((longword - lomagic) & himagic) != 0)
{
const char *cp = (const char *) (longword_ptr - 1);
if (cp[0] == 0)
return cp - str;
if (cp[1] == 0)
return cp - str + 1;
if (cp[2] == 0)
return cp - str + 2;
if (cp[3] == 0)
return cp - str + 3;
}
}
}
The longword_ptr = (unsigned long int *) char_ptr; line obviously aliases an unsigned long int to char. I fail to understand what makes this possible. I see that the code takes care of alignment problems, so no issues there, but I think this is not related with the strict aliasing rule.
The accepted answer for the third linked question says:
However, there is a very common compiler extension allowing you to cast properly aligned pointers from char to other types and access them, however this is non-standard.
Only thing comes to my mind is the -fno-strict-aliasing option, is this the case? I could not find it documented anywhere what glibc implementors depend on, and the comments somehow imply that this cast is done without any concerns like it is obvious that there will be no problems. That makes me think that it is indeed obvious and I am missing something silly, but my search failed me.
In ISO C this code would violate the strict aliasing rule. (And also violate the rule that you cannot define a function with the same name as a standard library function). However this code is not subject to the rules of ISO C. The standard library doesn't even have to be implemented in a C-like language. The standard only specifies that the implementation implements the behaviour of the standard functions.
In this case, we could say that the implementation is in a C-like GNU dialect, and if the code is compiled with the writer's intended compiler and settings then it would implement the standard library function successfully.
When writing the aliasing rules, the authors of the Standard only considered forms that would be useful, and should thus be mandated, on all implementations. C implementations are targeted toward a variety of purposes, and the authors of the Standard make no attempt to specify what a compiler must do to be suitable for any particular purpose (e.g. low-level programming) or, for that matter, any purpose whatsoever.
Code like the above which relies upon low-level constructs should not be expected to run on compilers that make no claim of being suitable for low-level programming. On the flip side, any compiler which can't support such code should be viewed as unsuitable for low-level programming. Note that compilers can employ type-based aliasing assumptions and still be suitable for low-level programming if they make a reasonable effort to recognize common aliasing patterns. Some compiler writers are very highly invested in a view of code which fits neither common low-level coding patterns, nor the C Standard, but
anyone writing low-level code should simply recognize that those compilers'
optimizers are unsuitable for use with low-level code.
The wording of the standard is actually a bit more weird than the actual compiler implementations: The C standard talks about declared object types, but the compilers only ever see pointers to these objects. As such, when a compiler sees a cast from a char* to an unsigned long*, it has to assume that the char* is actually aliasing an object with a declared type of unsigned long, making the cast correct.
A word of caution: I assume that strlen() is compiled into a library that is later only linked to the rest of the application. As such, the optimizer does not see the use of the function when compiling it, forcing it to assume that the cast to unsigned long* is indeed legit. If you called strlen() with
short myString[] = {0x666f, 0x6f00, 0};
size_t length = strlen((char*)myString); //implementation now invokes undefined behavior!
the cast within strlen() is undefined behavior, and your compiler would be allowed to strip pretty much the entire body of strlen() if it saw your use while compiling strlen() itself. The only thing that allows strlen() to behave as expected in this call is the fact, that strlen() is compiled separately as a library, hiding the undefined behavior from the optimizer, so the optimizer has to assume the cast to be legit when compiling strlen().
So, assuming that the optimizer cannot call "undefined behavior", the reason why casts from char* to anything else are dangerous, is not aliasing, but alignment. On some hardware, weird stuff starts happening if you try to access a misaligned pointer. The hardware might load data from the wrong address, raise an interrupt, or just process the requested memory load extremely slowly. That is why the C standard generally declares such casts undefined behavior.
Nevertheless, you see that the code in question actually handles the alignment issue explicitly (the first loop that contains the (unsigned long int) char_ptr & (sizeof (longword) - 1) subcondition). After that, the char* is properly aligned to be reinterpreted as unsigned long*.
Of course, all of this is not really compliant with the C standard, but it is compliant with the C implementation of the compiler that this code is meant to be compiled with. If the gcc people modified their compiler to act up on this bit of code, the glibc people would just complain about it loud enough so that the gcc will be changed back to handle this kind of cast correctly.
At the end of the day, standard C library implementations simply must violate strict aliasing rules to work properly and be efficient. strlen() just needs to violate those rules to be efficient, the malloc()/free() function pair must be able to take a memory region that had a declared type of Foo, and turn it into a memory region of declared type Bar. And there is no malloc() call inside the malloc() implementation that would give the object a declared type in the first place. The abstraction of the C language simply breaks down at this level.
The underlying assumption is probably that the function is separately compiled, and not available for inlining or other cross function optimizations. This means that no compile time information flows inside or outside the function.
The function doesn't try to modify anything through a pointer, so there is no conflict.
#include<stdio.h>
#include<stdlib.h>
int *func(int *);
int main(void)
{
int i,size;
const int *arr=func(&size);
for(i=0;i<size;i++)
{
printf("Enter a[%d] : ",i);
scanf("%d",&arr[i]);
}
for(i=0;i<size;i++)
{
printf("%d\t",arr[i]);
}
return 0;
}
int *func(int *psize)
{
int *p;
printf("Enter the size: ");
scanf("%d",psize);
p=(int *)malloc(*psize *sizeof(int));
return p;
}
Enter the size: 3
Enter a[0] : 1
Enter a[1] : 2
Enter a[2] : 3
1 2 3
Here, in this code, i use const keyword to not modify data which is pointed by 'arr' pointer.
if i use const keyword they why its give me output ?
You just encountered one of the many cases C allows you to shoot your foot - for good and bad.
Some general remarks about your assumptions: const is a guarantee you give to the compiler. So you have to make sure not to violate the contract. C does not have true constants [1]. Semantically const is a qualifier` which allows the compiler additional error-checking. But that requires the type of the argument to be known to the compiler. This is true for functions with a proper prototype, but (normally [2]) not for those with a variable number of argument ("variadic functions"), as their types are not given at compile-time (and not explicitly available at run-time).
In
scanf("%d",&arr[i]);
You actually pass a problematic (see below) pointer type to scanf. The function itself does not check, but just expects the correct type. It cannot, because C does not provide the type of an object at run-time.
A modern compiler should warn about argument type missmatch for printf and scanf. Always enable warnings (for gcc at least use -Wall -Wextra -Wconversions) and pay heed to them.
Edit: After heavy discussion, I have to change my mind. It seems to be not undefined behaviour [3] for the reason given initially: passing a const int * to scanf which expects a int *.
This because the object malloced in func has no effective type until the first write (6.5p6). That occurs in scanf using an int *. Thus the object has type int - no const. However, your further accesses through a const int * are valid. 6.7.3p6 only makes the other direction undefined behaviour (for good reason).
Getting through undetected is only possible for variadic functions, because there is no information about the expected type available in the function declaration. Consinder something like:
void f(int *p)
{
*p = 0;
}
int main(void)
{
const int *p = ...;
f(p);
}
Here the compiler will generate a warning. Variadic functions are of the cases C cannot check for qualifier-correctness (this includes e.g. volatile, too). There are more and some are quite subtle.
The only case of undefined behaviour here is to pass an incompatible differently qualified pointer than expected (6.7.6.1p2).
Recomendation: Enable warnings, but do not rely on the compiler detecting all flaws (not only true for const-correctness). If you need more saffety, C is not the right language. There are good reasons higher-level languages like Python, Java, etc. exist (C++ is somewhere in-between). OTOH the open ends in C allow things very hard to accomplish (if at all) in these languages where required. As allways: know your tools.
Note: You should not cast the result of malloc & friends in C. And sizeof(char) is useless. It is defined by the standard to yield 1.
[1] As violating the contract is undefined behaviour, the compiler is actually free to store such data in read-only memory. This is vital for microcontrollers which run code and read some data straight from ROM, for example.
[2] Modern compilers can parse the format-string of printf and scanf family for the argument types and warn about missmatch. That requires this string to be a string literal (not a variable), though. That is a courtesy of the compiler writes a these functions are widely used.
[3] Basically undefined behaviour means anything can happen - Your computer might run away, nasal daemons may appear, or it might work. But all not guaranteed reliable or deterministic. So next time you start something else might happen.
It is quite evident that you have turned off all compiler warnings.
Your program invokes undefined behaviour by assigning the result of a function returning int to const int*. The compiler should have told you, maybe you ignored it, but from then on all odds are off.
You pass a const int* to scanf. Again, the compiler should have warned you, maybe you ignored it, but again all odds are off.
const int* doesn't make the object pointed to unmodifiable. It tells the compiler to not let you modify anything through that pointer, that's all. The storage area returned by malloc is never unmodifiable.
I'm not sure I quite understand the extent to which undefined behavior can jeopardize a program.
Let's say I have this code:
#include <stdio.h>
int main()
{
int v = 0;
scanf("%d", &v);
if (v != 0)
{
int *p;
*p = v; // Oops
}
return v;
}
Is the behavior of this program undefined for only those cases in which v is nonzero, or is it undefined even if v is zero?
I'd say that the behavior is undefined only if the users inserts any number different from 0. After all, if the offending code section is not actually run the conditions for UB aren't met (i.e. the non-initialized pointer is not created neither dereferenced).
A hint of this can be found into the standard, at 3.4.3:
behavior, upon use of a nonportable or erroneous program construct or of erroneous data,
for which this International Standard imposes no requirements
This seems to imply that, if such "erroneous data" was instead correct, the behavior would be perfectly defined - which seems pretty much applicable to our case.
Additional example: integer overflow. Any program that does an addition with user-provided data without doing extensive check on it is subject to this kind of undefined behavior - but an addition is UB only when the user provides such particular data.
Since this has the language-lawyer tag, I have an extremely nitpicking argument that the program's behavior is undefined regardless of user input, but not for the reasons you might expect -- though it can be well-defined (when v==0) depending on the implementation.
The program defines main as
int main()
{
/* ... */
}
C99 5.1.2.2.1 says that the main function shall be defined either as
int main(void) { /* ... */ }
or as
int main(int argc, char *argv[]) { /* ... */ }
or equivalent; or in some other implementation-defined manner.
int main() is not equivalent to int main(void). The former, as a declaration, says that main takes a fixed but unspecified number and type of arguments; the latter says it takes no arguments. The difference is that a recursive call to main such as
main(42);
is a constraint violation if you use int main(void), but not if you use int main().
For example, these two programs:
int main() {
if (0) main(42); /* not a constraint violation */
}
int main(void) {
if (0) main(42); /* constraint violation, requires a diagnostic */
}
are not equivalent.
If the implementation documents that it accepts int main() as an extension, then this doesn't apply for that implementation.
This is an extremely nitpicking point (about which not everyone agrees), and is easily avoided by declaring int main(void) (which you should do anyway; all functions should have prototypes, not old-style declarations/definitions).
In practice, every compiler I've seen accepts int main() without complaint.
To answer the question that was intended:
Once that change is made, the program's behavior is well defined if v==0, and is undefined if v!=0. Yes, the definedness of the program's behavior depends on user input. There's nothing particularly unusual about that.
Let me give an argument for why I think this is still undefined.
First, the responders saying this is "mostly defined" or somesuch, based on their experience with some compilers, are just wrong. A small modification of your example will serve to illustrate:
#include <stdio.h>
int
main()
{
int v;
scanf("%d", &v);
if (v != 0)
{
printf("Hello\n");
int *p;
*p = v; // Oops
}
return v;
}
What does this program do if you provide "1" as input? If you answer is "It prints Hello and then crashes", you are wrong. "Undefined behavior" does not mean the behavior of some specific statement is undefined; it means the behavior of the entire program is undefined. The compiler is allowed to assume that you do not engage in undefined behavior, so in this case, it may assume that v is non-zero and simply not emit any of the bracketed code at all, including the printf.
If you think this is unlikely, think again. GCC may not perform this analysis exactly, but it does perform very similar ones. My favorite example that actually illustrates the point for real:
int test(int x) { return x+1 > x; }
Try writing a little test program to print out INT_MAX, INT_MAX+1, and test(INT_MAX). (Be sure to enable optimization.) A typical implementation might show INT_MAX to be 2147483647, INT_MAX+1 to be -2147483648, and test(INT_MAX) to be 1.
In fact, GCC compiles this function to return a constant 1. Why? Because integer overflow is undefined behavior, therefore the compiler may assume you are not doing that, therefore x cannot equal INT_MAX, therefore x+1 is greater than x, therefore this function can return 1 unconditionally.
Undefined behavior can and does result in variables that are not equal to themselves, negative numbers that compare greater than positive numbers (see above example), and other bizarre behavior. The smarter the compiler, the more bizarre the behavior.
OK, I admit I cannot quote chapter and verse of the standard to answer the exact question you asked. But people who say "Yeah yeah, but in real life dereferencing NULL just gives a seg fault" are more wrong than they can possibly imagine, and they get more wrong with every compiler generation.
And in real life, if the code is dead you should remove it; if it is not dead, you must not invoke undefined behavior. So that is my answer to your question.
If v is 0, your random pointer assignment never gets executed, and the function will return zero, so it is not undefined behaviour
When you declare variables (especially explicit pointers), a piece of memory is allocated (usually an int). This peace of memory is being marked as free to the system but the old value stored there is not cleared (this depends on the memory allocation being implemented by the compiler, it might fill the place with zeroes) so your int *p will have a random value (junk) which it has to interpret as integer. The result is the place in memory where p points to (p's pointee). When you try to dereference (aka. access this piece of the memory), it will be (almost every time) occupied by another process/program, so trying to alter/modify some others memory will result in access violation issues by the memory manager.
So in this example, any other value then 0 will result in undefined behavior, because no one knows what *p will point to at this moment.
I hope this explanation is of any help.
Edit: Ah, sorry, again few answers ahead of me :)
It is simple. If a piece of code doesn't execute, it doesn't have a behavior!!!, whether defined or not.
If input is 0, then the code inside if doesn't run, so it depends on the rest of the program to determine whether the behavior is defined (in this case it is defined).
If input is not 0, you execute code that we all know is a case of undefined behavior.
I would say it makes the whole program undefined.
The key to undefined behavior is that it is undefined. The compiler can do whatever it wants to when it sees that statement. Now, every compiler will handle it as expected, but they still have every right to do whatever they want to - including changing parts unrelated to it.
For example, a compiler may choose to add a message "this program may be dangerous" to the program if it detects undefined behavior. This would change the output whether or not v is 0.
Your program is pretty-well defined. If v == 0 then it returns zero. If v != 0 then it splatters over some random point in memory.
p is a pointer, its initial value could be anything, since you don't initialise it. The actual value depends on the operating system (some zero memory before giving it to your process, some don't), your compiler, your hardware and what was in memory before you ran your program.
The pointer assignment is just writing into a random memory location. It might succeed, it might corrupt other data or it might segfault - it depends on all of the above factors.
As far as C goes, it's pretty well defined that unintialised variables do not have a known value, and your program (though it might compile) will not be correct.
As pointed out in an answer to this question, the compiler (in this case gcc-4.1.2, yes it's old, no I can't change it) can replace struct assignments with memcpy where it thinks it is appropriate.
I'm running some code under valgrind and got a warning about memcpy source/destination overlap. When I look at the code, I see this (paraphrasing):
struct outer
{
struct inner i;
// lots of other stuff
};
struct inner
{
int x;
// lots of other stuff
};
void frob(struct inner* i, struct outer* o)
{
o->i = *i;
}
int main()
{
struct outer o;
// assign a bunch of fields in o->i...
frob(&o.i, o);
return 0;
}
If gcc decides to replace that assignment with memcpy, then it's an invalid call because the source and dest overlap.
Obviously, if I change the assignment statement in frob to call memmove instead, then the problem goes away.
But is this a compiler bug, or is that assignment statement somehow invalid?
I think that you are mixing up the levels. gcc is perfectly correct to replace an assignment operation by a call to any library function of its liking, as long as it can guarantee the correct behavior.
It is not "calling" memcpy or whatsoever in the sense of the standard. It is just using one function it its library for which it might have additional information that guarantees correctness. The properties of memcpy as they are described in the standard are properties seen as interfaces for the programmer, not for the compiler/environment implementor.
Whether or not memcpy in that implementation in question implements a behavior that makes it valid for the assignment operation is another question. It should not be so difficult to check that or even to inspect the code.
As far as I can tell, this is a compiler bug. i is allowed to alias &o.i according to the aliasing rules, since the types match and the compiler cannot prove that the address of o.i could not have been previously taken. And of course calling memcpy with overlapping (or same) pointers invokes UB.
By the way note that, in your example, o->i is nonsense. You meant o.i I think...
I suppose that there is a typo: "&o" instead of "0".
Under this hypothesis, the "overlap" is actually a strict overwrite: memcpy(&o->i,&o->i,sizeof(o->i)). In this particular case memcpy behaves correctly.