string literals and strcat - c

I am not sure why strcat works in this case for me:
char* foo="foo";
printf(strcat(foo,"bar"));
It successfully prints "foobar" for me.
However, as per an earlier topic discussed on stackoverflow here: I just can't figure out strcat
It says, that the above should not work because foo is declared as a string literal. Instead, it needs to be declared as a buffer (an array of a predetermined size so that it can accommodate another string which we are trying to concatenate).
In that case, why does the above program work for me successfully?

This code invokes Undefined Behavior (UB), meaning that you have no guarantee of what will happen (failure here).
The reason is that string literals are immutable. That means that they are not mutable, and any attempt of doing so, will invoke UB.
Note what a difficult logical error(s) can arise with UB, since it might work (today and in your system), but it's still wrong, which makes it very likely that you might miss the error, and get along as everything was fine.
PS: In this Live Demo, I am lucky enough to get a Segmentation fault. I say lucky, because this seg fault will make me investigate and debug the code.
It's worth noting that GCC issues no warning, and the warning from Clang are also irrelevant:
p
rog.c:7:8: warning: format string is not a string literal (potentially insecure) [-Wformat-security]
printf(strcat(foo,"bar"));
^~~~~~~~~~~~~~~~~
prog.c:7:8: note: treat the string as an argument to avoid this
printf(strcat(foo,"bar"));
^
"%s",
1 warning generated.

String literals are immutable in the sense that the compiler will operate under the assumption that you won't mutate them, not that you'll necessarily get an error if you try to modify them. In legalese, this is "undefined behavior", so anything can happen, and, as far as the standard is concerned, it's fine.
Now, on modern platforms and with modern compilers you do have extra protections: on platforms that have memory protection the string table generally gets placed in a read-only memory area, so that modifying it will get you a runtime error.
Still, you may have a compiler that doesn't provide any of the runtime-enforced checks, either because you are compiling for a platform without memory protection (e.g. pre-80386 x86, so pretty much any C compiler for DOS such as Turbo C, most microcontrollers when operating on RAM and not on flash, ...), or with an older compiler which doesn't exploit this hardware capability by default to remain compatible with older revisions (older VC++ for a long time), or with a modern compiler which has such an option explicitly enabled, again for compatibility with older code (e.g. gcc with -fwritable-strings). In all these cases, it's normal that you won't get any runtime error.
Finally, there's an extra devious corner case: current-day optimizers actively exploit undefined behavior - i.e. they assume that it will never happen, and modify the code accordingly. It's not impossible that a particularly smart compiler can generate code that just drops such a write, as it's legally allowed to do anything it likes most for such a case.
This can be seen for some simple code, such as:
int foo() {
char *bar = "bar";
*bar = 'a';
if(*bar=='b') return 1;
return 0;
}
here, with optimizations enabled:
VC++ sees that the write is used just for the condition that immediately follows, so it simplifies the whole thing to return 0; no memory write, no segfault, it "appears to work" (https://godbolt.org/g/cKqYU1);
gcc 4.1.2 "knows" that literals don't change; the write is redundant and it gets optimized away (so, no segfault), the whole thing becomes return 1 (https://godbolt.org/g/ejbqDm);
any more modern gcc choose a more schizophrenic route: the write is not elided (so you get a segfault with the default linker options), but if it succeeded (e.g. if you manually fiddle with memory protection) you'd get a return 1 (https://godbolt.org/g/rnUDYr) - so, memory modified but the code that follows thinks it hasn't been modified; this is particularly egregious on AVR, where there's no memory protection and the write succeeds.
clang does pretty much the same as gcc.
Long story short: don't try your luck and tread carefully. Always assign string literals to const char * (not plain char *) and let the type system help you avoid this kind of problems.

Related

Can a string literal in C be modified?

I recently had a question, I know that a pointer to a constant array initialized as it is in the code below, is in the .rodata region and that this region is only readable.
However, I saw in pattern C11, that writing in this memory address behavior will be undefined.
I was aware that the Borland's Turbo-C compiler can write where the pointer points, this would be because the processor operated in real mode on some systems of the time, such as MS-DOS? Or is it independent of the operating mode of the processor? Is there any other compiler that writes to the pointer and does not take any memory breach failure using the processor in protected mode?
#include <stdio.h>
int main(void) {
char *st = "aaa";
*st = 'b';
return 0;
}
In this code compiling with Turbo-C in MS-DOS, you will be able to write to memory
As has been pointed out, trying to modify a constant string in C results in undefined behavior. There are several reasons for this.
One reason is that the string may be placed in read-only memory. This allows it to be shared across multiple instances of the same program, and doesn't require the memory to be saved to disk if the page it's on is paged out (since the page is read-only and thus can be reloaded later from the executable). It also helps detect run-time errors by giving an error (e.g. a segmentation fault) if an attempt is made to modify it.
Another reason is that the string may be shared. Many compilers (e.g., gcc) will notice when the same literal string appears more than once in a compilation unit, and will share the same storage for it. So if a program modifies one instance, it could affect others as well.
There is also never a need to do this, since the same intended effect can easily be achieved by using a static character array. For instance:
#include <stdio.h>
int main(void) {
static char st_arr[] = "aaa";
char *st = st_arr;
*st = 'b';
return 0;
}
This does exactly what the posted code attempted to do, but without any undefined behavior. It also takes the same amount of memory. In this example, the string "aaa" is used as an array initializer, and does not have any storage of its own. The array st_arr takes the place of the constant string from the original example, but (1) it will not be placed in read-only memory, and (2) it will not be shared with any other references to the string. So it's safe to modify it, if in fact that's what you want.
Is there any other compiler that writes to the pointer and does not take any memory breach failure using the processor in protected mode?
GCC 3 and earlier used to support gcc -fwriteable-strings to let you compile old K&R C where this was apparently legal, according to https://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Incompatibilities.html. (It's undefined behaviour in ISO C and thus a bug in an ISO C program). That option will define the behaviour of the assignment which ISO C leaves undefined.
GCC 3.3.6 manual - C Dialect options
-fwritable-strings
Store string constants in the writable data segment and don't uniquize them. This is for compatibility with old programs which assume they can write into string constants.
Writing into string constants is a very bad idea; “constants” should be constant.
GCC 4.0 removed that option (release notes); the last GCC3 series was gcc3.4.6 in March 2006. Although apparently it had become buggy in that version.
gcc -fwritable-strings would treat string literals like non-const anonymous character arrays (see #gnasher's answer), so they go in the .data section instead of .rodata, and thus get linked into a segment of the executable that's mapped to read+write pages, not read-only. (Executable segments have basically nothing to do with x86 segmentation, it's just a start+range memory-mapping from the executable file to memory.)
And it would disable duplicate-string merging, so char *foo() { return "hello"; } and char *bar() { return "hello"; } would return different pointer values, instead of merging identical string literals.
Related:
How can some GCC compilers modify a constant char pointer?
https://softwareengineering.stackexchange.com/questions/294748/why-are-c-string-literals-read-only
Linker option: still Undefined Behaviour so probably not viable
On GNU/Linux, linking with ld -N (--omagic) will make the text (as well as data) section read+write. This may apply to .rodata even though modern GNU Binutils ld puts .rodata in its own section (normally with read but not exec permission) instead of making it part of .text. Having .text writeable could easily be a security problem: you never want a page with write+exec at the same time, otherwise some bugs like buffer overflows can turn into code-injection attacks.
To do this from gcc, use gcc -Wl,-N to pass on that option to ld when linking.
This doesn't do anything about it being Undefined Behaviour to write const objects. e.g. the compiler will still merge duplicate strings, so writing into one char *foo = "hello"; will affect all other uses of "hello" in the whole program, even across files.
What to use instead:
If you want something writeable, use static char foo[] = "hello"; where the quoted string is just an array initializer for a non-const array. As a bonus, this is more efficient than static char *foo = "hello"; at global scope, because there's one fewer level of indirection to get to the data: it's just an array instead a pointer stored in memory.
You are asking whether or not the platform may cause undefined behavior to be defined. The answer to that question is yes.
But you are also asking whether or not the platform defines this behavior. In fact it does not.
Under some optimization hints, the compiler will merge string constants, so that writing to one constant will write to the other uses of that constant. I used this compiler once, it was quite capable of merging strings.
Don't write this code. It's not good. You will regret writing code in this style when you move onto a more modern platform.
Your literal "aaa" produces a static array of four const char 'a', 'a', 'a', '\0' in an anonymous location and returns a pointer to the first 'a', cast to char*.
Trying to modify any of the four characters is undefined behaviour. Undefined behaviour can do anything, from modifying the char as intended, pretending to modify the char, doing nothing, or crashing.
It's basically the same as static const char anonymous[4] = { 'a', 'a', 'a', '\0' }; char* st = (char*) &anonymous [0];
To add to the correct answers above, DOS runs in real mode, so there is no read only memory. All memory is flat and writable. Hence, writing to the literal was well defined (as it was in any sort of const variable) at the time.

Assigning a "string" to a varible previously declared as "int"

I am new to programming and on learning dynamic typing in python, it arisess a doubt in "static typing". I tried out this code (assigning a string to an integer variable which was previously declared) and printing the variable as printf(var_name) and its gives output; can anyone explain this concept?
#include<stdio.h>
#include<conio.h>
void main()
{
int i = 20 ;
i = "hello";
printf(i);
}
Besides your question might be a duplicate, let me append something missing of the read worthy answer https://stackoverflow.com/a/430414/3537677
C is strongly/statically typed but weakly checked
This is one of the biggest core language features which sets C apart from other languages like C++. (Which people are used to mistake a simply "C with classes"
Meaning although C has a strong type system in the context of needing and using it for knowing sizes of types at compile time, the C languages does not have a type system in order to check them for misuse. So compilers are neither mandated to check it nor are they allowed to error your code, because its legal C code. Modern compilers will issue a warning dough.
C compilers are only ensuring "their type system" for the mentioned size management. Meaning, if you just type int i = 42; this variable has so called automatic storage duration or what many people are calling more or less correctly "the stack". It means the compiler will take care of getting space for the variable and cleaning it up. If it can not know the size of it, but needs it then it will indeed generate an error. But this can be circumvented by doing things at run-time and using of types without any type whats so ever, i.e. pointers and void* aka void-pointers.
Regarding your code
Your code seems to be an old, non standard C compiler judging by the #include<conio.h> and void returning main. With a few modifications one can compile your code, but by calling printf with an illegal format string, you are causing so called undefined behaviour (UB), meaning it might work on your machine, but crashes on mine.

What could happen when you call function returning int with void (*)() pointer?

I would like to know what could happen in a situation like this:
int foo()
{
return 1;
}
void bar()
{
void(*fPtr)();
fPtr = (void(*)())foo;
fPtr();
}
Address of function returning int is assigned to pointer of void(*)() type and the function pointed is called.
What does the standard say about it?
Regardless of answer to 1st question: Are we safe to call the function like this? In practise shouldnt the outcome be just that callee (foo) will put something in EAX / RAX and caller (bar) will just ignore the rax content and go on with the program? I'm interested in Windows calling convention x86 and x64.
Thanks a lot for your time
1)
From the C11 standard - 6.5.2.2 - 9
If the function is defined with a type that is not compatible with the type (of the expression) pointed to by the expression that denotes the called function, the behavior is undefined
It is clearly stated that if a function is called using a pointer of type that does not match the type it is defined with, it leads to Undefined Behavior.
But the cast is okay.
2)
Regarding your second question - In case of a well defined Calling convention XXX and implementation YYYY -
You might have disassembled a sample program (even this one) and figured out that it "works". But there are slight complications. You see, the compilers these days are very smart. There are some compilers which are capable of performing precise inter procedural analysis. Some compiler might figure out that you have behavior that is not defined and it might make some assumption that might break the behavior.
A simple example -
Since the compiler sees that this function is being called with type void(*)(), it will assume that it is not supposed to return anything, and it might remove the instructions required to return the correct value.
In this case other functions calling this functions (in a right way) will get a bad value and thus it would have visible bad effects.
PS: As pointed out by #PeterCordes any modern, sane and useful compiler won't have such an optimization and probably it is always safe to use such calls. But the intent of the answer and the example (probably too simplistic) is to remind that one must tread very carefully when dealing with UBs.
What happens in practice depends a lot on how the compiler implements this. You're assuming C is just a thin ("obvious") layer over asm, but it isn't.
In this case, a compiler can see that you're calling a function through a pointer with the wrong type (which has undefined behavior1), so it could theoretically compile bar() to:
bar:
ret
A compiler can assume undefined behavior never happens during the execution of a program. Calling bar() always results in undefined behavior. Therefore the compiler can assume bar is never called and optimize the rest of the program based on that.
1 C99, 6.3.2.3/8:
If a converted
pointer is used to call a function whose type is not compatible with the pointed-to type,
the behavior is undefined.
About sub-question 2:
Nearly all x86 calling conventions I know (cdecl, stdcall, syscall, fastcall, pascal, 64-bit Windows and 64-bit Linux) will allow void functions to modify the ax/eax/rax register and the difference between an int function and a void function is only that the returned value is passed in the eax register.
The same is true for the "default" calling convention on most other CPUs I have already worked with (MIPS, Sparc, ARM, V850/RH850, PowerPC, TriCore). The register name is not eax but different, of course.
So when using these calling convention you can safely call the int function using a void pointer.
There are however calling conventions where this is not the case: I've read about a calling convention that implicitly use an additional argument for non-void functions...
At the asm level only, this is safe in all normal x86 calling conventions for integer types: eax/rax is call-clobbered, and the caller doesn't have to do anything differently to call a void function vs. an int function and ignoring the return value.
For non-integer return types, this is a problem even in asm. Struct returns are done via a hidden pointer arg that displaces the other args, and the caller is going to store through it so it better not hold garbage. (Assuming the case is more complex than the one shown here, so the function doesn't just inline when optimization is enabled.) See the Godbolt link below for an example of calling through a casted function pointer that results in a store through a garbage "pointer" in rdi.
For legacy 32-bit code, FP return values are in st(0) on the x87 stack, and it's the caller's responsibility to not leave the x87 stack unbalanced. float / double / __m128 return values are safe to ignore in 64-bit ABIs, or in 32-bit code using a calling convention that returns FP values in xmm0 (SSE/SSE2).
In C, this is UB (see other answers for quotes from the standard). When possible / convenient, prefer a workaround (see below).
It's possible that future aggressive optimizations based on a no-UB assumption could break code like this. For example, a compiler might assume any path that leads to UB is never taken, so an if() condition that leads to this code running must always be false.
Note that merely compiling bar() can't break foo() or other functions that don't call bar(). There's only UB if bar() ever runs, so emitting a broken externally-visible definition for foo() (like #Ajay suggests) is not a possible consequence. (Except maybe if you use whole-program optimization and the compiler proves that bar() is always called at least once.) The compiler can break functions that call bar(), though, at least the parts of them that lead to the UB.
However, it is allowed (by accident or on purpose) by many current compilers for x86. Some users expect this to work, and this kind of thing is present in some real codebases, so compiler devs may support this usage even if they implement aggressive optimizations that would otherwise assume this function (and thus all paths that lead to it in any callers) never run. Or maybe not!
An implementation is free to define the behaviour in cases where the ISO C standard leaves the behaviour undefined. However, I don't think gcc/clang or any other compiler explicitly guarantees that this is safe. Compiler devs might or might not consider it a compiler bug if this code stopped working.
I definitely can't recommend doing this, because it may well not continue to be safe. Hopefully if compiler devs decide to break it with aggressive no-UB-assuming optimizations, there will be options to control which kinds of UB are assumed not to happen. And/or there will be warnings. As discussed in comments, whether to take a risk of possible future breakage for short-term performance / convenience benefits depends on external factors (like will lives be at risk, and how carefully you plan to maintain in the future, e.g. checking compiler warnings with future compiler versions.)
Anyway, if it works, it's because of the generosity of your compiler, not because of any kind of standards guarantee. This compiler generosity may be intentional and semi-maintained, though.
See also discussion on another answer: the compilers people actually use aim to be useful, not just standards compliant. The C standard allows enough freedom to make a compliant but not very useful implementation. (Many would argue that compilers that assume no signed overflow even on machines where it has well-defined semantics have already gone past this point, though. See also What Every C Programmer Should Know About Undefined Behavior (an LLVM blog post).)
If the compiler can't prove that it would be UB (e.g. if it can't statically determine which function a function-pointer is pointing to), there's pretty much no way it can break (if the functions are ABI-compatible). Clang's runtime UB-sanitizer would still find it, but a compiler doesn't have much choice in code-gen for calling through an unknown function pointer. It just has to call the way the ABI / calling convention says it should. It can't tell the difference between casting a function pointer to the "wrong" type and casting it back to the correct type (unless you dereference the same function pointer with two different types, which means one or the other must be UB. But the compiler would have a hard time proving it, because the first call might not return. noreturn functions don't have to be marked noreturn.)
But remember that link-time optimization / inlining / constant-propagation could let the compiler see which function is pointed to even in a function that gets a function pointer as an arg or from a global variable.
Workarounds (for a function before you take its address):
If the function won't be part of Link-Time-Optimization, you could lie to the compiler and give it a prototype that matches how you want to call it (as long as you're sure you got the asm-level calling convention is compatible).
You could write a wrapper function. It's potentially less efficient (an extra jmp if it just tail-calls the original), but if it inlines then you're cloning the function to make a version that doesn't do any of the work of creating a return value. This might still be a loss if that was cheap compared to the extra I-cache / uop cache pressure of a 2nd definition, if the version that does return a value is used too.
You could also define an alternate name for a function, using linker stuff so both symbols have the same address. That way you can have two prototypes for the same block of compiler-generated machine code.
Using the GNU toolchain, you can use an attribute on a prototype to make it a weak alias (at the asm / linker level). This doesn't work for all targets; it works for ELF object files, but IDK about Windows.
// in GNU C:
int foo(void) { return 4; }
// include this line in a header if you want; weakref is per translation unit
// a definition (or prototype) for foo doesn't have to be visible.
static void foo_void(void) __attribute((weakref("foo"))); // in C++, use the mangled name
int bar_safe(void) {
void (*goo)(void) = (void(*)())foo_void;
goo();
return 1;
}
example on Godbolt for gcc7.2 and clang5.0.
gcc7.2 inlines foo through the weak alias call to foo_void! clang doesn't, though. I think that means that this is safe, and so is function-pointer casting, in gcc. Alternatively it means that this is potentially dangerous, too. >.<
clang's undefined-behaviour sanitizer does runtime function typeinfo checking (in C++ mode only) for calls through function pointers. int () is different from void (), so it will detect and report this UB on x86. (See the asm on Godbolt). It probably doesn't mean it's actually unsafe at the moment, though, because it doesn't yet detect / warn about it at compile time.
Use the above workarounds in the code that takes the address of the function, not in the code that receives a function pointer.
You want to let the compiler see a real function with the signature that it will eventually be called with, regardless of the function pointer type you pass it through. Make an alias / wrapper with a signature that matches what the function pointer will eventually be cast to. If that means you have to cast the function pointer to pass it in the first place, so be it.
(I think it's safe to create a pointer to the wrong type as long as it's not dereferenced. It's UB to even create an unaligned pointer, even if you don't dereference, but that's different.)
If you have code that needs to deref the same function pointer as int foo(args) in one place and void foo(args) in another place, you're screwed as far as avoiding UB.
C11 §6.3.2.3 paragraph 8:
A pointer to a function of one type may be converted to a pointer to a
function of another type and back again; the result shall compare
equal to the original pointer. If a converted pointer is used to call
a function whose type is not compatible with the referenced type, the
behavior is undefined.

Why is C so good at crapping out on undefined variables but when a var lacks initialization, a check of if it is untrue goes fine? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
e.g. I'm usually compulsive obsessive and like to do
static int i = 0;
if (!i) i = var;
but
static int i;
if (!i) i = var;
would also work.
Why? Why can't it segfault so we can all be happy that undefined variables are evil and be concise about it?
Not even the compilers complain:(
This 'philosophy' of indecisiveness in C has made me do errors such as this:
strcat(<uninitialized>, <proper_string>)) //wrong!!1
strcpy(<uninitialized>, <proper_string>)) //nice
In your example, i is not an undefined variable, it is an uninitialized variable. And C has good reasons for not producing an error in these cases. For instance, a variable may be uninitialized when it is defined but assigned a value before it is used, so it is not a semantic error to lack an initialization in the definition statement.
Not all uses of uninitialized variables can be checked at compile-time. You could suggest that the program check every access to every variable by performing a runtime check, but that requires incurring a runtime overhead for something that is not necessary if the programmer wrote the code correctly. That's against the philosophy of C. A similar argument applies to why automatically-allocated variables aren't initialized by default.
However, in cases where the use of a variable before being initialized can be detected at compile-time, most modern compilers will emit a warning about it, if you have your warning level turned up high enough (which you always should). So even though the standard does not require it, it's easy to get a helpful diagnostic about this sort of thing.
Edit: Your edit to your question makes it make no sense. If i is declared to be static then it is initialized -- to zero.
This comes from C's "lightweight" and "concise" roots. Default initializing to zero bytes was free (for global variables). And why specify anything in source text when you know what the compiler is going to do?
Uninitialized auto variables can contain random data, and in that case your "if" statements are not only odd but don't reliably do what you desire.
It seems you don't understand something about C.
int i;
actually DOES define the variable in addition to declaring it. There is memory storage. There is just no initialization when in function scope.
int i=0;
declares, defines, and initializes the storage to 0.
if (!i)
is completely unnecessary before assigning a value to i. All it does is test the value of integer i (which may or may not be initialized to a specific value depending on which statement above you used).
It would only be useful if you did:
int *i = malloc(sizeof int);
because then i would be a pointer you are checking for validity.
You said:
Why? Why can't it segfault so we can all be happy that undefined variables are evil and be concise about it?
A "segfault" or segmentation fault, is a term that is a throwback to segmented memory OSes. Segmentation was used to get around the fact that the size of the machine word was inadequate to address all of available memory. As such, it is a runtime error, not a compile time one.
C is really not that many steps up from assembly language. It just does what you tell it to do. When you define your int, a machine word's worth of memory is allocated. Period. That memory is in a particular state at runtime, whether you initialize it specifically or leave it to randomness.
It's to squeeze every last cycle out of your CPU. On a modern CPU of course not initializing a variable until the last millisecond is a totally trivial thing, but when C was designed, that was not necessarily the case.
That behavior is undefined. Stack variables are uninitialized, so your second example may work in your compiler on your platform that one time you ran it, but it probably won't in most cases.
Getting to your broader question, compile with -Wall and -pedantic and it may make you happier. Also, if you're going to be ocd about it, you may as well write if (i == 0) i = var;
p.s. Don't be ocd about it. Trust that variable initialization works or don't use it. With C99 you can declare your variables right before you use them.
C simply gives you space, but it doesn't promise to know what is in that space. It is garbage data. An automatically added check in the compiler is possible, but that extends compile times. C is a powerful language, and as such you have the power to do anything and fall on your face at the same time. If you want something, you have to explicitly ask for it. Thus is the C philosophy.
Where is var defined?
The codepad compiler for C gives me the following error:
In function 'main': Line 4: error:
'var' undeclared (first use in this
function) Line 4: error: (Each
undeclared identifier is reported only
once Line 4: error: for each function
it appears in.)
for the code:
int main(void)
{
static int i;
if (!i) i = var;
return 0;
}
If I define var as an int then the program compiles fine.
I am not really sure where your problem is. The program seems to be working fine. Segfault is not for causing your program to crash because you coded something that may be undefined in the language. The variable i is unitialized not undefined. You defined it as static int. Had you simply done:
int main(void)
{
i = var;
return 0;
}
Then it would most definately be undefined.
Your compiler should be throwing a warning because i isn't initialized to catch these sort of gotchas. It seems your if statement is sort of a catch for that warning, even if the compiler does not report it.
Static variables (in function or file scope) and global variables are always initialized to zero.
Stack variables have no dependable value. To answer your question, uninitialized stack variables are often set to non-zero, so they often evaluate as true. That cannot be depended on.
When I was running Gentoo Linux I once found a bug in some open source Unicode handling code that checked an uninitialized variable against -1 in a while loop. On 32-bit x86 with GCC this code always ran fine, because the variable was never -1. On AMD-64 with its extra registers, the variable ended up always set to -1 and the processing loop never ran.
So, always use the compiler's high warning levels when building so you can find those bugs.
That's C, you cannot do much about it. The basic purpose of C is to be fast - a default initialization of the variable takes a few more CPU cycles and therefore you have to explicitly specify that you want to spend them. Because of this (and many other pitfalls) C is not considered good for those who don't know what they are doing :)
The C standard says the behavior of uninitialized auto variables is undefined. A particular C compiler may initialize all variables to zero (or pointers to null), but since this is undefined behavior you can't rely on it being the case for a compiler, or even any version of a particular compiler. In other words, always be explicit, and undefined means just that: The behavior is not defined and may vary from implementation to implementation.
-- Edit --
As pointed out, the particular question was about static variables, which have defined initialization behavior according to the standard. Although it's still good practice to always explicitly initialize variables, this answer is only relevant to auto variables which do not have defined behavior.

Why do compilers not warn about out-of-bounds static array indices?

A colleague of mine recently got bitten badly by writing out of bounds to a static array on the stack (he added an element to it without increasing the array size). Shouldn't the compiler catch this kind of error? The following code compiles cleanly with gcc, even with the -Wall -Wextra options, and yet it is clearly erroneous:
int main(void)
{
int a[10];
a[13] = 3; // oops, overwrote the return address
return 0;
}
I'm positive that this is undefined behavior, although I can't find an excerpt from the C99 standard saying so at the moment. But in the simplest case, where the size of an array is known as compile time and the indices are known at compile time, shouldn't the compiler emit a warning at the very least?
GCC does warn about this. But you need to do two things:
Enable optimization. Without at least -O2, GCC is not doing enough analysis to know what a is, and that you ran off the edge.
Change your example so that a[] is actually used, otherwise GCC generates a no-op program and has completely discarded your assignment.
.
$ cat foo.c
int main(void)
{
int a[10];
a[13] = 3; // oops, overwrote the return address
return a[1];
}
$ gcc -Wall -Wextra -O2 -c foo.c
foo.c: In function ‘main’:
foo.c:4: warning: array subscript is above array bounds
BTW: If you returned a[13] in your test program, that wouldn't work either, as GCC optimizes out the array again.
Have you tried -fmudflap with GCC? These are runtime checks but are useful, as most often you have got to do with runtime calculated indices anyway. Instead of silently continue to work, it will notify you about those bugs.
-fmudflap -fmudflapth -fmudflapir
For front-ends that support it (C and C++), instrument all risky
pointer/array dereferencing
operations, some standard
library string/heap functions, and some other associated
constructs with range/validity tests.
Modules so instrumented
should be immune to buffer overflows, invalid heap use, and some
other classes of C/C++ programming
errors. The instrumen‐
tation relies on a separate runtime library (libmudflap), which
will be linked into a program if
-fmudflap is given at link
time. Run-time behavior of the instrumented program is controlled
by the MUDFLAP_OPTIONS environment
variable. See "env
MUDFLAP_OPTIONS=-help a.out" for its options.
Use -fmudflapth instead of -fmudflap to compile and to link if your program is multi-threaded. Use
-fmudflapir, in addition
to -fmudflap or -fmudflapth, if instrumentation should ignore pointer reads. This produces
less instrumentation (and there‐
fore faster execution) and still provides some protection against
outright memory corrupting writes, but
allows erroneously
read data to propagate within a program.
Here is what mudflap gives me for your example:
[js#HOST2 cpp]$ gcc -fstack-protector-all -fmudflap -lmudflap mudf.c
[js#HOST2 cpp]$ ./a.out
*******
mudflap violation 1 (check/write): time=1229801723.191441 ptr=0xbfdd9c04 size=56
pc=0xb7fb126d location=`mudf.c:4:3 (main)'
/usr/lib/libmudflap.so.0(__mf_check+0x3d) [0xb7fb126d]
./a.out(main+0xb9) [0x804887d]
/usr/lib/libmudflap.so.0(__wrap_main+0x4f) [0xb7fb0a5f]
Nearby object 1: checked region begins 0B into and ends 16B after
mudflap object 0x8509cd8: name=`mudf.c:3:7 (main) a'
bounds=[0xbfdd9c04,0xbfdd9c2b] size=40 area=stack check=0r/3w liveness=3
alloc time=1229801723.191433 pc=0xb7fb09fd
number of nearby objects: 1
[js#HOST2 cpp]$
It has a bunch of options. For example it can fork off a gdb process upon violations, can show you where your program leaked (using -print-leaks) or detect uninitialized variable reads. Use MUDFLAP_OPTIONS=-help ./a.out to get a list of options. Since mudflap only outputs addresses and not filenames and lines of the source, i wrote a little gawk script:
/^ / {
file = gensub(/([^(]*).*/, "\\1", 1);
addr = gensub(/.*\[([x[:xdigit:]]*)\]$/, "\\1", 1);
if(file && addr) {
cmd = "addr2line -e " file " " addr
cmd | getline laddr
print $0 " (" laddr ")"
close (cmd)
next;
}
}
1 # print all other lines
Pipe the output of mudflap into it, and it will display the sourcefile and line of each backtrace entry.
Also -fstack-protector[-all] :
-fstack-protector
Emit extra code to check for buffer overflows, such as stack smashing attacks. This is done by adding a guard variable to functions with vulnerable objects. This includes functions that call alloca, and functions with buffers larger than 8 bytes. The guards are initialized when a function is entered and then checked when the function exits. If a guard check fails, an error message is printed and the program exits.
-fstack-protector-all
Like -fstack-protector except that all functions are protected.
You're right, the behavior is undefined. C99 pointers must point within or just one element beyond declared or heap-allocated data structures.
I've never been able to figure out how the gcc people decide when to warn. I was shocked to learn that -Wall by itself will not warn of uninitialized variables; at minimum you need -O, and even then the warning is sometimes omitted.
I conjecture that because unbounded arrays are so common in C, the compiler probably doesn't have a way in its expression trees to represent an array that has a size known at compile time. So although the information is present at the declaration, I conjecture that at the use it is already lost.
I second the recommendation of valgrind. If you are programming in C, you should run valgrind on every program, all the time until you can no longer take the performance hit.
It's not a static array.
Undefined behavior or not, it's writing to an address 13 integers from the beginning of the array. What's there is your responsibility. There are several C techniques that intentionally misallocate arrays for reasonable reasons. And this situation is not unusual in incomplete compilation units.
Depending on your flag settings, there are a number of features of this program that would be flagged, such as the fact that the array is never used. And the compiler might just as easily optimize it out of existence and not tell you - a tree falling in the forest.
It's the C way. It's your array, your memory, do what you want with it. :)
(There are any number of lint tools for helping you find this sort of thing; and you should use them liberally. They don't all work through the compiler though; Compiling and linking are often tedious enough as it is.)
The reason C doesn't do it is that C doesn't have the information. A statement like
int a[10];
does two things: it allocates sizeof(int)*10 bytes of space (plus, potentially, a little dead space for alignment), and it puts an entry in the symbol table that reads, conceptually,
a : address of a[0]
or in C terms
a : &a[0]
and that's all. In fact, in C you can interchange *(a+i) with a[i] in (almost*) all cases with no effect BY DEFINITION. So your question is equivalent to asking "why can I add any integer to this (address) value?"
* Pop quiz: what is the one case in this this isn't true?
The C philosophy is that the programmer is always right. So it will silently allow you to access whatever memory address you give there, assuming that you always know what you are doing and will not bother you with a warning.
I believe that some compilers do in certain cases. For example, if my memory serves me correctly, newer Microsoft compilers have a "Buffer Security Check" option which will detect trivial cases of buffer overruns.
Why don't all compilers do this? Either (as previously mentioned) the internal representation used by the compiler doesn't lend itself to this type of static analysis or it just isn't high enough of the writers priority list. Which to be honest, is a shame either way.
shouldn't the compiler emit a warning at the very least?
No; C compilers generally do not preform array bounds checks. The obvious negative effect of this is, as you mention, an error with undefined behavior, which can be very difficult to find.
The positive side of this is a possible small performance advantage in certain cases.
There are some extension in gcc for that (from compiler side)
http://www.doc.ic.ac.uk/~awl03/projects/miro/
on the other hand splint, rat and quite a few other static code analysis tools would have
found that.
You also can use valgrind on your code and see the output.
http://valgrind.org/
another widely used library seems to be libefence
It's simply a design decision ones made. Which now leads to this things.
Regards
Friedrich
-fbounds-checking option is available with gcc.
worth going thru this article
http://www.doc.ic.ac.uk/~phjk/BoundsChecking.html
'le dorfier' has given apt answer to your question though, its your program and it is the way C behaves.

Resources