I have the following code:
# include<stdio.h>
# include<string.h>
# define M 5
void mycopy(char* text)
{
char buffer[M];
strcpy(buffer, text);
}
int main()
{
char *name = "Kshitij";
int i = 0;
mycopy(name);
printf("i = %d", i);
return 0;
}
This code compiles in GCC on Apple LLVM version 8.0.0 (clang-800.0.42.1).
However, when I try to run the corresponding executable, I get a process abort output such as:
[1] PID abort ./executable.out
I understand that an error is expected since the size of buffer array is less than the length of text argument being passed to it in this case. However, I am unable to understand to grasp the concept behind this behavior. Why isn't a compile-time error stating the reason raised by the compiler here ?
The C11 standard says the following in 7.1.4. Use of library functions:
If a function argument is described as being an array, *the pointer actually passed to the function shall have a value such that all address computations and accesses to objects (that would be valid if the pointer did point to the first element of such an array) are in fact valid.
And of strcpy
2 The strcpy function copies the string pointed to by s2 (including the terminating null character) into the array pointed to by s1.
And in Appendix J.2 Undefined behavior:
1 The behavior is undefined in the following circumstances:
The pointer passed to a library function array parameter does not have a value such that all address computations and object accesses are valid (7.1.4).
Since the behaviour is undefined, according to standard, anything can happen. A compiler is actually allowed to do compile-time bounds checking, and the compilation might very well be aborted if it can be deduced that the program always writes out of bounds. Or the implementation can support range checking and abort with clear diagnostics. Or the strcpy might just copy the 4 first characters and add a terminating null, or copy 42 into the target string instead - and all these implementations would be standard-conforming.
Related
I have a question about this code below:
#include <stdio.h>
char abcd(char array[]);
int main(void)
{
char array[4] = { 'a', 'b', 'c', 'd' };
printf("%c\n", abcd(array));
return 0;
}
char abcd(char array[])
{
char *p = array;
while (*p) {
putchar(*p);
p++;
}
putchar(*p);
putchar(p[4]);
return *p;
}
Why isn't segmentation fault generated when this program comes across putchar(*p) right after exiting while loop? I think that after *p went beyond the array[3] there is supposed to be no value assigned to other memory locations. For example, trying to access p[4] would be illegal because it would be out of the bound, I thought. On the contrary, this program runs with no errors. Is this because any other memories which no value are assigned (in this case any other memories than array[4]) should be null, whose value is '\0'?
OP seems to think accessing an array out-of-bounds, something special should happen.
Accessing outside array bounds is undefined behavior (UB). Anything may happen.
Let's clarify what a undefined behavior is.
The C standard is a contract between the developer and the compiler as to what the code means. However, it just so happens that you can write things that are just outside what is defined by the standard.
One of the most common cases is trying to do out-of-bounds access. Other languages say that this should result in an exception or another error. C does not. An argument is that it would imply adding costly checks at every array access.
The compiler does not know that what you are writing is undefined behavior¹. Instead, the compiler assumes that what you write contains no undefined behavior, and translate your code to assembly accordingly.
If you want an example, compile the code below with or without optimizations:
#include <stdio.h>
int table[4] = {0, 0, 0, 0};
int exists_in_table(int v)
{
for (int i = 0; i <= 4; i++) {
if (table[i] == v) {
return 1;
}
}
return 0;
}
int main(void) {
printf("%d\n", exists_in_table(3));
}
Without optimizations, the assembly I get from gcc does what you might expect: it just goes too far in the memory, which might cause a segfault if the array is allocated right before a page boundary.
With optimizations, however, the compiler looks at your code and notices that it cannot exit the loop (otherwise, it would try to access table[4], which cannot be), so the function exists_in_table necessarily returns 1. And we get the following, valid, implementation:
exists_in_table(int):
mov eax, 1
ret
Undefined behavior means undefined. They are very tricky to detect since they can be virtually invisible after compiling. You need advanced static analyzer to interpret the C source code and understand whether what it does can be undefined behavior.
¹ in the general case, that is; modern compilers use some basic static analyzer to detect the most common errors
C does no bounds checking on array accesses; because of how arrays and array subscripting are implemented, it can't do any bounds checking. It simply doesn't know that you've run past the end of the array. The operating environment will throw a runtime error if you cross a page boundary, but up until that point you can read or clobber any memory following the end of the array.
The behavior on subscripting past the end of the array is undefined - the language definition does not require the compiler or the operating environment to handle it any particular way. You may get a segfault, you may get corrupted data, you may clobber a frame pointer or return instruction address and put your code in a bad state, or it may work exactly as expected.
There are few remark points inside your program:
array inside the main and abcd function are different. In main, it is array of 4 elements, in abcd, it is an input variable with array type. If inside main, you call something like array[4] there will be compiler warnings for this. But there won't be compiler warning if you call in side abcd.
*p is a pointer point to array or in other word, it point to first element of array. In C, there isn't any boundary or limit for p. Your program is lucky because the memory after array contains 0 value to stop the while(*p) loop. If you did check the address of pointer p (&p). It might not equal to array[4].
I am trying to understand the array concept in string.
char a[5]="hello";
Here, array a is an character array of size 5. "hello" occupies the array index from 0 to 4. Since, we have declared the array size as 5, there is no space to store the null character at the end of the string.
So my understanding is when we try to print a, it should print until a null character is encountered. Otherwise it may also run into segmentation fault.
But, when I ran it in my system it always prints "hello" and terminates.
So can anyone clarify whether my understanding is correct. Or does it depends upon the system that we execute.
As ever so often, the answer is:
Undefined behavior is undefined.
What this means is, trying to feed this character array to a function handling strings is wrong. It's wrong because it isn't a string. A string in C is a sequence of characters that ends with a \0 character.
The C standard will tell you that this is undefined behavior. So, anything can happen. In C, you don't have runtime checks, the code just executes. If the code has undefined behavior, you have to be prepared for any effect. This includes working like you expected, just by accident.
It's very well possible that the byte following in memory after your array happens to be a \0 byte. In this case, it will look to any function processing this "string" as if you passed it a valid string. A crash is just waiting to happen on some seemingly unrelated change to the code.
You could try to add some char foo = 42; before or after the array definition, it's quite likely that you will see that in the output. But of course, there's no guarantee, because, again, undefined behavior is undefined :)
What you have done is undefined behavior. Apparently whatever compiler you used happened to initialize memory after your array to 0.
Here, array a is an character array of size 5. "hello" occupies the array index from 0 to 4. Since, we have declared the array size as 5, there is no space to store the null character at the end of the string.
So my understanding is when we try to print a, it should print until a null character is encountered.
Yes, when you use printf("%s", a), it prints characters until it hits a '\0' character (or segfaults or something else bad happens - undefined behavior). I can demonstrate that with a simple program:
#include <stdio.h>
int main()
{
char a[5] = "hello";
char b[5] = "world";
int c = 5;
printf("%s%s%d\n", a, b, c);
return 0;
}
Output:
$ ./a.out
helloworldworld5
You can see the printf function continuing to read characters after it has already read all the characters in array a. I don't know when it will stop reading characters, however.
I've slightly modified my program to demonstrate how this undefined behavior can create bad problems.
#include <stdio.h>
#include <string.h>
int main()
{
char a[5] = "hello";
char b[5] = "world";
int c = 5;
printf("%s%s%d\n", a, b, c);
char d[5];
strcpy(d, a);
printf("%s", d);
return 0;
}
Here's the result:
$ ./a.out
helloworld��world��5
*** stack smashing detected ***: <unknown> terminated
helloworldhell�p��UAborted (core dumped)
This is a classic case of stack overflow (pun intended) due to undefined behavior.
Edit:
I need to emphasize: this is UNDEFINED BEHAVIOR. What happened in this example may or may not happen to you, depending on your compiler, architecture, libraries, etc. You can make guesses to what will happen based on your understanding of different implementations of various libraries and compilers on different platforms, but you can NEVER say for certain what will happen. My example was on Ubuntu 17.10 with gcc version 7. My guess is that something very different could happen if I tried this on an embedded platform with a different compiler, but I cannot say for certain. In fact, something different could happen if I had this example inside of a larger program on the same machine.
strlen returns the number of characters that precede the terminating null character. An implementation of strlen might look like this:
size_t strlen(const char * str)
{
const char *s;
for (s = str; *s; ++s) {}
return(s - str);
}
This particular implementation dereferences s, where s may contain indeterminate values. It's equivalent to this:
int a;
int* p = &a;
*p;
So for example if one were to do this (which causes strlen to give an incorrect output):
char buffer[10];
buffer[9] = '\0';
strlen(buffer);
Is it undefined behavior?
Calling the standard function strlen causes undefined behaviour. DR 451 clarifies this:
library functions will exhibit undefined behavior when used on indeterminate values
For a more in-depth discussion see this thread.
The behavior of the variant that you are showing is well defined under these circumstances.
The bytes of the uninitialized array have all indeterminate values, with exception of the 10th element that you set to 0.
Accessing an indeterminate value would only be UB if the address of the underlying object would be never taken or if the value is a trap for the corresponding type.
Since this is an array and access to array elements is through pointer arithmetic, the first case is not relevant, here.
Any char value can be accessed without UB, the clauses about trap representations in the standard explicitly exclude all character types from that.
Thus the values that you are dealing with are simply "unspecified".
Reading unspecified values may according to some members of the C standards committee give different results each time, what some call a "whobly" state or so. This property is not relevant, here, since your function reads any such value at most once.
So your access to the array elements gives you any arbitrary but valid char value.
You are sure that your for loop stops at latest at position 9, so you will not overrun your array.
So no "bad" things beyond the visible may happen if you use your specific version of the function. But having a function call that produces unspecified results is certainly nothing you want to see in real code. Something like this here leads to very subtle bugs, and you should avoid it by all means.
No, it's not undefined behavior. Your strlen function will stop before the end of the buffer. If your strlen function referenced buffer[10], then, yes that is undefined.
It certainly will be unexpected behavior, since most of buffer contains random data. "Undefined" is special word for people writing language standards. It means that anything could happen, including memory faults or exiting the program. By unexpected, I mean that it sure not what the programmer wanted to happen. On some runs, the result of strlen could be 3 or it could be 10.
Yes, it's undefined behaviour. From the draft C11 standard, §J.2 "Undefined behavior":
The behavior is undefined in the following circumstances:
...
The value of an object with automatic storage duration is used while it is
indeterminate.
I wonder why the following code does not throw segmentation fault when a string literal which is a result of dirname() is modified but throws segmentation fault when a string literal created in a usual is modified:
#include <stdio.h>
#include <stdlib.h>
#include <libgen.h>
#define FILE_PATH "/usr/bin/screen"
int main(void)
{
char db_file[] = FILE_PATH;
char *filename = dirname(db_file);
/* no segfault here */
filename[1] = 'a';
/* segfault here */
char *p = "abc";
p[1] = 'z';
exit(0);
}
I know that it's an UB to modify a string literal so output I get may be perfectly valid but I wonder if this can be explained. Are string literals that are returned by functions treated differently by compilers? The same situation happens when I compile this code with Clang 3.0 on x86 and gcc on x86 and ARM.
dirname() does not return a reference to a "string" literal, so it is fully legal to modify the data referenced by the pointer returned. Whether to do so makes sense or not is a different question, as the pointer returned may reference the char array passed to dirname().
However, if the OP's code would have passed a "string"-literal to dirname(), this would have already been illegal, as the POSIX specification explicitly state that the function may modify the array passed in.
From the POSIX specifications:
The dirname() function may modify the string pointed to by path.
From the manual
These functions may return pointers to statically allocated memory which may be overwritten by subsequent calls. Alternatively, they may return a pointer to some part of path, so that the string referred to by path should not be modified or freed until the pointer returned by the function is no longer required.
So it might return memory you can modify, and it might not. Depends on your system I guess.
During compilation string literals are stored in a read-only memory segment, and loaded as such at runtime.
This question explains things very nicely:
String literals: Where do they go?
int main(void)
{
char name1[5];
int count;
printf("Please enter names\n");
count = scanf("%s",name1);
printf("You entered name1 %s\n",name1);
return 0;
}
When I entered more than 5 characters, it printed the characters as I entered, it was more than 5, but the char array is declared as:
char name1[5];
Why did this happened
Because the characters are stored on the addresses after the 'storage space'. This is very dangerous and can lead to crashes.
E.g. suppose you enter name: Michael and the name1 variable starts at 0x1000.
name1: M i c h a e l \0
0x1000 0x1001 0x1002 0x1003 0x1004 0x1005 0x1006 0x1007
[................................]
The allocated space is shown with [...]
This means from 0x1005 memory is overwritten.
Solution:
Copy only 5 characters (including the \0 at the end) or check the length of the entered string before you copy it.
This is undefined behavior, you are writing beyond the bounds of allocated memory. Anything can happen, including a program that appears to work correctly.
The C99 draft standard section J.2 Undefined Behavior says:
The behavior is undefined in the following circumstances:
and contains the following bullet:
An array subscript is out of range, even if an object is apparently accessible with the
given subscript (as in the lvalue expression a[1][7] given the declaration int a[4][5]) (6.5.6).
This applies to the more general case since E1[E2] is identical to (*((E1)+(E2))).
This is undefined behavior, you can't count on it. It just happens to work, it may not work on another machine.
To avoid buffer overflow, use
fgets(name1, sizeof(name1) - 1, stdin);
or in C11
gets_s(name1, sizeof(name1) - 1);
another example to make things clearer :
#include <stdio.h>
int array[5] ;
int main ( void )
{
array[-1] = array[-1] ; // sound strange ??
printf ( "%d" , array[-1] ) ; // but work !!
return 0 ;
}
array in this case in an address, and you get number
before or after that address, but this is undefined behavior
unless you know what you do. Pointer works with ++ or -- !
It's very clear from other answers that this constitutes some kind of vulnerability to your program.
What can be learned from this? Lets assume:
int func(void)
{
char buffer[1];
...
In almost every implementation of the C compiler, the code generated here will create a local stack area and enables you to access this stack by the address given in buffer. On this stack reside other important data too, for example: the address of the next code line to be executed after the function returns to it's caller.
You could, therefore, theoretically:
Enter a lot of code into your input function,
Create a code that defines (in binary code) a new function that does something ugly,
Overwrite the correct return address (on the stack) with the address that the new function would have if you write it beyond the buffers bounds.
This is called buffer overflow exploit, you can read up here (and on many other places).
Yes it is allowed in C, as there is no bound checking.