Is strcpy where src == dest strictly defined? - c

It's not too hard to demonstrate that strcpy on overlapped source and destination addresses fails on some platforms, either producing incorrect results or trapping (the latter with some negative random offsets on Linux/amd64).
I'd instrumented a strcpy wrapper function for our codebase with an debug-build assertions that check for such overlapped copies, and have received a number of internal development requests to weaken this assertion checking so that it only raises an abortion for non-zero overlaps.
I've been hesitant to do so, based on my read of the strcpy documentation since I'd assume that equal source and destinations would count as overlapped. Is overlapped defined explicitly in the C++ standard (or C), and does this also include equality?
I suspect many vendor strcpy implementations special case this despite the freedom the standard allows to have this be undefined behavior. Are there any platform/hardware combinations where such an equal copy is known to fail?

Since you are using C++, the question is really, why are you not using std::string. strcpy is notoriously unsafe (as are it's cousins strcpy_s and strncpy, although they are mildly safer than strcpy).
If you attempt to copy from a source to a destination that are the same, at best you will get no change.

If you found yourself a better documentation website for C functions, you'd see this signature:
char *strcpy(char *restrict s1, const char *restrict s2);
restrict in this case indicates that the caller promises that the two buffers in question do not overlap.
We can further search for the meaning of restrict as of C99, and find this wikipedia page:
It says that for the lifetime of the pointer, only it or a value
directly derived from it (such as pointer + 1) will be used to access
the object to which it points.
which is pretty clear that identical pointers are not allowed. If it happens to work on your system, there is no reason for you to think it will work in the next iteration of the compiler, library, or on new hardware.

Related

Why C11 standard doesn't drop unsafe strcat(),strcpy() functions?

C11 & C++14 standards have dropped gets() function that is inherently insecure & leads to security problems because it doesn't performs bounds checking results in buffer overflow. Then why C11 standard doesn't drop strcat() & strcpy() functions? strcat() function doesn't check to see whether second string will fit in the 1st array. strcpy() function also contains no provision for checking boundary of target array. What if the source array has more characters than destination array can hold? Most probably program will crash at runtime.
So, wouldn't it be nice if these two unsafe functions completely removed from the language? Why they are still exist? What is the reason? Wouldn't it is fine to have only functions like strncat(),strncpy()? If I am not wrong Microsoft C & C++ compiler provides safe versions of these functions strcpy_s(),strcat_s(). Then why they aren't officially implemented by other C compilers to provide safety?
gets() is inherently unsafe, because in general it can overflow the target if too much data is received on stdin. This:
char s[MANY];
gets(s);
will cause undefined behavior if more than MANY characters are entered, and there is typically nothing the program can do to prevent it.
strcpy() and strcat() can be used completely safely, since they can overflow the target only if the source string is too long to be contained in the target array. The source string is contained in an array object that is under the control of the program itself, not of any external input. For example, this:
char s[100];
strcpy(s, "hello");
strcat(s, ", ");
strcat(s, "world");
cannot possibly overflow unless the program itself is modified.
strncat() can be used as a safer version of strcat() -- as long as you specify the third argument correctly. One problem with strncat() is that it only gives you one way of handling the case where there's not enough room in the target array: it silently truncates the string. Sometimes that might be what you want, but sometimes you might want to detect the overflow and do something about it.
As for strncpy(), it is not simply a safer version of strcpy(). It's not inherently dangerous, but if you're not very careful you can easily leave the target array without a terminating '\0' null character, leading to undefined behavior next time you pass it to a function expecting a pointer to a string. As it happens, I've written about this.
strcpy and strcat aren't similar to gets. The problem of gets is, it's used to read from input, so it's out of the programmer's control whether there will be buffer overflow.
C99 Rational explains strncpy as:
Rationale for International Standard — Programming Languages — C §7.21.2.4 The strncpy function
strncpy was initially introduced into the C library to deal with fixed-length name fields in structures such as directory entries. Such fields are not used in the same way as strings: the trailing null is unnecessary for a maximum-length field, and setting trailing bytes for shorter 5 names to null assures efficient field-wise comparisons. strncpy is not by origin a “bounded strcpy,” and the Committee preferred to recognize existing practice rather than alter the function to better suit it to such use.
Myth 1: strcpy() is unsafe and how it works comes as a great surprise to a veteran C programmer.
Myth 2: strncpy() is safe.
Myth 3: strncpy() is a safer version of strcpy().
Myth 4: Microsoft is some kind of authority of the use of the C language and know what they are talking about.
strcat() and strcpy() are perfectly safe functions.
Also note that strncpy was never intended to be a safe version of strcpy. It is used for an obscure, obsolete string format used in an ancient version of Unix. strncpy is actually very unsafe (one of many blog post about it here), unlike strcpy, since very few programmers seem to be able to use the former without producing fatal bugs (no null termination).
A better question is why the inherently unsafe strncpy() wasn't removed from the language. Is anyone working with obscure Unix strings from the 1970s much?
When removing a function completely, one of the major things the standards have to mainly consider is how much of code it could break and how many people (programmers, library writers, compiler vendors, etc) would be annoyed (or would oppose) with the change.
gets() was deprecated from LSB (Linux Standard Base). POSIX-2008 made it obsolete and gets() has been historically known to be a seriously bad function and has always been strongly discouraged to use in any code. Pretty much every C programmer knew it's seriously dangerous to use gets(). So the chances of its removal breaking any production code is very very little, it not, non-existing. So it was easy to remove gets() from C11 for the committee.
But it's not the case with strcpy, strcat, etc. They can be used safely and it's still being used by many programmers in new code. While they can be subject to be buffer overflow, it's mostly programmer's control while gets() isn't.
There can be argument made to use snprintf in place of strcpy and strcat. But it would seem pointless in simple cases like:
char buf[256];
strcpy(buf, "hello");
(if buf was a pointer, then the allocate size need to tracked for use in snprintf)
because as a programmer, I know, the above is perfectly safe. More importantly a lot of legacy code would break. Basically, there's no such strong arguments can be made to remove strcpy, etc functions as they can be used safely.
What you are talking about is scenarios which will lead to undefined behavior.
Let's say
char a[3] = "string";
for(i=0;i<5;i++)
printf("%c\n",a[i]);
You have array out of bound access and the standard hasn't removed this because it is you who is assigning the value and it is under your control.
Same with strcpy() and strcat() .
So standard can't remove all scenarios leading to UB.
Whereas gets() we know is not under the programmers control and it is taking data from some stream and you never know what the input might be and there is a high probability you might end up with buffer overflow so it has been removed and a safer function fgets() has been added.

strlen not checking for NULL

Why is strlen() not checking for NULL?
if I do strlen(NULL), the program segmentation faults.
Trying to understand the rationale behind it (if any).
The rational behind it is simple -- how can you check the length of something that does not exist?
Also, unlike "managed languages" there is no expectations the run time system will handle invalid data or data structures correctly. (This type of issue is exactly why more "modern" languages are more popular for non-computation or less performant requiring applications).
A standard template in c would look like this
int someStrLen;
if (someStr != NULL) // or if (someStr)
someStrLen = strlen(someStr);
else
{
// handle error.
}
The portion of the language standard that defines the string handling library states that, unless specified otherwise for the specific function, any pointer arguments must have valid values.
The philosphy behind the design of the C standard library is that the programmer is ultimately in the best position to know whether a run-time check really needs to be performed. Back in the days when your total system memory was measured in kilobytes, the overhead of performing an unnecessary runtime check could be pretty painful. So the C standard library doesn't bother doing any of those checks; it assumes that the programmer has already done it if it's really necessary. If you know you will never pass a bad pointer value to strlen (such as, you're passing in a string literal, or a locally allocated array), then there's no need to clutter up the resulting binary with an unnecessary check against NULL.
The standard does not require it, so implementations just avoid a test and potentially an expensive jump.
A little macro to help your grief:
#define strlens(s) (s==NULL?0:strlen(s))
Three significant reasons:
The standard library and the C language are designed assuming that the programmer knows what he is doing, so a null pointer isn't treated as an edge case, but rather as a programmer's mistake that results in undefined behaviour;
It incurs runtime overhead - calling strlen thousands of times and always doing str != NULL is not reasonable unless the programmer is treated as a sissy;
It adds up to the code size - it could only be a few instructions, but if you adopt this principle and do it everywhere it can inflate your code significantly.
size_t strlen ( const char * str );
http://www.cplusplus.com/reference/clibrary/cstring/strlen/
Strlen takes a pointer to a character array as a parameter, null is not a valid argument to this function.

Is the function strcpy always dangerous?

Are functions like strcpy, gets, etc. always dangerous? What if I write a code like this:
int main(void)
{
char *str1 = "abcdefghijklmnop";
char *str2 = malloc(100);
strcpy(str2, str1);
}
This way the function doesn't accept arguments(parameters...) and the str variable will always be the same length...which is here 16 or slightly more depending on the compiler version...but yeah 100 will suffice as of march, 2011 :).
Is there a way for a hacker to take advantage of the code above?
10x!
Absolutely not. Contrary to Microsoft's marketing campaign for their non-standard functions, strcpy is safe when used properly.
The above is redundant, but mostly safe. The only potential issue is that you're not checking the malloc return value, so you may be dereferencing null (as pointed out by kotlinski). In practice, this likely to cause an immediate SIGSEGV and program termination.
An improper and dangerous use would be:
char array[100];
// ... Read line into uncheckedInput
// Extract substring without checking length
strcpy(array, uncheckedInput + 10);
This is unsafe because the strcpy may overflow, causing undefined behavior. In practice, this is likely to overwrite other local variables (itself a major security breach). One of these may be the return address. Through a return to lib C attack, the attacker may be able to use C functions like system to execute arbitrary programs. There are other possible consequences to overflows.
However, gets is indeed inherently unsafe, and will be removed from the next version of C (C1X). There is simply no way to ensure the input won't overflow (causing the same consequences given above). Some people would argue it's safe when used with a known input file, but there's really no reason to ever use it. POSIX's getline is a far better alternative.
Also, the length of str1 doesn't vary by compiler. It should always be 17, including the terminating NUL.
You are forcefully stuffing completely different things into one category.
Functions gets is indeed always dangerous. There's no way to make a safe call to gets regardless of what steps you are willing to take and how defensive you are willing to get.
Function strcpy is perfectly safe if you are willing to take the [simple] necessary steps to make sure that your calls to strcpy are safe.
That already puts gets and strcpy in vastly different categories, which have nothing in common with regard to safety.
The popular criticisms directed at safety aspects of strcpy are based entirely on anecdotal social observations as opposed to formal facts, e.g. "programmers are lazy and incompetent, so don't let them use strcpy". Taken in the context of C programming, this is, of course, utter nonsense. Following this logic we should also declare the division operator exactly as unsafe for exactly the same reasons.
In reality, there are no problems with strcpy whatsoever. gets, on the other hand, is a completely different story, as I said above.
yes, it is dangerous. After 5 years of maintenance, your code will look like this:
int main(void)
{
char *str1 = "abcdefghijklmnop";
{enough lines have been inserted here so as to not have str1 and str2 nice and close to each other on the screen}
char *str2 = malloc(100);
strcpy(str2, str1);
}
at that point, someone will go and change str1 to
str1 = "THIS IS A REALLY LONG STRING WHICH WILL NOW OVERRUN ANY BUFFER BEING USED TO COPY IT INTO UNLESS PRECAUTIONS ARE TAKEN TO RANGE CHECK THE LIMITS OF THE STRING. AND FEW PEOPLE REMEMBER TO DO THAT WHEN BUGFIXING A PROBLEM IN A 5 YEAR OLD BUGGY PROGRAM"
and forget to look where str1 is used and then random errors will start happening...
Your code is not safe. The return value of malloc is unchecked, if it fails and returns 0 the strcpy will give undefined behavior.
Besides that, I see no problem other than that the example basically does not do anything.
strcpy isn't dangerous as far as you know that the destination buffer is large enough to hold the characters of the source string; otherwise strcpy will happily copy more characters than your target buffer can hold, which can lead to several unfortunate consequences (stack/other variables overwriting, which can result in crashes, stack smashing attacks & co.).
But: if you have a generic char * in input which hasn't been already checked, the only way to be sure is to apply strlen to such string and check if it's too large for your buffer; however, now you have to walk the entire source string twice, once for checking its length, once to perform the copy.
This is suboptimal, since, if strcpy were a little bit more advanced, it could receive as a parameter the size of the buffer and stop copying if the source string were too long; in a perfect world, this is how strncpy would perform (following the pattern of other strn*** functions). However, this is not a perfect world, and strncpy is not designed to do this. Instead, the nonstandard (but popular) alternative is strlcpy, which, instead of going out of the bounds of the target buffer, truncates.
Several CRT implementations do not provide this function (notably glibc), but you can still get one of the BSD implementations and put it in your application. A standard (but slower) alternative can be to use snprintf with "%s" as format string.
That said, since you're programming in C++ (edit I see now that the C++ tag has been removed), why don't you just avoid all the C-string nonsense (when you can, obviously) and go with std::string? All these potential security problems vanish and string operations become much easier.
The only way malloc may fail is when an out-of-memory error occurs, which is a disaster by itself. You cannot reliably recover from it because virtually anything may trigger it again, and the OS is likely to kill your process anyway.
As you point out, under constrained circumstances strcpy isn't dangerous. It is more typical to take in a string parameter and copy it to a local buffer, which is when things can get dangerous and lead to a buffer overrun. Just remember to check your copy lengths before calling strcpy and null terminate the string afterward.
Aside for potentially dereferencing NULL (as you do not check the result from malloc) which is UB and likely not a security threat, there is no potential security problem with this.
gets() is always unsafe; the other functions can be used safely.
gets() is unsafe even when you have full control on the input -- someday, the program may be run by someone else.
The only safe way to use gets() is to use it for a single run thing: create the source; compile; run; delete the binary and the source; interpret results.

Valgrind Warning: Should I Take It Seriously

Background:
I have a small routine that mimics fgets(character, 2, fp) except it takes a character from a string instead of a stream. newBuff is dynamically allocated string passed as a parameter and character is declared as char character[2].
Routine:
character[0] = newBuff[0];
character[1] = '\0';
strcpy(newBuff, newBuff+1);
The strcpy replicates the loss of information as each character is read from it.
Problem: Valgrind does warns me about
this activity, "Source and destination
overlap in strcpy(0x419b818,
0x419b819)".
Should I worry about this warning?
Probably the standard does not specify what happens when these buffers overlap. So yes, valgrind is right to complain about this.
In practical terms you will most likely find that your strcpy copies in order from left-to-right (eg. while (*dst++ = *src++);) and that it's not an issue. But it it still incorrect and may have issues when running with other C libraries.
One standards-correct way to write this would be:
memmove(newBuff, newBuff+1, strlen(newBuff));
Because memmove is defined to handle overlap. (Although here you would end up traversing the string twice, once to check the length and once to copy. I also took a shortcut, since strlen(newBuff) should equal strlen(newBuff+1)+1, which is what I originally wrote.)
Yes, and you should also worry that your function has pathologically bad performance (O(n^2) for a task that should be O(n)). Moving the entire contents of the string back by a character every time you read a character is a huge waste of time. Instead you should just keep a pointer to the current position and increment that pointer.
Situations where you find yourself needing memmove or the equivalent (copying between buffers that overlap) almost always indicate a design flaw. Often it's not just a flaw in the implementation but in the interface.
Yes -- the behavior of strcpy is only defined if the source and dest don't overlap. You might consider a combination of strlen and memmove instead.
Yes, you should worry. The C standard states that the behavior of strcpy is undefined when the source and destination objects overlap. Undefined behavior means it may work sometimes, or it may fail, or it may appear to succeed but manifest failure elsewhere in the program.
The behavior of strcpy() is officially undefined if source and destination overlap.
From the manpage for memcpy comes a suggestion:
The memcpy() function copies n bytes from memory area s2 to memory area s1. If s1 and s2 overlap, behavior is undefined. Applications in which s1 and s2 might overlap should use memmove(3) instead.
The answer is yes: with certain compiler/library implementations, newest ones I guess, you'll end up with a bogus result. See How is strcpy implemented? for an example.

How to implement memmove in standard C without an intermediate copy?

From the man page on my system:
void *memmove(void *dst, const void *src, size_t len);
DESCRIPTION
The memmove() function copies len bytes from string src to string dst.
The two strings may overlap; the copy is always done in a non-destructive
manner.
From the C99 standard:
6.5.8.5 When two pointers are compared, the result depends on the
relative locations in the address
space of the objects pointed to. If
two pointers to object or incomplete
types both point to the same object,
or both point one past the last
element of the same array object,
theycompare equal. If the objects
pointed to are members of the same
aggregate object, pointers to
structure members declared later
compare greater than pointers to
members declared earlier in the
structure, and pointers to array
elements with larger subscript values
compare greater than pointers to
elements of the same array with lower
subscript values. All pointers to
members of the same union object
compare equal. If the expression P
points to an element of an array
object and the expression Q points to
the last element of the same array
object, the pointer expression Q+1
compares greater than P. In all
other cases, the behavior is
undefined.
The emphasis is mine.
The arguments dst and src can be converted to pointers to char so as to alleviate strict aliasing problems, but is it possible to compare two pointers that may point inside different blocks, so as to do the copy in the correct order in case they point inside the same block?
The obvious solution is if (src < dst), but that is undefined if src and dst point to different blocks. "Undefined" means you should not even assume that the condition returns 0 or 1 (this would have been called "unspecified" in the standard's vocabulary).
An alternative is if ((uintptr_t)src < (uintptr_t)dst), which is at least unspecified, but I am not sure that the standard guarantees that when src < dst is defined, it is equivalent to (uintptr_t)src < (uintptr_t)dst). Pointer comparison is defined from pointer arithmetic. For instance, when I read section 6.5.6 on addition, it seems to me that pointer arithmetic could go in the direction opposite to uintptr_t arithmetic, that is, that a compliant compiler might have, when p is of type char*:
((uintptr_t)p)+1==((uintptr_t)(p-1)
This is only an example. Generally speaking very little seems to be guaranteed when converting pointers to integers.
This is a purely academic question, because memmove is provided together with the compiler. In practice, the compiler authors can simply promote undefined pointer comparison to unspecified behavior, or use the relevant pragma to force their compiler to compile their memmove correctly. For instance, this implementation has this snippet:
if ((uintptr_t)dst < (uintptr_t)src) {
/*
* As author/maintainer of libc, take advantage of the
* fact that we know memcpy copies forwards.
*/
return memcpy(dst, src, len);
}
I would still like to use this example as proof that the standard goes too far with undefined behaviors, if it is true that memmove cannot be implemented efficiently in standard C. For instance, no-one ticked when answering this SO question.
I think you're right, it's not possible to implement memmove efficiently in standard C.
The only truly portable way to test whether the regions overlap, I think, is something like this:
for (size_t l = 0; l < len; ++l) {
if (src + l == dst) || (src + l == dst + len - 1) {
// they overlap, so now we can use comparison,
// and copy forwards or backwards as appropriate.
...
return dst;
}
}
// No overlap, doesn't matter which direction we copy
return memcpy(dst, src, len);
You can't implement either memcpy or memmove all that efficiently in portable code, because the platform-specific implementation is likely to kick your butt whatever you do. But a portable memcpy at least looks plausible.
C++ introduced a pointer specialization of std::less, which is defined to work for any two pointers of the same type. It might in theory be slower than <, but obviously on a non-segmented architecture it isn't.
C has no such thing, so in a sense, the C++ standard agrees with you that C doesn't have enough defined behaviour. But then, C++ needs it for std::map and so on. It's much more likely that you'd want to implement std::map (or something like it) without knowledge of the implementation than that you'd want to implement memmove (or something like it) without knowledge of the implementation.
For two memory areas to be valid and overlapping, I believe you would need to be in one of the defined situations of 6.5.8.5. That is, two areas of an array, union, struct, etc.
The reason other situations are undefined are because two different objects might not even be in the same kind of memory, with the same kind of pointer. On PC architectures, addresses are usually just 32-bit address into virtual memory, but C supports all kinds of bizarre architectures, where memory is nothing like that.
The reason that C leaves things undefined is to give leeway to the compiler writers when the situation doesn't need to be defined. The way to read 6.5.8.5 is a paragraph carefully describing architectures that C wants to support where pointer comparison doesn't make sense unless it's inside the same object.
Also, the reason memmove and memcpy are provided by the compiler is that they are sometimes written in tuned assembly for the target CPU, using a specialized instruction. They are not meant to be able to be implemented in C with the same efficiency.
For starters, the C standard is notorious for having problems in the details like this. Part of the problem is because C is used on multiple platforms and the standard attempts to be abstract enough to cover all current and future platforms (which might use some convoluted memory layout that's beyond anything we've ever seen). There is a lot of undefined or implementation-specific behavior in order for compiler writers to "do the right thing" for the target platform. Including details for every platform would be impractical (and constantly out-of-date); instead, the C standard leaves it up to the compiler writer to document what happens in these cases. "Unspecified" behavior only means that the C standard doesn't specify what happens, not necessarily that the outcome cannot be predicted. The outcome is usually still predictable if you read the documentation for your target platform and your compiler.
Since determining if two pointers point to the same block, memory segment, or address space depends on how the memory for that platform is laid out, the spec does not define a way to make that determination. It assumes that the compiler knows how to make this determination. The part of the spec you quoted said that result of pointer comparison depends on the pointers' "relative location in the address space". Notice that "address space" is singular here. This section is only referring to pointers that are in the same address space; that is, pointers that are directly comparable. If the pointers are in different address spaces, then the result is undefined by the C standard and is instead defined by the requirements of the target platform.
In the case of memmove, the implementor generally determines first if the addresses are directly comparable. If not, then the rest of the function is platform-specific. Most of the time, being in different memory spaces is enough to ensure that the regions don't overlap and the function turns into a memcpy. If the addresses are directly comparable, then it's just a simple byte copy process starting from the first byte and going forward or from the last byte and going backwards (whichever one will safely copy the data without clobbering anything).
All in all, the C standard leaves a lot intentionally unspecified where it can't write a simple rule that works on any target platform. However, the standard writers could have done a better job explaining why some things are not defined and used more descriptive terms like "architecture-dependent".
Here's another idea, but I don't know if it's correct. To avoid the O(len) loop in Steve's answer, one could put it in the #else clause of an #ifdef UINTPTR_MAX with the cast-to-uintptr_t implementation. Provided that cast of unsigned char * to uintptr_t commutes with adding integer offsets whenever the offset is valid with the pointer, this makes the pointer comparison well-defined.
I'm not sure whether this commutativity is defined by the standard, but it would make sense, as it works even if only the lower bits of a pointer are an actual numeric address and the upper bits are some sort of black box.
I would still like to use this example as proof that the standard goes too far with undefined behaviors, if it is true that memmove cannot be implemented efficiently in standard C
But it's not proof. There's absolutely no way to guarantee that you can compare two arbitrary pointers on an arbitrary machine architecture. The behaviour of such a pointer comparison cannot be legislated by the C standard or even a compiler. I could imagine a machine with a segmented architecture that might produce a different result depending on how the segments are organised in RAM or might even choose to throw an exception when pointers into different segments are compared. This is why the behaviour is "undefined". The exact same program on the exact same machine might give different results from run to run.
The oft given "solution" of memmove() using the relationship of the two pointers to choose whether to copy from the beginning to the end or from the end to the beginning only works if all memory blocks are allocated from the same address space. Fortunately, this is usually the case although it wasn't in the days of 16 bit x86 code.

Resources