What's wrong with strcmp? - c

In the responses to the question Reading In A String and comparing it C, more than one person discouraged the use of strcmp(), saying things like
I also strongly, strongly advise you to get used to using strncmp()
now, ... to avoid many problems down the road.
or (in Why does my string comparison fail? )
Make certain you use strncmp and not strcmp. strcmp is profoundly
unsafe.
What problems are they alluding to?
The reason scanf() with string specifiers and gets() are strongly discouraged is because they almost inevitably lead to buffer overflow vulnerabilities. However, it's not possible to overflow a buffer with strcmp(), right?
"A buffer overflow, or buffer overrun, is an anomaly where a program, while writing data to a buffer, overruns the buffer's boundary and overwrites adjacent memory."
( -- Wikipedia: buffer overflow).
Since the strcmp() function never writes to any buffer, the strcmp() function cannot cause a buffer overflow, right?
What is the reason people discourage the use of strcmp(), and recommend strncmp() instead?

While strncmp can prevent you from overrunning a buffer, its primary purpose isn't for safety. Rather, it exists for the case where one wants to compare only the first N characters of a (properly possibly NUL-terminated) string.
From the man page:
The strcmp() function compares the two strings s1 and s2. It returns an integer less than, equal to, or greater than zero if s1 is found, respectively, to be less than, to match, or be greater than s2.
The strncmp() function is similar, except it compares the only first (at most) n bytes of s1 and s2.
Note that strncmp in this case cannot be replaced with a simple memcmp, because you still need to take advantage of its stop-on-NUL behavior, in case one of the strings is shorter than n.
If strcmp causes a buffer overrun, then one of two things is true:
Your data isn't expected to be NUL-terminated, and you should be using memcmp instead.
Your data is expected to be NUL-terminated, but you've already screwed up when you populated the buffer, by somehow not NUL-terminating it.
Note that reading past the end of a buffer is still considered a buffer overrun. While it may seem harmless, it can be just as dangerous as writing past the end.
Reading, writing, executing... it doesn't matter. Any memory reference to an unintended address is undefined behavior. In the most apparent scenario, you attempt to access a page that isn't mapped into your process's address space, causing a page fault, and subsequent SIGSEGV. In the worst case, you sometimes run into a \0 byte, but other times you run into some other buffer, causing inconstant program behavior.

A string is by definition "a contiguous sequence of characters terminated by and including the first null character".
The only case where strncmp() would be safer than strcmp() is when you're comparing two character arrays as strings, you're certain that both arrays are at least n bytes long (the 3rd argument passed to strncmp()), and you're not certain that both arrays contain strings (i.e., contain a '\0' null character terminator).
In most cases, your code (if it's correct) will guarantee that any arrays that are supposed to contain null-terminated strings actually do contain null-terminated strings.
That added n in strncmp() is not a magic wand that makes unsafe code safe. It doesn't guard against null pointers, uninitialized pointers, uninitialized arrays, an incorrect value of n, or just passing incorrect data. You can shoot yourself in the foot with either function.
And if you're trying to call strcmp or strncmp with an array that you thought contained a null-terminated string but actually doesn't, then your code already has a bug. Using strncmp() might help you avoid the immediate symptom of that bug, but it won't fix it.

strcmp compares two strings character to character until a difference has been detected or the \0 is found at one of them.
On the other hand, strncmp provides a way to limit the number of characters to be compared so if the strings do not end with \0 the function won't continue checking after the size limit has been reached.
Imagine what would happen if you are comparing two strings at this two memory regions:
0x40, 0x41, 0x42,...
0x40, 0x41, 0x42,...
And you are only interested in the two first characters. Somehow \0 has been removed from the end of the strings and the third byte happens to coincide at the two regions. strncmp would avoid comparing this third byte if num parameter is 2.
EDIT
As the comments below indicate, this situation is derived from a wrong or very concrete use of the language.

Related

Properties of strcpy()

I have a global definition as following:
#define globalstring "example1"
typedef struct
{
char key[100];
char trail[10][100];
bson_value_t value;
} ObjectInfo;
typedef struct
{
ObjectInfo CurrentOrderInfoSet[5];
} DataPackage;
DataPackage GlobalDataPackage[10];
And I would like to use the strcpy() function in some of my functions as following:
strcpy(GlobalDataPackage[2].CurrentOrderInfoSet[0].key, "example2");
char string[100] = "example3";
strcpy(GlobalDataPackage[2].CurrentOrderInfoSet[0].key, string);
strcpy(GlobalDataPackage[2].CurrentOrderInfoSet[0].key, globalstring);
First question: Are the global defined strings all initiated with 100 times '\0'?
Second qestion: I am a bit confused as to how exactly strcpy() works. Does it only overwrite the characters necessary to place the source string into the destination string plus a \0 at the end and leave the rest as it is or does it fully delete any content of the destination string prior to that?
Third question: All my strings are fixed length of 100. If I use the 3 examples of strcpy() above, with my strings not exceeding 99 characters, does strcpy() properly overwrite the destination string and NULL terminate it? Meaning do I run into problems when using functions like strlen(), printf() later?
Fourth question: What happens when I strcpy() empty strings?
I plan to overwrite these strings in loops various times and would like to know if it would be safer to use memset() to fully "empty" the strings prior to strcpy() on every iteration.
Thx.
Are the global defined strings all initiated with 100 times '\0'?
Yes. Global char arrays will be initilizated to all zeros.
I am a bit confused as to how exactly strcpy() works. Does it only overwrite the characters necessary to place the source string into the destination string plus a \0 at the end and leave the rest as it
Exactly. It copies the characters up until and including '\0' and does not care about the rest.
If I use ... my strings not exceeding 99 characters, does strcpy() properly overwrite the destination string and NULL terminate it?
Yes, but NULL is a pointer, it's terminated with zero byte, sometimes called NUL. You might want to see What is the difference between NUL and NULL? .
Meaning do I run into problems when using functions like strlen(), printf() later?
Not if your string lengths are less than or equal to 99.
What happens when I strcpy() empty strings?
It just copies one zero byte.
would like to know if it would be safer to use memset() to fully "empty" the strings prior to strcpy() on every iteration.
Safety is a broad concept. As far as safety as in if the program will execute properly, there is no point in caring about anything after zero byte, so just strcpy it.
But you should check if your strings are less than 99 characters and handle what to do it they are longer. You might be interested in strnlen, but the interface is confusing - I recommend to use memcpy + explicitly manually set zero byte.

A little query, String in C

Recently I was programming in my Code Blocks and I did a little program only for hobby in C.
char littleString[1];
fflush( stdin );
scanf( "%s", littleString );
printf( "\n%s", littleString);
If I created a string of one character, why does the CodeBlocks allow me to save 13 characters?
C have no bounds-checking, writing out of bounds of arrays or dynamically allocated memory can't be checked by the compiler. Instead it will lead to undefined behavior.
To prevent buffer overflow with scanf you can tell it to only read a specific number of characters, and nothing more. So to tell it to read only one character you use the format "%1s".
As a small side-note: Remember that strings in C have an extra character in them, the terminator (character '\0'). So if you have a string that should contain one character, the size actually needs to be two characters.
LittleString is not a string. It is a char array of length one. In order for a char array to be a string, it must be null terminated with an \0. You are writing past the memory you have allotted for littleString. This is undefined behavior.Scanf just reads user input from the console and assigns it to the variable specified, in this case littleString. If you would like to control the length of user input which is assigned to the variable, I would suggest using scanf_s. Please note that scanf_s is not a C99 standard
Many functions in C is implemented without any checks for correctness of use. In other words, it is the callers responsibility that the arguments fulfill some rules set by the function.
Example: For strcpy the Linux man page says
The strcpy() function copies the string pointed to by src,
including the terminating null byte ('\0'), to the buffer
pointed to by dest. The strings may not overlap, and the
destination string dest must be large enough to receive the copy.
If you as a caller break that contract by passing a too small buffer, you'll have undefined behavior and anything can happen.
The program may crash or even do exactly what you expected in 99 out of 100 times and do something strange in 1 out of 100 times.

Is there an idiomatic use of strncmp()?

The strncmp() function really only has one use case (for lexicographical ordering):
One of the strings has a known length,† the other string is known to be NUL terminated. (As a bonus, the string with known length need not be NUL terminated at all.)
The reasons I believe there is just one use case (prefix match detection is not lexicographical ordering):‡ (1) If both strings are NUL terminated, strcmp() should be used, as it will do the job correctly; and (2) If both strings have known length, memcmp() should be used, as it will avoid the unnecessary check against NUL on a byte per byte basis.
I am seeking an idiomatic (and readable) way to use the function to lexicographically compare two such arguments correctly (one of them is NUL terminated, one of them is not necessarily NUL terminated, with known length).
Does an idiom exist? If so, what is it? If not, what should it be, or what should be used instead?
Simply using the result of strncmp() won't work, because it will result in a false equality result in the case that the argument with known length is shorter than the NUL terminated one, and it happens to be a prefix. Therefore, extra code is required to test for that case.
As a standalone function I don't see much wrong with this construction, and it appears idiomatic:
/* s1 is NUL terminated */
int variation_as_function (const char *s1, const char *s2, size_t s2len) {
int result = strncmp(s1, s2, s2len);
if (result == 0) {
result = (s1[s2len] != '\0');
}
return result;
}
However, when inlining this construction into code, it results in a double test for 0 when equality needs special action:
int result = strncmp(key, input, inputlen);
if (result == 0) {
result = (key[inputlen] != '\0');
}
if (result == 0) {
do_something();
} else {
do_something_else();
}
The motivation for inlining the call is because the standalone function is esoteric: It matters which string argument is NUL terminated and which one is not.
Please note, the question is not about performance, but about writing idiomatic code and adopting best practices for coding style. I see there is some DRY violation with the comparison. Is there a straightforward way to avoid the duplication?
† By known length, I mean the length is correct (there is no embedded NUL that would truncate the length). In other words, the input was validated at some earlier point in the program, and its length was recorded, but the input is not explicitly NUL terminated. As a hypothetical example, a scanner on a stream of text could have this property.
‡ As has been pointed out by addy2012, strncmp() could be used for prefix matching. I as focused on lexicographical ordering. However, (1) If the length of the prefix string is used as the length argument, both arguments need to be NUL terminated to guard against reading past an input string shorter than the prefix string. (2) If the minimum length is known between the prefix string and the input string, then memcmp() would be a better choice in terms of providing equivalent functionality at less CPU cost and no loss in readability.
The strncmp() function really only has one use case:
One of the strings has a known length, the other string is known to be
NUL terminated.
No, you can use it to compare the beginnings of two strings, no matter if the length of any string is known or not. For example, if you have an array / a list with last names, and you want to find all which begin with "Mac".
In fact, strncmp should generally be used in preference to strcmp unless you know absolutely know that both strings are well-formed and nul-terminated.
Why? Because otherwise you have a vulnerability to buffer overflows.
This rule is unfortunately not followed often.
There are a lot of buffer overflow errors.
Update
I think the core error here is in "one of the strings has a known length". No C string has a known length a priori. They're not like Pascal or Java strings, which are essentially a pair of (length, buffer). A C string is by definition a char[] identifying a chunk of memory, with the distinguished symbol \0 to identify the end. strncmp, strncpy etc exist to protect against attempts to use a chunk of memory as a string that is not well-formed.

need for the last '\0' in fgets

I've seen several usage of fgets (for example, here) that go like this:
char buff[7]="";
(...)
fgets(buff, sizeof(buff), stdin);
The interest being that, if I supply a long input like "aaaaaaaaaaa", fgets will truncate it to "aaaaaa" here, because the 7th character will be used to store '\0'.
However, when doing this:
int i=0;
for (i=0;i<7;i++)
{
buff[i]='a';
}
printf("%s\n",buff);
I will always get 7 'a's, and the program will not crash. But if I try to write 8 'a's, it will.
As I saw it later, the reason for this is that, at least on my system, when I allocate char buff[7] (with or without =""), the 8th byte (counting from 1, not from 0) gets set to 0. From what I guess, things are done like this precisely so that a for loop with 7 writes, followed by a string formatted read, could succeed, whether the last character to be written was '\0' or not, and thus avoiding the need for the programmer to set the last '\0' himself, when writing chars individually.
From this, it follows that in the case of
fgets(buff, sizeof(buff), stdin);
and then providing a too long input, the resulting buffstring will automatically have two '\0' characters, one inside the array, and one right after it that was written by the system.
I have also observed that doing
fgets(buff,(sizeof(buff)+17),stdin);
will still work, and output a very long string, without crashing. From what I guessed, this is because fgets will keep writing until sizeof(buff)+17, and the last char to be written will precisely be a '\0', ensuring that any forthcoming string reading process would terminate properly (although the memory is messed up anyway).
But then, what about fgets(buff, (sizeof(buff)+1),stdin);? this would use up all the space that was rightfully allocated in buff, and then write a '\0' right after it, thus overwriting...the '\0' previously written by the system. In other words, yes, fgets would go out of bounds, but it can be proven that when adding only one to the length of the write, the program will never crash.
So in the end, here comes the question: why does fgets always terminates its write with a '\0', when another '\0', placed by the system right after the array, already exists? why not do like in the one by one for-loop based write, that can access the whole of the array and write anything the programmer wants, without endangering anything?
Thank you very much for your answer!
EDIT: indeed, there is no proof possible, as long as I do not know whether this 8th '\0' that mysteriously appears upon allocation of buff[7], is part of the C standard or not, specifically for string arrays. If not, then...it's just luck that it works :-)
but it can be proven that when adding only one to the length of the write, the program will never crash.
No! You can't prove that! Not in the sense of a mathematical proof. You have only shown that on your system, with your compiler, with those particular compiler settings you used, with particular environment configuration, it might not crash. This is far from a mathematical proof!
In fact the C standard itself, although it guarantees that you can get the address of "one place after the last element of an array", it also states that dereferencing that address (i.e. trying to read or write from that address) is undefined behaviour.
That means that an implementation can do everything in this case. It can even do what you expect with naive reasoning (i.e. work - but it's sheer luck), but it may also crash or it may also format your HD (if your are very, very unlucky). This is especially true when writing system software (e.g. a device driver or a program running on the bare metal), i.e. when there is no OS to shield you from the nastiest consequences of writing bad code!
Edit This should answer the question made in a comment (C99 draft standard):
7.19.7.2 The fgets function
Synopsis
#include <stdio.h>
char *fgets(char * restrict s, int n,
FILE * restrict stream);
Description
The fgets function reads at most one less than the number of characters specified by n
from the stream pointed to by stream into the array pointed to by s. No additional
characters are read after a new-line character (which is retained) or after end-of-file. A
null character is written immediately after the last character read into the array.
Returns
The fgets function returns s if successful. If end-of-file is encountered and no
characters have been read into the array, the contents of the array remain unchanged and a
null pointer is returned. If a read error occurs during the operation, the array contents are
indeterminate and a null pointer is returned.
Edit: Since it seems that the problem lies in a misunderstanding of what a string is, this is the relevant excerpt from the standard (emphasis mine):
7.1.1 Definitions of terms
A string is a contiguous sequence of characters terminated by and including the first null
character. The term multibyte string is sometimes used instead to emphasize special
processing given to multibyte characters contained in the string or to avoid confusion
with a wide string. A pointer to a string is a pointer to its initial (lowest addressed)
character. The length of a string is the number of bytes preceding the null character and
the value of a string is the sequence of the values of the contained characters, in order.
From C11 standard draft:
The fgets function reads at most one less than the number of characters specified by n
from the stream pointed to by stream into the array pointed to by s. No additional
characters are read after a new-line character (which is retained) or after end-of-file. A
null character is written immediately after the last character read into the array.
The fgets function returns s if successful. If end-of-file is encountered and no
characters have been read into the array, the contents of the array remain unchanged and a
null pointer is returned. If a read error occurs during the operation, the array contents are indeterminate and a null pointer is returned.
The behaviour you describe is undefined.

Does strlen() in a strncmp() expression defeat the purpose of using strncmp() over strcmp()?

By my understanding, strcmp() (no 'n'), upon seeing a null character in either argument, immediately stops processing and returns a result.
Therefore, if one of the arguments is known with 100% certainty to be null-terminated (e.g. it is a string literal), there is no security benefit whatsoever in using strncmp() (with 'n') with a call to strlen() as part of the third argument to limit the comparison to the known string length, because strcmp() will already never read more characters than are in that known-terminating string.
In fact, it seems to me that a call to strncmp() whose length argument is a strlen() on one of the first two arguments is only different from the strcmp() case in that it wastes time linear in the size of the known-terminating string by evaluating the strlen() expression.
Consider:
Sample code A:
if (strcmp(user_input, "status") == 0)
reply_with_status();
Sample code B:
if (strncmp(user_input, "status", strlen("status")+1) == 0)
reply_with_status();
Is there any benefit to the former over the latter? Because I see it in other people's code a lot.
Do I have a flawed understanding of how these functions work?
In your particular example, I would say it's detrimental to use strncmp because of:
Using strlen does the scan anyway
Repetition of the string literal "status"
Addition of 1 to make sure that strings are indeed equal
All these add to clutter, and in neither case would you be protected from overflowing user_input if it indeed was shorter than 6 characters and contained the same characters as the test string.
That would be exceptional. If you know that your input string always has more memory than the number of characters in your test strings, then don't worry. Otherwise, you might need to worry, or at least consider it. strncmp is useful for testing stuff inside large buffers.
My preference is for code readability.
Yes it does. If you use strlen in a strncmp, it will traverse the pointer until it sees a null in the string. That makes it functionally equivalent to strcmp.
In the particular case you've given, it's indeed useless. However, a slight alteration is more common:
if (strncmp(user_input, "status", strlen("status")) == 0)
reply_with_status();
This version just checks if user_input starts with "status", so it has different semantics.
Beyond tricks to check if the beginning of a string matches an input, strncmp is only going to be useful when you are not 100% sure that a string is null terminated before the end of its allocated space.
So, if you had a fixed size buffer that you took your user input in with you could use:
strncmp(user_input, "status", sizeof(user_input))
Therefore ensuring that your comparison does not overflow.
However in this case you do have to be careful, since if your user_input wasn't null terminated, then it will really be checking if user_input matches the beginning of status.
A better way might be to say:
if (user_input[sizeof(user_input) - 1] != '\0') {
// handle it, since it is _not_ equal to your string
// unless filling the buffer is valid
}
else if (strcmp(user_input, "status")) { ... }
Now, I agree that this isn't particularly useful use of strncmp(), and I can't see any benefit in this over strcmp().
However, if we change the code by removing the +1 after strlen, then it starts to be useful.
strncmp(user_input, "status", strlen("status"))
since that compares the first 6 characters of user_input with `"status" - which, at least sometimes is meaningful.
So, if the +1 is there, it becomes a regular strcmp - and it's just a waste of time calculating the length. But without the +1, it's quite a useful comparison (under the right circumstances).
strncmp() has limited use. The normal strcmp() will stop if it encounters a NUL on any of the two strings. (and in that case the strings are different) Strncmp() would stop and return zero ("the strings are equal in the first N characters")
One possible use of stncmp() is parsing options, upto the non-signifacant part, eg
if (!strncmp("-st", argv[xx], 3)) {}
, which would return zero for "-string" or "-state" or "-st0", but not for "-sadistic".

Resources