The strncmp() function really only has one use case (for lexicographical ordering):
One of the strings has a known length,† the other string is known to be NUL terminated. (As a bonus, the string with known length need not be NUL terminated at all.)
The reasons I believe there is just one use case (prefix match detection is not lexicographical ordering):‡ (1) If both strings are NUL terminated, strcmp() should be used, as it will do the job correctly; and (2) If both strings have known length, memcmp() should be used, as it will avoid the unnecessary check against NUL on a byte per byte basis.
I am seeking an idiomatic (and readable) way to use the function to lexicographically compare two such arguments correctly (one of them is NUL terminated, one of them is not necessarily NUL terminated, with known length).
Does an idiom exist? If so, what is it? If not, what should it be, or what should be used instead?
Simply using the result of strncmp() won't work, because it will result in a false equality result in the case that the argument with known length is shorter than the NUL terminated one, and it happens to be a prefix. Therefore, extra code is required to test for that case.
As a standalone function I don't see much wrong with this construction, and it appears idiomatic:
/* s1 is NUL terminated */
int variation_as_function (const char *s1, const char *s2, size_t s2len) {
int result = strncmp(s1, s2, s2len);
if (result == 0) {
result = (s1[s2len] != '\0');
}
return result;
}
However, when inlining this construction into code, it results in a double test for 0 when equality needs special action:
int result = strncmp(key, input, inputlen);
if (result == 0) {
result = (key[inputlen] != '\0');
}
if (result == 0) {
do_something();
} else {
do_something_else();
}
The motivation for inlining the call is because the standalone function is esoteric: It matters which string argument is NUL terminated and which one is not.
Please note, the question is not about performance, but about writing idiomatic code and adopting best practices for coding style. I see there is some DRY violation with the comparison. Is there a straightforward way to avoid the duplication?
† By known length, I mean the length is correct (there is no embedded NUL that would truncate the length). In other words, the input was validated at some earlier point in the program, and its length was recorded, but the input is not explicitly NUL terminated. As a hypothetical example, a scanner on a stream of text could have this property.
‡ As has been pointed out by addy2012, strncmp() could be used for prefix matching. I as focused on lexicographical ordering. However, (1) If the length of the prefix string is used as the length argument, both arguments need to be NUL terminated to guard against reading past an input string shorter than the prefix string. (2) If the minimum length is known between the prefix string and the input string, then memcmp() would be a better choice in terms of providing equivalent functionality at less CPU cost and no loss in readability.
The strncmp() function really only has one use case:
One of the strings has a known length, the other string is known to be
NUL terminated.
No, you can use it to compare the beginnings of two strings, no matter if the length of any string is known or not. For example, if you have an array / a list with last names, and you want to find all which begin with "Mac".
In fact, strncmp should generally be used in preference to strcmp unless you know absolutely know that both strings are well-formed and nul-terminated.
Why? Because otherwise you have a vulnerability to buffer overflows.
This rule is unfortunately not followed often.
There are a lot of buffer overflow errors.
Update
I think the core error here is in "one of the strings has a known length". No C string has a known length a priori. They're not like Pascal or Java strings, which are essentially a pair of (length, buffer). A C string is by definition a char[] identifying a chunk of memory, with the distinguished symbol \0 to identify the end. strncmp, strncpy etc exist to protect against attempts to use a chunk of memory as a string that is not well-formed.
Related
For clarity I'm only talking about null terminated strings.
I'm familiar with the standard way of doing string comparisons in C with the usage of strcmp. But I feel like it's slow and inefficient.
I'm not necessarily looking for the easiest method but the most efficient.
Can the current comparison method (strcmp) be optimized further while the underlying code remains cross platform?
If strcmp can't be optimized further, what is the fastest way which I could perform the string comparison without strcmp?
Current use case:
Determine if two arbitrary strings match
Strings will not exceed 4096 bytes, nor be less than 1 byte in size
Strings are allocated/deallocated and compared within the same code/library
Once comparison is complete I do pass the string to another C library which needs the format to be in a standard null terminated format
System memory limits are not a huge concern, but I will have tens of thousands of such strings queued up for comparison
Strings may contain high-ascii character set or UTF-8 characters but for my purposes I only need to know if they match, content is not a concern
Application runs on x86 but should also run on x64
Reference to current strcmp() implementation:
How does strcmp work?
What does strcmp actually do?
GLIBC strcmp() source code
Edit: Clarified the solution does not need to be a modification of strcmp.
Edit 2: Added specific examples for this use case.
I'm afraid your reference imlementation for strcmp() is both inaccurate and irrelevant:
it is inaccurate because it compares characters using the char type instead of the unsigned char type as specified in the C11 Standard:
7.24.4 Comparison functions
The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared.
It is irrelevant because the actual implementation used by modern compilers is much more sophisticated, expanded inline using hand-coded assembly language.
Any generic implementation is likely to be less optimal, especially if coded to remain portable across platforms.
Here are a few directions to explore if your program's bottleneck is comparing strings.
Analyze your algorithms, try and find ways to reduce the number of comparisons: for example if you search for a string in an array, sorting that array and using a binary search with drastically reduce the number of comparisons.
If your strings are tokens used in many different places, allocate unique copies of these tokens and use those as scalar values. The strings will be equal if and only if the pointers are equal. I use this trick in compilers and interpreters all the time with a hash table.
If your strings have the same known length, you can use memcmp() instead of strcmp(). memcmp() is simpler than strcmp() and can be implemented even more efficiently in places where the strings are known to be properly aligned.
EDIT: with the extra information provided, you could use a structure like this for your strings:
typedef struct string_t {
size_t len;
size_t hash; // optional
char str[]; // flexible array, use [1] for pre-c99 compilers
} string_t;
You allocate this structure this way:
string_t *create_str(const char *s) {
size_t len = strlen(s);
string_t *str = malloc(sizeof(*str) + len + 1;
str->len = len;
str->hash = hash_str(s, len);
memcpy(str->str, s, len + 1);
return str;
}
If you can use these str things for all your strings, you can greatly improve the efficiency of the matching by first comparing the lengths or the hashes. You can still pass the str member to your library function, it is properly null terminated.
In the responses to the question Reading In A String and comparing it C, more than one person discouraged the use of strcmp(), saying things like
I also strongly, strongly advise you to get used to using strncmp()
now, ... to avoid many problems down the road.
or (in Why does my string comparison fail? )
Make certain you use strncmp and not strcmp. strcmp is profoundly
unsafe.
What problems are they alluding to?
The reason scanf() with string specifiers and gets() are strongly discouraged is because they almost inevitably lead to buffer overflow vulnerabilities. However, it's not possible to overflow a buffer with strcmp(), right?
"A buffer overflow, or buffer overrun, is an anomaly where a program, while writing data to a buffer, overruns the buffer's boundary and overwrites adjacent memory."
( -- Wikipedia: buffer overflow).
Since the strcmp() function never writes to any buffer, the strcmp() function cannot cause a buffer overflow, right?
What is the reason people discourage the use of strcmp(), and recommend strncmp() instead?
While strncmp can prevent you from overrunning a buffer, its primary purpose isn't for safety. Rather, it exists for the case where one wants to compare only the first N characters of a (properly possibly NUL-terminated) string.
From the man page:
The strcmp() function compares the two strings s1 and s2. It returns an integer less than, equal to, or greater than zero if s1 is found, respectively, to be less than, to match, or be greater than s2.
The strncmp() function is similar, except it compares the only first (at most) n bytes of s1 and s2.
Note that strncmp in this case cannot be replaced with a simple memcmp, because you still need to take advantage of its stop-on-NUL behavior, in case one of the strings is shorter than n.
If strcmp causes a buffer overrun, then one of two things is true:
Your data isn't expected to be NUL-terminated, and you should be using memcmp instead.
Your data is expected to be NUL-terminated, but you've already screwed up when you populated the buffer, by somehow not NUL-terminating it.
Note that reading past the end of a buffer is still considered a buffer overrun. While it may seem harmless, it can be just as dangerous as writing past the end.
Reading, writing, executing... it doesn't matter. Any memory reference to an unintended address is undefined behavior. In the most apparent scenario, you attempt to access a page that isn't mapped into your process's address space, causing a page fault, and subsequent SIGSEGV. In the worst case, you sometimes run into a \0 byte, but other times you run into some other buffer, causing inconstant program behavior.
A string is by definition "a contiguous sequence of characters terminated by and including the first null character".
The only case where strncmp() would be safer than strcmp() is when you're comparing two character arrays as strings, you're certain that both arrays are at least n bytes long (the 3rd argument passed to strncmp()), and you're not certain that both arrays contain strings (i.e., contain a '\0' null character terminator).
In most cases, your code (if it's correct) will guarantee that any arrays that are supposed to contain null-terminated strings actually do contain null-terminated strings.
That added n in strncmp() is not a magic wand that makes unsafe code safe. It doesn't guard against null pointers, uninitialized pointers, uninitialized arrays, an incorrect value of n, or just passing incorrect data. You can shoot yourself in the foot with either function.
And if you're trying to call strcmp or strncmp with an array that you thought contained a null-terminated string but actually doesn't, then your code already has a bug. Using strncmp() might help you avoid the immediate symptom of that bug, but it won't fix it.
strcmp compares two strings character to character until a difference has been detected or the \0 is found at one of them.
On the other hand, strncmp provides a way to limit the number of characters to be compared so if the strings do not end with \0 the function won't continue checking after the size limit has been reached.
Imagine what would happen if you are comparing two strings at this two memory regions:
0x40, 0x41, 0x42,...
0x40, 0x41, 0x42,...
And you are only interested in the two first characters. Somehow \0 has been removed from the end of the strings and the third byte happens to coincide at the two regions. strncmp would avoid comparing this third byte if num parameter is 2.
EDIT
As the comments below indicate, this situation is derived from a wrong or very concrete use of the language.
By my understanding, strcmp() (no 'n'), upon seeing a null character in either argument, immediately stops processing and returns a result.
Therefore, if one of the arguments is known with 100% certainty to be null-terminated (e.g. it is a string literal), there is no security benefit whatsoever in using strncmp() (with 'n') with a call to strlen() as part of the third argument to limit the comparison to the known string length, because strcmp() will already never read more characters than are in that known-terminating string.
In fact, it seems to me that a call to strncmp() whose length argument is a strlen() on one of the first two arguments is only different from the strcmp() case in that it wastes time linear in the size of the known-terminating string by evaluating the strlen() expression.
Consider:
Sample code A:
if (strcmp(user_input, "status") == 0)
reply_with_status();
Sample code B:
if (strncmp(user_input, "status", strlen("status")+1) == 0)
reply_with_status();
Is there any benefit to the former over the latter? Because I see it in other people's code a lot.
Do I have a flawed understanding of how these functions work?
In your particular example, I would say it's detrimental to use strncmp because of:
Using strlen does the scan anyway
Repetition of the string literal "status"
Addition of 1 to make sure that strings are indeed equal
All these add to clutter, and in neither case would you be protected from overflowing user_input if it indeed was shorter than 6 characters and contained the same characters as the test string.
That would be exceptional. If you know that your input string always has more memory than the number of characters in your test strings, then don't worry. Otherwise, you might need to worry, or at least consider it. strncmp is useful for testing stuff inside large buffers.
My preference is for code readability.
Yes it does. If you use strlen in a strncmp, it will traverse the pointer until it sees a null in the string. That makes it functionally equivalent to strcmp.
In the particular case you've given, it's indeed useless. However, a slight alteration is more common:
if (strncmp(user_input, "status", strlen("status")) == 0)
reply_with_status();
This version just checks if user_input starts with "status", so it has different semantics.
Beyond tricks to check if the beginning of a string matches an input, strncmp is only going to be useful when you are not 100% sure that a string is null terminated before the end of its allocated space.
So, if you had a fixed size buffer that you took your user input in with you could use:
strncmp(user_input, "status", sizeof(user_input))
Therefore ensuring that your comparison does not overflow.
However in this case you do have to be careful, since if your user_input wasn't null terminated, then it will really be checking if user_input matches the beginning of status.
A better way might be to say:
if (user_input[sizeof(user_input) - 1] != '\0') {
// handle it, since it is _not_ equal to your string
// unless filling the buffer is valid
}
else if (strcmp(user_input, "status")) { ... }
Now, I agree that this isn't particularly useful use of strncmp(), and I can't see any benefit in this over strcmp().
However, if we change the code by removing the +1 after strlen, then it starts to be useful.
strncmp(user_input, "status", strlen("status"))
since that compares the first 6 characters of user_input with `"status" - which, at least sometimes is meaningful.
So, if the +1 is there, it becomes a regular strcmp - and it's just a waste of time calculating the length. But without the +1, it's quite a useful comparison (under the right circumstances).
strncmp() has limited use. The normal strcmp() will stop if it encounters a NUL on any of the two strings. (and in that case the strings are different) Strncmp() would stop and return zero ("the strings are equal in the first N characters")
One possible use of stncmp() is parsing options, upto the non-signifacant part, eg
if (!strncmp("-st", argv[xx], 3)) {}
, which would return zero for "-string" or "-state" or "-st0", but not for "-sadistic".
I wrote this small piece of code in C to test memcmp() strncmp() strcmp() functions in C.
Here is the code that I wrote:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main() {
char *word1="apple",*word2="atoms";
if (strncmp(word1,word2,5)==0)
printf("strncmp result.\n");
if (memcmp(word1,word2,5)==0)
printf("memcmp result.\n");
if (strcmp(word1,word2)==0)
printf("strcmp result.\n");
}
Can somebody explain me the differences because I am confused with these three functions?
My main problem is that I have a file in which I tokenize its line of it,the problem is that when I tokenize the word "atoms" in the file I have to stop the process of tokenizing.
I first tried strcmp() but unfortunately when it reached to the point where the word "atoms" were placed in the file it didn't stop and it continued,but when I used either the memcmp() or the strncmp() it stopped and I was happy.
But then I thought,what if there will be a case in which there is one string in which the first 5 letters are a,t,o,m,s and these are being followed by other letters.
Unfortunately,my thoughts were right as I tested it using the above code by initializing word1 to "atomsaaaaa" and word2 to atoms and memcmp() and strncmp() in the if statements returned 0.On the other hand strcmp() it didn't. It seems that I must use strcmp().
In short:
strcmp compares null-terminated C strings
strncmp compares at most N characters of null-terminated C strings
memcmp compares binary byte buffers of N bytes
So, if you have these strings:
const char s1[] = "atoms\0\0\0\0"; // extra null bytes at end
const char s2[] = "atoms\0abc"; // embedded null byte
const char s3[] = "atomsaaa";
Then these results hold true:
strcmp(s1, s2) == 0 // strcmp stops at null terminator
strcmp(s1, s3) != 0 // Strings are different
strncmp(s1, s3, 5) == 0 // First 5 characters of strings are the same
memcmp(s1, s3, 5) == 0 // First 5 bytes are the same
strncmp(s1, s2, 8) == 0 // Strings are the same up through the null terminator
memcmp(s1, s2, 8) != 0 // First 8 bytes are different
memcmp compares a number of bytes.
strcmp and the like compare strings.
You kind of cheat in your example because you know that both strings are 5 characters long (plus the null terminator). However, what if you don't know the length of the strings, which is often the case? Well, you use strcmp because it knows how to deal with strings, memcmp does not.
memcmp is all about comparing byte sequences. If you know how long each string is then yeah, you could use memcmp to compare them, but how often is that the case? Rarely. You often need string comparison functions because, well... they know what a string is and how to compare them.
As for any other issues you are experiencing it is unclear from your question and code. Rest assured though that strcmp is better equipped in the general case for string comparisons than memcmp is.
strcmp():
It is used to compare the two string stored in two variable, It takes some time to compare them. And so it slows down the process.
strncmp():
It is very much similar to the previous one, but in this one, it compares the first n number of characters alone. This also slows down the process.
memcmp():
This function is used compare two variables using their memory. It doesn't compare them one by one, It compares four characters at one time. If your program is too concerned about speed, I recommend using memcmp().
To summarize:
strncmp() and strcmp() treat a 0 byte as the end of a string, and don't compare beyond it
to memcmp(), a 0 byte has no special meaning
strncmp and memcmp are same except the fact that former takes care of NULL terminated string.
For strcmp you'll want to be only comparing what you know are going to be strings however sometimes this is not always the case such as reading lines of binary files and there for you would want to use memcmp to compare certain lines of input that contain NUL characters but match and you may want to continue checking further lengths of input.
A comment on one of my answers has left me a little puzzled. When trying to compute how much memory is needed to concat two strings to a new block of memory, it was said that using snprintf was preferred over strlen, as shown below:
size_t length = snprintf(0, 0, "%s%s", str1, str2);
// preferred over:
size_t length = strlen(str1) + strlen(str2);
Can I get some reasoning behind this? What is the advantage, if any, and would one ever see one result differ from the other?
I was the one who said it, and I left out the +1 in my comment which was written quickly and carelessly, so let me explain. My point was merely that you should use the pattern of using the same method to compute the length that will eventually be used to fill the string, rather than using two different methods that could potentially differ in subtle ways.
For example, if you had three strings rather than two, and two or more of them overlapped, it would be possible that strlen(str1)+strlen(str2)+strlen(str3)+1 exceeds SIZE_MAX and wraps past zero, resulting in under-allocation and truncation of the output (if snprintf is used) or extremely dangerous memory corruption (if strcpy and strcat are used).
snprintf will return -1 with errno=EOVERFLOW when the resulting string would be longer than INT_MAX, so you're protected. You do need to check the return value before using it though, and add one for the null terminator.
If you only need to determine how big would be the concatenation of the two strings, I don't see any particular reason to prefer snprintf, since the minimum operations to determine the total length of the two strings is what the two strlen calls do. snprintf will almost surely be slower, because it has to check the parameters and parse the format string besides just walking the two strings counting the characters.
... but... it may be an intelligent move to use snprintf if you are in a scenario where you want to concatenate two strings, and have a static, not too big buffer to handle normal cases, but you can fallback to a dynamically allocated buffer in case of big strings, e.g.:
/* static buffer "big enough" for most cases */
char buffer[256];
/* pointer used in the part where work on the string is actually done */
char * outputStr=buffer;
/* try to concatenate, get the length of the resulting string */
int length = snprintf(buffer, sizeof(buffer), "%s%s", str1, str2);
if(length<0)
{
/* error, panic and death */
}
else if(length>sizeof(buffer)-1)
{
/* buffer wasn't enough, allocate dynamically */
outputStr=malloc(length+1);
if(outputStr==NULL)
{
/* allocation error, death and panic */
}
if(snprintf(outputStr, length, "%s%s", str1, str2)<0)
{
/* error, the world is doomed */
}
}
/* here do whatever you want with outputStr */
if(outputStr!=buffer)
free(outputStr);
One advantage would be that the input strings are only scanned once (inside the snprintf()) instead of twice for the strlen/strcpy solution.
Actually, on rereading this question and the comment on your previous answer, I don't see what the point is in using sprintf() just to calculate the concatenated string length. If you're actually doing the concatenation, my above paragraph applies.
You need to add 1 to the strlen() example. Remember you need to allocate space for nul terminating byte.
So snprintf( ) gives me the size a string would have been. That means I can malloc( ) space for that guy. Hugely useful.
I wanted (but did not find until now) this function of snprintf( ) because I format tons of strings for output later; but I wanted not to have to assign static bufs for the outputs because it's hard to predict how long the outputs will be. So I ended up with a lot of 4096-long char arrays :-(
But now -- using this newly-discovered (to me) snprintf( ) char-counting function, I can malloc( ) output bufs AND sleep at night, both.
Thanks again and apologies to the OP and to Matteo.
EDIT: random, mistaken nonsense removed. Did I say that?
EDIT: Matteo in his comment below is absolutely right and I was absolutely wrong.
From C99:
2 The snprintf function is equivalent to fprintf, except that the output is written into
an array (specified by argument s) rather than to a stream. If n is zero, nothing is written,
and s may be a null pointer. Otherwise, output characters beyond the n-1st are
discarded rather than being written to the array, and a null character is written at the end
of the characters actually written into the array. If copying takes place between objects
that overlap, the behavior is undefined.
Returns
3 The snprintf function returns the number of characters that would have been written
had n been sufficiently large, not counting the terminating null character, or a neg ative
value if an encoding error occurred. Thus, the null-terminated output has been
completely written if and only if the returned value is nonnegative and less than n.
Thank you, Matteo, and I apologize to the OP.
This is great news because it gives a positive answer to a question I'd asked here only a three weeks ago. I can't explain why I didn't read all of the answers, which gave me what I wanted. Awesome!
The "advantage" that I can see here is that strlen(NULL) might cause a segmentation fault, while (at least glibc's) snprintf() handles NULL parameters without failing.
Hence, with glibc-snprintf() you don't need to check whether one of the strings is NULL, although length might be slightly larger than needed, because (at least on my system) printf("%s", NULL); prints "(null)" instead of nothing.
I wouldn't recommend using snprintf() instead of strlen() though. It's just not obvious. A much better solution is a wrapper for strlen() which returns 0 when the argument is NULL:
size_t my_strlen(const char *str)
{
return str ? strlen(str) : 0;
}