What is the fastest way to compare two strings in C? - c

For clarity I'm only talking about null terminated strings.
I'm familiar with the standard way of doing string comparisons in C with the usage of strcmp. But I feel like it's slow and inefficient.
I'm not necessarily looking for the easiest method but the most efficient.
Can the current comparison method (strcmp) be optimized further while the underlying code remains cross platform?
If strcmp can't be optimized further, what is the fastest way which I could perform the string comparison without strcmp?
Current use case:
Determine if two arbitrary strings match
Strings will not exceed 4096 bytes, nor be less than 1 byte in size
Strings are allocated/deallocated and compared within the same code/library
Once comparison is complete I do pass the string to another C library which needs the format to be in a standard null terminated format
System memory limits are not a huge concern, but I will have tens of thousands of such strings queued up for comparison
Strings may contain high-ascii character set or UTF-8 characters but for my purposes I only need to know if they match, content is not a concern
Application runs on x86 but should also run on x64
Reference to current strcmp() implementation:
How does strcmp work?
What does strcmp actually do?
GLIBC strcmp() source code
Edit: Clarified the solution does not need to be a modification of strcmp.
Edit 2: Added specific examples for this use case.

I'm afraid your reference imlementation for strcmp() is both inaccurate and irrelevant:
it is inaccurate because it compares characters using the char type instead of the unsigned char type as specified in the C11 Standard:
7.24.4 Comparison functions
The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared.
It is irrelevant because the actual implementation used by modern compilers is much more sophisticated, expanded inline using hand-coded assembly language.
Any generic implementation is likely to be less optimal, especially if coded to remain portable across platforms.
Here are a few directions to explore if your program's bottleneck is comparing strings.
Analyze your algorithms, try and find ways to reduce the number of comparisons: for example if you search for a string in an array, sorting that array and using a binary search with drastically reduce the number of comparisons.
If your strings are tokens used in many different places, allocate unique copies of these tokens and use those as scalar values. The strings will be equal if and only if the pointers are equal. I use this trick in compilers and interpreters all the time with a hash table.
If your strings have the same known length, you can use memcmp() instead of strcmp(). memcmp() is simpler than strcmp() and can be implemented even more efficiently in places where the strings are known to be properly aligned.
EDIT: with the extra information provided, you could use a structure like this for your strings:
typedef struct string_t {
size_t len;
size_t hash; // optional
char str[]; // flexible array, use [1] for pre-c99 compilers
} string_t;
You allocate this structure this way:
string_t *create_str(const char *s) {
size_t len = strlen(s);
string_t *str = malloc(sizeof(*str) + len + 1;
str->len = len;
str->hash = hash_str(s, len);
memcpy(str->str, s, len + 1);
return str;
}
If you can use these str things for all your strings, you can greatly improve the efficiency of the matching by first comparing the lengths or the hashes. You can still pass the str member to your library function, it is properly null terminated.

Related

Wide-character version of _memccpy

I have to concatenate wide C-style strings, and based on my research, it seems that something like _memccpy is most ideal (in order to avoid Shlemiel's problem). But I can't seem to find a wide-character version. Does something like that exist?
Does something like that exist?
The C standard library does not contain a wide-character version of Microsoft's _memccpy(). Neither does it contain _memccpy() itself, although POSIX specifies the memccpy() function on which MS's _memccpy() appears to be modeled.
POSIX also defines wcpcpy() (a wide version of stpcpy()), which copies a a wide string and returns a pointer to the end of the result. That's not as fully featured as memccpy(), but it would suffice to avoid Shlemiel's problem, if only Microsoft's C library provided a version of it.
You can, however, use swprintf() to concatenate wide strings without suffering from Shlemiel's problem, with the added advantage that it is in the standard library, since C99. It does not provide the memccpy behavior of halting after copying a user-specified (wide) character, but it does return the number of wide characters written, which is equivalent to returning a pointer to the end of the result. Also, it can directly concatenate an arbitrary fixed number of strings in a single call. swprintf does have significant overhead, though.
But of course, if the overhead of swprintf puts you off then it's pretty easy to write your own. The result might not be as efficient as a well-tuned implementation from your library vendor, but we're talking about a scaling problem, so you mainly need to win on the asymptotic complexity front. Simple example:
/*
* Copies at most 'count' wide characters from 'src' to 'dest', stopping after
* copying a wide character with value 0 if that happens first. If no 0 is
* encountered in the first 'count' wide characters of 'src' then the result
* will be unterminated.
* Returns 'dest' + n, where n is the number of non-zero wide characters copied to 'dest'.
*/
wchar_t *wcpncpy(wchar_t *dest, const wchar_t *src, size_t count) {
for (wchar_t *bound = dest + count; dest < bound; ) {
if ((*dest++ = *src++) == 0) return dest - 1;
}
return dest;
}

Is there an idiomatic use of strncmp()?

The strncmp() function really only has one use case (for lexicographical ordering):
One of the strings has a known length,† the other string is known to be NUL terminated. (As a bonus, the string with known length need not be NUL terminated at all.)
The reasons I believe there is just one use case (prefix match detection is not lexicographical ordering):&ddagger; (1) If both strings are NUL terminated, strcmp() should be used, as it will do the job correctly; and (2) If both strings have known length, memcmp() should be used, as it will avoid the unnecessary check against NUL on a byte per byte basis.
I am seeking an idiomatic (and readable) way to use the function to lexicographically compare two such arguments correctly (one of them is NUL terminated, one of them is not necessarily NUL terminated, with known length).
Does an idiom exist? If so, what is it? If not, what should it be, or what should be used instead?
Simply using the result of strncmp() won't work, because it will result in a false equality result in the case that the argument with known length is shorter than the NUL terminated one, and it happens to be a prefix. Therefore, extra code is required to test for that case.
As a standalone function I don't see much wrong with this construction, and it appears idiomatic:
/* s1 is NUL terminated */
int variation_as_function (const char *s1, const char *s2, size_t s2len) {
int result = strncmp(s1, s2, s2len);
if (result == 0) {
result = (s1[s2len] != '\0');
}
return result;
}
However, when inlining this construction into code, it results in a double test for 0 when equality needs special action:
int result = strncmp(key, input, inputlen);
if (result == 0) {
result = (key[inputlen] != '\0');
}
if (result == 0) {
do_something();
} else {
do_something_else();
}
The motivation for inlining the call is because the standalone function is esoteric: It matters which string argument is NUL terminated and which one is not.
Please note, the question is not about performance, but about writing idiomatic code and adopting best practices for coding style. I see there is some DRY violation with the comparison. Is there a straightforward way to avoid the duplication?
† By known length, I mean the length is correct (there is no embedded NUL that would truncate the length). In other words, the input was validated at some earlier point in the program, and its length was recorded, but the input is not explicitly NUL terminated. As a hypothetical example, a scanner on a stream of text could have this property.
&ddagger; As has been pointed out by addy2012, strncmp() could be used for prefix matching. I as focused on lexicographical ordering. However, (1) If the length of the prefix string is used as the length argument, both arguments need to be NUL terminated to guard against reading past an input string shorter than the prefix string. (2) If the minimum length is known between the prefix string and the input string, then memcmp() would be a better choice in terms of providing equivalent functionality at less CPU cost and no loss in readability.
The strncmp() function really only has one use case:
One of the strings has a known length, the other string is known to be
NUL terminated.
No, you can use it to compare the beginnings of two strings, no matter if the length of any string is known or not. For example, if you have an array / a list with last names, and you want to find all which begin with "Mac".
In fact, strncmp should generally be used in preference to strcmp unless you know absolutely know that both strings are well-formed and nul-terminated.
Why? Because otherwise you have a vulnerability to buffer overflows.
This rule is unfortunately not followed often.
There are a lot of buffer overflow errors.
Update
I think the core error here is in "one of the strings has a known length". No C string has a known length a priori. They're not like Pascal or Java strings, which are essentially a pair of (length, buffer). A C string is by definition a char[] identifying a chunk of memory, with the distinguished symbol \0 to identify the end. strncmp, strncpy etc exist to protect against attempts to use a chunk of memory as a string that is not well-formed.

C Strings Comparison with Equal Sign

I have this code:
char *name = "George"
if(name == "George")
printf("It's George")
I thought that c strings could not be compared with == sign and I have to use strcmp. For unknown reason when I compile with gcc (version 4.7.3) this code works. I though that this was wrong because it is like comparing pointers so I searched in google and many people say that it's wrong and comparing with == can't be done. So why this comparing method works ?
I thought that c strings could not be compared with == sign and I have to use strcmp
Right.
I though that this was wrong because it is like comparing pointers so I searched in google and many people say that it's wrong and comparing with == can't be done
That's right too.
So why this comparing method works ?
It doesn't "work". It only appears to be working.
The reason why this happens is probably a compiler optimization: the two string literals are identical, so the compiler really generates only one instance of them, and uses that very same pointer/array whenever the string literal is referenced.
Just to provide a reference to #H2CO3's answer:
C11 6.4.5 String literals
It is unspecified whether these arrays are distinct provided their elements have the
appropriate values. If the program attempts to modify such an array, the behavior is
undefined.
This means that in your example, name(a string literal "George") and "George" may and may not share the same location, it's up to the implementation. So don't count on this, it may results differently in other machines.
The comparison you have done compares the location of the two strings, rather than their content. It just so happens that your compiler decided to only create one string literal containing the characters "George". This means that the location of the string stored in name and the location of the second "George" are the same, so the comparison returns non-zero.
The compiler is not required to do this, however - it could just as easily create two different string literals, with different locations but the same content, and the comparison would then return zero.
This will fail, since you are comparing two different pointers of two separate strings.
If this code still works, then this is a result of a heavy optimization of GCC, that keeps only one copy for size optimization.
Use strcmp(). Link.
If you compare two stings that you are comparing base addresses of those strings not actual characters in those strings. for comparing strings use strcmp() and strcasecmp() library functions or write program like this. below is not a full code just logic required for string comparison.
void mystrcmp(const char *source,char *dest)
{
for(i=0;source[i] != '\0';i++)
dest[i] = source[i];
dest[i] = 0;
}

Tool functions for chars

I want to handle some char variables and would like to get a list of some functions that can do these tasks when it comes to handling chars.
Getting first characters of a char (var_name[1] doesnt seem to work)
Getting last characters of a char
Checking for char1 matches with char2 ( eg if "unicorn" matches words with "bicycle"
I am pretty sure some of these methods exist in libraries such as stdio.h or so but google isnt my friend.
EDIT:My 3rd question means not direct match with strcmp but single character match(eg if "hey" and "hello") have e as common letter.
Use var_name[0] to get first character (array indexes run from 0 to N - 1, where N is the number of elements in the array).
Use var_name[strlen(var_name) - 1] to get the last character.
Use strcmp() to compare two char strings.
EDIT:
To search for character in a string you can use strchr():
if (strchr("hello", 'e') && strchr("hey", 'e'))
{
}
There is also strpbrk() function that would indicate if two strings have any common characters:
if (strpbrk("hello", "hey"))
{
}
Assuming you mean a char[], and not a char which is a single character.
C uses 0-based indexing, var_name[0] gives you the first char.
strlen() gives you the length of the string, which together with my answer to 1. means
char lastchar = var_name[strlen(var_name)-1]; http://www.cplusplus.com/reference/clibrary/cstring/strlen/
strcmp(var_name1, var_name2) == 0. http://www.cplusplus.com/reference/clibrary/cstring/strcmp/
I am pretty sure some of these methods exist in libraries such as
stdio.h or so but google isnt my friend.
The string functions in the C standard library (libc) are described in the header file . If you're on a unix-ish machine, try typing man 3 string at a command line. You can then use the man program again to get more information about specific functions, e.g. man 3 strlen. (The '3' just tells man to look in "section 3", which describes the C standard library functions.)
What you're looking for is the string functions in the C runtime library. These are defined in string.h, not stdio.h.
But your list of problems is simple:
var_name[0] works perfectly well for accessing the first char in an array. var_name[ 1] doesn't work because arrays in C are zero-based.
The last char in an array is:
char c;
c = var_name[strlen(var_name)-1];
Testing for equality is simple:
if (var_name[0] == var_name[1])
; // they match
C and C++ strings are zero indexed. The memory you need to hold a particular length string has to be at least the string length and one character for the string terminator \0. So, the first character is array[0].
As #Carey Gregory said, the basic string handling functions are in string.h. But these are only primitives for handling strings. C is a low level enough language, that you have an opportunity to build up your own string handling library based on the functions in string.h.
On example might be that you want to pass a string pointer to a function and also the length of the buffer holding that sane string, not just the string length itself.

Different ways to calculate string length

A comment on one of my answers has left me a little puzzled. When trying to compute how much memory is needed to concat two strings to a new block of memory, it was said that using snprintf was preferred over strlen, as shown below:
size_t length = snprintf(0, 0, "%s%s", str1, str2);
// preferred over:
size_t length = strlen(str1) + strlen(str2);
Can I get some reasoning behind this? What is the advantage, if any, and would one ever see one result differ from the other?
I was the one who said it, and I left out the +1 in my comment which was written quickly and carelessly, so let me explain. My point was merely that you should use the pattern of using the same method to compute the length that will eventually be used to fill the string, rather than using two different methods that could potentially differ in subtle ways.
For example, if you had three strings rather than two, and two or more of them overlapped, it would be possible that strlen(str1)+strlen(str2)+strlen(str3)+1 exceeds SIZE_MAX and wraps past zero, resulting in under-allocation and truncation of the output (if snprintf is used) or extremely dangerous memory corruption (if strcpy and strcat are used).
snprintf will return -1 with errno=EOVERFLOW when the resulting string would be longer than INT_MAX, so you're protected. You do need to check the return value before using it though, and add one for the null terminator.
If you only need to determine how big would be the concatenation of the two strings, I don't see any particular reason to prefer snprintf, since the minimum operations to determine the total length of the two strings is what the two strlen calls do. snprintf will almost surely be slower, because it has to check the parameters and parse the format string besides just walking the two strings counting the characters.
... but... it may be an intelligent move to use snprintf if you are in a scenario where you want to concatenate two strings, and have a static, not too big buffer to handle normal cases, but you can fallback to a dynamically allocated buffer in case of big strings, e.g.:
/* static buffer "big enough" for most cases */
char buffer[256];
/* pointer used in the part where work on the string is actually done */
char * outputStr=buffer;
/* try to concatenate, get the length of the resulting string */
int length = snprintf(buffer, sizeof(buffer), "%s%s", str1, str2);
if(length<0)
{
/* error, panic and death */
}
else if(length>sizeof(buffer)-1)
{
/* buffer wasn't enough, allocate dynamically */
outputStr=malloc(length+1);
if(outputStr==NULL)
{
/* allocation error, death and panic */
}
if(snprintf(outputStr, length, "%s%s", str1, str2)<0)
{
/* error, the world is doomed */
}
}
/* here do whatever you want with outputStr */
if(outputStr!=buffer)
free(outputStr);
One advantage would be that the input strings are only scanned once (inside the snprintf()) instead of twice for the strlen/strcpy solution.
Actually, on rereading this question and the comment on your previous answer, I don't see what the point is in using sprintf() just to calculate the concatenated string length. If you're actually doing the concatenation, my above paragraph applies.
You need to add 1 to the strlen() example. Remember you need to allocate space for nul terminating byte.
So snprintf( ) gives me the size a string would have been. That means I can malloc( ) space for that guy. Hugely useful.
I wanted (but did not find until now) this function of snprintf( ) because I format tons of strings for output later; but I wanted not to have to assign static bufs for the outputs because it's hard to predict how long the outputs will be. So I ended up with a lot of 4096-long char arrays :-(
But now -- using this newly-discovered (to me) snprintf( ) char-counting function, I can malloc( ) output bufs AND sleep at night, both.
Thanks again and apologies to the OP and to Matteo.
EDIT: random, mistaken nonsense removed. Did I say that?
EDIT: Matteo in his comment below is absolutely right and I was absolutely wrong.
From C99:
2 The snprintf function is equivalent to fprintf, except that the output is written into
an array (specified by argument s) rather than to a stream. If n is zero, nothing is written,
and s may be a null pointer. Otherwise, output characters beyond the n-1st are
discarded rather than being written to the array, and a null character is written at the end
of the characters actually written into the array. If copying takes place between objects
that overlap, the behavior is undefined.
Returns
3 The snprintf function returns the number of characters that would have been written
had n been sufficiently large, not counting the terminating null character, or a neg ative
value if an encoding error occurred. Thus, the null-terminated output has been
completely written if and only if the returned value is nonnegative and less than n.
Thank you, Matteo, and I apologize to the OP.
This is great news because it gives a positive answer to a question I'd asked here only a three weeks ago. I can't explain why I didn't read all of the answers, which gave me what I wanted. Awesome!
The "advantage" that I can see here is that strlen(NULL) might cause a segmentation fault, while (at least glibc's) snprintf() handles NULL parameters without failing.
Hence, with glibc-snprintf() you don't need to check whether one of the strings is NULL, although length might be slightly larger than needed, because (at least on my system) printf("%s", NULL); prints "(null)" instead of nothing.
I wouldn't recommend using snprintf() instead of strlen() though. It's just not obvious. A much better solution is a wrapper for strlen() which returns 0 when the argument is NULL:
size_t my_strlen(const char *str)
{
return str ? strlen(str) : 0;
}

Resources