differences between memchr() and strchr() - c

What is the actual difference between memchr() and strchr(), besides the extra parameter? When do you use one or the other one? and would there be a better outcome performance replacing strchr() by memchr() if parsing a big file (theoretically speaking)?

strchr stops when it hits a null character but memchr does not; this is why the former does not need a length parameter but the latter does.

Functionally there is no difference in that they both scan an array / pointer for a provided value. The memchr version just takes an extra parameter because it needs to know the length of the provided pointer. The strchr version can avoid this because it can use strlen to calculate the length of the string.
Differences can popup if you attempt to use a char* which stores binary data with strchr as it potentially won't see the full length of the string. This is true of pretty much any char* with binary data and a str* function. For non-binary data though they are virtually the same function.
You can actually code up strchr in terms of memchr fairly easily
const char* strchr(const char* pStr, char value) {
return (const char*)memchr(pStr, value, strlen(pStr)+1);
}
The +1 is necessary here because strchr can be used to find the null terminator in the string. This is definitely not an optimal implementation because it walks the memory twice. But it does serve to demonstrate how close the two are in functionality.

strchr expects that the first parameter is null-terminated, and hence doesn't require a length parameter.
memchr works similarly but doesn't expect that the memory block is null-terminated, so you may be searching for a \0 character successfully.

No real difference, just that strchr() assumes it is looking through a null-terminated string (so that determines the size).
memchr() simply looks for the given value up to the size passed in.

In practical terms, there's not much difference. Also, implementations are free to make one function faster than the other.
The real difference comes from context. If you're dealing with strings, then use strchr(). If you have a finite-size, non-terminated buffer, then use memchr(). If you want to search a finite-size subset of a string, then use memchr().

Related

Can a C implementation use length-prefixed-strings "under the hood"?

After reading this question: What are the problems of a zero-terminated string that length-prefixed strings overcome? I started to wonder, what exactly is stopping a C implementation from allocating a few extra bytes for any char or wchar_t array allocated on the stack or heap and using them as a "string prefix" to store the number N of its elements?
Then, if the N-th character is '\0', N - 1 would signify the string length.
I believe this could mightily boost performance of functions such as strlen or strcat.
This could potentially turn to extra memory consumption if a program uses non-0-terminated char arrays extensively, but that could be remedied by a compiler flag turning on or off the regular "count-until-you-reach-'\0'" routine for the compiled code.
What are possible obstacles for such an implementation? Does the C Standard allow for this? What problems can this technique cause that I haven't accounted for?
And... has this actually ever been done?
You can store the length of the allocation. And malloc implementations really do do that (or some do, at least).
You can't reasonably store the length of whatever string is stored in the allocation, though, because the user can change the contents to their whim; it would be unreasonable to keep the length up to date. Furthermore, users might start strings somewhere in the middle of the character array, or might not even be using the array to hold a string!
Then, if the N-th character is '\0', N - 1 would signify the string length.
Actually, no, and that's why this suggestion cannot work.
If I overwrite a character in a string with a 0, I have effectively truncated the string, and a subsequent call of strlen on the string must return the truncated length. (This is commonly done by application programs, including every scanner generated by (f)lex, as well as the strtok standard library function. Amongst others.)
Moreover, it is entirely legal to call strlen on an interior byte of the string.
For example (just for demonstration purposes, although I'll bet you can find code almost identical to this in common use.)
/* Split a string like 'key=value...' into key and value parts, and
* return the value, and optionally its length (if the second argument
* is not a NULL pointer).
* On success, returns the value part and modifieds the original string
* so that it is the key.
* If there is no '=' in the supplied string, neither it nor the value
* pointed to by plen are modified, and NULL is returned.
*/
char* keyval_split(char* keyval, int* plen) {
char* delim = strchr(keyval, '=');
if (delim) {
if (plen) *plen = strlen(delim + 1)
*delim = 0;
return delim + 1;
} else {
return NULL;
}
}
There's nothing fundamentally stopping you from doing this in your application, if that was useful (one of the comments noted this). There are two problems that would emerge, however:
You'd have to reimplement all the string-handling functions, and have my_strlen, my_strcpy, and so on, and add string-creating functions. That might be annoying, but it's a bounded problem.
You'd have to stop people, when writing for the system, deliberately or automatically treating the associated character arrays as ‘ordinary’ C strings, and using the usual functions on them. You might have to make sure that such usages broke promptly.
This means that it would, I think, be infeasible to smuggle a reimplemented ‘C string’ into an existing program.
Something like
typedef struct {
size_t len;
char* buf;
} String;
size_t my_strlen(String*);
...
might work, since type-checking would frustrate (2) (unless someone decided to hack things ‘for efficiency’, in which case there's not much you can do).
Of course, you wouldn't do this unless and until you'd proven that string management was the bottleneck in your code and that this approach provably improved things....
There are a couple of issues with this approach. First of all, you wouldn't be able to create arbitrarily long strings. If you only reserve 1 byte for length, then your string can only go up to 255 characters. You can certainly use more bytes to store the length, but how many? 2? 4?
What if you try to concatenate two strings that are both at the edge of their size limits (i.e., if you use 1 byte for length and try to concatenate two 250-character strings to each other, what happens)? Do you simply add more bytes to the length as necessary?
Secondly, where do you store this metadata? It somehow has to be associated with the string. This is similar to the problem Dennis Ritchie ran into when he was implementing arrays in C. Originally, array objects stored an explicit pointer to the first element of the array, but as he added struct types to the language, he realized that he didn't want that metadata cluttering up the representation of the struct object in memory, so he got rid of it and introduced the rule that array expressions get converted to pointer expressions in most circumstances.
You could create a new aggregate type like
struct string
{
char *data;
size_t len;
};
but then you wouldn't be able to use the C string library to manipulate objects of that type; an implementation would still have to support the existing interface.
You could store the length in the leading byte or bytes of the string, but how many do you reserve? You could use a variable number of bytes to store the length, but now you need a way to distinguish length bytes from content bytes, and you can't read the first character by simply dereferencing the pointer. Functions like strcat would have to know how to step around the length bytes, how to adjust the contents if the number of length bytes changes, etc.
The 0-terminated approach has its disadvantages, but it's also a helluva lot easier to implement and makes manipulating strings a lot easier.
The string methods in the standard library have defined semantics. If one generates an array of char that contains various values, and passes a pointer to the array or a portion thereof, the methods whose behavior is defined in terms of NUL bytes must search for NUL bytes in the same fashion as defined by the standard.
One could define one's own methods for string handling which use a better form of string storage, and simply pretend that the standard library string-related functions don't exist unless one must pass strings to things like fopen. The biggest difficulty with such an approach is that unless one uses non-portable compiler features it would not be possible to use in-line string literals. Instead of saying:
ns_output(my_file, "This is a test"); // ns -- new string
one would have to say something more like:
MAKE_NEW_STRING(this_is_a_test, "This is a test");
ns_output(my_file, this_is_a_test);
where the macro MAKE_NEW_STRING would create a union of an anonymous type, define an instance called this_is_a_test, and suitably initialize it. Since a lot of strings would be of different anonymous types, type-checking would require that strings be unions that include a member of a known array type, and code expecting strings should be given a pointer that member, likely using something like:
#define ns_output(f,s) (ns_output_func((f),(s).stringref))
It would be possible to define the types in such a way as to avoid the need for the stringref member and have code just accept void*, but the stringref member would essentially perform static duck-typing (only things with a stringref member could be given to such a macro) and could also allow type-checking on the type of stringref itself).
If one could accept those constraints, I think one could probably write code that was more efficient in almost every way that zero-terminated strings; the question would be whether the advantages would be worth the hassle.

'strncpy' vs. 'sprintf'

I can see many sprintf's used in my applications for copying a string.
I have a character array:
char myarray[10];
const char *str = "mystring";
Now if I want want to copy the string str into myarray, is is better to use:
sprintf(myarray, "%s", str);
or
strncpy(myarray, str, 8);
?
Neither should be used, at all.
sprintf is dangerous, deprecated, and superseded by snprintf. The only way to use the old sprintf safely with string inputs is to either measure their length before calling sprintf, which is ugly and error-prone, or by adding a field precision specifier (e.g. %.8s or %.*s with an extra integer argument for the size limit). This is also ugly and error-prone, especially if more than one %s specifier is involved.
strncpy is also dangerous. It is not a buffer-size-limited version of strcpy. It's a function for copying characters into a fixed-length, null-padded (as opposed to null-terminated) array, where the source may be either a C string or a fixed-length character array at least the size of the destination. Its intended use was for legacy unix directory tables, database entries, etc. that worked with fixed-size text fields and did not want to waste even a single byte on disk or in memory for null termination. It can be misused as a buffer-size-limited strcpy, but doing so is harmful for two reasons. First of all, it fails to null terminate if the whole buffer is used for string data (i.e. if the source string length is at least as long as the dest buffer). You can add the termination back yourself, but this is ugly and error-prone. And second, strncpy always pads the full destination buffer with null bytes when the source string is shorter than the output buffer. This is simply a waste of time.
So what should you use instead?
Some people like the BSD strlcpy function. Semantically, it's identical to snprintf(dest, destsize, "%s", source) except that the return value is size_t and it does not impose an artificial INT_MAX limit on string length. However, most popular non-BSD systems lack strlcpy, and it's easy to make dangerous errors writing your own, so if you want to use it, you should obtain a safe, known-working version from a trustworthy source.
My preference is to simply use snprintf for any nontrivial string construction, and strlen+memcpy for some trivial cases that have been measured to be performance-critical. If you get in a habit of using this idiom correctly, it becomes almost impossible to accidentally write code with string-related vulnerabilities.
The different versions of printf/scanf are incredibly slow functions, for the following reasons:
They use variable argument lists, which makes parameter passing more complex. This is done through various obscure macros and pointers. All the arguments have to be parsed in runtime to determine their types, which adds extra overhead code. (VA lists is also quite a redundant feature of the language, and dangerous as well, as it has farweaker typing than plain parameter passing.)
They must handle a lot of complex formatting and all different types supported. This adds plenty of overhead to the function as well. Since all type evaluations are done in runtime, the compiler cannot optimize away parts of the function that are never used. So if you only wanted to print integers with printf(), you will get support for float numbers, complex arithmetic, string handling etc etc linked to your program, as complete waste of space.
Functions like strcpy() and particularly memcpy() on the other hand, are heavily optimized by the compiler, often implemented in inline assemble for maximum performance.
Some measurements I once made on barebone 16-bit low-end microcontrollers are included below.
As a rule of thumb, you should never use stdio.h in any form of production code. It is to be considered as a debugging/testing library. MISRA-C:2004 bans stdio.h in production code.
EDIT
Replaced subjective numbers with facts:
Measurements of strcpy versus sprintf on target Freescale HCS12, compiler Freescale
Codewarrior 5.1. Using C90 implementation of sprintf, C99 would be more ineffective yet. All optimizations enabled. The following code was tested:
const char str[] = "Hello, world";
char buf[100];
strcpy(buf, str);
sprintf(buf, "%s", str);
Execution time, including parameter shuffling on/off call stack:
strcpy 43 instructions
sprintf 467 instructions
Program/ROM space allocated:
strcpy 56 bytes
sprintf 1488 bytes
RAM/stack space allocated:
strcpy 0 bytes
sprintf 15 bytes
Number of internal function calls:
strcpy 0
sprintf 9
Function call stack depth:
strcpy 0 (inlined)
sprintf 3
I would not use sprintf just to copy a string. It's overkill, and someone who reads that code would certainly stop and wonder why I did that, and if they (or I) are missing something.
There is one way to use sprintf() (or if being paranoid, snprintf() ) to do a "safe" string copy, that truncates instead of overflowing the field or leaving it un-NUL-terminated.
That is to use the "*" format character as "string precision" as follows:
So:
char dest_buff[32];
....
sprintf(dest_buff, "%.*s", sizeof(dest_buff) - 1, unknown_string);
This places the contents of unknown_string into dest_buff allowing space for the terminating NUL.

Getting the length of a formatted string from wsprintf

When using standard char* strings, the snprintf and vsnprintf functions will return the length of the output string, even if that string was truncated due to overflow.* It seems like the ISO C committee didn't like this functionality when they added swprintf and vswprintf, which return -1 on overflow.
Does anyone know of a function that will provide this length? I don't know the size of the potential strings. I might be asking too much, but.. I'd rather not:
allocate a huge static temp buffer
iteratively allocate and free memory until i've found a size that fits
add an additional library dependency
write my own format string parser
*I realize MSVC doesn't do this, and instead provides the scprintf and vscprintf functions, but I'm looking for other compilers, mainly GCC.
My best suggestion to you would be not to use wchar_t strings at all, especially if you're not writing Windows-oriented code. In case that's not an option, here are some other ideas:
If your format string does not contain non-ASCII characters itself, what about first calling vsnprintf with the same set of arguments to get the length in bytes, then use that as a safe upper bound for the length in wchar_t characters (if there are few or non-ASCII characters, the bound will be tight).
If you're okay with introducing a dependency on a POSIX function (which is likely to be added to C1x), use open_wmemstream and fwprintf.
Just iterate allocating larger buffers, but do it smart: increase the size geometrically at each step, e.g. 127, 255, 511, 1023, 2047, ... I like this pattern better than whole powers of 2 because it's easy to avoid dangerous case where allocation might succeed for SIZE_MAX/2+1 but then wrap to 0 at the next iteration.
This returns the buffer size for wide character strings:
vswprintf(nullptr, -1, aFormat, argPtr);

Reading user input and checking the string

How does one check the read in string for a substring in C?
If I have the following
char name[21];
fgets(name, 21, stdin);
How do I check the string for a series of substrings?
How does one check for a substring before a character? For example, how would one check for a substring before an = sign?
Be wary of strtok(); it is not re-entrant. Amongst other things, it means that if you need to call it in one function, and then call another function, and if that other function also uses strtok(), your first function is messed up. It also writes NUL ('\0') bytes over the separators, so it modifies the input string as it goes. If you are looking for more than one terminator character, you can't tell which one was found. Further, if you write a library function for others to use, yet your function uses strtok(), you must document the fact so that callers of your function are not bemused by the failures of their own code that uses strtok() after calling your function. In other words, it is poisonous; if your function calls strtok(), it makes your function unreusable, in general; similarly, your code that uses strtok() cannot call other people's functions that also use it.
If you still like the idea of the functionality - some people do (but I almost invariably avoid it) - then look for strtok_r() on your system. It is re-entrant; it takes an extra parameter which means that other functions can use strtok_r() (or strtok()) without affecting your function.
There are a variety of alternatives that might be appropriate. The obvious ones to consider are strchr(), strrchr(), strpbrk(), strspn(), strcspn(): none of these modify the strings they analyze. All are part of Standard C (as is strtok()), so they are essentially available everywhere. Looking for the material before a single character suggests that you should use strchr().
Use strtok() to split the string into tokens.
char *pch;
pch = strtok (name,"=");
if (pch != NULL)
{
printf ("Substring: %s\n",pch);
}
You can keep calling strtok() to find more strings after the =.
You can use strtok but it's not reentrant and it destroys the original string. Other (perhaps safer) functions to look into would be strchr, strstr, strspn, and perhaps the mem* variations. In general, I avoid strn* variants because, while they do "boinds checking," they still rely on the nul terminator. They can fail on a valid string that just happens to be longer than you expected to deal with, and they won't actually prevent a buffer overrun unless you know the buffer size. Better (IMHO) to ignore the terminator and know exactly how much data you're working with every time the way the mem* functions work.

What makes a C standard library function dangerous, and what is the alternative?

While learning C I regularly come across resources which recommend that some functions (e.g. gets()) are never to be used, because they are either difficult or impossible to use safely.
If the C standard library contains a number of these "never-use" functions, it would seem necessary to learn a list of them, what makes them unsafe, and what to do instead.
So far, I've learned that functions which:
Cannot be prevented from overwriting memory
Are not guaranteed to null-terminate a string
Maintain internal state between calls
are commonly regarded as being unsafe to use. Is there a list of functions which exhibit these behaviours? Are there other types of functions which are impossible to use safely?
In the old days, most of the string functions had no bounds checking. Of course they couldn't just delete the old functions, or modify their signatures to include an upper bound, that would break compatibility. Now, for almost every one of those functions, there is an alternative "n" version. For example:
strcpy -> strncpy
strlen -> strnlen
strcmp -> strncmp
strcat -> strncat
strdup -> strndup
sprintf -> snprintf
wcscpy -> wcsncpy
wcslen -> wcsnlen
And more.
See also https://github.com/leafsr/gcc-poison which is a project to create a header file that causes gcc to report an error if you use an unsafe function.
Yes, fgets(..., ..., STDIN) is a good alternative to gets(), because it takes a size parameter (gets() has in fact been removed from the C standard entirely in C11). Note that fgets() is not exactly a drop-in replacement for gets(), because the former will include the terminating \n character if there was room in the buffer for a complete line to be read.
scanf() is considered problematic in some cases, rather than straight-out "bad", because if the input doesn't conform to the expected format it can be impossible to recover sensibly (it doesn't let you rewind the input and try again). If you can just give up on badly formatted input, it's useable. A "better" alternative here is to use an input function like fgets() or fgetc() to read chunks of input, then scan it with sscanf() or parse it with string handling functions like strchr() and strtol(). Also see below for a specific problem with the "%s" conversion specifier in scanf().
It's not a standard C function, but the BSD and POSIX function mktemp() is generally impossible to use safely, because there is always a TOCTTOU race condition between testing for the existence of the file and subsequently creating it. mkstemp() or tmpfile() are good replacements.
strncpy() is a slightly tricky function, because it doesn't null-terminate the destination if there was no room for it. Despite the apparently generic name, this function was designed for creating a specific style of string that differs from ordinary C strings - strings stored in a known fixed width field where the null terminator is not required if the string fills the field exactly (original UNIX directory entries were of this style). If you don't have such a situation, you probably should avoid this function.
atoi() can be a bad choice in some situations, because you can't tell when there was an error doing the conversion (e.g., if the number exceeded the range of an int). Use strtol() if this matters to you.
strcpy(), strcat() and sprintf() suffer from a similar problem to gets() - they don't allow you to specify the size of the destination buffer. It's still possible, at least in theory, to use them safely - but you are much better off using strncat() and snprintf() instead (you could use strncpy(), but see above). Do note that whereas the n for snprintf() is the size of the destination buffer, the n for strncat() is the maximum number of characters to append and does not include the null terminator. Another alternative, if you have already calculated the relevant string and buffer sizes, is memmove() or memcpy().
On the same theme, if you use the scanf() family of functions, don't use a plain "%s" - specify the size of the destination e.g. "%200s".
strtok() is generally considered to be evil because it stores state information between calls. Don't try running THAT in a multithreaded environment!
Strictly speaking, there is one really dangerous function. It is gets() because its input is not under the control of the programmer. All other functions mentioned here are safe in and of themselves. "Good" and "bad" boils down to defensive programming, namely preconditions, postconditions and boilerplate code.
Let's take strcpy() for example. It has some preconditions that the programmer must fulfill before calling the function. Both strings must be valid, non-NULL pointers to zero terminated strings, and the destination must provide enough space with a final string length inside the range of size_t. Additionally, the strings are not allowed to overlap.
That is quite a lot of preconditions, and none of them is checked by strcpy(). The programmer must be sure they are fulfilled, or he must explicitly test them with additional boilerplate code before calling strcpy():
n = DST_BUFFER_SIZE;
if ((dst != NULL) && (src != NULL) && (strlen(dst)+strlen(src)+1 <= n))
{
strcpy(dst, src);
}
Already silently assuming the non-overlap and zero-terminated strings.
strncpy() does include some of these checks, but it adds another postcondition the programmer must take care for after calling the function, because the result may not be zero-terminated.
strncpy(dst, src, n);
if (n > 0)
{
dst[n-1] = '\0';
}
Why are these functions considered "bad"? Because they would require additional boilerplate code for each call to really be on the safe side when the programmer assumes wrong about the validity, and programmers tend to forget this code.
Or even argue against it. Take the printf() family. These functions return a status that indicate error and success. Who checks if the output to stdout or stderr succeeded? With the argument that you can't do anything at all when the standard channels are not working. Well, what about rescuing the user data and terminating the program with an error-indicating exit code? Instead of the possible alternative of crash and burn later with corrupted user data.
In a time- and money-limited environment it is always the question of how much safety nets you really want and what is the resulting worst case scenario? If it is a buffer overflow as in case of the str-functions, then it makes sense to forbid them and probably provide wrapper functions with the safety nets already within.
One final question about this: What makes you sure that your "good" alternatives are really good?
Any function that does not take a maximum length parameter and instead relies on an end-of- marker to be present (such as many 'string' handling functions).
Any method that maintains state between calls.
sprintf is bad, does not check size, use snprintf
gmtime, localtime -- use gmtime_r, localtime_r
To add something about strncpy most people here forgot to mention. strncpy can result in performance problems as it clears the buffer to the length given.
char buff[1000];
strncpy(buff, "1", sizeof buff);
will copy 1 char and overwrite 999 bytes with 0
Another reason why I prefer strlcpy (I know strlcpy is a BSDism but it is so easy to implement that there's no excuse to not use it).
View page 7 (PDF page 9) SAFECode Dev Practices
Edit: From the page -
strcpy family
strncpy family
strcat family
scanf family
sprintf family
gets family
strcpy - again!
Most people agree that strcpy is dangerous, but strncpy is only rarely a useful replacement. It is usually important that you know when you've needed to truncate a string in any case, and for this reason you usually need to examine the length of the source string anwyay. If this is the case, usually memcpy is the better replacement as you know exactly how many characters you want copied.
e.g. truncation is error:
n = strlen( src );
if( n >= buflen )
return ERROR;
memcpy( dst, src, n + 1 );
truncation allowed, but number of characters must be returned so caller knows:
n = strlen( src );
if( n >= buflen )
n = buflen - 1;
memcpy( dst, src, n );
dst[n] = '\0';
return n;

Resources