Does sscanf require a null terminated string as input?

Does sscanf require a null terminated string as input? - c

A recently discovered explanation for GTA lengthy load times(1) showed that many implementations of sscanf() call strlen() on their input string to set up a context object for an internal routine shared with other scanning functions (scanf(), fscanf()...). This can become a performance bottleneck when the input string is very long. Parsing a 10MB JSON file loaded as a string with repeated calls to sscanf() with an offset and a %n conversion proved to be a dominant cause for the load time.
My question is should sscanf() even read the input string beyond the bytes necessary for the conversions to complete? For example does the following code invoke undefined behavior:
int test(void) {
char buf[1] = { '1' };
int v;
sscanf(buf, "%1d", &v);
return v;
}
The function should return 1 and does not need to read more than one byte from buf, but is sscanf() allowed to read from buf beyond the first byte?
(1) references provided by JdeBP:
https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
https://news.ycombinator.com/item?id=26297612
https://github.com/biojppm/rapidyaml/issues/40

Here are the relevant parts from the C Standard:
7.21.6.7 The sscanf function Synopsis
Synopsis
#include <stdio.h>
int sscanf(const char * restrict s, const char * restrict format, ...);
Description
The sscanf function is equivalent to fscanf, except that input is obtained from a string (specified by the argument s) rather than from a stream. Reaching the end of the string is equivalent to encountering end-of-file for the fscanf function. If copying takes place between objects that overlap, the behavior is undefined.
Returns
The sscanf function returns the value of the macro EOF if an input failure occurs before the first conversion (if any) has completed. Otherwise, the sscanf function returns the number of input items assigned, which can be fewer than provided for, or even zero, in the event of an early matching failure.
The input is specifically referred to as a string, so it should be null terminated
Albeit none of the characters in the string beyond the initial prefix that matches the conversion specifier and potentially the next byte that helped determine the end of the matching sequence are used for the conversion, these characters must be followed by a null terminator so the input is a well formed string, and it is conforming to call strlen() on it to determine the input length.
To avoid linear time complexity on long input strings, sscanf() should limit the scan for the end of string to a small size with strnlen() or equivalent and pass an appropriate refill function. Passing a huge length and letting the internal routine special case the null byte is an even better approach.
In the mean time, programmers should avoid passing long input strings to sscanf() and use more specialized functions for their parsing tasks, such as strtol(), which also requires a well formed C string, but is implemented in a more conservative way. This would also avoid potential undefined behavior on number conversions for out of range string representations.

When the Standard was written, many library functions were handled identically by almost all existing implementations, but some implementations may have had good reasons for handling a few cases differently. If the number of implementations that would have reason to differ from the commonplace behavior was substantial, then the Committee would either require that all implementations behave in the common fashion (as happens when e.g. computing UINT_MAX+1u), or explicitly state that they were not required to do so (as when e.g. computing INT_MAX+1). In cases where there was a clear common behavior, but it might not be practical on all implementations, however, the Committee generally simply refrained from saying anything, on the presumption that most compilers would have no reason to deviate from the common behavior, and the authors of those that would have reason to deviate would be better placed than the Committee to judge the pros and cons of following the common behavior versus deviating from it.
The sscanf behavior at issue fits the latter pattern. The Committee didn't want to mandate that implementations which would have trouble if the data source didn't have a trailing zero byte must be changed to deal with such data sources, but nor did they want to require that programmers copy data from sources that don't have trailing zero bytes to places that do before using sscanf upon it even when their implementation wouldn't care about anything beyond the portion of the source that would be meaningfully examined. Since makers of implementations that require a trailing zero will likely block any change to the Standard that would require them to tolerate its absence, and programmers whose implementations that impose no such needless requirements will block any change to the Standard that would require that they add extra data-copying steps to their code, the situation will remain deadlocked unless people can agree to categorize implementations that impose the trailing-byte requirement as "conforming but deficient" and require that they indicate such deficiency via predefined macro or other such means.

Related

Why the different behavior of snprintf vs swprintf?

The C standard states the following from the standard library function snprintf:
"The snprintf function is equivalent to fprintf, except that the
output is written into an array (specified by arguments) rather than
to a stream. If n is zero, nothing is written, and s may be a null
pointer. Otherwise, output characters beyond the n-1st are discarded
rather than being written to the array, and a null character is
written at the end of the characters actually written into the array.
If copying takes place between objects that overlap, the behavior is
undefined."
"The snprintf function returns the number of characters that would
have been written had n been sufficiently large, not counting the
terminating null character, or a negative value if an encoding error
occurred. Thus, the null-terminated output has been completely written
if and only if the returned value is nonnegative and less than n."
Compare it to the statement about swprintf:
"The swprintf function is equivalent to fwprintf, except that the
argument s specifies an array of wide characters into which the
generated output is to be written, rather than written to a stream. No
more than n wide characters are written, including a terminating null
wide character, which is always added (unless n is zero)."
"The swprintf function returns the number of wide characters written
in the array, not counting the terminating null wide character, or a
negative value if an encoding error occurred or if n or more wide
characters were requested to be written."
At first glance it may seem like snprintf and swprintf are complete equivalent to each other, the latter merely handling wide strings and the former narrow strings. However, that's not the case. While snprintf returns the number of characters that would have been written if n had been large enough, swprintf returns a negative value in this case (which means that you can't know how many characters would have been written if there had been enough space). This makes the two functions not fully interchangeable, because their behavior is different in this regard (and thus the latter can't be used for some thing that the former can, such as evaluating how long the output buffer would need to be, before actually creating it.)
Why would they make this difference? I suppose the behavior of swprintf makes the implementation more efficient when n is too small, but still, why the difference? I don't think it's even a question of snprintf being older and thus "legacy" and "dragging the weight of its history, which can't be changed later" and swprintf being newer and thus free to be improved, because both were introduced in C99.
There is, however, another significantly subtler difference between the two specifications. If you notice, the specifications are not merely carbon-copies of each other, with the only difference being the return value. That's another, much subtler difference, and that's the somewhat ambiguous behavior of what happens if n is too small for the string-to-be-printed.
The specification for snprintf quite clearly states that the output will be written up to n-1 characters even when n is too small, and the null character will be written at the end of it always. The specification of swprintf almost states this... except it leaves it ambiguous in its specification of the return value.
More specifically a negative return value is used to signal that an error occurred while trying to write the string to the destination. A negative value is also returned when n was too small. It's left ambiguous whether this is actually considered an error situation or not. This is significant because if it's considered an error, then the implementation is free to not write all, or anything, into the destination, because it can be signaling "an error occurred, the output is invalid". The first paragraph of the specification makes it sound like at most n-1 characters are always written, and an ending null character is always written ("which is always added"), but the second paragraph about the return value leaves it ambiguous whether this is actually an error situation and whether the implementation can choose not to write those things in this case.
This is significant because the glibc implementation of swprintf does not write the final null character when n is too small, making the result an invalid string. While I can't find definitive information on this, I have got the impression that the developers of glibc have interpreted the standard in such a manner that they don't have to write the final null character (or anything) to the output because this is an error situation.
The thing is that the standard seems to be very ambiguous and vague in this regard. Is it a correct interpretation? Or are they misinterpreting? Why would the standard leave it this ambiguous?
My interpretation differs from that of the glibc developers. I understand the second paragraph to mean:
A negative value is returned if:
an encoding error occurred, or
n or more wide characters were requested to be written.
I don't see how this could be interpreted as n being too small being considered an error.

unlimited buffer printf - formatted puts directly to stream

my understanding is that most implementations of printf rely on something like
vsnprintf( _acBuffer[0], sizeof( _acBuffer[0] ), pcFormat, *ptArgList );
to actually handle the formatting and then they output them to the stream via puts.
Are there any implementation that minimize the size of _acBuffer[0] required while maintaining the ability to print all string?
Obviously something like :
printf("%s", pcReallyLongString);
would be a problem.
Your thoughts are much appreciated!

Your understanding is just wrong. I've never seen or heard of a printf implementation that works by first formatting the entire output to a temporary string buffer. Usually printf is done the other way around: the fundamental building block is vfprintf and vsnprintf is a wrapper for that which creates a fake FILE whose buffer is the destination string.
Edit: Some popular (e.g. glibc) implementations do make some use of unboundedly-large intermediate buffers for certain formats, especially wide character conversions, and will fail unpredictably when they cannot allocate sufficient memory for the buffer. This is purely a low-quality-implementation issue, however; there's no fundamental reason any of the printf functions should require any more than a small constant amount of working space, regardless of what they're printing.

I'd say that the whole point of fprintf (or printf) specification being the way it it is to allow a "bufferless" one-pass implementation of this function. I.e. it converts the data sequentially piece by piece (if it requires conversion), immediately sends it to the output and forgets about it for good. The function can use an intermediate buffer for numeric data conversion, but this is a temporary buffer of fixed and insignificant compile-time size.
Unless I'm missing something, a properly implemented fprintf function should impose absolutely no limitations on how long the resultant string is. Your hypothetical implementation through vsnprintf would violate that principle.

Why does C not have an snwprintf function?

Does anyone know why there is no snwprintf function in the C standard library?
I am aware of swprintf, but that doesn't have the same semantics of a true, wchar_t version of snprintf. As far as I can tell, there is no easy way to implement an snwprintf function using [v]swprintf:
Unlike snprintf, swprintf does not return the necessary buffer size; if the supplied buffer is insufficient, it simply returns -1. This is indistinguishable from failure due to encoding errors, so I can't keep retrying with progressively larger buffers hoping that it eventually will succeed.
I suppose I could set the last element of the buffer to be non-NUL, call swprintf, and assume that truncation occurred if that element is NUL afterward. However, is that guaranteed to work? The standard does not specify what state the buffer should be in if swprintf fails. (In contrast, snprintf describes which characters are written and which are discarded.)

See the answer given by Larry Jones here.
Essentially, swprintf was added in C95 while snprintf was added in C99 and since many implementations already returned the number of required characters (for snprintf) and it seemed a useful thing to do, that was the behavior that was standardized. They didn't think that behavior was important enough to break backwards compatibility with swprintf by adding it (which was added without that behavior several years earlier).

Which functions from the standard library must (should) be avoided?

I've read on Stack Overflow that some C functions are "obsolete" or "should be avoided". Can you please give me some examples of this kind of function and the reason why?
What alternatives to those functions exist?
Can we use them safely - any good practices?

Deprecated Functions
Insecure
A perfect example of such a function is gets(), because there is no way to tell it how big the destination buffer is. Consequently, any program that reads input using gets() has a buffer overflow vulnerability. For similar reasons, one should use strncpy() in place of strcpy() and strncat() in place of strcat().
Yet some more examples include the tmpfile() and mktemp() function due to potential security issues with overwriting temporary files and which are superseded by the more secure mkstemp() function.
Non-Reentrant
Other examples include gethostbyaddr() and gethostbyname() which are non-reentrant (and, therefore, not guaranteed to be threadsafe) and have been superseded by the reentrant getaddrinfo() and freeaddrinfo().
You may be noticing a pattern here... either lack of security (possibly by failing to include enough information in the signature to possibly implement it securely) or non-reentrance are common sources of deprecation.
Outdated, Non-Portable
Some other functions simply become deprecated because they duplicate functionality and are not as portable as other variants. For example, bzero() is deprecated in favor of memset().
Thread Safety and Reentrance
You asked, in your post, about thread safety and reentrance. There is a slight difference. A function is reentrant if it does not use any shared, mutable state. So, for example, if all the information it needs is passed into the function, and any buffers needed are also passed into the function (rather than shared by all calls to the function), then it is reentrant. That means that different threads, by using independent parameters, do not risk accidentally sharing state. Reentrancy is a stronger guarantee than thread safety. A function is thread safe if it can be used by multiple threads concurrently. A function is thread safe if:
It is reentrant (i.e. it does not share any state between calls), or:
It is non-reentrant, but it uses synchronization/locking as needed for shared state.
In general, in the Single UNIX Specification and IEEE 1003.1 (i.e. "POSIX"), any function which is not guaranteed to be reentrant is not guaranteed to be thread safe. So, in other words, only functions which are guaranteed to be reentrant may be portably used in multithreaded applications (without external locking). That does not mean, however, that implementations of these standards cannot choose to make a non-reentrant function threadsafe. For example, Linux frequently adds synchronization to non-reentrant functions in order to add a guarantee (beyond that of the Single UNIX Specification) of threadsafety.
Strings (and Memory Buffers, in General)
You also asked if there is some fundamental flaw with strings/arrays. Some might argue that this is the case, but I would argue that no, there is no fundamental flaw in the language. C and C++ require you to pass the length/capacity of an array separately (it is not a ".length" property as in some other languages). This is not a flaw, per-se. Any C and C++ developer can write correct code simply by passing the length as a parameter where needed. The problem is that several APIs that required this information failed to specify it as a parameter. Or assumed that some MAX_BUFFER_SIZE constant would be used. Such APIs have now been deprecated and replaced by alternative APIs that allow the array/buffer/string sizes to be specified.
Scanf (In Answer to Your Last Question)
Personally, I use the C++ iostreams library (std::cin, std::cout, the << and >> operators, std::getline, std::istringstream, std::ostringstream, etc.), so I do not typically deal with that. If I were forced to use pure C, though, I would personally just use fgetc() or getchar() in combination with strtol(), strtoul(), etc. and parse things manually, since I'm not a huge fan of varargs or format strings. That said, to the best of my knowledge, there is no problem with [f]scanf(), [f]printf(), etc. so long as you craft the format strings yourself, you never pass arbitrary format strings or allow user input to be used as format strings, and you use the formatting macros defined in <inttypes.h> where appropriate. (Note, snprintf() should be used in place of sprintf(), but that has to do with failing to specify the size of the destination buffer and not the use of format strings). I should also point out that, in C++, boost::format provides printf-like formatting without varargs.

Once again people are repeating, mantra-like, the ludicrous assertion that the "n" version of str functions are safe versions.
If that was what they were intended for then they would always null terminate the strings.
The "n" versions of the functions were written for use with fixed length fields (such as directory entries in early file systems) where the nul terminator is only required if the string does not fill the field. This is also the reason why the functions have strange side effects that are pointlessly inefficient if just used as replacements - take strncpy() for example:
If the array pointed to by s2 is a
string that is shorter than n bytes,
null bytes are appended to the copy in
the array pointed to by s1, until n
bytes in all are written.
As buffers allocated to handle filenames are typically 4kbytes this can lead to a massive deterioration in performance.
If you want "supposedly" safe versions then obtain - or write your own - strl routines (strlcpy, strlcat etc) which always nul terminate the strings and don't have side effects. Please note though that these aren't really safe as they can silently truncate the string - this is rarely the best course of action in any real-world program. There are occasions where this is OK but there are also many circumstances where it could lead to catastrophic results (e.g. printing out medical prescriptions).

Several answers here suggest using strncat() over strcat(); I'd suggest that strncat() (and strncpy()) should also be avoided. It has problems that make it difficult to use correctly and lead to bugs:
the length parameter to strncat() is related to (but not quite exactly - see the 3rd point) the maximum number of characters that can be copied to the destination rather than the size of the destination buffer. This makes strncat() more difficult to use than it should be, particularly if multiple items will be concatenated to the destination.
it can be difficult to determine if the result was truncated (which may or may not be important)
it's easy to have an off-by-one error. As the C99 standard notes, "Thus, the maximum number of characters that can end up in the array pointed to by s1 is strlen(s1)+n+1" for a call that looks like strncat( s1, s2, n)
strncpy() also has an issue that can result in bugs you try to use it in an intuitive way - it doesn't guarantee that the destination is null terminated. To ensure that you have to make sure you specifically handle that corner case by dropping a '\0' in the buffer's last location yourself (at least in certain situations).
I'd suggest using something like OpenBSD's strlcat() and strlcpy() (though I know that some people dislike those functions; I believe they're far easier to use safely than strncat()/strncpy()).
Here's a little of what Todd Miller and Theo de Raadt had to say about problems with strncat() and strncpy():
There are several problems encountered when strncpy() and strncat() are used as safe versions of strcpy() and strcat(). Both functions deal with NUL-termination and the length parameter in different and non-intuitive ways that confuse even experienced programmers. They also provide no easy way to detect when truncation occurs. ... Of all these issues, the confusion caused by the length parameters and the related issue of NUL-termination are most important. When we audited the OpenBSD source tree for potential security holes we found rampant misuse of strncpy() and strncat(). While not all of these resulted in exploitable security holes, they made it clear that the rules for using strncpy() and strncat() in safe string operations are widely misunderstood.
OpenBSD's security audit found that bugs with these functions were "rampant". Unlike gets(), these functions can be used safely, but in practice there are a lot of problems because the interface is confusing, unintuitive and difficult to use correctly. I know that Microsoft has also done analysis (though I don't know how much of their data they may have published), and as a result have banned (or at least very strongly discouraged - the 'ban' might not be absolute) the use of strncat() and strncpy() (among other functions).
Some links with more information:
http://www.usenix.org/events/usenix99/full_papers/millert/millert_html/
http://en.wikipedia.org/wiki/Off-by-one_error#Security_implications
http://blogs.msdn.com/michael_howard/archive/2004/10/29/249713.aspx
http://blogs.msdn.com/michael_howard/archive/2004/11/02/251296.aspx
http://blogs.msdn.com/michael_howard/archive/2004/12/10/279639.aspx
http://blogs.msdn.com/michael_howard/archive/2006/10/30/something-else-to-look-out-for-when-reviewing-code.aspx

Standard library functions that should never be used:
setjmp.h
setjmp(). Together with longjmp(), these functions are widely recogniced as incredibly dangerous to use: they lead to spaghetti programming, they come with numerous forms of undefined behavior, they can cause unintended side-effects in the program environment, such as affecting values stored on the stack. References: MISRA-C:2012 rule 21.4, CERT C MSC22-C.
longjmp(). See setjmp().
stdio.h
gets(). The function has been removed from the C language (as per C11), as it was unsafe as per design. The function was already flagged as obsolete in C99. Use fgets() instead. References: ISO 9899:2011 K.3.5.4.1, also see note 404.
stdlib.h
atoi() family of functions. These have no error handling but invoke undefined behavior whenever errors occur. Completely superfluous functions that can be replaced with the strtol() family of functions. References: MISRA-C:2012 rule 21.7.
string.h
strncat(). Has an awkward interface that are often misused. It is mostly a superfluous function. Also see remarks for strncpy().
strncpy(). The intention of this function was never to be a safer version of strcpy(). Its sole purpose was always to handle an ancient string format on Unix systems, and that it got included in the standard library is a known mistake. This function is dangerous because it may leave the string without null termination and programmers are known to often use it incorrectly. References: Why are strlcpy and strlcat considered insecure?, with a more detailed explanation here: Is strcpy dangerous and what should be used instead?.
Standard library functions that should be used with caution:
assert.h
assert(). Comes with overhead and should generally not be used in production code. It is better to use an application-specific error handler which displays errors but does not necessarily close down the whole program.
signal.h
signal(). References: MISRA-C:2012 rule 21.5, CERT C SIG32-C.
stdarg.h
va_arg() family of functions. The presence of variable-length functions in a C program is almost always an indication of poor program design. Should be avoided unless you have very specific requirements.
stdio.h
Generally, this whole library is not recommended for production code, as it comes with numerous cases of poorly-defined behavior and poor type safety.
fflush(). Perfectly fine to use for output streams. Invokes undefined behavior if used for input streams.
gets_s(). Safe version of gets() included in C11 bounds-checking interface. It is preferred to use fgets() instead, as per C standard recommendation. References: ISO 9899:2011 K.3.5.4.1.
printf() family of functions. Resource heavy functions that come with lots of undefined behavior and poor type safety. sprintf() also has vulnerabilities. These functions should be avoided in production code. References: MISRA-C:2012 rule 21.6.
scanf() family of functions. See remarks about printf(). Also, - scanf() is vulnerable to buffer overruns if not used correctly. fgets() is preferred to use when possible. References: CERT C INT05-C, MISRA-C:2012 rule 21.6.
tmpfile() family of functions. Comes with various vulnerability issues. References: CERT C FIO21-C.
stdlib.h
malloc() family of functions. Perfectly fine to use in hosted systems, though be aware of well-known issues in C90 and therefore don't cast the result. The malloc() family of functions should never be used in freestanding applications. References: MISRA-C:2012 rule 21.3.
Also note that realloc() is dangerous in case you overwrite the old pointer with the result of realloc(). In case the function fails, you create a leak.
system(). Comes with lots of overhead and although portable, it is often better to use system-specific API functions instead. Comes with various poorly-defined behavior. References: CERT C ENV33-C.
string.h
strcat(). See remarks for strcpy().
strcpy(). Perfectly fine to use, unless the size of the data to be copied is unknown or larger than the destination buffer. If no check of the incoming data size is done, there may be buffer overruns. Which is no fault of strcpy() itself, but of the calling application - that strcpy() is unsafe is mostly a myth created by Microsoft.
strtok(). Alters the caller string and uses internal state variables, which could make it unsafe in a multi-threaded environment.

Some people would claim that strcpy and strcat should be avoided, in favor of strncpy and strncat. This is somewhat subjective, in my opinion.
They should definitely be avoided when dealing with user input - no doubt here.
In code "far" from the user, when you just know the buffers are long enough, strcpy and strcat may be a bit more efficient because computing the n to pass to their cousins may be superfluous.

Avoid
strtok for multithreaded programs as its not thread-safe.
gets as it could cause buffer overflow

It is probably worth adding again that strncpy() is not the general-purpose replacement for strcpy() that it's name might suggest. It is designed for fixed-length fields that don't need a nul-terminator (it was originally designed for use with UNIX directory entries, but can be useful for things like encryption key fields).
It is easy, however, to use strncat() as a replacement for strcpy():
if (dest_size > 0)
{
dest[0] = '\0';
strncat(dest, source, dest_size - 1);
}
(The if test can obviously be dropped in the common case, where you know that dest_size is definitely nonzero).

Also check out Microsoft's list of banned APIs. These are APIs (including many already listed here) that are banned from Microsoft code because they are often misused and lead to security problems.
You may not agree with all of them, but they are all worth considering. They add an API to the list when its misuse has led to a number of security bugs.

It is very hard to use scanf safely. Good use of scanf can avoid buffer overflows, but you are still vulnerable to undefined behavior when reading numbers that don't fit in the requested type. In most cases, fgets followed by self-parsing (using sscanf, strchr, etc.) is a better option.
But I wouldn't say "avoid scanf all the time". scanf has its uses. As an example, let's say you want to read user input in a char array that's 10 bytes long. You want to remove the trailing newline, if any. If the user enters more than 9 characters before a newline, you want to store the first 9 characters in the buffer and discard everything until the next newline. You can do:
char buf[10];
scanf("%9[^\n]%*[^\n]", buf));
getchar();
Once you get used to this idiom, it's shorter and in some ways cleaner than:
char buf[10];
if (fgets(buf, sizeof buf, stdin) != NULL) {
char *nl;
if ((nl = strrchr(buf, '\n')) == NULL) {
int c;
while ((c = getchar()) != EOF && c != '\n') {
;
}
} else {
*nl = 0;
}
}

Almost any function that deals with NUL terminated strings is potentially unsafe.
If you are receiving data from the outside world and manipulating it via the str*() functions then you set yourself up for catastrophe

Don't forget about sprintf - it is the cause of many problems. This is true because the alternative, snprintf has sometimes different implementations which can make you code unportable.
linux: http://linux.die.net/man/3/snprintf
windows: http://msdn.microsoft.com/en-us/library/2ts7cx93%28VS.71%29.aspx
In case 1 (linux) the return value is the amount of data needed to store the entire buffer (if it is smaller than the size of the given buffer then the output was truncated)
In case 2 (windows) the return value is a negative number in case the output is truncated.
Generally you should avoid functions that are not:
buffer overflow safe (a lot of functions are already mentioned in here)
thread safe/not reentrant (strtok for example)
In the manual of each functions you should search for keywords like: safe, sync, async, thread, buffer, bugs

In all the string-copy/move scenarios - strcat(), strncat(), strcpy(), strncpy(), etc. - things go much better (safer) if a couple simple heuristics are enforced:
1. Always NUL-fill your buffer(s) before adding data.
2. Declare character-buffers as [SIZE+1], with a macro-constant.
For example, given:
#define BUFSIZE 10
char Buffer[BUFSIZE+1] = { 0x00 }; /* The compiler NUL-fills the rest */
we can use code like:
memset(Buffer,0x00,sizeof(Buffer));
strncpy(Buffer,BUFSIZE,"12345678901234567890");
relatively safely. The memset() should appear before the strncpy(), even though we initialized Buffer at compile-time, because we don't know what garbage other code placed into it before our function was called. The strncpy() will truncate the copied data to "1234567890", and will not NUL-terminate it. However, since we have already NUL-filled the entire buffer - sizeof(Buffer), rather than BUFSIZE - there is guaranteed to be a final "out-of-scope" terminating NUL anyway, as long as we constrain our writes using the BUFSIZE constant, instead of sizeof(Buffer).
Buffer and BUFSIZE likewise work fine for snprintf():
memset(Buffer,0x00,sizeof(Buffer));
if(snprintf(Buffer,BUFIZE,"Data: %s","Too much data") > BUFSIZE) {
/* Do some error-handling */
} /* If using MFC, you need if(... < 0), instead */
Even though snprintf() specifically writes only BUFIZE-1 characters, followed by NUL, this works safely. So we "waste" an extraneous NUL byte at the end of Buffer...we prevent both buffer-overflow and unterminated string conditions, for a pretty small memory-cost.
My call on strcat() and strncat() is more hard-line: don't use them. It is difficult to use strcat() safely, and the API for strncat() is so counter-intuitive that the effort needed to use it properly negates any benefit. I propose the following drop-in:
#define strncat(target,source,bufsize) snprintf(target,source,"%s%s",target,source)
It is tempting to create a strcat() drop-in, but not a good idea:
#define strcat(target,source) snprintf(target,sizeof(target),"%s%s",target,source)
because target may be a pointer (thus sizeof() does not return the information we need). I don't have a good "universal" solution to instances of strcat() in your code.
A problem I frequently encounter from "strFunc()-aware" programmers is an attempt to protect against buffer-overflows by using strlen(). This is fine if the contents are guaranteed to be NUL-terminated. Otherwise, strlen() itself can cause a buffer-overrun error (usually leading to a segmentation violation or other core-dump situation), before you ever reach the "problematic" code you are trying to protect.

atoi is not thread safe. I use strtol instead, per recommendation from the man page.

What makes a C standard library function dangerous, and what is the alternative?

While learning C I regularly come across resources which recommend that some functions (e.g. gets()) are never to be used, because they are either difficult or impossible to use safely.
If the C standard library contains a number of these "never-use" functions, it would seem necessary to learn a list of them, what makes them unsafe, and what to do instead.
So far, I've learned that functions which:
Cannot be prevented from overwriting memory
Are not guaranteed to null-terminate a string
Maintain internal state between calls
are commonly regarded as being unsafe to use. Is there a list of functions which exhibit these behaviours? Are there other types of functions which are impossible to use safely?

In the old days, most of the string functions had no bounds checking. Of course they couldn't just delete the old functions, or modify their signatures to include an upper bound, that would break compatibility. Now, for almost every one of those functions, there is an alternative "n" version. For example:
strcpy -> strncpy
strlen -> strnlen
strcmp -> strncmp
strcat -> strncat
strdup -> strndup
sprintf -> snprintf
wcscpy -> wcsncpy
wcslen -> wcsnlen
And more.
See also https://github.com/leafsr/gcc-poison which is a project to create a header file that causes gcc to report an error if you use an unsafe function.

Yes, fgets(..., ..., STDIN) is a good alternative to gets(), because it takes a size parameter (gets() has in fact been removed from the C standard entirely in C11). Note that fgets() is not exactly a drop-in replacement for gets(), because the former will include the terminating \n character if there was room in the buffer for a complete line to be read.
scanf() is considered problematic in some cases, rather than straight-out "bad", because if the input doesn't conform to the expected format it can be impossible to recover sensibly (it doesn't let you rewind the input and try again). If you can just give up on badly formatted input, it's useable. A "better" alternative here is to use an input function like fgets() or fgetc() to read chunks of input, then scan it with sscanf() or parse it with string handling functions like strchr() and strtol(). Also see below for a specific problem with the "%s" conversion specifier in scanf().
It's not a standard C function, but the BSD and POSIX function mktemp() is generally impossible to use safely, because there is always a TOCTTOU race condition between testing for the existence of the file and subsequently creating it. mkstemp() or tmpfile() are good replacements.
strncpy() is a slightly tricky function, because it doesn't null-terminate the destination if there was no room for it. Despite the apparently generic name, this function was designed for creating a specific style of string that differs from ordinary C strings - strings stored in a known fixed width field where the null terminator is not required if the string fills the field exactly (original UNIX directory entries were of this style). If you don't have such a situation, you probably should avoid this function.
atoi() can be a bad choice in some situations, because you can't tell when there was an error doing the conversion (e.g., if the number exceeded the range of an int). Use strtol() if this matters to you.
strcpy(), strcat() and sprintf() suffer from a similar problem to gets() - they don't allow you to specify the size of the destination buffer. It's still possible, at least in theory, to use them safely - but you are much better off using strncat() and snprintf() instead (you could use strncpy(), but see above). Do note that whereas the n for snprintf() is the size of the destination buffer, the n for strncat() is the maximum number of characters to append and does not include the null terminator. Another alternative, if you have already calculated the relevant string and buffer sizes, is memmove() or memcpy().
On the same theme, if you use the scanf() family of functions, don't use a plain "%s" - specify the size of the destination e.g. "%200s".

strtok() is generally considered to be evil because it stores state information between calls. Don't try running THAT in a multithreaded environment!

Strictly speaking, there is one really dangerous function. It is gets() because its input is not under the control of the programmer. All other functions mentioned here are safe in and of themselves. "Good" and "bad" boils down to defensive programming, namely preconditions, postconditions and boilerplate code.
Let's take strcpy() for example. It has some preconditions that the programmer must fulfill before calling the function. Both strings must be valid, non-NULL pointers to zero terminated strings, and the destination must provide enough space with a final string length inside the range of size_t. Additionally, the strings are not allowed to overlap.
That is quite a lot of preconditions, and none of them is checked by strcpy(). The programmer must be sure they are fulfilled, or he must explicitly test them with additional boilerplate code before calling strcpy():
n = DST_BUFFER_SIZE;
if ((dst != NULL) && (src != NULL) && (strlen(dst)+strlen(src)+1 <= n))
{
strcpy(dst, src);
}
Already silently assuming the non-overlap and zero-terminated strings.
strncpy() does include some of these checks, but it adds another postcondition the programmer must take care for after calling the function, because the result may not be zero-terminated.
strncpy(dst, src, n);
if (n > 0)
{
dst[n-1] = '\0';
}
Why are these functions considered "bad"? Because they would require additional boilerplate code for each call to really be on the safe side when the programmer assumes wrong about the validity, and programmers tend to forget this code.
Or even argue against it. Take the printf() family. These functions return a status that indicate error and success. Who checks if the output to stdout or stderr succeeded? With the argument that you can't do anything at all when the standard channels are not working. Well, what about rescuing the user data and terminating the program with an error-indicating exit code? Instead of the possible alternative of crash and burn later with corrupted user data.
In a time- and money-limited environment it is always the question of how much safety nets you really want and what is the resulting worst case scenario? If it is a buffer overflow as in case of the str-functions, then it makes sense to forbid them and probably provide wrapper functions with the safety nets already within.
One final question about this: What makes you sure that your "good" alternatives are really good?

Any function that does not take a maximum length parameter and instead relies on an end-of- marker to be present (such as many 'string' handling functions).
Any method that maintains state between calls.

sprintf is bad, does not check size, use snprintf
gmtime, localtime -- use gmtime_r, localtime_r

To add something about strncpy most people here forgot to mention. strncpy can result in performance problems as it clears the buffer to the length given.
char buff[1000];
strncpy(buff, "1", sizeof buff);
will copy 1 char and overwrite 999 bytes with 0
Another reason why I prefer strlcpy (I know strlcpy is a BSDism but it is so easy to implement that there's no excuse to not use it).

View page 7 (PDF page 9) SAFECode Dev Practices
Edit: From the page -
strcpy family
strncpy family
strcat family
scanf family
sprintf family
gets family

strcpy - again!
Most people agree that strcpy is dangerous, but strncpy is only rarely a useful replacement. It is usually important that you know when you've needed to truncate a string in any case, and for this reason you usually need to examine the length of the source string anwyay. If this is the case, usually memcpy is the better replacement as you know exactly how many characters you want copied.
e.g. truncation is error:
n = strlen( src );
if( n >= buflen )
return ERROR;
memcpy( dst, src, n + 1 );
truncation allowed, but number of characters must be returned so caller knows:
n = strlen( src );
if( n >= buflen )
n = buflen - 1;
memcpy( dst, src, n );
dst[n] = '\0';
return n;

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Does sscanf require a null terminated string as input? - c

Related

Why the different behavior of snprintf vs swprintf?

unlimited buffer printf - formatted puts directly to stream

Why does C not have an snwprintf function?

Which functions from the standard library must (should) be avoided?

What makes a C standard library function dangerous, and what is the alternative?

Categories

Resources