Getting the length of a formatted string from wsprintf - c

When using standard char* strings, the snprintf and vsnprintf functions will return the length of the output string, even if that string was truncated due to overflow.* It seems like the ISO C committee didn't like this functionality when they added swprintf and vswprintf, which return -1 on overflow.
Does anyone know of a function that will provide this length? I don't know the size of the potential strings. I might be asking too much, but.. I'd rather not:
allocate a huge static temp buffer
iteratively allocate and free memory until i've found a size that fits
add an additional library dependency
write my own format string parser
*I realize MSVC doesn't do this, and instead provides the scprintf and vscprintf functions, but I'm looking for other compilers, mainly GCC.

My best suggestion to you would be not to use wchar_t strings at all, especially if you're not writing Windows-oriented code. In case that's not an option, here are some other ideas:
If your format string does not contain non-ASCII characters itself, what about first calling vsnprintf with the same set of arguments to get the length in bytes, then use that as a safe upper bound for the length in wchar_t characters (if there are few or non-ASCII characters, the bound will be tight).
If you're okay with introducing a dependency on a POSIX function (which is likely to be added to C1x), use open_wmemstream and fwprintf.
Just iterate allocating larger buffers, but do it smart: increase the size geometrically at each step, e.g. 127, 255, 511, 1023, 2047, ... I like this pattern better than whole powers of 2 because it's easy to avoid dangerous case where allocation might succeed for SIZE_MAX/2+1 but then wrap to 0 at the next iteration.

This returns the buffer size for wide character strings:
vswprintf(nullptr, -1, aFormat, argPtr);

Related

Can a C implementation use length-prefixed-strings "under the hood"?

After reading this question: What are the problems of a zero-terminated string that length-prefixed strings overcome? I started to wonder, what exactly is stopping a C implementation from allocating a few extra bytes for any char or wchar_t array allocated on the stack or heap and using them as a "string prefix" to store the number N of its elements?
Then, if the N-th character is '\0', N - 1 would signify the string length.
I believe this could mightily boost performance of functions such as strlen or strcat.
This could potentially turn to extra memory consumption if a program uses non-0-terminated char arrays extensively, but that could be remedied by a compiler flag turning on or off the regular "count-until-you-reach-'\0'" routine for the compiled code.
What are possible obstacles for such an implementation? Does the C Standard allow for this? What problems can this technique cause that I haven't accounted for?
And... has this actually ever been done?
You can store the length of the allocation. And malloc implementations really do do that (or some do, at least).
You can't reasonably store the length of whatever string is stored in the allocation, though, because the user can change the contents to their whim; it would be unreasonable to keep the length up to date. Furthermore, users might start strings somewhere in the middle of the character array, or might not even be using the array to hold a string!
Then, if the N-th character is '\0', N - 1 would signify the string length.
Actually, no, and that's why this suggestion cannot work.
If I overwrite a character in a string with a 0, I have effectively truncated the string, and a subsequent call of strlen on the string must return the truncated length. (This is commonly done by application programs, including every scanner generated by (f)lex, as well as the strtok standard library function. Amongst others.)
Moreover, it is entirely legal to call strlen on an interior byte of the string.
For example (just for demonstration purposes, although I'll bet you can find code almost identical to this in common use.)
/* Split a string like 'key=value...' into key and value parts, and
* return the value, and optionally its length (if the second argument
* is not a NULL pointer).
* On success, returns the value part and modifieds the original string
* so that it is the key.
* If there is no '=' in the supplied string, neither it nor the value
* pointed to by plen are modified, and NULL is returned.
*/
char* keyval_split(char* keyval, int* plen) {
char* delim = strchr(keyval, '=');
if (delim) {
if (plen) *plen = strlen(delim + 1)
*delim = 0;
return delim + 1;
} else {
return NULL;
}
}
There's nothing fundamentally stopping you from doing this in your application, if that was useful (one of the comments noted this). There are two problems that would emerge, however:
You'd have to reimplement all the string-handling functions, and have my_strlen, my_strcpy, and so on, and add string-creating functions. That might be annoying, but it's a bounded problem.
You'd have to stop people, when writing for the system, deliberately or automatically treating the associated character arrays as ‘ordinary’ C strings, and using the usual functions on them. You might have to make sure that such usages broke promptly.
This means that it would, I think, be infeasible to smuggle a reimplemented ‘C string’ into an existing program.
Something like
typedef struct {
size_t len;
char* buf;
} String;
size_t my_strlen(String*);
...
might work, since type-checking would frustrate (2) (unless someone decided to hack things ‘for efficiency’, in which case there's not much you can do).
Of course, you wouldn't do this unless and until you'd proven that string management was the bottleneck in your code and that this approach provably improved things....
There are a couple of issues with this approach. First of all, you wouldn't be able to create arbitrarily long strings. If you only reserve 1 byte for length, then your string can only go up to 255 characters. You can certainly use more bytes to store the length, but how many? 2? 4?
What if you try to concatenate two strings that are both at the edge of their size limits (i.e., if you use 1 byte for length and try to concatenate two 250-character strings to each other, what happens)? Do you simply add more bytes to the length as necessary?
Secondly, where do you store this metadata? It somehow has to be associated with the string. This is similar to the problem Dennis Ritchie ran into when he was implementing arrays in C. Originally, array objects stored an explicit pointer to the first element of the array, but as he added struct types to the language, he realized that he didn't want that metadata cluttering up the representation of the struct object in memory, so he got rid of it and introduced the rule that array expressions get converted to pointer expressions in most circumstances.
You could create a new aggregate type like
struct string
{
char *data;
size_t len;
};
but then you wouldn't be able to use the C string library to manipulate objects of that type; an implementation would still have to support the existing interface.
You could store the length in the leading byte or bytes of the string, but how many do you reserve? You could use a variable number of bytes to store the length, but now you need a way to distinguish length bytes from content bytes, and you can't read the first character by simply dereferencing the pointer. Functions like strcat would have to know how to step around the length bytes, how to adjust the contents if the number of length bytes changes, etc.
The 0-terminated approach has its disadvantages, but it's also a helluva lot easier to implement and makes manipulating strings a lot easier.
The string methods in the standard library have defined semantics. If one generates an array of char that contains various values, and passes a pointer to the array or a portion thereof, the methods whose behavior is defined in terms of NUL bytes must search for NUL bytes in the same fashion as defined by the standard.
One could define one's own methods for string handling which use a better form of string storage, and simply pretend that the standard library string-related functions don't exist unless one must pass strings to things like fopen. The biggest difficulty with such an approach is that unless one uses non-portable compiler features it would not be possible to use in-line string literals. Instead of saying:
ns_output(my_file, "This is a test"); // ns -- new string
one would have to say something more like:
MAKE_NEW_STRING(this_is_a_test, "This is a test");
ns_output(my_file, this_is_a_test);
where the macro MAKE_NEW_STRING would create a union of an anonymous type, define an instance called this_is_a_test, and suitably initialize it. Since a lot of strings would be of different anonymous types, type-checking would require that strings be unions that include a member of a known array type, and code expecting strings should be given a pointer that member, likely using something like:
#define ns_output(f,s) (ns_output_func((f),(s).stringref))
It would be possible to define the types in such a way as to avoid the need for the stringref member and have code just accept void*, but the stringref member would essentially perform static duck-typing (only things with a stringref member could be given to such a macro) and could also allow type-checking on the type of stringref itself).
If one could accept those constraints, I think one could probably write code that was more efficient in almost every way that zero-terminated strings; the question would be whether the advantages would be worth the hassle.

'strncpy' vs. 'sprintf'

I can see many sprintf's used in my applications for copying a string.
I have a character array:
char myarray[10];
const char *str = "mystring";
Now if I want want to copy the string str into myarray, is is better to use:
sprintf(myarray, "%s", str);
or
strncpy(myarray, str, 8);
?
Neither should be used, at all.
sprintf is dangerous, deprecated, and superseded by snprintf. The only way to use the old sprintf safely with string inputs is to either measure their length before calling sprintf, which is ugly and error-prone, or by adding a field precision specifier (e.g. %.8s or %.*s with an extra integer argument for the size limit). This is also ugly and error-prone, especially if more than one %s specifier is involved.
strncpy is also dangerous. It is not a buffer-size-limited version of strcpy. It's a function for copying characters into a fixed-length, null-padded (as opposed to null-terminated) array, where the source may be either a C string or a fixed-length character array at least the size of the destination. Its intended use was for legacy unix directory tables, database entries, etc. that worked with fixed-size text fields and did not want to waste even a single byte on disk or in memory for null termination. It can be misused as a buffer-size-limited strcpy, but doing so is harmful for two reasons. First of all, it fails to null terminate if the whole buffer is used for string data (i.e. if the source string length is at least as long as the dest buffer). You can add the termination back yourself, but this is ugly and error-prone. And second, strncpy always pads the full destination buffer with null bytes when the source string is shorter than the output buffer. This is simply a waste of time.
So what should you use instead?
Some people like the BSD strlcpy function. Semantically, it's identical to snprintf(dest, destsize, "%s", source) except that the return value is size_t and it does not impose an artificial INT_MAX limit on string length. However, most popular non-BSD systems lack strlcpy, and it's easy to make dangerous errors writing your own, so if you want to use it, you should obtain a safe, known-working version from a trustworthy source.
My preference is to simply use snprintf for any nontrivial string construction, and strlen+memcpy for some trivial cases that have been measured to be performance-critical. If you get in a habit of using this idiom correctly, it becomes almost impossible to accidentally write code with string-related vulnerabilities.
The different versions of printf/scanf are incredibly slow functions, for the following reasons:
They use variable argument lists, which makes parameter passing more complex. This is done through various obscure macros and pointers. All the arguments have to be parsed in runtime to determine their types, which adds extra overhead code. (VA lists is also quite a redundant feature of the language, and dangerous as well, as it has farweaker typing than plain parameter passing.)
They must handle a lot of complex formatting and all different types supported. This adds plenty of overhead to the function as well. Since all type evaluations are done in runtime, the compiler cannot optimize away parts of the function that are never used. So if you only wanted to print integers with printf(), you will get support for float numbers, complex arithmetic, string handling etc etc linked to your program, as complete waste of space.
Functions like strcpy() and particularly memcpy() on the other hand, are heavily optimized by the compiler, often implemented in inline assemble for maximum performance.
Some measurements I once made on barebone 16-bit low-end microcontrollers are included below.
As a rule of thumb, you should never use stdio.h in any form of production code. It is to be considered as a debugging/testing library. MISRA-C:2004 bans stdio.h in production code.
EDIT
Replaced subjective numbers with facts:
Measurements of strcpy versus sprintf on target Freescale HCS12, compiler Freescale
Codewarrior 5.1. Using C90 implementation of sprintf, C99 would be more ineffective yet. All optimizations enabled. The following code was tested:
const char str[] = "Hello, world";
char buf[100];
strcpy(buf, str);
sprintf(buf, "%s", str);
Execution time, including parameter shuffling on/off call stack:
strcpy 43 instructions
sprintf 467 instructions
Program/ROM space allocated:
strcpy 56 bytes
sprintf 1488 bytes
RAM/stack space allocated:
strcpy 0 bytes
sprintf 15 bytes
Number of internal function calls:
strcpy 0
sprintf 9
Function call stack depth:
strcpy 0 (inlined)
sprintf 3
I would not use sprintf just to copy a string. It's overkill, and someone who reads that code would certainly stop and wonder why I did that, and if they (or I) are missing something.
There is one way to use sprintf() (or if being paranoid, snprintf() ) to do a "safe" string copy, that truncates instead of overflowing the field or leaving it un-NUL-terminated.
That is to use the "*" format character as "string precision" as follows:
So:
char dest_buff[32];
....
sprintf(dest_buff, "%.*s", sizeof(dest_buff) - 1, unknown_string);
This places the contents of unknown_string into dest_buff allowing space for the terminating NUL.

Disadvantages of strlen() in Embedded Domain

What are the disadvantages of using strlen()?
If sometimes in TCP Communication NULL character comes in string than we find length of string up to only null character.
we cant find actual length of string.
if we make other alternative of this strlen function than its also stops at NULL character. so which method i can use to find out string length in C
To read from a "TCP Communication" you are probably using read. The prototype for read is
ssize_t read(int fildes, void *buf, size_t nbyte);
and the return value is the number of bytes read (even if they are 0).
So, let's say you're about to read 10 bytes, all of which are 0. You have an array with more than enough to hold all the data
int fildes;
char data[1000];
// fildes = TCPConnection
nbytes = read(fildes, data, 1000);
Now, by inspecting nbytes you know you have read 10 bytes. If you check data[0] through data[9] you will find they have 0;
If the runtime library provides strcpy() and strcat(), then surely it provides strlen().
I suspect you are confusing NULL, an invalid pointer value, from the ASCII code NUL, for a zero character value which indicates the end of string to many C runtime functions.
As such, there is no concern for inserting a NUL value in a string, nor in detecting it.
Response to updated question:
Since you seem to be processing binary data, the string functions are not a good fit—unless you can guarantee there are no NULs in the stream. However, for this reason, most TCP/IP messages use headers with fields containing the number of bytes which follow.
"Embedded" strikes me as a red herring here.
If you're processing binary data where an embedded NUL might be valid, then you can't expect meaningful results from strlen.
If you're processing strings (as that term is defined in C -- a block of non-NUL data terminated by a NUL) then you can use strlen just fine.
A system being "embedded" would affect this only to the degree that it might be less common to process strings and more common to process binary data.
it is safer to use strnlen instead of strlen to avoid the problems with strlen. The strlen problems are present everywhere not just embedded. Many of the string functions are dangerous because they go forever or until a zero is hit, or like scanf or strtok go until a pattern is hit.
Remember tcp is a stream not a packet, you may have to wait for multiple or many packets and piece together the data before you can attempt to call it a string anyway. that is assuming the payload is an asciiz string anyway, if raw data then dont use string functions use some other solution.
Yes, strlen() uses a terminating character \0 aka NUL. Most str* functions do so. There could be a risk that data coming from files/command line/sockets would not contain this character (usually, they won't: they'll be \n-terminated), but their size will also be provided by the read()/recv() function you've used. If that's a concern, you can always use a buffer slightly larger than what declared to those functions, e.g.
char mybuf[256+4];
mybuf[256]=0;
int reallen=fgets(mybuf, 256, stdin);
// we've got a 0-terminated string in mybuf.
If your data may not contain \0, compare strlen(mybuf) with reallen and terminate the session with an error code if they differ.
If your data may contain 0, then it should be processed as a buffer and not as a string. Size must be kept aside, and memcpy / memcmp functions should be used instead of strcpy and strcmp.
Also, your network protocol should be very explicit on whether strings or binary data is expected in different parts of the communication. HTTP is for instance, and it provides many way to tell the actual size of the transmitted payload.
This isn't specific to "embedded" programs, but it has come a major concern in every programs to ensure no remote code/script injection can occur. If by "embedded", you mean you're in a non-preemptive environment and have only limited time available to perform some action ... then yeah, you don't want to end up scanning 2GB of incoming bits for a (never-appearing) \0. either the above trick, or strnlen (mentioned in another answer) could be used to ensure this isn't the case.

Is sscanf considered safe to use?

I have vague memories of suggestions that sscanf was bad. I know it won't overflow buffers if I use the field width specifier, so is my memory just playing tricks with me?
I think it depends on how you're using it: If you're scanning for something like int, it's fine. If you're scanning for a string, it's not (unless there was a width field I'm forgetting?).
Edit:
It's not always safe for scanning strings.
If your buffer size is a constant, then you can certainly specify it as something like %20s. But if it's not a constant, you need to specify it in the format string, and you'd need to do:
char format[80]; //Make sure this is big enough... kinda painful
sprintf(format, "%%%ds", cchBuffer - 1); //Don't miss the percent signs and - 1!
sscanf(format, input); //Good luck
which is possible but very easy to get wrong, like I did in my previous edit (forgot to take care of the null-terminator). You might even overflow the format string buffer.
The reason why sscanf might be considered bad is because it doesnt require you to specify maximum string width for string arguments, which could result in overflows if the input read from the source string is longer. so the precise answer is: it is safe if you specify widths properly in the format string otherwise not.
Note that as long as your buffers are at least as long as strlen(input_string)+1, there is no way the %s or %[ specifiers can overflow. You can also use field widths in the specifiers if you want to enforce stricter limits, or you can use %*s and %*[ to suppress assignment and instead use %n before and after to get the offsets in the original string, and then use those to read the resulting sub-string in-place from the input string.
Yes it is..if you specify the string width so the are no buffer overflow related problems.
Anyway, like #Mehrdad showed us, there will be possible problems if the buffer size isn't established at compile-time. I suppose that put a limit to the length of a string that can be supplied to sscanf, could eliminate the problem.
All of the scanf functions have fundamental design flaws, only some of which could be fixed. They should not be used in production code.
Numeric conversion has full-on demons-fly-out-of-your-nose undefined behavior if a value overflows the representable range of the variable you're storing the value in. I am not making this up. The C library is allowed to crash your program just because somebody typed too many input digits. Even if it doesn't crash, it's not obliged to do anything sensible. There is no workaround.
As pointed out in several other answers, %s is just as dangerous as the infamous gets. It's possible to avoid this by using either the 'm' modifier, or a field width, but you have to remember to do that for every single text field you want to convert, and you have to wire the field widths into the format string -- you can't pass sizeof(buff) as an argument.
If the input does not exactly match the format string, sscanf doesn't tell you how many characters into the input buffer it got before it gave up. This means the only practical error-recovery policy is to discard the entire input buffer. This can be OK if you are processing a file that's a simple linear array of records of some sort (e.g. with a CSV file, "skip the malformed line and go on to the next one" is a sensible error recovery policy), but if the input has any more structure than that, you're hosed.
In C, parse jobs that aren't complicated enough to justify using lex and yacc are generally best done either with POSIX regexps (regex.h) or with hand-rolled string parsing. The strto* numeric conversion functions do have well-specified and useful behavior on overflow and do tell you how may characters of input they consumed, and string.h has lots of handy functions for hand-rolled parsers (strchr, strcspn, strsep, etc).
There is 2 point to take care.
The output buffer[s].
As mention by others if you specify a size smaller or equals to the output buffer size in the format string you are safe.
The input buffer.
Here you need to make sure that it is a null terminate string or that you will not read more than the input buffer size.
If the input string is not null terminated sscanf may read past the boundary of the buffer and crash if the memorie is not allocated.

What makes a C standard library function dangerous, and what is the alternative?

While learning C I regularly come across resources which recommend that some functions (e.g. gets()) are never to be used, because they are either difficult or impossible to use safely.
If the C standard library contains a number of these "never-use" functions, it would seem necessary to learn a list of them, what makes them unsafe, and what to do instead.
So far, I've learned that functions which:
Cannot be prevented from overwriting memory
Are not guaranteed to null-terminate a string
Maintain internal state between calls
are commonly regarded as being unsafe to use. Is there a list of functions which exhibit these behaviours? Are there other types of functions which are impossible to use safely?
In the old days, most of the string functions had no bounds checking. Of course they couldn't just delete the old functions, or modify their signatures to include an upper bound, that would break compatibility. Now, for almost every one of those functions, there is an alternative "n" version. For example:
strcpy -> strncpy
strlen -> strnlen
strcmp -> strncmp
strcat -> strncat
strdup -> strndup
sprintf -> snprintf
wcscpy -> wcsncpy
wcslen -> wcsnlen
And more.
See also https://github.com/leafsr/gcc-poison which is a project to create a header file that causes gcc to report an error if you use an unsafe function.
Yes, fgets(..., ..., STDIN) is a good alternative to gets(), because it takes a size parameter (gets() has in fact been removed from the C standard entirely in C11). Note that fgets() is not exactly a drop-in replacement for gets(), because the former will include the terminating \n character if there was room in the buffer for a complete line to be read.
scanf() is considered problematic in some cases, rather than straight-out "bad", because if the input doesn't conform to the expected format it can be impossible to recover sensibly (it doesn't let you rewind the input and try again). If you can just give up on badly formatted input, it's useable. A "better" alternative here is to use an input function like fgets() or fgetc() to read chunks of input, then scan it with sscanf() or parse it with string handling functions like strchr() and strtol(). Also see below for a specific problem with the "%s" conversion specifier in scanf().
It's not a standard C function, but the BSD and POSIX function mktemp() is generally impossible to use safely, because there is always a TOCTTOU race condition between testing for the existence of the file and subsequently creating it. mkstemp() or tmpfile() are good replacements.
strncpy() is a slightly tricky function, because it doesn't null-terminate the destination if there was no room for it. Despite the apparently generic name, this function was designed for creating a specific style of string that differs from ordinary C strings - strings stored in a known fixed width field where the null terminator is not required if the string fills the field exactly (original UNIX directory entries were of this style). If you don't have such a situation, you probably should avoid this function.
atoi() can be a bad choice in some situations, because you can't tell when there was an error doing the conversion (e.g., if the number exceeded the range of an int). Use strtol() if this matters to you.
strcpy(), strcat() and sprintf() suffer from a similar problem to gets() - they don't allow you to specify the size of the destination buffer. It's still possible, at least in theory, to use them safely - but you are much better off using strncat() and snprintf() instead (you could use strncpy(), but see above). Do note that whereas the n for snprintf() is the size of the destination buffer, the n for strncat() is the maximum number of characters to append and does not include the null terminator. Another alternative, if you have already calculated the relevant string and buffer sizes, is memmove() or memcpy().
On the same theme, if you use the scanf() family of functions, don't use a plain "%s" - specify the size of the destination e.g. "%200s".
strtok() is generally considered to be evil because it stores state information between calls. Don't try running THAT in a multithreaded environment!
Strictly speaking, there is one really dangerous function. It is gets() because its input is not under the control of the programmer. All other functions mentioned here are safe in and of themselves. "Good" and "bad" boils down to defensive programming, namely preconditions, postconditions and boilerplate code.
Let's take strcpy() for example. It has some preconditions that the programmer must fulfill before calling the function. Both strings must be valid, non-NULL pointers to zero terminated strings, and the destination must provide enough space with a final string length inside the range of size_t. Additionally, the strings are not allowed to overlap.
That is quite a lot of preconditions, and none of them is checked by strcpy(). The programmer must be sure they are fulfilled, or he must explicitly test them with additional boilerplate code before calling strcpy():
n = DST_BUFFER_SIZE;
if ((dst != NULL) && (src != NULL) && (strlen(dst)+strlen(src)+1 <= n))
{
strcpy(dst, src);
}
Already silently assuming the non-overlap and zero-terminated strings.
strncpy() does include some of these checks, but it adds another postcondition the programmer must take care for after calling the function, because the result may not be zero-terminated.
strncpy(dst, src, n);
if (n > 0)
{
dst[n-1] = '\0';
}
Why are these functions considered "bad"? Because they would require additional boilerplate code for each call to really be on the safe side when the programmer assumes wrong about the validity, and programmers tend to forget this code.
Or even argue against it. Take the printf() family. These functions return a status that indicate error and success. Who checks if the output to stdout or stderr succeeded? With the argument that you can't do anything at all when the standard channels are not working. Well, what about rescuing the user data and terminating the program with an error-indicating exit code? Instead of the possible alternative of crash and burn later with corrupted user data.
In a time- and money-limited environment it is always the question of how much safety nets you really want and what is the resulting worst case scenario? If it is a buffer overflow as in case of the str-functions, then it makes sense to forbid them and probably provide wrapper functions with the safety nets already within.
One final question about this: What makes you sure that your "good" alternatives are really good?
Any function that does not take a maximum length parameter and instead relies on an end-of- marker to be present (such as many 'string' handling functions).
Any method that maintains state between calls.
sprintf is bad, does not check size, use snprintf
gmtime, localtime -- use gmtime_r, localtime_r
To add something about strncpy most people here forgot to mention. strncpy can result in performance problems as it clears the buffer to the length given.
char buff[1000];
strncpy(buff, "1", sizeof buff);
will copy 1 char and overwrite 999 bytes with 0
Another reason why I prefer strlcpy (I know strlcpy is a BSDism but it is so easy to implement that there's no excuse to not use it).
View page 7 (PDF page 9) SAFECode Dev Practices
Edit: From the page -
strcpy family
strncpy family
strcat family
scanf family
sprintf family
gets family
strcpy - again!
Most people agree that strcpy is dangerous, but strncpy is only rarely a useful replacement. It is usually important that you know when you've needed to truncate a string in any case, and for this reason you usually need to examine the length of the source string anwyay. If this is the case, usually memcpy is the better replacement as you know exactly how many characters you want copied.
e.g. truncation is error:
n = strlen( src );
if( n >= buflen )
return ERROR;
memcpy( dst, src, n + 1 );
truncation allowed, but number of characters must be returned so caller knows:
n = strlen( src );
if( n >= buflen )
n = buflen - 1;
memcpy( dst, src, n );
dst[n] = '\0';
return n;

Resources