While trying to output some counted strings, I encountered the following asymmetry between the Microsoft's sprintf() and wsprintf() functions:
#define _UNICODE
sprintf(buff, "%.3s", "abcdef"); //Outputs: "abc"
sprintf(buff, "%.*s", 3, "abcdef"); //Outputs: "abc"
wsprintf(buff, L"%.3s", L"abcdef"); //Outputs: L"abc"
wsprintf(buff, L"%.*s", 3, L"abcdef"); //Outputs: L"*s"
Note that the last wsprintf() does not output L"abc" like its narrow sister function sprintf() with the same (but wide) arguments.
Q: Is this a bug or feature?
Note: This is similar to the issue described here:
Formatting differences between sprintf() and wsprintf() in VS2015
wsprintf is old, very old. It does not document * in precision, so don't pass that format string to wsprintf. Your test is technically unspecified.
Please note that wsprintf will not write more than 1023 characters to buff followed by the null character, and in UCS-2 rather than UTF-16. The design of this function is you pass it a fixed size stack buffer of 1024 and don't worry about buffer overflow since it truncates for you.
As far as I can tell for its intention it's more intended for making debug messages to pass to MessageBox rather than for actual application use. It's a much-reduced form of snprintf with a fixed n that's implemented independently of the other standard libraries.
Ok so you want a swprintf that always null terminates. Try this:
int swprintf2(wchar_t *ws, size_t len, const wchar_t* format, ...)
{
va_arg arg;
va_start(arg, format);
ws[len - 1] = 0;
return vswprintf(ws, len - 1, format, arg);
}
Related
The following are bare minimum examples (I know that e.g. UNICODE/_UNICODE should be defined) that I've found to work:
Linux:
#include <stdio.h>
int main() {
char* str = "Rölf";
printf("%s\n", str);
}
Windows:
#include <stdio.h>
#include <locale.h>
int main() {
setlocale(LC_ALL, "");
wchar_t* str = L"Rölf";
wprintf(L"%s\n", str);
}
Now, I've read that one way of going about it is to basically "just use UTF-8/char everywhere and worry about platform-specific conversion when you do API calls".
And that would be great - have users provide char* as input for my library and "simply" convert that. So I've tried the following snippet based on this example (I've also seen it in variations elsewhere). If this would actually work, it would be amazing. But it doesn't:
char* str = u8"Rölf";
int len = mbstowcs(NULL, str, 0) + 1;
wchar_t wstr[len];
mbstowcs(wstr, str, len);
wprintf(L"%s\n", wstr);
I've also stumbled across discussions about console fonts and whatnot being the cause of faulty rendering, so to demonstrate that this is not a console issue - the following doesn't work either (well - the L"" literal does. The converted u8 literal doesn't):
MessageBoxW(NULL, wstr, L"Rölf", MB_OK);
Am I misunderstanding the conversion process? Is there a way to make to this work? (Without using e.g. ICU)
The mbstowcs function converts from a string encoded in the current locale's encoding to wchar_t[], not from UTF-8 (unless that encoding is UTF-8). On post-April-beta-2018 versions of Windows 10 or later, you actually can fix Windows to use UTF-8 as the encoding for plain char[] strings either as a global setting, or presumably by calling _setmbcp(65001). Older versions of Windows explicitly forbid this however for dubious historical reasons.
Anyway, you second version of the code which you called "Windows" should work on arbitrary systems if not for a bug in MSVC's wprintf that you worked around: they have the meanings of %ls and %s backwards for the wide stdio functions. In standard C, you need %ls to format a wchar_t[] string. But there's actually no reason to use wprintf there at all, and in fact wprintf is highly problematic because you can't mix it with byte-oriented stdio (doing so invokes undefined behavior). So better would be:
#include <stdio.h>
#include <locale.h>
int main() {
setlocale(LC_ALL, "");
wchar_t* str = L"Rölf";
printf("%ls\n", str);
}
and this version should work correctly on Windows and standards-conforming C implementations, since for the byte-oriented printf functions, MSVC doesn't have the meaning of %s and %ls reversed.
If you really want to, you can also use a variant of your third version of the code, but you can't use mbstowcs to convert from UTF-8 to wchar_t. Instead you need to either:
Assume wchar_t is Unicode-encoded, and convert from UTF-8 to Unicode codepoints with your own (or a third-party library's) UTF-8 decoder. But this is a bad assumption, because MSVC is also non-conforming in that it uses UTF-16 for wchar_t (C explicitly forbids "multi-wchar_t-characters because the mb/wc APIs are inherently incompatible with them), not Unicode codepoint values (equivalent to UTF-32).
Convert from UTF-8 to uchar32_t (UTF-32) with your own (or a third-party library's) UTF-8 decoder, then use c32rtomb to convert to wchar_t[].
Use iconv (standard on POSIX systems; available as a third-party library on Windows) to convert directly from UTF-8 to wchar_t.
UTF8 option for Windows 10, version 1803+
Thanks to Barmak Shemirani making me aware of MultiByteToWideChar, I've found a solution to this that is even C99 conform. (Which works on Windows 7 by the way)
Note that setlocale() is only necessary for console output to render correctly. I didn't use it to highlight that it doesn't seem to be needed for GUI-related API calls.
#define UNICODE
#define _UNICODE
#include <stdio.h>
#include <windows.h>
//#include <locale.h>
wchar_t* toWide(char* str) {
int wchars_num = MultiByteToWideChar(CP_UTF8, 0, str, -1, NULL, 0);
wchar_t* wstr = (wchar_t*)malloc(sizeof(wchar_t) * wchars_num);
MultiByteToWideChar(CP_UTF8, 0, str, -1, wstr, wchars_num);
return wstr;
}
int main() {
// For output in console to render correctly - as far as the font allows anyway...
//setlocale(LC_ALL, "");
// PLATFORM-AGNOSTIC DATA STRUCTURE WITH UTF-8 TEXT
// (Usually not directly next to the platform-specific API calls...)
char* str = "Rölf";
// PLATFORM-SPECIFIC TEXT HANDLING
wchar_t* wstr = toWide(str);
printf("%ls\n", wstr);
MessageBox(NULL, wstr, L"Rölf", MB_OK);
free(wstr);
}
The way I use it is that I declare a data structure to be filled by my users where all text is char* and assumed to be UTF-8. Then in my library, I use platform-specific UI APIs. And in the case of Windows, doing the above UTF-16 conversion is obviously necessary.
I can specify the maximum amount of characters for scanf to read to a buffer using this technique:
char buffer[64];
/* Read one line of text to buffer. */
scanf("%63[^\n]", buffer);
But what if we do not know the buffer length when we write the code? What if it is the parameter of a function?
void function(FILE *file, size_t n, char buffer[n])
{
/* ... */
fscanf(file, "%[^\n]", buffer); /* WHAT NOW? */
}
This code is vulnerable to buffer overflows as fscanf does not know how big the buffer is.
I remember seeing this before and started to think that it was the solution to the problem:
fscanf(file, "%*[^\n]", n, buffer);
My first thought was that the * in "%*[*^\n]" meant that the maximum string size is passed an argument (in this case n). This is the meaning of the * in printf.
When I checked the documentation for scanf I found out that it means that scanf should discard the result of [^\n].
This left me somewhat disappointed as I think that it would be a very useful feature to be able to pass the buffer size dynamically for scanf.
Is there any way I can pass the buffer size to scanf dynamically?
Basic answer
There isn't an analog to the printf() format specifier * in scanf().
In The Practice of Programming, Kernighan and Pike recommend using snprintf() to create the format string:
size_t sz = 64;
char format[32];
snprintf(format, sizeof(format), "%%%zus", sz);
if (scanf(format, buffer) != 1) { …oops… }
Extra information
Upgrading the example to a complete function:
int read_name(FILE *fp, char *buffer, size_t bufsiz)
{
char format[16];
snprintf(format, sizeof(format), "%%%zus", bufsiz - 1);
return fscanf(fp, format, buffer);
}
This emphasizes that the size in the format specification is one less than the size of the buffer (it is the number of non-null characters that can be stored without counting the terminating null). Note that this is in contrast to fgets() where the size (an int, incidentally; not a size_t) is the size of the buffer, not one less. There are multiple ways of improving the function, but it shows the point. (You can replace the s in the format with [^\n] if that's what you want.)
Also, as Tim Čas noted in the comments, if you want (the rest of) a line of input, you're usually better off using fgets() to read the line, but remember that it includes the newline in its output (whereas %63[^\n] leaves the newline to be read by the next I/O operation). For more general scanning (for example, 2 or 3 strings), this technique may be better — especially if used with fgets() or getline() and then sscanf() to parse the input.
Also, the TR 24731-1 'safe' functions, implemented by Microsoft (more or less) and standardized in Annex K of ISO/IEC 9899-2011 (the C11 standard), require a length explicitly:
if (scanf_s("%[^\n]", buffer, sizeof(buffer)) != 1)
...oops...
This avoids buffer overflows, but probably generates an error if the input is too long. The size could/should be specified in the format string as before:
if (scanf_s("%63[^\n]", buffer, sizeof(buffer)) != 1)
...oops...
if (scanf_s(format, buffer, sizeof(buffer)) != 1)
...oops...
Note that the warning (from some compilers under some sets of flags) about 'non-constant format string' has to be ignored or suppressed for code using the generated format string.
There is indeed no variable width specifier in the scanf family of functions. Alternatives include creating the format string dynamically (though this seems a bit silly if the width is a compile-time constant) or simply accepting the magic number. One possibility is to use preprocessor macros for specifying both the buffer and format string width:
#define STR_VALUE(x) STR(x)
#define STR(x) #x
#define MAX_LEN 63
char buffer[MAX_LEN + 1];
fscanf(file, "%" STR_VALUE(MAX_LEN) "[^\n]", buffer);
Another option is to #define the length of the string:
#define STRING_MAX_LENGTH "%10s"
or
#define DOUBLE_LENGTH "%5lf"
I'm rewriting some code to be compatible with 32-bit and 64-bit architectures, and I'm having trouble with a vsnprintf call. It doesn't appear that vsnprintf handles the fixed size integer types from inttypes.h properly on either architecture.
Here is the relevant code:
void formatString(char *buffer, int size, char *format, ...)
{
va_list va;
/* Format the data */
va_start( va, format );
vsnprintf( (char *)buffer, size, format, va );
va_end( va );
}
int main(int argc, char *argv[])
{
char buffer[2048];
printf("The format string: %s\n", stringsLookup(0));
formatString(&buffer[0], sizeof(buffer), stringsLookup(0), 1, 2);
printf("The output string: %s\n", buffer);
return 0;
}
The output is as follows:
The format string: action=DoSomething&Val1=%"PRIx32"&Val2=%x
The output string: action=DoSomething&Val1=%"PRIx32"&Val2=1
You can see that the %"PRIx32" portion of the format string was not replaced with the value '1' as expected. Is this a known issue? Is there a work around?
I will mention that if I hard code the strings in the source, the preprocessor appears to convert "%PRIu32" to the appropriate macro for the architecture and the call to vsnprintf works. Unfortunately I need to be able to load the strings.
Update
Some additional background: When I moved from a 32-bit system to a 64-bit system, I had to fix the size of certain variables. I declared them as uint32_t. I also changed the places where they were printed to clean up compiler warnings. The previous code used printf("%lx"). I used printf("%"PRIx32). I need to do something similar with the call to vsnprintf.
As I've mentioned, if I hard code the string in the source code, the preprocessor appears to convert "%"PRIx32 to "%lx" or "%x" appropriately. Unfortunately, I'm running into trouble when I have to load the strings from a file. The preprocessor can't help me there.
PRIx32 is a macro whose name should not appear textually even in the format string. You are almost certainly using it wrong, unless it expands to a string that contains "PRIx32" (it almost certainly doesn't).
A typical use is printf("Number: %" PRIx32 "...", arg);.
In the typical idiom above, "Number: %" PRIx32 "..." is expanded to, say, "Number: %" "lX" "...", which by a peculiarity of C syntax is equivalent to "Number: %lX..."
If you need to create the format string dynamically, use strcat or other string-manipulation functions. Do not write the equivalent of "Number: %\"PRIx32\"...".
Just remember that PRIx32 expands to a string literal, and don't write "%PRIx32", that does not make sense.
EDIT:
If you are loading the format string from a file, information that I suggested you provide in a comment 45 minutes ago, then you have to do your own substitution when the file is loaded from. Invent a syntax similar to the % syntax of printf, and write your own function to recognize it and substitute it with what is right on the architecture the program is running on.
Note that from a security point of view, if you load format strings from a file, whoever controls the file controls what the program does.
Also note that printf("Number: %llx\n", (unsigned long long) e); almost always work. It can only disappoint you if your compiler has an integer type wider than unsigned long long and you use it.
I am using snprintf like this to avoid a buffer overrun:
char err_msg[32] = {0};
snprintf(err_msg, sizeof(err_msg) - 1, "[ ST_ENGINE_FAILED ]");
I added the -1 to reserve space for the null terminator in case the string is more than 32 bytes long.
Am I correct in my thinking?
Platform:
GCC 4.4.1
C99
As others have said, you do not need the -1 in this case. If the array is fixed size, I would use strncpy instead. It was made for copying strings - sprintf was made for doing difficult formatting. However, if the size of the array is unknown or you are trying to determine how much storage is necessary for a formatted string. This is what I really like about the Standard specified version of snprintf:
char* get_error_message(char const *msg) {
size_t needed = snprintf(NULL, 0, "%s: %s (%d)", msg, strerror(errno), errno);
char *buffer = malloc(needed+1);
sprintf(buffer, "%s: %s (%d)", msg, strerror(errno), errno);
return buffer;
}
Combine this feature with va_copy and you can create very safe formatted string operations.
You don't need the -1, as the reference states:
The functions snprintf() and
vsnprintf() do not write more than
size bytes (including the trailing
'\0').
Note the "including the trailing '\0'" part
No need for -1. C99 snprintf always zero-terminates. Size argument specifies the size of output buffer including zero terminator. The code, thus, becomes
char err_msg[32];
int ret = snprintf(err_msg, sizeof err_msg, "[ ST_ENGINE_FAILED ]");
ret contains actual number of characters printed (excluding zero terminator).
However, do not confuse with Microsoft's _snprintf (pre-C99), which does not null-terminate, and, for that matter, has completely different behaviour (e.g. returning -1 instead of would-be printed length in case if buffer is not big enough). If using _snprintf, you should be using the same code as in your question.
According to snprintf(3):
The functions snprintf() and vsnprintf() do not write more than size bytes (including the trailing '\0').
For the example given, you should be doing this instead:
char err_msg[32];
strncpy(err_msg, "[ ST_ENGINE_FAILED ]", sizeof(err_msg));
err_msg[sizeof(err_msg) - 1] = '\0';
or even better:
char err_msg[32] = "[ ST_ENGINE_FAILED ]";
sizeof will return the number of bytes the datatype will use in memory, not the length of the string. E.g. sizeof(int) returns '4' bytes on a 32-bit system (well, depending on the implementation I guess). Since you use a constant in your array, you can happily pass that to the printf.
The following code causes an error and kills my application. It makes sense as the buffer is only 10 bytes long and the text is 22 bytes long (buffer overflow).
char buffer[10];
int length = sprintf_s( buffer, 10, "1234567890.1234567890." );
How do I catch this error so I can report it instead of crashing my application?
Edit:
After reading the comments below I went with _snprintf_s. If it returns a -1 value then the buffer was not updated.
length = _snprintf_s( buffer, 10, 9, "123456789" );
printf( "1) Length=%d\n", length ); // Length == 9
length = _snprintf_s( buffer, 10, 9, "1234567890.1234567890." );
printf( "2) Length=%d\n", length ); // Length == -1
length = _snprintf_s( buffer, 10, 10, "1234567890.1234567890." );
printf( "3) Length=%d\n", length ); // Crash, it needs room for the NULL char
It's by design. The entire point of sprintf_s, and other functions from the *_s family, is to catch buffer overrun errors and treat them as precondition violations. This means that they're not really meant to be recoverable. This is designed to catch errors only - you shouldn't ever call sprintf_s if you know the string can be too large for a destination buffer. In that case, use strlen first to check and decide whether you need to trim.
Instead of sprintf_s, you could use snprintf (a.k.a _snprintf on windows).
#ifdef WIN32
#define snprintf _snprintf
#endif
char buffer[10];
int length = snprintf( buffer, 10, "1234567890.1234567890." );
// unix snprintf returns length output would actually require;
// windows _snprintf returns actual output length if output fits, else negative
if (length >= sizeof(buffer) || length<0)
{
/* error handling */
}
This works with VC++ and is even safer than using snprintf (and certainly safer than _snprintf):
void TestString(const char* pEvil)
{
char buffer[100];
_snprintf_s(buffer, _TRUNCATE, "Some data: %s\n", pEvil);
}
The _TRUNCATE flag indicates that the string should be truncated. In this form the size of the buffer isn't actually passed in, which (paradoxically!) is what makes it so safe. The compiler uses template magic to infer the buffer size which means it cannot be incorrectly specified (a surprisingly common error). This technique can be applied to create other safe string wrappers, as described in my blog post here:
https://randomascii.wordpress.com/2013/04/03/stop-using-strncpy-already/
From MSDN:
The other main difference between sprintf_s and sprintf is that sprintf_s takes a length parameter specifying the size of the output buffer in characters. If the buffer is too small for the text being printed then the buffer is set to an empty string and the invalid parameter handler is invoked. Unlike snprintf, sprintf_s guarantees that the buffer will be null-terminated (unless the buffer size is zero).
So ideally what you've written should work correctly.
Looks like you're writing on MSVC of some sort?
I think the MSDN docs for sprintf_s says that it assert dies, so I'm not too sure if you can programmatically catch that.
As LBushkin suggested, you're much better off using classes that manage the strings.
See section 6.6.1 of TR24731 which is the ISO C Committee version of the functionality implemented by Microsoft. It provides functions set_constraint_handler(), abort_constraint_handler() and ignore_constraint_handler() functions.
There are comments from Pavel Minaev suggesting that the Microsoft implementation does not adhere to the TR24731 proposal (which is a 'Type 2 Tech Report'), so you may not be able to intervene, or you may have to do something different from what the TR indicates should be done. For that, scrutinize MSDN.