I'm currently learning C and lately, I have been focusing on the topic of character encoding. Note that I'm a Windows programmer. While I currently test my code only on Windows, I want to eventually port it to Linux and macOS, so I'm trying to learn the best practices right now.
In the example below, I store a file path in a wchar_t variable to be opened later on with _wfopen. I need to use _wfopen because my file path may contain chars not in my default codepage. Afterwards, the file path and a text literal is stored inside a char variable named message for further use. My understanding is that you can store a wide string into a multibyte string with the %ls modifier.
char message[8094] = "";
wchar_t file_path[4096] = L"C:\\test\\test.html";
sprintf(message, "Accessing: %ls\n", file_path);
While the code works, GCC/MinGW outputs the following warning and notes:
warning: '%ls' directive writing up to 49146 bytes into a region of size 8083 [-Wformat-overflow=]|
note: assuming directive output of 16382 bytes|
note: 'sprintf' output between 13 and 49159 bytes into a destination of size 8094|
My issue is that I simply do not understand how sprintf could output up to 49159 bytes into the message variable. I output the Accessing: string literal, the file_path variable, the \n char and the \0 char. What else is there to output?
Sure, I could declare message as a wchar_t variable and use wsprintf instead of sprintf, but my understanding is that wchar_t does not make up for nice portable code. As such, I'm trying to avoid using it unless it's required by a specific API.
So, what am I missing?
The warning doesn't take into account the actual contents of file_path , it is calculated based on file_path having any possible content . There would be an overflow if file_path consisted of 4095 emoji and a null terminator.
Using %ls in narrow printf family converts the source to multi-byte characters which could be several bytes for each wide character.
To avoid this warning you could:
disable it with -Wno-format-overflow
use snprintf instead of sprintf
The latter is always a good idea IMHO, it is always good to have a second line of defence against mistakes introduced in code maintenance later (e.g. someone comes along and changes the code to grab a path from user input instead of hardcoded value).
After-word. Be very careful using wide characters and printf family in MinGW , which implements the printf family by calling MSVCRT which does not follow the C Standard. Further reading
To get closer to standard behaviour, use a build of MinGW-w64 which attempts to implement stdio library functions itself, instead of deferring to MSVCRT. (E.g. MSYS2 build).
Related
I am looking for ways to convert a PCHAR* variable to a TCHAR* without having any warnings in Visual Studio( this is a requirement)?
Looking online I can't find a function or a method to do so without having warnings. Maybe somebody has come across something similar?
Thank you !
convert a PCHAR* variable to a TCHAR*
PCHAR is a typedef that resolves to char*, so PCHAR* means char**.
TCHAR is a macro #define'd to either the "wide" wchar_t or the "narrow" char.
In neither case can you (safely) convert between a char ** and a simple character pointer, so the following assumes the question is actually about converting a PCHAR to a TCHAR*.
PCHAR is the same TCHAR* in ANSI builds, and no conversion would be necessary in that case, so it can be further assumed that the question is about Unicode builds.
The PCHAR comes from the function declaration(can t be changed) and TCHAR comes from GetCurrentDirectory. I want to concatenate the 2 using _tcscat_s but I need to convert the PCHAR first.
The general question of converting between narrow and wide strings has been answered before, see for example Convert char * to LPWSTR or How to convert char* to LPCWSTR?. However, in this particular case, you could weigh the alternatives before choosing the general approaches.
Change your build settings to ANSI, instead of Unicode, then no conversion is necessary.
That's as easy as making sure neither UNICODE nor _UNICODE macros are defined when compiling, or changing in the IDE the project Configuration Properties / Advanced / Character Set from Use Unicode Character Set to either Not Set or Use Multi-Byte Character Set.
Disclaimer: it is retrograde nowadays to compile against an 8-bit Windows codepage. I am not advising it, and doing that means many international characters cannot be represented literally. However, a chain is only as strong as its weakest link, and if you are forced to use narrow strings returned by an external function that you cannot change, then that's limiting the usefulness of going full Unicode elsewhere.
Keep the build as Unicode, but change just the concatenation code to use ANSI strings.
This can be done by explicitly calling the ANSI version GetCurrentDirectoryA of the API, which returns a narrow string. Then you can strcat that directly with the other PCHAR string.
Keep it as is, but combine the narrow and wide strings using [w]printf instead of _tcscat_s.
char szFile[] = "test.txt";
PCHAR pszFile = szFile; // narrow string from ext function
wchar_t wszDir[_MAX_PATH];
GetCurrentDirectoryW(_MAX_PATH, wszDir); // wide string from own code
wchar_t wszPath[_MAX_PATH];
wsprintf(wszPath, L"%ws\\%hs", wszDir, pszFile); // combined into wide string
I'm trying to recode a part of printf.
setlocale(LC_ALL, "en_US.UTF-8");
int ret = printf("%S\n", "我是一只猫。");
printf("Printf returned %d\n", ret);
If the format is %s, printf writes the wide characters and returns 19.
If the format is %S, printf returns -1 because the argument is not a wide string (no L before "").
In my own implementation of printf, how can I determine if the string passed in parameter is wide, so I can return -1 if it isn't ?
Edit
I'm programming on OS X El Capitan (but I would have like a portable solution if it were possible)
In my programming environment, %S and %ls are the same - it doesn't really matter for my question here
Printf also returns -1 when I don't set a locale for the example with format %s. This is the only reason why I've set a locale.
I'm compiling with clang (Apple LLVM version 7.0.0 (clang-700.1.76))
Basically, you can't. Passing something that is not a wide-string for %S is undefined behaviour, anything can happen, including dæmons flying out of your nose. You are lucky that printf catches that, likely it detects that the contents of "我是一只猫。" when interpreted as an array of wchar_t aren't all valid codepoints (if that happens, errno is set to EILSEQ by printf).
In my own implementation of printf, how can I determine if the string passed in parameter is wide, so I can return -1 if it isn't ?
You cannot. The %S format specifier is documented in printf(3) as
(Not in C99 or C11, but in SUSv2, SUSv3, and SUSv4.) Synonym
for %ls. Don't use.
so you should probably not use it (since it is not in the C11 standard, but in SUSv4). And if you did use it for your own printf, it would be a promise that the corresponding actual argument is a wide string.
You might however, if your C compiler is a recent GCC, use an appropriate format function attribute (it is a GCC extension) in your declaration of your printf (or likewise) function. This would give warnings to the users of ill-typed arguments to your function. And you could even customize GCC (e.g. using MELT) by defining your own function attribute which would enable extra type-checking at compile time, so there is no portable way, given a pointer to something, to check at runtime if it is a pointer to a string or to something else (like an array of integers).
At runtime, your printf would use stdarg(3) facilities so would have to "interpret" the format string to handle appropriately the various format specifiers. Without compiler support (à la __attribute__((format(printf,1,2))) in GCC (also supported by Clang), or with your own function attribute) you cannot get any compile-time type checking for variadic functions. And the type information is erased in C at runtime.
Look also at existing implementation of printf like functions in free software implementations of the C standard library. The stdio/vfprintf.c file of MUSL libc is quite readable.
Also, GNU libunistring has some elementary string checks functions like e.g. u16_check which checks if an array (whose size is given) of 16 bits integers is a valid UTF16 string. Notice that "我是一只猫。" in UTF8 is not a zero-doublebyte or zero-widechar terminated UTF16 string (so simply computing its length as wchar_t* wide string is undefined behavior, because of buffer overflow!) and might not even have the required alignment for wide strings.
Assume that you're writing (portable) C99 code in the invariant set of ISO 646. This means that the \ (backslash, reverse solidus, however you name it) can't be written directly. For instance, one could opt to write a Hello World program as such:
%:include <stdio.h>
%:include <stdlib.h>
int main()
<%
fputs("Hello World!??/n", stdout);
return EXIT_SUCCESS;
%>
However, besides digraphs, I used the ??/ trigraph to write the \ character.
Given my assumptions above, is it possible to either
include the '\n' character (which is translated to a newline in <stdio.h> functions) in a string without the use of trigraphs, or
write a newline to a FILE * without using the '\n' character?
For stdout you could just use puts("") to output a newline. Or indeed replace the fputs in your original program with puts and delete the \n.
If you want to get the newline character into a variable so you can do other things with it, I know another standard function that gives you one for free:
int gimme_a_newline(void)
{
time_t t = time(0);
return strchr(ctime(&t), 0)[-1];
}
You could then say
fprintf(stderr, "Hello, world!%c", gimme_a_newline());
(I hope all of the characters I used are ISO646 or digraph-accessible. I found it surprisingly difficult to get a simple list of which ASCII characters are not in ISO646. Wikipedia has a color-coded table with not nearly enough contrast between colors for me to tell what's what.)
Your premise:
Assume that you're writing (portable) C99 code in the invariant set of ISO 646. This means that the \ (backslash, reverse solidus, however you name it) can't be written directly.
is questionable. C99 defines "source" and "execution" character sets, and requires that both include representations of the backslash character (C99 5.2.1). The only reason I can imagine for an effort such as you describe would be to try to produce source code that does not require character set transcoding upon movement among machines. In that case, however, the choice of ISO 646 as a common baseline is odd. You're more likely to run into an EBCDIC machine than one that uses an ISO 646 variant that is not coincident with the ISO-8859 family of character sets. (And if you can assume ISO 8859, then backslash does not present a problem.)
Nevertheless, if you insist on writing C source code without using a literal backslash character, then the trigraph for that character is the way to do so. That's what trigraphs were invented for. In character constants and string literals, you cannot portably substitute anything else for \n or its trigraph equivalent, ??/n, because it is implementation-dependent how that code is mapped. In particular, it is not safe to assume that it maps to a line-feed character (which, however, is included among the invariant characters of ISO 646).
Update:
You ask specifically whether it is possible to
include the '\n' character (which is translated to a newline in functions) in a string without the use of trigraphs, or
No, it is not possible, because there is no one '\n' character. Moreover, there seems to be a bit of a misconception here: \n in a character or string literal represents one character in the execution character set. The compiler is therefore responsible for that transformation, not the stdio functions. The stdio functions' responsibility is to handle that character on output by writing a character or character sequence intended to produce the specified effect ("[m]oves the active position to the initial position of the next line").
You also ask whether it is possible to
write a newline to a FILE * without using the '\n' character?
This one depends on exactly what you mean. If you want to write a character whose code in the execution character set you know, then you can write a numeric constant having that numeric value. In particular, if you want to write the character with encoded value 0xa (in the execution character set) then you can do so. For example, you could
fputc(0xa, my_file);
but that does not necessarily produce a result equivalent to
fputc('\n', my_file);
Short answer is, yes, for what you want to do, you have to use this trigraph.
Even if there was a digraph for \, it would be useless inside a string literal because digraphs must be tokens, they are recognized by the tokenizer, while trigraphs are pre-processed and so still work inside string literals and the like.
Still wondering why somebody would encode source this way today ... :o
No. \n (or its trigraph equivalent) is the portable representation of a newline character.
No. You'd have to represent the literal newline somehow, and \n (or it's trigraph equivalent) is the only portable representation.
It's very unusual to find C source code that uses trigraphs or digraphs! Some compilers (e.g. GNU gcc) require command-line options to enable the use of trigraphs and assume they have been used unintentionally and issues a warning if it encounters them in the source code.
EDIT: I forgot about puts(""). That's a sneaky way to do it, but only works for stdout.
Yes of course it's possible
fputc(0x0A, file);
I have an issue with fgetws and wprintf.
NULL is returned when a special character is fund in the File opened before. I don't have this problem with fgets.
I tried to use setlocale, as recommended here : fgetws fails to get the exact wide char string from FILE*
but it doesn't change nothing.
Moreover, wprintf(L"éé"); prints ?? (I also don't have this problem with printf) in the terminal (on Ubuntu 12), what can be done to avoid this?
Edit : as it is asked in the comments, here is the very simple code :
# include "sys.h"
#define MAX_LINE_LENGTH 1024
int main (void){
FILE *File = fopen("D.txt", "r");
wchar_t line[MAX_LINE_LENGTH];
while (fgetws(line, MAX_LINE_LENGTH, File))
wprintf(L"%S", line);
fclose(File);
return 0;
}
By default, when a program starts, it is running in the C locale, which is not guaranteed to support any characters except those needed for translating C programs. (It can contain more as an implementation detail, but you cannot rely on this.) In order to use wchar_t to store other characters and process them with the wide character conversion functions or wide stdio functions, you need to set a locale in which those characters are supported.
The locales available, and how they are named, vary by system, so you should not attempt to set a locale by name. Instead, pass "" to setlocale to request the "default" locale for the user or the system. On POSIX-like systems, this uses the LANG and LC_* environment variables to determine the preferred locale. As long as the characters you're trying to use exist in the user's locale, your wprintf should work.
The call to setlocale should look like:
setlocale(LC_CTYPE, "");
or:
setlocale(LC_ALL, "");
The former only applies the locale settings to character encoding/character type functions (things that process wchar_t). The latter also causes locale to be set for, and affect, a number of other things like message language, formatting of numbers and time, ...
One detail to note is that wide stdio functions bind the character encoding of the locale that's in use at the time the stream "becomes wide-oriented", i.e. on the first wide operation that's performed on it. So you need to call setlocale before using wprintf.
Basically I have some simple code that does some things for files and I'm trying to port it to windows. I have something that looks like this:
int SomeFileCall(const char * filename){
#ifndef __unix__
SomeWindowsFileCall(filename);
#endif
#ifdef __unix__
/**** Some unix only stat code here! ****/
#endif
}
the line SomeWindowsFileCall(filename); causes the compiler error:
cannot convert parameter 1 from 'const char *' to 'LPCWSTR'
How do I fix this, without changing the SomeFileCall prototype?
Most of the Windows APIs that take strings have two versions: one that takes char * and one that takes WCHAR * (that latter is equivalent to wchar_t *).
SetWindowText, for example, is actually a macro that expands to either SetWindowTextA (which takes char *) or SetWindowTextW (which takes WCHAR *).
In your project, it sounds like all of these macros are referencing the -W versions. This is controlled by the UNICODE preprocessor macro (which is defined if you choose the "Use Unicode Character Set" project option in Visual Studio). (Some of Microsoft's C and C++ run time library functions also have ANSI and wide versions. Which one you get is selected by the similarly-named _UNICODE macro that is also defined by that Visual Studio project setting.)
Typically, both of the -A and -W functions exist in the libraries and are available, even if your application is compiled for Unicode. (There are exceptions; some newer functions are available only in "wide" versions.)
If you have a char * that contains text in the proper ANSI code page, you can call the -A version explicitly (e.g., SetWindowTextA). The -A versions are typically wrappers that make wide character copies of the string parameters and pass control to the -W versions.
An alternative is to make your own wide character copies of the strings. You can do this with MultiByteToWideChar. Calling it can be tricky, because you have to manage the buffers. If you can get away with calling the -A version directly, that's generally simpler and already tested. But if your char * string is using UTF-8 or any encoding other than the user's current ANSI code page, you should do the conversion yourself.
Bonus Info
The -A suffix stands for "ANSI", which was the common Windows term for a single-byte code-page character set.
The -W suffix stands for "Wide" (meaning the encoding units are wider than a single byte). Specifically, Windows uses little-endian UTF-16 for wide strings. The MSDN documentation simply calls this "Unicode", which is a little bit of a misnomer.
Configure your project to use ANSI character set. (General -> Character Set)
What are TCHAR, WCHAR, LPSTR, LPWSTR, LPCTSTR etc.
typedef const wchar_t* LPCWSTR;
{project properties->advanced->character set->use multi byte character set} İf you do these step you problem is solved
You are building with WinApi in Unicode mode, so all string parameters resolve to wide strings. The simplest fix would be to change the WinApi to ANSI, otherwise you need to create a wchar_t* with the contents from filename and use that as an argument.
Am able to solve this error by setting the Character set to "Use Multi-Byte Character set"
[Project Properties-> Configuration Properties -> General -> Character Set ->"Use Multi-Byte Character set"
not sure what compiler you are using but in visual studio you can specify the default char type, whether it be UNICODE or multibyte. In your case it sounds as if UNICODE is default so the simplest solution is to check for the switch on your particular compiler that determines default char type because it would save you some work, otherwise you would end up adding code to convert back and forth from UNICODE which may add unnecessary overhead plus could be an additional source of error.