It's not clear to me what encodings are used where in C's argv. In particular, I'm interested in the following scenario:
A user uses locale L1 to create a file whose name, N, contains non-ASCII characters
Later on, a user uses locale L2 to tab-complete the name of that file on the command line, which is fed into a program P as a command line argument
What sequence of bytes does P see on the command line?
I have observed that on Linux, creating a filename in the UTF-8 locale and then tab-completing it in (e.g.) the zw_TW.big5 locale seems to cause my program P to be fed UTF-8 rather than Big5. However, on OS X the same series of actions results in my program P getting a Big5 encoded filename.
Here is what I think is going on so far (long, and I'm probably wrong and need to be corrected):
Windows
File names are stored on disk in some Unicode format. So Windows takes the name N, converts from L1 (the current code page) to a Unicode version of N we will call N1, and stores N1 on disk.
What I then assume happens is that when tab-completing later on, the name N1 is converted to locale L2 (the new current code page) for display. With luck, this will yield the original name N -- but this won't be true if N contained characters unrepresentable in L2. We call the new name N2.
When the user actually presses enter to run P with that argument, the name N2 is converted back into Unicode, yielding N1 again. This N1 is now available to the program in UCS2 format via GetCommandLineW/wmain/tmain, but users of GetCommandLine/main will see the name N2 in the current locale (code page).
OS X
The disk-storage story is the same, as far as I know. OS X stores file names as Unicode.
With a Unicode terminal, I think what happens is that the terminal builds the command line in a Unicode buffer. So when you tab complete, it copies the file name as a Unicode file name to that buffer.
When you run the command, that Unicode buffer is converted to the current locale, L2, and fed to the program via argv, and the program can decode argv with the current locale into Unicode for display.
Linux
On Linux, everything is different and I'm extra-confused about what is going on. Linux stores file names as byte strings, not in Unicode. So if you create a file with name N in locale L1 that N as a byte string is what is stored on disk.
When I later run the terminal and try and tab-complete the name, I'm not sure what happens. It looks to me like the command line is constructed as a byte buffer, and the name of the file as a byte string is just concatenated onto that buffer. I assume that when you type a standard character it is encoded on the fly to bytes that are appended to that buffer.
When you run a program, I think that buffer is sent directly to argv. Now, what encoding does argv have? It looks like any characters you typed in the command line while in locale L2 will be in the L2 encoding, but the file name will be in the L1 encoding. So argv contains a mixture of two encodings!
Question
I'd really like it if someone could let me know what is going on here. All I have at the moment is half-guesses and speculation, and it doesn't really fit together. What I'd really like to be true is for argv to be encoded in the current code page (Windows) or the current locale (Linux / OS X) but that doesn't seem to be the case...
Extras
Here is a simple candidate program P that lets you observe encodings for yourself:
#include <stdio.h>
int main(int argc, char **argv)
{
if (argc < 2) {
printf("Not enough arguments\n");
return 1;
}
int len = 0;
for (char *c = argv[1]; *c; c++, len++) {
printf("%d ", (int)(*c));
}
printf("\nLength: %d\n", len);
return 0;
}
You can use locale -a to see available locales, and use export LC_ALL=my_encoding to change your locale.
Thanks everyone for your responses. I have learnt quite a lot about this issue and have discovered the following things that has resolved my question:
As discussed, on Windows the argv is encoded using the current code page. However, you can retrieve the command line as UTF-16 using GetCommandLineW. Use of argv is not recommended for modern Windows apps with unicode support because code pages are deprecated.
On Unixes, the argv has no fixed encoding:
a) File names inserted by tab-completion/globbing will occur in argv verbatim as exactly the byte sequences by which they are named on disk. This is true even if those byte sequences make no sense in the current locale.
b) Input entered directly by the user using their IME will occur in argv in the locale encoding. (Ubuntu seems to use LOCALE to decide how to encode IME input, whereas OS X uses the Terminal.app encoding Preference.)
This is annoying for languages such as Python, Haskell or Java, which want to treat command line arguments as strings. They need to decide how to decode argv into whatever encoding is used internally for a String (which is UTF-16 for those languages). However, if they just use the locale encoding to do this decoding, then valid filenames in the input may fail to decode, causing an exception.
The solution to this problem adopted by Python 3 is a surrogate-byte encoding scheme (http://www.python.org/dev/peps/pep-0383/) which represents any undecodable byte in argv as special Unicode code points. When that code point is decoded back to a byte stream, it just becomes the original byte again. This allows for roundtripping data from argv that is not valid in the current encoding (i.e. a filename named in something other than the current locale) through the native Python string type and back to bytes with no loss of information.
As you can see, the situation is pretty messy :-)
I can only speak about Windows for now. On Windows, code pages are only meant for legacy applications and not used by the system or by modern applications. Windows uses UTF-16 (and has done so for ages) for everything: text display, file names, the terminal, the system API. Conversions between UTF-16 and the legacy code pages are only performed at the highest possible level, directly at the interface between the system and the application (technically, the older API functions are implemented twice—one function FunctionW that does the real work and expects UTF-16 strings, and one compatibility function FunctionA that simply converts input strings from the current (thread) code page to UTF-16, calls the FunctionW, and converts back the results). Tab-completion should always yield UTF-16 strings (it definitely does when using a TrueType font) because the console uses only UTF-16 as well. The tab-completed UTF-16 file name is handed over to the application. If now that application is a legacy application (i.e., it uses main instead of wmain/GetCommandLineW etc.), then the Microsoft C runtime (probably) uses GetCommandLineA to have the system convert the command line. So basically I think what you're saying about Windows is correct (only that there is probably no conversion involved while tab-completing): the argv array will always contain the arguments in the code page of the current application because the information what code page (L1) the original program has uses has been irreversibly lost during the intermediate UTF-16 stage.
The conclusion is as always on Windows: Avoid the legacy code pages; use the UTF-16 API wherever you can. If you have to use main instead of wmain (e.g., to be platform independent), use GetCommandLineW instead of the argv array.
The output from your test app needed some modifications to make any sense,
you need the hex codes and you need to get rid of the negative values.
Or you can't print things like UTF-8 special chars so you can read them.
First the modified SW:
#include <stdio.h>
int main(int argc, char **argv)
{
if (argc < 2) {
printf("Not enough arguments\n");
return 1;
}
int len = 0;
for (unsigned char *c = argv[1]; *c; c++, len++) {
printf("%x ", (*c));
}
printf("\nLength: %d\n", len);
return 0;
}
Then on my Ubuntu box that is using UTF-8 I get this output.
$> gcc -std=c99 argc.c -o argc
$> ./argc 1ü
31 c3 bc
Length: 3
And here you can see that in my case ü is encoded over 2 chars,
and that the 1 is a single char.
More or less exactly what you expect from a UTF-8 encoding.
And this actually match what is in the env LANG varible.
$> env | grep LANG
LANG=en_US.utf8
Hope this clarifies the linux case a little.
/Good luck
Yep, users has to be careful when mixing locales on Unix in general. GUI file managers that displays and changes filenames also have this problem. On Mac OS X the standard Unix encoding is UTF-8. In fact the HFS+ filesystem, when called via the Unix interfaces, enforces UTF-8 filenames because it needs to convert it to UTF-16 for storage in the filesystem itself.
Related
I have tried many ways to do it.. using scanf(), getc(), but nothing worked. Most of the time, 0 is stored in the supplied variable (maybe indicating wrong input?). How can I make it so that when the user enters any Unicode codepoint, it is properly recognized and stored in either a string or a char?
I'm guessing you already know that C chars and Unicode characters are two very different things, so I'll skip over that. The assumptions I'll make here include:
Your C strings will contain UTF-8 encoded characters, terminated by a NUL (\x00) character.
You won't use any C functions that could break the per-character encoding, and you will use output (strlen(), etc) with the understanding you need to differentiate between C chars and real characters.
It really is as simple as:
char input[256];
scanf("%[^\n]", &input);
printf("%s\n", input);
The problems comes with what is providing the input, and what is displaying the output.
#include <stdio.h>
int main(int argc, char** argv) {
char* bananna = "\xF0\x9F\x8D\x8C\x00";
printf("%s\n", bananna);
}
This probably won't display a banana. That's because the UTF-8 sequence being written to the terminal isn't being interpreted as a UTF-8 sequence.
So, the first thing you need to do is to configure your terminal. If your program is likely to only use one terminal type, then you might even be able to do this from within the program; however, there are tons of people who use different terminals, some that even cross Operating System boundaries. For example, I'm testing my Linux programs in a Windows terminal, connected to the Linux system using SSH.
Once the terminal is configured, your probably already correct program should display a banana. But, even a correctly configured terminal can fail.
After the terminal is verified to be correctly configured, the last piece of the puzzle is the font. Not all fonts contain glyphs for all Unicode characters. The banana is one of those characters that isn't typically typed into a computer, so you need to open up a font tool and search the font for the glyph. If it doesn't exist in that font, you need to find a font that implements a glyph for that character.
In the below code I'm creating two files one in text format and other in binary format. The icons of the files show the same. But the characteristics of both the files are exactly same including the size ,charset (==binary) and stream(octet). Why isn't there a text file? Because if i create a text file explicitly the charset is ASCII.
Compiler version - gcc (Ubuntu 8.3.0-6ubuntu1) 8.3.0.
Operating system - Tried on both Ubuntu 18.10 and 19.04.
No messages displayed by compiler.
Command used to examine the files file --mime.
Output by the command for file Text1.txt :
Text1.txt: application/octet-stream; charset=binary
Output by the command for file Text1.txt : Binary: application/octet-stream; charset=binary
Output by command od -xa FILENAME is same for both files and is :
0000000 0021
!
0000001
#include<stdio.h>
void main(){
FILE *fp;
FILE *fp2;
int a = 10111110;
fp2 = fopen("Text1.txt","w");
fputc('!',fp2);
fp = fopen("Binary","wb");
fputc('!',fp);
}
Expected output is One File with charset as ASCII and One with Binary, Actual output is both of them with charset as Binary
The file command diagnoses the files as binary and not ASCII because you are writing non-ASCII characters to the files due to incorrect use of fputc.
fputc("!",fp2); is incorrect. The first argument to fputc should be an int with a character value. "!" is a string literal, which is an array, which is automatically converted to a pointer to its first character.
GCC warns you about this, saying “warning: passing argument 1 of 'fputc' makes integer from pointer without a cast [-Wint-conversion]”. You apparently ignored the warning. Do not do that. When the compiler warns you about something, pay attention, diagnose the problem, and fix it.
The result is that the pointer is converted to an int, and this int is passed to fputc. That may result in some non-ASCII character being written to the file, which in turn causes the file command to diagnose the file as binary.
To fix this, change the string "!" to a single character '!', so that you pass a single character to fputc, with fputc('!',fp2);.
Additionally, main should not be declared with void main(). Declare it with int main(void) or int main(int argc, char *argv[]) or another implementation-defined manner.
On Unix systems, the resulting files with the corrected code will be identical. Core Unix does not distinguish between text and binary files, except that some applications may use metadata (such as “extended attributes”) to characterize files in various ways. The files resulting from the incorrect code may or may not be identical, because identical string literals in different places may or may not have the same address, so the resulting pointer may or may not have the same value.
C provides a distinction in principle between binary and text streams. Data traversing a text stream may be subject to implementation-dependent conversions:
Characters may have to be added, altered, or deleted on input and
output to conform to differing conventions for representing text in
the host environment. Thus, there need not be a one- to-one
correspondence between the characters in a stream and those in the
external representation. Data read in from a text stream will
necessarily compare equal to the data that were earlier written out to
that stream only if: the data consist only of printing characters and
the control characters horizontal tab and new-line; no new-line
character is immediately preceded by space characters; and the last
character is a new-line character. Whether space characters that are
written out immediately before a new-line character appear when read
in is implementation-defined.
(C2011, 7.21.2/2)
In practice, however, the only conversion you will see for byte-oriented streams on any system you're likely to meet is line terminator conversions on systems (primarily Windows) that use carriage return / newline pairs for line terminators in text files. C text mode streams will convert between that external representation and C's newline-only internal representation.
On Linux and modern BSD-based macOS, however, there isn't even that -- these operating systems make no distinction in practice between text and binary files, and it is not at all surprising that your two mechanisms for producing a file yield identical files.
It is an entirely separate question how an external program that attempts to guess at file types might interpret any given file, especially a very short one. Your chances are better for a file to be detected as text if it contains genuine text in the form of words and sentences.
I have try to check an importance and reason to use W winapi vs A, (W meaning wide char, A meaning ascii right?)
I have made a simple example, i receive a temp path for current user like this:
CHAR pszUserTempPathA[MAX_PATH] = { 0 };
WCHAR pwszUserTempPathW[MAX_PATH] = { 0 };
GetTempPathA(MAX_PATH - 1, pszUserTempPathA);
GetTempPathW(MAX_PATH - 1, pwszUserTempPathW);
printf("pathA=%s\r\npathW=%ws\r\n",pszUserTempPathA,pwszUserTempPathW);
My current user has a russian name, so its written in cyrillic, printf outputs like this:
pathA=C:\users\Пыщь\Local\Temp
pathW=C:\users\Пыщь\Local\Temp
So both paths are allright, i thought i will receive some error, or a mess of symbols with a GetTempPathA since the current user is a unicode, but i figured out, that cyrillic characters are actually included in extended ascii character set. So i have a question, if i were to use my software, and it will extract data in a temp folder of current user, who is chinese ( assuming he have chinese symbols in user name ), will i get a mess or an error using the GetTempPathA version? Should i always use a W prefixed functions, for a production software, that is working with winapi directly?
First, the -A suffix stands for ANSI, not ASCII. ASCII is a 7-bit character set. ANSI, as Microsoft uses the term, is for an encoding using 8-bit code units (chars) and code pages.
Some people use the terms "extended ASCII" or "high ASCII," but that's not actually a standard and, in some cases, isn't quite the same as ANSI. Extended ASCII is the ASCII character set plus (at most) 128 additional characters. For many ANSI code pages this is identical to extended ASCII, but some code pages accommodate variable length characters (which Microsoft calls multi-byte). Some people consider "extended ASCII" to just mean ISO-Latin-1 (which is nearly identical to Windows-1252).
Anyway, with an ANSI function, your string can include any characters from your current code page. If you need characters that aren't part of your current code page, you're out-of-luck. You'll have to use the wide -W versions.
In modern versions of Windows, you can generally think of the -A functions as wrappers around the -W functions that use MultiByteToWideChar and/or WideCharToMultiByte to convert any strings passing through the API. But the latter conversion can be lossy, since wide character strings might include characters that your multibyte strings cannot represent.
Portable, cross-platform code often stores all text in UTF-8, which uses 8-bit code units (chars) but can represent any Unicode code point, and anytime text needs to go through a Windows API, you'd explicitly convert to/from wide chars and then call the -W version of the API.
UTF-8 is nearly similar to what Microsoft calls a multibyte ANSI code page, except that Windows does not completely support a UTF-8 code page. There is CP_UTF8, but it works only with certain APIs (like WideCharToMultiByte and MultiByteToWideChar). You cannot set your code page to CP_UTF8 and expect the general -A APIs to do the right thing.
As you try to test things, be aware that it's difficult (and sometimes impossible) to get the CMD console window to display characters outside the current code page. If you want to display multi-script strings, you probably should write a GUI application and/or use the debugger to inspect the actual content of the strings.
Of course, you need the wide version. ASCII version can't even technically handle more than 256 distinct characters. Cyrillic is included in the extended ASCII set (if that's your localization) while Chinese isn't and can't due to much larger set of characters needed to represent it. Moreover, you can get mess with Cyrillic as well - it will only work properly if the executing machine has matching localization. So on a machine with non-cyrillic localization the text will be displayed according to whatever is defined by the localization settings.
I'm primarily interested in the Unix-like systems (e.g., portable POSIX) as it seems like Windows does strange things for wide characters.
Do the read and write wide character functions (like getwchar() and putwchar()) always "do the right thing", for example read from utf-8 and write to utf-8 when that is the set locale, or do I have to manually call wcrtomb() and print the string using e.g. fputs()? On my system (openSUSE 12.3) where $LANG is set to en_GB.UTF-8 they do seem to do the right thing (inspecting the output I see what looks like UTF-8 even though strings were stored using wchar_t and written using the wide character functions).
However I am unsure if this is guaranteed. For example cprogramming.com states that:
[wide characters] should not be used for output, since spurious zero
bytes and other low-ASCII characters with common meanings (such as '/'
and '\n') will likely be sprinkled throughout the data.
Which seems to indicate that outputting wide characters (presumably using the wide character output functions) can wreak havoc.
Since the C standard does not seem to mention coding at all I really have no idea who/when/how coding is applied when using wchar_t. So my question is basically if reading, writing and using wide characters exclusively is a proper thing to do when my application has no need to know about the encoding used. I only need string lengths and console widths (wcswidth()), so to me using wchar_t everywhere when dealing with text seems ideal.
The relevant text governing the behavior of the wide character stdio functions and their relationship to locale is from POSIX XSH 2.5.2 Stream Orientation and Encoding Rules:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_05_02
Basically, the wide character stdio functions always write in the encoding that's in effect (per the LC_CTYPE locale category) at the time the FILE stream becomes wide-oriented; this means the first time a wide stdio function is called on it, or fwide is used to set the orientation to wide. So as long as a proper LC_CTYPE locale is in effect matching the desired "system" encoding (e.g. UTF-8) when you start working with the stream, everything should be fine.
However, one important consideration you should not overlook is that you must not mix byte and wide oriented operations on the same FILE stream. Failure to observe this rule is not a reportable error; it simply results in undefined behavior. As a good deal of library code assumes stderr is byte oriented (and some even makes the same assumption about stdout), I would strongly discourage ever using wide-oriented functions on the standard streams. If you do, you need to be very careful about which library functions you use.
Really, I can't think of any reason at all to use wide-oriented functions. fprintf is perfectly capable of sending wide-character strings to byte-oriented FILE streams using the %ls specifier.
So long as the locale is set correctly, there shouldn't be any issues processing UTF-8 files on a system using UTF-8, using the wide character functions. They'll be able to interpret things correctly, i.e. they'll treat a character as 1-4 bytes as necessary (in both input and output). You can test it out by something like this:
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
int main()
{
setlocale(LC_CTYPE, "en_GB.UTF-8");
// setlocale(LC_CTYPE, ""); // to use environment variable instead
wchar_t *txt = L"£Δᗩ";
wprintf(L"The string %ls has %d characters\n", txt, wcslen(txt));
}
$ gcc -o loc loc.c && ./loc
The string £Δᗩ has 3 characters
If you use the standard functions (in particular character functions) on multibyte strings carelessly, things will start to break, e.g. the equivalent:
char *txt = "£Δᗩ";
printf("The string %s has %zu characters\n", txt, strlen(txt));
$ gcc -o nloc nloc.c && ./nloc
The string £Δᗩ has 7 characters
The string still prints correctly here because it's essentially just a stream of bytes, and as the system is expecting UTF-8 sequences, they're translated perfectly. Of course strlen is reporting the number of bytes in the string, 7 (plus the \0), with no understanding that a character and a byte aren't equivalent.
In this respect, because of the compatibility between ASCII and UTF-8, you can often get away with treating UTF-8 files as simply multibyte C strings, as long as you're careful.
There's a degree of flexibility as well. It's possible to convert a standard C string (as a multibyte string) to a wide character string easily:
char *stdtxt = "ASCII and UTF-8 €£¢";
wchar_t buf[100];
mbstowcs(buf, stdtxt, 20);
wprintf(L"%ls has %zu wide characters\n", buf, wcslen(buf));
Output:
ASCII and UTF-8 €£¢ has 19 wide characters
Once you've used a wide character function on a stream, it's set to wide orientation. If you later want to use standard byte i/o functions, you'll need to re-open the stream first. This is probably why the recommendation is not to use it on stdout. However, if you only use wide character functions on stdin and stdout (including any code that you link to), you will not have any problems.
Don't use fputs with anything else than ASCII.
If you want to write down lets say UTF8, then use a function who return the real size used by the utf8 string and use fwrite to write the good number of bytes, without worrying of vicious '\0' inside the string.
I'm trying to print out a wchar_t* string.
Code goes below:
#include <stdio.h>
#include <string.h>
#include <wchar.h>
char *ascii_ = "中日友好"; //line-1
wchar_t *wchar_ = L"中日友好"; //line-2
int main()
{
printf("ascii_: %s\n", ascii_); //line-3
wprintf(L"wchar_: %s\n", wchar_); //line-4
return 0;
}
//Output
ascii_: 中日友好
Question:
Apparently I should not assign CJK characters to char* pointer in line-1, but I just did it, and the output of line-3 is correct, So why? How could printf() in line-3 give me the non-ascii characters? Does it know the encoding somehow?
I assume the code in line-2 and line-4 are correct, but why I didn't get any output of line-4?
First of all, it's usually not a good idea to use non-ascii characters in source code. What's probably happening is that the chinese characters are being encoded as UTF-8 which works with ascii.
Now, as for why the wprintf() isn't working. This has to do with stream orientation. Each stream can only be set to either normal or wide. Once set, it cannot be changed. It is set the first time it is used. (which is ascii due to the printf). After that the wprintf will not work due the incorrect orientation.
In other words, once you use printf() you need to keep on using printf(). Similarly, if you start with wprintf(), you need to keep using wprintf().
You cannot intermix printf() and wprintf(). (except on Windows)
EDIT:
To answer the question about why the wprintf line doesn't work even by itself. It's probably because the code is being compiled so that the UTF-8 format of 中日友好 is stored into wchar_. However, wchar_t needs 4-byte unicode encoding. (2-bytes in Windows)
So there's two options that I can think of:
Don't bother with wchar_t, and just stick with multi-byte chars. This is the easy way, but may break if the user's system is not set to the Chinese locale.
Use wchar_t, but you will need to encode the Chinese characters using unicode escape sequences. This will obviously make it unreadable in the source code, but it will work on any machine that can print Chinese character fonts regardless of the locale.
Line 1 is not ascii, it's whatever multibyte encoding is used by your compiler at compile-time. On modern systems that's probably UTF-8. printf does not know the encoding. It's just sending bytes to stdout, and as long as the encodings match, everything is fine.
One problem you should be aware of is that lines 3 and 4 together invoke undefined behavior. You cannot mix character-based and wide-character io on the same FILE (stdout). After the first operation, the FILE has an "orientation" (either byte or wide), and after that any attempt to perform operations of the opposite orientation results in UB.
You are omitting one step and therefore think the wrong way.
You have a C file on disk, containing bytes. You have a "ASCII" string and a wide string.
The ASCII string takes the bytes exactly like they are in line 1 and outputs them.
This works as long as the encoding of the user's side is the same as the one on the programmer's side.
The wide string first decodes the given bytes into unicode codepoints and stored in the program- maybe this goes wrong on your side. On output they are encoded again according to the encoding on the user's side. This ensures that these characters are emitted as they are intended to, not as they are entered.
Either your compiler assumes the wrong encoding, or your output terminal is set up the wrong way.