How to read banana (🍌) from console in C? - c

I have tried many ways to do it.. using scanf(), getc(), but nothing worked. Most of the time, 0 is stored in the supplied variable (maybe indicating wrong input?). How can I make it so that when the user enters any Unicode codepoint, it is properly recognized and stored in either a string or a char?

I'm guessing you already know that C chars and Unicode characters are two very different things, so I'll skip over that. The assumptions I'll make here include:
Your C strings will contain UTF-8 encoded characters, terminated by a NUL (\x00) character.
You won't use any C functions that could break the per-character encoding, and you will use output (strlen(), etc) with the understanding you need to differentiate between C chars and real characters.
It really is as simple as:
char input[256];
scanf("%[^\n]", &input);
printf("%s\n", input);
The problems comes with what is providing the input, and what is displaying the output.
#include <stdio.h>
int main(int argc, char** argv) {
char* bananna = "\xF0\x9F\x8D\x8C\x00";
printf("%s\n", bananna);
}
This probably won't display a banana. That's because the UTF-8 sequence being written to the terminal isn't being interpreted as a UTF-8 sequence.
So, the first thing you need to do is to configure your terminal. If your program is likely to only use one terminal type, then you might even be able to do this from within the program; however, there are tons of people who use different terminals, some that even cross Operating System boundaries. For example, I'm testing my Linux programs in a Windows terminal, connected to the Linux system using SSH.
Once the terminal is configured, your probably already correct program should display a banana. But, even a correctly configured terminal can fail.
After the terminal is verified to be correctly configured, the last piece of the puzzle is the font. Not all fonts contain glyphs for all Unicode characters. The banana is one of those characters that isn't typically typed into a computer, so you need to open up a font tool and search the font for the glyph. If it doesn't exist in that font, you need to find a font that implements a glyph for that character.

Related

How to take I/O in Greek, in C program on Windows console

For a school project I decided to make an app. I am writing it in C, and running it on the Windows Console. I live in Greece and the program needs read and write text in Greek too. So, I have tried just plainly
printf("Καλησπέρα");
But it prints some random characters. How can I output Greek letters? And, similarly, how can I take input in Greek?
Welcome to Stack Overflow, and thank you for asking such an interesting question! I wish what you are trying to do was simple. But your programming language (C), and your execution environment (the Windows console) were both designed a long time ago, without Greek in mind. As a result, it is not easy to use them for your simple school project.
When your C program outputs bytes to stdout via printf, the Windows Console interprets those bytes as characters. It has a default interpretation, or encoding, which does not include Greek. In order for your Greek letters to appear, you need to tell Windows Console to use the correct encoding. You do this using the _setmode call, using the _O_U16TEXT parameter. This is described in the Windows _setmode documentation, as Semih Artan pointed out in the comments.
The _O_U16TEXT mode means your program must print text out in UTF-16 form. Each character is 16 bits long. That means you must represent your text as wide characters, using C syntax like L"\x039a". The L before the double quotes marks the string as having "wide characters", where each character has 16 bits instead of 8 bits. The \x in the string indicates that the next four characters are hex digits, representing the 16 bits of a wide character.
Your C program is itself a text file. The C compiler must interpret the bytes of this text file in terms of characters. When used in a simple way, the compiler will expect only ASCII-compatible byte values in the file. That includes Latin letters and digits, and simple punctuation. It does not include Greek letters. Thus you must write your Greek text by representing its bytes with ASCII substitutes.
There Greek characters Καλησπέρα are, I believe, represented in C wide character syntax as L"\x039a\x03b1\x03bb\x03b7\x03c3\x03c0\x03ad\x03c1\x03b1".
Finally, Windows Console must have access to a Greek font in order for it to display the Greek characters. I expect this is not a problem for you, because you are probably already running your computer in Greek. In any case Windows worldwide includes fonts with Greek coverage.
Plugging this Greek text into the sample program in Microsoft's _setmode documentation gives this. (Note: I have not tested this program myself.)
#include <fcntl.h>
#include <io.h>
#include <stdio.h>
int main(void) {
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(L"\x039a\x03b1\x03bb\x03b7\x03c3\x03c0\x03ad\x03c1\x03b1\n");
return 0;
}
Input is another matter. I won't attempt to go through it here. You probably have to set the mode of stdin to _O_U16TEXT. Then characters will appear as UTF-16. You may need to convert them before they are useful to your program.
Overall, to write a simple app for a school project, which reads and writes Greek, I suggest that you consider using a tool like Visual Studio to write a GUI program. These tools have more modern design, and give you access to text with Greek letters more easily.

Does printf() care about the locale?

wprintf() takes a wchar_t string as argument and prints the string in the specified locale character encoding.
But I have noticed that when using printf() and passing it a UTF-8 string, the UTF-8 string will always be printed regardless of the specified locale character encoding (for example, if the UTF-8 string contains Arabic characters, and the locale is set to "C" (not "C.UTF-8"), then the Arabic characters will still be printed).
Am I correct that printf() doesn't care about the locale?
True printf doesn't care about locale for c-strings. If you pass it an UTF-8 string, it knows nothing about it, it just see a sequence of bytes (hopefully terminated by ascii NUL). Then, bytes are passed to the output as-is, and are interpreted by the terminal (or whatever is the output). If the terminal is able to interpret UTF-8 sequences it then does so (if not, it tries to interpret it the way it is configured, Latin-1 or alike) and if it is also able to print them correctly then it does so (sometimes it doesn't have the right font/glyph and prints unknown characters as ? or alike).
This is one of the big virtues (perhaps the biggest virtue) of UTF-8: it's just a string of reasonably ordinary bytes. If your code-editing environment knows how to let you type
printf("Cööl!\n");
and if your display environment (e.g. your terminal window) knows how to display it, you can just write that, and run it, and it works (as it sounds like you've discovered).
So you don't need special run-time support, you don't need special header files or libraries or anything, you don't need to write your code in some fancy new Unicodey way -- you can just keep on using ordinary C strings and printf and friends like you're used to, and it all just works.
Of course, those two if's can be big ones. If you can't figure out how to (or your code editing environment won't let you) type the characters, or if your display environment doesn't display them, you may be stuck, or you may have to do some hard work after all. (Display environments that don't properly display UTF-8 output from C programs are evidently quite common, based on the number of times the question gets asked here on SO.)
See also the "UTF-8 Everywhere" manifesto.
(Now, with all of this said, this doesn't mean that printf doesn't care about locale settings at all. There are aspects of the locale that printf may care about, and there may be character sets and encodings that printf might have to treat specially, in a locale-dependent way. But since printf doesn't have to do anything special to make UTF-8 work right, that one aspect of the locale -- although it's a biggie -- doesn't end up affecting printf at all.)
Let's consider the following simple program, which uses printf() to print a wide string if run without command-line arguments, and wprintf() otherwise:
#include <stdlib.h>
#include <locale.h>
#include <stdio.h>
#include <wchar.h>
const wchar_t hello1[] = L"تحية طيبة";
const wchar_t hello2[] = L"Tervehdys";
int main(int argc, char *argv[])
{
if (!setlocale(LC_ALL, ""))
fprintf(stderr, "Warning: Current locale is not supported by the C library.\n");
if (argc <= 1) {
printf("printf 1: %ls\n", hello1);
printf("printf 2: %ls\n", hello2);
} else {
wprintf(L"wprintf: %ls\n", hello1);
wprintf(L"wprintf: %ls\n", hello2);
}
return EXIT_SUCCESS;
}
Using the GNU C library and any UTF-8 locale:
$ ./example
printf 1: تحية طيبة
printf 2: Tervehdys
$ ./example wide
wprintf: تحية طيبة
wprintf: Tervehdys
i.e. both produce the exact same output. However, if we run the example in the C/POSIX locale (that only supports ASCII), we get
$ LANG=C LC_ALL=C ./example
printf 1: printf 2: Tervehdys
i.e., the first printf() stopped at the first non-ASCII character (and that's why the second printf() printed on the same line);
$ LANG=C LC_ALL=C ./example wide
wprintf: ???? ????
wprintf: Tervehdys
i.e. wprintf() replaces wide characters that cannot be represented in the charset used by the current locale with a ?.
So, if we consider the GNU C library (which exhibits this behaviour), then we must say yes, printf cares about the locale, although it actually mostly cares about the character set used by the locale, and not the locale per se:
printf() will stop when trying to print wide strings that cannot be represented by the current character set (as defined by the locale). wprintf() will output question marks for those characters instead.
libc6-2.23-0ubuntu10 on x86-64 (amd64) does some replacements for multibyte characters in the printf format string, but multibyte characters in strings printed with %s are printed as-is. Which means it is a bit complicated to say exactly what gets printed and when the printf() gives up on the first multibyte or wide character it cannot convert, or just prints as-is.
However, wprintf() is pretty rock solid. (It too may choke if you try to print narrow strings with multibyte characters not representable in the character set used by the current locale, but for wide string stuff, it seems to work very well.)
Do note that POSIX.1 C libraries also provide iconv_open(), iconv(), and iconv_close() for converting strings, as well as mbstowcs() and wcstombs() to convert between wide and narrow/multibyte strings. You can also use asprintf() to create a dynamically allocated narrow string out of narrow and/or wide character strings (%s and %ls, respectively).

Impossible to put stdout in wide char mode

On my system, a pretty normal Ubuntu 13.10, the french accented characters "éèàçù..." are always handled correctly by whatever tools I use, despite LC_ environment variables being set to en_US.UTF-8.
In particular command line utilities like grep, cat, ... always read and print these characters without a hitch.
Despite these remarks, such a small program as
int main() {
printf("%c", getchar());
return 0;
}
fails when the user enters "é".
From the man pages, and a lot of googling, there is no standard way to close stdout, then reopening it. From man fwide(), if stdout is in byte mode, I can't pass it to wide character mode, short of closing it and reopening it... therefore I can't use getwchar() and wprintf().
I can't believe that every single utility like cat, grep, etc... reimplements a way to manage wide characters, yet from my research, I see no other way.
Is it my system that has a problem? I can't see how since every utility works flawlessly.
What am I missing, please?
When a C program starts, stdout, stdin and stderr are neither byte nor wide-character oriented. fwide(stdin, 0) should return 0 at this point.
If you expand your minimal program to:
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
int main()
{
setlocale(LC_ALL, "");
printf("%lc\n", getwchar());
return 0;
}
Then it should work as you expect. (There is no need to explicitly set the orientation of stdin here - since the first operation on it is a wide-character operation, it will have wide-character orientation).
You do need to use getwchar() instead of getchar() if you want to read a wide character with it, though.
UTF-8 character are taken as byte code not character and non ascii character are more then one byte.
Check this Question
for more info
The utilities you mention are generally line-oriented. If you were to try to read a whole line with e.g. fgets() rather than a single character, I think it'll work for you, too.
When you start reading single characters (which may be just bytes, and often are), you are of course very much susceptible to encoding issues.
Reading full lines will work just fine, as long as the line-termiation encoding is not mis-understood (and for UTF-8 it won't be).

char vs wchar_t

I'm trying to print out a wchar_t* string.
Code goes below:
#include <stdio.h>
#include <string.h>
#include <wchar.h>
char *ascii_ = "中日友好"; //line-1
wchar_t *wchar_ = L"中日友好"; //line-2
int main()
{
printf("ascii_: %s\n", ascii_); //line-3
wprintf(L"wchar_: %s\n", wchar_); //line-4
return 0;
}
//Output
ascii_: 中日友好
Question:
Apparently I should not assign CJK characters to char* pointer in line-1, but I just did it, and the output of line-3 is correct, So why? How could printf() in line-3 give me the non-ascii characters? Does it know the encoding somehow?
I assume the code in line-2 and line-4 are correct, but why I didn't get any output of line-4?
First of all, it's usually not a good idea to use non-ascii characters in source code. What's probably happening is that the chinese characters are being encoded as UTF-8 which works with ascii.
Now, as for why the wprintf() isn't working. This has to do with stream orientation. Each stream can only be set to either normal or wide. Once set, it cannot be changed. It is set the first time it is used. (which is ascii due to the printf). After that the wprintf will not work due the incorrect orientation.
In other words, once you use printf() you need to keep on using printf(). Similarly, if you start with wprintf(), you need to keep using wprintf().
You cannot intermix printf() and wprintf(). (except on Windows)
EDIT:
To answer the question about why the wprintf line doesn't work even by itself. It's probably because the code is being compiled so that the UTF-8 format of 中日友好 is stored into wchar_. However, wchar_t needs 4-byte unicode encoding. (2-bytes in Windows)
So there's two options that I can think of:
Don't bother with wchar_t, and just stick with multi-byte chars. This is the easy way, but may break if the user's system is not set to the Chinese locale.
Use wchar_t, but you will need to encode the Chinese characters using unicode escape sequences. This will obviously make it unreadable in the source code, but it will work on any machine that can print Chinese character fonts regardless of the locale.
Line 1 is not ascii, it's whatever multibyte encoding is used by your compiler at compile-time. On modern systems that's probably UTF-8. printf does not know the encoding. It's just sending bytes to stdout, and as long as the encodings match, everything is fine.
One problem you should be aware of is that lines 3 and 4 together invoke undefined behavior. You cannot mix character-based and wide-character io on the same FILE (stdout). After the first operation, the FILE has an "orientation" (either byte or wide), and after that any attempt to perform operations of the opposite orientation results in UB.
You are omitting one step and therefore think the wrong way.
You have a C file on disk, containing bytes. You have a "ASCII" string and a wide string.
The ASCII string takes the bytes exactly like they are in line 1 and outputs them.
This works as long as the encoding of the user's side is the same as the one on the programmer's side.
The wide string first decodes the given bytes into unicode codepoints and stored in the program- maybe this goes wrong on your side. On output they are encoded again according to the encoding on the user's side. This ensures that these characters are emitted as they are intended to, not as they are entered.
Either your compiler assumes the wrong encoding, or your output terminal is set up the wrong way.

What is the encoding of argv?

It's not clear to me what encodings are used where in C's argv. In particular, I'm interested in the following scenario:
A user uses locale L1 to create a file whose name, N, contains non-ASCII characters
Later on, a user uses locale L2 to tab-complete the name of that file on the command line, which is fed into a program P as a command line argument
What sequence of bytes does P see on the command line?
I have observed that on Linux, creating a filename in the UTF-8 locale and then tab-completing it in (e.g.) the zw_TW.big5 locale seems to cause my program P to be fed UTF-8 rather than Big5. However, on OS X the same series of actions results in my program P getting a Big5 encoded filename.
Here is what I think is going on so far (long, and I'm probably wrong and need to be corrected):
Windows
File names are stored on disk in some Unicode format. So Windows takes the name N, converts from L1 (the current code page) to a Unicode version of N we will call N1, and stores N1 on disk.
What I then assume happens is that when tab-completing later on, the name N1 is converted to locale L2 (the new current code page) for display. With luck, this will yield the original name N -- but this won't be true if N contained characters unrepresentable in L2. We call the new name N2.
When the user actually presses enter to run P with that argument, the name N2 is converted back into Unicode, yielding N1 again. This N1 is now available to the program in UCS2 format via GetCommandLineW/wmain/tmain, but users of GetCommandLine/main will see the name N2 in the current locale (code page).
OS X
The disk-storage story is the same, as far as I know. OS X stores file names as Unicode.
With a Unicode terminal, I think what happens is that the terminal builds the command line in a Unicode buffer. So when you tab complete, it copies the file name as a Unicode file name to that buffer.
When you run the command, that Unicode buffer is converted to the current locale, L2, and fed to the program via argv, and the program can decode argv with the current locale into Unicode for display.
Linux
On Linux, everything is different and I'm extra-confused about what is going on. Linux stores file names as byte strings, not in Unicode. So if you create a file with name N in locale L1 that N as a byte string is what is stored on disk.
When I later run the terminal and try and tab-complete the name, I'm not sure what happens. It looks to me like the command line is constructed as a byte buffer, and the name of the file as a byte string is just concatenated onto that buffer. I assume that when you type a standard character it is encoded on the fly to bytes that are appended to that buffer.
When you run a program, I think that buffer is sent directly to argv. Now, what encoding does argv have? It looks like any characters you typed in the command line while in locale L2 will be in the L2 encoding, but the file name will be in the L1 encoding. So argv contains a mixture of two encodings!
Question
I'd really like it if someone could let me know what is going on here. All I have at the moment is half-guesses and speculation, and it doesn't really fit together. What I'd really like to be true is for argv to be encoded in the current code page (Windows) or the current locale (Linux / OS X) but that doesn't seem to be the case...
Extras
Here is a simple candidate program P that lets you observe encodings for yourself:
#include <stdio.h>
int main(int argc, char **argv)
{
if (argc < 2) {
printf("Not enough arguments\n");
return 1;
}
int len = 0;
for (char *c = argv[1]; *c; c++, len++) {
printf("%d ", (int)(*c));
}
printf("\nLength: %d\n", len);
return 0;
}
You can use locale -a to see available locales, and use export LC_ALL=my_encoding to change your locale.
Thanks everyone for your responses. I have learnt quite a lot about this issue and have discovered the following things that has resolved my question:
As discussed, on Windows the argv is encoded using the current code page. However, you can retrieve the command line as UTF-16 using GetCommandLineW. Use of argv is not recommended for modern Windows apps with unicode support because code pages are deprecated.
On Unixes, the argv has no fixed encoding:
a) File names inserted by tab-completion/globbing will occur in argv verbatim as exactly the byte sequences by which they are named on disk. This is true even if those byte sequences make no sense in the current locale.
b) Input entered directly by the user using their IME will occur in argv in the locale encoding. (Ubuntu seems to use LOCALE to decide how to encode IME input, whereas OS X uses the Terminal.app encoding Preference.)
This is annoying for languages such as Python, Haskell or Java, which want to treat command line arguments as strings. They need to decide how to decode argv into whatever encoding is used internally for a String (which is UTF-16 for those languages). However, if they just use the locale encoding to do this decoding, then valid filenames in the input may fail to decode, causing an exception.
The solution to this problem adopted by Python 3 is a surrogate-byte encoding scheme (http://www.python.org/dev/peps/pep-0383/) which represents any undecodable byte in argv as special Unicode code points. When that code point is decoded back to a byte stream, it just becomes the original byte again. This allows for roundtripping data from argv that is not valid in the current encoding (i.e. a filename named in something other than the current locale) through the native Python string type and back to bytes with no loss of information.
As you can see, the situation is pretty messy :-)
I can only speak about Windows for now. On Windows, code pages are only meant for legacy applications and not used by the system or by modern applications. Windows uses UTF-16 (and has done so for ages) for everything: text display, file names, the terminal, the system API. Conversions between UTF-16 and the legacy code pages are only performed at the highest possible level, directly at the interface between the system and the application (technically, the older API functions are implemented twice—one function FunctionW that does the real work and expects UTF-16 strings, and one compatibility function FunctionA that simply converts input strings from the current (thread) code page to UTF-16, calls the FunctionW, and converts back the results). Tab-completion should always yield UTF-16 strings (it definitely does when using a TrueType font) because the console uses only UTF-16 as well. The tab-completed UTF-16 file name is handed over to the application. If now that application is a legacy application (i.e., it uses main instead of wmain/GetCommandLineW etc.), then the Microsoft C runtime (probably) uses GetCommandLineA to have the system convert the command line. So basically I think what you're saying about Windows is correct (only that there is probably no conversion involved while tab-completing): the argv array will always contain the arguments in the code page of the current application because the information what code page (L1) the original program has uses has been irreversibly lost during the intermediate UTF-16 stage.
The conclusion is as always on Windows: Avoid the legacy code pages; use the UTF-16 API wherever you can. If you have to use main instead of wmain (e.g., to be platform independent), use GetCommandLineW instead of the argv array.
The output from your test app needed some modifications to make any sense,
you need the hex codes and you need to get rid of the negative values.
Or you can't print things like UTF-8 special chars so you can read them.
First the modified SW:
#include <stdio.h>
int main(int argc, char **argv)
{
if (argc < 2) {
printf("Not enough arguments\n");
return 1;
}
int len = 0;
for (unsigned char *c = argv[1]; *c; c++, len++) {
printf("%x ", (*c));
}
printf("\nLength: %d\n", len);
return 0;
}
Then on my Ubuntu box that is using UTF-8 I get this output.
$> gcc -std=c99 argc.c -o argc
$> ./argc 1ü
31 c3 bc
Length: 3
And here you can see that in my case ü is encoded over 2 chars,
and that the 1 is a single char.
More or less exactly what you expect from a UTF-8 encoding.
And this actually match what is in the env LANG varible.
$> env | grep LANG
LANG=en_US.utf8
Hope this clarifies the linux case a little.
/Good luck
Yep, users has to be careful when mixing locales on Unix in general. GUI file managers that displays and changes filenames also have this problem. On Mac OS X the standard Unix encoding is UTF-8. In fact the HFS+ filesystem, when called via the Unix interfaces, enforces UTF-8 filenames because it needs to convert it to UTF-16 for storage in the filesystem itself.

Resources