Impossible to put stdout in wide char mode - c

On my system, a pretty normal Ubuntu 13.10, the french accented characters "éèàçù..." are always handled correctly by whatever tools I use, despite LC_ environment variables being set to en_US.UTF-8.
In particular command line utilities like grep, cat, ... always read and print these characters without a hitch.
Despite these remarks, such a small program as
int main() {
printf("%c", getchar());
return 0;
}
fails when the user enters "é".
From the man pages, and a lot of googling, there is no standard way to close stdout, then reopening it. From man fwide(), if stdout is in byte mode, I can't pass it to wide character mode, short of closing it and reopening it... therefore I can't use getwchar() and wprintf().
I can't believe that every single utility like cat, grep, etc... reimplements a way to manage wide characters, yet from my research, I see no other way.
Is it my system that has a problem? I can't see how since every utility works flawlessly.
What am I missing, please?

When a C program starts, stdout, stdin and stderr are neither byte nor wide-character oriented. fwide(stdin, 0) should return 0 at this point.
If you expand your minimal program to:
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
int main()
{
setlocale(LC_ALL, "");
printf("%lc\n", getwchar());
return 0;
}
Then it should work as you expect. (There is no need to explicitly set the orientation of stdin here - since the first operation on it is a wide-character operation, it will have wide-character orientation).
You do need to use getwchar() instead of getchar() if you want to read a wide character with it, though.

UTF-8 character are taken as byte code not character and non ascii character are more then one byte.
Check this Question
for more info

The utilities you mention are generally line-oriented. If you were to try to read a whole line with e.g. fgets() rather than a single character, I think it'll work for you, too.
When you start reading single characters (which may be just bytes, and often are), you are of course very much susceptible to encoding issues.
Reading full lines will work just fine, as long as the line-termiation encoding is not mis-understood (and for UTF-8 it won't be).

Related

What's the difference between putch() and putchar()?

Okay so, I'm pretty new to C.
I've been trying to figure out what exactly is the difference between putch() and putchar()?
I tried googling my answers but all I got was the same copy-pasted-like message that said:
putchar(): This function is used to print one character on the screen, and this may be any character from C character set (i.e it may be printable or non printable characters).
putch(): The putch() function is used to display all alphanumeric characters through the standard output device like monitor. this function display single character at a time.
As English isn't my first language I'm kinda lost. Are there non printable characters in C? If so, what are they? And why can't putch produce the same results?
I've tried googling the C character set and all of the alphanumeric characters there are, but as much as my testing went, there wasn't really anything that one function could print and the other couldn't.
Anyways, I'm kind of lost.
Could anyone help me out? thanks!
TLDR;
what can putchar() do that putch() can't? (or the opposite or something idk)
dunno, hoped there would be a visible difference between the two but can't seem to find it.
putchar() is a standard function, possibly implemented as a macro, defined in <stdio.h> to output a single byte to the standard output stream (stdout).
putch() is a non standard function available on some legacy systems, originally implemented on MS/DOS as a way to output characters to the screen directly, bypassing the standard streams buffering scheme. This function is mostly obsolete, don't use it.
Output via stdout to a terminal is line buffered by default on most systems, so as long as your output ends with a newline, it will appear on the screen without further delay. If you need output to be flushed in the absence of a newline, use fflush(stdout) to force the stream buffered contents to be written to the terminal.
putch replaces the newline character (\ n)
putchar is a function in the C programming language that writes a single character to the standard output

How does the Windows terminal know if I wrote a wide character or a normal character?

I used _setmode(_fileno(stdout), _O_U8TEXT).
I am trying to understand how wprintf() differs from printf(), so I am trying to understand the difference betweem putc() and fputwc(). I thought the difference was that fputwc() could write 1–2 bytes and putc() only one byte. This was the case when using them to write one byte long characters to a file, but when I use them to write to stdout, they were different.
The letter L'é' is equivalent to 233, which is one byte long. So I thought that using putc(233, stdout) would work as fputwc(233, stdout), but it does not: only fputwc(233, stdout) works, even though both work equally when using them to write to a file. The other command wrote nothing.
Hence my question: how does the (Windows) terminal know if I wrote a wide character or a normal character?

How to read banana (🍌) from console in C?

I have tried many ways to do it.. using scanf(), getc(), but nothing worked. Most of the time, 0 is stored in the supplied variable (maybe indicating wrong input?). How can I make it so that when the user enters any Unicode codepoint, it is properly recognized and stored in either a string or a char?
I'm guessing you already know that C chars and Unicode characters are two very different things, so I'll skip over that. The assumptions I'll make here include:
Your C strings will contain UTF-8 encoded characters, terminated by a NUL (\x00) character.
You won't use any C functions that could break the per-character encoding, and you will use output (strlen(), etc) with the understanding you need to differentiate between C chars and real characters.
It really is as simple as:
char input[256];
scanf("%[^\n]", &input);
printf("%s\n", input);
The problems comes with what is providing the input, and what is displaying the output.
#include <stdio.h>
int main(int argc, char** argv) {
char* bananna = "\xF0\x9F\x8D\x8C\x00";
printf("%s\n", bananna);
}
This probably won't display a banana. That's because the UTF-8 sequence being written to the terminal isn't being interpreted as a UTF-8 sequence.
So, the first thing you need to do is to configure your terminal. If your program is likely to only use one terminal type, then you might even be able to do this from within the program; however, there are tons of people who use different terminals, some that even cross Operating System boundaries. For example, I'm testing my Linux programs in a Windows terminal, connected to the Linux system using SSH.
Once the terminal is configured, your probably already correct program should display a banana. But, even a correctly configured terminal can fail.
After the terminal is verified to be correctly configured, the last piece of the puzzle is the font. Not all fonts contain glyphs for all Unicode characters. The banana is one of those characters that isn't typically typed into a computer, so you need to open up a font tool and search the font for the glyph. If it doesn't exist in that font, you need to find a font that implements a glyph for that character.

Does wide character input/output in C always read from / write to the correct (system default) encoding?

I'm primarily interested in the Unix-like systems (e.g., portable POSIX) as it seems like Windows does strange things for wide characters.
Do the read and write wide character functions (like getwchar() and putwchar()) always "do the right thing", for example read from utf-8 and write to utf-8 when that is the set locale, or do I have to manually call wcrtomb() and print the string using e.g. fputs()? On my system (openSUSE 12.3) where $LANG is set to en_GB.UTF-8 they do seem to do the right thing (inspecting the output I see what looks like UTF-8 even though strings were stored using wchar_t and written using the wide character functions).
However I am unsure if this is guaranteed. For example cprogramming.com states that:
[wide characters] should not be used for output, since spurious zero
bytes and other low-ASCII characters with common meanings (such as '/'
and '\n') will likely be sprinkled throughout the data.
Which seems to indicate that outputting wide characters (presumably using the wide character output functions) can wreak havoc.
Since the C standard does not seem to mention coding at all I really have no idea who/when/how coding is applied when using wchar_t. So my question is basically if reading, writing and using wide characters exclusively is a proper thing to do when my application has no need to know about the encoding used. I only need string lengths and console widths (wcswidth()), so to me using wchar_t everywhere when dealing with text seems ideal.
The relevant text governing the behavior of the wide character stdio functions and their relationship to locale is from POSIX XSH 2.5.2 Stream Orientation and Encoding Rules:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_05_02
Basically, the wide character stdio functions always write in the encoding that's in effect (per the LC_CTYPE locale category) at the time the FILE stream becomes wide-oriented; this means the first time a wide stdio function is called on it, or fwide is used to set the orientation to wide. So as long as a proper LC_CTYPE locale is in effect matching the desired "system" encoding (e.g. UTF-8) when you start working with the stream, everything should be fine.
However, one important consideration you should not overlook is that you must not mix byte and wide oriented operations on the same FILE stream. Failure to observe this rule is not a reportable error; it simply results in undefined behavior. As a good deal of library code assumes stderr is byte oriented (and some even makes the same assumption about stdout), I would strongly discourage ever using wide-oriented functions on the standard streams. If you do, you need to be very careful about which library functions you use.
Really, I can't think of any reason at all to use wide-oriented functions. fprintf is perfectly capable of sending wide-character strings to byte-oriented FILE streams using the %ls specifier.
So long as the locale is set correctly, there shouldn't be any issues processing UTF-8 files on a system using UTF-8, using the wide character functions. They'll be able to interpret things correctly, i.e. they'll treat a character as 1-4 bytes as necessary (in both input and output). You can test it out by something like this:
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
int main()
{
setlocale(LC_CTYPE, "en_GB.UTF-8");
// setlocale(LC_CTYPE, ""); // to use environment variable instead
wchar_t *txt = L"£Δᗩ";
wprintf(L"The string %ls has %d characters\n", txt, wcslen(txt));
}
$ gcc -o loc loc.c && ./loc
The string £Δᗩ has 3 characters
If you use the standard functions (in particular character functions) on multibyte strings carelessly, things will start to break, e.g. the equivalent:
char *txt = "£Δᗩ";
printf("The string %s has %zu characters\n", txt, strlen(txt));
$ gcc -o nloc nloc.c && ./nloc
The string £Δᗩ has 7 characters
The string still prints correctly here because it's essentially just a stream of bytes, and as the system is expecting UTF-8 sequences, they're translated perfectly. Of course strlen is reporting the number of bytes in the string, 7 (plus the \0), with no understanding that a character and a byte aren't equivalent.
In this respect, because of the compatibility between ASCII and UTF-8, you can often get away with treating UTF-8 files as simply multibyte C strings, as long as you're careful.
There's a degree of flexibility as well. It's possible to convert a standard C string (as a multibyte string) to a wide character string easily:
char *stdtxt = "ASCII and UTF-8 €£¢";
wchar_t buf[100];
mbstowcs(buf, stdtxt, 20);
wprintf(L"%ls has %zu wide characters\n", buf, wcslen(buf));
Output:
ASCII and UTF-8 €£¢ has 19 wide characters
Once you've used a wide character function on a stream, it's set to wide orientation. If you later want to use standard byte i/o functions, you'll need to re-open the stream first. This is probably why the recommendation is not to use it on stdout. However, if you only use wide character functions on stdin and stdout (including any code that you link to), you will not have any problems.
Don't use fputs with anything else than ASCII.
If you want to write down lets say UTF8, then use a function who return the real size used by the utf8 string and use fwrite to write the good number of bytes, without worrying of vicious '\0' inside the string.

char vs wchar_t

I'm trying to print out a wchar_t* string.
Code goes below:
#include <stdio.h>
#include <string.h>
#include <wchar.h>
char *ascii_ = "中日友好"; //line-1
wchar_t *wchar_ = L"中日友好"; //line-2
int main()
{
printf("ascii_: %s\n", ascii_); //line-3
wprintf(L"wchar_: %s\n", wchar_); //line-4
return 0;
}
//Output
ascii_: 中日友好
Question:
Apparently I should not assign CJK characters to char* pointer in line-1, but I just did it, and the output of line-3 is correct, So why? How could printf() in line-3 give me the non-ascii characters? Does it know the encoding somehow?
I assume the code in line-2 and line-4 are correct, but why I didn't get any output of line-4?
First of all, it's usually not a good idea to use non-ascii characters in source code. What's probably happening is that the chinese characters are being encoded as UTF-8 which works with ascii.
Now, as for why the wprintf() isn't working. This has to do with stream orientation. Each stream can only be set to either normal or wide. Once set, it cannot be changed. It is set the first time it is used. (which is ascii due to the printf). After that the wprintf will not work due the incorrect orientation.
In other words, once you use printf() you need to keep on using printf(). Similarly, if you start with wprintf(), you need to keep using wprintf().
You cannot intermix printf() and wprintf(). (except on Windows)
EDIT:
To answer the question about why the wprintf line doesn't work even by itself. It's probably because the code is being compiled so that the UTF-8 format of 中日友好 is stored into wchar_. However, wchar_t needs 4-byte unicode encoding. (2-bytes in Windows)
So there's two options that I can think of:
Don't bother with wchar_t, and just stick with multi-byte chars. This is the easy way, but may break if the user's system is not set to the Chinese locale.
Use wchar_t, but you will need to encode the Chinese characters using unicode escape sequences. This will obviously make it unreadable in the source code, but it will work on any machine that can print Chinese character fonts regardless of the locale.
Line 1 is not ascii, it's whatever multibyte encoding is used by your compiler at compile-time. On modern systems that's probably UTF-8. printf does not know the encoding. It's just sending bytes to stdout, and as long as the encodings match, everything is fine.
One problem you should be aware of is that lines 3 and 4 together invoke undefined behavior. You cannot mix character-based and wide-character io on the same FILE (stdout). After the first operation, the FILE has an "orientation" (either byte or wide), and after that any attempt to perform operations of the opposite orientation results in UB.
You are omitting one step and therefore think the wrong way.
You have a C file on disk, containing bytes. You have a "ASCII" string and a wide string.
The ASCII string takes the bytes exactly like they are in line 1 and outputs them.
This works as long as the encoding of the user's side is the same as the one on the programmer's side.
The wide string first decodes the given bytes into unicode codepoints and stored in the program- maybe this goes wrong on your side. On output they are encoded again according to the encoding on the user's side. This ensures that these characters are emitted as they are intended to, not as they are entered.
Either your compiler assumes the wrong encoding, or your output terminal is set up the wrong way.

Resources