How can C read chinese from console and file - c

I'm using ubuntu 12.04
I want to know how can I read Chinese using C
setlocale(LC_ALL, "zh_CN.UTF-8");
scanf("%s", st1);
for (b = 0; b < max_w;b++)
{
printf("%d ", st1[b]);
if (st1[b] == 0)
break;
}
For this code, when I input English, it outputs fine, but if I enter Chinese like"的",it outputs
Enter word or sentence (EXIT to break): 的
target char seq :
-25 -102 -124 0
I'm wondering why there is negative values in the array.
Further, I found that the bytes of a "的" in file read using fscanf is different from reading from the console.

UTF-8 encodes characters with a variable number of bytes. This is why you see three bytes for the 的 sign.
At graphemica - 的, you can see that 的 has the value U+7684 which translates to E7 9A 84 when you encode it in UTF-8.
You print every byte separately as an integer value. A char type might be signed and when it is converted to an integer, you can get negative numbers too. In your case this is
-25 = E7
-102 = 9A
-124 = 84
You can print the bytes as hex values with %x or as an unsigned integer %u, then you will see positive numbers only.
You can also change your print statement to
printf("%d ", (unsigned char) st1[b]);
which will interpret the bytes as unsigned values and show your output as
231 154 132 0

There's no need (and in fact it's harmful) to hard-code a specific locale name. What characters you can read are independent of the locale's language (used for messages), and any locale with UTF-8 encoding should work fine.
The easiest (but ugly once you try to go too far with it) way to make this work is to use the wide character stdio functions (e.g. getwc) instead of the byte-oriented ones. Otherwise you can read bytes then process them with mbrtowc.

Related

linux - serial port programming ( ASCII to Byte )

I tried to receive data from serial port. However, those data is unrecognized to me. The root cause is because those are in ASCII. To decode the data, it needs to be the byte formate.
The buffer I've created is unsigned char [255] and I try to print out the data by using
while (STOP==FALSE) {
res = read(fd,buf,255);
buf[res]=0;
printf(":%x\n", buf[0]);
if (buf[0]=='z') STOP=TRUE;
}
Two questions here:
The data might is shorter than 255 in the real case. It might takes 20 - 30 arrays from 255. In this case, how can I print 20 arrays ?
The correct output should be 41542b ( AT+ ) as the head of the entire command since this is the AT command. So I expect the buf[0] should be 41 in the beginning. It is, however, I dont know why the second one is e0 while I expect to have 54 (T).
Thanks
Ascii is a text encoding in bytes. There's no difference in reading them, it's just a matter of how you interpret what you read. This is not your problem.
Your problem is you read up to 255 bytes at once and only ever print the first of them.
It's pointless to set buf[res] to 0 when you expect binary data (that possibly contains 0 bytes). That's just useful for terminating text strings.
Just use a loop over your buffer, e.g.
for (int i = 0; i < res; ++i)
{
printf("%x", buf[i]);
}

Columns generated by wprintf are not equal

I'm using wprintf to print out c-strings of different size.
wprintf(L"%-*.*ls ", PRINTED_WORD_LENGTH, PRINTED_WORD_LENGTH, word->string);
int i;
for (i = 0; i < word->usage_length; i++) {
printf("%d ", word->usage[i]);
}
printf("\n");
As you can see, these strings contain some diacritic characters. Rows with these characters aren't formatted correctly (wprintf doesn't use enough spaces when it encounters them). Is there any way to format rows correctly without writing a new function?
z 39 46 62 113
za 101 105
zabawa 132
zasną 123
zatrzymać 88
They align correctly at the byte level. Only because you are looking at it as UTF-8 multi-byte characters make you feel they are not correctly aligned (for whatever definition of text alignment you want to use.)
If you are targeting a POSIX-conforming implementation, you can perhaps use the wcswidth(3) function: it was purposely specified to solve this kind of problems (originally with CJKV characters.) It is a bit lower-level though.

How does C retrieve the specific ASCII value?

In the below program
#include<stdio.h>
int main()
{
int k=65;
printf(" The ASCII value is : %c",k);
return 0;
}
The output is "The ASCII Value is : A" .
I just don't understand how does %c brought the corresponding ASCII value of that number?
I mean how does an integer value is referred to %c(instead of %d) and still brought the ASCII value?
How does this process work? Please explain.
Its not the "%c" that is doing it. When you run your program, all it does is outputs a sequence of bytes (numbers) to the standard output. If you use "%c" it will output a single byte of value 65 and if you use "%d" it will output two bytes, one with value 54 for the 6 and with value 53 for the 5. Then, your terminal displays those bytes as character glyphs, according to what encoding it is using. If your terminal is using an ascii-compatible encoding then 65 will be the code for "A".
The short version: your operating system has a table that links a graphical symbol to each integer , and for 65 it has linked the graphics for "A". The ASCII standard says that 65 should be linked to a graphic that humans can read as "A".
I think the following answer helps you. Credits to Sujeet Gholap.
the %c works as following:
1. Take the least significant byte of the variable
2. Interpret that byte as an ascii character
that's it
so, when you look at the four bytes of 65, (in hex), they look like
00 00 00 41
the %c looks at the 41
and prints it as A
that's it
#include<stdio.h>
int main()
{
int k=65 + 256;
printf(" The ASCII value is : %c",k);
return 0;
}
consider that code
where k is
00 00 01 41
here, even now, the last byte is 41
so, it still prints A

printf field width : bytes or chars?

The printf/fprintf/sprintf family supports
a width field in its format specifier. I have a doubt
for the case of (non-wide) char arrays arguments:
Is the width field supposed to mean bytes or characters?
What is the (correct-de facto) behaviour if the char array
corresponds to (say) a raw UTF-8 string?
(I know that normally I should use some wide char type,
that's not the point)
For example, in
char s[] = "ni\xc3\xb1o"; // utf8 encoded "niño"
fprintf(f,"%5s",s);
Is that function supposed to try to ouput just 5 bytes
(plain C chars) (and you take responsability of misalignments
or other problems if two bytes results in a textual characters) ?
Or is it supposed to try to compute the length of "textual characters"
of the array? (decodifying it... according to the current locale?)
(in the example, this would amount to find out that the string has
4 unicode chars, so it would add a space for padding).
UPDATE: I agree with the answers, it is logical that the printf family doesnt
distinguish plain C chars from bytes. The problem is my glibc doest not seem
to fully respect this notion, if the locale has been set previously, and if
one has the (today most used) LANG/LC_CTYPE=en_US.UTF-8
Case in point:
#include<stdio.h>
#include<locale.h>
main () {
char * locale = setlocale(LC_ALL, ""); /* I have LC_CTYPE="en_US.UTF-8" */
char s[] = {'n','i', 0xc3,0xb1,'o',0}; /* "niño" in utf8: 5 bytes, 4 unicode chars */
printf("|%*s|\n",6,s); /* this should pad a blank - works ok*/
printf("|%.*s|\n",4,s); /* this should eat a char - works ok */
char s3[] = {'A',0xb1,'B',0}; /* this is not valid UTF8 */
printf("|%s|\n",s3); /* print raw chars - ok */
printf("|%.*s|\n",15,s3); /* panics (why???) */
}
So, even when a non-POSIX-C locale has been set, still printf seems to have the right notion for counting width: bytes (c plain chars) and not unicode chars. That's fine. However, when given a char array that is not decodable in his locale, it silently panics (it aborts - nothing is printed after the first '|' - without error messages)... only if it needs to count some width. I dont understand why it even tries to decode the string from utf-8, when it doesn need/have to. Is this a bug in glibc ?
Tested with glibc 2.11.1 (Fedora 12) (also glibc 2.3.6)
Note: it's not related to terminal display issues - you can check the output by piping to od : $ ./a.out | od -t cx1 Here's my output:
0000000 | n i 303 261 o | \n | n i 303 261 | \n
7c 20 6e 69 c3 b1 6f 7c 0a 7c 6e 69 c3 b1 7c 0a
0000020 | A 261 B | \n |
7c 41 b1 42 7c 0a 7c
UPDATE 2 (May 2015): This questionable behaviour has been fixed in newer versions of glibc (from 2.17, it seems). With glibc-2.17-21.fc19 it works ok for me.
It will result in five bytes being output. And five chars. In ISO C, there is no distinction between chars and bytes. Bytes are not necessarily 8 bits, instead being defined as the width of a char.
The ISO term for an 8-bit value is an octet.
Your "niño" string is actually five characters wide in terms of the C environment (sans the null terminator, of course). If only four symbols show up on your terminal, that's almost certainly a function of the terminal, not C's output functions.
I'm not saying a C implementation couldn't handle Unicode. It could quite easily do UTF-32 if CHAR_BITS was defined as 32. UTF-8 would be harder since it's a variable length encoding but there are ways around almost any problem :-)
Based on your update, it seems like you might have a problem. However, I'm not seeing your described behaviour in my setup with the same locale settings. In my case, I'm getting the same output in those last two printf statements.
If your setup is just stopping output after the first | (I assume that's what you mean by abort but, if you meant the whole program aborts, that's much more serious), I would raise the issue with GNU (try your particular distributions bug procedures first). You've done all the important work such as producing a minimal test case so someone should even be happy to run that against the latest version if your distribution doesn't quite get there (most don't).
As an aside, I'm not sure what you meant by checking the od output. On my system, I get:
pax> ./qq | od -t cx1
0000000 | n i 303 261 o | \n | n i 303 261 | \n
7c 20 6e 69 c3 b1 6f 7c 0a 7c 6e 69 c3 b1 7c 0a
0000020 | A 261 B | \n | A 261 B | \n
7c 41 b1 42 7c 0a 7c 41 b1 42 7c 0a
0000034
so you can see the output stream contains the UTF-8, meaning that it's the terminal program which must interpret this. C/glibc isn't modifying the output at all, so maybe I just misunderstood what you were trying to say.
Although I've just realised you may be saying that your od output has only the starting bar on that line as well (unlike mine which appears to not have the problem), meaning that it is something wrong within C/glibc, not something wrong with the terminal silently dropping the characters (in all honesty, I would expect the terminal to drop either the whole line or just the offending character (i.e., output |A) - the fact that you're just getting | seems to preclude a terminal problem). Please clarify that.
Bytes (chars). There is no built-in support for Unicode semantics. You can imagine it as resulting in at least 5 calls to fputc.
What you've found is a bug in glibc. Unfortunately it's an intentional one which the developers refuse to fix. See here for a description:
http://www.kernel.org/pub/linux/libs/uclibc/Glibc_vs_uClibc_Differences.txt
The original question (bytes or chars?) was rightly answered by several people: both according to the spec and the glibc implementation, the width (or precision) in the printf C function counts bytes (or plain C chars, which are the same thing). So, fprintf(f,"%5s",s) in my first example, means definitely "try to output at least 5 bytes (plain chars) from the array s -if not enough, pad with blanks".
It does not matter whether the string (in my example, of byte-length 5) represents text encoded in -say- UTF8 and if fact contains 4 "textual (unicode) characters". To printf(), internally, it just has 5 (plain) C chars, and that's what counts.
Ok, this seems crystal clear. But it doesn't explain my other problem. Then we must be missing something.
Searching in glibc bug-tracker, I found some related (rather old) issues - I was not the first one caught by this... feature:
http://sources.redhat.com/bugzilla/show_bug.cgi?id=6530
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=208308
http://sources.redhat.com/bugzilla/show_bug.cgi?id=649
This quote, from the last link, is specially relevant here:
ISO C99 requires for %.*s to only write complete characters that fit below the
precision number of bytes. If you are using say UTF-8 locale, but ISO-8859-1
characters as shown in the input file you provided, some of the strings are
not valid UTF-8 strings, therefore sprintf fails with -1 because of the
encoding error. That's not a bug in glibc.
Whether it is a bug (perhaps in interpretation or in the ISO spec itself) is debatable.
But what glibc is doing is clear now.
Recall my problematic statement: printf("|%.*s|\n",15,s3) . Here, glibc must find out if the length of s3 is greater than 15 and, if so, truncate it. For computing this length it doesn't need to mess with encodings at all. But, if it must be truncated, glibc strives to be careful: if it just keeps the first 15 bytes, it could potentially break a multibyte character in half, and hence produce an invalid text output (I'd be ok with that - but glibc sticks to its curious ISO C99 interpretation).
So, it unfortunately needs to decode the char array, using the environment locale, to find out where are the real characters boundaries. Hence, for example, if LC_TYPE says UTF-8 and the array is not a valid UTF-8 bytes sequence, it aborts (not so bad, because then printf returns -1 ; not so well, because it prints part of the string anyway, so it's difficult to recover cleanly).
Apparently only in this case, when a precision is specified for a string and there is possibility of truncation, glibc needs to mix some Unicode semantics with the plain-chars/bytes semantics. Quite ugly, IMO, but so it is.
Update: Notice that this behaviour is relevant not only for the case of invalid original encodings, but also for invalid codes after the truncation. For example:
char s[] = "ni\xc3\xb1o"; /* "niño" in UTF8: 5 bytes, 4 unicode chars */
printf("|%.3s|",s); /* would cut the double-byte UTF8 char in two */
Thi truncates the field to 2 bytes, not 3, because it refuses to output an invalid UTF8 string:
$ ./a.out
|ni|
$ ./a.out | od -t cx1
0000000 | n i | \n
7c 6e 69 7c 0a
UPDATE (May 2015) This (IMO) questionable behaviour has been changed (fixed) in newer versions of glib. See main question.
To be portable, convert the string using mbstowcs and print it using printf( "%6ls", wchar_ptr ).
%ls is the specifier for a wide string according to POSIX.
There is no "de-facto" standard. Typically, I would expect stdout to accept UTF-8 if the OS and locale have been configured to treat it as a UTF-8 file, but I would expect printf to be ignorant of multibyte encoding because it isn't defined in those terms.
Don't use mbstowcs unless you also make sure that wchar_t is at-least 32 bits long.
else you'll likely end up with UTF-16 which has all disadvantages of UTF-8 and
all the disadvantages of UTF-32.
I'm not saying avoid mbstowcs I just saying don't let windows programmers use it.
It might be simpler to use iconv to convert to UTF-32.

Emacs, xterm, mousepad, C, Unicode and UTF-8: Trying to make sense of it all

Disclaimer: My apologies for all the text below (for a single simple question), but I sincerely think that every bit of information is relevant to the question. I'd be happy to learn otherwise. I can only hope that, if successful, the question(s) and the answers may help others in Unicode madness. Here goes.
I have read all the usually highly-regarded websites about utf8, particularly this one is very good for my purposes, but I've read the classics too, like those mentioned in other similar questions in SO. However, I still lack the knowledge about how to integrate it all in my virtual lab. I use Emacs with
;; Internationalization
(prefer-coding-system 'utf-8)
(setq locale-coding-system 'utf-8)
(set-terminal-coding-system 'utf-8)
(set-keyboard-coding-system 'utf-8)
(set-selection-coding-system 'utf-8)
in my .emacs, xterm started with
LC_CTYPE=en_US.UTF-8 xterm -geometry 91x58\
-fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'
and my locale reads:
LANG=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
My questions are the following (some of the answers may be the expected behavior of the application, but I still need to make sense of it, so bear with me):
Supposing the following C program:
#include <stdio.h>
int main(void) {
int c;
while((c=getc(stdin))!=EOF) {
if(c!='\n') {
printf("Character: %c, Integer: %d\n", c, c);
}
}
return 0;
}
If I run this in my xterm I get:
€
Character: � Integer: 226
Character: �, Integer: 130
Character: �, Integer: 172
(just in case the chars I get are a white question mark within a black circle). The ints are the decimal representation of the 3 bytes needed to encode €, but I am not exactly sure why xterm does not display them properly.
Instead, Mousepad, eg, prints
Character: â, Integer: 226
Character: ,, Integer: 130 (a comma, standing forU+0082 <control>, why?!)
Character: ¬, Integer: 172
Meanwhile, Emacs displays
Character: \342, Integer: 226
Character: \202, Integer: 130
Character: \254, Integer: 172
QUESTION: The most general question I can ask is: How do I get everything to print the same character? But I am certain there will be follow-ups.
Thanks again, and apologies for all the text.
Ok, so your problem here is due to mixing old-school C library calls (getc, printf %c) and UTF-8. Your code is correctly reading the three bytes which make up '€' - 226, 130 and 172 as decimal - but these values individually are not valid UTF-8 encoded glyphs.
If you look at the UTF-8 encoding, Integer values 0..127 are the encodings for the original US-ASCII character set. However 128..255 (i.e. all your bytes) are part of a multibyte UTF-8 character, and so don't correspond to a valid UTF-8 character invidually.
In other words the single byte '226' doesn't mean anything on it's own (as it's the prefix for a 3-byte character - as expected). The printf call prints it as a single byte, which is invalid with the UTF-8 encoding, so each different program copes with the invalid value in different ways.
Assuming you just want to 'see' what bytes UTF-8 character is made of, I suggest you stick to the integer output you already have (or maybe use hex if that is more sensible) - as your >127 bytes arn't valid unicode you're unlikely to get consistent results across different programs.
The UTF-8 encoding says that the three bytes together in a string form the euro sign, or '€'. But single bytes, like the ones produced by your C program, doesn't make sense in a UTF-8 stream. That is why they are replaced with the U+FFFD "REPLACEMENT CHARACTER", or '�'.
E-macs is smart, it knows that the single bytes are invalid data for the output stream, and replaces it with a visible escape representation of the byte. Mousepad output is really broken, I can't make any sense of it. Mousepad is falling back to the CP1252 Windows codepage, where the individual bytes represent characters. The "comma" is not a comma, it is a low curved quote.
The first thing you posted:
Character: � Integer: 226
Character: �, Integer: 130
Character: �, Integer: 172
Is the "correct" answer. When you print character 226 and the terminal expects utf8, there is nothing the terminal can do, you gave it invalid data. The sequence "226" "space" is an error. The ? character is a nice way of showing you that there is malformed data somewhere.
If you want to replicate your second example, you need to properly encode the character.
Imagine two functions; decode, which takes a character encoding and an octet stream and produces a list of characters; and encode, which takes an encoding an a list of characters and produces an octet stream. encode/decode should be reversible when your data is valid: encode( 'utf8', decode( 'utf8', "..." ) ) == "...".
Anyway, in the second example, the application ("mousepad?") is treating each octet in the three octet representation of the euro character as an individual latin1 character. It gets the octet, decodes it from latin-1 to some internal representation of a "character" (not octet or byte), and then encodes that character as utf8 and writes that to the terminal. That's why it works.
If you have GNU Recode, try this:
$ recode latin1..utf8
<three-octet representation of the euro character> <control-D>
â¬
What this did was treat each octet of the utf-8 representation as a latin1 character, and then converted each of those characters into something your terminal can understand. Perhaps running this through hd makes it clearer:
$ cat | hd
€
00000000 e2 82 ac 0a |....|
00000004
As you can see, it's 3 octets for the utf-8 representation of the character, and then a newline.
Running through recode:
$ recode latin1..utf8 | hd
€
00000000 c3 a2 c2 82 c2 ac 0a |.......|
00000007
This is the utf-8 representation of the "latin1" input string; something your terminal can display. The idea is if you output to your terminal, you'll see the euro sign. If you output , you get nothing, that's not valid. Finally, if you output , you get the "garbage" that is the "utf-8 representation" of the character.
If this seems confusing it is. You should never worry about the internal representation like this; if you are working with characters and you need to print them to a utf-8 terminal, you have to always encode to utf-8. If you are reading from a utf-8 encoded file, you need to decode the octets into characters before processing them in your application.

Resources