Columns generated by wprintf are not equal - c

I'm using wprintf to print out c-strings of different size.
wprintf(L"%-*.*ls ", PRINTED_WORD_LENGTH, PRINTED_WORD_LENGTH, word->string);
int i;
for (i = 0; i < word->usage_length; i++) {
printf("%d ", word->usage[i]);
}
printf("\n");
As you can see, these strings contain some diacritic characters. Rows with these characters aren't formatted correctly (wprintf doesn't use enough spaces when it encounters them). Is there any way to format rows correctly without writing a new function?
z 39 46 62 113
za 101 105
zabawa 132
zasną 123
zatrzymać 88

They align correctly at the byte level. Only because you are looking at it as UTF-8 multi-byte characters make you feel they are not correctly aligned (for whatever definition of text alignment you want to use.)
If you are targeting a POSIX-conforming implementation, you can perhaps use the wcswidth(3) function: it was purposely specified to solve this kind of problems (originally with CJKV characters.) It is a bit lower-level though.

Related

Logical XOR in character arrays

I've been trying to make a program on Vernam Cipher which requires me to XOR two strings. I tried to do this program in C and have been getting an error.The length of the two strings are the same.
#include<stdio.h>
#include<string.h>
int main()
{
printf("Enter your string to be encrypted ");
char a[50];
char b[50];
scanf("%s",a);
printf("Enter the key ");
scanf("%s",b);
char c[50];
int q=strlen(a);
int i=0;
for(i=0;i<q;i++)
{
c[i]=(char)(a[i]^b[i]);
}
printf("%s",c);
}
Whenever I run the code, I get output as ????? in boxes. What is the method to XOR these two strings ?
I've been trying to make a program on Vernam Cipher which requires me to XOR two strings
Yes, it does, but that's not the only thing it requires. The Vernam cipher involves first representing the message and key in the ITA2 encoding (also known as Baudot-Murray code), and then computing the XOR of each pair of corresponding character codes from the message and key streams.
Moreover, to display the result in the manner you indicate wanting to do, you must first convert it from ITA2 to the appropriate character encoding for your locale, which is probably a superset of ASCII.
The transcoding to and from ITA2 is relatively straightforward, but not so trivial that I'm inclined to write them for you. There is a code chart at the ITA2 link above.
Note also that ITA2 is a stateful encoding that includes shift codes and a null character. This implies that the enciphered message may contain non-printing characters, which could cause some confusion, including a null character, which will be misinterpreted as a string terminator if you are not careful. More importantly, encoding in ITA2 may increase the length of the message as a result of a need to insert shift codes.
Additionally, as a technical matter, if you want to treat the enciphered bytes as a C string, then you need to ensure that it is terminated with a null character. On a related note, scanf() will do that for the strings it reads, which uses one character, leaving you only 49 each for the actual message and key characters.
What is the method to XOR these two strings ?
The XOR itself is not your problem. Your code for that is fine. The problem is that you are XORing the wrong values, and (once the preceding is corrected) outputting the result in a manner that does not serve your purpose.
Whenever I run the code, I get output as ????? in boxes...
XORing two printable characters does not always result in a printable value.
Consider the following:
the ^ operator operates at the bit level.
there is a limited range of values that are printable. (from here):
Control Characters (0–31 & 127): Control characters are not printable characters. They are used to send commands to the PC or the
printer and are based on telex technology. With these characters, you
can set line breaks or tabs. Today, they are mostly out of use.
Special Characters (32–47 / 58–64 / 91–96 / 123–126): Special characters include all printable characters that are neither letters
nor numbers. These include punctuation or technical, mathematical
characters. ASCII also includes the space (a non-visible but printable
character), and, therefore, does not belong to the control characters
category, as one might suspect.
Numbers (30–39): These numbers include the ten Arabic numerals from 0-9.
Letters (65–90 / 97–122): Letters are divided into two blocks, with the first group containing the uppercase letters and the second
group containing the lowercase.
Using the following two strings and the following code:
char str1 = {"asdf"};
char str1 = {"jkl;"};
Following demonstrates XORing the elements of the strings:
int main(void)
{
char str1[] = {"asdf"};
char str2[] = {"jkl;"};
for(int i=0;i<sizeof(str1)/sizeof(str1[i]);i++)
{
printf("%d ^ %d: %d\n", str1[i],str2[i], str1[i]^str2[i]);
}
getchar();
return 0;
}
While all of the input characters are printable (except the NULL character), not all of the XOR results of corresponding characters are:
97 ^ 106: 11 //not printable
115 ^ 107: 24 //not printable
100 ^ 108: 8 //not printable
102 ^ 59: 93
0 ^ 0: 0
This is why you are seeing the odd output. While all of the values may be completely valid for your purposes, they are not all printable.

At what point do encodings enter into play in C, if at all? How do strings get properly printed, then?

To investigate how C deals with UTF-8 / Unicode characters, I did this little experiment.
It's not that I'm trying to solve anything particular at the moment, but I know that Java deals with the whole encoding situation in a transparent way to the coder and I was wondering how C, that is a lot lower level, treats its characters.
The following test seems to indicate that C is entirely ignorant about encoding concerns, as that it's just up to the display device to know how to interpret the sequence of chars when showing them on screen. The later tests (when printing the characters surrounded by _) seem particular telling?
#include <stdio.h>
#include <string.h>
int main() {
char str[] = "João"; // ã does not belong to the standard
// (or extended) ASCII characters
printf("number of chars = %d\n", (int)strlen(str)); // 5
int len = 0;
while (str[len] != '\0')
len++;
printf("number of bytes = %d\n", len); // 5
for (int i = 0; i < len; i++)
printf("%c", str[i]);
puts("");
// "João"
for (int i = 0; i < len; i++)
printf("_%c_", str[i]);
puts("");
// _J__o__�__�__o_ -> wow!!!
str[2] = 'X'; // let's change this special character
// and see what happens
for (int i = 0; i < len; i++)
printf("%c", str[i]);
puts("");
// JoX�o
for (int i = 0; i < len; i++)
printf("_%c_", str[i]);
puts("");
// _J__o__X__�__o_
}
I have knowledge of how ASCII / UTF-8 work, what I'm really unsure is on at what moment do the characters get interpreted as "compound" characters, as it seems that C just treats them as dumb bytes. What's really the science behind this?
The printing isn't a function of C, but of the display context, whatever that is. For a terminal there are UTF-8 decoding functions which map the raw character data into the character to be shown on screen using a particular font. A similar sort of display logic happens in graphical applications, though with even more complexity relating to proportional font widths, ligatures, hyphenation, and numerous other typographical concerns.
Internally this is often done by decoding UTF-8 into some intermediate form first, like UTF-16 or UTF-32, for look-up purposes. In extremely simple terms, each character in a font has a Unicode identifier. In practice this is a lot more complicated as there is room for character variants, and multiple characters may be represented by a singular character in a font, like "fi" and "ff" ligatures. Accented characters like "ç" may be a combination of characters, as allowed by Unicode. That's where things like Zalgo text come about: you can often stack a truly ridiculous number of Unicode "combining characters" together into a single output character.
Typography is a complex world with complex libraries required to render properly.
You can handle UTF-8 data in C, but only with special libraries. Nothing that C ships with in the Standard Library can understand them, to C it's just a series of bytes, and it assumes byte is equivalent to character for the purposes of length. That is strlen and such work with bytes as a unit, not characters.
C++, as an example, has much better support for this distinction between byte and character. Other languages have even better support, with languages like Swift having exceptional support for UTF-8 specifically and Unicode in general.
printf("_%c_", str[i]); prints the character associated with each str[i] - one at a time.
The value of char str[i] is converted to an int when passed ot a ... function. The int value is then converted to unsigned char as directed by "%c" and "and the resulting character is written".
char str[] = "João"; does not certainly specify a UTF8 sequence. That in an implementation detail. A specified way is to use char str[] = u8"João"; since C11 (or maybe C99).
printf() does not specify a direct way to print UTF8 stirrings.

How can C read chinese from console and file

I'm using ubuntu 12.04
I want to know how can I read Chinese using C
setlocale(LC_ALL, "zh_CN.UTF-8");
scanf("%s", st1);
for (b = 0; b < max_w;b++)
{
printf("%d ", st1[b]);
if (st1[b] == 0)
break;
}
For this code, when I input English, it outputs fine, but if I enter Chinese like"的",it outputs
Enter word or sentence (EXIT to break): 的
target char seq :
-25 -102 -124 0
I'm wondering why there is negative values in the array.
Further, I found that the bytes of a "的" in file read using fscanf is different from reading from the console.
UTF-8 encodes characters with a variable number of bytes. This is why you see three bytes for the 的 sign.
At graphemica - 的, you can see that 的 has the value U+7684 which translates to E7 9A 84 when you encode it in UTF-8.
You print every byte separately as an integer value. A char type might be signed and when it is converted to an integer, you can get negative numbers too. In your case this is
-25 = E7
-102 = 9A
-124 = 84
You can print the bytes as hex values with %x or as an unsigned integer %u, then you will see positive numbers only.
You can also change your print statement to
printf("%d ", (unsigned char) st1[b]);
which will interpret the bytes as unsigned values and show your output as
231 154 132 0
There's no need (and in fact it's harmful) to hard-code a specific locale name. What characters you can read are independent of the locale's language (used for messages), and any locale with UTF-8 encoding should work fine.
The easiest (but ugly once you try to go too far with it) way to make this work is to use the wide character stdio functions (e.g. getwc) instead of the byte-oriented ones. Otherwise you can read bytes then process them with mbrtowc.

Looking for patterns in binary files

I'm working on a small project in C where I have to parse a binary file of undocumented file format. As I'm quite new to C I have two questions to some more experienced programmers.
The first seems to be an easy one. How do I extract all the strings from the binary file and put them into an array? Basically I am looking for a simple implementation of strings program in C.
When I open the binary file in any text editor I get a lot of rubbish with some readable strings mixed in. I can extract this strings using strings in the command line. Now I'd like to do something similar in C, like in the pseudocode below:
while (!EOF) {
if (string found) {
put it into array[i]
i++
}
return i;
}
The second problem is a little bit more complicated and is, I believe, the proper way of achieving the same thing. When I look at the file in HEX editor it's easy to notice some patterns. For example before each string there is a byte of value 02 (0x02) followed by the length of the string and the string itself. For example 02 18 52 4F 4F 54 4B 69 57 69 4B 61 4B 69 is a string with the string part in bold.
Now the function I'm trying to create would work like this:
while(!EOF) {
for(i=0; i<buffer_size; ++i) {
if(buffer[i] hex value == 02) {
int n = read the next byte;
string = read the next n bytes as char;
put string into array;
}
}
}
Thanks for any pointers. :)
The first seems to be an easy one. How do I extract all the strings from the binary file and put them into an array?
Figure out what character range represents printable ASCII characters. Iterate across the file, checking if characters are ASCII characters, and counting up for adjacent ASCII characters. By default, strings will treat sequences of four or more characters as strings; when you find the next non-ASCII character, check if the number has been exceeded; if it has, output the string. Some book-keeping is necessary.
The second problem is a little bit more complicated and is, I believe, the proper way of achieving the same thing.
Your pseudocode is essentially correct. You can manually compare the contents of buffer[i] with an integer (e.g. 2). Reading a byte is as simple as incrementing i. Make sure you don't overrun the buffer, and make sure the array your reading the string to is big enough (if the size parameter is only one byte, you can get away with a 255 length array buffer.)
I'm not sure your solution will work: what if you find a string with 350 char length?
Numbers can be part of a string or you can consider them "rubbish"?
I think the most safe way is
Define what you consider string and what you consider "rubbish" - for instance ":,!?" are "string" or "rubbish"?
Define a minimum string length to be considered a "readable" string
Parse the file looking for every group of char with length >= minimum.
I know, it's boring, but I think it's the only safe way. Good luck!

printf field width : bytes or chars?

The printf/fprintf/sprintf family supports
a width field in its format specifier. I have a doubt
for the case of (non-wide) char arrays arguments:
Is the width field supposed to mean bytes or characters?
What is the (correct-de facto) behaviour if the char array
corresponds to (say) a raw UTF-8 string?
(I know that normally I should use some wide char type,
that's not the point)
For example, in
char s[] = "ni\xc3\xb1o"; // utf8 encoded "niño"
fprintf(f,"%5s",s);
Is that function supposed to try to ouput just 5 bytes
(plain C chars) (and you take responsability of misalignments
or other problems if two bytes results in a textual characters) ?
Or is it supposed to try to compute the length of "textual characters"
of the array? (decodifying it... according to the current locale?)
(in the example, this would amount to find out that the string has
4 unicode chars, so it would add a space for padding).
UPDATE: I agree with the answers, it is logical that the printf family doesnt
distinguish plain C chars from bytes. The problem is my glibc doest not seem
to fully respect this notion, if the locale has been set previously, and if
one has the (today most used) LANG/LC_CTYPE=en_US.UTF-8
Case in point:
#include<stdio.h>
#include<locale.h>
main () {
char * locale = setlocale(LC_ALL, ""); /* I have LC_CTYPE="en_US.UTF-8" */
char s[] = {'n','i', 0xc3,0xb1,'o',0}; /* "niño" in utf8: 5 bytes, 4 unicode chars */
printf("|%*s|\n",6,s); /* this should pad a blank - works ok*/
printf("|%.*s|\n",4,s); /* this should eat a char - works ok */
char s3[] = {'A',0xb1,'B',0}; /* this is not valid UTF8 */
printf("|%s|\n",s3); /* print raw chars - ok */
printf("|%.*s|\n",15,s3); /* panics (why???) */
}
So, even when a non-POSIX-C locale has been set, still printf seems to have the right notion for counting width: bytes (c plain chars) and not unicode chars. That's fine. However, when given a char array that is not decodable in his locale, it silently panics (it aborts - nothing is printed after the first '|' - without error messages)... only if it needs to count some width. I dont understand why it even tries to decode the string from utf-8, when it doesn need/have to. Is this a bug in glibc ?
Tested with glibc 2.11.1 (Fedora 12) (also glibc 2.3.6)
Note: it's not related to terminal display issues - you can check the output by piping to od : $ ./a.out | od -t cx1 Here's my output:
0000000 | n i 303 261 o | \n | n i 303 261 | \n
7c 20 6e 69 c3 b1 6f 7c 0a 7c 6e 69 c3 b1 7c 0a
0000020 | A 261 B | \n |
7c 41 b1 42 7c 0a 7c
UPDATE 2 (May 2015): This questionable behaviour has been fixed in newer versions of glibc (from 2.17, it seems). With glibc-2.17-21.fc19 it works ok for me.
It will result in five bytes being output. And five chars. In ISO C, there is no distinction between chars and bytes. Bytes are not necessarily 8 bits, instead being defined as the width of a char.
The ISO term for an 8-bit value is an octet.
Your "niño" string is actually five characters wide in terms of the C environment (sans the null terminator, of course). If only four symbols show up on your terminal, that's almost certainly a function of the terminal, not C's output functions.
I'm not saying a C implementation couldn't handle Unicode. It could quite easily do UTF-32 if CHAR_BITS was defined as 32. UTF-8 would be harder since it's a variable length encoding but there are ways around almost any problem :-)
Based on your update, it seems like you might have a problem. However, I'm not seeing your described behaviour in my setup with the same locale settings. In my case, I'm getting the same output in those last two printf statements.
If your setup is just stopping output after the first | (I assume that's what you mean by abort but, if you meant the whole program aborts, that's much more serious), I would raise the issue with GNU (try your particular distributions bug procedures first). You've done all the important work such as producing a minimal test case so someone should even be happy to run that against the latest version if your distribution doesn't quite get there (most don't).
As an aside, I'm not sure what you meant by checking the od output. On my system, I get:
pax> ./qq | od -t cx1
0000000 | n i 303 261 o | \n | n i 303 261 | \n
7c 20 6e 69 c3 b1 6f 7c 0a 7c 6e 69 c3 b1 7c 0a
0000020 | A 261 B | \n | A 261 B | \n
7c 41 b1 42 7c 0a 7c 41 b1 42 7c 0a
0000034
so you can see the output stream contains the UTF-8, meaning that it's the terminal program which must interpret this. C/glibc isn't modifying the output at all, so maybe I just misunderstood what you were trying to say.
Although I've just realised you may be saying that your od output has only the starting bar on that line as well (unlike mine which appears to not have the problem), meaning that it is something wrong within C/glibc, not something wrong with the terminal silently dropping the characters (in all honesty, I would expect the terminal to drop either the whole line or just the offending character (i.e., output |A) - the fact that you're just getting | seems to preclude a terminal problem). Please clarify that.
Bytes (chars). There is no built-in support for Unicode semantics. You can imagine it as resulting in at least 5 calls to fputc.
What you've found is a bug in glibc. Unfortunately it's an intentional one which the developers refuse to fix. See here for a description:
http://www.kernel.org/pub/linux/libs/uclibc/Glibc_vs_uClibc_Differences.txt
The original question (bytes or chars?) was rightly answered by several people: both according to the spec and the glibc implementation, the width (or precision) in the printf C function counts bytes (or plain C chars, which are the same thing). So, fprintf(f,"%5s",s) in my first example, means definitely "try to output at least 5 bytes (plain chars) from the array s -if not enough, pad with blanks".
It does not matter whether the string (in my example, of byte-length 5) represents text encoded in -say- UTF8 and if fact contains 4 "textual (unicode) characters". To printf(), internally, it just has 5 (plain) C chars, and that's what counts.
Ok, this seems crystal clear. But it doesn't explain my other problem. Then we must be missing something.
Searching in glibc bug-tracker, I found some related (rather old) issues - I was not the first one caught by this... feature:
http://sources.redhat.com/bugzilla/show_bug.cgi?id=6530
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=208308
http://sources.redhat.com/bugzilla/show_bug.cgi?id=649
This quote, from the last link, is specially relevant here:
ISO C99 requires for %.*s to only write complete characters that fit below the
precision number of bytes. If you are using say UTF-8 locale, but ISO-8859-1
characters as shown in the input file you provided, some of the strings are
not valid UTF-8 strings, therefore sprintf fails with -1 because of the
encoding error. That's not a bug in glibc.
Whether it is a bug (perhaps in interpretation or in the ISO spec itself) is debatable.
But what glibc is doing is clear now.
Recall my problematic statement: printf("|%.*s|\n",15,s3) . Here, glibc must find out if the length of s3 is greater than 15 and, if so, truncate it. For computing this length it doesn't need to mess with encodings at all. But, if it must be truncated, glibc strives to be careful: if it just keeps the first 15 bytes, it could potentially break a multibyte character in half, and hence produce an invalid text output (I'd be ok with that - but glibc sticks to its curious ISO C99 interpretation).
So, it unfortunately needs to decode the char array, using the environment locale, to find out where are the real characters boundaries. Hence, for example, if LC_TYPE says UTF-8 and the array is not a valid UTF-8 bytes sequence, it aborts (not so bad, because then printf returns -1 ; not so well, because it prints part of the string anyway, so it's difficult to recover cleanly).
Apparently only in this case, when a precision is specified for a string and there is possibility of truncation, glibc needs to mix some Unicode semantics with the plain-chars/bytes semantics. Quite ugly, IMO, but so it is.
Update: Notice that this behaviour is relevant not only for the case of invalid original encodings, but also for invalid codes after the truncation. For example:
char s[] = "ni\xc3\xb1o"; /* "niño" in UTF8: 5 bytes, 4 unicode chars */
printf("|%.3s|",s); /* would cut the double-byte UTF8 char in two */
Thi truncates the field to 2 bytes, not 3, because it refuses to output an invalid UTF8 string:
$ ./a.out
|ni|
$ ./a.out | od -t cx1
0000000 | n i | \n
7c 6e 69 7c 0a
UPDATE (May 2015) This (IMO) questionable behaviour has been changed (fixed) in newer versions of glib. See main question.
To be portable, convert the string using mbstowcs and print it using printf( "%6ls", wchar_ptr ).
%ls is the specifier for a wide string according to POSIX.
There is no "de-facto" standard. Typically, I would expect stdout to accept UTF-8 if the OS and locale have been configured to treat it as a UTF-8 file, but I would expect printf to be ignorant of multibyte encoding because it isn't defined in those terms.
Don't use mbstowcs unless you also make sure that wchar_t is at-least 32 bits long.
else you'll likely end up with UTF-16 which has all disadvantages of UTF-8 and
all the disadvantages of UTF-32.
I'm not saying avoid mbstowcs I just saying don't let windows programmers use it.
It might be simpler to use iconv to convert to UTF-32.

Resources