Printing wide character literals in C - c

I am trying to print unicode to the terminal under linux using the wchar_t type defined in the wchar.h header. I have tried the following:
#include <wchar.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
/*
char* direct = "\xc2\xb5";
fprintf(stderr, "%s\n", direct);
*/
wchar_t* dir_lit = L"μ";
wchar_t* uni_lit = L"\u03BC";
wchar_t* hex_lit = L"\xc2\xb5";
fwprintf(stderr,
L"direct: %ls, unicode: %ls, hex: %ls\n",
dir_lit,
uni_lit,
hex_lit);
return 0;
}
and compiled it using gcc -O0 -g -std=c11 -o main main.c.
This produces the output direct: m, unicode: m, hex: ?u (based on a terminal with LANG=en_US.UTF-8). In hex:
00000000 64 69 72 65 63 74 3a 20 6d 2c 20 75 6e 69 63 6f |direct: m, unico|
00000010 64 65 3a 20 6d 2c 20 68 65 78 3a 20 3f 75 0a |de: m, hex: ?u.|
0000001f
The only way that I have managed to obtain the desired output of μ is via the code commented in above (as a char* consisting of hex digits).
I have also tried to print based on the wcstombs funtion:
void print_wcstombs(wchar_t* str)
{
char buffer[100];
wcstombs(buffer, str, sizeof(buffer));
fprintf(stderr, "%s\n", buffer);
}
If I call for example print_wcstombs(dir_lit), then nothing is printed at all, so this approach does not seem to work at all.
I would be contend with the hex digit solution in principle, however, the length of the string is not calulated correctly (should be one, but is two bytes long), so formatting via printf does not work correctly.
Is there any way to handle / print unicode literals the way I intend using the wchar_t type?

With your program as-is, I compiled and ran it to get
direct: ?, unicode: ?, hex: ?u
I then included <locale.h> and added a setlocale(LC_CTYPE, ""); at the very beginning of the main() function, which, when run using a Unicode locale (LANG=en_US.UTF-8), produces
direct: μ, unicode: μ, hex: µ
(Codepoint 0xC2 is  in Unicode and 0xB5 is µ (U+00B5 MICRO SIGN as oppposed to U+03BC GREEK SMALL LETTER MU); hence the characters seen for the 'hex' output; results might vary if using an environment that does not use Unicode for wide characters).
Basically, to output wide characters you need to set the ctype locale so the stdio system knows how to convert them to the multibyte ones expected by the underlying system.
The updated program:
#include <wchar.h>
#include <stdio.h>
#include <locale.h>
int main(int argc, char *argv[])
{
setlocale(LC_CTYPE, "");
wchar_t* dir_lit = L"μ";
wchar_t* uni_lit = L"\u03BC";
wchar_t* hex_lit = L"\xc2\xb5";
fwprintf(stderr,
L"direct: %ls, unicode: %ls, hex: %ls\n",
dir_lit,
uni_lit,
hex_lit);
return 0;
}

Related

Writing out a binary structure to file, unexpected results?

I'm trying to write out a packed data structure to a binary file, however as you can see from od -x below the results are not expected in their ordering. I'm using gcc on a 64-bit Intel system. Does anyone know why the ordering is wrong? It doesn't look like an endianness issue.
#include <stdio.h>
#include <stdlib.h>
struct B {
char a;
int b;
char c;
short d;
} __attribute__ ((packed));
int main(int argc, char *argv[])
{
FILE *fp;
fp = fopen("temp.bin", "w");
struct B b = {'a', 0xA5A5, 'b', 0xFF};
if (fwrite(&b, sizeof(b), 1, fp) != 1)
printf("Error fwrite\n");
exit(0);
}
ASCII 61 is 'a', so the b.a member. ASCII 62 is 'b', so the b.c member. It's odd how 0xA5A5 is spread out over the sequence.
$ od -x temp.bin
0000000 a561 00a5 6200 00ff
0000010
od -x groups the input into 2-byte units and swaps their endianness. It's a confusing output format. Use -t x1 to leave the bytes alone.
$ od -t x1 temp.bin
0000000 61 a5 a5 00 00 62 ff 00
0000010
Or, easier to remember, use hd (hex dump) instead of od (octal dump). hd's default format doesn't need adjusting, plus it shows both a hex and ASCII dump.
$ hd temp.bin
00000000 61 a5 a5 00 00 62 ff 00 |a....b..|
00000008
od -x writes out two little-endian bytes. Per the od man page:
-x same as -t x2, select hexadecimal 2-byte units
So
0000000 a561 00a5 6200 00ff
is, on disk:
0000000 61a5 a500 0062 ff00

sha512: c program using the openSSL library

I have a program in C, which calculates sha256 hash of input file. It is using the openSSL library. Here is the core of the program:
#include <openssl/sha.h>
SHA256_CTX ctx;
unsigned char buffer[512];
SHA256_Init(&ctx);
SHA256_Update(&ctx, buffer, len);
SHA256_Final(buffer, &ctx);
fwrite(&buffer,32,1,stdout);
I need to change it to calculate sha512 hash instead.
Can I just (naively) change all the names of the functions from SHA256 to SHA512, and then in the last step fwrite 64 bytes, instead of the 32 bytes ? Is that all, or do I have to make more changes ?
Yes, this will work. The man page for the SHA family of functions lists the following:
int SHA256_Init(SHA256_CTX *c);
int SHA256_Update(SHA256_CTX *c, const void *data, size_t len);
int SHA256_Final(unsigned char *md, SHA256_CTX *c);
unsigned char *SHA256(const unsigned char *d, size_t n,
unsigned char *md);
...
int SHA512_Init(SHA512_CTX *c);
int SHA512_Update(SHA512_CTX *c, const void *data, size_t len);
int SHA512_Final(unsigned char *md, SHA512_CTX *c);
unsigned char *SHA512(const unsigned char *d, size_t n,
unsigned char *md);
...
SHA1_Init() initializes a SHA_CTX structure.
SHA1_Update() can be called repeatedly with chunks of the message
to be hashed (len bytes at data).
SHA1_Final() places the message digest in md, which must have space
for SHA_DIGEST_LENGTH == 20 bytes of output, and erases the
SHA_CTX.
The SHA224, SHA256, SHA384 and SHA512 families of functions operate
in the same way as for the SHA1 functions. Note that SHA224 and
SHA256 use a SHA256_CTX object instead of SHA_CTX. SHA384 and
SHA512 use SHA512_CTX. The buffer md must have space for the
output from the SHA variant being used (defined by
SHA224_DIGEST_LENGTH, SHA256_DIGEST_LENGTH, SHA384_DIGEST_LENGTH
and SHA512_DIGEST_LENGTH). Also note that, as for the SHA1()
function above, the SHA224(), SHA256(), SHA384() and SHA512()
functions are not thread safe if md is NULL.
To confirm, let's look at some code segments. First with SHA256:
SHA256_CTX ctx;
unsigned char buffer[512];
char *str = "this is a test";
int len = strlen(str);
strcpy(buffer,str);
SHA256_Init(&ctx);
SHA256_Update(&ctx, buffer, len);
SHA256_Final(buffer, &ctx);
fwrite(&buffer,32,1,stdout);
When run as:
./test1 | od -t x1
Outputs:
0000000 2e 99 75 85 48 97 2a 8e 88 22 ad 47 fa 10 17 ff
0000020 72 f0 6f 3f f6 a0 16 85 1f 45 c3 98 73 2b c5 0c
0000040
Which matches the output of:
echo -n "this is a test" | openssl sha256
Which is:
(stdin)= 2e99758548972a8e8822ad47fa1017ff72f06f3ff6a016851f45c398732bc50c
Now the same code with the changes you suggested:
SHA512_CTX ctx;
unsigned char buffer[512];
char *str = "this is a test";
int len = strlen(str);
strcpy(buffer,str);
SHA512_Init(&ctx);
SHA512_Update(&ctx, buffer, len);
SHA512_Final(buffer, &ctx);
fwrite(&buffer,64,1,stdout);
The output when passed through "od" gives us:
0000000 7d 0a 84 68 ed 22 04 00 c0 b8 e6 f3 35 ba a7 e0
0000020 70 ce 88 0a 37 e2 ac 59 95 b9 a9 7b 80 90 26 de
0000040 62 6d a6 36 ac 73 65 24 9b b9 74 c7 19 ed f5 43
0000060 b5 2e d2 86 64 6f 43 7d c7 f8 10 cc 20 68 37 5c
0000100
Which matches the output of:
echo -n "this is a test" | openssl sha512
Which is:
(stdin)= 7d0a8468ed220400c0b8e6f335baa7e070ce880a37e2ac5995b9a97b809026de626da636ac7365249bb974c719edf543b52ed286646f437dc7f810cc2068375c
I am using macOS and openssl. This works for me:
#include <openssl/sha.h>
#include <stdio.h>
#include <string.h>
int main() {
unsigned char data[] = "some text";
unsigned char hash[SHA512_DIGEST_LENGTH];
SHA512(data, strlen((char *)data), hash);
for (int i = 0; i < SHA512_DIGEST_LENGTH; i++)
printf("%02x", hash[i]);
putchar('\n');
}
I compile using,
~$ gcc -o sha512 sha512.c \
-I /usr/local/opt/openssl/include \
-L /usr/local/opt/openssl/lib \
-lcrypto
~S ./sha512
e2732baedca3eac1407828637de1dbca702c3fc9ece16cf536ddb8d6139cd85dfe7464b8235
b29826f608ccf4ac643e29b19c637858a3d8710a59111df42ddb5

putwchar() can't diplay a wchar_t variable

Why printf() can display é (\u00E9 int UTF-16) and putwchar() can't ?
And what is the right syntax to get putwchar displaying é correctly ?
#include <stdlib.h>
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main() {
wint_t wc = L'\u00E9';
setlocale(LC_CTYPE, "fr_FR.utf8");
printf("%C\n", wc);
putwchar((wchar_t)wc);
putchar('\n');
return 0;
}
Environnement
OS : openSUSE Leap 42.1
compiler : gcc version 4.8.5 (SUSE Linux)
Terminal : Terminator
Terminal encoding : UTF-8
Shell : zsh
CPU : x86_64
Shell env :
env | grep LC && env | grep LANG
LC_CTYPE=fr_FR.utf8
LANG=fr_FR.UTF-8
GDM_LANG=fr_FR.utf8
Edit
in :
wint_t wc = L'\u00E9'
setlocale(LC_CTYPE, "");
out:
C3 A9 0A E9 0A
in:
wint_t wc = L'\xc3a9';
setlocale(LC_CTYPE, "");
out:
EC 8E A9 0A A9 0A
You cannot mix wide character and byte input/output functions (printf is a byte output function, regardless if it includes formats for wide characters) on the same stream. The orientation of a stream can only be reset with freopen, which must be done again before calling the byte-oriented putchar function.
#include <stdlib.h>
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main() {
wint_t wc = L'\u00E9';
setlocale(LC_CTYPE, "");
printf("%lc\n", wc);
freopen(NULL, "w", stdout);
putwchar((wchar_t)wc);
freopen(NULL, "w", stdout);
putchar('\n');
return 0;
}
The fact that the orientation can only be set by reopening the stream indicates that this is not intended to be done trivially, and most programs should use only one kind of output. (i.e. either wprintf/putwchar, or printf/putchar, using printf or wctomb if you need to print a wide character)
The problem is your setlocale() call failed. If you check the result you'll see that.
if( !setlocale(LC_CTYPE, "fr_FR.utf8") ) {
printf("Failed to set locale\n");
return 1;
}
The problem is fr_FR.utf8 is not the correct name for the locale. Instead, use the LANG format: fr_FR.UTF-8.
if( !setlocale(LC_CTYPE, "fr_FR.UTF-8") ) {
printf("Failed to set locale\n");
return 1;
}
The locale names are whatever is installed on your system, probably in /usr/share/locale/. Or you can get a list with locale -a.
It's rare you want to hard code a locale. Usually you want to use whatever is specified by the environment. To do this, pass in "" as the locale and the program will figure it out.
if( !setlocale(LC_CTYPE, "") ) {
printf("Failed to set locale\n");
return 1;
}

Unescape a universal character name to the corresponding character in C

NEW EDIT:
Basically I've provided a example that isn't correct. In my real application the string will of course not always be "C:/Users/Familjen-Styren/Documents/V\u00E5gformer/20140104-0002/text.txt". Instead I will have a input window in java and then I will "escape" the unicode characters to a universal character name. And then it will be "unescaped" in C (I do this to avoid problems with passing multibyte characters from java to c). So here is a example where I actually ask the user to input a string (filename):
#include <stdio.h>
#include <string.h>
int func(const char *fname);
int main()
{
char src[100];
scanf("%s", &src);
printf("%s\n", src);
int exists = func((const char*) src);
printf("Does the file exist? %d\n", exists);
return exists;
}
int func(const char *fname)
{
FILE *file;
if (file = fopen(fname, "r"))
{
fclose(file);
return 1;
}
return 0;
}
And now it will think the universal character names is just a part of the actual filename. So how do I "unescape" the universal character names included in the input?
FIRST EDIT:
So I compile this example like this: "gcc -std=c99 read.c" where 'read.c' is my source file. I need the -std=c99 parameter because I'm using the prefix '\u' for my universal character name. If I change it to '\x' it works fine, and I can remove the -std=c99 parameter. But in my real application the input will not use the prefix '\x' instead it will be using the prefix '\u'. So how do I work around this?
This code gives the desired result but for my real application I can't really use '\x':
#include <stdio.h>
#include <string.h>
int func(const char *fname);
int main()
{
char *src = "C:/Users/Familjen-Styren/Documents/V\x00E5gformer/20140104-0002/text.txt";
int exists = func((const char*) src);
printf("Does the file exist? %d\n", exists);
return exists;
}
int func(const char *fname)
{
FILE *file;
if (file = fopen(fname, "r"))
{
fclose(file);
return 1;
}
return 0;
}
ORIGINAL:
I've found a few examples of how to do this in other programming languages like javascript but I couldn't find any example on how to do this in C. Here is a sample code which produces the same error:
#include <stdio.h>
#include <string.h>
int func(const char *fname);
int main()
{
char *src = "C:/Users/Familjen-Styren/Documents/V\u00E5gformer/20140104-0002/text.txt";
int len = strlen(src); /* This returns 68. */
char fname[len];
sprintf(fname,"%s", src);
int exists = func((const char*) src);
printf("%s\n", fname);
printf("Does the file exist? %d\n", exists); /* Outputs 'Does the file exist? 0' which means it doesn't exist. */
return exists;
}
int func(const char *fname)
{
FILE *file;
if (file = fopen(fname, "r"))
{
fclose(file);
return 1;
}
return 0;
}
If I instead use the same string without universal character names:
#include <stdio.h>
#include <string.h>
int func(const char *fname);
int main()
{
char *src = "C:/Users/Familjen-Styren/Documents/Vågformer/20140104-0002/text.txt";
int exists = func((const char*) src);
printf("Does the file exist? %d\n", exists); /* Outputs 'Does the file exist? 1' which means it does exist. */
return exists;
}
int func(const char *fname)
{
FILE *file;
if (file = fopen(fname, "r"))
{
fclose(file);
return 1;
}
return 0;
}
it will output 'Does the file exist? 1'. Which means it does indeed exist. But the problem is I need to be able to handle universal character. So how do I unescape a string which contains universal character names?
Thanks in advance.
I'm reediting the answer in the hope to make it clearer. First of all I'm assuming you are familiar with this: http://www.joelonsoftware.com/articles/Unicode.html. It is required background knowledge when dealing with character encoding.
Now I'm starting with a simple test program I typed on my linux machine test.c
#include <stdio.h>
#include <string.h>
#include <wchar.h>
#define BUF_SZ 255
void test_fwrite_universal(const char *fname)
{
printf("test_fwrite_universal on %s\n", fname);
printf("In memory we have %d bytes: ", strlen(fname));
for (unsigned i=0; i<strlen(fname); ++i) {
printf("%x ", (unsigned char)fname[i]);
}
printf("\n");
FILE* file = fopen(fname, "w");
if (file) {
fwrite((const void*)fname, 1, strlen(fname), file);
fclose(file);
file = NULL;
printf("Wrote to file successfully\n");
}
}
int main()
{
test_fwrite_universal("file_\u00e5.txt");
test_fwrite_universal("file_å.txt");
test_fwrite_universal("file_\u0436.txt");
return 0;
}
the text file is encoded as UTF-8. On my linux machine my locale is en_US.UTF-8
So I compile and run the program like this:
gcc -std=c99 test.c -fexec-charset=UTF-8 -o test
test
test_fwrite_universal on file_å.txt
In memory we have 11 bytes: 66 69 6c 65 5f c3 a5 2e 74 78 74
Wrote to file successfully
test_fwrite_universal on file_å.txt
In memory we have 11 bytes: 66 69 6c 65 5f c3 a5 2e 74 78 74
Wrote to file successfully
test_fwrite_universal on file_ж.txt
In memory we have 11 bytes: 66 69 6c 65 5f d0 b6 2e 74 78 74
Wrote to file successfully
The text file is in UTF-8, my locale is working of of UTF-8 and the execution character set for char is UTF-8.
In main I call the function fwrite 3 times with character strings. The function prints the strings byte by byte. Then writes a file with that name and write that string into the file.
We can see that "file_\u00e5.txt" and "file_å.txt" are the same: 66 69 6c 65 5f c3 a5 2e 74 78 74
and sure enough (http://www.fileformat.info/info/unicode/char/e5/index.htm) the UTF-8 representation for code point +00E5 is: c3 a5
In the last example I used \u0436 which is a Russian character ж (UTF-8 d0 b6)
Now lets try the same on my windows machine. Here I use mingw and I execute the same code:
C:\test>gcc -std=c99 test.c -fexec-charset=UTF-8 -o test.exe
C:\test>test
test_fwrite_universal on file_å.txt
In memory we have 11 bytes: 66 69 6c 65 5f c3 a5 2e 74 78 74
Wrote to file successfully
test_fwrite_universal on file_å.txt
In memory we have 11 bytes: 66 69 6c 65 5f c3 a5 2e 74 78 74
Wrote to file successfully
test_fwrite_universal on file_╨╢.txt
In memory we have 11 bytes: 66 69 6c 65 5f d0 b6 2e 74 78 74
Wrote to file successfully
So it looks like something went horribly wrong printf is not writing the characters properly and the files on the disk also look wrong.
Two things worth noting: in terms of byte values the file name is the same in both linux and windows. The content of the file is also correct when opened with something like notepad++
The reason for the problem is the C Standard library on windows and the locale. Where on linux the system locale is UTF-8 on windows my default locale is CP-437. And when I call functions such as printf fopen it assumes the input is in CP-437 and there c3 a5 are actually two characters.
Before we look at a proper windows solution lets try to explain why you have different results in file_å.txt vs file_\u00e5.txt.
I believe the key is the encoding of your text file. If I write the same test.c in CP-437:
C:\test>iconv -f UTF-8 -t cp437 test.c > test_lcl.c
C:\test>gcc -std=c99 test_lcl.c -fexec-charset=UTF-8 -o test_lcl.exe
C:\test>test_lcl
test_fwrite_universal on file_å.txt
In memory we have 11 bytes: 66 69 6c 65 5f c3 a5 2e 74 78 74
Wrote to file successfully
test_fwrite_universal on file_å.txt
In memory we have 10 bytes: 66 69 6c 65 5f 86 2e 74 78 74
Wrote to file successfully
test_fwrite_universal on file_╨╢.txt
In memory we have 11 bytes: 66 69 6c 65 5f d0 b6 2e 74 78 74
Wrote to file successfully
I now get a difference between file_å and file_\u00e5. The character å in the file is actually encoded as 0x86. Notice that this time the second string is 10 characters long not 11.
If we look at the file and tell Notepad++ to use UTF-8 we will see a funny result. Same goes to the actual data written to the file.
Finally how to get the damn thing working on windows. Unfortunately It seems that it is impossible to use the standard library with UTF-8 encoded strings. On windows you can't set the C locale to that. see: What is the Windows equivalent for en_US.UTF-8 locale?.
However we can work around this with wide characters:
#include <stdio.h>
#include <string.h>
#include <windows.h>
#define BUF_SZ 255
void test_fopen_windows(const char *fname)
{
wchar_t buf[BUF_SZ] = {0};
int sz = MultiByteToWideChar(CP_UTF8, 0, fname, strlen(fname), (LPWSTR)buf, BUF_SZ-1);
wprintf(L"converted %d characters\n", sz);
wprintf(L"Converting to wide characters %s\n", buf);
FILE* file =_wfopen(buf, L"w");
if (file) {
fwrite((const void*)fname, 1, strlen(fname), file);
fclose(file);
wprintf(L"Wrote file %s successfully\n", buf);
}
}
int main()
{
test_fopen_windows("file_\u00e5.txt");
return 0;
}
To compile use:
gcc -std=gnu99 -fexec-charset=UTF-8 test_wide.c -o test_wide.exe
_wfopen is not ANSI compliant and -std=c99 actually means STRICT_ANSI so you should use gnu99 to have that function.
Wrong array size (forgot the .txt and \0 and that an encoded non-ASCII char takes up more than 1 byte.)
// length of the string without the universal character name.
// C:/Users/Familjen-Styren/Documents/Vågformer/20140104-0002/text
// 123456789012345678901234567890123456789012345678901234567890123
// 1 2 3 4 5 6
// int len = 63;
// C:/Users/Familjen-Styren/Documents/Vågformer/20140104-0002/text.txt
int len = 100;
char *src = "C:/Users/Familjen-Styren/Documents/V\u00E5gformer/20140104-0002/text.txt";
char fname[len];
// or if you can use VLA
char fname[strlen(src)+1];
sprintf(fname, "%s", src);

About Linux C - UUID

#include <stdio.h>
#include <stdlib.h>
#include <uuid/uuid.h>
int main(void) {
puts("!!!Hello World!!!"); /* prints !!!Hello World!!! */
uuid_t uuid;
int uuid_generate_time_safe(uuid);
printf("%x",uuid);
return EXIT_SUCCESS;
}
I just wonder why uuid is not 16 bytes long?
I use DEBUG to view the memory, It is indeed not 16 bytes.
And I use libpcap to develop my program, The uuid is not unique.
I just tried your program on my system, and uuid is 16 bytes long. But your program doesn't display its size.
The line:
int uuid_generate_time_safe(uuid);
isn't a call to the uuid_generate_time_safe function, it's a declaration of that function with uuid as the (ignored) name of the single parameter. (And that kind of function declaration isn't even valid as of the 1999 standard, which dropped the old "implicit int" rule.)
Your printf call:
printf("%x",uuid);
has undefined behavior; "%x" requires an argument of type unsigned int.
If you look in /usr/include/uuid/uuid.h, you'll see that the definition of uuid_t is:
typedef unsigned char uuid_t[16];
The correct declaration of uuid_generate_time_safe (see man uuid_generate_time_safe) is:
int uuid_generate_time_safe(uuid_t out);
You don't need that declaration in your own code; it's provided by the #include <uuid/uuid.h>.
Because uuid_t is an array type, the parameter is really of type unsigned char*, which is why the function is seemingly able to modify its argument.
Here's a more correct program that illustrates the use of the function:
#include <stdio.h>
#include <uuid/uuid.h>
int main(void) {
uuid_t uuid;
int result = uuid_generate_time_safe(uuid);
printf("sizeof uuid = %d\n", (int)sizeof uuid);
// or: printf("sizeof uuid = %zu\n", sizeof uuid);
if (result == 0) {
puts("uuid generated safely");
}
else {
puts("uuid not generated safely");
}
for (size_t i = 0; i < sizeof uuid; i ++) {
printf("%02x ", uuid[i]);
}
putchar('\n');
return 0;
}
On my system, I got the following output:
sizeof uuid = 16
uuid not generated safely
26 9b fc b8 89 35 11 e1 96 30 00 13 20 db 0a c4
See the man page for information about why the "uuid not generated safely" message might appear.
Note that I had to install the uuid-dev package to be able to build and run this program.

Resources