Handling special characters in C (UTF-8 encoding)

Handling special characters in C (UTF-8 encoding) - c

I'm writing a small application in C that reads a simple text file and then outputs the lines one by one. The problem is that the text file contains special characters like Æ, Ø and Å among others. When I run the program in terminal the output for those characters are represented with a "?".
Is there an easy fix?

First things first:
Read in the buffer
Use libiconv or similar to obtain wchar_t type from UTF-8 and use the wide character handling functions such as wprintf()
Use the wide character functions in C! Most file/output handling functions have a wide-character variant
Ensure that your terminal can handle UTF-8 output. Having the correct locale setup and manipulating the locale data can automate alot of the file opening and conversion for you ... depending on what you are doing.
Remember that the width of a code-point or character in UTF-8 is variable. This means you can't just seek to a byte and begin reading like with ASCII ... because you might land in the middle of a code point. Good libraries can do this in some cases.
Here is some code (not mine) that demonstrates some usage of UTF-8 file reading and wide character handling in C.
#include <stdio.h>
#include <wchar.h>
int main()
{
FILE *f = fopen("data.txt", "r, ccs=UTF-8");
if (!f)
return 1;
for (wint_t c; (c = fgetwc(f)) != WEOF;)
printf("%04X\n", c);
fclose(f);
return 0;
}
Links
libiconv
Locale data in C/GNU libc
Some handy info
Another good Unicode/UTF-8 in C resource

Make sure you're not accidentally dropping any bytes; some UTF-8 characters are more than one byte in length (that's sort of the point), and you need to keep them all.
It can be useful to print the contents of the buffer as hex, so you can inspect which bytes are actually read:
static void print_buffer(const char *buffer, size_t length)
{
size_t i;
for(i = 0; i < length; i++)
printf("%02x ", (unsigned int) buffer[i]);
putchar('\n');
}
You can do this after loading a very short file, containing just a few characters.
Also make sure the terminal is set to the proper encoding, so it interprets your characters as UTF-8.

Probably your text file is ISO-8559-1 encoded but your terminal is UTF-8. This kind of mismatch is a standard problem when dealing with byte-oriented text handling; other C programs (such as the standard ‘cat’ and ‘more’ commands) will do the same thing and it isn't generally considered an error or something that needs to be fixed.
If you want to operate on a Unicode character level instead of bytes that's fine, but you'll need to use wchar as your character type instead of char throughout your program, and provide switches for the user to specify what the incoming file encoding actually is. (Whilst it is sometimes possible to guess, it's not very reliable.)

I don't know if it could help but if you're sure that the encodings of terminal and input file are the same, you can try to setlocale():
#include <locale.h>
…
setlocale(LC_CTYPE, "");

Related

Why does fgetc() in C always reads extra, non-existent characters whenever I try to read non-printable characters from txt files?

I am trying to read non-printable characters from a text file, print out the characters' ASCII code, and finally write these non-printable characters into an output file.
However, I have noticed that for every non-printable character I read, there is always an extra non-printable character existing in front of what I really want to read.
For example, the character I want to read is "§".
And when I print out its ASCII code in my program, instead of printing just "167", it prints out "194 167".
I looked it up in the debugger and saw "Â§" in the char array. But I don't have Â anywhere in my input file.
screenshot of debugger
And after I write the non-printable character into my output file, I have noticed that it is also just "§", not "Â§".
There is an extra character being attached to every single non-printable character I read. Why is this happening? How do I get rid of it?
Thanks!
Code as follows:
case 1:
mode = 1;
FILE *fp;
fp = fopen ("input2.txt", "r");
int charCount = 0;
while(!feof(fp)) {
original_message[charCount] = fgetc(fp);
charCount++;
}
original_message[charCount - 1] = '\0';
fclose(fp);
k = strlen(original_message);//split the original message into k input symbols
printf("k: \n%lld\n", k);
printf("ASCII code:\n");
for (int i = 0; i < k; i++)
{
ASCII = original_message[i];
printf("%d ", ASCII);
}

C's getchar (and getc and fgetc) functions are designed to read individual bytes. They won't directly handle "wide" or "multibyte" characters such as occur in the UTF-8 encoding of Unicode.
But there are other functions which are specifically designed to deal with those extended characters. In particular, if you wish, you can replace your call to fgetc(fp) with fgetwc(fp), and then you should be able to start reading characters like § as themselves.
You will have to #include <wchar.h> to get the prototype for fgetwc. And you may have to add the call
setlocale(LC_CTYPE, "");
at the top of your program to synchronize your program's character set "locale" with that of your operating system.
Not your original code, but I wrote this little program:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main()
{
wchar_t c;
setlocale(LC_CTYPE, "");
while((c = fgetwc(stdin)) != EOF)
printf("%lc %d\n", c, c);
}
When I type "A", it prints A 65.
When I type "§", it prints § 167.
When I type "Ƶ", it prints Ƶ 437.
When I type "†", it prints † 8224.
Now, with all that said, reading wide characters using functions like fgetwc isn't the only or necessarily even the best way of dealing with extended characters. In your case, it carries a number of additional consequences:
Your original_message array is going to have to be an array of wchar_t, not an array of char.
Your original_message array isn't going to be an ordinary C string — it's a "wide character string". So you can't call strlen on it; you're going to have to call wcslen.
Similarly, you can't print it using %s, or its characters using %c. You'll have to remember to use %ls or %lc.
So although you can convert your entire program to use "wide" strings and "w" functions everywhere, it's a ton of work. In many cases, and despite anomalies like the one you asked about, it's much easier to use UTF-8 everywhere, since it tends to Just Work. In particular, as long as you don't have to pick a string apart and work with its individual characters, or compute the on-screen display length of a string (in "characters") using strlen, you can just use plain C strings everywhere, and let the magic of UTF-8 sequences take care of any non-ASCII characters your users happen to enter.

What encoding are character / string literals stored in? (Or how to find a literal character in a string from input?)

As we know, different encodings map different representations to same characters. Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well? I'd find this surprising since these are compile-time!
This matters for tasks as simple as, for example, determining whether a string read from input contains a specific character. When reading strings from input it seems sensible to set the locale to to the user's locale (setlocale("LC_ALL", "");) so that the string is read and processed correctly. But when we're comparing this string with a character literal, won't problems arise due to mismatched encoding?
In other words: The following snippet seems to work for me. But doesn't it work only because of coincidence? Because - for example? - the source code happened to be saved in the same encoding that is used on the machine during runtime?
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <locale.h>
int main()
{
setlocale(LC_ALL, "");
// Read line and convert it to wide string so that wcschr can be used
// So many lines! And that's even though I'm omitting the necessary
// error checking for brevity. Ah I'm also omitting free's
char *s = NULL; size_t n = 0;
getline(&s, &n, stdin);
mbstate_t st = {0}; const char* cs = s;
size_t wn = mbsrtowcs(NULL, &cs, 0, &st);
wchar_t *ws = malloc((wn+1) * sizeof(wchar_t));
st = (mbstate_t){0};
mbsrtowcs(ws, &cs, (wn+1), &st);
int contains_guitar = (wcschr(ws, L'🎸') != NULL);
if(contains_guitar)
printf("Let's rock!\n");
else
printf("Let's not.\n");
return 0;
}
How to do this correctly?

Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well?
No. String literals use the execution character set, which is defined by your compiler at compile time.
Execution character set does not have to be the same as the source character set, the character set used in the source code. The C compiler is responsible for the translation, and should have options for choosing/defining them. The default depends on the compiler, but on Linux and most current POSIXy systems, is usually UTF-8.
The following snippet seems to work for me. But doesn't it work only because of coincidence?
The example works because the character set of your locale, the source character set, and the execution character set used when the binary was constructed, all happen to be UTF-8.
How to do this correctly?
Two options. One is to use wide characters and string literals. The other is to use UTF-8 everywhere.
For wide input and output, see e.g. this example in another answer here.
Do note that getwline() and getwdelim() are not in POSIX.1, but in C11 Annex K. This means they are optional, and as of this writing, not widely available at all. Thus, a custom implementation around fgetwc() is recommended instead. (One based on fgetws(), wcslen(), and/or wcscspn() will not be able to handle embedded nuls, L'\0', correctly.)
In a typical wide I/O program, you only need mbstowcs() to convert command-line arguments and environment variables to wide strings.
Using UTF-8 everywhere is also a perfectly valid practical approach, at least if it is well documented, so that users know the program inputs and outputs UTF-8 strings, and developers know to ensure their C compiler uses UTF-8 as the execution character set when compiling those binaries.
Your program can even use e.g.
if (!setlocale(LC_ALL, ""))
fprintf(stderr, "Warning: Your C library does not support your current locale.\n");
if (strcmp("UTF-8", nl_langinfo(CODESET)))
fprintf(stderr, "Warning: Your locale does not use the UTF-8 character set.\n");
to verify the current locale uses UTF-8.
I have used both approaches, depending on the circumstances. It is difficult to say which one is more portable in practice, because as usual, both work just fine on non-Windows OSes without issues.

If you're willing to assume UTF-8,
strstr(s,"🎸")
Or:
strstr(s,u8"🎸")
The latter avoids some assumptions but requires a C11 compiler. If you want the best of both and can sacrifice readability:
strstr(s,"\360\237\216\270")

How to write äõüö is C? [duplicate]

I am trying to do my own version of wc (unix filter), but I have a problem with non-ASCII characters. I did a HEX dump of a text file and found out that these characters occupy more than one byte. So they won't fit to char. Is there any way I can read these characters from file and handle them like single characters (in order to count characters in a file) in C?
I've been googling a little bit and found some wchar_t type, but there were not any simple examples how to use it with files.

I've been googling a little bit and found some wchar_t type, but there was not any simple example how to use it with files.
Well met. There weren't any simple examples because, unfortunately, proper character set support isn't simple.
Aside: In an ideal world, everybody would use UTF-8 (a Unicode encoding that is memory-efficient, robust, and backward-compatible with ASCII), the standard C library would include UTF-8 encoding-decoding support, and the answer to this question (and dealing with text in general) would be simple and straightforward.
The answer to the question "What is the best unicode library for C?" is to use the ICU library. You may want to look at ustdio.h, as it has a u_fgetc function, and adding Unicode support to your program will probably take little more than typing u_ a few times.
Also, if you can spare a few minutes for some light reading, you may want to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets (No Excuses!) from Joel On Software.
I, personally, have never used ICU, but I probably will from now on :-)

If you want to write a standard C version of the wc utility that respects the current language setting when it is run, then you can indeed use the wchar_t versions of the stdio functions. At program startup, you should call setlocale():
setlocale(LC_CTYPE, "");
This will cause the wide character functions to use the appropriate character set defined by the environment - eg. on Unix-like systems, the LANG environment variable. For example, this means that if your LANG variable is set to a UTF8 locale, the wide character functions will handle input and output in UTF8. (This is how the POSIX wc utility is specified to work).
You can then use the wide-character versions of all the standard functions. For example, if you have code like this:
long words = 0;
int in_word = 0;
int c;
while ((c = getchar()) != EOF)
{
if (isspace(c))
{
if (in_word)
{
in_word = 0;
words++;
}
}
else
{
in_word = 1;
}
}
...you would convert it to the wide character version by changing c to a wint_t, getchar() to getwchar(), EOF to WEOF and isspace() to iswspace():
long words = 0;
int in_word = 0;
wint_t c;
while ((c = getwchar()) != WEOF)
{
if (iswspace(c))
{
if (in_word)
{
in_word = 0;
words++;
}
}
else
{
in_word = 1;
}
}

Go have a look at ICU. That library is what you need to deal with all the issues.

Most of the answers so far have merit, but which you use depends on the semantics you want:
If you want to process text in the configured locale's encoding, and don't care about complete failure in the case of encountering invalid sequences, using getwchar() is fine.
If you want to process text in the configured locale's encoding, but need to detect and recover from invalid sequences, you need to read bytes and use mbrtowc manually.
If you always want to process text as UTF-8, you need to read bytes and feed them to your own decoder. If you know in advance the file will be valid UTF-8, you can just count bytes in the ranges 00-7F and C2-F4 and skip counting all other bytes, but this could give wrong results in the presence of invalid sequences. A more robust approach would be decoding the bytestream to Unicode codepoints and counting the number of successful decodes.
Hope this helps.

Are you sure you really need the number of characters? wc counts the number of bytes.
~$ echo 'דניאל' > hebrew.txt
~$ wc hebrew.txt
1 1 11 hebrew.txt
(11 = 5 two-byte characters + 1 byte for '\n')
However, if you really do want to count characters rather than bytes, and can assume that your text files are encoded in UTF-8, then the easiest approach is to count all bytes that are not trail bytes (i.e., in the range 0x80 to 0xBF).
If you can't assume UTF-8 but can assume that any non-UTF-8 files are in a single-byte encoding, then perform a UTF-8 validation check on the data. If it passes, return the number of UTF-8 lead bytes. If if fails, return the number of total bytes.
(Note that the above approach is specific to wc. If you're actually doing something with the characters rather than just counting them, you'll need to know the encoding.)

Looking for UTF8-aware formatting functions like printf(), etc

I discovered an interesting problem when processing UTF-8 strings containing non-ASCII chars with C standard library formatting functions like sprintf():
The functions of the printf() family are not aware of utf-8 and process everything based on the number of bytes, not chars. Therefore the formatting is incorrect.
Simple example:
#include <stdio.h>
int main(int argc, char *argv[])
{
const char* testMsg = "Tääääßt";
char buf[1024];
int len;
sprintf(buf, "|%7.7s|", testMsg);
len = strlen(buf);
printf("Result=\"%s\", len=%d", buf, len);
return 0;
}
The result is:
Result="|Täää|", len=7
Most probably some of you will recommand to convert the application from char to wchar_t and use fwprintf(), etc., but that's absolutely impossible because of huge existing applications. I could imagine writing a wrapper that uses these functions internally, but this would be tricky and very inefficient.
So the best solution would be a UTF-8-aware replacement for the formatting functions of the Standard C Library.
Currently I'm working on QNX 6.4, but replies for other operating systems. e.g. Linux, are also very welcome.

Well, once you ask printf to do intelligent padding of Unicode characters, you run into major problems. As they say,
w͢͢͝h͡o͢͡ ̸͢k̵͟n̴͘ǫw̸̛s͘ ̀́w͘͢ḩ̵a҉̡͢t ̧̕h́o̵r͏̵rors̡ ̶͡͠lį̶e͟͟ ̶͝in͢ ͏t̕h̷̡͟e ͟͟d̛a͜r̕͡k̢̨ ͡h̴e͏a̷̢̡rt́͏ ̴̷͠ò̵̶f̸ u̧͘ní̛͜c͢͏o̷͏d̸͢e̡͝?͞
How many Unicode characters are in Tääääßt? Well, it could be anywhere from 7 to 11, depending on how it's encoded. Each ä can be written as U+00E4, which is one character, or it could be written as U+0061 U+0308, which is two characters. So your next hope is to count grapheme clusters. (No, normalization won't make the problem go away.)
But, how wide is a grapheme cluster? Obviously, a is one column wide. U+200B should be zero columns wide, it's a "zero-width" space. Should each ひらがな be two columns wide? They usually are in terminal emulators. What happens when you format ひらがな as 7 columns, do you get "ひらが ", which adds a space, or do you get "ひらが", which is only 6 columns?
If you cut something up which mixes RTL and LTR text, should you reset the text direction afterwards? What are you going to do? (Some terminal emulators, such as Apple's, support a mixture of left-to-right and right-to-left text.)
What is your goal by truncating text? Are you trying to show the user a string in limited space, or are you trying to write a format that uses fixed-width fields?
Basically, if you want to cut Unicode text into chunks, you shouldn't be doing it with something as simple as printf (or wprintf, which is quite possibly worse). Use LibICU (website) to iterate over the breaks you want. Writing a UTF-8 aware version of printf is asking for all sorts of trouble that you don't want.

The following C99 code snippet defines the function u8printf where format specifiers such as %10s yield 10 utf-8 code points, that is characters rather than bytes. Don't forget to set the locale with setlocale(LC_ALL,"") somewhere before this routine is called. This works because the wprintf uses wchar_t internally. You can define u8fprintf and u8sprintf in a similar way. If you want to write this without C99 variable length arrays than a suitable combination of malloc/free is also possible.
int u8printf(char *fmt,...){
va_list ap;
va_start(ap,fmt);
int n=mbstowcs(0,fmt,0);
if(n==-1) return -1;
wchar_t wfmt[n+1];
mbstowcs(wfmt,fmt,n+1);
for(int m=128;m<=32768;m*=2){
wchar_t wbuf[m];
int r=vswprintf(wbuf,m,wfmt,ap);
if(r!=-1) {
char buf[m*4];
wcstombs(buf,wbuf,m*4);
fputs(buf,stdout);
return r;
}
}
return -1;
va_end(ap);
}

Handling multibyte (non-ASCII) characters in C

I am trying to do my own version of wc (unix filter), but I have a problem with non-ASCII characters. I did a HEX dump of a text file and found out that these characters occupy more than one byte. So they won't fit to char. Is there any way I can read these characters from file and handle them like single characters (in order to count characters in a file) in C?
I've been googling a little bit and found some wchar_t type, but there were not any simple examples how to use it with files.

I've been googling a little bit and found some wchar_t type, but there was not any simple example how to use it with files.
Well met. There weren't any simple examples because, unfortunately, proper character set support isn't simple.
Aside: In an ideal world, everybody would use UTF-8 (a Unicode encoding that is memory-efficient, robust, and backward-compatible with ASCII), the standard C library would include UTF-8 encoding-decoding support, and the answer to this question (and dealing with text in general) would be simple and straightforward.
The answer to the question "What is the best unicode library for C?" is to use the ICU library. You may want to look at ustdio.h, as it has a u_fgetc function, and adding Unicode support to your program will probably take little more than typing u_ a few times.
Also, if you can spare a few minutes for some light reading, you may want to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets (No Excuses!) from Joel On Software.
I, personally, have never used ICU, but I probably will from now on :-)

If you want to write a standard C version of the wc utility that respects the current language setting when it is run, then you can indeed use the wchar_t versions of the stdio functions. At program startup, you should call setlocale():
setlocale(LC_CTYPE, "");
This will cause the wide character functions to use the appropriate character set defined by the environment - eg. on Unix-like systems, the LANG environment variable. For example, this means that if your LANG variable is set to a UTF8 locale, the wide character functions will handle input and output in UTF8. (This is how the POSIX wc utility is specified to work).
You can then use the wide-character versions of all the standard functions. For example, if you have code like this:
long words = 0;
int in_word = 0;
int c;
while ((c = getchar()) != EOF)
{
if (isspace(c))
{
if (in_word)
{
in_word = 0;
words++;
}
}
else
{
in_word = 1;
}
}
...you would convert it to the wide character version by changing c to a wint_t, getchar() to getwchar(), EOF to WEOF and isspace() to iswspace():
long words = 0;
int in_word = 0;
wint_t c;
while ((c = getwchar()) != WEOF)
{
if (iswspace(c))
{
if (in_word)
{
in_word = 0;
words++;
}
}
else
{
in_word = 1;
}
}

Go have a look at ICU. That library is what you need to deal with all the issues.

Most of the answers so far have merit, but which you use depends on the semantics you want:
If you want to process text in the configured locale's encoding, and don't care about complete failure in the case of encountering invalid sequences, using getwchar() is fine.
If you want to process text in the configured locale's encoding, but need to detect and recover from invalid sequences, you need to read bytes and use mbrtowc manually.
If you always want to process text as UTF-8, you need to read bytes and feed them to your own decoder. If you know in advance the file will be valid UTF-8, you can just count bytes in the ranges 00-7F and C2-F4 and skip counting all other bytes, but this could give wrong results in the presence of invalid sequences. A more robust approach would be decoding the bytestream to Unicode codepoints and counting the number of successful decodes.
Hope this helps.

Are you sure you really need the number of characters? wc counts the number of bytes.
~$ echo 'דניאל' > hebrew.txt
~$ wc hebrew.txt
1 1 11 hebrew.txt
(11 = 5 two-byte characters + 1 byte for '\n')
However, if you really do want to count characters rather than bytes, and can assume that your text files are encoded in UTF-8, then the easiest approach is to count all bytes that are not trail bytes (i.e., in the range 0x80 to 0xBF).
If you can't assume UTF-8 but can assume that any non-UTF-8 files are in a single-byte encoding, then perform a UTF-8 validation check on the data. If it passes, return the number of UTF-8 lead bytes. If if fails, return the number of total bytes.
(Note that the above approach is specific to wc. If you're actually doing something with the characters rather than just counting them, you'll need to know the encoding.)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight