I am trying to make a simple -ancient greek to modern greek- converter, in c, by changing the tones of the vowels. For example, the user types a text in greek which conains the character: ῶ (unicode: U+1FF6), so the program converts it into: ώ (unicode:U+1F7D). Greek are not sopported by c, so I don't know how to make it work. Any ideas?
Assuming you use a sane operating system (meaning, not Windows), this is very easy to achieve using C99/C11 locale and wide character support. Consider filter.c:
#include <stdlib.h>
#include <locale.h>
#include <wchar.h>
#include <stdio.h>
wint_t convert(const wint_t wc)
{
switch (wc) {
case L'ῶ': return L'ώ';
default: return wc;
}
}
int main(void)
{
wint_t wc;
if (!setlocale(LC_ALL, "")) {
fprintf(stderr, "Current locale is unsupported.\n");
return EXIT_FAILURE;
}
if (fwide(stdin, 1) <= 0) {
fprintf(stderr, "Standard input does not support wide characters.\n");
return EXIT_FAILURE;
}
if (fwide(stdout, 1) <= 0) {
fprintf(stderr, "Standard output does not support wide characters.\n");
return EXIT_FAILURE;
}
while ((wc = fgetwc(stdin)) != WEOF)
fputwc(convert(wc), stdout);
return EXIT_SUCCESS;
}
The above program reads standard input, converts each ῶ into a ώ, and outputs the result.
Note that wide character strings and characters have an L prefix; L'ῶ' is a wide character constant. These are only in Unicode if the execution character set (the character set the code is compiled for) is Unicode, and that depends on your development environment. (Fortunately, outside of Windows, UTF-8 is pretty much a standard nowadays -- and that is a good thing -- so code like the above Just Works.)
On POSIXy systems (like Linux, Android, Mac OS, BSDs), you can use the iconv() facilities to convert from any input character set to Unicode, do the conversion there, and finally convert back to any output character set. Unfortunately, the question is not tagged posix, so that is outside this particular question.
The above example uses a simple switch/case statement. If there are many replacement pairs, one could use e.g.
typedef struct {
wint_t from;
wint_t to;
} widepair;
static widepair replace[] = {
{ L'ῶ', L'ώ' },
/* Others? */
};
#define NUM_REPLACE (sizeof replace / sizeof replace[0])
and at runtime, sort replace[] (using qsort() and a function that compares the from elements), and use binary search to quickly determine if a wide character is to be replaced (and if so, to which wide character). Because this is a O(log2N) operation with N being the number of pairs, and it utilizes cache okay, even thousands of replacement pairs is not a problem this way. (And of course, you can build the replacement array at runtime just as well, even from user input or command-line options.)
For Unicode characters, we could use a uint32_t map_to[0x110000]; to directly map each code point to another Unicode code point, but because we do not know whether wide characters are Unicode or not, we cannot do that; we do not know the code range of the wide characters until after compile time. Of course, we can do a multi-stage compilation, where a test program generates the replace[] array shown above, and outputs their codes in decimal; then do some kind of auto-grouping or clustering, for example bit maps or hash tables, to do it "even faster".
However, in practice it usually turns out that the I/O (reading and writing the data) takes more real-world time than the conversion itself. Even when the conversion is the bottleneck, the conversion rate is sufficient for most humans. (As an example, when compiling C or C++ code with the GNU utilities, the preprocessor first converts the source code to UTF-8 internally.)
Okay, here's some quick advice. I wouldn't use C because Unicode is not wel supported (yet).
A better language choice would be Python, Java, ..., anything with good Unicode support.
I'd write a utility that reads from standard input and writes to standard output. This makes it easy to use from the command line and in scripts.
I might be missing something but it's going to be something like this (in pseudo code):
while ((inCharacter = getCharacterFromStandardInput) != EOF
{
switch (inCharacter)
{
case 'ῶ': outCharacter = ώ; break
...
}
writeCharacterToStandardOutput(outCharacter)
}
You'll also need to select & handle the format: UTF-8/16/32.
That's it. Good luck!
Related
I want to use √ symbol in the program written below.
#include <stdio.h>
main(){
char a='√';
if (a=='√'){
printf("Working");
}
else{
printf("Not working");
}
}
√ is not ASCII that's why its not working. But I want to know to make it work.
Thanks in advance.
There are two different things going on here to be aware of:
The source C file itself may not be able to contain this character correctly.
The char type within the semantics of the actual program does not support this character, either.
As to the first issue, it depends on your platform (etc) but being conservative with C source is most portable, which means sticking to ASCII characters only within the code file. That means, e.g., in comments as well as within meaningful code. That said, lots of platforms will allow and support Unicode characters inside the source files.
Regarding the second, a char is old-fashioned for containing characters, and is limited to an octet, which means arbitrary unicode characters with values above 0xFF just don't fit inside of them. I suppose some non-ASCII characters do in a platform dependent way (Windows code pages?) above value 0x7F, but in this case, I would treat this as a string, using a unicode escape sequence for this character: "\u221A".
char * sqrt = "\u221A";
if (strcmp(sqrt, "\u221A") == 0) {
printf("Working");
} else {
printf("Not working");
}
Heads-up that C strings (char*) are not really designed around non-ASCII characters either, so in this case you end up embedding the UTF8 encoded representation of the character (which is three bytes long) inside the char string. This works, preserves the value, and the compare works, but if you're going to be working with unicode more generally...
If your platform supports "wide characters" (wchar_t or unichar or similar) that can hold Unicode characters, then you can use those types to hold this character, and do direct equality comparisons like you were doing:
wchar_t sqrt = L'\u221A';
if (sqrt == L'\u221A') {
...
(FYI Be a little aware that these wide char types may not be wide enough for arbitrary Unicode code points on your platform thus might work for the square root char, but not, say, an emoji.)
Finally, for the sake of completeness, I feel honor-bound to admit that given a contemporary development environment/toolchain and target platform, you could probably get away with using the explicit character in a widechar literal like so:
wchar_t sqrt = L'√';
if (sqrt == L'√') {
....
But I'm old-fashioned, this feels sketchy, and I don't recommend it. :)
As we know, different encodings map different representations to same characters. Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well? I'd find this surprising since these are compile-time!
This matters for tasks as simple as, for example, determining whether a string read from input contains a specific character. When reading strings from input it seems sensible to set the locale to to the user's locale (setlocale("LC_ALL", "");) so that the string is read and processed correctly. But when we're comparing this string with a character literal, won't problems arise due to mismatched encoding?
In other words: The following snippet seems to work for me. But doesn't it work only because of coincidence? Because - for example? - the source code happened to be saved in the same encoding that is used on the machine during runtime?
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <locale.h>
int main()
{
setlocale(LC_ALL, "");
// Read line and convert it to wide string so that wcschr can be used
// So many lines! And that's even though I'm omitting the necessary
// error checking for brevity. Ah I'm also omitting free's
char *s = NULL; size_t n = 0;
getline(&s, &n, stdin);
mbstate_t st = {0}; const char* cs = s;
size_t wn = mbsrtowcs(NULL, &cs, 0, &st);
wchar_t *ws = malloc((wn+1) * sizeof(wchar_t));
st = (mbstate_t){0};
mbsrtowcs(ws, &cs, (wn+1), &st);
int contains_guitar = (wcschr(ws, L'🎸') != NULL);
if(contains_guitar)
printf("Let's rock!\n");
else
printf("Let's not.\n");
return 0;
}
How to do this correctly?
Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well?
No. String literals use the execution character set, which is defined by your compiler at compile time.
Execution character set does not have to be the same as the source character set, the character set used in the source code. The C compiler is responsible for the translation, and should have options for choosing/defining them. The default depends on the compiler, but on Linux and most current POSIXy systems, is usually UTF-8.
The following snippet seems to work for me. But doesn't it work only because of coincidence?
The example works because the character set of your locale, the source character set, and the execution character set used when the binary was constructed, all happen to be UTF-8.
How to do this correctly?
Two options. One is to use wide characters and string literals. The other is to use UTF-8 everywhere.
For wide input and output, see e.g. this example in another answer here.
Do note that getwline() and getwdelim() are not in POSIX.1, but in C11 Annex K. This means they are optional, and as of this writing, not widely available at all. Thus, a custom implementation around fgetwc() is recommended instead. (One based on fgetws(), wcslen(), and/or wcscspn() will not be able to handle embedded nuls, L'\0', correctly.)
In a typical wide I/O program, you only need mbstowcs() to convert command-line arguments and environment variables to wide strings.
Using UTF-8 everywhere is also a perfectly valid practical approach, at least if it is well documented, so that users know the program inputs and outputs UTF-8 strings, and developers know to ensure their C compiler uses UTF-8 as the execution character set when compiling those binaries.
Your program can even use e.g.
if (!setlocale(LC_ALL, ""))
fprintf(stderr, "Warning: Your C library does not support your current locale.\n");
if (strcmp("UTF-8", nl_langinfo(CODESET)))
fprintf(stderr, "Warning: Your locale does not use the UTF-8 character set.\n");
to verify the current locale uses UTF-8.
I have used both approaches, depending on the circumstances. It is difficult to say which one is more portable in practice, because as usual, both work just fine on non-Windows OSes without issues.
If you're willing to assume UTF-8,
strstr(s,"🎸")
Or:
strstr(s,u8"🎸")
The latter avoids some assumptions but requires a C11 compiler. If you want the best of both and can sacrifice readability:
strstr(s,"\360\237\216\270")
I am trying to do my own version of wc (unix filter), but I have a problem with non-ASCII characters. I did a HEX dump of a text file and found out that these characters occupy more than one byte. So they won't fit to char. Is there any way I can read these characters from file and handle them like single characters (in order to count characters in a file) in C?
I've been googling a little bit and found some wchar_t type, but there were not any simple examples how to use it with files.
I've been googling a little bit and found some wchar_t type, but there was not any simple example how to use it with files.
Well met. There weren't any simple examples because, unfortunately, proper character set support isn't simple.
Aside: In an ideal world, everybody would use UTF-8 (a Unicode encoding that is memory-efficient, robust, and backward-compatible with ASCII), the standard C library would include UTF-8 encoding-decoding support, and the answer to this question (and dealing with text in general) would be simple and straightforward.
The answer to the question "What is the best unicode library for C?" is to use the ICU library. You may want to look at ustdio.h, as it has a u_fgetc function, and adding Unicode support to your program will probably take little more than typing u_ a few times.
Also, if you can spare a few minutes for some light reading, you may want to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets (No Excuses!) from Joel On Software.
I, personally, have never used ICU, but I probably will from now on :-)
If you want to write a standard C version of the wc utility that respects the current language setting when it is run, then you can indeed use the wchar_t versions of the stdio functions. At program startup, you should call setlocale():
setlocale(LC_CTYPE, "");
This will cause the wide character functions to use the appropriate character set defined by the environment - eg. on Unix-like systems, the LANG environment variable. For example, this means that if your LANG variable is set to a UTF8 locale, the wide character functions will handle input and output in UTF8. (This is how the POSIX wc utility is specified to work).
You can then use the wide-character versions of all the standard functions. For example, if you have code like this:
long words = 0;
int in_word = 0;
int c;
while ((c = getchar()) != EOF)
{
if (isspace(c))
{
if (in_word)
{
in_word = 0;
words++;
}
}
else
{
in_word = 1;
}
}
...you would convert it to the wide character version by changing c to a wint_t, getchar() to getwchar(), EOF to WEOF and isspace() to iswspace():
long words = 0;
int in_word = 0;
wint_t c;
while ((c = getwchar()) != WEOF)
{
if (iswspace(c))
{
if (in_word)
{
in_word = 0;
words++;
}
}
else
{
in_word = 1;
}
}
Go have a look at ICU. That library is what you need to deal with all the issues.
Most of the answers so far have merit, but which you use depends on the semantics you want:
If you want to process text in the configured locale's encoding, and don't care about complete failure in the case of encountering invalid sequences, using getwchar() is fine.
If you want to process text in the configured locale's encoding, but need to detect and recover from invalid sequences, you need to read bytes and use mbrtowc manually.
If you always want to process text as UTF-8, you need to read bytes and feed them to your own decoder. If you know in advance the file will be valid UTF-8, you can just count bytes in the ranges 00-7F and C2-F4 and skip counting all other bytes, but this could give wrong results in the presence of invalid sequences. A more robust approach would be decoding the bytestream to Unicode codepoints and counting the number of successful decodes.
Hope this helps.
Are you sure you really need the number of characters? wc counts the number of bytes.
~$ echo 'דניאל' > hebrew.txt
~$ wc hebrew.txt
1 1 11 hebrew.txt
(11 = 5 two-byte characters + 1 byte for '\n')
However, if you really do want to count characters rather than bytes, and can assume that your text files are encoded in UTF-8, then the easiest approach is to count all bytes that are not trail bytes (i.e., in the range 0x80 to 0xBF).
If you can't assume UTF-8 but can assume that any non-UTF-8 files are in a single-byte encoding, then perform a UTF-8 validation check on the data. If it passes, return the number of UTF-8 lead bytes. If if fails, return the number of total bytes.
(Note that the above approach is specific to wc. If you're actually doing something with the characters rather than just counting them, you'll need to know the encoding.)
I discovered an interesting problem when processing UTF-8 strings containing non-ASCII chars with C standard library formatting functions like sprintf():
The functions of the printf() family are not aware of utf-8 and process everything based on the number of bytes, not chars. Therefore the formatting is incorrect.
Simple example:
#include <stdio.h>
int main(int argc, char *argv[])
{
const char* testMsg = "Tääääßt";
char buf[1024];
int len;
sprintf(buf, "|%7.7s|", testMsg);
len = strlen(buf);
printf("Result=\"%s\", len=%d", buf, len);
return 0;
}
The result is:
Result="|Täää|", len=7
Most probably some of you will recommand to convert the application from char to wchar_t and use fwprintf(), etc., but that's absolutely impossible because of huge existing applications. I could imagine writing a wrapper that uses these functions internally, but this would be tricky and very inefficient.
So the best solution would be a UTF-8-aware replacement for the formatting functions of the Standard C Library.
Currently I'm working on QNX 6.4, but replies for other operating systems. e.g. Linux, are also very welcome.
Well, once you ask printf to do intelligent padding of Unicode characters, you run into major problems. As they say,
w͢͢͝h͡o͢͡ ̸͢k̵͟n̴͘ǫw̸̛s͘ ̀́w͘͢ḩ̵a҉̡͢t ̧̕h́o̵r͏̵rors̡ ̶͡͠lį̶e͟͟ ̶͝in͢ ͏t̕h̷̡͟e ͟͟d̛a͜r̕͡k̢̨ ͡h̴e͏a̷̢̡rt́͏ ̴̷͠ò̵̶f̸ u̧͘ní̛͜c͢͏o̷͏d̸͢e̡͝?͞
How many Unicode characters are in Tääääßt? Well, it could be anywhere from 7 to 11, depending on how it's encoded. Each ä can be written as U+00E4, which is one character, or it could be written as U+0061 U+0308, which is two characters. So your next hope is to count grapheme clusters. (No, normalization won't make the problem go away.)
But, how wide is a grapheme cluster? Obviously, a is one column wide. U+200B should be zero columns wide, it's a "zero-width" space. Should each ひらがな be two columns wide? They usually are in terminal emulators. What happens when you format ひらがな as 7 columns, do you get "ひらが ", which adds a space, or do you get "ひらが", which is only 6 columns?
If you cut something up which mixes RTL and LTR text, should you reset the text direction afterwards? What are you going to do? (Some terminal emulators, such as Apple's, support a mixture of left-to-right and right-to-left text.)
What is your goal by truncating text? Are you trying to show the user a string in limited space, or are you trying to write a format that uses fixed-width fields?
Basically, if you want to cut Unicode text into chunks, you shouldn't be doing it with something as simple as printf (or wprintf, which is quite possibly worse). Use LibICU (website) to iterate over the breaks you want. Writing a UTF-8 aware version of printf is asking for all sorts of trouble that you don't want.
The following C99 code snippet defines the function u8printf where format specifiers such as %10s yield 10 utf-8 code points, that is characters rather than bytes. Don't forget to set the locale with setlocale(LC_ALL,"") somewhere before this routine is called. This works because the wprintf uses wchar_t internally. You can define u8fprintf and u8sprintf in a similar way. If you want to write this without C99 variable length arrays than a suitable combination of malloc/free is also possible.
int u8printf(char *fmt,...){
va_list ap;
va_start(ap,fmt);
int n=mbstowcs(0,fmt,0);
if(n==-1) return -1;
wchar_t wfmt[n+1];
mbstowcs(wfmt,fmt,n+1);
for(int m=128;m<=32768;m*=2){
wchar_t wbuf[m];
int r=vswprintf(wbuf,m,wfmt,ap);
if(r!=-1) {
char buf[m*4];
wcstombs(buf,wbuf,m*4);
fputs(buf,stdout);
return r;
}
}
return -1;
va_end(ap);
}
I am trying to do my own version of wc (unix filter), but I have a problem with non-ASCII characters. I did a HEX dump of a text file and found out that these characters occupy more than one byte. So they won't fit to char. Is there any way I can read these characters from file and handle them like single characters (in order to count characters in a file) in C?
I've been googling a little bit and found some wchar_t type, but there were not any simple examples how to use it with files.
I've been googling a little bit and found some wchar_t type, but there was not any simple example how to use it with files.
Well met. There weren't any simple examples because, unfortunately, proper character set support isn't simple.
Aside: In an ideal world, everybody would use UTF-8 (a Unicode encoding that is memory-efficient, robust, and backward-compatible with ASCII), the standard C library would include UTF-8 encoding-decoding support, and the answer to this question (and dealing with text in general) would be simple and straightforward.
The answer to the question "What is the best unicode library for C?" is to use the ICU library. You may want to look at ustdio.h, as it has a u_fgetc function, and adding Unicode support to your program will probably take little more than typing u_ a few times.
Also, if you can spare a few minutes for some light reading, you may want to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets (No Excuses!) from Joel On Software.
I, personally, have never used ICU, but I probably will from now on :-)
If you want to write a standard C version of the wc utility that respects the current language setting when it is run, then you can indeed use the wchar_t versions of the stdio functions. At program startup, you should call setlocale():
setlocale(LC_CTYPE, "");
This will cause the wide character functions to use the appropriate character set defined by the environment - eg. on Unix-like systems, the LANG environment variable. For example, this means that if your LANG variable is set to a UTF8 locale, the wide character functions will handle input and output in UTF8. (This is how the POSIX wc utility is specified to work).
You can then use the wide-character versions of all the standard functions. For example, if you have code like this:
long words = 0;
int in_word = 0;
int c;
while ((c = getchar()) != EOF)
{
if (isspace(c))
{
if (in_word)
{
in_word = 0;
words++;
}
}
else
{
in_word = 1;
}
}
...you would convert it to the wide character version by changing c to a wint_t, getchar() to getwchar(), EOF to WEOF and isspace() to iswspace():
long words = 0;
int in_word = 0;
wint_t c;
while ((c = getwchar()) != WEOF)
{
if (iswspace(c))
{
if (in_word)
{
in_word = 0;
words++;
}
}
else
{
in_word = 1;
}
}
Go have a look at ICU. That library is what you need to deal with all the issues.
Most of the answers so far have merit, but which you use depends on the semantics you want:
If you want to process text in the configured locale's encoding, and don't care about complete failure in the case of encountering invalid sequences, using getwchar() is fine.
If you want to process text in the configured locale's encoding, but need to detect and recover from invalid sequences, you need to read bytes and use mbrtowc manually.
If you always want to process text as UTF-8, you need to read bytes and feed them to your own decoder. If you know in advance the file will be valid UTF-8, you can just count bytes in the ranges 00-7F and C2-F4 and skip counting all other bytes, but this could give wrong results in the presence of invalid sequences. A more robust approach would be decoding the bytestream to Unicode codepoints and counting the number of successful decodes.
Hope this helps.
Are you sure you really need the number of characters? wc counts the number of bytes.
~$ echo 'דניאל' > hebrew.txt
~$ wc hebrew.txt
1 1 11 hebrew.txt
(11 = 5 two-byte characters + 1 byte for '\n')
However, if you really do want to count characters rather than bytes, and can assume that your text files are encoded in UTF-8, then the easiest approach is to count all bytes that are not trail bytes (i.e., in the range 0x80 to 0xBF).
If you can't assume UTF-8 but can assume that any non-UTF-8 files are in a single-byte encoding, then perform a UTF-8 validation check on the data. If it passes, return the number of UTF-8 lead bytes. If if fails, return the number of total bytes.
(Note that the above approach is specific to wc. If you're actually doing something with the characters rather than just counting them, you'll need to know the encoding.)