Diacritic characters in C char arrays or strings - c

Background
Im working on some embedded project and Im trying to handle non-standard characters and font.
I have raw bitmap font in 600+ element array. Every 5 elements of this array contain one character. I have character 32 (space) in first 5 elements, 33 character (!) in 6-10 elements etc.
I have to handle national diacritic characters ("ę" for example). I located them after 122 character. Now im trying to remap characters, to get proper character printed when I type print("Test ę"); in C source.
Problem
So I want to type like this in source:
print("Test diactric ę");
// warning: (228) illegal character (0xC4)
When I try this (I tried to see what code C will put for "ę"):
int a = 'ę';
// error: (226) char const too long
How to workaround this?
Im using XC8 compiler (gcc based?).
I found in compiler manual, that it uses 7-bit character encoding, but maybe there is some way? My source file is encoded in UTF-8.
EDIT
Looks like wchar.h suggested by Emilien could work for me, but unfortunately there is no wchar.h for my compiler.
Maybe some preprocessor trick? I really want to avoid hardcore text preparation like this:
print("abcde");
print_diactric(123); // 123 code used for ę
print("fgh");
// to get "abcdeęf" "word"

You need to think about the difference between the source encoding (what it sounds like, the character encoding used by your C source files on the system where the compiler runs) and the target encoding, which is the encoding that the compiler assumes for the system where the code will be running.
If your compiler's target encoding is "7-bit", then there's no standard way to express a character like ę, it's simply not part of the target charset. You're going to have to work around that, perhaps by implementing the encoding by yourself from some other format.

As unwind explained, you'll need for than 7 bits in order to encode these characters, maybe you can use the wide character type?
#include <wchar.h>
#include <stdio.h>
int main(){
printf("%s\n", "漢語");
printf("%s\n", "ę");
}
output:
~$ gcc wcharexample.c -o wcharexample && ./wcharexample
漢語
ę

Related

How to take I/O in Greek, in C program on Windows console

For a school project I decided to make an app. I am writing it in C, and running it on the Windows Console. I live in Greece and the program needs read and write text in Greek too. So, I have tried just plainly
printf("Καλησπέρα");
But it prints some random characters. How can I output Greek letters? And, similarly, how can I take input in Greek?
Welcome to Stack Overflow, and thank you for asking such an interesting question! I wish what you are trying to do was simple. But your programming language (C), and your execution environment (the Windows console) were both designed a long time ago, without Greek in mind. As a result, it is not easy to use them for your simple school project.
When your C program outputs bytes to stdout via printf, the Windows Console interprets those bytes as characters. It has a default interpretation, or encoding, which does not include Greek. In order for your Greek letters to appear, you need to tell Windows Console to use the correct encoding. You do this using the _setmode call, using the _O_U16TEXT parameter. This is described in the Windows _setmode documentation, as Semih Artan pointed out in the comments.
The _O_U16TEXT mode means your program must print text out in UTF-16 form. Each character is 16 bits long. That means you must represent your text as wide characters, using C syntax like L"\x039a". The L before the double quotes marks the string as having "wide characters", where each character has 16 bits instead of 8 bits. The \x in the string indicates that the next four characters are hex digits, representing the 16 bits of a wide character.
Your C program is itself a text file. The C compiler must interpret the bytes of this text file in terms of characters. When used in a simple way, the compiler will expect only ASCII-compatible byte values in the file. That includes Latin letters and digits, and simple punctuation. It does not include Greek letters. Thus you must write your Greek text by representing its bytes with ASCII substitutes.
There Greek characters Καλησπέρα are, I believe, represented in C wide character syntax as L"\x039a\x03b1\x03bb\x03b7\x03c3\x03c0\x03ad\x03c1\x03b1".
Finally, Windows Console must have access to a Greek font in order for it to display the Greek characters. I expect this is not a problem for you, because you are probably already running your computer in Greek. In any case Windows worldwide includes fonts with Greek coverage.
Plugging this Greek text into the sample program in Microsoft's _setmode documentation gives this. (Note: I have not tested this program myself.)
#include <fcntl.h>
#include <io.h>
#include <stdio.h>
int main(void) {
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(L"\x039a\x03b1\x03bb\x03b7\x03c3\x03c0\x03ad\x03c1\x03b1\n");
return 0;
}
Input is another matter. I won't attempt to go through it here. You probably have to set the mode of stdin to _O_U16TEXT. Then characters will appear as UTF-16. You may need to convert them before they are useful to your program.
Overall, to write a simple app for a school project, which reads and writes Greek, I suggest that you consider using a tool like Visual Studio to write a GUI program. These tools have more modern design, and give you access to text with Greek letters more easily.

How to compress Non-ASCII characters to 1 byte in C for Linux?

I have a list of Turkish words. I need to compare their lengths. But since some Turkish characters are non-ASCII, I can't compare their lengths correctly. Non-ASCII Turkish characters holds 2 bytes.
For example:
#include <stdio.h>
#include <string.h>
int main()
{
char s1[] = "ab";
char s2[] = "çş";
printf("%d\n", strlen(s1)); // it prints 2
printf("%d\n", strlen(s2)); // it prints 4
return 0;
}
My friend said it's possible to do that in Windows with the line of code below:
system("chcp 1254");
He said that it fills the Turkish chars to the extended ASCII table. However it doesn't work in Linux.
Is there a way to do that in Linux?
It's 2017 and soon 2018. So use UTF-8 everywhere (on recent Linux distributions, UTF-8 is the most common encoding, for most locale(7)-s, and certainly the default on your system); of course, an Unicode character coded in UTF-8 may have one to six bytes (so the number of Unicode characters in some UTF-8 string is not given by strlen). Consider using some UTF-8 library, like libunistring (or others, e.g. in Glib).
The chcp 1254 thing is some Windows specific stuff irrelevant on UTF-8 systems. So forget about it.
If you code a GUI application, use a widget toolkit like GTK or Qt. They both do handle Unicode and are able to accept (or convert to UTF-8). Notice that even simply displaying Unicode (e.g. some UTF-8 or UTF-16 string) is non trivial, because a string could mix e.g. Arabic, Japanese, Cyrillic and English words (that you need to display in both left-to-right and right-to-left directions), so better find a library (or other tool, e.g. a UTF-8 capable terminal emulator) to do that.
If you happen to get some file, you need to know the encoding it is using (and that is only some convention that you need to get and follow). In some cases, the file(1) command might help you guessing that encoding, but you need to understand the encoding convention used to make that file. If it is not UTF-8 encoded, you can convert it (provided you know the source encoding), perhaps with the iconv(1) command.
One possibility could be to use wide character strings to store words. It does not store characters as one byte but it solves your main problem. To get a set of functions working with your language. The program would look like the following:
#include <stdio.h>
#include <string.h>
#include <wchar.h>
int main()
{
wchar_t s1[] = L"ab";
wchar_t s2[] = L"çş";
printf("%d\n", wcslen(s1)); // it prints 2
printf("%d\n", wcslen(s2)); // it prints 2
return 0;
}

Using sprintf with unicode characters

I wanted to print out depictions of playing cards using Unicode.
Code snippet:
void printCard(int card){
char strCard[10];
sprintf(strCard, "\U0001F0A%x", (card%13)+1);
printf("%s\n", cardStr);
}
Since the \U requires 8 hex characters after it I get the following from compiling:
error: incomplete universal character name \U0001F0A
I could create a bunch of if/else statements and print out the card that way but I was hoping for a way that wouldn't make me explicitly write out every card's Unicode encoding.
Universal character names (like \U0001F0A1) are resolved by the compiler. If you use one in a format string, printf will see the UTF-8 representation of the character; it has no idea how to handle backslash escapes. (The same is true of \n and \x2C; those are single characters resolved by the compiler.) So you certainly cannot compute the UCN at runtime.
The most readable solution would be to use an array of strings to hold the 13 different card symbols.
That will avoid hard-wiring knowledge about Unicode and UTF-8 encoding into the program. If you knew that the active locale was a UTF-8 locale, you could compute the codepoints as a wchar_t and the use wide-character-to-multibyte standard library functions to produce the UTF-8 version. But I'm not at all convinced that it would be worthwhile.
A quick and dirty UTF-8 solution:
void printCard(int card) {
printf("\xF0\x9F\x82%c\n", 0xA1 + card % 13);
}
The UTF-8 representation of \U0001F0A1 is F0 9F 82 A1. The above code will correctly handle all 13 cards, if your terminal supports UTF-8 and non-BMP code points, like iTerm2 on OS/X.
Alternative solutions involving wide-char conversion to multibyte character sets are complicated to use and would not work on platforms where wchar_t is limited to 16 bits.

How to print "box drawers" Unicode characters in C (Linux utf8 terminal)?

I'm trying to display Unicode characters from (Box Drawing Range: 2500–257F). It's supposed to be standard utf8 (The Unicode Standard, Version 6.2). I'm simply unable to do it.
I first tried to use the good old ASCII characters but the Linux terminal displays in utf8 and there is no conversion (symbol ?) displayed in place.
Could anyone answer these questions:
How to encode a unicode character in a C variable (style wchar_t)?
How to use the escape sequence such as 0x or 0o (hex, oct) for Unicode?
I know U+ but it seems it didn't work.
setlocale(LC_ALL,"");
short a = 0x2500, b = 0x2501;
wchar_t ac = a;
wchar_t bc = b;
wprintf(L"%c%c\n", ac, bc);
exit(0);
I know that the results are related to the font used, but I use a utf8 font (http://www.unicode.org/charts/fonts.html) and codes from 2500 to 257F must be displayed... Actually they aren't.
Thanks for your help in advance...
[EDIT LATELY]
The issue is solved since... and I found how to use wprintf() with %lc instead of %c... and deeper.
Now those bow drawers are part of my student "tools" library to make the console programming learning a little more coloured.
Use a Cstring containing the bytes for the utf-8 versions of those characters. If You print that Cstring, it will print that character.
example for Your two characters:
#include <stdio.h>
int main (int argc, char *argv[])
{
char block1[] = { 0xe2, 0x94, 0x80, '\0' };
char block2[] = { 0xe2, 0x94, 0x81, '\0' };
printf("%s%s\n", block1, block2);
return 0;
}
prints ─━ for me.
Also, if You'd print a Cstring containing uft-8 character bytes somewhere in it, it would print those characters without problems.
/* assuming You use gcc */
And IIRC gcc uses utf-8 internally anyway.
EDIT: Your question changed a bit while I was writing this. And my answer is less relevant now.
But from Your symptoms - if You see one ? for each character You expect, I'd say Your terminal font might be missing the glyphs required for those characters.
That depends on what you call "terminal".
The linux console uses various hacks to display unicode but in reality its font is limited to 512 symbols IIRC so it can't really display the whole unicode range and what it can display depends on the font loaded (this may change in the future).
Windows terminals used to access Linux are usually braindamaged one way or another unicode-wise.
Physical terminals are usually worse and only operate in ascii-land
Linux GUI terminals (such as gnome-terminal) can pretty much display everything as long as you have the corresponding fonts.
Are you sure you don't want to use ncurses instead of writing your own terminal widgets?

char vs wchar_t

I'm trying to print out a wchar_t* string.
Code goes below:
#include <stdio.h>
#include <string.h>
#include <wchar.h>
char *ascii_ = "中日友好"; //line-1
wchar_t *wchar_ = L"中日友好"; //line-2
int main()
{
printf("ascii_: %s\n", ascii_); //line-3
wprintf(L"wchar_: %s\n", wchar_); //line-4
return 0;
}
//Output
ascii_: 中日友好
Question:
Apparently I should not assign CJK characters to char* pointer in line-1, but I just did it, and the output of line-3 is correct, So why? How could printf() in line-3 give me the non-ascii characters? Does it know the encoding somehow?
I assume the code in line-2 and line-4 are correct, but why I didn't get any output of line-4?
First of all, it's usually not a good idea to use non-ascii characters in source code. What's probably happening is that the chinese characters are being encoded as UTF-8 which works with ascii.
Now, as for why the wprintf() isn't working. This has to do with stream orientation. Each stream can only be set to either normal or wide. Once set, it cannot be changed. It is set the first time it is used. (which is ascii due to the printf). After that the wprintf will not work due the incorrect orientation.
In other words, once you use printf() you need to keep on using printf(). Similarly, if you start with wprintf(), you need to keep using wprintf().
You cannot intermix printf() and wprintf(). (except on Windows)
EDIT:
To answer the question about why the wprintf line doesn't work even by itself. It's probably because the code is being compiled so that the UTF-8 format of 中日友好 is stored into wchar_. However, wchar_t needs 4-byte unicode encoding. (2-bytes in Windows)
So there's two options that I can think of:
Don't bother with wchar_t, and just stick with multi-byte chars. This is the easy way, but may break if the user's system is not set to the Chinese locale.
Use wchar_t, but you will need to encode the Chinese characters using unicode escape sequences. This will obviously make it unreadable in the source code, but it will work on any machine that can print Chinese character fonts regardless of the locale.
Line 1 is not ascii, it's whatever multibyte encoding is used by your compiler at compile-time. On modern systems that's probably UTF-8. printf does not know the encoding. It's just sending bytes to stdout, and as long as the encodings match, everything is fine.
One problem you should be aware of is that lines 3 and 4 together invoke undefined behavior. You cannot mix character-based and wide-character io on the same FILE (stdout). After the first operation, the FILE has an "orientation" (either byte or wide), and after that any attempt to perform operations of the opposite orientation results in UB.
You are omitting one step and therefore think the wrong way.
You have a C file on disk, containing bytes. You have a "ASCII" string and a wide string.
The ASCII string takes the bytes exactly like they are in line 1 and outputs them.
This works as long as the encoding of the user's side is the same as the one on the programmer's side.
The wide string first decodes the given bytes into unicode codepoints and stored in the program- maybe this goes wrong on your side. On output they are encoded again according to the encoding on the user's side. This ensures that these characters are emitted as they are intended to, not as they are entered.
Either your compiler assumes the wrong encoding, or your output terminal is set up the wrong way.

Resources