I'm trying to display Unicode characters from (Box Drawing Range: 2500–257F). It's supposed to be standard utf8 (The Unicode Standard, Version 6.2). I'm simply unable to do it.
I first tried to use the good old ASCII characters but the Linux terminal displays in utf8 and there is no conversion (symbol ?) displayed in place.
Could anyone answer these questions:
How to encode a unicode character in a C variable (style wchar_t)?
How to use the escape sequence such as 0x or 0o (hex, oct) for Unicode?
I know U+ but it seems it didn't work.
setlocale(LC_ALL,"");
short a = 0x2500, b = 0x2501;
wchar_t ac = a;
wchar_t bc = b;
wprintf(L"%c%c\n", ac, bc);
exit(0);
I know that the results are related to the font used, but I use a utf8 font (http://www.unicode.org/charts/fonts.html) and codes from 2500 to 257F must be displayed... Actually they aren't.
Thanks for your help in advance...
[EDIT LATELY]
The issue is solved since... and I found how to use wprintf() with %lc instead of %c... and deeper.
Now those bow drawers are part of my student "tools" library to make the console programming learning a little more coloured.
Use a Cstring containing the bytes for the utf-8 versions of those characters. If You print that Cstring, it will print that character.
example for Your two characters:
#include <stdio.h>
int main (int argc, char *argv[])
{
char block1[] = { 0xe2, 0x94, 0x80, '\0' };
char block2[] = { 0xe2, 0x94, 0x81, '\0' };
printf("%s%s\n", block1, block2);
return 0;
}
prints ─━ for me.
Also, if You'd print a Cstring containing uft-8 character bytes somewhere in it, it would print those characters without problems.
/* assuming You use gcc */
And IIRC gcc uses utf-8 internally anyway.
EDIT: Your question changed a bit while I was writing this. And my answer is less relevant now.
But from Your symptoms - if You see one ? for each character You expect, I'd say Your terminal font might be missing the glyphs required for those characters.
That depends on what you call "terminal".
The linux console uses various hacks to display unicode but in reality its font is limited to 512 symbols IIRC so it can't really display the whole unicode range and what it can display depends on the font loaded (this may change in the future).
Windows terminals used to access Linux are usually braindamaged one way or another unicode-wise.
Physical terminals are usually worse and only operate in ascii-land
Linux GUI terminals (such as gnome-terminal) can pretty much display everything as long as you have the corresponding fonts.
Are you sure you don't want to use ncurses instead of writing your own terminal widgets?
Related
I have tried many ways to do it.. using scanf(), getc(), but nothing worked. Most of the time, 0 is stored in the supplied variable (maybe indicating wrong input?). How can I make it so that when the user enters any Unicode codepoint, it is properly recognized and stored in either a string or a char?
I'm guessing you already know that C chars and Unicode characters are two very different things, so I'll skip over that. The assumptions I'll make here include:
Your C strings will contain UTF-8 encoded characters, terminated by a NUL (\x00) character.
You won't use any C functions that could break the per-character encoding, and you will use output (strlen(), etc) with the understanding you need to differentiate between C chars and real characters.
It really is as simple as:
char input[256];
scanf("%[^\n]", &input);
printf("%s\n", input);
The problems comes with what is providing the input, and what is displaying the output.
#include <stdio.h>
int main(int argc, char** argv) {
char* bananna = "\xF0\x9F\x8D\x8C\x00";
printf("%s\n", bananna);
}
This probably won't display a banana. That's because the UTF-8 sequence being written to the terminal isn't being interpreted as a UTF-8 sequence.
So, the first thing you need to do is to configure your terminal. If your program is likely to only use one terminal type, then you might even be able to do this from within the program; however, there are tons of people who use different terminals, some that even cross Operating System boundaries. For example, I'm testing my Linux programs in a Windows terminal, connected to the Linux system using SSH.
Once the terminal is configured, your probably already correct program should display a banana. But, even a correctly configured terminal can fail.
After the terminal is verified to be correctly configured, the last piece of the puzzle is the font. Not all fonts contain glyphs for all Unicode characters. The banana is one of those characters that isn't typically typed into a computer, so you need to open up a font tool and search the font for the glyph. If it doesn't exist in that font, you need to find a font that implements a glyph for that character.
I have a list of Turkish words. I need to compare their lengths. But since some Turkish characters are non-ASCII, I can't compare their lengths correctly. Non-ASCII Turkish characters holds 2 bytes.
For example:
#include <stdio.h>
#include <string.h>
int main()
{
char s1[] = "ab";
char s2[] = "çş";
printf("%d\n", strlen(s1)); // it prints 2
printf("%d\n", strlen(s2)); // it prints 4
return 0;
}
My friend said it's possible to do that in Windows with the line of code below:
system("chcp 1254");
He said that it fills the Turkish chars to the extended ASCII table. However it doesn't work in Linux.
Is there a way to do that in Linux?
It's 2017 and soon 2018. So use UTF-8 everywhere (on recent Linux distributions, UTF-8 is the most common encoding, for most locale(7)-s, and certainly the default on your system); of course, an Unicode character coded in UTF-8 may have one to six bytes (so the number of Unicode characters in some UTF-8 string is not given by strlen). Consider using some UTF-8 library, like libunistring (or others, e.g. in Glib).
The chcp 1254 thing is some Windows specific stuff irrelevant on UTF-8 systems. So forget about it.
If you code a GUI application, use a widget toolkit like GTK or Qt. They both do handle Unicode and are able to accept (or convert to UTF-8). Notice that even simply displaying Unicode (e.g. some UTF-8 or UTF-16 string) is non trivial, because a string could mix e.g. Arabic, Japanese, Cyrillic and English words (that you need to display in both left-to-right and right-to-left directions), so better find a library (or other tool, e.g. a UTF-8 capable terminal emulator) to do that.
If you happen to get some file, you need to know the encoding it is using (and that is only some convention that you need to get and follow). In some cases, the file(1) command might help you guessing that encoding, but you need to understand the encoding convention used to make that file. If it is not UTF-8 encoded, you can convert it (provided you know the source encoding), perhaps with the iconv(1) command.
One possibility could be to use wide character strings to store words. It does not store characters as one byte but it solves your main problem. To get a set of functions working with your language. The program would look like the following:
#include <stdio.h>
#include <string.h>
#include <wchar.h>
int main()
{
wchar_t s1[] = L"ab";
wchar_t s2[] = L"çş";
printf("%d\n", wcslen(s1)); // it prints 2
printf("%d\n", wcslen(s2)); // it prints 2
return 0;
}
Background
Im working on some embedded project and Im trying to handle non-standard characters and font.
I have raw bitmap font in 600+ element array. Every 5 elements of this array contain one character. I have character 32 (space) in first 5 elements, 33 character (!) in 6-10 elements etc.
I have to handle national diacritic characters ("ę" for example). I located them after 122 character. Now im trying to remap characters, to get proper character printed when I type print("Test ę"); in C source.
Problem
So I want to type like this in source:
print("Test diactric ę");
// warning: (228) illegal character (0xC4)
When I try this (I tried to see what code C will put for "ę"):
int a = 'ę';
// error: (226) char const too long
How to workaround this?
Im using XC8 compiler (gcc based?).
I found in compiler manual, that it uses 7-bit character encoding, but maybe there is some way? My source file is encoded in UTF-8.
EDIT
Looks like wchar.h suggested by Emilien could work for me, but unfortunately there is no wchar.h for my compiler.
Maybe some preprocessor trick? I really want to avoid hardcore text preparation like this:
print("abcde");
print_diactric(123); // 123 code used for ę
print("fgh");
// to get "abcdeęf" "word"
You need to think about the difference between the source encoding (what it sounds like, the character encoding used by your C source files on the system where the compiler runs) and the target encoding, which is the encoding that the compiler assumes for the system where the code will be running.
If your compiler's target encoding is "7-bit", then there's no standard way to express a character like ę, it's simply not part of the target charset. You're going to have to work around that, perhaps by implementing the encoding by yourself from some other format.
As unwind explained, you'll need for than 7 bits in order to encode these characters, maybe you can use the wide character type?
#include <wchar.h>
#include <stdio.h>
int main(){
printf("%s\n", "漢語");
printf("%s\n", "ę");
}
output:
~$ gcc wcharexample.c -o wcharexample && ./wcharexample
漢語
ę
I just started to learn C and then want to proceed to learn C++. I am currently using a textbook and just write the examples in order to get a bit more familiar with the programming language and procedure.
Since the example that is given in the book didn't work, I tried to find other similar codes. The problem is that after compiling the code, the program itself does not show and of the symbols represented by %c. I get symbols for the numbers 33-126 but everything else is either nothing at all or just a white block...
Also, on some previous example I wanted to write °C for temperature and it couldn't display the symbol °
The example I found on the web that does not display the %c symbols is
#include <stdio.h>
#include <ctype.h>
int main()
{
int i;
i=0;
do
{
printf("%i %c \n",i,i);
i++;
}
while(i<=255);
}
Is anyone familiar with this? Why can I not get an output for %c or e.g. ° as well???
ASCII is a 7-bit character set, which means it consists of only codepoints in the range [0, 127]. For 8-bit code pages there are still 128 available codepoints with values from 128 to 255 (i.e. the high bit is set). These are sometimes called extended ASCII (although they're not related to ASCII at all) and the characters that they map to depend on the character set. An 8-bit charset is sometimes also called ANSI although it's actually a misnomer
US English Windows uses Windows-1252 code page by default, with the character ° at codepoint 0xB0. Other OSes/languages may use different character sets which have different codepoint for ° or possibly no ° symbol at all.
You have many solutions to this:
If your PC uses an 8-bit charset
Lookup the value of ° in the charset your computer is using and print it normally. For example if you're using CP437 then printf("\xF8") will work because ° is at the code point 0xF8. printf("°") also works if you save the source file in the same code page (CP437)
Or just change charset to Windows-1252/ISO 8859-1 and print '°' or '\xB0'. This can be done programmatically (using SetConsoleOutputCP on Windows and similar APIs on other OSes) or manually (by some console settings, or by running chcp 1252 in Windows cmd). The source code file still needs to be saved in the same code page
Print Unicode. This is the recommended way to do
Linux/Unix and most other modern OSes use UTF-8, so just output the correct UTF-8 string and you don't need to care about anything. However because ° is a multibyte sequence in UTF-8, you must print it as a string. That means you need to use %s instead of %c. A single char can't represent ° in UTF-8. Newer Windows 10 also supports UTF-8 as a locale so you can print the UTF-8 string directly
On older Windows you need to print the string out as UTF-16. It's a little bit tricky but not impossible
If you use "\u00B0" and it prints out successfully then it means your terminal is already using UTF-8. \u is the escape sequence for arbitrary Unicode code points
See also
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Anything outside the range 33-126 isn't a visible ASCII character. 0-32 is stuff like backspace (8), "device control 2" (18), and space (32). 127 is DEL, and anything past that isn't even ASCII; who knows how your terminal will handle that.
I have a simple question that I can't find anywhere over the internet, how can I convert UTF-8 to ASCII (mostly accented characters to the same character without accent) in C using only the standard lib? I found solutions to most of the languages out there, but not for C particularly.
Thanks!
EDIT: Some of the kind guys that commented made me double check what I needed and I exaggerated. I only need an idea on how to make a function that does: char with accent -> char without accent. :)
Take a look at libiconv. Even if you insist on doing it without libraries, you might find an inspiration there.
In general, you can't. UTF-8 covers much more than accented characters.
There's no built in way of doing that. There's really little difference between UTF-8 and ASCII unless you're talking about high level characters, which cannot be represented in ASCII anyway.
If you have a specific mapping you want (such as a with accent -> a) then you should just probably handle that as a string replace operation.
Every decent Unicode support library (not the standard library of course) has a way to decompose a string in KC or KD form. Which separates the diacritics from the letters. Giving you a shot at filtering them out. Not so sure this is worth pursuing, the result is just gibberish to the native language reader and not every letter is decomposable. In other words, junk with question marks.
Since this is homework, I'm guessing your teacher is clueless and doesn't know anything about UTF-8, and probably is stuck in the 1980s with "code pages" and "extended ASCII" (words you should erase from your vocabulary if you haven't already). Your teacher probably wants you to write a 128-byte lookup table that maps CP437 or Windows-1252 bytes in the range 128-255 to similar-looking ASCII letters. It would go something like...
void strip_accents(unsigned char *dest, const unsigned char *src)
{
static const unsigned char lut[128] = { /* mapping here */ };
do {
*dest++ = *src < 128 ? *src : lut[*src];
} while (*src++);
}