Utf8 Linux filenames and C

Utf8 Linux filenames and C - c

I am working at a OS independent file manager, using SDL_ttf to draw my text.
On Windows, everything works well, but on Linux I have to use the UTF8 functions of SDL_ttf, because the filenames can be UTF8 encoded.
This works well, but if I have my own C string (not a file name) such as "Ää", it will be displayed wrong. Is there any way to tell gcc to encode my strings as UTF8?

You don't need anything special from your C compiler for UTF-8 string literals. Proper support for it in the APIs you use is another matter, but that seems to be covered.
What you do need to do is to make sure your source files are actually saved in UTF-8, so that non-ASCII characters don't get converted to some other encoding when you edit or save the file.
The compiler doesn't need specific UTF-8 support, as long as it assumes 8-bit characters and the usual ASCII values for any syntactically significant characters; in other words, it's almost certainly not the problem.

gcc should interpret your source code and string literals as UTF-8 by default. Try -fexec-charset
See also: http://gcc.gnu.org/onlinedocs/gcc-4.0.1/cpp/Implementation_002ddefined-behavior.html#Implementation_002ddefined-behavior

C should have some sort of Unicode string literal syntax. Googling for "Unicode programming C" should get you started, two tutorials that seemed good are the one on developerworks and the one on cprogramming.com.
The general approach for your specific case would be using a wide string literal L"Ää", then converting that into UTF-8 with wcstrtombs().

Related

How the encoding standard you use to save a c program affect during compilation

I created a basic, simple C program and saved it with the .c extension using the Unicode encoding standard, and did not compile correctly. An error occurred saying null character(s) ignored, but when I saved the same program using ASCII standard it compiled just fine.
What is the reason behind this? My compiler is gcc compiler
Thank you.

There is no encoding called "Unicode". Unicode is not an encoding. It is a standard for many many things including several encodings.
The encodings are such as UTF16-LE and UTF-8. I presume you're using Notepad.exe on Windows. Microsoft may call this UTF16 little-endian as "Unicode". It would represent each ASCII character as two bytes, one of which would be NULL byte.
As far as I know GCC never expects the file to be in UTF-16 encoding, so it just ignores these intervening null bytes...
What you need to do is get a proper text editor that uses proper terminology and save your files as UTF-8 or whatever lesser encoding the operating system happens to use from day to day.

Print Unicode characters in C, using ncurses

I have to draw a box in C, using ncurses;
First, I have defined some values for simplicity:
#define RB "\e(0\x6a\e(B" (ASCII 188,Right bottom, for example)
I have compiled with gcc, over Ubuntu, with -finput-charset=UTF-8 flag.
But, if I try to print with addstr or printw, I get the hexa code.
What I`m doing wrong?

ncurses defines the values ACS_HLINE, ACS_VLINE, ACS_ULCORNER, ACS_URCORNER, ACS_LLCORNER and ACS_LRCORNER. You can use those constants in addch and friends, which should result in your seeing the expected box characters. (There's lots more ACS characters; you'll find a complete list in man addch.)
ncurses needs to know what it is drawing because it needs to know exactly where the cursor is all the time. Outputting console control sequences is not a good idea; if ncurses knows how to handle the sequence, it has its own abstraction for the feature and you should use that abstraction. The ACS ("alternate character set") defines are one of those abstractions.

A few issues:
if your program writes something like "\e(0\x6a\e(B" using addstr, then ncurses (any curses implementation) will translate the individual characters to printable form as described in the addch manual page.
ncurses supports line-drawing for commonly-used pseudo-graphics using symbols (such as ACS_HLINE) which are predefined characters with the A_ALTCHARSET attribute combined. You can read about those in the Line Graphics section of the addch manual page.
the code 0x6a is ASCII j, which (given a VT100-style mapping) would be the lower left corner. The curses symbol for that is ACS_LRCORNER.
you cannot write the line-drawing characters with addstr; instead addch, addchstr are useful. There are also functions oriented to line-drawing (see box and friends).
running in Ubuntu, your locale encoding is probably UTF-8. To make your program work properly, it should initialize the locale as described in the Initialization section of the ncurses manual page. In particular:
setlocale(LC_ALL, "");
Also, your program should link against the ncursesw library (-lncursesw) to use UTF-8, rather than just ncurses (-lncurses).
when compiling on Ubuntu, to use the proper header definitions, you should define _GNU_SOURCE.

BTW, maybe I'm probably arriving somewhat late to the party but I'll give you some insight that might or not shed some light and skills for your "box drawing" needs.
As of 2020 I'm involved in a funny project on my own mixing Swift + Ncurses (under OSX for now, but thinking about mixing it with linux). Apparently it works flawlessly.
The thing is, as I'm using Swift, internally it all reduces to "importing .h and .c" files from some Darwin.ncurses library the MacOS Xcode/runtime offers.
That means (I hope) my newly acquired skills might be useful for you because apparently we're using the very same .h and .c files for our ncurses needs. (or at least they should be really similar)
Said that:
As of now, I "ignored" ACS_corner chars (I can't find them under swift/Xcode/Darwin.ncurses runtime !!!) in favour of pure UTF "corner chars", which also exist in the unicode pointspace, look:
https://en.wikipedia.org/wiki/Box-drawing_character
What does it mean? Whenever I want to use some drawing box chars around I just copy&paste pure UTF-8 chars into my strings, and I send these very strings onto addstr.
Why does it work? Because as someone also answered above, before initializing ncurses with initscr(), I just claimed "I want a proper locale support" in the form of a setlocale(LC_ALL, ""); line.
What did I achieve? Apparently pure magic. And very comfortable one, as I just copy paste box chars inside my normal strings. At least under Darwin.ncurses/OSX Mojave I'm getting, not only "bounding box chars", but also full UTF8 support.
Try the "setlocale(LC_ALL, ""); initscr();" approach and tell us if "drawing boxes" works also for you under a pure C environment just using UTF8 bounding box chars.
Greetings and happy ncursing!

how to handle russian string as a command line argument in C program

I have an exe file build from C code. There is a situation where russian string is passed as an argument to this exe.
When I call exe with this argument, task manager shows russian string perfectly as command line argument.
But when I print that argument from my exe it just prints ???
How can I make my C program(hence exe) handle russian character?

The answer depends on a target platform for your program. Traditionally, a C- or C++-program begins its life from main(....) function which may have byte-oriented strings passed as arguments (notice char* in main declaration int main(int argc, char* argv[])). Byte-oriented strings mean that characters in a string are passed in a specific byte-oriented encoding and one character, for example Я or Ñ in UTF-8 may take more than 1 char.
Nowadays the most wide used encoding on Linux/Unix platform is UTF-8, but some time ago there were other encodings in use such as ISO8859-1, KOI8-R and a lot of others. Most of programs are still byte oriented as UTF-8 encoding is mostly backward-compatible with all traditional C strings API.
In other hand wide strings can be more convenient in use, because each character in a widestring uses a predefined space. Thus, for example, the following expression passes assertion test: std::wstring hello = L"Привет!¡Hola!"; assert(L'в' == hello[3]); (if UTF-8 char strings are used the test would fail). So if your program performs a lot of operations on letters, not strings as a whole, then widestrings can be the solution.
To convert strings from multi-byte to a wide character encoding, you may use mbtowc functions family or that awesome fancy codecvt C++-11 facility if your compiler supports it (likely it doesn't as of mid-2014 :))
In Windows strings are also can be passed as byte-oriented strings, and for Russian most likely CP1251 is used (depends on Operating system settings, but for Windows sold within Russia and CIS this is the most popular variant). Also MSVC has a language extension which allows an application programmer to avoid all this complexity with manual conversion of bytestring to widestrings, and use a variant of main() function which instantly receives widestrings

#user3159253 provided a good answer that I will complete with some more references:
Windows: Usually it uses wide characters.
Linux: Normally it uses UTF-8 encoding: please do NOT use wide chars in this case.
You are facing an internationalization (cf i18n, i10n ) issue.
You might need tools like iconv for character set conversion, and gettext for string translation.

How do I convert a UTF-8 string to upper case?

Is there a portable way to convert a UTF-8 string in C to upper case? If not, what is the Linux way to do it?

The portable way of doing it would be to use a Unicode aware library such as ICU. Seems like u_strToUpper might the function you're looking for.

glib has g_utf8_strup().

The canonical way to do this is with wchar_t -- if you have a string of wide characters and use towlower/towupper/towctrans with your wide characters (which will work if your locale is set correctly). So you need to take your UTF-8 string, convert it into a wide-character string, and then use these functions that take wchar_t's and then convert back.
This is a giant PITA so you're probably better off using a supported, open-source Unicode library like ICU.

how to print the microsecond symbol in C?

I am trying to print the microsecond symbol in C, but I don't get any data in the output.
printf("Micro second = \230");
I also tried using
int i = 230;
printf("Character %c", i);
but in vain! Any pointers?

That depends entirely on the character encoding used by the console you're using. If you're using Linux or Mac OS X, then most likely that encoding is UTF-8. The UTF-8 encoding for µ (Unicode code point U+00B5) is 'C2 B5' (two bytes), so you can print it like this:
printf("Micro second = \xC2\xB5"); // UTF-8
If you're on Windows, then the default encoding used by the console is code page 437, then µ is encoded as 0xE6, so you would have to do this:
printf("Micro second = \xE6"); // Windows, assuming CP 437

Here is the standards-sanctioned way to print it in C:
printf("%lc", L'\u00b5');
If you're happy assuming UTF-8, though, I'd just hard-code "µ".

Since you work on Mac OS, you can rest assured that the terminal uses UTF-8. Therefore, bring up the characters palette (Edit -> Special characters...), find the microsecond symbol there, and put it right into your string.
int main()
{
printf("µs\n");
}
It will work as long as your source file is UTF-8 too. Otherwise, you'll need to find the code point for it (which should also be indicated in the characters palette). Mouse over the character to find its UTF-8 value (and excuse my French system):
This means you can use printf("\xc2\xb5") as an encoding-independant replacement for the character itself.

printf("%c", 230);
works for me. I think this is OS dependent though. It is encoding dependent (thanks zneak). I'm on Windows.

I had the same issue, I found the solution based on Code page 437
"\xB5"
I just included that code into a string and it worked for me.
My environnement: Windows, C++. I was trying to convert a string to display in SFML.

Using a mix of C and C++14 for a work project, here's what worked for me:
printf("Micro second symbol = \xC2\xB5");
\xB5, and \230 produced ▒ on my output. My project equipment is apparently is picky as I tried it with elsewhere with some online IDE's and a version of Xcode on Mac, all worked fine.
Could be if you're trying to print elsewhere besides console, say like to an LCD screen, then you're screen would have to be programmed (created and mapped to xyz...) to include that character. Though from my experience that varies wildly depending on what you're creating.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight