How does ncurses output non-ascii characters? - c

I'd like to know how ncurses (a c library) manages to put characters like ├, despite them not (to the best of my knowledge) being part of ASCII.
I would have assumed it was just drawing them pixel by pixel, but you can copy/paste them out of the terminal (in MacOS).

ncurses puts characters such as ├ on the screen by assuming that your locale environment variables (LC_ALL and/or LC_CTYPE) match the terminal on which you are displaying. The environment variables indicate the encoding (e.g., UTF-8). There are other encodings and terminals which support those encodings, but generally speaking you'll mostly see UTF-8. If the environment and terminal cooperate, things "just work":
at startup, ncurses checks for the locale which a program has initialized, via setlocale, and determines if that uses UTF-8. It uses that information later.
when a program adds character strings, e.g., using addstr, ncurses uses the character-type information (set as a side-effect of calling setlocale), and uses standard C library functions for combining sequences of bytes which make up a multi-byte character, and converting those into wide characters. It stores those wide characters internally, and
when writing to the terminal, ncurses reverses the process, converting from wide characters to use the encoding assumed to be supported by the terminal (assuming that your locale environment matches the terminal).
However —
The character indicated ├ happens to be a special case. That is one of the graphic characters used for line-drawing, which predate Unicode and UTF-8. curses has names for these graphic characters, making it simple to refer to them, e.g., ACS_LTEE (the ├ is a left-tee):
Before UTF-8 came along to complicate things, developers came up with a scheme using a table of these graphic characters by adapting the escape sequences used for the VT100 (late 1970s) and the AT&T 4410 and 5410 terminals (apparently the early 1980s since the latter were in use by 1984) for drawing their graphic characters.
AT&T SystemV curses provided support for these graphic characters from the mid-1980s. BSD curses never did that...
Unicode (roughly 1990 and later) provided most of the same glyphs using a different encoding. There are a few omissions (the most noticeable are the scan lines above/below the one used for horizontal lines), but once UTF-8 got into use in the early 2000s, it was logical to extend ncurses to use these characters.
ncurses looks at the locale settings, but prefers using the terminal description for these graphic characters except for cases where that is known to not work — and will assume that the terminal can display the Unicode equivalents for these characters if the terminal is assumed to use UTF-8. It uses a table for this purpose (SystemV curses and its successor X/Open Curses didn't do any of this — NetBSD curses adapted the table from ncurses sometime after 2010).
Further reading:
NCURSES_NO_UTF8_ACS
Line Graphics (in curs_addch(3x))
Line Graphics (in curs_add_wch(3x))

There is more than one version of ncurses, for more than one platform, and if you really want to know, check the source. However, none of them would draw a character pixel-by-pixel; that isn’t something a library running inside a terminal emulator does.
Modern versions of the C standard library, POSIX and ncurses all support writing wide characters to the console and conversion between wide and multibyte strings. Today, wide characters are normally UTF-16 or UTF-32 and multibyte strings are normally UTF-8. You can see the documentation for <wchar.h> and ncursesw for more information.
Note that C11 does have support for UTF-8 literals, through the u8 prefix.
A program that’s concerned about portability with systems where the local multibyte encoding is something other than UTF-8 can use another library such as the C++ standard library or ICU to convert between UTF-8 and wide-character strings, then display those with curses.
You might need to #define _XOPEN_SOURCE 700, or the appropriate value for the version of the standard you are targeting, and with some versions of the libraries, also #define _XOPEN_SOURCE_EXTENDED 1, to get your system libraries to let you use functions such as addwstr().
However, many programs might simply send strings of char encoded in UTF-8 to the console and assume it can handle them. I don’t recommend this approach, but it works on most Linux systems in 2017.

Related

How could I guarantee a terminal has Unicode/wide character support with NCURSES?

I am developing an NCURSES application for a little TUI (text user interface) exercise. Unfortunately, I do not have the option of using the ever-so-wonderful-and-faithful ASCII. My program uses a LOT of Unicode box drawing characters.
My program can already detect if the terminal is color-capable. I need to do something like:
if(!supportsUnicode()) //I prefer camel-case, it's just the way I am.
{
fprintf(stderr, "This program requires a Unicode-capable terminal.\n\r");
exit(1);
}
else
{
//Yay, we have Unicode! some random UI-related code goes here.
}
This isn't just a matter of simply including ncursesw and just setting the locale. I need to get specific terminal info and actually throw an error if it's not gonna happen. I need to, for example, throw an error when the user tries to run the program in the lovely XTerm rather than the Unicode-capable UXTerm.
As noted, you cannot detect the terminal's capabilities reliably. For that matter, you cannot detect the terminal's support for color either. In either case, your application can only detect what you have configured, which is not the same thing.
Some people have had partial success detecting Unicode support by writing a UTF-encoded character and using the cursor-position report to see where the cursor is (see for example Detect how much of Unicode my terminal supports, even through screen).
Compiling/linking with ncursesw relies upon having your locale configured properly, with some workarounds for terminals (such as PuTTY) which do not support VT100 line-graphics when in UTF-8 mode.
Further reading:
Line Graphics curs_add_wch(3x)
NCURSES_NO_UTF8_ACS ncurses(3x)
You can't. ncurses(w) uses termcap to determine what capabilities a terminal has, and that looks at the $TERM environment variable to determine what terminal is being used. There is no special value of that variable that indicates that a terminal supports Unicode; both XTerm and UXTerm set TERM=xterm. Many other terminal applications use that value of $TERM as well, including both ones that support Unicode and ones that don't. (Indeed, in many terminal emulators, it's possible to enable and disable Unicode support at runtime.)
If you want to start outputting Unicode text to the terminal, you will just have to take it on faith that the user's terminal will support that.
If all you want to do is output box drawing characters, though, you may not need Unicode at all — those characters are available as part of the VT100 graphical character set. You can output these characters in a ncurses application using the ACS_* constants (e.g, ACS_ULCORNER for ┌), or use a function like box() to draw a larger figure for you.
The nl_langinfo() function shall return a pointer to a string containing information relevant to the particular language or cultural area defined in the current locale.
#include <langinfo.h>
#include <locale.h>
#include <stdbool.h>
#include <string.h>
bool supportsUnicode()
{
/* Set a locale for the ctype and multibyte functions.
* This controls recognition of upper and lower case,
* alphabetic or non-alphabetic characters, and so on.
*/
setlocale(LC_CTYPE, "en_US.UTF-8");
return (strcmp(nl_langinfo(CODESET), "UTF-8") == 0) ? true : false;
}
Refer to htop source code which can draw lines with/without Unicode.

Print Unicode characters in C, using ncurses

I have to draw a box in C, using ncurses;
First, I have defined some values for simplicity:
#define RB "\e(0\x6a\e(B" (ASCII 188,Right bottom, for example)
I have compiled with gcc, over Ubuntu, with -finput-charset=UTF-8 flag.
But, if I try to print with addstr or printw, I get the hexa code.
What I`m doing wrong?
ncurses defines the values ACS_HLINE, ACS_VLINE, ACS_ULCORNER, ACS_URCORNER, ACS_LLCORNER and ACS_LRCORNER. You can use those constants in addch and friends, which should result in your seeing the expected box characters. (There's lots more ACS characters; you'll find a complete list in man addch.)
ncurses needs to know what it is drawing because it needs to know exactly where the cursor is all the time. Outputting console control sequences is not a good idea; if ncurses knows how to handle the sequence, it has its own abstraction for the feature and you should use that abstraction. The ACS ("alternate character set") defines are one of those abstractions.
A few issues:
if your program writes something like "\e(0\x6a\e(B" using addstr, then ncurses (any curses implementation) will translate the individual characters to printable form as described in the addch manual page.
ncurses supports line-drawing for commonly-used pseudo-graphics using symbols (such as ACS_HLINE) which are predefined characters with the A_ALTCHARSET attribute combined. You can read about those in the Line Graphics section of the addch manual page.
the code 0x6a is ASCII j, which (given a VT100-style mapping) would be the lower left corner. The curses symbol for that is ACS_LRCORNER.
you cannot write the line-drawing characters with addstr; instead addch, addchstr are useful. There are also functions oriented to line-drawing (see box and friends).
running in Ubuntu, your locale encoding is probably UTF-8. To make your program work properly, it should initialize the locale as described in the Initialization section of the ncurses manual page. In particular:
setlocale(LC_ALL, "");
Also, your program should link against the ncursesw library (-lncursesw) to use UTF-8, rather than just ncurses (-lncurses).
when compiling on Ubuntu, to use the proper header definitions, you should define _GNU_SOURCE.
BTW, maybe I'm probably arriving somewhat late to the party but I'll give you some insight that might or not shed some light and skills for your "box drawing" needs.
As of 2020 I'm involved in a funny project on my own mixing Swift + Ncurses (under OSX for now, but thinking about mixing it with linux). Apparently it works flawlessly.
The thing is, as I'm using Swift, internally it all reduces to "importing .h and .c" files from some Darwin.ncurses library the MacOS Xcode/runtime offers.
That means (I hope) my newly acquired skills might be useful for you because apparently we're using the very same .h and .c files for our ncurses needs. (or at least they should be really similar)
Said that:
As of now, I "ignored" ACS_corner chars (I can't find them under swift/Xcode/Darwin.ncurses runtime !!!) in favour of pure UTF "corner chars", which also exist in the unicode pointspace, look:
https://en.wikipedia.org/wiki/Box-drawing_character
What does it mean? Whenever I want to use some drawing box chars around I just copy&paste pure UTF-8 chars into my strings, and I send these very strings onto addstr.
Why does it work? Because as someone also answered above, before initializing ncurses with initscr(), I just claimed "I want a proper locale support" in the form of a setlocale(LC_ALL, ""); line.
What did I achieve? Apparently pure magic. And very comfortable one, as I just copy paste box chars inside my normal strings. At least under Darwin.ncurses/OSX Mojave I'm getting, not only "bounding box chars", but also full UTF8 support.
Try the "setlocale(LC_ALL, ""); initscr();" approach and tell us if "drawing boxes" works also for you under a pure C environment just using UTF8 bounding box chars.
Greetings and happy ncursing!

how to handle russian string as a command line argument in C program

I have an exe file build from C code. There is a situation where russian string is passed as an argument to this exe.
When I call exe with this argument, task manager shows russian string perfectly as command line argument.
But when I print that argument from my exe it just prints ???
How can I make my C program(hence exe) handle russian character?
The answer depends on a target platform for your program. Traditionally, a C- or C++-program begins its life from main(....) function which may have byte-oriented strings passed as arguments (notice char* in main declaration int main(int argc, char* argv[])). Byte-oriented strings mean that characters in a string are passed in a specific byte-oriented encoding and one character, for example Я or Ñ in UTF-8 may take more than 1 char.
Nowadays the most wide used encoding on Linux/Unix platform is UTF-8, but some time ago there were other encodings in use such as ISO8859-1, KOI8-R and a lot of others. Most of programs are still byte oriented as UTF-8 encoding is mostly backward-compatible with all traditional C strings API.
In other hand wide strings can be more convenient in use, because each character in a widestring uses a predefined space. Thus, for example, the following expression passes assertion test: std::wstring hello = L"Привет!¡Hola!"; assert(L'в' == hello[3]); (if UTF-8 char strings are used the test would fail). So if your program performs a lot of operations on letters, not strings as a whole, then widestrings can be the solution.
To convert strings from multi-byte to a wide character encoding, you may use mbtowc functions family or that awesome fancy codecvt C++-11 facility if your compiler supports it (likely it doesn't as of mid-2014 :))
In Windows strings are also can be passed as byte-oriented strings, and for Russian most likely CP1251 is used (depends on Operating system settings, but for Windows sold within Russia and CIS this is the most popular variant). Also MSVC has a language extension which allows an application programmer to avoid all this complexity with manual conversion of bytestring to widestrings, and use a variant of main() function which instantly receives widestrings
#user3159253 provided a good answer that I will complete with some more references:
Windows: Usually it uses wide characters.
Linux: Normally it uses UTF-8 encoding: please do NOT use wide chars in this case.
You are facing an internationalization (cf i18n, i10n ) issue.
You might need tools like iconv for character set conversion, and gettext for string translation.

Converting string in host character encoding to Unicode in C

Is there a way to portably (that is, conforming to the C standard) convert strings in the host character encoding to an array of Unicode code points? I'm working on some data serialization software, and I've got a problem because while I need to send UTF-8 over the wire, the C standard doesn't guarantee the ASCII encoding, so converting a string in the host character encoding can be a nontrivial task.
Is there a library that takes care of this kind of stuff for me? Is there a function hidden in the C standard library that can do something like this?
The C11 standard, ISO/IEC 9899:2011, has a new header <uchar.h> with rudimentary facilities to help. It is described in section §7.28 Unicode utilities <uchar.h>.
There are two pairs of functions defined:
c16rtomb() and mbrtoc16() — using type char16_t aka uint_least16_t.
c32rtomb() and mbrtoc32() — using type char32_t aka uint_least32_t.
The r in the name is for 'restartable'; the functions are intended to be called iteratively. The mbrtoc{16,32}() pair convert from a multibyte code set (hence the mb) to either char16_t or char32_t. The c{16,32}rtomb() pair convert from either char16_t or char32_t to a multibyte character sequence.
I'm not sure whether they'll do what you want. The <uchar.h> header and hence the functions are not available on Mac OS X 10.9.1 with either the Apple-provided clang or with the 'home-built' GCC 4.8.2, so I've not had a chance to investigate them. The header does appear to be available on Linux (Ubuntu 13.10) with GCC 4.8.1.
I think it likely that ICU is a better choice — it is, however, a rather large library (but that is because it does a thorough job of supporting Unicode in general and different locales in general).

Adding unicode support to a library for Windows

I would like to add Unicode support to a C library I am maintaining. Currently it expects all strings to be passed in utf8 encoded. Based on feedback it seems windows usually provides 3 function versions.
fooA() ANSI encoded strings
fooW() Unicode encoded strings
foo() string encoding depends on the UNICODE define
Is there an easy way to add this support without writing a lot of wrapper functions myself? Some of the functions are callable from the library and by the user and this complicates the situation a little.
I would like to keep support for utf8 strings as the library is usable on multiple operating systems.
The foo functions without the suffix are in fact macros. The fooA functions are obsolete and are simple wrappers around the fooW functions, which are the only ones that actually perform work. Windows uses UTF-16 strings for everything, so if you want to continue using UTF-8 strings, you must convert them for every API call (e.g. with MultiByteToWideChar).
For the public interface of your library, stick to exactly one encoding, either UTF-16, UTF-32 or UTF-8. Everything else (locale-dependent or OS-dependent encodings) is too complex for the callers. You don't need UTF-8 to be compatible with other OSes: many platform-independent libraries such as ICU, Qt or the Java standard libraries use UTF-16 on all systems. I think the choice between the three Unicode encodings depends on which OS you expect the library will be used most: If it will mostly be used on Windows, stick to UTF-16 so that you can avoid all string conversions. On Linux, UTF-8 is a common choice as a filesystem or terminal encoding (because it is the only Unicode encoding with an 8-bit-wide character unit), but see the note above regarding libraries. OS X uses UTF-8 for its POSIX interface and UTF-16 for everything else (Carbon, Cocoa).
Some notes on terminology: The words "ANSI" and "Unicode" as used in the Microsoft documentation are not in accordance to what the international standard say. When Microsoft speaks of "Unicode" or "wide characters", they mean "UTF-16" or (historically) the BMP subset thereof (with one code unit per code point). "ANSI" in Microsoft parlance means some locale-dependent legacy encoding which is completely obsolete in all modern versions of Windows.
If you want a definitive recommendation, go for UTF-16 and the ICU library.
Since your library already requires UTF-8 encoded strings, then it is already fully Unicode enabled, as UTF-8 is a loss-less Unicode encoding. If you are wanting to use your library in an environment that normally uses UTF-16 or even UTF-32 strings, then it could simply encode to, and decode from, UTF-8 when talking with your library. Otherwise, your library would have to expose extra UTF-16/32 functions that do those encoding/decoding operations internally.

Resources