Hebrew support in a C ncurses application - c

We have a C nurses based application (runs on most flavours of Unix, but we favour RHEL). We've got Unicode support in there, but now we have to provide a Hebrew version of the application. Does anyone know a process we could go through to convert the program? It mainly gets and stores data from Oracle, which can support Hebrew, so there should not be a problem there.
It really is just the display of the text that is the issue.

It is important to know what terminal they are using because that defines how you should write the code. Some terminals support BiDi(ie bidirectional text). That means they automatically turn Hebrew/Arabian text backwards.
It has its own problems, you can check what your app would look like using mlterm.
Basically it reverses the lines that contain hebrew text while keeping what is interpreted as English characters LTR. A Hebrew character printed to 10,70 will appear in 10,10. You can use Unicode LTR RTL to try to force direction for things that break your formatting, but at least on mlterm while they work, they print garbage characters.
If they use regular terminals with unicode support, however, you should roll the characters yourself.
Then of course if it is run on bidirectional terminals the text would be backwards again and the format lost.

Related

Clarification on Winapi Paths and Filename (W functions and A functions)

I have try to check an importance and reason to use W winapi vs A, (W meaning wide char, A meaning ascii right?)
I have made a simple example, i receive a temp path for current user like this:
CHAR pszUserTempPathA[MAX_PATH] = { 0 };
WCHAR pwszUserTempPathW[MAX_PATH] = { 0 };
GetTempPathA(MAX_PATH - 1, pszUserTempPathA);
GetTempPathW(MAX_PATH - 1, pwszUserTempPathW);
printf("pathA=%s\r\npathW=%ws\r\n",pszUserTempPathA,pwszUserTempPathW);
My current user has a russian name, so its written in cyrillic, printf outputs like this:
pathA=C:\users\Пыщь\Local\Temp
pathW=C:\users\Пыщь\Local\Temp
So both paths are allright, i thought i will receive some error, or a mess of symbols with a GetTempPathA since the current user is a unicode, but i figured out, that cyrillic characters are actually included in extended ascii character set. So i have a question, if i were to use my software, and it will extract data in a temp folder of current user, who is chinese ( assuming he have chinese symbols in user name ), will i get a mess or an error using the GetTempPathA version? Should i always use a W prefixed functions, for a production software, that is working with winapi directly?
First, the -A suffix stands for ANSI, not ASCII. ASCII is a 7-bit character set. ANSI, as Microsoft uses the term, is for an encoding using 8-bit code units (chars) and code pages.
Some people use the terms "extended ASCII" or "high ASCII," but that's not actually a standard and, in some cases, isn't quite the same as ANSI. Extended ASCII is the ASCII character set plus (at most) 128 additional characters. For many ANSI code pages this is identical to extended ASCII, but some code pages accommodate variable length characters (which Microsoft calls multi-byte). Some people consider "extended ASCII" to just mean ISO-Latin-1 (which is nearly identical to Windows-1252).
Anyway, with an ANSI function, your string can include any characters from your current code page. If you need characters that aren't part of your current code page, you're out-of-luck. You'll have to use the wide -W versions.
In modern versions of Windows, you can generally think of the -A functions as wrappers around the -W functions that use MultiByteToWideChar and/or WideCharToMultiByte to convert any strings passing through the API. But the latter conversion can be lossy, since wide character strings might include characters that your multibyte strings cannot represent.
Portable, cross-platform code often stores all text in UTF-8, which uses 8-bit code units (chars) but can represent any Unicode code point, and anytime text needs to go through a Windows API, you'd explicitly convert to/from wide chars and then call the -W version of the API.
UTF-8 is nearly similar to what Microsoft calls a multibyte ANSI code page, except that Windows does not completely support a UTF-8 code page. There is CP_UTF8, but it works only with certain APIs (like WideCharToMultiByte and MultiByteToWideChar). You cannot set your code page to CP_UTF8 and expect the general -A APIs to do the right thing.
As you try to test things, be aware that it's difficult (and sometimes impossible) to get the CMD console window to display characters outside the current code page. If you want to display multi-script strings, you probably should write a GUI application and/or use the debugger to inspect the actual content of the strings.
Of course, you need the wide version. ASCII version can't even technically handle more than 256 distinct characters. Cyrillic is included in the extended ASCII set (if that's your localization) while Chinese isn't and can't due to much larger set of characters needed to represent it. Moreover, you can get mess with Cyrillic as well - it will only work properly if the executing machine has matching localization. So on a machine with non-cyrillic localization the text will be displayed according to whatever is defined by the localization settings.

Using C I would like to format my output such that the output in the terminal stops once it hits the edge of the window

If you type ps aux into your terminal and make the window really small, the output of the command will not wrap and the format is still very clear.
When I use printf and output my 5 or 6 strings, sometimes the length of my output exceeds that of the terminal window and the strings wrap to the next line which totally screws up the format. How can I write my program such that the output continues to the edge of the window but no further?
I've tried searching for an answer to this question but I'm having trouble narrowing it down and thus my search results never have anything to do with it so it seems.
Thanks!
There are functions that can let you know information about the terminal window, and some others that will allow you to manipulate it. Look up the "ncurses" or the "termcap" library.
A simple approach for solving your problem will be to get the terminal window size (specially the width), and then format your output accordingly.
There are two possible answers to fix your problem.
Turn off line wrapping in your terminal emulator(if it supports it).
Look into the Curses library. Applications like top or vim use the Curses library for screen formatting.
You can find, or at least guess, the width of the terminal using methods that other answers describe. That's only part of the problem however -- the tricky bit is formatting the output to fit the console. I don't believe there's any alternative to reading the text word by word, and moving the output to the next line when a word would overflow the width. You'll need to implement a method to detect where the white-space is, allowing for the fact that there could be multiple white spaces in a row. You'll need to decide how to handle line-breaking white-space, like CR/LF, if you have any. You'll need to decide whether you can break a word on punctuation (e.g, a hyphen). My approach is to use a simple finite-state machine, where the states are "At start of line", "in a word", "in whitespace", etc., and the characters (or, rather character classes) encountered are the events that change the state.
A particular complication when working in C is that there is little-to-no built-in support for multi-byte characters. That's fine for text which you are certain will only ever be in English, and use only the ASCII punctuation symbols, but with any kind of internationalization you need to be more careful. I've found that it's easiest to convert the text into some wide format, perhaps UTF-32, and then work with arrays of 32-bit integers to represent the characters. If your text is UTF-8, there are various tricks you can use to avoid having to do this conversion, but they are a bit ugly.
I have some code I could share, but I don't claim it is production quality, or even comprehensible. This simple-seeming problem is actually far more complicated than first impressions suggest. It's easy to do badly, but difficult to do well.

Do modern terminals generally render all utf-8 characters correctly?

I am writting an application in C that will be ran in a terminal, and it would be handy but not necesary to use some of the less used unicode characters. From my experimentation, I have not had any trouble rendering them. However, I would not use any non ascii characters if it were a likely source of trouble in the future.
So, in short, can I count on just about any terminal or terminal emulator in the modern *nix world (mainly linux, freebsd, and osx) to properly render arbitrary utf-8 characters?
If I cannot make such an assumption, there are particular subsets of unicode characters defined for various purposes, so would some such subset at least be reliably rendered in any likely modern *nix terminal or terminal emulator?
NOTE: When I say arbitrary, I do mean arbitrary: any unicode characters. But for completeness of my question, I will note that I am primarily interested in arrows and mathematical characters, this link has lists of both: https://en.wikipedia.org/wiki/Unicode_symbols.
No, you should not assume that. Even in a modern system, the set of fonts installed, the font used by the terminal application, and environment variables such as LANG, LC_*, etc. may influence whether certain characters can be displayed correctly on the terminal or not.
You might be able to make reasonable guesses based on the value of the TERM, LANG, and LC_* environment variable as to what is supported, but it's still going to be a guess. I'd suggest either not relying on it at all or providing some means of enabling/disabling the use (via an environment variable and/or via commandline flags to the application).
For the most part, this depends on the font, not the terminal. But there are a couple of things the terminal software has to take into account. For example, halfwidth and fullwidth forms of CJK characters.
Also, Unicode characters are added on a regular basis. There's no way that every font and terminal software is automatically updated as soon as a new version of the Unicode standard is released.
In general, you should assume that there are always Unicode characters that are not rendered correctly, even on a modern terminal.

How to manage unicode inputs in terminal unix in C89

I made some research on this subject (unicode inputs in c89) but i didn't find everythings i wanted to know.
Someone can explain me how he manage the whole keyboard (utf8) with some basic operation (only looking at the binary value), because i didn't find how make the difference between character keys and function keys.
Thanks a lot.
That's not unicode, but likely ANSI escape sequences. VT100/52 etc. for 80ies devices (Atari-ST used VT52 for instance).

Background c program for keyboard mapping

I have installed a Bramma TTF file in my windows 8 system. Through a windows character map, I was able to find individual character code. Attached below the screenshot of the map. We can see at the right bottom side, the character code for "!" is 0x21. Similarly, I can find all the character code of all other letters.
Now I defined a character mapping for this font with my US based keyboard layout. For example, I mapped physical character 'a' in the keyboard to the character shown in 3rd row and 1st column. [whenever I hit 'a' from the keyboard, the corresponding character has to be displayed]
I would like to write a background C program such that it listens the keyboard hit and as per my previously defined character mapping, my C program should output that mapped character. i.e., when i hit character 'a' from the keyboard it should return the mapped character.
Can any one help me out in solving this problem or else just give me a lead towards the solution.
I'm somewhat familiar with these kind of fonts, they popped up in other questions at SO. The kind of questions from users that tried to deal with the consequences of using such a font. They are rather grave.
The biggest problem is that this font is not Unicode compatible. The actual string that underlies the text that's rendered to the screen is very different, containing characters from the ANSI character set. What goes horribly wrong is when the program that displays these strings saves the data. The data file contains the original strings, a good example is an Excel spreadsheet. This spreadsheet just contains gibberish when it is read by any other program. Especially bad when read by a program on another machine that doesn't have the same font installed. Very, very painful.
You are in fact making this problem worse by even destroying the normal mapping between keyboard to ANSI character. The 1st character in the 3rd row is produced by typing a capital I (eye) on the keyboard.
The message is clear: don't do this. Windows supports Unicode compatible fonts with Indic scripts well, fonts like Sylfaen, Mangal, Latha. All of which are available on my Windows 8 machine, about ten thousand miles away from where they are normally used. It also has Indic keyboard layouts available under the Language applet, I just picked one as an example:
Well, it is your funeral. You don't have to write a C program to translate these keystrokes, you need a custom keyboard layout. It is a DLL. You normally need the DDK to build them, but there is simple tooling available to auto-generate them. It doesn't get any easier than with MKLC, the Microsoft Keyboard Layout Creator. Web page and download link are here.
Probably you should use autohotkey.
With this application, you can listen to a set of keys, & then send a different set of keys.
This can be used as implementation of "autocorrect"
e.g.
:*:btw::By the way `
will autocorrect btw to By the way.
autohotkey supports quite complicated scripts, & many scripts are already available online.
On another note, if you only want english keyboard to print malayalam unicode characters, you may also think of a popular software called baraha
Google's Virtual Keyboard (also works with your physical keyboard)
https://code.google.com/apis/ajax/playground/#virtual_keyboard
http://www.tavultesoft.com/ allows you to create keyboards for MSWindows and the web. Over 1000 keyboards are readily provided. There is a developer and a user version. With the developer version you may create installation programs which install fonts, keyboards, keymaps and documentation.

Resources