How to show arbitary characters in c? - c

warning C4566: character represented
by universal-charac ter-name '\u2E81'
cannot be represented in the current
code page (936)
Sometimes we need to display text in various languages such as Russian,Japanese and so on.
But seems a single code page can only show characters of 1 single language ,how can I show characters in various languages at the same time?

Since you're (apparently) using VC++, you probably want to switch to the UTF-8 code page. You'll also need to set the font to one that has glyphs for all the code points you care about (many have few if any beyond the first 256).

Related

Clarification on Winapi Paths and Filename (W functions and A functions)

I have try to check an importance and reason to use W winapi vs A, (W meaning wide char, A meaning ascii right?)
I have made a simple example, i receive a temp path for current user like this:
CHAR pszUserTempPathA[MAX_PATH] = { 0 };
WCHAR pwszUserTempPathW[MAX_PATH] = { 0 };
GetTempPathA(MAX_PATH - 1, pszUserTempPathA);
GetTempPathW(MAX_PATH - 1, pwszUserTempPathW);
printf("pathA=%s\r\npathW=%ws\r\n",pszUserTempPathA,pwszUserTempPathW);
My current user has a russian name, so its written in cyrillic, printf outputs like this:
pathA=C:\users\Пыщь\Local\Temp
pathW=C:\users\Пыщь\Local\Temp
So both paths are allright, i thought i will receive some error, or a mess of symbols with a GetTempPathA since the current user is a unicode, but i figured out, that cyrillic characters are actually included in extended ascii character set. So i have a question, if i were to use my software, and it will extract data in a temp folder of current user, who is chinese ( assuming he have chinese symbols in user name ), will i get a mess or an error using the GetTempPathA version? Should i always use a W prefixed functions, for a production software, that is working with winapi directly?
First, the -A suffix stands for ANSI, not ASCII. ASCII is a 7-bit character set. ANSI, as Microsoft uses the term, is for an encoding using 8-bit code units (chars) and code pages.
Some people use the terms "extended ASCII" or "high ASCII," but that's not actually a standard and, in some cases, isn't quite the same as ANSI. Extended ASCII is the ASCII character set plus (at most) 128 additional characters. For many ANSI code pages this is identical to extended ASCII, but some code pages accommodate variable length characters (which Microsoft calls multi-byte). Some people consider "extended ASCII" to just mean ISO-Latin-1 (which is nearly identical to Windows-1252).
Anyway, with an ANSI function, your string can include any characters from your current code page. If you need characters that aren't part of your current code page, you're out-of-luck. You'll have to use the wide -W versions.
In modern versions of Windows, you can generally think of the -A functions as wrappers around the -W functions that use MultiByteToWideChar and/or WideCharToMultiByte to convert any strings passing through the API. But the latter conversion can be lossy, since wide character strings might include characters that your multibyte strings cannot represent.
Portable, cross-platform code often stores all text in UTF-8, which uses 8-bit code units (chars) but can represent any Unicode code point, and anytime text needs to go through a Windows API, you'd explicitly convert to/from wide chars and then call the -W version of the API.
UTF-8 is nearly similar to what Microsoft calls a multibyte ANSI code page, except that Windows does not completely support a UTF-8 code page. There is CP_UTF8, but it works only with certain APIs (like WideCharToMultiByte and MultiByteToWideChar). You cannot set your code page to CP_UTF8 and expect the general -A APIs to do the right thing.
As you try to test things, be aware that it's difficult (and sometimes impossible) to get the CMD console window to display characters outside the current code page. If you want to display multi-script strings, you probably should write a GUI application and/or use the debugger to inspect the actual content of the strings.
Of course, you need the wide version. ASCII version can't even technically handle more than 256 distinct characters. Cyrillic is included in the extended ASCII set (if that's your localization) while Chinese isn't and can't due to much larger set of characters needed to represent it. Moreover, you can get mess with Cyrillic as well - it will only work properly if the executing machine has matching localization. So on a machine with non-cyrillic localization the text will be displayed according to whatever is defined by the localization settings.

Using C I would like to format my output such that the output in the terminal stops once it hits the edge of the window

If you type ps aux into your terminal and make the window really small, the output of the command will not wrap and the format is still very clear.
When I use printf and output my 5 or 6 strings, sometimes the length of my output exceeds that of the terminal window and the strings wrap to the next line which totally screws up the format. How can I write my program such that the output continues to the edge of the window but no further?
I've tried searching for an answer to this question but I'm having trouble narrowing it down and thus my search results never have anything to do with it so it seems.
Thanks!
There are functions that can let you know information about the terminal window, and some others that will allow you to manipulate it. Look up the "ncurses" or the "termcap" library.
A simple approach for solving your problem will be to get the terminal window size (specially the width), and then format your output accordingly.
There are two possible answers to fix your problem.
Turn off line wrapping in your terminal emulator(if it supports it).
Look into the Curses library. Applications like top or vim use the Curses library for screen formatting.
You can find, or at least guess, the width of the terminal using methods that other answers describe. That's only part of the problem however -- the tricky bit is formatting the output to fit the console. I don't believe there's any alternative to reading the text word by word, and moving the output to the next line when a word would overflow the width. You'll need to implement a method to detect where the white-space is, allowing for the fact that there could be multiple white spaces in a row. You'll need to decide how to handle line-breaking white-space, like CR/LF, if you have any. You'll need to decide whether you can break a word on punctuation (e.g, a hyphen). My approach is to use a simple finite-state machine, where the states are "At start of line", "in a word", "in whitespace", etc., and the characters (or, rather character classes) encountered are the events that change the state.
A particular complication when working in C is that there is little-to-no built-in support for multi-byte characters. That's fine for text which you are certain will only ever be in English, and use only the ASCII punctuation symbols, but with any kind of internationalization you need to be more careful. I've found that it's easiest to convert the text into some wide format, perhaps UTF-32, and then work with arrays of 32-bit integers to represent the characters. If your text is UTF-8, there are various tricks you can use to avoid having to do this conversion, but they are a bit ugly.
I have some code I could share, but I don't claim it is production quality, or even comprehensible. This simple-seeming problem is actually far more complicated than first impressions suggest. It's easy to do badly, but difficult to do well.

UTF-8 and ISO 8859-9

I have been reading about UTF-8 and unicode for the last couple of days and when I thought I figured it all, I am confused when I read that UTF-8 and ISO 8859-9 are not compatible.
I have a database that stores data as UTF-8. I have a requirement from a customer to support various ISO 8859-x code pages (i.e. 8859-3, 8859-2, and also ISO 6937). My questions are:
Since my data ingest and database engine type is UTF-8, would it be correct to assume that I am using unicode?
I understand that unicode can support all characters and it is the way to go. However, my customer is an european entity that wants us to use ISO code pages. so my question is how can I support multiple client use cases using existing UTF-8 data? Since ISO 8859-x is not a subset of unicode, do I have to write code to send appropriate character set of ISO 8859-x depending on my use cases? Is that I need to do or there is more to it?
btw, my understanding is that UTF-8 is merely an encoding algorithm to get a numeric value from binary data. if so, how character set is applied? Do I have to write a code to return 8859-x response or all that's needed is to set an appropriate character set value in the response header?
Topic is pretty vast so let me simplify (a lot, even too much) and answer point by point.
Since my data ingest and database engine type is UTF-8, would it be correct to assume that I am using unicode?
Yes, you're using UNICODE and you're storing UNICODE characters (formally called code points) using UTF-8 encoding. Please note that UNICODE defines rules and sets of characters (even if same word is often used as synonym of UTF-16 encoding), the way you encode such characters in a byte stream is another thing.
... However, my customer is an european entity that wants us to use ISO code pages. so my question is how can I support multiple client use cases using existing UTF-8 data?
Of course if you store UNICODE characters (it doesn't matter with which encoding) then you can always convert them to a specific ASCII code page (or to any other encoding). OK this isn't formally always true (because UNICODE doesn't define every possible characters actually in use/used in the past) but I would ignore this point...
... Since ISO 8859-x is not a subset of unicode, do I have to write code to send appropriate character set of ISO 8859-x depending on my use cases?
All characters from ISO 8859 code pages are also available in UNICODE then (from this point of view) it's a subset. Of course encoded values are different so they need to be converted. If you know needed code page for each customer then you can always convert an UNICODE UTF-8 encoded text into an ASCII (with right code page) text.
Is that I need to do or there is more to it?
Just that. Code could be pretty short but you didn't tag your question with any language so I won't provide links/examples. Just for a rudimentary example take a look to this post.
Let me also say one important thing: if they want to consume your data in ASCII with their code page then you have to perform a conversion. If they can consume directly UTF-8 data (or you present them somehow in your own application) then you don't have to worry about code pages (that's why we're using UNICODE) because - no matters encoding - UNICODE character set contains all characters they may need.
btw, my understanding is that UTF-8 is merely an encoding algorithm to get a numeric value from binary data.
Not exactly. You have a table of characters, right? For example A. Now you have to store a numeric value that will be interpreted as A. In ASCII they arbitrary decided that 65 is the numeric value that represents that character. UNICODE is a long list of characters (and rules to combine them), UTF-X are arbitrary representations used to store them as numeric values.
if so, how character set is applied?
"Character set" is a pretty vague sentence. With UNICODE character set you mean all characters available with UNICODE. If you mean code page then (simplifying) it represents a subset of available character set. Imagine you have 8 bit ASCII (then up to 256 symbols), you can't accommodate all characters used in Europe, right? Code pages solve this problem, half of these symbols are always the same and the other half represent different characters according to code page (each "Country" will use a specific code page with its preferred characters).
For an introductory overview about this topic: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

Background c program for keyboard mapping

I have installed a Bramma TTF file in my windows 8 system. Through a windows character map, I was able to find individual character code. Attached below the screenshot of the map. We can see at the right bottom side, the character code for "!" is 0x21. Similarly, I can find all the character code of all other letters.
Now I defined a character mapping for this font with my US based keyboard layout. For example, I mapped physical character 'a' in the keyboard to the character shown in 3rd row and 1st column. [whenever I hit 'a' from the keyboard, the corresponding character has to be displayed]
I would like to write a background C program such that it listens the keyboard hit and as per my previously defined character mapping, my C program should output that mapped character. i.e., when i hit character 'a' from the keyboard it should return the mapped character.
Can any one help me out in solving this problem or else just give me a lead towards the solution.
I'm somewhat familiar with these kind of fonts, they popped up in other questions at SO. The kind of questions from users that tried to deal with the consequences of using such a font. They are rather grave.
The biggest problem is that this font is not Unicode compatible. The actual string that underlies the text that's rendered to the screen is very different, containing characters from the ANSI character set. What goes horribly wrong is when the program that displays these strings saves the data. The data file contains the original strings, a good example is an Excel spreadsheet. This spreadsheet just contains gibberish when it is read by any other program. Especially bad when read by a program on another machine that doesn't have the same font installed. Very, very painful.
You are in fact making this problem worse by even destroying the normal mapping between keyboard to ANSI character. The 1st character in the 3rd row is produced by typing a capital I (eye) on the keyboard.
The message is clear: don't do this. Windows supports Unicode compatible fonts with Indic scripts well, fonts like Sylfaen, Mangal, Latha. All of which are available on my Windows 8 machine, about ten thousand miles away from where they are normally used. It also has Indic keyboard layouts available under the Language applet, I just picked one as an example:
Well, it is your funeral. You don't have to write a C program to translate these keystrokes, you need a custom keyboard layout. It is a DLL. You normally need the DDK to build them, but there is simple tooling available to auto-generate them. It doesn't get any easier than with MKLC, the Microsoft Keyboard Layout Creator. Web page and download link are here.
Probably you should use autohotkey.
With this application, you can listen to a set of keys, & then send a different set of keys.
This can be used as implementation of "autocorrect"
e.g.
:*:btw::By the way `
will autocorrect btw to By the way.
autohotkey supports quite complicated scripts, & many scripts are already available online.
On another note, if you only want english keyboard to print malayalam unicode characters, you may also think of a popular software called baraha
Google's Virtual Keyboard (also works with your physical keyboard)
https://code.google.com/apis/ajax/playground/#virtual_keyboard
http://www.tavultesoft.com/ allows you to create keyboards for MSWindows and the web. Over 1000 keyboards are readily provided. There is a developer and a user version. With the developer version you may create installation programs which install fonts, keyboards, keymaps and documentation.

get the text in the display with ncurses

Is there any way to get back the characters outputted into a variable on ncurses ?
let's say I do:
printw("test");
then I want to be able to:
somefunc(strbuffer);
printf("%s",strbuffer); // test
I need a function to get back all characters on the screen into a variable, scr_dump get's close but the output format is unreadable..
If you put stuff on the screen using curses functions (e.g. addch, mvaddch, addstr) you can use inchstr) and related functions to read the characters from the screen (extracting them with AND'ing the returned value with A_CHARTEXT).
However, if you use printf or any other non-curses method of puting text on the screen (including a system call to another program that uses curses) you will not be able to read the content of the screen.
Curses maintains the current screen contents internally and the inchstr functions use the internal representation of the screen to find the current contents.
There are two sets of functions for retrieving data from the screen. If your printw uses only (as in the question) text which is represented as an 8-bit encoding (ASCII, POSIX, ISO-8859-1), then inch and inchstr work:
inch retrieves a single cell along with its attributes
inchstr retrieves multiple cells along with their attributes
or more simply using instr and its variations. These functions return the data without additional need for masking the attributes from the characters.
However, if the data uses a multibyte encoding (such as UTF-8), then you must use a different interface for retrieving the characters. These are the equivalents of inch and inchstr:
in_wch, etc. - extract a complex character and
rendition from a window
in_wchstr, etc. - get an array of complex
characters and renditions from a curses window
A complex character is a structure, which X/Open Curses treats as opaque. You must use getcchar to extract data (such as a wide-character string) from each cell's data.
A (little) more simply, you can read the wide-character string information from a window:
inwstr, etc. - get a string of wchar_t characters from a curses window
there is no single-character form; you must retrieve data as a one-character string.
In summary, while your application can put data as an array of char (or individual chtype values), in a UTF-8 environment it must retrieve it as complex characters or wide-characters. If you happen to be using Linux, you can generally treat wchar_t as Unicode values. Given data as an array of wchar_t values, you would use other (non-curses) functions to obtain a multibyte (UTF-8) string.
Since the question said ncurses rather than simply curses, it's appropriate to point out that applications using ncurses can differ from X/Open Curses in the way they put data on the screen (which can affect your expectations about retrieving it). In ncurses, addch (and similar char-oriented functions) will handle bytes in a multi-byte string such as UTF-8, storing the result as wide-characters. None of the other X/Open Curses implementations to date do this. The others treat those bytes as independent, and may represent them as invalid wide-characters.
By the way, since the question was asked in 2010, ncurses' scr_dump format has been extended, making it "readable".

Resources