Write Non-BMP Code Points to Console

Write Non-BMP Code Points to Console - c

The Windows console API provides the WriteConsoleOutput function, which allows you to write characters to arbitrary locations on the console. This function takes an array of CHAR_INFO structures as an argument, specifying the:
Characters (i.e. code points) to write
Attributes thereof
However the CHAR_INFO structure allows code points to be specified only as either WCHAR or CHAR. CHAR supports only ANSI characters, and WCHAR supports only code points in the range U+0000 to U+FFFF (i.e. the BMP).
Is there any way to use the console API to write out code points in the range U+10000 to U+10FFFF? I.e. to write code points outside of the BMP?

To the best of my knowledge, the Windows console API is limited to UCS2 and so cannot output non-BMP characters.

Related

Clarification on Winapi Paths and Filename (W functions and A functions)

I have try to check an importance and reason to use W winapi vs A, (W meaning wide char, A meaning ascii right?)
I have made a simple example, i receive a temp path for current user like this:
CHAR pszUserTempPathA[MAX_PATH] = { 0 };
WCHAR pwszUserTempPathW[MAX_PATH] = { 0 };
GetTempPathA(MAX_PATH - 1, pszUserTempPathA);
GetTempPathW(MAX_PATH - 1, pwszUserTempPathW);
printf("pathA=%s\r\npathW=%ws\r\n",pszUserTempPathA,pwszUserTempPathW);
My current user has a russian name, so its written in cyrillic, printf outputs like this:
pathA=C:\users\Пыщь\Local\Temp
pathW=C:\users\Пыщь\Local\Temp
So both paths are allright, i thought i will receive some error, or a mess of symbols with a GetTempPathA since the current user is a unicode, but i figured out, that cyrillic characters are actually included in extended ascii character set. So i have a question, if i were to use my software, and it will extract data in a temp folder of current user, who is chinese ( assuming he have chinese symbols in user name ), will i get a mess or an error using the GetTempPathA version? Should i always use a W prefixed functions, for a production software, that is working with winapi directly?

First, the -A suffix stands for ANSI, not ASCII. ASCII is a 7-bit character set. ANSI, as Microsoft uses the term, is for an encoding using 8-bit code units (chars) and code pages.
Some people use the terms "extended ASCII" or "high ASCII," but that's not actually a standard and, in some cases, isn't quite the same as ANSI. Extended ASCII is the ASCII character set plus (at most) 128 additional characters. For many ANSI code pages this is identical to extended ASCII, but some code pages accommodate variable length characters (which Microsoft calls multi-byte). Some people consider "extended ASCII" to just mean ISO-Latin-1 (which is nearly identical to Windows-1252).
Anyway, with an ANSI function, your string can include any characters from your current code page. If you need characters that aren't part of your current code page, you're out-of-luck. You'll have to use the wide -W versions.
In modern versions of Windows, you can generally think of the -A functions as wrappers around the -W functions that use MultiByteToWideChar and/or WideCharToMultiByte to convert any strings passing through the API. But the latter conversion can be lossy, since wide character strings might include characters that your multibyte strings cannot represent.
Portable, cross-platform code often stores all text in UTF-8, which uses 8-bit code units (chars) but can represent any Unicode code point, and anytime text needs to go through a Windows API, you'd explicitly convert to/from wide chars and then call the -W version of the API.
UTF-8 is nearly similar to what Microsoft calls a multibyte ANSI code page, except that Windows does not completely support a UTF-8 code page. There is CP_UTF8, but it works only with certain APIs (like WideCharToMultiByte and MultiByteToWideChar). You cannot set your code page to CP_UTF8 and expect the general -A APIs to do the right thing.
As you try to test things, be aware that it's difficult (and sometimes impossible) to get the CMD console window to display characters outside the current code page. If you want to display multi-script strings, you probably should write a GUI application and/or use the debugger to inspect the actual content of the strings.

Of course, you need the wide version. ASCII version can't even technically handle more than 256 distinct characters. Cyrillic is included in the extended ASCII set (if that's your localization) while Chinese isn't and can't due to much larger set of characters needed to represent it. Moreover, you can get mess with Cyrillic as well - it will only work properly if the executing machine has matching localization. So on a machine with non-cyrillic localization the text will be displayed according to whatever is defined by the localization settings.

Secure MultiByteToWideChar Usage

I've got code that used MultiByteToWideChar like so:
wchar_t * bufferW = malloc(mbBufferLen * 2);
MultiByteToWideChar(CP_ACP, 0, mbBuffer, mbBufferLen, bufferW, mbBufferLen);
Note that the code does not use a previous call to MultiByteToWideChar to check how large the new unicode buffer needs to be, and assumes it will be twice the multibyte buffer.
My question is if this usage is safe? Could there be a default code page that maps a character into a 3-byte or larger unicode character, and cause an overflow? While I'm aware the usage isn't exactly correct, I'd like to gauge the risk impact.

Could there be a default code page that maps a character into a 3-byte or larger [sequence of wchar_t UTF-16 code units]
There is currently no ANSI code page that maps a single byte to a character outside the BMP (ie one that would take more than one 2-byte codeunit in UTF-16).
No single multi-byte ANSI character can ever be encoded as more than two 2-byte codeunits in UTF-16. So, at worse, you will never end up with a UTF-16 string that has more than 2x the length of the input ANSI string (not counting the null-terminator, which does not apply in this case since you are passing explicit lengths), and at best you will end up with a UTF-16 string that has fewer wchar_t characters than the input string has char characters.
For what it's worth, Microsoft are endeavouring not to develop the ANSI code pages any further, and I suspect the NLS file format would need changes to allow it, so it's pretty unlikely that this will change in future. But there is no firm API promise that this will definitely always hold true.

How to convert uni code point value (utf16) to C char array

I have an api which takes uni code data as c character array and sends it as a correct sms in uni code.
Now i have four code point values corresponding to four characters in some native alphabet and i want to send those correctly by inserting them into a c char array.
I tried
char test_data[] = {"\x00\x6B\x00\x6A\x00\x63\x00\x69"};
where 0x006B is one code point and so on.
The api internally is calling
int len = mbstowcs(NULL,test_data,0);
which results in 0 for above. Seems like 0x00 is treated as a terminating null.
I want to assign the above code points correctly to c array so they result into corresponding utf16 characters on the receiving phone (which does support the char set). If required i have the leverage to change the api too.
Platform is Linux with glib

UTF-16BE is not the native execution (AKA multibyte) character set and mbstowcs does expect null-terminated strings, so this will not work. Since you are using Linux, the function is probably expecting any char[] sequence to be UTF-8.
I believe you can transcode character data in Linux using uniconv. I've only used the ICU4C project.
Your code would read the UTF-16BE data, transcode it to a common form (e.g. uint8_t), then transcode it to the native execution character set prior to calling the API (which will then transcode it to the native wide character set.)
Note: this may be a lossy process if the execution character set does not contain the relevant code points, but you have no choice because this is what the API is expecting. But as I noted above, modern Linux systems should default to UTF-8. I wrote a little bit about transcoding codepoints in C here.

I think using wchar_t would solve your problem.
Correct me if I am wrong or missing something.

I think you should create a union of chars and ints.
typedef union wchars{int int_arr[200]; char char_arr[800]}; memcpy the data into this union for your assignment

Assembly Hex Dump

I am attempting to learn debugging in x86 assembly and am trying to debug my simple C program. However, I am confused as to how large values (like strings) are stored in memory. For example, lets say I store the string VEQ9SZ9T8I62ZCIWE6RKZDE6AZSI2 at address 0012E965 in register EBX and I look at the hex dump at that address, how do I know where it ends? Say I didn't have a nice ASCII string stored at that location, how would I know where the hex dump ended for that particular address? As you can see, I am quite a beginner at assembly so I thank everyone for his/her patience and help.

It's mostly a matter of interpretation. How a string (or any data in memory) is interpreted is (not surprisingly) defined by some code which interprets it. From just looking at a hex dump of data you cannot say which method was used to create the string, but chances are, that a common method was used. Null-terminated strings are easily recognized by a tailing zero, some strings may be prepended by it's length in bytes or chars. It's also possible that the size is not encoded in data memory but was put in as an immediate value inside the program.

Depends who stored or generated the string. If it is generated by the assembler or a C program/library, it is most likely a C string.
For storing strings there are at some possibilities:
Using a terminating 0 character, aka C string. To determine the length of the string you have to call a function like strlen. In this case the string ends where the first 0 char is.
Storing the length of the string in a separate variable at the beginning. The length variable can be of byte, 16-bit, 32-bit or 64-bit width.
Storing the length of the string and a pointer to an address in a global memory pool.
Additionally there are variants for storing wide chars, UTF-8 and such, and a mixture between everything. As assembler programmer its up to you what you use internally. It does make sense to use an format which can be used by the OS (like in file names) or which is common to programs or libraries you want to use. So C strings are probably most common in assembly programs.

get the text in the display with ncurses

Is there any way to get back the characters outputted into a variable on ncurses ?
let's say I do:
printw("test");
then I want to be able to:
somefunc(strbuffer);
printf("%s",strbuffer); // test
I need a function to get back all characters on the screen into a variable, scr_dump get's close but the output format is unreadable..

If you put stuff on the screen using curses functions (e.g. addch, mvaddch, addstr) you can use inchstr) and related functions to read the characters from the screen (extracting them with AND'ing the returned value with A_CHARTEXT).
However, if you use printf or any other non-curses method of puting text on the screen (including a system call to another program that uses curses) you will not be able to read the content of the screen.
Curses maintains the current screen contents internally and the inchstr functions use the internal representation of the screen to find the current contents.

There are two sets of functions for retrieving data from the screen. If your printw uses only (as in the question) text which is represented as an 8-bit encoding (ASCII, POSIX, ISO-8859-1), then inch and inchstr work:
inch retrieves a single cell along with its attributes
inchstr retrieves multiple cells along with their attributes
or more simply using instr and its variations. These functions return the data without additional need for masking the attributes from the characters.
However, if the data uses a multibyte encoding (such as UTF-8), then you must use a different interface for retrieving the characters. These are the equivalents of inch and inchstr:
in_wch, etc. - extract a complex character and
rendition from a window
in_wchstr, etc. - get an array of complex
characters and renditions from a curses window
A complex character is a structure, which X/Open Curses treats as opaque. You must use getcchar to extract data (such as a wide-character string) from each cell's data.
A (little) more simply, you can read the wide-character string information from a window:
inwstr, etc. - get a string of wchar_t characters from a curses window
there is no single-character form; you must retrieve data as a one-character string.
In summary, while your application can put data as an array of char (or individual chtype values), in a UTF-8 environment it must retrieve it as complex characters or wide-characters. If you happen to be using Linux, you can generally treat wchar_t as Unicode values. Given data as an array of wchar_t values, you would use other (non-curses) functions to obtain a multibyte (UTF-8) string.
Since the question said ncurses rather than simply curses, it's appropriate to point out that applications using ncurses can differ from X/Open Curses in the way they put data on the screen (which can affect your expectations about retrieving it). In ncurses, addch (and similar char-oriented functions) will handle bytes in a multi-byte string such as UTF-8, storing the result as wide-characters. None of the other X/Open Curses implementations to date do this. The others treat those bytes as independent, and may represent them as invalid wide-characters.
By the way, since the question was asked in 2010, ncurses' scr_dump format has been extended, making it "readable".

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight