get the text in the display with ncurses

get the text in the display with ncurses - c

Is there any way to get back the characters outputted into a variable on ncurses ?
let's say I do:
printw("test");
then I want to be able to:
somefunc(strbuffer);
printf("%s",strbuffer); // test
I need a function to get back all characters on the screen into a variable, scr_dump get's close but the output format is unreadable..

If you put stuff on the screen using curses functions (e.g. addch, mvaddch, addstr) you can use inchstr) and related functions to read the characters from the screen (extracting them with AND'ing the returned value with A_CHARTEXT).
However, if you use printf or any other non-curses method of puting text on the screen (including a system call to another program that uses curses) you will not be able to read the content of the screen.
Curses maintains the current screen contents internally and the inchstr functions use the internal representation of the screen to find the current contents.

There are two sets of functions for retrieving data from the screen. If your printw uses only (as in the question) text which is represented as an 8-bit encoding (ASCII, POSIX, ISO-8859-1), then inch and inchstr work:
inch retrieves a single cell along with its attributes
inchstr retrieves multiple cells along with their attributes
or more simply using instr and its variations. These functions return the data without additional need for masking the attributes from the characters.
However, if the data uses a multibyte encoding (such as UTF-8), then you must use a different interface for retrieving the characters. These are the equivalents of inch and inchstr:
in_wch, etc. - extract a complex character and
rendition from a window
in_wchstr, etc. - get an array of complex
characters and renditions from a curses window
A complex character is a structure, which X/Open Curses treats as opaque. You must use getcchar to extract data (such as a wide-character string) from each cell's data.
A (little) more simply, you can read the wide-character string information from a window:
inwstr, etc. - get a string of wchar_t characters from a curses window
there is no single-character form; you must retrieve data as a one-character string.
In summary, while your application can put data as an array of char (or individual chtype values), in a UTF-8 environment it must retrieve it as complex characters or wide-characters. If you happen to be using Linux, you can generally treat wchar_t as Unicode values. Given data as an array of wchar_t values, you would use other (non-curses) functions to obtain a multibyte (UTF-8) string.
Since the question said ncurses rather than simply curses, it's appropriate to point out that applications using ncurses can differ from X/Open Curses in the way they put data on the screen (which can affect your expectations about retrieving it). In ncurses, addch (and similar char-oriented functions) will handle bytes in a multi-byte string such as UTF-8, storing the result as wide-characters. None of the other X/Open Curses implementations to date do this. The others treat those bytes as independent, and may represent them as invalid wide-characters.
By the way, since the question was asked in 2010, ncurses' scr_dump format has been extended, making it "readable".

Related

Clarification on Winapi Paths and Filename (W functions and A functions)

I have try to check an importance and reason to use W winapi vs A, (W meaning wide char, A meaning ascii right?)
I have made a simple example, i receive a temp path for current user like this:
CHAR pszUserTempPathA[MAX_PATH] = { 0 };
WCHAR pwszUserTempPathW[MAX_PATH] = { 0 };
GetTempPathA(MAX_PATH - 1, pszUserTempPathA);
GetTempPathW(MAX_PATH - 1, pwszUserTempPathW);
printf("pathA=%s\r\npathW=%ws\r\n",pszUserTempPathA,pwszUserTempPathW);
My current user has a russian name, so its written in cyrillic, printf outputs like this:
pathA=C:\users\Пыщь\Local\Temp
pathW=C:\users\Пыщь\Local\Temp
So both paths are allright, i thought i will receive some error, or a mess of symbols with a GetTempPathA since the current user is a unicode, but i figured out, that cyrillic characters are actually included in extended ascii character set. So i have a question, if i were to use my software, and it will extract data in a temp folder of current user, who is chinese ( assuming he have chinese symbols in user name ), will i get a mess or an error using the GetTempPathA version? Should i always use a W prefixed functions, for a production software, that is working with winapi directly?

First, the -A suffix stands for ANSI, not ASCII. ASCII is a 7-bit character set. ANSI, as Microsoft uses the term, is for an encoding using 8-bit code units (chars) and code pages.
Some people use the terms "extended ASCII" or "high ASCII," but that's not actually a standard and, in some cases, isn't quite the same as ANSI. Extended ASCII is the ASCII character set plus (at most) 128 additional characters. For many ANSI code pages this is identical to extended ASCII, but some code pages accommodate variable length characters (which Microsoft calls multi-byte). Some people consider "extended ASCII" to just mean ISO-Latin-1 (which is nearly identical to Windows-1252).
Anyway, with an ANSI function, your string can include any characters from your current code page. If you need characters that aren't part of your current code page, you're out-of-luck. You'll have to use the wide -W versions.
In modern versions of Windows, you can generally think of the -A functions as wrappers around the -W functions that use MultiByteToWideChar and/or WideCharToMultiByte to convert any strings passing through the API. But the latter conversion can be lossy, since wide character strings might include characters that your multibyte strings cannot represent.
Portable, cross-platform code often stores all text in UTF-8, which uses 8-bit code units (chars) but can represent any Unicode code point, and anytime text needs to go through a Windows API, you'd explicitly convert to/from wide chars and then call the -W version of the API.
UTF-8 is nearly similar to what Microsoft calls a multibyte ANSI code page, except that Windows does not completely support a UTF-8 code page. There is CP_UTF8, but it works only with certain APIs (like WideCharToMultiByte and MultiByteToWideChar). You cannot set your code page to CP_UTF8 and expect the general -A APIs to do the right thing.
As you try to test things, be aware that it's difficult (and sometimes impossible) to get the CMD console window to display characters outside the current code page. If you want to display multi-script strings, you probably should write a GUI application and/or use the debugger to inspect the actual content of the strings.

Of course, you need the wide version. ASCII version can't even technically handle more than 256 distinct characters. Cyrillic is included in the extended ASCII set (if that's your localization) while Chinese isn't and can't due to much larger set of characters needed to represent it. Moreover, you can get mess with Cyrillic as well - it will only work properly if the executing machine has matching localization. So on a machine with non-cyrillic localization the text will be displayed according to whatever is defined by the localization settings.

UTF-8 and ISO 8859-9

I have been reading about UTF-8 and unicode for the last couple of days and when I thought I figured it all, I am confused when I read that UTF-8 and ISO 8859-9 are not compatible.
I have a database that stores data as UTF-8. I have a requirement from a customer to support various ISO 8859-x code pages (i.e. 8859-3, 8859-2, and also ISO 6937). My questions are:
Since my data ingest and database engine type is UTF-8, would it be correct to assume that I am using unicode?
I understand that unicode can support all characters and it is the way to go. However, my customer is an european entity that wants us to use ISO code pages. so my question is how can I support multiple client use cases using existing UTF-8 data? Since ISO 8859-x is not a subset of unicode, do I have to write code to send appropriate character set of ISO 8859-x depending on my use cases? Is that I need to do or there is more to it?
btw, my understanding is that UTF-8 is merely an encoding algorithm to get a numeric value from binary data. if so, how character set is applied? Do I have to write a code to return 8859-x response or all that's needed is to set an appropriate character set value in the response header?

Topic is pretty vast so let me simplify (a lot, even too much) and answer point by point.
Since my data ingest and database engine type is UTF-8, would it be correct to assume that I am using unicode?
Yes, you're using UNICODE and you're storing UNICODE characters (formally called code points) using UTF-8 encoding. Please note that UNICODE defines rules and sets of characters (even if same word is often used as synonym of UTF-16 encoding), the way you encode such characters in a byte stream is another thing.
... However, my customer is an european entity that wants us to use ISO code pages. so my question is how can I support multiple client use cases using existing UTF-8 data?
Of course if you store UNICODE characters (it doesn't matter with which encoding) then you can always convert them to a specific ASCII code page (or to any other encoding). OK this isn't formally always true (because UNICODE doesn't define every possible characters actually in use/used in the past) but I would ignore this point...
... Since ISO 8859-x is not a subset of unicode, do I have to write code to send appropriate character set of ISO 8859-x depending on my use cases?
All characters from ISO 8859 code pages are also available in UNICODE then (from this point of view) it's a subset. Of course encoded values are different so they need to be converted. If you know needed code page for each customer then you can always convert an UNICODE UTF-8 encoded text into an ASCII (with right code page) text.
Is that I need to do or there is more to it?
Just that. Code could be pretty short but you didn't tag your question with any language so I won't provide links/examples. Just for a rudimentary example take a look to this post.
Let me also say one important thing: if they want to consume your data in ASCII with their code page then you have to perform a conversion. If they can consume directly UTF-8 data (or you present them somehow in your own application) then you don't have to worry about code pages (that's why we're using UNICODE) because - no matters encoding - UNICODE character set contains all characters they may need.
btw, my understanding is that UTF-8 is merely an encoding algorithm to get a numeric value from binary data.
Not exactly. You have a table of characters, right? For example A. Now you have to store a numeric value that will be interpreted as A. In ASCII they arbitrary decided that 65 is the numeric value that represents that character. UNICODE is a long list of characters (and rules to combine them), UTF-X are arbitrary representations used to store them as numeric values.
if so, how character set is applied?
"Character set" is a pretty vague sentence. With UNICODE character set you mean all characters available with UNICODE. If you mean code page then (simplifying) it represents a subset of available character set. Imagine you have 8 bit ASCII (then up to 256 symbols), you can't accommodate all characters used in Europe, right? Code pages solve this problem, half of these symbols are always the same and the other half represent different characters according to code page (each "Country" will use a specific code page with its preferred characters).
For an introductory overview about this topic: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

Write Non-BMP Code Points to Console

The Windows console API provides the WriteConsoleOutput function, which allows you to write characters to arbitrary locations on the console. This function takes an array of CHAR_INFO structures as an argument, specifying the:
Characters (i.e. code points) to write
Attributes thereof
However the CHAR_INFO structure allows code points to be specified only as either WCHAR or CHAR. CHAR supports only ANSI characters, and WCHAR supports only code points in the range U+0000 to U+FFFF (i.e. the BMP).
Is there any way to use the console API to write out code points in the range U+10000 to U+10FFFF? I.e. to write code points outside of the BMP?

To the best of my knowledge, the Windows console API is limited to UCS2 and so cannot output non-BMP characters.

Why isn't wchar_t widely used in code for Linux / related platforms?

This intrigues me, so I'm going to ask - for what reason is wchar_t not used so widely on Linux/Linux-like systems as it is on Windows? Specifically, the Windows API uses wchar_t internally whereas I believe Linux does not and this is reflected in a number of open source packages using char types.
My understanding is that given a character c which requires multiple bytes to represent it, then in a char[] form c is split over several parts of char* whereas it forms a single unit in wchar_t[]. Is it not easier, then, to use wchar_t always? Have I missed a technical reason that negates this difference? Or is it just an adoption problem?

wchar_t is a wide character with platform-defined width, which doesn't really help much.
UTF-8 characters span 1-4 bytes per character. UCS-2, which spans exactly 2 bytes per character, is now obsolete and can't represent the full Unicode character set.
Linux applications that support Unicode tend to do so properly, above the byte-wise storage layer. Windows applications tend to make this silly assumption that only two bytes will do.
wchar_t's Wikipedia article briefly touches on this.

The first people to use UTF-8 on a Unix-based platform explained:
The Unicode Standard [then at version 1.1]
defines an
adequate character set but an
unreasonable representation [UCS-2]. It states
that all characters are 16 bits wide [no longer true]
and are communicated and stored in 16-bit units.
It also reserves a pair
of characters (hexadecimal FFFE and
FEFF) to detect byte order in
transmitted text, requiring state in
the byte stream. (The Unicode
Consortium was thinking of files, not
pipes.) To adopt this encoding, we
would have had to convert all text
going into and out of Plan 9 between
ASCII and Unicode, which cannot be
done. Within a single program, in
command of all its input and output,
it is possible to define characters as
16-bit quantities; in the context of a
networked system with hundreds of
applications on diverse machines by
different manufacturers [italics mine], it is
impossible.
The italicized part is less relevant to Windows systems, which have a preference towards monolithic applications (Microsoft Office), non-diverse machines (everything's an x86 and thus little-endian), and a single OS vendor.
And the Unix philosophy of having small, single-purpose programs means fewer of them need to do serious character manipulation.
The source for our tools and
applications had already been
converted to work with Latin-1, so it
was ‘8-bit safe’, but the conversion
to the Unicode Standard and UTF[-8] is
more involved. Some programs needed no
change at all: cat, for instance,
interprets its argument strings,
delivered in UTF[-8], as file names
that it passes uninterpreted to the
open system call, and then just copies
bytes from its input to its output; it
never makes decisions based on the
values of the bytes...Most programs,
however, needed modest change.
...Few tools actually need to operate
on runes [Unicode code points]
internally; more typically they need
only to look for the final slash in a
file name and similar trivial tasks.
Of the 170 C source programs...only 23
now contain the word Rune.
The programs that do store runes
internally are mostly those whose
raison d’être is character
manipulation: sam (the text editor),
sed, sort, tr, troff, 8½ (the window
system and terminal emulator), and so
on. To decide whether to compute using
runes or UTF-encoded byte strings
requires balancing the cost of
converting the data when read and
written against the cost of converting
relevant text on demand. For programs
such as editors that run a long time
with a relatively constant dataset,
runes are the better choice...
UTF-32, with code points directly accessible, is indeed more convenient if you need character properties like categories and case mappings.
But widechars are awkward to use on Linux for the same reason that UTF-8 is awkward to use on Windows. GNU libc has no _wfopen or _wstat function.

UTF-8, being compatible to ASCII, makes it possible to ignore Unicode somewhat.
Often, programs don't care (and in fact, don't need to care) about what the input is, as long as there is not a \0 that could terminate strings. See:
char buf[whatever];
printf("Your favorite pizza topping is which?\n");
fgets(buf, sizeof(buf), stdin); /* Jalapeños */
printf("%s it shall be.\n", buf);
The only times when I found I needed Unicode support is when I had to have a multibyte character as a single unit (wchar_t); e.g. when having to count the number of characters in a string, rather than bytes. iconv from utf-8 to wchar_t will quickly do that. For bigger issues like zero-width spaces and combining diacritics, something more heavy like icu is needed—but how often do you do that anyway?

wchar_t is not the same size on all platforms. On Windows it is a UTF-16 code unit that uses two bytes. On other platforms it typically uses 4 bytes (for UCS-4/UTF-32). It is therefore unlikely that these platforms would standardize on using wchar_t, since it would waste a lot of space.

UTF-8 tuple storage using lowest common technological denominator, append-only

EDIT: Note that due to the way hard drives actually write data, none of the schemes in this list work reliably. Do not use them. Just use a database. SQLite is a good simple one.
What's the most low-tech but reliable way of storing tuples of UTF-8 strings on disk? Storage should be append-only for reliability.
As part of a document storage system I'm experimenting with I have to store UTF-8 tuple data on disk. Obviously, for a full-blown implementation, I want to use something like Amazon S3, Project Voldemort, or CouchDB.
However, at the moment, I'm experimenting and haven't even firmly settled on a programming language yet. I have been using CSV, but CSV tend to become brittle when you try to store outlandish unicode and unexpected whitespace (eg vertical tabs).
I could use XML or JSON for storage, but they don't play nice with append-only files. My best guess so far is a rather idiosyncratic format where each string is preceded by a 4-byte signed integer indicating the number of bytes it contains, and an integer value of -1 indicates that this tuple is complete - the equivalent of a CSV newline. The main source of headaches there is having to decide on the endianness of the integer on disk.
Edit: actually, this won't work. If the program exits while writing a string, the data becomes irrevocably misaligned. Some sort of out-of-band signalling is needed to ensure alignment can be regained after an aborted tuple.
Edit 2: Turns out that guaranteeing atomicity when appending to text files is possible, but the parser is quite non-trivial. Writing said parser now.
Edit 3: You can view the end result at http://github.com/MetalBeetle/Fruitbat/tree/master/src/com/metalbeetle/fruitbat/atrio/ .

I would recommend tab delimiting each field and carriage-return delimiting each record.
Within each string, Replace all characters that would affect the field and record interpretation and rendering. This would include control characters (U+0000–U+001F, U+007F–U+009F), non-graphical line and paragraph separators (U+2028, U=2029), directional control characters (U+202A–U+202E), and the byte order mark (U+FEFF).
They should be replaced with escape sequences of constant length. The escape sequences should begin with a rare (for your application) character. The escape character itself should also be escaped.
This would allow you to append new records easily. It has the additional advantage of being able to load the file for visual inspection and modification into any spreadsheet or word processing program, which could be useful for debugging purposes.
This would also be easy to code, since the file will be a valid UTF-8 document, so standard text reading and writing routines may be used. This also allows you to convert easily to UTF-16BE or UTF-16LE if desired, without complications.
Example:
U+0009 CHARACTER TABULATION becomes ~TB
U+000A LINE FEED becomes ~LF
U+000D CARRIAGE RETURN becomes ~CR
U+007E TILDE becomes ~~~
etc.
There are a couple of reasons why tabs would be better than commas as field delimiters. Commas appear more commonly within normal text strings (such as English text), and would have to be replaced more frequently. And spreadsheet programs (such as Microsoft Excel) tend to handle tab-delimited files much more naturally.

Mostly thinking out loud here...
Really low tech would be to use (for example) null bytes as separators, and just "quote" all null bytes appearing in the output with an additional null.
Perhaps one could use SCSU along with that.
Or it might be worth to look at the gzip format, and maybe ape it, if not using it:
A gzip file consists of a series of "members" (compressed data sets).
[...]
The members simply appear one after another in the file, with no additional information before, between, or after them.
Each of these members can have an optional "filename", comment, or the like, and i believe you can just keep appending members.
Or you could use bencode, used in torrent-files. Or BSON.
See also Wikipedia's Comparison of data serialization formats.
Otherwise i think your idea of preceding each string with its length is probably the simplest one.