Expected encoding of wcwidth() argument

Expected encoding of wcwidth() argument - c

I'm trying to find out what the expected encoding of wcwidth() argument is.
The man page says absolutely nothing about this, and I wasted hours trying to
find out what it is. Here's an example, in C:
#include <stdio.h>
#include <wchar.h>
void main()
{
wchar_t c = L'ｈ';
printf("%d\n", wcwidth(c));
}
I want to know how should I encode this character literal so that this program
prints 2 instead of -1.
Here's a Rust example:
extern "C" {
fn wcwidth(c: libc::wchar_t) -> libc::c_int;
}
fn main() {
let c = 'ｈ';
println!("{}", unsafe { wcwidth(c as libc::wchar_t) });
}
Similarly I want to convert this character constant to wchar_t (i32) so that
this program prints 2.
Thanks.
UPDATE: Sorry for my wording, I made this sound specific to C's long char literals. I want to encode character literals in any language as a 32-bit int so that when I pass it to wcwidth I get a right answer. So my question is not specific to C or C's long char literals.
UPDATE 2: I'd also be happy with another function like wcwidth that is better specified (and maybe even platform independent). E.g. one that takes UTF-8 encoded character and returns number of cols needed to render it in a monospace terminal.

You need to add support for _XOPEN_SOURCE and also you need to set your locales.
Try this:
#define _XOPEN_SOURCE 700
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
int main(void)
{
setlocale(LC_CTYPE, "");
wchar_t c = L'ｈ';
printf("%d\n", wcwidth(c));
return 0;
}

Related

Is there a missing code on the isupper function?

#include <stdio.h>
#include <cs50.h>
#include <ctype.h>
#include <math.h>
// Prototype
string Get_text(void);
char isupper(ch);
int main(void)
{
string text = Get_text();
printf("%s\n", text );
}
// Prompt the user for text
string Get_text(void)
{
string n;
do {
n = get_string("Text: ");
}
while (n >= 0);
// Letters, Words, & Sentences
char ch = 'A';
}
I encountered an error when I ran my code. It point out to line 9 where I implemented the isupper function to check if the letters are capital. I even included an extra parenthesis on the isupper before the parenthesis on char but there's still errors. P.S I'm not yet done with the code. I'm reviewing how the isupper function works.

isupper was once a macro. Never declare it. #include <ctype.h> does the right thing.
If the offending declaration were for something other than isupper I would answer rather as follows:
char isupper(ch);
is bad syntax because it should be (type argumentname) in parenthesis. It would rather be as it appears in the man page (taking the luxury of correcting the type)
int isupper(int ch);
but as I said don't actually do this because of macro fun for the builtins in ctype.h.
Anyway, you're coding in c (from the tag) so there's no stock string type. This is not a compilable fragment; thankfully the line you're asking about occurs early enough that we can tell anyway what the problem is.

The isupper identifier coincides with a <ctype.h> function/macro that is included in the standard library.
As there's an already introduced prototype for the isupper function (int isupper(int ch);) that doesn't match the one you have used, it is giving you an error.
Simply call it otherwise (more if you plan to use the <ctype.h> routines) and not isupper.

Weird space when using wide characters in c

I am trying to draw a square with a given width and height.
I am trying to do so while using the box characters from Unicode.
I am using this code:
#include <stdlib.h>
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include "string_prints.h"
#define VERTICAL_PIPE L"║"
#define HORIZONTAL_PIPE L"═"
#define UP_RIGHT_CORNER L"╗"
#define UP_LEFT_CORNER L"╔"
#define DOWN_RIGHT_CORNER L"╝"
#define DOWN_LEFT_CORNER L"╚"
// Function to print the top line
void DrawUpLine(int w){
setlocale(LC_ALL, "");
wprintf(UP_LEFT_CORNER);
for (int i = 0; i < w; i++)
{
wprintf(HORIZONTAL_PIPE);
}
wprintf(UP_RIGHT_CORNER);
}
// Function to print the sides
void DrawSides(int w, int h){
setlocale(LC_ALL, "");
for (int i = 0; i < h; i++)
{
wprintf(VERTICAL_PIPE);
for (int j = 0; j < w; j++)
{
putchar(' ');
}
wprintf(VERTICAL_PIPE);
putchar('\n');
}
}
// Function to print the bottom line
void DrawDownLine(int w){
setlocale(LC_ALL, "");
wprintf(DOWN_LEFT_CORNER);
for (int i = 0; i < w; i++)
{
wprintf(HORIZONTAL_PIPE);
}
wprintf(DOWN_RIGHT_CORNER);
}
void DrawFrame(int w, int h){
DrawUpLine(w);
putchar('\n');
DrawSides(w, h);
putchar('\n');
DrawDownLine(w);
}
But when I am running this code with some int values I get an output with seemingly random spaces and newlines (although the pipes seem at the correct order).
It is being called from main.c from the header like so:
#include <stdlib.h>
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include "string_prints.h"
int main(){
DrawFrame(10, 20); // Calling the function
return 0;
}
Also as you can see I don't understand the correct use of setlocale, do you need to do it only once? or more?
Any help appreciated thanks in advance!

Also as you can see I don't understand the correct use of setlocale, do you need to do it only once? or more?
Locale changes applied via setlocale() are persistent within the calling process. You do not need to call that function multiple times unless you want to make multiple changes. But you do need to name a locale to it that serves your intended purpose, or if you call it with an empty string then you or the program user does need to ensure that the environment variables that define the various locale categories are set to values that suit the purpose.
But when I am running this code with some int values I get an output
with seemingly random spaces and newlines.
That sounds like the result of a character-encoding mismatch, or even two (but see also below):
there can be a runtime mismatch because the locale you tell the program to use for output does not match the one expected by the output device (e.g. a terminal) with which the program's output is displayed, and
there can also be a compile time mismatch between the actual character encoding of your source file and the encoding the compiler interprets it as having.
Additionally, use of wide string literal syntax notwithstanding, it is implementation-dependent which characters other than C's basic set may appear in your source code. The wide syntax specifies mostly the form of the storage for the literal (elements of type wchar_t), not so much what character values are valid or how they are interpreted.
Note also that the width of wchar_t is implementation-dependent, and it can be as small as eight bits. It is not necessarily the case that a wchar_t can represent arbitrary Unicode characters -- in fact, it is pretty common for wchar_t to be 16 bits wide, which in fact isn't wide enough for the majority of characters from Unicode's 21-bit code space. You might get an internal representation of wider characters in a two-unit form, such as a UTF-16 surrogate pair, but you also might not -- a great deal of this is left to individual implementations.
Among those things, what encoding the compiler expects, under what circumstances, and how you can influence that are all implementation-dependent. For GCC, for instance, the default source ("input") character set is UTF-8, and you can define a different one via its -finput-charset option. You can also specify both a standard and a wide execution character set via the -fexec-charset and -fwide-exec-charset options, if you wish to do so. GCC relies on iconv for conversions, both at compile time (source charset to execution charset) and at runtime (from execution charset to locale charset). Other implementations have other options (or none), with their own semantics.
So what should you do? In the first place, I suggest taking the source character set out of the equation by using UTF-8 string literals expressed using only basic character set (requires C2011):
#define VERTICAL_PIPE u8"\xe2\x95\x91"
#define HORIZONTAL_PIPE u8"\xe2\x95\x90"
#define UP_RIGHT_CORNER u8"\xe2\x95\x97"
#define UP_LEFT_CORNER u8"\xe2\x95\x94"
#define DOWN_RIGHT_CORNER u8"\xe2\x95\x9d"
#define DOWN_LEFT_CORNER u8"\xe2\x95\x9a"
Note well that the resulting strings are normal, not wide, so you should not use the wide-oriented output functions with them. Instead, use the normal printf, putchar, etc..
And that brings us to another issue with your code: you must not mix wide-oriented and byte-oriented functions writing to the same stream without taking explicit measures to switch (freopen or fwide; see paragraph 7.21.2/4 of the standard). In practice, mixing the two can quite plausibly produce mangled results.
Then also ensure that your local environment variables are set correctly for your actual environment. Chances are good that they already are, but it's worth a check.

How to fix locale?

Add ru_RU.CP1251 locale (on debian uncomment ru_RU.CP1251 in /etc/locale.gen and run sudo locale-gen) and
compile the following program with gcc -fexec-charset=cp1251 test.c (input file is in UTF-8). The result is empty. Just letter 'я' is wrong.
Other letters are determined either lowercase or uppercase just fine.
#include <locale.h>
#include <ctype.h>
#include <stdio.h>
int main (void)
{
setlocale(LC_ALL, "ru_RU.CP1251");
char c = 'я';
int i;
char z;
for (i = 7; i >= 0; i--) {
z = 1 << i;
if ((z & c) == z) printf("1"); else printf("0");
}
printf("\n");
if (islower(c))
printf("lowercase\n");
if (isupper(c))
printf("uppercase\n");
return 0;
}
Why neither islower() nor isupper() work on letter я?

The answer is that the encoding for the lower case version of that character in CP 1251 is decimal 255, and islower() and isupper() for your implementation do not accept or return that value (which is often interpreted as EOF).
You need to track down the source code for the runtime library to see what it does and why.
The solution is to write your own implementations, or wrap the ones you have. Personally, I never use these functions directly because of the many gotchas.

Igor, if your file is UTF-8 it's of no sense to try to use code page 1251, as it has nothing in common with utf-8 encoding. Just use locale ru_RU.UTF-8 and you'll be able to display your file without any problem. Or, if you insist on using ru_RU.CP1251, you'll need to first convert your file from utf-8 encoding to cp1251 (you can use the iconv(1) utility for that)
iconv --from-code=utf-8 --to-code=cp1251 your_file.txt > your_converted_file.txt
On other side, the --fexec-charset=cp1251 only affects the characters used on the executable, but you have not specified the input charset to use in string literals in your source code. Probably, the compiler is determining that from the environment (which you have set in your LANG or LC_CHARSET environment variables)
Only once you control exactly what locales are used at each stage, you'll get coherent results.
The main reason an effort is being made to switch all countries to a common charset (UTF) is exactly to not have to deal with all these locale settings at each stage.
If you deal always with documents encoded in CP1251, you'll need to use that encoding for everything on your computer, but when you receive some document encoded in utf-8, then you'll have to convert it to be able to see it right.
I mostly recommend you to switch to utf-8, as it's an encoding that has support for all countries character sets, but at this moment, that decision is only yours.
NOTE
On debian linux:
$ sed 's/^/ /' pru-$$.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <locale.h>
#define P(f,v) printf(#f"(%d /* '%c' */) => %d\n", (v), (v), f(v))
#define Q(v) do{P(isupper,(v));P(islower,(v));}while(0)
int main()
{
setlocale(LC_ALL, "");
Q(0xff);
}
Compiled with
$ make pru-$$
cc pru-1342.c -o pru-1342
execution with ru_RU.CP1251 locale
$ locale | sed 's/^/ /'
LANG=ru_RU.CP1251
LANGUAGE=
LC_CTYPE="ru_RU.CP1251"
LC_NUMERIC="ru_RU.CP1251"
LC_TIME="ru_RU.CP1251"
LC_COLLATE="ru_RU.CP1251"
LC_MONETARY="ru_RU.CP1251"
LC_MESSAGES="ru_RU.CP1251"
LC_PAPER="ru_RU.CP1251"
LC_NAME="ru_RU.CP1251"
LC_ADDRESS="ru_RU.CP1251"
LC_TELEPHONE="ru_RU.CP1251"
LC_MEASUREMENT="ru_RU.CP1251"
LC_IDENTIFICATION="ru_RU.CP1251"
LC_ALL=
$ pru-$$
isupper(255 /* 'я' */) => 0
islower(255 /* 'я' */) => 512
So, glibc is not faulty, the fault is in your code.

The first comment of Jonathan Leffler to OP is true. isxxx() (and iswxxx()) functions are required to handle EOF (WEOF) argument
(probably to be fool-proof).
This is why int was chosen as the argument type. When we pass argument of type char or character literal, it is
promoted to int (preserving the sign). And because by default char type and character literals are signed in gcc,
0xFF becomes -1, which is by unhappy coincidence the value of EOF.
Therefore always do explicit typecasting when passing parameters of type char (and character literals with code 0xFF) to functions, using int argument type (don't count on the unsignedness of char, because it is implementation-defined). Typecasting may be either done via (unsigned char), or via (uint8_t), which is less to type (you must include stdint.h).
See also https://sourceware.org/bugzilla/show_bug.cgi?id=20792 and Why passing char as parameter to islower() does not work correctly?

ASCII characters in C

I'm trying to save a character from the cyrillic alphabet in a char.
When I take a string from the console it saves it in the char array successfully but just initializing it doesn't seem to work. I get "programName.exe has stopped working" when trying to run it.
#include <stdio.h>
#include <conio.h>
#include <string.h>
#include <Windows.h>
#include <stdlib.h>
void test(){
char test = 'Я';
printf("%s",test);
}
void main(){
SetConsoleOutputCP(1251);
SetConsoleCP(1251);
test();
}
fgets ( books[booksCount].bookTitle, 80, stdin ); // this seems to be working ok with ascii.
I tried using wchar_t but I get the same results.

If you're using Russian Windows which uses Windows-1251 codepage by default, you can print the character encoded as a single byte using the old printf but you need to make sure that the source code uses the same cp1251 charset. Don't save as Unicode.
But the preferred way should be using wprintf with wide char string
void test() {
wchar_t test_char = L'Я';
wchar_t *test_string = L"АБВГ"; // or LPCWSTR test_string
wprintf(L"%c\n%s", test_char, test_string);
}
This time you need to save the file as Unicode (UTF-8 or UTF-16)
UTF-8 may be better, but it's trickier on Windows. Moreover if you use UTF-8 you cannot use a char to store Я because it needs more than 1 byte. You must use a char* instead
Note that main must return int, not void, and the above fgets must be called from inside some function

This could be solved, doing
void test()
{
char test = 'Я';
putchar(test);
}
But there is a catch: Since 'Я' is not an ASCII character, you might need to set appropriate locale before.
Moreover, only ASCII characters 32 - 126 are guaranteed to be printable, and the same symbol, on all systems.

How to find the built-in function to deal with char16_t in C?

Please tell what is the char16_t version for the String Manipulation Functions
such as:
http://www.tutorialspoint.com/ansi_c/c_function_references.htm
I found many references site, but no one mentioned that.
Especially for printing function, this is that most important, because it help me to verify whether the Manipulation function is work.
#include <stdio.h>
#include <uchar.h>
char16_t *u=u"α";
int main(int argc, char *argv[])
{
printf("%x\n",u[0]); // output 3b1, it is UTF16
wprintf("%s\n",u); //no ouput
_cwprintf("%s\n",u); //incorrect output
return 0;
}

To print/read/open write etc.., you need to convert to 32-bit chars using the mbsrtowcs function.
For ALL intents and purposes, char16_t is a multi-byte representation, therefore, one need use mbr functions to work with this integral type.
A few answers used the L"prefix" which is completely incorrect. 16-bit strings require the u"prefix".
The following code gets you everything you need to work with 8, 16, and 32-bit string representations.
#include <string.h>
#include <wchar.h>
#include <uchar.h>
You can Google the procedures found in <wchar.h> if you don't have manual pages (UNIX).
Gnome.org's GLib has some great code for you to drop-in if overhead isn't an issue.
char16_t and char32_t are ISO C11 (iso9899:2011) extensions.

wprintf and its wchar colleagues need to have th format string in wchar too:
wprintf( L"%s\n", u);
For wchar L is used as a prefix to the string literals.
Edit:
Here's a code snippet (tested on Windows):
#include <stdio.h>
#include <io.h>
#include <fcntl.h>
#include <wchar.h>
void main()
{
wchar_t* a = L"α";
fflush(stdout); //must be done before _setmode
_setmode(_fileno(stdout), _O_U16TEXT); // set console mode to unicode
wprintf(L"alpha is:\n\t%s\n", a); // works for me :)
}
The console doesn't work in unicode and prints a "?" for non ascii chars. In Linux you need to remove the underscore prefix before setmode and fileno.
Note: for windows GUI prints, there already proper support, so you can use wsprintf to format unicode strings.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight