I am trying to draw a square with a given width and height.
I am trying to do so while using the box characters from Unicode.
I am using this code:
#include <stdlib.h>
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include "string_prints.h"
#define VERTICAL_PIPE L"║"
#define HORIZONTAL_PIPE L"═"
#define UP_RIGHT_CORNER L"╗"
#define UP_LEFT_CORNER L"╔"
#define DOWN_RIGHT_CORNER L"╝"
#define DOWN_LEFT_CORNER L"╚"
// Function to print the top line
void DrawUpLine(int w){
setlocale(LC_ALL, "");
wprintf(UP_LEFT_CORNER);
for (int i = 0; i < w; i++)
{
wprintf(HORIZONTAL_PIPE);
}
wprintf(UP_RIGHT_CORNER);
}
// Function to print the sides
void DrawSides(int w, int h){
setlocale(LC_ALL, "");
for (int i = 0; i < h; i++)
{
wprintf(VERTICAL_PIPE);
for (int j = 0; j < w; j++)
{
putchar(' ');
}
wprintf(VERTICAL_PIPE);
putchar('\n');
}
}
// Function to print the bottom line
void DrawDownLine(int w){
setlocale(LC_ALL, "");
wprintf(DOWN_LEFT_CORNER);
for (int i = 0; i < w; i++)
{
wprintf(HORIZONTAL_PIPE);
}
wprintf(DOWN_RIGHT_CORNER);
}
void DrawFrame(int w, int h){
DrawUpLine(w);
putchar('\n');
DrawSides(w, h);
putchar('\n');
DrawDownLine(w);
}
But when I am running this code with some int values I get an output with seemingly random spaces and newlines (although the pipes seem at the correct order).
It is being called from main.c from the header like so:
#include <stdlib.h>
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include "string_prints.h"
int main(){
DrawFrame(10, 20); // Calling the function
return 0;
}
Also as you can see I don't understand the correct use of setlocale, do you need to do it only once? or more?
Any help appreciated thanks in advance!
Also as you can see I don't understand the correct use of setlocale, do you need to do it only once? or more?
Locale changes applied via setlocale() are persistent within the calling process. You do not need to call that function multiple times unless you want to make multiple changes. But you do need to name a locale to it that serves your intended purpose, or if you call it with an empty string then you or the program user does need to ensure that the environment variables that define the various locale categories are set to values that suit the purpose.
But when I am running this code with some int values I get an output
with seemingly random spaces and newlines.
That sounds like the result of a character-encoding mismatch, or even two (but see also below):
there can be a runtime mismatch because the locale you tell the program to use for output does not match the one expected by the output device (e.g. a terminal) with which the program's output is displayed, and
there can also be a compile time mismatch between the actual character encoding of your source file and the encoding the compiler interprets it as having.
Additionally, use of wide string literal syntax notwithstanding, it is implementation-dependent which characters other than C's basic set may appear in your source code. The wide syntax specifies mostly the form of the storage for the literal (elements of type wchar_t), not so much what character values are valid or how they are interpreted.
Note also that the width of wchar_t is implementation-dependent, and it can be as small as eight bits. It is not necessarily the case that a wchar_t can represent arbitrary Unicode characters -- in fact, it is pretty common for wchar_t to be 16 bits wide, which in fact isn't wide enough for the majority of characters from Unicode's 21-bit code space. You might get an internal representation of wider characters in a two-unit form, such as a UTF-16 surrogate pair, but you also might not -- a great deal of this is left to individual implementations.
Among those things, what encoding the compiler expects, under what circumstances, and how you can influence that are all implementation-dependent. For GCC, for instance, the default source ("input") character set is UTF-8, and you can define a different one via its -finput-charset option. You can also specify both a standard and a wide execution character set via the -fexec-charset and -fwide-exec-charset options, if you wish to do so. GCC relies on iconv for conversions, both at compile time (source charset to execution charset) and at runtime (from execution charset to locale charset). Other implementations have other options (or none), with their own semantics.
So what should you do? In the first place, I suggest taking the source character set out of the equation by using UTF-8 string literals expressed using only basic character set (requires C2011):
#define VERTICAL_PIPE u8"\xe2\x95\x91"
#define HORIZONTAL_PIPE u8"\xe2\x95\x90"
#define UP_RIGHT_CORNER u8"\xe2\x95\x97"
#define UP_LEFT_CORNER u8"\xe2\x95\x94"
#define DOWN_RIGHT_CORNER u8"\xe2\x95\x9d"
#define DOWN_LEFT_CORNER u8"\xe2\x95\x9a"
Note well that the resulting strings are normal, not wide, so you should not use the wide-oriented output functions with them. Instead, use the normal printf, putchar, etc..
And that brings us to another issue with your code: you must not mix wide-oriented and byte-oriented functions writing to the same stream without taking explicit measures to switch (freopen or fwide; see paragraph 7.21.2/4 of the standard). In practice, mixing the two can quite plausibly produce mangled results.
Then also ensure that your local environment variables are set correctly for your actual environment. Chances are good that they already are, but it's worth a check.
Related
As we know, different encodings map different representations to same characters. Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well? I'd find this surprising since these are compile-time!
This matters for tasks as simple as, for example, determining whether a string read from input contains a specific character. When reading strings from input it seems sensible to set the locale to to the user's locale (setlocale("LC_ALL", "");) so that the string is read and processed correctly. But when we're comparing this string with a character literal, won't problems arise due to mismatched encoding?
In other words: The following snippet seems to work for me. But doesn't it work only because of coincidence? Because - for example? - the source code happened to be saved in the same encoding that is used on the machine during runtime?
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <locale.h>
int main()
{
setlocale(LC_ALL, "");
// Read line and convert it to wide string so that wcschr can be used
// So many lines! And that's even though I'm omitting the necessary
// error checking for brevity. Ah I'm also omitting free's
char *s = NULL; size_t n = 0;
getline(&s, &n, stdin);
mbstate_t st = {0}; const char* cs = s;
size_t wn = mbsrtowcs(NULL, &cs, 0, &st);
wchar_t *ws = malloc((wn+1) * sizeof(wchar_t));
st = (mbstate_t){0};
mbsrtowcs(ws, &cs, (wn+1), &st);
int contains_guitar = (wcschr(ws, L'🎸') != NULL);
if(contains_guitar)
printf("Let's rock!\n");
else
printf("Let's not.\n");
return 0;
}
How to do this correctly?
Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well?
No. String literals use the execution character set, which is defined by your compiler at compile time.
Execution character set does not have to be the same as the source character set, the character set used in the source code. The C compiler is responsible for the translation, and should have options for choosing/defining them. The default depends on the compiler, but on Linux and most current POSIXy systems, is usually UTF-8.
The following snippet seems to work for me. But doesn't it work only because of coincidence?
The example works because the character set of your locale, the source character set, and the execution character set used when the binary was constructed, all happen to be UTF-8.
How to do this correctly?
Two options. One is to use wide characters and string literals. The other is to use UTF-8 everywhere.
For wide input and output, see e.g. this example in another answer here.
Do note that getwline() and getwdelim() are not in POSIX.1, but in C11 Annex K. This means they are optional, and as of this writing, not widely available at all. Thus, a custom implementation around fgetwc() is recommended instead. (One based on fgetws(), wcslen(), and/or wcscspn() will not be able to handle embedded nuls, L'\0', correctly.)
In a typical wide I/O program, you only need mbstowcs() to convert command-line arguments and environment variables to wide strings.
Using UTF-8 everywhere is also a perfectly valid practical approach, at least if it is well documented, so that users know the program inputs and outputs UTF-8 strings, and developers know to ensure their C compiler uses UTF-8 as the execution character set when compiling those binaries.
Your program can even use e.g.
if (!setlocale(LC_ALL, ""))
fprintf(stderr, "Warning: Your C library does not support your current locale.\n");
if (strcmp("UTF-8", nl_langinfo(CODESET)))
fprintf(stderr, "Warning: Your locale does not use the UTF-8 character set.\n");
to verify the current locale uses UTF-8.
I have used both approaches, depending on the circumstances. It is difficult to say which one is more portable in practice, because as usual, both work just fine on non-Windows OSes without issues.
If you're willing to assume UTF-8,
strstr(s,"🎸")
Or:
strstr(s,u8"🎸")
The latter avoids some assumptions but requires a C11 compiler. If you want the best of both and can sacrifice readability:
strstr(s,"\360\237\216\270")
I am trying to make a simple -ancient greek to modern greek- converter, in c, by changing the tones of the vowels. For example, the user types a text in greek which conains the character: ῶ (unicode: U+1FF6), so the program converts it into: ώ (unicode:U+1F7D). Greek are not sopported by c, so I don't know how to make it work. Any ideas?
Assuming you use a sane operating system (meaning, not Windows), this is very easy to achieve using C99/C11 locale and wide character support. Consider filter.c:
#include <stdlib.h>
#include <locale.h>
#include <wchar.h>
#include <stdio.h>
wint_t convert(const wint_t wc)
{
switch (wc) {
case L'ῶ': return L'ώ';
default: return wc;
}
}
int main(void)
{
wint_t wc;
if (!setlocale(LC_ALL, "")) {
fprintf(stderr, "Current locale is unsupported.\n");
return EXIT_FAILURE;
}
if (fwide(stdin, 1) <= 0) {
fprintf(stderr, "Standard input does not support wide characters.\n");
return EXIT_FAILURE;
}
if (fwide(stdout, 1) <= 0) {
fprintf(stderr, "Standard output does not support wide characters.\n");
return EXIT_FAILURE;
}
while ((wc = fgetwc(stdin)) != WEOF)
fputwc(convert(wc), stdout);
return EXIT_SUCCESS;
}
The above program reads standard input, converts each ῶ into a ώ, and outputs the result.
Note that wide character strings and characters have an L prefix; L'ῶ' is a wide character constant. These are only in Unicode if the execution character set (the character set the code is compiled for) is Unicode, and that depends on your development environment. (Fortunately, outside of Windows, UTF-8 is pretty much a standard nowadays -- and that is a good thing -- so code like the above Just Works.)
On POSIXy systems (like Linux, Android, Mac OS, BSDs), you can use the iconv() facilities to convert from any input character set to Unicode, do the conversion there, and finally convert back to any output character set. Unfortunately, the question is not tagged posix, so that is outside this particular question.
The above example uses a simple switch/case statement. If there are many replacement pairs, one could use e.g.
typedef struct {
wint_t from;
wint_t to;
} widepair;
static widepair replace[] = {
{ L'ῶ', L'ώ' },
/* Others? */
};
#define NUM_REPLACE (sizeof replace / sizeof replace[0])
and at runtime, sort replace[] (using qsort() and a function that compares the from elements), and use binary search to quickly determine if a wide character is to be replaced (and if so, to which wide character). Because this is a O(log2N) operation with N being the number of pairs, and it utilizes cache okay, even thousands of replacement pairs is not a problem this way. (And of course, you can build the replacement array at runtime just as well, even from user input or command-line options.)
For Unicode characters, we could use a uint32_t map_to[0x110000]; to directly map each code point to another Unicode code point, but because we do not know whether wide characters are Unicode or not, we cannot do that; we do not know the code range of the wide characters until after compile time. Of course, we can do a multi-stage compilation, where a test program generates the replace[] array shown above, and outputs their codes in decimal; then do some kind of auto-grouping or clustering, for example bit maps or hash tables, to do it "even faster".
However, in practice it usually turns out that the I/O (reading and writing the data) takes more real-world time than the conversion itself. Even when the conversion is the bottleneck, the conversion rate is sufficient for most humans. (As an example, when compiling C or C++ code with the GNU utilities, the preprocessor first converts the source code to UTF-8 internally.)
Okay, here's some quick advice. I wouldn't use C because Unicode is not wel supported (yet).
A better language choice would be Python, Java, ..., anything with good Unicode support.
I'd write a utility that reads from standard input and writes to standard output. This makes it easy to use from the command line and in scripts.
I might be missing something but it's going to be something like this (in pseudo code):
while ((inCharacter = getCharacterFromStandardInput) != EOF
{
switch (inCharacter)
{
case 'ῶ': outCharacter = ώ; break
...
}
writeCharacterToStandardOutput(outCharacter)
}
You'll also need to select & handle the format: UTF-8/16/32.
That's it. Good luck!
Add ru_RU.CP1251 locale (on debian uncomment ru_RU.CP1251 in /etc/locale.gen and run sudo locale-gen) and
compile the following program with gcc -fexec-charset=cp1251 test.c (input file is in UTF-8). The result is empty. Just letter 'я' is wrong.
Other letters are determined either lowercase or uppercase just fine.
#include <locale.h>
#include <ctype.h>
#include <stdio.h>
int main (void)
{
setlocale(LC_ALL, "ru_RU.CP1251");
char c = 'я';
int i;
char z;
for (i = 7; i >= 0; i--) {
z = 1 << i;
if ((z & c) == z) printf("1"); else printf("0");
}
printf("\n");
if (islower(c))
printf("lowercase\n");
if (isupper(c))
printf("uppercase\n");
return 0;
}
Why neither islower() nor isupper() work on letter я?
The answer is that the encoding for the lower case version of that character in CP 1251 is decimal 255, and islower() and isupper() for your implementation do not accept or return that value (which is often interpreted as EOF).
You need to track down the source code for the runtime library to see what it does and why.
The solution is to write your own implementations, or wrap the ones you have. Personally, I never use these functions directly because of the many gotchas.
Igor, if your file is UTF-8 it's of no sense to try to use code page 1251, as it has nothing in common with utf-8 encoding. Just use locale ru_RU.UTF-8 and you'll be able to display your file without any problem. Or, if you insist on using ru_RU.CP1251, you'll need to first convert your file from utf-8 encoding to cp1251 (you can use the iconv(1) utility for that)
iconv --from-code=utf-8 --to-code=cp1251 your_file.txt > your_converted_file.txt
On other side, the --fexec-charset=cp1251 only affects the characters used on the executable, but you have not specified the input charset to use in string literals in your source code. Probably, the compiler is determining that from the environment (which you have set in your LANG or LC_CHARSET environment variables)
Only once you control exactly what locales are used at each stage, you'll get coherent results.
The main reason an effort is being made to switch all countries to a common charset (UTF) is exactly to not have to deal with all these locale settings at each stage.
If you deal always with documents encoded in CP1251, you'll need to use that encoding for everything on your computer, but when you receive some document encoded in utf-8, then you'll have to convert it to be able to see it right.
I mostly recommend you to switch to utf-8, as it's an encoding that has support for all countries character sets, but at this moment, that decision is only yours.
NOTE
On debian linux:
$ sed 's/^/ /' pru-$$.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <locale.h>
#define P(f,v) printf(#f"(%d /* '%c' */) => %d\n", (v), (v), f(v))
#define Q(v) do{P(isupper,(v));P(islower,(v));}while(0)
int main()
{
setlocale(LC_ALL, "");
Q(0xff);
}
Compiled with
$ make pru-$$
cc pru-1342.c -o pru-1342
execution with ru_RU.CP1251 locale
$ locale | sed 's/^/ /'
LANG=ru_RU.CP1251
LANGUAGE=
LC_CTYPE="ru_RU.CP1251"
LC_NUMERIC="ru_RU.CP1251"
LC_TIME="ru_RU.CP1251"
LC_COLLATE="ru_RU.CP1251"
LC_MONETARY="ru_RU.CP1251"
LC_MESSAGES="ru_RU.CP1251"
LC_PAPER="ru_RU.CP1251"
LC_NAME="ru_RU.CP1251"
LC_ADDRESS="ru_RU.CP1251"
LC_TELEPHONE="ru_RU.CP1251"
LC_MEASUREMENT="ru_RU.CP1251"
LC_IDENTIFICATION="ru_RU.CP1251"
LC_ALL=
$ pru-$$
isupper(255 /* 'я' */) => 0
islower(255 /* 'я' */) => 512
So, glibc is not faulty, the fault is in your code.
The first comment of Jonathan Leffler to OP is true. isxxx() (and iswxxx()) functions are required to handle EOF (WEOF) argument
(probably to be fool-proof).
This is why int was chosen as the argument type. When we pass argument of type char or character literal, it is
promoted to int (preserving the sign). And because by default char type and character literals are signed in gcc,
0xFF becomes -1, which is by unhappy coincidence the value of EOF.
Therefore always do explicit typecasting when passing parameters of type char (and character literals with code 0xFF) to functions, using int argument type (don't count on the unsignedness of char, because it is implementation-defined). Typecasting may be either done via (unsigned char), or via (uint8_t), which is less to type (you must include stdint.h).
See also https://sourceware.org/bugzilla/show_bug.cgi?id=20792 and Why passing char as parameter to islower() does not work correctly?
I would like to convert (transliterate) UTF-8 characters to be closest match in ASCII in C. Characters like ú is transliterated to u. I can do that with iconv, with iconv -f utf-8 -t ascii//TRANSLIT, on the command line.
In C, there is a function towctrans to do that, but I only found documentation about two possible transliterations: to lower case and to upper case (see man wctrans). On the documentation, wctrans depends on LC_CTYPE. But what other function (other than "tolower" and "toupper") are available for a specific LC_CTYPE value?
A simple example with towctrans and the basic toupper transliteration:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main() {
wchar_t frase[] = L"Amélia"; int i;
setlocale(LC_ALL, "");
for (i=0; i < wcslen(frase); i++) {
printf("%lc --> %lc\n", frase[i], towctrans(frase[i], wctrans("toupper")));
}
}
I know I can do this conversion with libiconv, but I was trying to find out possible already defined wctrans functions.
While the standard allows for implementation-defined or locale-defined transformations via wctrans, I'm not aware of any existing implementations that offer such a feature, and it's certainly not widespread. The iconv approach of //TRANSLIT is also non-standard and in fact conflicts with the standard: POSIX requires a charset name containing a slash character to be interpreted as a pathname to a charmap file, so use of slash for specifying translit-mode is non-conforming.
Edit:
I can only use stdio.h and stdlib.h
I would like to iterate through a char array filled with chars.
However chars like ä,ö take up twice the space and use two elements.
This is where my problem lies, I don't know how to access those special chars.
In my example the char "ä" would use hmm[0] and hmm[1].
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
char* hmm = "äö";
printf("%c\n", hmm[0]); //i want to print "ä"
printf("%i\n", strlen(hmm));
return 0;
}
Thanks, i tried to run my attached code in Eclipse, there it works. I assume because it uses 64 bits and the "ä" has enough space to fit. strlen confirms that each "ä" is only counted as one element.
So i guess i could somehow tell it to allocate more space for each char (so "ä" can fit)?
#include <stdio.h>
#include <stdlib.h>
int main()
{
char* hmm = "äüö";
printf("%c\n", hmm[0]);
printf("%c\n", hmm[1]);
printf("%c\n", hmm[2]);
return 0;
}
A char always used one byte.
In your case you think that "ä" is one char: Wrong.
Open your .c source code with an hexadecimal viewer and you will see that ä is using 2 char because the file is encoded in UTF8
Now the question is do you want to use wide character ?
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <locale.h>
int main()
{
const wchar_t hmm[] = L"äö";
setlocale(LC_ALL, "");
wprintf(L"%ls\n", hmm);
wprintf(L"%lc\n", hmm[0]);
wprintf(L"%i\n", wcslen(hmm));
return 0;
}
Your data is in a multi-byte encoding. Therefore, you need to use multibyte character handling techniques to divvy up the string. For example:
#include <stdio.h>
#include <string.h>
#include <locale.h>
int main(void)
{
char* hmm = "äö";
int off = 0;
int len;
int max = strlen(hmm);
setlocale(LC_ALL, "");
printf("<<%s>>\n", hmm);
printf("%zi\n", strlen(hmm));
while (hmm[off] != '\0' && (len = mblen(&hmm[off], max - off)) > 0)
{
printf("<<%.*s>>\n", len, &hmm[off]);
off += len;
}
return 0;
}
On my Mac, it produced:
<<äö>>
4
<<ä>>
<<ö>>
The call to setlocale() was crucial; without that, the program runs in the "C" locale instead of my en_US.UTF-8 locale, and mblen() mishandled things:
<<äö>>
4
<<?>>
<<?>>
<<?>>
<<?>>
The questions marks appear because the bytes being printed are invalid single bytes as far as the UTF-8 terminal is concerned.
You can also use wide characters and wide-character printing, as shown in benjarobin's answer..
Sorry to drag this on. Though I think its important to highlight some issues. As I understand it OS-X has the ability to have the default OS code page to be UTF-8 so the answer is mostly in regards to Windows that under the hood uses UTF-16, and its default ACP code page is dependent on the specified OS region.
Firstly you can open Character Map, and find that
äö
Both reside in the code page 1252 (western), so this is not a MBCS issue. The only way it could be a MBCS issue is if you saved the file using MBCS (Shift-JIS,Big5,Korean,GBK) encoding.
The answer, of using
setlocale( LC_ALL, "" )
Does not give insight into the reason why, äö was rendered in the command prompt window incorrectly.
Command Prompt does use its own code pages, namely OEM code pages. Here is a reference to the following (OEM) code pages available with their character map's.
Going into command prompt and typing the following command (Chcp) Will reveal the current OEM code page that the command prompt is using.
Following Microsoft documentation by using setlocal(LC_ALL,"") it details the following behavior.
setlocale( LC_ALL, "" );
Sets the locale to the default, which is the user-default ANSI code page obtained from the operating system.
You can do this manually, by using chcp and passing your required code page, then run your application and it should output the text perfectly fine.
If it was a multie byte character set problem then there would be a whole list of other issues:
Under MBCS, characters are encoded in either one or two bytes. In two-byte characters, the first, or "lead-byte," signals that both it and the following byte are to be interpreted as one character. The first byte comes from a range of codes reserved for use as lead bytes. Which ranges of bytes can be lead bytes depends on the code page in use. For example, Japanese code page 932 uses the range 0x81 through 0x9F as lead bytes, but Korean code page 949 uses a different range.
Looking at the situation, and that the length was 4 instead of 2. I would say that the file format has been saved in UTF-8 (It could in fact been saved in UTF-16, though you would of run into problems sooner than later with the compiler). You're using characters that are not within the ASCII range of 0 to 127, UTF-8 is encoding the Unicode code point to two bytes. Your compiler is opening the file and assuming its your default OS code page or ANSI C. When parsing your string, it's interpreting the string as a ANSI C Strings 1 byte = 1 character.
To sove the issue, under windows convert the UTF-8 string to UTF-16 and print it with wprintf. Currently there is no native UTF-8 support for the Ascii/MBCS stdio functions.
For Mac OS-X, that has the default OS code page of UTF-8 then I would recommend following Jonathan Leffler solution to the problem because it is more elegant. Though if you port it to Windows later, you will find you will need to covert the string from UTF-8 to UTF-16 using the example bellow.
In either solution you will still need to change the command prompt code page to your operating system code page to print the characters above ASCII correctly.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <Windows.h>
#include <locale>
// File saved as UTF-8, with characters outside the ASCII range
int main()
{
// Set the OEM code page to be the default OS code page
setlocale(LC_ALL, "");
// äö reside outside of the ASCII range and in the Unicode code point Western Latin 1
// Thus, requires a lead byte per unicode code point when saving as UTF-8
char* hmm = "äö";
printf("UTF-8 file string using Windows 1252 code page read as:%s\n",hmm);
printf("Length:%d\n", strlen(hmm));
// Convert the UTF-8 String to a wide character
int nLen = MultiByteToWideChar(CP_UTF8, 0,hmm, -1, NULL, NULL);
LPWSTR lpszW = new WCHAR[nLen];
MultiByteToWideChar(CP_UTF8, 0, hmm, -1, lpszW, nLen);
// Print it
wprintf(L"wprintf wide character of UTF-8 string: %s\n", lpszW);
// Free the memory
delete[] lpszW;
int c = getchar();
return 0;
}
UTF-8 file string using Windows 1252 code page read as:äö
Length:4
wprintf wide character of UTF-8 string: äö
i would check your command prompt font/code page to make sure that it can display your os single byte encoding. note command prompt has its own code page that differs to your text editor.