I would like to convert (transliterate) UTF-8 characters to be closest match in ASCII in C. Characters like ú is transliterated to u. I can do that with iconv, with iconv -f utf-8 -t ascii//TRANSLIT, on the command line.
In C, there is a function towctrans to do that, but I only found documentation about two possible transliterations: to lower case and to upper case (see man wctrans). On the documentation, wctrans depends on LC_CTYPE. But what other function (other than "tolower" and "toupper") are available for a specific LC_CTYPE value?
A simple example with towctrans and the basic toupper transliteration:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main() {
wchar_t frase[] = L"Amélia"; int i;
setlocale(LC_ALL, "");
for (i=0; i < wcslen(frase); i++) {
printf("%lc --> %lc\n", frase[i], towctrans(frase[i], wctrans("toupper")));
}
}
I know I can do this conversion with libiconv, but I was trying to find out possible already defined wctrans functions.
While the standard allows for implementation-defined or locale-defined transformations via wctrans, I'm not aware of any existing implementations that offer such a feature, and it's certainly not widespread. The iconv approach of //TRANSLIT is also non-standard and in fact conflicts with the standard: POSIX requires a charset name containing a slash character to be interpreted as a pathname to a charmap file, so use of slash for specifying translit-mode is non-conforming.
Related
I am trying to draw a square with a given width and height.
I am trying to do so while using the box characters from Unicode.
I am using this code:
#include <stdlib.h>
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include "string_prints.h"
#define VERTICAL_PIPE L"║"
#define HORIZONTAL_PIPE L"═"
#define UP_RIGHT_CORNER L"╗"
#define UP_LEFT_CORNER L"╔"
#define DOWN_RIGHT_CORNER L"╝"
#define DOWN_LEFT_CORNER L"╚"
// Function to print the top line
void DrawUpLine(int w){
setlocale(LC_ALL, "");
wprintf(UP_LEFT_CORNER);
for (int i = 0; i < w; i++)
{
wprintf(HORIZONTAL_PIPE);
}
wprintf(UP_RIGHT_CORNER);
}
// Function to print the sides
void DrawSides(int w, int h){
setlocale(LC_ALL, "");
for (int i = 0; i < h; i++)
{
wprintf(VERTICAL_PIPE);
for (int j = 0; j < w; j++)
{
putchar(' ');
}
wprintf(VERTICAL_PIPE);
putchar('\n');
}
}
// Function to print the bottom line
void DrawDownLine(int w){
setlocale(LC_ALL, "");
wprintf(DOWN_LEFT_CORNER);
for (int i = 0; i < w; i++)
{
wprintf(HORIZONTAL_PIPE);
}
wprintf(DOWN_RIGHT_CORNER);
}
void DrawFrame(int w, int h){
DrawUpLine(w);
putchar('\n');
DrawSides(w, h);
putchar('\n');
DrawDownLine(w);
}
But when I am running this code with some int values I get an output with seemingly random spaces and newlines (although the pipes seem at the correct order).
It is being called from main.c from the header like so:
#include <stdlib.h>
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include "string_prints.h"
int main(){
DrawFrame(10, 20); // Calling the function
return 0;
}
Also as you can see I don't understand the correct use of setlocale, do you need to do it only once? or more?
Any help appreciated thanks in advance!
Also as you can see I don't understand the correct use of setlocale, do you need to do it only once? or more?
Locale changes applied via setlocale() are persistent within the calling process. You do not need to call that function multiple times unless you want to make multiple changes. But you do need to name a locale to it that serves your intended purpose, or if you call it with an empty string then you or the program user does need to ensure that the environment variables that define the various locale categories are set to values that suit the purpose.
But when I am running this code with some int values I get an output
with seemingly random spaces and newlines.
That sounds like the result of a character-encoding mismatch, or even two (but see also below):
there can be a runtime mismatch because the locale you tell the program to use for output does not match the one expected by the output device (e.g. a terminal) with which the program's output is displayed, and
there can also be a compile time mismatch between the actual character encoding of your source file and the encoding the compiler interprets it as having.
Additionally, use of wide string literal syntax notwithstanding, it is implementation-dependent which characters other than C's basic set may appear in your source code. The wide syntax specifies mostly the form of the storage for the literal (elements of type wchar_t), not so much what character values are valid or how they are interpreted.
Note also that the width of wchar_t is implementation-dependent, and it can be as small as eight bits. It is not necessarily the case that a wchar_t can represent arbitrary Unicode characters -- in fact, it is pretty common for wchar_t to be 16 bits wide, which in fact isn't wide enough for the majority of characters from Unicode's 21-bit code space. You might get an internal representation of wider characters in a two-unit form, such as a UTF-16 surrogate pair, but you also might not -- a great deal of this is left to individual implementations.
Among those things, what encoding the compiler expects, under what circumstances, and how you can influence that are all implementation-dependent. For GCC, for instance, the default source ("input") character set is UTF-8, and you can define a different one via its -finput-charset option. You can also specify both a standard and a wide execution character set via the -fexec-charset and -fwide-exec-charset options, if you wish to do so. GCC relies on iconv for conversions, both at compile time (source charset to execution charset) and at runtime (from execution charset to locale charset). Other implementations have other options (or none), with their own semantics.
So what should you do? In the first place, I suggest taking the source character set out of the equation by using UTF-8 string literals expressed using only basic character set (requires C2011):
#define VERTICAL_PIPE u8"\xe2\x95\x91"
#define HORIZONTAL_PIPE u8"\xe2\x95\x90"
#define UP_RIGHT_CORNER u8"\xe2\x95\x97"
#define UP_LEFT_CORNER u8"\xe2\x95\x94"
#define DOWN_RIGHT_CORNER u8"\xe2\x95\x9d"
#define DOWN_LEFT_CORNER u8"\xe2\x95\x9a"
Note well that the resulting strings are normal, not wide, so you should not use the wide-oriented output functions with them. Instead, use the normal printf, putchar, etc..
And that brings us to another issue with your code: you must not mix wide-oriented and byte-oriented functions writing to the same stream without taking explicit measures to switch (freopen or fwide; see paragraph 7.21.2/4 of the standard). In practice, mixing the two can quite plausibly produce mangled results.
Then also ensure that your local environment variables are set correctly for your actual environment. Chances are good that they already are, but it's worth a check.
I am trying to make a simple -ancient greek to modern greek- converter, in c, by changing the tones of the vowels. For example, the user types a text in greek which conains the character: ῶ (unicode: U+1FF6), so the program converts it into: ώ (unicode:U+1F7D). Greek are not sopported by c, so I don't know how to make it work. Any ideas?
Assuming you use a sane operating system (meaning, not Windows), this is very easy to achieve using C99/C11 locale and wide character support. Consider filter.c:
#include <stdlib.h>
#include <locale.h>
#include <wchar.h>
#include <stdio.h>
wint_t convert(const wint_t wc)
{
switch (wc) {
case L'ῶ': return L'ώ';
default: return wc;
}
}
int main(void)
{
wint_t wc;
if (!setlocale(LC_ALL, "")) {
fprintf(stderr, "Current locale is unsupported.\n");
return EXIT_FAILURE;
}
if (fwide(stdin, 1) <= 0) {
fprintf(stderr, "Standard input does not support wide characters.\n");
return EXIT_FAILURE;
}
if (fwide(stdout, 1) <= 0) {
fprintf(stderr, "Standard output does not support wide characters.\n");
return EXIT_FAILURE;
}
while ((wc = fgetwc(stdin)) != WEOF)
fputwc(convert(wc), stdout);
return EXIT_SUCCESS;
}
The above program reads standard input, converts each ῶ into a ώ, and outputs the result.
Note that wide character strings and characters have an L prefix; L'ῶ' is a wide character constant. These are only in Unicode if the execution character set (the character set the code is compiled for) is Unicode, and that depends on your development environment. (Fortunately, outside of Windows, UTF-8 is pretty much a standard nowadays -- and that is a good thing -- so code like the above Just Works.)
On POSIXy systems (like Linux, Android, Mac OS, BSDs), you can use the iconv() facilities to convert from any input character set to Unicode, do the conversion there, and finally convert back to any output character set. Unfortunately, the question is not tagged posix, so that is outside this particular question.
The above example uses a simple switch/case statement. If there are many replacement pairs, one could use e.g.
typedef struct {
wint_t from;
wint_t to;
} widepair;
static widepair replace[] = {
{ L'ῶ', L'ώ' },
/* Others? */
};
#define NUM_REPLACE (sizeof replace / sizeof replace[0])
and at runtime, sort replace[] (using qsort() and a function that compares the from elements), and use binary search to quickly determine if a wide character is to be replaced (and if so, to which wide character). Because this is a O(log2N) operation with N being the number of pairs, and it utilizes cache okay, even thousands of replacement pairs is not a problem this way. (And of course, you can build the replacement array at runtime just as well, even from user input or command-line options.)
For Unicode characters, we could use a uint32_t map_to[0x110000]; to directly map each code point to another Unicode code point, but because we do not know whether wide characters are Unicode or not, we cannot do that; we do not know the code range of the wide characters until after compile time. Of course, we can do a multi-stage compilation, where a test program generates the replace[] array shown above, and outputs their codes in decimal; then do some kind of auto-grouping or clustering, for example bit maps or hash tables, to do it "even faster".
However, in practice it usually turns out that the I/O (reading and writing the data) takes more real-world time than the conversion itself. Even when the conversion is the bottleneck, the conversion rate is sufficient for most humans. (As an example, when compiling C or C++ code with the GNU utilities, the preprocessor first converts the source code to UTF-8 internally.)
Okay, here's some quick advice. I wouldn't use C because Unicode is not wel supported (yet).
A better language choice would be Python, Java, ..., anything with good Unicode support.
I'd write a utility that reads from standard input and writes to standard output. This makes it easy to use from the command line and in scripts.
I might be missing something but it's going to be something like this (in pseudo code):
while ((inCharacter = getCharacterFromStandardInput) != EOF
{
switch (inCharacter)
{
case 'ῶ': outCharacter = ώ; break
...
}
writeCharacterToStandardOutput(outCharacter)
}
You'll also need to select & handle the format: UTF-8/16/32.
That's it. Good luck!
Add ru_RU.CP1251 locale (on debian uncomment ru_RU.CP1251 in /etc/locale.gen and run sudo locale-gen) and
compile the following program with gcc -fexec-charset=cp1251 test.c (input file is in UTF-8). The result is empty. Just letter 'я' is wrong.
Other letters are determined either lowercase or uppercase just fine.
#include <locale.h>
#include <ctype.h>
#include <stdio.h>
int main (void)
{
setlocale(LC_ALL, "ru_RU.CP1251");
char c = 'я';
int i;
char z;
for (i = 7; i >= 0; i--) {
z = 1 << i;
if ((z & c) == z) printf("1"); else printf("0");
}
printf("\n");
if (islower(c))
printf("lowercase\n");
if (isupper(c))
printf("uppercase\n");
return 0;
}
Why neither islower() nor isupper() work on letter я?
The answer is that the encoding for the lower case version of that character in CP 1251 is decimal 255, and islower() and isupper() for your implementation do not accept or return that value (which is often interpreted as EOF).
You need to track down the source code for the runtime library to see what it does and why.
The solution is to write your own implementations, or wrap the ones you have. Personally, I never use these functions directly because of the many gotchas.
Igor, if your file is UTF-8 it's of no sense to try to use code page 1251, as it has nothing in common with utf-8 encoding. Just use locale ru_RU.UTF-8 and you'll be able to display your file without any problem. Or, if you insist on using ru_RU.CP1251, you'll need to first convert your file from utf-8 encoding to cp1251 (you can use the iconv(1) utility for that)
iconv --from-code=utf-8 --to-code=cp1251 your_file.txt > your_converted_file.txt
On other side, the --fexec-charset=cp1251 only affects the characters used on the executable, but you have not specified the input charset to use in string literals in your source code. Probably, the compiler is determining that from the environment (which you have set in your LANG or LC_CHARSET environment variables)
Only once you control exactly what locales are used at each stage, you'll get coherent results.
The main reason an effort is being made to switch all countries to a common charset (UTF) is exactly to not have to deal with all these locale settings at each stage.
If you deal always with documents encoded in CP1251, you'll need to use that encoding for everything on your computer, but when you receive some document encoded in utf-8, then you'll have to convert it to be able to see it right.
I mostly recommend you to switch to utf-8, as it's an encoding that has support for all countries character sets, but at this moment, that decision is only yours.
NOTE
On debian linux:
$ sed 's/^/ /' pru-$$.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <locale.h>
#define P(f,v) printf(#f"(%d /* '%c' */) => %d\n", (v), (v), f(v))
#define Q(v) do{P(isupper,(v));P(islower,(v));}while(0)
int main()
{
setlocale(LC_ALL, "");
Q(0xff);
}
Compiled with
$ make pru-$$
cc pru-1342.c -o pru-1342
execution with ru_RU.CP1251 locale
$ locale | sed 's/^/ /'
LANG=ru_RU.CP1251
LANGUAGE=
LC_CTYPE="ru_RU.CP1251"
LC_NUMERIC="ru_RU.CP1251"
LC_TIME="ru_RU.CP1251"
LC_COLLATE="ru_RU.CP1251"
LC_MONETARY="ru_RU.CP1251"
LC_MESSAGES="ru_RU.CP1251"
LC_PAPER="ru_RU.CP1251"
LC_NAME="ru_RU.CP1251"
LC_ADDRESS="ru_RU.CP1251"
LC_TELEPHONE="ru_RU.CP1251"
LC_MEASUREMENT="ru_RU.CP1251"
LC_IDENTIFICATION="ru_RU.CP1251"
LC_ALL=
$ pru-$$
isupper(255 /* 'я' */) => 0
islower(255 /* 'я' */) => 512
So, glibc is not faulty, the fault is in your code.
The first comment of Jonathan Leffler to OP is true. isxxx() (and iswxxx()) functions are required to handle EOF (WEOF) argument
(probably to be fool-proof).
This is why int was chosen as the argument type. When we pass argument of type char or character literal, it is
promoted to int (preserving the sign). And because by default char type and character literals are signed in gcc,
0xFF becomes -1, which is by unhappy coincidence the value of EOF.
Therefore always do explicit typecasting when passing parameters of type char (and character literals with code 0xFF) to functions, using int argument type (don't count on the unsignedness of char, because it is implementation-defined). Typecasting may be either done via (unsigned char), or via (uint8_t), which is less to type (you must include stdint.h).
See also https://sourceware.org/bugzilla/show_bug.cgi?id=20792 and Why passing char as parameter to islower() does not work correctly?
I have this program:
#include <ncurses.h>
int main () {
initscr();
mvaddstr(0, 0, " A B C D E ");
mvaddstr(1, 24, "ñandñ");
mvaddstr(1, 34, "esdrñjulñ");
refresh();
getch();
endwin();
return 0;
}
When ncurses prints the first word (ñandñ) something happens that when moving to another position afterwards (in this case to 1, 34), actually it moves to another position, so the second word gets printed in another column.
So what it should look like:
A B C D E
ñandñ esdrñjulñ
looks like this:
A B C D E
ñandñ esdrñjulñ
because of the two 'ñ' extended ascii char in the first word.
Any idea what is wrong?
Thanks!
If you want to use multibyte UTF-8 characters, you must use the version of the ncurses library compiled with multibyte support, and you need to set the locale correctly at the beginning of your program.
The multibyte ncurses library is usually called libncursesw, so to use it is is sufficient to change -lncurses to -lncursesw in your linker options. You do not need a different header file. However, if you actually want to use the wide-character functions, you must #define the symbol _XOPEN_SOURCE_EXTENDED before any #include directive. The easiest way to do that with gcc (or clang) is to add -D_XOPEN_SOURCE_EXTENDED to your compiler options.
If your shell's locale has been set to a UTF-8 locale (which will usually be the case on a modern Linux distribution), it is sufficient to insert
setlocale(LC_ALL, "");
before any call to an ncurses routine. That sets the executable's locale to the locale configured by the environment variables. You'll need to add
#include <locale.h>
to your header includes. No special library needs to be linked for locale support.
The original question indicated that mvaddwstr was being used. That's generally a better solution for multibyte characters, but as indicated above, you can use the narrow string interfaces if you want to. However, you cannot output incomplete UTF-8 sequences, so the single-character narrow interfaces like addch can only be used with character codes less than 128.
This note applied to an attempt to call mvaddwstr with a char* instead of a wchar_t* argument. Don't do that:
Like all of the w ncurses functions which accept strings, mvaddwstr takes a wchar_t* as its string argument. Your compiler should have warned you about that (unless it warned you that there was no prototype for mvaddwstr). So the strings should be prefixed with the L length attribute: L"ñandñ".
I'm able to output a single character using this code:
#include <locale.h>
#include <stdio.h>
#include <wchar.h>
main(){
setlocale(LC_CTYPE, "");
wchar_t a = L'Ö';
putwchar(a);
}
How can I adapt the code to output a string?
Something like
wchar_t *a = L"ÖÜÄöüä";
wprinf("%ls", a);
wprintf(L"%ls", str)
It's a bit tricky, you have to know what your internal wchar_ts mean. (See here for a little discussion.) Basically you should communicate with the environment via mbstowcs/wcstombs, and with data with known encoding via iconv (converting from and to WCHAR_T).
(The exception here is Windows, where you can't really communicate with the environment meaningfully, but you can access it in a wide version directly with Windows API functions, and you can write wide strings directly into message boxes etc.)
That said, once you have your internal wide string, you can convert it to the environment's multibyte string with wcstombs, or you can just use printf("%ls", mywstr); which performs the conversion for you. Just don't forget to call setlocale(LC_CTYPE, "") at the very beginning of your program.