I have to save in a char[] the letter ñ and I'm not being able to do it. I tried doing this:
char example[1];
example[0] = 'ñ';
When compiling I get this:
$ gcc example.c
error: character too large for enclosing
character literal type
example[0] = 'ñ';
Does anyone know how to do this?
If you're using High Sierra, you are presumably using a Mac running macOS 10.13.3 (High Sierra), the same as me.
This comes down to code sets and locales — and can get tricky. Mac terminals use UTF-8 by default and ñ is Unicode character U+00F1, which requires two bytes, 0xC3 and 0xB1, to represent it in UTF-8. And the compiler is letting you know that one byte isn't big enough to hold two bytes of data. (In the single-byte code sets such as ISO 8859-1 or 8859-15, ñ has character code 0xF1 — 0xF1 and U+00F1 are similar, and this is not a coincidence; Unicode code points U+0000 to U+00FF are the same as in ISO 8859-1. ISO 8859-15 is a more modern variant of 8859-1, with the Euro symbol € and 7 other variations from 8859-1.)
Another option is to change the character set that your terminal works with; you need to adapt your code to suit the code set that the terminal uses.
You can work around this by using wchar_t:
#include <wchar.h>
void function(void);
void function(void)
{
wchar_t example[1];
example[0] = L'ñ';
putwchar(example[0]);
putwchar(L'\n');
}
#include <locale.h>
int main(void)
{
setlocale(LC_ALL, "");
function();
return 0;
}
This compiles; if you omit the call to setlocale(LC_ALL, "");, it doesn't work as I want (it generates just octal byte \361 (aka 0xF1) and a newline, which generates a ? on the terminal), whereas with setlocale(), it generates two bytes (\303\261 in octal, aka 0xC3 and 0xB1) and you see ñ on the console output.
You can use "extended ascii". This chart shows that 'ñ' can be represented in extended ascii as 164.
example[0] = (char)164;
You can print this character just like any other character
putchar(example[0]);
As noted in the comments above, this will depend on your environment. It might work on your machine but not another one.
The better answer is to use unicode, for example:
wchar_t example = '\u00F1';
This really depends on which character set / locale you will be using. If you want to hardcode this as a latin1 character, this example program does that:
#include <cstdio>
int main() {
char example[2] = {'\xF1'};
printf("%s", example);
return 0;
}
This, however, results in this output on my system that uses UTF-8:
$ ./a.out
�
So if you want to use non-ascii strings, I'd recommend not representing them as char arrays directly. If you really need to use char directly, the UTF-8 sequence for ñ is two chars wide, and can be written as such (again with a terminating '\0' for good measure):
char s[3] = {"\xC3\xB1"};
Related
My main language is portuguese so we have some accented words (with á é í ó ú... etc characters) i'm trying to read and store those characters into a variable but it just doesn't work. If i just set it on the code it works, but if i ask the user for input it doesn't. Example code:
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
int main(int argc, char *argv[]) {
setlocale(LC_ALL, "Portuguese");
char test, test2; //The same still happens using unsigned char
test = 'í';
printf("Character: %c\n", test);
scanf(" %c", &test2); //The same still happens using fgets in case of a string
printf("Character: %c\n", test2);
system("pause");
return 0;
}
When compiled and executed the code shows:
Character: í
(wait for input, example:) í
character: ¡
if input is 'á' it prints ' '(space), 'é' prints ', ó prints '¢' and ú prints '£'.
I'm new into programming and stackoverflow, so sorry for any mistake i made, every help is appreciated, thank you.
oh, also I'm using Dev-c++ to compile if this make any difference.
You need to recognize that a char in C is a numeric type of size 1 byte. It actually is not exactly intended to keep the representation of a single language character item. (Sometimes called code point).
You do have two options to deal with this situation:
Use a character encoding that is single byte. (E.g. the proper
version of the iso-8859 family, iso-8859-1 in your case). This
will ensure that all characters will fit into a single byte.
deal with your input with proper mechanisms for multibyte
characters. You might look for char16_t or char32_t types and
maybe turn to using wchar_t and related library routines
Add ru_RU.CP1251 locale (on debian uncomment ru_RU.CP1251 in /etc/locale.gen and run sudo locale-gen) and
compile the following program with gcc -fexec-charset=cp1251 test.c (input file is in UTF-8). The result is empty. Just letter 'я' is wrong.
Other letters are determined either lowercase or uppercase just fine.
#include <locale.h>
#include <ctype.h>
#include <stdio.h>
int main (void)
{
setlocale(LC_ALL, "ru_RU.CP1251");
char c = 'я';
int i;
char z;
for (i = 7; i >= 0; i--) {
z = 1 << i;
if ((z & c) == z) printf("1"); else printf("0");
}
printf("\n");
if (islower(c))
printf("lowercase\n");
if (isupper(c))
printf("uppercase\n");
return 0;
}
Why neither islower() nor isupper() work on letter я?
The answer is that the encoding for the lower case version of that character in CP 1251 is decimal 255, and islower() and isupper() for your implementation do not accept or return that value (which is often interpreted as EOF).
You need to track down the source code for the runtime library to see what it does and why.
The solution is to write your own implementations, or wrap the ones you have. Personally, I never use these functions directly because of the many gotchas.
Igor, if your file is UTF-8 it's of no sense to try to use code page 1251, as it has nothing in common with utf-8 encoding. Just use locale ru_RU.UTF-8 and you'll be able to display your file without any problem. Or, if you insist on using ru_RU.CP1251, you'll need to first convert your file from utf-8 encoding to cp1251 (you can use the iconv(1) utility for that)
iconv --from-code=utf-8 --to-code=cp1251 your_file.txt > your_converted_file.txt
On other side, the --fexec-charset=cp1251 only affects the characters used on the executable, but you have not specified the input charset to use in string literals in your source code. Probably, the compiler is determining that from the environment (which you have set in your LANG or LC_CHARSET environment variables)
Only once you control exactly what locales are used at each stage, you'll get coherent results.
The main reason an effort is being made to switch all countries to a common charset (UTF) is exactly to not have to deal with all these locale settings at each stage.
If you deal always with documents encoded in CP1251, you'll need to use that encoding for everything on your computer, but when you receive some document encoded in utf-8, then you'll have to convert it to be able to see it right.
I mostly recommend you to switch to utf-8, as it's an encoding that has support for all countries character sets, but at this moment, that decision is only yours.
NOTE
On debian linux:
$ sed 's/^/ /' pru-$$.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <locale.h>
#define P(f,v) printf(#f"(%d /* '%c' */) => %d\n", (v), (v), f(v))
#define Q(v) do{P(isupper,(v));P(islower,(v));}while(0)
int main()
{
setlocale(LC_ALL, "");
Q(0xff);
}
Compiled with
$ make pru-$$
cc pru-1342.c -o pru-1342
execution with ru_RU.CP1251 locale
$ locale | sed 's/^/ /'
LANG=ru_RU.CP1251
LANGUAGE=
LC_CTYPE="ru_RU.CP1251"
LC_NUMERIC="ru_RU.CP1251"
LC_TIME="ru_RU.CP1251"
LC_COLLATE="ru_RU.CP1251"
LC_MONETARY="ru_RU.CP1251"
LC_MESSAGES="ru_RU.CP1251"
LC_PAPER="ru_RU.CP1251"
LC_NAME="ru_RU.CP1251"
LC_ADDRESS="ru_RU.CP1251"
LC_TELEPHONE="ru_RU.CP1251"
LC_MEASUREMENT="ru_RU.CP1251"
LC_IDENTIFICATION="ru_RU.CP1251"
LC_ALL=
$ pru-$$
isupper(255 /* 'я' */) => 0
islower(255 /* 'я' */) => 512
So, glibc is not faulty, the fault is in your code.
The first comment of Jonathan Leffler to OP is true. isxxx() (and iswxxx()) functions are required to handle EOF (WEOF) argument
(probably to be fool-proof).
This is why int was chosen as the argument type. When we pass argument of type char or character literal, it is
promoted to int (preserving the sign). And because by default char type and character literals are signed in gcc,
0xFF becomes -1, which is by unhappy coincidence the value of EOF.
Therefore always do explicit typecasting when passing parameters of type char (and character literals with code 0xFF) to functions, using int argument type (don't count on the unsignedness of char, because it is implementation-defined). Typecasting may be either done via (unsigned char), or via (uint8_t), which is less to type (you must include stdint.h).
See also https://sourceware.org/bugzilla/show_bug.cgi?id=20792 and Why passing char as parameter to islower() does not work correctly?
Code:
#include <stdio.h>
#include <wchar.h>
#define USE_W
int main()
{
#ifdef USE_W
const wchar_t *ae_utf16 = L"\x00E6 & ASCII text ae\n";
wprintf(ae_utf16);
#else
const char *ae_utf8 = "\xC3\xA6 & ASCII text ae\n";
printf(ae_utf8);
#endif
return 0;
}
Output:
ae & ASCII text ae
While printf produces correct UTF-8 output:
æ & ASCII text ae
You can test this here.
printf just sends raw bytes to your terminal; it does not know anything about encodings. If your terminal happens to be configured to interpret that as UTF-8, it will show the right characters.
wprintf, on the other hand, does know about encodings. It behaves as though it uses the function wcrtomb, which encodes a wide character (wchar_t) into a multibyte sequence, depending on the current locale. If the default locale happens to be "C", which is quite minimalistic, the character æ gets converted to the "more or less equivalent" byte sequence ae.
If you set the locale explicitly to something using UTF-8, like "en_US.UTF-8", the output is as expected. Of course, the set of supported locales differs per system, so it's no good to hardcode this.
I have an utf-8 character in chinese or arabic language. I need to get the value of that UTF-8 character, like getting a value of ASCII character. I need to implement it in "C". Can you please provide your suggestions?
For example:
char array[3] = "ab";
int v1,v2;
v1 = array[0];
v2 = array[1];
In the above code I will get corresponding ASCII values in v1 and v2. In the same way for UF8 string I need to get the value for each character in a string.
Only the C11 standard version of the C language offers UTF-8 support, so depending on what standard you are targeting, you can use the C11 features (<uchar.h>) or rely on a UTF library such as ICU.
There is no such thing as a UTF-8 character. There are Unicode characters and there are encodings for Unicode characters such as UTF-8.
What you probably want is to decode several bytes - encoded in UTF-8 and representing a single Unicode character - into the Unicode code point.
There's lot of C source code for this available in the net. Just google for UTF-8 decoding C.
Update:
What you're obviously looking for is a UTF-8 decoding for more than just one character, namely a function decoding an array of bytes (UTF-8 decoded text) into an array of ints (Unicode code points).
The answer remains the same: use Google. There's lot of C code for it out there.
C and C++ model is that the encoding is tied to the locale, so code using that model works for the encoding of the locale, whatever it is.
If you have a locale using UTF8 for the narrow encoding. See mbtowc(), mbrtowc(), mbstowcs and mbsrtocws(),they should be pretty straightforward to use.
With icu, you can skip through utf8 characters with U8_NEXT
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <unicode/utf.h>
#include <unicode/ustring.h>
int main(int argc, char **argv)
{
const char s[] = "日本語";
UChar32 c;
int32_t k;
int32_t len = strlen(s);
for (k = 0; k < len;) {
U8_NEXT(s, k, len, c);
printf("%d - %x\n", k, c);
}
return 0;
}
To compile with gcc utf.c -o utf $(icu-config --ldflags --ldflags-icuio)
The index k here indicates the starting offset of the encoding of your jth character. And c contains the unicode value (32 bits) of the character.
Edit:
I can only use stdio.h and stdlib.h
I would like to iterate through a char array filled with chars.
However chars like ä,ö take up twice the space and use two elements.
This is where my problem lies, I don't know how to access those special chars.
In my example the char "ä" would use hmm[0] and hmm[1].
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
char* hmm = "äö";
printf("%c\n", hmm[0]); //i want to print "ä"
printf("%i\n", strlen(hmm));
return 0;
}
Thanks, i tried to run my attached code in Eclipse, there it works. I assume because it uses 64 bits and the "ä" has enough space to fit. strlen confirms that each "ä" is only counted as one element.
So i guess i could somehow tell it to allocate more space for each char (so "ä" can fit)?
#include <stdio.h>
#include <stdlib.h>
int main()
{
char* hmm = "äüö";
printf("%c\n", hmm[0]);
printf("%c\n", hmm[1]);
printf("%c\n", hmm[2]);
return 0;
}
A char always used one byte.
In your case you think that "ä" is one char: Wrong.
Open your .c source code with an hexadecimal viewer and you will see that ä is using 2 char because the file is encoded in UTF8
Now the question is do you want to use wide character ?
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <locale.h>
int main()
{
const wchar_t hmm[] = L"äö";
setlocale(LC_ALL, "");
wprintf(L"%ls\n", hmm);
wprintf(L"%lc\n", hmm[0]);
wprintf(L"%i\n", wcslen(hmm));
return 0;
}
Your data is in a multi-byte encoding. Therefore, you need to use multibyte character handling techniques to divvy up the string. For example:
#include <stdio.h>
#include <string.h>
#include <locale.h>
int main(void)
{
char* hmm = "äö";
int off = 0;
int len;
int max = strlen(hmm);
setlocale(LC_ALL, "");
printf("<<%s>>\n", hmm);
printf("%zi\n", strlen(hmm));
while (hmm[off] != '\0' && (len = mblen(&hmm[off], max - off)) > 0)
{
printf("<<%.*s>>\n", len, &hmm[off]);
off += len;
}
return 0;
}
On my Mac, it produced:
<<äö>>
4
<<ä>>
<<ö>>
The call to setlocale() was crucial; without that, the program runs in the "C" locale instead of my en_US.UTF-8 locale, and mblen() mishandled things:
<<äö>>
4
<<?>>
<<?>>
<<?>>
<<?>>
The questions marks appear because the bytes being printed are invalid single bytes as far as the UTF-8 terminal is concerned.
You can also use wide characters and wide-character printing, as shown in benjarobin's answer..
Sorry to drag this on. Though I think its important to highlight some issues. As I understand it OS-X has the ability to have the default OS code page to be UTF-8 so the answer is mostly in regards to Windows that under the hood uses UTF-16, and its default ACP code page is dependent on the specified OS region.
Firstly you can open Character Map, and find that
äö
Both reside in the code page 1252 (western), so this is not a MBCS issue. The only way it could be a MBCS issue is if you saved the file using MBCS (Shift-JIS,Big5,Korean,GBK) encoding.
The answer, of using
setlocale( LC_ALL, "" )
Does not give insight into the reason why, äö was rendered in the command prompt window incorrectly.
Command Prompt does use its own code pages, namely OEM code pages. Here is a reference to the following (OEM) code pages available with their character map's.
Going into command prompt and typing the following command (Chcp) Will reveal the current OEM code page that the command prompt is using.
Following Microsoft documentation by using setlocal(LC_ALL,"") it details the following behavior.
setlocale( LC_ALL, "" );
Sets the locale to the default, which is the user-default ANSI code page obtained from the operating system.
You can do this manually, by using chcp and passing your required code page, then run your application and it should output the text perfectly fine.
If it was a multie byte character set problem then there would be a whole list of other issues:
Under MBCS, characters are encoded in either one or two bytes. In two-byte characters, the first, or "lead-byte," signals that both it and the following byte are to be interpreted as one character. The first byte comes from a range of codes reserved for use as lead bytes. Which ranges of bytes can be lead bytes depends on the code page in use. For example, Japanese code page 932 uses the range 0x81 through 0x9F as lead bytes, but Korean code page 949 uses a different range.
Looking at the situation, and that the length was 4 instead of 2. I would say that the file format has been saved in UTF-8 (It could in fact been saved in UTF-16, though you would of run into problems sooner than later with the compiler). You're using characters that are not within the ASCII range of 0 to 127, UTF-8 is encoding the Unicode code point to two bytes. Your compiler is opening the file and assuming its your default OS code page or ANSI C. When parsing your string, it's interpreting the string as a ANSI C Strings 1 byte = 1 character.
To sove the issue, under windows convert the UTF-8 string to UTF-16 and print it with wprintf. Currently there is no native UTF-8 support for the Ascii/MBCS stdio functions.
For Mac OS-X, that has the default OS code page of UTF-8 then I would recommend following Jonathan Leffler solution to the problem because it is more elegant. Though if you port it to Windows later, you will find you will need to covert the string from UTF-8 to UTF-16 using the example bellow.
In either solution you will still need to change the command prompt code page to your operating system code page to print the characters above ASCII correctly.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <Windows.h>
#include <locale>
// File saved as UTF-8, with characters outside the ASCII range
int main()
{
// Set the OEM code page to be the default OS code page
setlocale(LC_ALL, "");
// äö reside outside of the ASCII range and in the Unicode code point Western Latin 1
// Thus, requires a lead byte per unicode code point when saving as UTF-8
char* hmm = "äö";
printf("UTF-8 file string using Windows 1252 code page read as:%s\n",hmm);
printf("Length:%d\n", strlen(hmm));
// Convert the UTF-8 String to a wide character
int nLen = MultiByteToWideChar(CP_UTF8, 0,hmm, -1, NULL, NULL);
LPWSTR lpszW = new WCHAR[nLen];
MultiByteToWideChar(CP_UTF8, 0, hmm, -1, lpszW, nLen);
// Print it
wprintf(L"wprintf wide character of UTF-8 string: %s\n", lpszW);
// Free the memory
delete[] lpszW;
int c = getchar();
return 0;
}
UTF-8 file string using Windows 1252 code page read as:äö
Length:4
wprintf wide character of UTF-8 string: äö
i would check your command prompt font/code page to make sure that it can display your os single byte encoding. note command prompt has its own code page that differs to your text editor.