iterating through a char array with non standard chars

iterating through a char array with non standard chars - c

Edit:
I can only use stdio.h and stdlib.h
I would like to iterate through a char array filled with chars.
However chars like ä,ö take up twice the space and use two elements.
This is where my problem lies, I don't know how to access those special chars.
In my example the char "ä" would use hmm[0] and hmm[1].
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
char* hmm = "äö";
printf("%c\n", hmm[0]); //i want to print "ä"
printf("%i\n", strlen(hmm));
return 0;
}
Thanks, i tried to run my attached code in Eclipse, there it works. I assume because it uses 64 bits and the "ä" has enough space to fit. strlen confirms that each "ä" is only counted as one element.
So i guess i could somehow tell it to allocate more space for each char (so "ä" can fit)?
#include <stdio.h>
#include <stdlib.h>
int main()
{
char* hmm = "äüö";
printf("%c\n", hmm[0]);
printf("%c\n", hmm[1]);
printf("%c\n", hmm[2]);
return 0;
}

A char always used one byte.
In your case you think that "ä" is one char: Wrong.
Open your .c source code with an hexadecimal viewer and you will see that ä is using 2 char because the file is encoded in UTF8
Now the question is do you want to use wide character ?
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <locale.h>
int main()
{
const wchar_t hmm[] = L"äö";
setlocale(LC_ALL, "");
wprintf(L"%ls\n", hmm);
wprintf(L"%lc\n", hmm[0]);
wprintf(L"%i\n", wcslen(hmm));
return 0;
}

Your data is in a multi-byte encoding. Therefore, you need to use multibyte character handling techniques to divvy up the string. For example:
#include <stdio.h>
#include <string.h>
#include <locale.h>
int main(void)
{
char* hmm = "äö";
int off = 0;
int len;
int max = strlen(hmm);
setlocale(LC_ALL, "");
printf("<<%s>>\n", hmm);
printf("%zi\n", strlen(hmm));
while (hmm[off] != '\0' && (len = mblen(&hmm[off], max - off)) > 0)
{
printf("<<%.*s>>\n", len, &hmm[off]);
off += len;
}
return 0;
}
On my Mac, it produced:
<<äö>>
4
<<ä>>
<<ö>>
The call to setlocale() was crucial; without that, the program runs in the "C" locale instead of my en_US.UTF-8 locale, and mblen() mishandled things:
<<äö>>
4
<<?>>
<<?>>
<<?>>
<<?>>
The questions marks appear because the bytes being printed are invalid single bytes as far as the UTF-8 terminal is concerned.
You can also use wide characters and wide-character printing, as shown in benjarobin's answer..

Sorry to drag this on. Though I think its important to highlight some issues. As I understand it OS-X has the ability to have the default OS code page to be UTF-8 so the answer is mostly in regards to Windows that under the hood uses UTF-16, and its default ACP code page is dependent on the specified OS region.
Firstly you can open Character Map, and find that
äö
Both reside in the code page 1252 (western), so this is not a MBCS issue. The only way it could be a MBCS issue is if you saved the file using MBCS (Shift-JIS,Big5,Korean,GBK) encoding.
The answer, of using
setlocale( LC_ALL, "" )
Does not give insight into the reason why, äö was rendered in the command prompt window incorrectly.
Command Prompt does use its own code pages, namely OEM code pages. Here is a reference to the following (OEM) code pages available with their character map's.
Going into command prompt and typing the following command (Chcp) Will reveal the current OEM code page that the command prompt is using.
Following Microsoft documentation by using setlocal(LC_ALL,"") it details the following behavior.
setlocale( LC_ALL, "" );
Sets the locale to the default, which is the user-default ANSI code page obtained from the operating system.
You can do this manually, by using chcp and passing your required code page, then run your application and it should output the text perfectly fine.
If it was a multie byte character set problem then there would be a whole list of other issues:
Under MBCS, characters are encoded in either one or two bytes. In two-byte characters, the first, or "lead-byte," signals that both it and the following byte are to be interpreted as one character. The first byte comes from a range of codes reserved for use as lead bytes. Which ranges of bytes can be lead bytes depends on the code page in use. For example, Japanese code page 932 uses the range 0x81 through 0x9F as lead bytes, but Korean code page 949 uses a different range.
Looking at the situation, and that the length was 4 instead of 2. I would say that the file format has been saved in UTF-8 (It could in fact been saved in UTF-16, though you would of run into problems sooner than later with the compiler). You're using characters that are not within the ASCII range of 0 to 127, UTF-8 is encoding the Unicode code point to two bytes. Your compiler is opening the file and assuming its your default OS code page or ANSI C. When parsing your string, it's interpreting the string as a ANSI C Strings 1 byte = 1 character.
To sove the issue, under windows convert the UTF-8 string to UTF-16 and print it with wprintf. Currently there is no native UTF-8 support for the Ascii/MBCS stdio functions.
For Mac OS-X, that has the default OS code page of UTF-8 then I would recommend following Jonathan Leffler solution to the problem because it is more elegant. Though if you port it to Windows later, you will find you will need to covert the string from UTF-8 to UTF-16 using the example bellow.
In either solution you will still need to change the command prompt code page to your operating system code page to print the characters above ASCII correctly.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <Windows.h>
#include <locale>
// File saved as UTF-8, with characters outside the ASCII range
int main()
{
// Set the OEM code page to be the default OS code page
setlocale(LC_ALL, "");
// äö reside outside of the ASCII range and in the Unicode code point Western Latin 1
// Thus, requires a lead byte per unicode code point when saving as UTF-8
char* hmm = "äö";
printf("UTF-8 file string using Windows 1252 code page read as:%s\n",hmm);
printf("Length:%d\n", strlen(hmm));
// Convert the UTF-8 String to a wide character
int nLen = MultiByteToWideChar(CP_UTF8, 0,hmm, -1, NULL, NULL);
LPWSTR lpszW = new WCHAR[nLen];
MultiByteToWideChar(CP_UTF8, 0, hmm, -1, lpszW, nLen);
// Print it
wprintf(L"wprintf wide character of UTF-8 string: %s\n", lpszW);
// Free the memory
delete[] lpszW;
int c = getchar();
return 0;
}
UTF-8 file string using Windows 1252 code page read as:Ã¤Ã¶
Length:4
wprintf wide character of UTF-8 string: äö

i would check your command prompt font/code page to make sure that it can display your os single byte encoding. note command prompt has its own code page that differs to your text editor.

Related

How do I read from a file in C if the file has accented chararcters such as 'á'?

Another day, another problem with strings in C. Let's say I have a text file named fileR.txt and I want to print its contents. The file goes like this:
Letter á
Letter b
Letter c
Letter ê
I would like to read it and show it on the screen, so I tried the following code:
#include <stdlib.h>
#include <locale.h>
#include <clocale>
#include <stdio.h>
#include <conio.h>
#include <wchar.h>
int main()
{
FILE *pF;
char line[512]; // Current line
setlocale(LC_ALL, "");
pF = fopen("Aulas\\source\\fileR.txt", "r");
while (!feof(pF))
{
fgets(line, 512, pF);
fputs(line, stdout);
}
return 0;
}
And the output was:
Letter ├â┬í
Letter b
Letter c
Letter ├â┬¬
I then attempted to use wchar_t to do it:
#include <stdlib.h>
#include <locale.h>
#include <clocale>
#include <stdio.h>
#include <conio.h>
#include <wchar.h>
int main()
{
FILE *pF;
wchar_t line[512]; // Current line
setlocale(LC_ALL, "");
pF = fopen("Aulas\\source\\fileR.txt", "r");
while (!feof(pF))
{
fgetws(line, 512, pF);
fputws(line, stdout);
}
return 0;
}
The output was even worse:
Letter ├âLetter b
Letter c
Letter ├â
I have seen people suggesting the use of an unsigned char array, but that simply results in an error, as the stdio functions made for input and output take signed char arrays, and even if i were to write my own funtion to print an array of unsigned chars, I would not know how to be able to read something from a file as unsigned.
So, how can I read and print a file with accented characters in C?

The problem you are having is not in your code, it's in your expectations. A text character is really just a value that has been associated with some form of glyph (symbol). There are different schemes for making this association, generally referred to as encodings. One early and still common encoding is known as ASCII (American Standard Code for Information Interchange). As the name implies it is American English centric. Originally this was a 7 bit encoding (128 values), but later was extended to include other symbols using 8 bits. Other encoding were developed for other languages. This was non-optimal. The Unicode standard was developed to address this. It's a relatively complicated standard designed to include any symbols one might want to encode. Unicode has various schemes that trade off data size for character size, for example UTF7, UTF8, UTF16 and UTF32. Because of this there will not necessarily be a one to one relationship between a byte and a character.
So different character representations have different values and those values can be greater than a single byte. The next problem is that to display the associated glyphs you need to have a system that correctly maps the value to the glyph and is able to display said glyph. A lot of "terminal" applications don't support Unicode by default. They use ASCII or Extended ASCII. It looks like that is what you may be using. The terminal is making the assumption that each byte it needs to display corresponds a single character (which as discussed isn't necessarily true in Unicode).
One thing to try is to redirect your output to a file and use a Unicode aware editor (like notepad++) to view the file using a UTF8 (for example) encoding. You can also hex dump the input file to see how it has been encoded. Sometimes Unicode files are written with BOM (Byte Order Mark) to help identify the Unicode encoding and byte order in play.

Use the letter ñ in C

I have to save in a char[] the letter ñ and I'm not being able to do it. I tried doing this:
char example[1];
example[0] = 'ñ';
When compiling I get this:
$ gcc example.c
error: character too large for enclosing
character literal type
example[0] = 'ñ';
Does anyone know how to do this?

If you're using High Sierra, you are presumably using a Mac running macOS 10.13.3 (High Sierra), the same as me.
This comes down to code sets and locales — and can get tricky. Mac terminals use UTF-8 by default and ñ is Unicode character U+00F1, which requires two bytes, 0xC3 and 0xB1, to represent it in UTF-8. And the compiler is letting you know that one byte isn't big enough to hold two bytes of data. (In the single-byte code sets such as ISO 8859-1 or 8859-15, ñ has character code 0xF1 — 0xF1 and U+00F1 are similar, and this is not a coincidence; Unicode code points U+0000 to U+00FF are the same as in ISO 8859-1. ISO 8859-15 is a more modern variant of 8859-1, with the Euro symbol € and 7 other variations from 8859-1.)
Another option is to change the character set that your terminal works with; you need to adapt your code to suit the code set that the terminal uses.
You can work around this by using wchar_t:
#include <wchar.h>
void function(void);
void function(void)
{
wchar_t example[1];
example[0] = L'ñ';
putwchar(example[0]);
putwchar(L'\n');
}
#include <locale.h>
int main(void)
{
setlocale(LC_ALL, "");
function();
return 0;
}
This compiles; if you omit the call to setlocale(LC_ALL, "");, it doesn't work as I want (it generates just octal byte \361 (aka 0xF1) and a newline, which generates a ? on the terminal), whereas with setlocale(), it generates two bytes (\303\261 in octal, aka 0xC3 and 0xB1) and you see ñ on the console output.

You can use "extended ascii". This chart shows that 'ñ' can be represented in extended ascii as 164.
example[0] = (char)164;
You can print this character just like any other character
putchar(example[0]);
As noted in the comments above, this will depend on your environment. It might work on your machine but not another one.
The better answer is to use unicode, for example:
wchar_t example = '\u00F1';

This really depends on which character set / locale you will be using. If you want to hardcode this as a latin1 character, this example program does that:
#include <cstdio>
int main() {
char example[2] = {'\xF1'};
printf("%s", example);
return 0;
}
This, however, results in this output on my system that uses UTF-8:
$ ./a.out
�
So if you want to use non-ascii strings, I'd recommend not representing them as char arrays directly. If you really need to use char directly, the UTF-8 sequence for ñ is two chars wide, and can be written as such (again with a terminating '\0' for good measure):
char s[3] = {"\xC3\xB1"};

ASCII characters in C

I'm trying to save a character from the cyrillic alphabet in a char.
When I take a string from the console it saves it in the char array successfully but just initializing it doesn't seem to work. I get "programName.exe has stopped working" when trying to run it.
#include <stdio.h>
#include <conio.h>
#include <string.h>
#include <Windows.h>
#include <stdlib.h>
void test(){
char test = 'Я';
printf("%s",test);
}
void main(){
SetConsoleOutputCP(1251);
SetConsoleCP(1251);
test();
}
fgets ( books[booksCount].bookTitle, 80, stdin ); // this seems to be working ok with ascii.
I tried using wchar_t but I get the same results.

If you're using Russian Windows which uses Windows-1251 codepage by default, you can print the character encoded as a single byte using the old printf but you need to make sure that the source code uses the same cp1251 charset. Don't save as Unicode.
But the preferred way should be using wprintf with wide char string
void test() {
wchar_t test_char = L'Я';
wchar_t *test_string = L"АБВГ"; // or LPCWSTR test_string
wprintf(L"%c\n%s", test_char, test_string);
}
This time you need to save the file as Unicode (UTF-8 or UTF-16)
UTF-8 may be better, but it's trickier on Windows. Moreover if you use UTF-8 you cannot use a char to store Я because it needs more than 1 byte. You must use a char* instead
Note that main must return int, not void, and the above fgets must be called from inside some function

This could be solved, doing
void test()
{
char test = 'Я';
putchar(test);
}
But there is a catch: Since 'Я' is not an ASCII character, you might need to set appropriate locale before.
Moreover, only ASCII characters 32 - 126 are guaranteed to be printable, and the same symbol, on all systems.

C Wide characters - how to use them?

I'm able to output a single character using this code:
#include <locale.h>
#include <stdio.h>
#include <wchar.h>
main(){
setlocale(LC_CTYPE, "");
wchar_t a = L'Ö';
putwchar(a);
}
How can I adapt the code to output a string?
Something like
wchar_t *a = L"ÖÜÄöüä";
wprinf("%ls", a);

wprintf(L"%ls", str)

It's a bit tricky, you have to know what your internal wchar_ts mean. (See here for a little discussion.) Basically you should communicate with the environment via mbstowcs/wcstombs, and with data with known encoding via iconv (converting from and to WCHAR_T).
(The exception here is Windows, where you can't really communicate with the environment meaningfully, but you can access it in a wide version directly with Windows API functions, and you can write wide strings directly into message boxes etc.)
That said, once you have your internal wide string, you can convert it to the environment's multibyte string with wcstombs, or you can just use printf("%ls", mywstr); which performs the conversion for you. Just don't forget to call setlocale(LC_CTYPE, "") at the very beginning of your program.

Handling special characters in C (UTF-8 encoding)

I'm writing a small application in C that reads a simple text file and then outputs the lines one by one. The problem is that the text file contains special characters like Æ, Ø and Å among others. When I run the program in terminal the output for those characters are represented with a "?".
Is there an easy fix?

First things first:
Read in the buffer
Use libiconv or similar to obtain wchar_t type from UTF-8 and use the wide character handling functions such as wprintf()
Use the wide character functions in C! Most file/output handling functions have a wide-character variant
Ensure that your terminal can handle UTF-8 output. Having the correct locale setup and manipulating the locale data can automate alot of the file opening and conversion for you ... depending on what you are doing.
Remember that the width of a code-point or character in UTF-8 is variable. This means you can't just seek to a byte and begin reading like with ASCII ... because you might land in the middle of a code point. Good libraries can do this in some cases.
Here is some code (not mine) that demonstrates some usage of UTF-8 file reading and wide character handling in C.
#include <stdio.h>
#include <wchar.h>
int main()
{
FILE *f = fopen("data.txt", "r, ccs=UTF-8");
if (!f)
return 1;
for (wint_t c; (c = fgetwc(f)) != WEOF;)
printf("%04X\n", c);
fclose(f);
return 0;
}
Links
libiconv
Locale data in C/GNU libc
Some handy info
Another good Unicode/UTF-8 in C resource

Make sure you're not accidentally dropping any bytes; some UTF-8 characters are more than one byte in length (that's sort of the point), and you need to keep them all.
It can be useful to print the contents of the buffer as hex, so you can inspect which bytes are actually read:
static void print_buffer(const char *buffer, size_t length)
{
size_t i;
for(i = 0; i < length; i++)
printf("%02x ", (unsigned int) buffer[i]);
putchar('\n');
}
You can do this after loading a very short file, containing just a few characters.
Also make sure the terminal is set to the proper encoding, so it interprets your characters as UTF-8.

Probably your text file is ISO-8559-1 encoded but your terminal is UTF-8. This kind of mismatch is a standard problem when dealing with byte-oriented text handling; other C programs (such as the standard ‘cat’ and ‘more’ commands) will do the same thing and it isn't generally considered an error or something that needs to be fixed.
If you want to operate on a Unicode character level instead of bytes that's fine, but you'll need to use wchar as your character type instead of char throughout your program, and provide switches for the user to specify what the incoming file encoding actually is. (Whilst it is sometimes possible to guess, it's not very reliable.)

I don't know if it could help but if you're sure that the encodings of terminal and input file are the same, you can try to setlocale():
#include <locale.h>
…
setlocale(LC_CTYPE, "");

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight