How to read and print a unicode file - c

I have a test input file input.txt with one line with the following contents:
кёльнский
I am using this code to attempt to read it in and print it out.
#include <locale.h>
#include <stdio.h>
#include <wchar.h>
int main()
{
FILE *input;
wchar_t buf[1000];
setlocale(LC_CTYPE,"");
if ((input = fopen("input.txt","r")) == NULL)
return 1;
printf("Read and print\n");
while (fgetws(buf,1000,input)!=NULL)
wprintf(L"%s",buf);
fclose(input);
}
However when I run it I see "Read and print" and then nothing else.
I am compiling with gcc on Ubuntu.
What am I doing wrong?
It turns out that substituting the wprintf line with
printf("%ls",buf);
fixes the problem.
Why is this?

You're doing two things wrong:
Mixing normal (byte-oriented) and wide output functions to standard output. You need to stick to one or the other. From the C11 draft, section 7.21.2:
Each stream has an orientation. After a stream is associated with an external file, but before any operations are performed on it, the stream is without orientation. Once a wide character input/output function has been applied to a stream without orientation, the stream becomes a wide-oriented stream. Similarly, once a byte input/output function has been applied to a stream without orientation, the stream becomes a byte-oriented stream. ...
Byte input/output functions shall not be applied to a wide-oriented stream and wide character input/output functions shall not be applied to a byte-oriented stream.
Using the wrong printf format to print a wide string. %s is for a normal char string. %ls is for a wchar_t string. But for just printing a wide string to a wide stream, prefer fputws(). No point in using a printf function if you're not actually using its formatting capabilities or mixing literal text with variables or printing wide characters to a byte-oriented stream or something else fancy.
One way (Of many alternatives) to fix the above problems, that treats standard output as a wide-oriented stream:
#include <locale.h>
#include <stdio.h>
#include <wchar.h>
int main(void)
{
FILE *input;
wchar_t buf[1000];
setlocale(LC_CTYPE,"");
if ((input = fopen("input.txt","r")) == NULL)
return 1;
fputws(L"Read and print\n", stdout);
while (fgetws(buf,1000,input)!=NULL)
fputws(buf, stdout);
fclose(input);
}
Another, using a byte-oriented standard output:
#include <locale.h>
#include <stdio.h>
#include <wchar.h>
int main()
{
FILE *input;
wchar_t buf[1000];
setlocale(LC_CTYPE,"");
if ((input = fopen("input.txt","r")) == NULL)
return 1;
puts("Read and print");
while (fgetws(buf,1000,input)!=NULL)
printf("%ls", buf);
fclose(input);
}

fgetws reads UTF-16. We normally handle UTF-8 with normal fgets into char [] and expect it to work and print it with normal printf. That's the point of UTF-8. If it doesn't work on display, probably your terminal isn't UTF-8; this is easily checked by running cat input.txt.

Related

How do I read from a file in C if the file has accented chararcters such as 'á'?

Another day, another problem with strings in C. Let's say I have a text file named fileR.txt and I want to print its contents. The file goes like this:
Letter á
Letter b
Letter c
Letter ê
I would like to read it and show it on the screen, so I tried the following code:
#include <stdlib.h>
#include <locale.h>
#include <clocale>
#include <stdio.h>
#include <conio.h>
#include <wchar.h>
int main()
{
FILE *pF;
char line[512]; // Current line
setlocale(LC_ALL, "");
pF = fopen("Aulas\\source\\fileR.txt", "r");
while (!feof(pF))
{
fgets(line, 512, pF);
fputs(line, stdout);
}
return 0;
}
And the output was:
Letter á
Letter b
Letter c
Letter ê
I then attempted to use wchar_t to do it:
#include <stdlib.h>
#include <locale.h>
#include <clocale>
#include <stdio.h>
#include <conio.h>
#include <wchar.h>
int main()
{
FILE *pF;
wchar_t line[512]; // Current line
setlocale(LC_ALL, "");
pF = fopen("Aulas\\source\\fileR.txt", "r");
while (!feof(pF))
{
fgetws(line, 512, pF);
fputws(line, stdout);
}
return 0;
}
The output was even worse:
Letter ÃLetter b
Letter c
Letter Ã
I have seen people suggesting the use of an unsigned char array, but that simply results in an error, as the stdio functions made for input and output take signed char arrays, and even if i were to write my own funtion to print an array of unsigned chars, I would not know how to be able to read something from a file as unsigned.
So, how can I read and print a file with accented characters in C?
The problem you are having is not in your code, it's in your expectations. A text character is really just a value that has been associated with some form of glyph (symbol). There are different schemes for making this association, generally referred to as encodings. One early and still common encoding is known as ASCII (American Standard Code for Information Interchange). As the name implies it is American English centric. Originally this was a 7 bit encoding (128 values), but later was extended to include other symbols using 8 bits. Other encoding were developed for other languages. This was non-optimal. The Unicode standard was developed to address this. It's a relatively complicated standard designed to include any symbols one might want to encode. Unicode has various schemes that trade off data size for character size, for example UTF7, UTF8, UTF16 and UTF32. Because of this there will not necessarily be a one to one relationship between a byte and a character.
So different character representations have different values and those values can be greater than a single byte. The next problem is that to display the associated glyphs you need to have a system that correctly maps the value to the glyph and is able to display said glyph. A lot of "terminal" applications don't support Unicode by default. They use ASCII or Extended ASCII. It looks like that is what you may be using. The terminal is making the assumption that each byte it needs to display corresponds a single character (which as discussed isn't necessarily true in Unicode).
One thing to try is to redirect your output to a file and use a Unicode aware editor (like notepad++) to view the file using a UTF8 (for example) encoding. You can also hex dump the input file to see how it has been encoded. Sometimes Unicode files are written with BOM (Byte Order Mark) to help identify the Unicode encoding and byte order in play.

EOF block while loop in c

I'm trying to make a code in c, that simply write disk c information in txt file with cmd comand "Wmic logicaldisk get" but i need only numbers instead (size 4294931).
So i pick this output and put it into a txt file to get only number in input.(I know it's quite strange).
This is the full code:
#include <stdlib.h>
#include <stdio.h>
#include <windows.h>
#include <ctype.h>
int main()
{
system("wmic logicaldisk get size> test.txt");
unsigned char symb;
FILE *FileIn;
FileIn = fopen("test.txt","rt");
int getc(FILE *stream);
while (( symb = getc(FileIn)) !=EOF)
{
if( isdigit(symb))
{
printf("%C", symb);
}
}
printf("test"); //for debug
}
the code work but can't exit the loop while, the number it's printed correctly but the next comands aren't executed(so the pritnf test isn't executed).
There are three things going on in your code that's wrong.
You redeclare a prototype for getc. You should not do that, since your declaration might not be the same as the official standard declaration.
The getc function returns an int. That is because EOF is an int constant, with the value -1. And ((unsigned char) -1) != -1. This is because the unsigned char value -1 is really 255 and that is not anywhere equal to -1. The variables you use together with getc (or any similar function) must be an int.
The printf format specifier "%C" (with an upper-case C) is not a standard format specifier. It is an Microsoft Visual C++ extension and is for wide characters of type wchar_t. Since your variable symb is not the correct type matching the format specifier you will have undefined behavior. For a narrow character like yours you should use lower case c.

Cannot read from stdin extended ASCII character in NCURSES

I have a problem trying to read extended ASCII chars in NCURSES.
I have this program:
#include <ncurses.h>
int main () {
initscr();
int d = getch();
mvprintw(0, 0, "letter: %c.", d);
refresh();
getch();
endwin();
return 0;
}
I build it with: gcc -lncursesw a.c
If I type a character in the 7bit ascii, like the 'e' char, I get:
letter: e.
And then I have to type another for the program to end.
If I type a character in the extended ascii, like the 'á' char, I get:
letter: .
and the program ends.
Its like the second byte is read as another character.
How can I get the correct char 'á' ???
Thanks!
The characters that you want to type require the program to setup the locale. As described in the manual:
Initialization
The library uses the locale which the calling program has
initialized. That is normally done with setlocale:
setlocale(LC_ALL, "");
If the locale is not initialized, the library assumes that
characters are printable as in ISO-8859-1, to work with
certain legacy programs. You should initialize the locale
and not rely on specific details of the library when the
locale has not been setup.
Past that, it is likely that your locale uses UTF-8. To work with UTF-8, you should compile and link against the ncursesw library.
Further, the getch function only returns values for single-byte encodings, such as ISO-8859-1, which some people confuse with Windows cp1252, and thence to "Extended ASCII" (which says something about two fallacies not cancelling out). UTF-8 is a multibyte encoding. If you use getch to read that, you will get the first byte of the character.
Instead, to read UTF-8, you should use get_wch (unless you want to decode the UTF-8 yourself). Here is a revised program which does that:
#include <ncurses.h>
#include <locale.h>
#include <wchar.h>
int
main(void)
{
wint_t value;
setlocale(LC_ALL, "");
initscr();
get_wch(&value);
mvprintw(0, 0, "letter: %#x.", value);
refresh();
getch();
endwin();
return 0;
}
I printed the result as a number, because printw does not know about Unicode values. printw uses the same C runtime support as printf, so you may be able to print the value directly. For instance, I see that POSIX printf has a formatting option for handling wint_t:
c
The int argument shall be converted to an unsigned char, and the resulting byte shall be written.
If an l (ell) qualifier is present, the wint_t argument shall be converted as if by an ls conversion specification with no precision and an argument that points to a two-element array of type wchar_t, the first element of which contains the wint_t argument to the ls conversion specification and the second element contains a null wide character.
Since ncurses works on many platforms, not all of those actually support the feature. But you can probably assume it works with the GNU C library: most distributions routinely provide workable locale configurations.
Doing that, the example is more interesting:
#include <ncurses.h>
#include <locale.h>
#include <wchar.h>
int
main(void)
{
wint_t value;
setlocale(LC_ALL, "");
initscr();
get_wch(&value);
mvprintw(0, 0, "letter: %#x (%lc).", value, value);
refresh();
getch();
endwin();
return 0;
}

Reading and outputting unicode in C

FILE * f = fopen("filename", "r");
int c;
while((c = fgetc(f)) != EOF) {
printf("%c\n", c);
}
Hello, I have searched for a whole hour, found many wise dissertations on Unicode, but no answer to this simple question:
what would be the shortest equivalent to these four lines, that can manage UTF8, on Linux using gcc and bash.
Thank you
Something like this should work, given your system:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main() {
setlocale(LC_CTYPE, "en_GB.UTF-8");
FILE * f = fopen("filename", "r");
wint_t c;
while((c = fgetwc(f)) != WEOF) {
wprintf(L"%lc\n", c);
}
}
The problem with your original code is that C doesn't realise (or care) that the characters are multibyte, and so your multibyte characters will be corrupted by the \n between each of the bytes. With this version, a character is treated as UTF-8, and so %lc now may represent as many as 6 actual bytes, which are guaranteed to be output correctly. If the input has any ASCII, it'll simply use one byte per character as previously (since ASCII is compatible with UTF-8).
strace is always useful for debugging things like this. As an example, if the file contains just ££ (£ has the UTF-8 sequence \302\243). Your version produces:
write(1, "\302\n\243\n\302\n\243\n\n\n", 10) = 10
And mine,
write(1, "\302\243\n\302\243\n", 6) = 6
Note that once you read or write to a stream (including stdout) it is set to either byte or wide orientation, and you will need to re-open the stream if you want to change it. So for example, if you wanted to read the UTF-8 file, but leave stdout as byte orientated, you could replace the wprintf with:
printf("%lc\n", c);
This involves extra code in the background (to convert the formats), but provides better compatibility with other code that expect a byte stream.

Handling special characters in C (UTF-8 encoding)

I'm writing a small application in C that reads a simple text file and then outputs the lines one by one. The problem is that the text file contains special characters like Æ, Ø and Å among others. When I run the program in terminal the output for those characters are represented with a "?".
Is there an easy fix?
First things first:
Read in the buffer
Use libiconv or similar to obtain wchar_t type from UTF-8 and use the wide character handling functions such as wprintf()
Use the wide character functions in C! Most file/output handling functions have a wide-character variant
Ensure that your terminal can handle UTF-8 output. Having the correct locale setup and manipulating the locale data can automate alot of the file opening and conversion for you ... depending on what you are doing.
Remember that the width of a code-point or character in UTF-8 is variable. This means you can't just seek to a byte and begin reading like with ASCII ... because you might land in the middle of a code point. Good libraries can do this in some cases.
Here is some code (not mine) that demonstrates some usage of UTF-8 file reading and wide character handling in C.
#include <stdio.h>
#include <wchar.h>
int main()
{
FILE *f = fopen("data.txt", "r, ccs=UTF-8");
if (!f)
return 1;
for (wint_t c; (c = fgetwc(f)) != WEOF;)
printf("%04X\n", c);
fclose(f);
return 0;
}
Links
libiconv
Locale data in C/GNU libc
Some handy info
Another good Unicode/UTF-8 in C resource
Make sure you're not accidentally dropping any bytes; some UTF-8 characters are more than one byte in length (that's sort of the point), and you need to keep them all.
It can be useful to print the contents of the buffer as hex, so you can inspect which bytes are actually read:
static void print_buffer(const char *buffer, size_t length)
{
size_t i;
for(i = 0; i < length; i++)
printf("%02x ", (unsigned int) buffer[i]);
putchar('\n');
}
You can do this after loading a very short file, containing just a few characters.
Also make sure the terminal is set to the proper encoding, so it interprets your characters as UTF-8.
Probably your text file is ISO-8559-1 encoded but your terminal is UTF-8. This kind of mismatch is a standard problem when dealing with byte-oriented text handling; other C programs (such as the standard ‘cat’ and ‘more’ commands) will do the same thing and it isn't generally considered an error or something that needs to be fixed.
If you want to operate on a Unicode character level instead of bytes that's fine, but you'll need to use wchar as your character type instead of char throughout your program, and provide switches for the user to specify what the incoming file encoding actually is. (Whilst it is sometimes possible to guess, it's not very reliable.)
I don't know if it could help but if you're sure that the encodings of terminal and input file are the same, you can try to setlocale():
#include <locale.h>
…
setlocale(LC_CTYPE, "");

Resources