Reading and outputting unicode in C - c

FILE * f = fopen("filename", "r");
int c;
while((c = fgetc(f)) != EOF) {
printf("%c\n", c);
}
Hello, I have searched for a whole hour, found many wise dissertations on Unicode, but no answer to this simple question:
what would be the shortest equivalent to these four lines, that can manage UTF8, on Linux using gcc and bash.
Thank you

Something like this should work, given your system:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main() {
setlocale(LC_CTYPE, "en_GB.UTF-8");
FILE * f = fopen("filename", "r");
wint_t c;
while((c = fgetwc(f)) != WEOF) {
wprintf(L"%lc\n", c);
}
}
The problem with your original code is that C doesn't realise (or care) that the characters are multibyte, and so your multibyte characters will be corrupted by the \n between each of the bytes. With this version, a character is treated as UTF-8, and so %lc now may represent as many as 6 actual bytes, which are guaranteed to be output correctly. If the input has any ASCII, it'll simply use one byte per character as previously (since ASCII is compatible with UTF-8).
strace is always useful for debugging things like this. As an example, if the file contains just ££ (£ has the UTF-8 sequence \302\243). Your version produces:
write(1, "\302\n\243\n\302\n\243\n\n\n", 10) = 10
And mine,
write(1, "\302\243\n\302\243\n", 6) = 6
Note that once you read or write to a stream (including stdout) it is set to either byte or wide orientation, and you will need to re-open the stream if you want to change it. So for example, if you wanted to read the UTF-8 file, but leave stdout as byte orientated, you could replace the wprintf with:
printf("%lc\n", c);
This involves extra code in the background (to convert the formats), but provides better compatibility with other code that expect a byte stream.

Related

Why does fgetc() in C always reads extra, non-existent characters whenever I try to read non-printable characters from txt files?

I am trying to read non-printable characters from a text file, print out the characters' ASCII code, and finally write these non-printable characters into an output file.
However, I have noticed that for every non-printable character I read, there is always an extra non-printable character existing in front of what I really want to read.
For example, the character I want to read is "§".
And when I print out its ASCII code in my program, instead of printing just "167", it prints out "194 167".
I looked it up in the debugger and saw "§" in the char array. But I don't have  anywhere in my input file.
screenshot of debugger
And after I write the non-printable character into my output file, I have noticed that it is also just "§", not "§".
There is an extra character being attached to every single non-printable character I read. Why is this happening? How do I get rid of it?
Thanks!
Code as follows:
case 1:
mode = 1;
FILE *fp;
fp = fopen ("input2.txt", "r");
int charCount = 0;
while(!feof(fp)) {
original_message[charCount] = fgetc(fp);
charCount++;
}
original_message[charCount - 1] = '\0';
fclose(fp);
k = strlen(original_message);//split the original message into k input symbols
printf("k: \n%lld\n", k);
printf("ASCII code:\n");
for (int i = 0; i < k; i++)
{
ASCII = original_message[i];
printf("%d ", ASCII);
}
C's getchar (and getc and fgetc) functions are designed to read individual bytes. They won't directly handle "wide" or "multibyte" characters such as occur in the UTF-8 encoding of Unicode.
But there are other functions which are specifically designed to deal with those extended characters. In particular, if you wish, you can replace your call to fgetc(fp) with fgetwc(fp), and then you should be able to start reading characters like § as themselves.
You will have to #include <wchar.h> to get the prototype for fgetwc. And you may have to add the call
setlocale(LC_CTYPE, "");
at the top of your program to synchronize your program's character set "locale" with that of your operating system.
Not your original code, but I wrote this little program:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main()
{
wchar_t c;
setlocale(LC_CTYPE, "");
while((c = fgetwc(stdin)) != EOF)
printf("%lc %d\n", c, c);
}
When I type "A", it prints A 65.
When I type "§", it prints § 167.
When I type "Ƶ", it prints Ƶ 437.
When I type "†", it prints † 8224.
Now, with all that said, reading wide characters using functions like fgetwc isn't the only or necessarily even the best way of dealing with extended characters. In your case, it carries a number of additional consequences:
Your original_message array is going to have to be an array of wchar_t, not an array of char.
Your original_message array isn't going to be an ordinary C string — it's a "wide character string". So you can't call strlen on it; you're going to have to call wcslen.
Similarly, you can't print it using %s, or its characters using %c. You'll have to remember to use %ls or %lc.
So although you can convert your entire program to use "wide" strings and "w" functions everywhere, it's a ton of work. In many cases, and despite anomalies like the one you asked about, it's much easier to use UTF-8 everywhere, since it tends to Just Work. In particular, as long as you don't have to pick a string apart and work with its individual characters, or compute the on-screen display length of a string (in "characters") using strlen, you can just use plain C strings everywhere, and let the magic of UTF-8 sequences take care of any non-ASCII characters your users happen to enter.

How to read and print a unicode file

I have a test input file input.txt with one line with the following contents:
кёльнский
I am using this code to attempt to read it in and print it out.
#include <locale.h>
#include <stdio.h>
#include <wchar.h>
int main()
{
FILE *input;
wchar_t buf[1000];
setlocale(LC_CTYPE,"");
if ((input = fopen("input.txt","r")) == NULL)
return 1;
printf("Read and print\n");
while (fgetws(buf,1000,input)!=NULL)
wprintf(L"%s",buf);
fclose(input);
}
However when I run it I see "Read and print" and then nothing else.
I am compiling with gcc on Ubuntu.
What am I doing wrong?
It turns out that substituting the wprintf line with
printf("%ls",buf);
fixes the problem.
Why is this?
You're doing two things wrong:
Mixing normal (byte-oriented) and wide output functions to standard output. You need to stick to one or the other. From the C11 draft, section 7.21.2:
Each stream has an orientation. After a stream is associated with an external file, but before any operations are performed on it, the stream is without orientation. Once a wide character input/output function has been applied to a stream without orientation, the stream becomes a wide-oriented stream. Similarly, once a byte input/output function has been applied to a stream without orientation, the stream becomes a byte-oriented stream. ...
Byte input/output functions shall not be applied to a wide-oriented stream and wide character input/output functions shall not be applied to a byte-oriented stream.
Using the wrong printf format to print a wide string. %s is for a normal char string. %ls is for a wchar_t string. But for just printing a wide string to a wide stream, prefer fputws(). No point in using a printf function if you're not actually using its formatting capabilities or mixing literal text with variables or printing wide characters to a byte-oriented stream or something else fancy.
One way (Of many alternatives) to fix the above problems, that treats standard output as a wide-oriented stream:
#include <locale.h>
#include <stdio.h>
#include <wchar.h>
int main(void)
{
FILE *input;
wchar_t buf[1000];
setlocale(LC_CTYPE,"");
if ((input = fopen("input.txt","r")) == NULL)
return 1;
fputws(L"Read and print\n", stdout);
while (fgetws(buf,1000,input)!=NULL)
fputws(buf, stdout);
fclose(input);
}
Another, using a byte-oriented standard output:
#include <locale.h>
#include <stdio.h>
#include <wchar.h>
int main()
{
FILE *input;
wchar_t buf[1000];
setlocale(LC_CTYPE,"");
if ((input = fopen("input.txt","r")) == NULL)
return 1;
puts("Read and print");
while (fgetws(buf,1000,input)!=NULL)
printf("%ls", buf);
fclose(input);
}
fgetws reads UTF-16. We normally handle UTF-8 with normal fgets into char [] and expect it to work and print it with normal printf. That's the point of UTF-8. If it doesn't work on display, probably your terminal isn't UTF-8; this is easily checked by running cat input.txt.

How do I read from a file in C if the file has accented chararcters such as 'á'?

Another day, another problem with strings in C. Let's say I have a text file named fileR.txt and I want to print its contents. The file goes like this:
Letter á
Letter b
Letter c
Letter ê
I would like to read it and show it on the screen, so I tried the following code:
#include <stdlib.h>
#include <locale.h>
#include <clocale>
#include <stdio.h>
#include <conio.h>
#include <wchar.h>
int main()
{
FILE *pF;
char line[512]; // Current line
setlocale(LC_ALL, "");
pF = fopen("Aulas\\source\\fileR.txt", "r");
while (!feof(pF))
{
fgets(line, 512, pF);
fputs(line, stdout);
}
return 0;
}
And the output was:
Letter á
Letter b
Letter c
Letter ê
I then attempted to use wchar_t to do it:
#include <stdlib.h>
#include <locale.h>
#include <clocale>
#include <stdio.h>
#include <conio.h>
#include <wchar.h>
int main()
{
FILE *pF;
wchar_t line[512]; // Current line
setlocale(LC_ALL, "");
pF = fopen("Aulas\\source\\fileR.txt", "r");
while (!feof(pF))
{
fgetws(line, 512, pF);
fputws(line, stdout);
}
return 0;
}
The output was even worse:
Letter ÃLetter b
Letter c
Letter Ã
I have seen people suggesting the use of an unsigned char array, but that simply results in an error, as the stdio functions made for input and output take signed char arrays, and even if i were to write my own funtion to print an array of unsigned chars, I would not know how to be able to read something from a file as unsigned.
So, how can I read and print a file with accented characters in C?
The problem you are having is not in your code, it's in your expectations. A text character is really just a value that has been associated with some form of glyph (symbol). There are different schemes for making this association, generally referred to as encodings. One early and still common encoding is known as ASCII (American Standard Code for Information Interchange). As the name implies it is American English centric. Originally this was a 7 bit encoding (128 values), but later was extended to include other symbols using 8 bits. Other encoding were developed for other languages. This was non-optimal. The Unicode standard was developed to address this. It's a relatively complicated standard designed to include any symbols one might want to encode. Unicode has various schemes that trade off data size for character size, for example UTF7, UTF8, UTF16 and UTF32. Because of this there will not necessarily be a one to one relationship between a byte and a character.
So different character representations have different values and those values can be greater than a single byte. The next problem is that to display the associated glyphs you need to have a system that correctly maps the value to the glyph and is able to display said glyph. A lot of "terminal" applications don't support Unicode by default. They use ASCII or Extended ASCII. It looks like that is what you may be using. The terminal is making the assumption that each byte it needs to display corresponds a single character (which as discussed isn't necessarily true in Unicode).
One thing to try is to redirect your output to a file and use a Unicode aware editor (like notepad++) to view the file using a UTF8 (for example) encoding. You can also hex dump the input file to see how it has been encoded. Sometimes Unicode files are written with BOM (Byte Order Mark) to help identify the Unicode encoding and byte order in play.

Reading general file

I'm making a program that reads in a file from stdin, does something to it and sends it to stdout.
As it stands, I have a line in my program:
while((c = getchar()) != EOF){
where c is an int.
However the problem is I want to use this program on ELF executables. And it appears that there must be the byte that represents EOF for ascii files inside the executable, which results in it being truncated (correct me if I'm wrong here - this is just my hypothesis).
What is an effective general way to go about doing this? I could dig up documents on the ELF format and then just check for whatever comes at the end. That would be useful, but I think it would be better if I could still apply this program to any kind of file.
You'll be fine - the EOF constant doesn't contain a valid ASCII value (it's typically -1).
For example, below is an excerpt from stdio.h on my system:
/* End of file character.
Some things throughout the library rely on this being -1. */
#ifndef EOF
# define EOF (-1)
#endif
You might want to go a bit lower level and use the system functions like open(), close() and read(), this way you can do what you like with the input as it will get stored in your own buffer.
You are doing it correctly.
EOF is not a character. There is no way c will have EOF to represent any byte in the stream. If / when c indeed contains EOF, that particular value did not originate from the file itself, but from the underlying library / OS. EOF is a signal that something went wrong.
Make sure c is an int though
Oh ... and you might want to read from a stream under your control. In the absence of code to do otherwise, stdin is subject to "text translation" which might not be desirable when reading binary data.
FILE *mystream = fopen(filename, "rb");
if (mystream) {
/* use fgetc() instead of getchar() */
while((c = fgetc(mystream)) != EOF) {
/* ... */
}
fclose(mystream);
} else {
/* error */
}
From the getchar(3) man page:
Character values are returned as an
unsigned char converted to an int.
This means, a character value read via getchar, can never be equal to an signed integer of -1. This little program explains it:
int main(void)
{
int a;
unsigned char c = EOF;
a = (int)c;
//output: 000000ff - 000000ff - ffffffff
printf("%08x - %08x - %08x\n", a, c, -1);
return 0;
}

Handling special characters in C (UTF-8 encoding)

I'm writing a small application in C that reads a simple text file and then outputs the lines one by one. The problem is that the text file contains special characters like Æ, Ø and Å among others. When I run the program in terminal the output for those characters are represented with a "?".
Is there an easy fix?
First things first:
Read in the buffer
Use libiconv or similar to obtain wchar_t type from UTF-8 and use the wide character handling functions such as wprintf()
Use the wide character functions in C! Most file/output handling functions have a wide-character variant
Ensure that your terminal can handle UTF-8 output. Having the correct locale setup and manipulating the locale data can automate alot of the file opening and conversion for you ... depending on what you are doing.
Remember that the width of a code-point or character in UTF-8 is variable. This means you can't just seek to a byte and begin reading like with ASCII ... because you might land in the middle of a code point. Good libraries can do this in some cases.
Here is some code (not mine) that demonstrates some usage of UTF-8 file reading and wide character handling in C.
#include <stdio.h>
#include <wchar.h>
int main()
{
FILE *f = fopen("data.txt", "r, ccs=UTF-8");
if (!f)
return 1;
for (wint_t c; (c = fgetwc(f)) != WEOF;)
printf("%04X\n", c);
fclose(f);
return 0;
}
Links
libiconv
Locale data in C/GNU libc
Some handy info
Another good Unicode/UTF-8 in C resource
Make sure you're not accidentally dropping any bytes; some UTF-8 characters are more than one byte in length (that's sort of the point), and you need to keep them all.
It can be useful to print the contents of the buffer as hex, so you can inspect which bytes are actually read:
static void print_buffer(const char *buffer, size_t length)
{
size_t i;
for(i = 0; i < length; i++)
printf("%02x ", (unsigned int) buffer[i]);
putchar('\n');
}
You can do this after loading a very short file, containing just a few characters.
Also make sure the terminal is set to the proper encoding, so it interprets your characters as UTF-8.
Probably your text file is ISO-8559-1 encoded but your terminal is UTF-8. This kind of mismatch is a standard problem when dealing with byte-oriented text handling; other C programs (such as the standard ‘cat’ and ‘more’ commands) will do the same thing and it isn't generally considered an error or something that needs to be fixed.
If you want to operate on a Unicode character level instead of bytes that's fine, but you'll need to use wchar as your character type instead of char throughout your program, and provide switches for the user to specify what the incoming file encoding actually is. (Whilst it is sometimes possible to guess, it's not very reliable.)
I don't know if it could help but if you're sure that the encodings of terminal and input file are the same, you can try to setlocale():
#include <locale.h>
…
setlocale(LC_CTYPE, "");

Resources