C writes unexpected characters - c

I've never really used C before but am trying to run this code: https://github.com/stanfordnlp/GloVe/blob/master/src/glove.c
Problem: when I read the utf8 character using this code and simply output that utf8 character, it outputs them differently.
Here is an example
µl µl
。 。
ß Ã<9f>
versión versión
◘ â<97><98>
Léon Léon
Résumé Résumé
Cancún Cancún
������ ���ï¿
The left side is what original word in fid and the right side is what this code outputs.
The fprintf is happening in line 234-237.
if (fscanf(fid,format,word) == 0) return 1;
if (strcmp(word, "<unk>") == 0) return 1;
fprintf(fout, "%s",word);
The first line reads the word from fid in format. However, format is defined as sprintf(format,"%%%ds",MAX_STRING_LENGTH);. It doesn't have any information about encoding.
My question is: How does C know which encoding to read and output? On this file, I can't find how it defines encodings like utf8, ISO-8859, etc.
How should I make this code to write left side characters?
Any comment (short is fine too!) or some keywords that I should look up will be highly appreciated! Thanks.

C doesn't know anything about whatever encoding you use for the input. The fscanf call will simply read space-delimited "characters", where each character is a single byte.

Related

How EOF is defined for binary and ascii files

I'm programming C on Windows(system language is Japanese), and I have a problem about EOF of binary and ascii files.
I asked this question last week, a kind guy helped me, but I still can't really understand how the program works when reading a binary or an ascii file.
I did the following test:
Test1:
int oneChar;
iFile = fopen("myFile.tar.gz", "rb");
while ((oneChar = fgetc(iFile)) != EOF) {
printf("%d ", oneChar);
}
Test2:
int oneChar;
iFile = fopen("myFile.tar.gz", "r");
while ((oneChar = fgetc(iFile)) != EOF) {
printf("%d ", oneChar);
}
In the test1 case, things worked perfectly for both binary and ascii files. But in test2, program stopped reading when it encountered 0x1A in a binary file. (Does this mean that 1A == EOF?) ASCII table tells me that 1A is a control character called substitute (whatever that means...) And when I printf("%d", EOF), however, it gave me -1...
I also found this question which tells me that the OS knows exactly where a file ends, so I don't really need to find EOF in the file, because EOF is out of the range of a byte (what about 1A?)
Can someone clear things up a little for me? Thanks in advance.
This is a Windows-specific trick for text files: SUB character, which is represented by Ctrl+Z sequence, is interpreted as EOF by fgetc. You do not have to have 1A in your text file in order to get an EOF back from fgetc, though: once you reach the actual end of file, EOF would be returned.
The standard does not define 1A as the char value to represent an EOF. The constant for EOF is of type int, with a negative value outside the range of unsigned char. In fact, the reason why fgetc returns an int, not char, is to let it return a special value for EOF.
The convention of ending a file with Ctrl-Z originated with CP/M, a very old operating system for 8080/Z80 microcomputers. Its file system did not keep track of file sizes down to the byte level, only to the 128-byte sector level, so there needed to be another way to mark the end-of-file.
Microsoft's DOS was made to be as compatible with CP/M as possible, so it kept the convention when reading text files. By this time the file size was kept by the file system so it wasn't strictly necessary, just retained for backward compatibility.
This convention has persisted to the present day in the C and C++ libraries for Windows; when you open a file in text mode, every character is checked for Ctrl-Z and the end-of-file flag is set if it's detected. You're seeing the effects of backwards compatibility taken to an extreme, back to systems that are almost 40 years old.
Found a terrific article that answers all the question! https://latedev.wordpress.com/2012/12/04/all-about-eof/
EOF in text files is usually character 0x1A or ASCII 26 if you will.

Special characters are not displayed correctly in the Linux Terminal

I have a file encoded in UTF-8, as it is shown by the following command :
file -i D.txt D.txt: text/plain; charset=utf-8
I just want to display each character one after one, so I have done this :
FILE * F_entree = fopen("D.txt", "r");
if (! F_entree) usage("impossible d'ouvrir le fichier d'entrée");
char ligne[TAILLE_MAX];
while (fgets(ligne, TAILLE_MAX, F_entree))
{
string mot = strtok(strdup(ligne), "\t");
while (*mot++){printf("%c \n", *mot) ;}
}
But the special characters aren't well displayed (a <?> is displayed instead) in the terminal (on Ubuntu 12). I think the problem is that only ASCII code can be stocked in %c, but how can I display those special characters?
And what's the good way to keep those characters in memory (in order to implement a tree index)? (I'm aware that this last question is unclear, don't hesitate to ask for clarifications.)
It does not work because your code splits up the multi-byte characters into separate ones. As your console expects a valid multi-byte code, after seeing a first one, and it does not receive the correct codes, you get your <?> -- translated freely, "whuh?". It does not receive a correct code because you are stuffing a space and newline in there.
Your console can only correctly interpret UTF8 characters if you send the right codes and in the correct sequence. The algorithm is:
Is the next character the start code for a UTF-8 sequence? If not, print it and continue.
If it is, print it and print all "next" codes for this character. See Wikipedia on UTF8 for the actual encoding; I took a shortcut in my code below.
Only then print your space (..?) and newline.
The procedure to recognize the start and length of a UTF8 multibyte character is this:
"Regular" (ASCII) characters never have their 7th bit set. Testing against 0x80 is enough to differentiate them from UTF8.
Each UTF8 character sequence starts with one of the bit patterns 110xxxxx, 1110xxxx, 11110xxx, 111110xx, or 1111110x. Every unique bit pattern has an associated number of extra bytes. The first one, for example, expects one additional byte. The xxx bits are combined with bits from the next byte(s) to form a 16-bit or longer Unicode character. (After all, that is what UTF8 is all about.)
Each next byte -- no matter how many! -- has the bit pattern 10xxxxxx. Important: none of the previous patterns start with this code!
Therefore, as soon as you see any UTF8 character, you can immediately display it and all 'next' codes, as long as they start with the bit pattern 10....... This can be tested efficiently with a bit-mask: value & 0xc0, and the result should be 0x80. Any other value means it's not a 'next' byte anymore, so you're done then.
All of this only works if your source file is valid UTF8. If you get to see some strange output, it most likely is not. If you need to check the input file for validity, you do need to implement the entire table in the Wikipedia page, and check if each 110xxxxx byte is in fact followed by a single 10xxxxxx byte, and so on. The pattern 10xxxxxx appearing on itself would indicate an error.
A definitive must-read is Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). See also UTF-8 and Unicode FAQ for Unix/Linux for more background information.
My code below addresses a few other issues with yours. I've used English variable names (see Meta Stackoverflow "Foreign variable names etc. in code"). It appears to me strdup is not necessary. Also, string is a C++ expression.
My code does not "fix" or handle anything beyond the UTF-8 printing. Because of your use of strtok, the code only prints the text before the first \t Tab character on each line in your input file. I assume you know what you are doing there ;-)
Add.: Ah, forgot to address Q2, "what's the good way to keep those characters in memory". UTF8 is designed to be maximally compatible with C-type char strings. You can safely store them as such. You don't need to do anything special to print them on an UTF8-aware console -- well, except when you are doing stuff as you do here, printing them as separate characters. printf ought to work just fine for whole words.
If you need UTF8-aware equivalents of strcmp, strchr, and strlen, you can roll your own code (see the Wikipedia link above) or find yourself a good pre-made library. (I left out strcpy intentionally!)
#define MAX_LINE_LENGTH 1024
int main (void)
{
char line[MAX_LINE_LENGTH], *word;
FILE *entry_file = fopen("D.txt", "r");
if (!entry_file)
{
printf ("not possible to open entry_file\n");
return -1;
}
while (fgets(line, MAX_LINE_LENGTH, entry_file))
{
word = strtok(line, "\t");
while (*word)
{
/* print UTF8 encoded characters as a single entity */
if (*word & 0x80)
{
do
{
printf("%c", *word);
word++;
} while ((*word & 0xc0) == 0x80);
printf ("\n");
} else
{
/* print low ASCII characters as-is */
printf("%c \n", *word);
word++;
}
}
}
return 0;
}

Reads only alphabetic chars with fscanf

Hello i have simply function to read from file
while(fscanf(fp," %255[a-zA-Z]",test) == 1)
{
puste = 1;
push(&drzewo,test);
}
It should read only words which contains only alphabetic characters and that works great. When I have for example a single number in my file my while loop quits; how should I change it?
Of course it stops, since the fscanf() call will fail to do the conversion you're requiring, and thus return 0. What would you expect it to do?
It's often better to read whole lines using fgets(), and then parse them "manually", that way it's easy to just do nothing and read another line if the desired data is not found.

How to write äõüö is C? [duplicate]

I am trying to do my own version of wc (unix filter), but I have a problem with non-ASCII characters. I did a HEX dump of a text file and found out that these characters occupy more than one byte. So they won't fit to char. Is there any way I can read these characters from file and handle them like single characters (in order to count characters in a file) in C?
I've been googling a little bit and found some wchar_t type, but there were not any simple examples how to use it with files.
I've been googling a little bit and found some wchar_t type, but there was not any simple example how to use it with files.
Well met. There weren't any simple examples because, unfortunately, proper character set support isn't simple.
Aside: In an ideal world, everybody would use UTF-8 (a Unicode encoding that is memory-efficient, robust, and backward-compatible with ASCII), the standard C library would include UTF-8 encoding-decoding support, and the answer to this question (and dealing with text in general) would be simple and straightforward.
The answer to the question "What is the best unicode library for C?" is to use the ICU library. You may want to look at ustdio.h, as it has a u_fgetc function, and adding Unicode support to your program will probably take little more than typing u_ a few times.
Also, if you can spare a few minutes for some light reading, you may want to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets (No Excuses!) from Joel On Software.
I, personally, have never used ICU, but I probably will from now on :-)
If you want to write a standard C version of the wc utility that respects the current language setting when it is run, then you can indeed use the wchar_t versions of the stdio functions. At program startup, you should call setlocale():
setlocale(LC_CTYPE, "");
This will cause the wide character functions to use the appropriate character set defined by the environment - eg. on Unix-like systems, the LANG environment variable. For example, this means that if your LANG variable is set to a UTF8 locale, the wide character functions will handle input and output in UTF8. (This is how the POSIX wc utility is specified to work).
You can then use the wide-character versions of all the standard functions. For example, if you have code like this:
long words = 0;
int in_word = 0;
int c;
while ((c = getchar()) != EOF)
{
if (isspace(c))
{
if (in_word)
{
in_word = 0;
words++;
}
}
else
{
in_word = 1;
}
}
...you would convert it to the wide character version by changing c to a wint_t, getchar() to getwchar(), EOF to WEOF and isspace() to iswspace():
long words = 0;
int in_word = 0;
wint_t c;
while ((c = getwchar()) != WEOF)
{
if (iswspace(c))
{
if (in_word)
{
in_word = 0;
words++;
}
}
else
{
in_word = 1;
}
}
Go have a look at ICU. That library is what you need to deal with all the issues.
Most of the answers so far have merit, but which you use depends on the semantics you want:
If you want to process text in the configured locale's encoding, and don't care about complete failure in the case of encountering invalid sequences, using getwchar() is fine.
If you want to process text in the configured locale's encoding, but need to detect and recover from invalid sequences, you need to read bytes and use mbrtowc manually.
If you always want to process text as UTF-8, you need to read bytes and feed them to your own decoder. If you know in advance the file will be valid UTF-8, you can just count bytes in the ranges 00-7F and C2-F4 and skip counting all other bytes, but this could give wrong results in the presence of invalid sequences. A more robust approach would be decoding the bytestream to Unicode codepoints and counting the number of successful decodes.
Hope this helps.
Are you sure you really need the number of characters? wc counts the number of bytes.
~$ echo 'דניאל' > hebrew.txt
~$ wc hebrew.txt
1 1 11 hebrew.txt
(11 = 5 two-byte characters + 1 byte for '\n')
However, if you really do want to count characters rather than bytes, and can assume that your text files are encoded in UTF-8, then the easiest approach is to count all bytes that are not trail bytes (i.e., in the range 0x80 to 0xBF).
If you can't assume UTF-8 but can assume that any non-UTF-8 files are in a single-byte encoding, then perform a UTF-8 validation check on the data. If it passes, return the number of UTF-8 lead bytes. If if fails, return the number of total bytes.
(Note that the above approach is specific to wc. If you're actually doing something with the characters rather than just counting them, you'll need to know the encoding.)

Handling multibyte (non-ASCII) characters in C

I am trying to do my own version of wc (unix filter), but I have a problem with non-ASCII characters. I did a HEX dump of a text file and found out that these characters occupy more than one byte. So they won't fit to char. Is there any way I can read these characters from file and handle them like single characters (in order to count characters in a file) in C?
I've been googling a little bit and found some wchar_t type, but there were not any simple examples how to use it with files.
I've been googling a little bit and found some wchar_t type, but there was not any simple example how to use it with files.
Well met. There weren't any simple examples because, unfortunately, proper character set support isn't simple.
Aside: In an ideal world, everybody would use UTF-8 (a Unicode encoding that is memory-efficient, robust, and backward-compatible with ASCII), the standard C library would include UTF-8 encoding-decoding support, and the answer to this question (and dealing with text in general) would be simple and straightforward.
The answer to the question "What is the best unicode library for C?" is to use the ICU library. You may want to look at ustdio.h, as it has a u_fgetc function, and adding Unicode support to your program will probably take little more than typing u_ a few times.
Also, if you can spare a few minutes for some light reading, you may want to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets (No Excuses!) from Joel On Software.
I, personally, have never used ICU, but I probably will from now on :-)
If you want to write a standard C version of the wc utility that respects the current language setting when it is run, then you can indeed use the wchar_t versions of the stdio functions. At program startup, you should call setlocale():
setlocale(LC_CTYPE, "");
This will cause the wide character functions to use the appropriate character set defined by the environment - eg. on Unix-like systems, the LANG environment variable. For example, this means that if your LANG variable is set to a UTF8 locale, the wide character functions will handle input and output in UTF8. (This is how the POSIX wc utility is specified to work).
You can then use the wide-character versions of all the standard functions. For example, if you have code like this:
long words = 0;
int in_word = 0;
int c;
while ((c = getchar()) != EOF)
{
if (isspace(c))
{
if (in_word)
{
in_word = 0;
words++;
}
}
else
{
in_word = 1;
}
}
...you would convert it to the wide character version by changing c to a wint_t, getchar() to getwchar(), EOF to WEOF and isspace() to iswspace():
long words = 0;
int in_word = 0;
wint_t c;
while ((c = getwchar()) != WEOF)
{
if (iswspace(c))
{
if (in_word)
{
in_word = 0;
words++;
}
}
else
{
in_word = 1;
}
}
Go have a look at ICU. That library is what you need to deal with all the issues.
Most of the answers so far have merit, but which you use depends on the semantics you want:
If you want to process text in the configured locale's encoding, and don't care about complete failure in the case of encountering invalid sequences, using getwchar() is fine.
If you want to process text in the configured locale's encoding, but need to detect and recover from invalid sequences, you need to read bytes and use mbrtowc manually.
If you always want to process text as UTF-8, you need to read bytes and feed them to your own decoder. If you know in advance the file will be valid UTF-8, you can just count bytes in the ranges 00-7F and C2-F4 and skip counting all other bytes, but this could give wrong results in the presence of invalid sequences. A more robust approach would be decoding the bytestream to Unicode codepoints and counting the number of successful decodes.
Hope this helps.
Are you sure you really need the number of characters? wc counts the number of bytes.
~$ echo 'דניאל' > hebrew.txt
~$ wc hebrew.txt
1 1 11 hebrew.txt
(11 = 5 two-byte characters + 1 byte for '\n')
However, if you really do want to count characters rather than bytes, and can assume that your text files are encoded in UTF-8, then the easiest approach is to count all bytes that are not trail bytes (i.e., in the range 0x80 to 0xBF).
If you can't assume UTF-8 but can assume that any non-UTF-8 files are in a single-byte encoding, then perform a UTF-8 validation check on the data. If it passes, return the number of UTF-8 lead bytes. If if fails, return the number of total bytes.
(Note that the above approach is specific to wc. If you're actually doing something with the characters rather than just counting them, you'll need to know the encoding.)

Resources