Issue reading Japanese characters from file - C

Issue reading Japanese characters from file - C - c

I am writing a program which reads a file with almost 2 million lines. The file is in the format integer ID tab with an artist name string.
6821361 Selinsgrove High School Chorus
10151460 greek-Antique
10236365 jnr walker & the all-stars
6878792 Grieg - Kraggerud, Kjekshus
6880556 Mr. Oiseau
6906305 stars on 54 (maxi single)
10584525 Jonie Mitchel
10299729 エリス レジーナ／アントニオ カルロス ジョビン
Above is an example with some items from the file (not some lines do not follow the specific format). My program work file until it gets to the last line from the example then it endlessly prints エリス レジーナ／アントニオ カルロス ジョビ\343\203.
struct artist *read_artists(char *fname)
{
FILE *file;
struct artist *temp = (struct artist*)malloc(sizeof(struct artist));
struct artist *head = (struct artist*)malloc(sizeof(struct artist));
file = fopen("/Users/Daniel/Library/Developer/Xcode/DerivedData/project_Audioscrobbler_Artists-hgwyqpinuoxayzbmvarcjxryqnrz/Build/Products/Debug/artist_data.txt", "r");
if(file == 0)
{
perror("fopen");
exit(1);
}
int artist_ID;
char artist_name[650];
while(!feof(file))
{
fscanf(file, "%d\t%65[^\t\n]\n", &artist_ID, artist_name);
temp = create_play(artist_ID, artist_name, 0, -1);
head = add_play(head, temp);
printf("%s\n", artist_name);
}
fclose(file);
//print_plays(head);
return head;
}
Above is my code for reading from the file. Can you please help explain what is wrong?

As the comments indicate, one problem is with while(!feof(file)) The linked content will explain in detail why this is not a good idea, but in summary, quoting from one of the answers in the link:
(!feof(file))...
...is wrong because it tests for something that is
irrelevant and fails to test for something that you need to know. The
result is that you are erroneously executing code that assumes that it
is accessing data that was read successfully, when in fact this never
happened. - Kerrek SB
In your case, this usage does not cause your problem, but as Kerrek explains might happen, masks it.
You can replace that with fgets(...):
char lineBuf[1000];//make length longer or shorter for your purpose
file = fopen("/Users/Daniel/Library/Developer/Xcode/DerivedData/project_Audioscrobbler_Artists-hgwyqpinuoxayzbmvarcjxryqnrz/Build/Products/Debug/artist_data.txt", "r");
if(!file) return -1;
while(fgets (lineBuf, sizeof(lineBuf), file))
{
//process each line here
//But processing Japanese characters
//will require special considerations.
//Refer to the link below for UNICODE tips
}
Unicode in C and C++...
In particular, you will need to use variable types that are sufficient for containing the different size characters you will be processing. The link discusses this in great detail.
Here is an excerpt:
"char" no longer means character
I hereby recommend referring to character codes in C programs using a 32-bit unsigned integer type. Many platforms provide a
"wchar_t" (wide character) type, but unfortunately it is to be avoided
since some compilers allot it only 16 bits—not enough to represent
Unicode. Wherever you need to pass around an individual character,
change "char" to "unsigned int" or similar. The only remaining use for
the "char" type is to mean "byte".
Edit:
In the comments above, you state but the string it's failing on is 66 bytes long. Because you are reading into a 'char' array, the bytes necessary to complete the character were truncated one byte before including the last necessary byte. ASCII characters can be contained in a single char space. Japanese characters cannot. If you were using an array of unsigned int instead of array of char, the last byte would have been included.

OP's code failed because the result of fscanf() was not checked.
fscanf(file, "%d\t%65[^\t\n]\n", &artist_ID, artist_name);
The fscanf() read in 65 char of "エリス レジーナ／アントニオ カルロス ジョビン". Yet this string, encoded in UTF8, has a length of 66. The last 'ン' is codes 227, 131, 179 (octal 343 203 263) and only the last 2 were read. When artist_name is printed the following appears.
エリス レジーナ／アントニオ カルロス ジョビ\343\203
Now begins the problem. The last char 179 remains in in file. On the next fscanf(), it fails as char 179 does not convert into a int ("%d"). So fscanf() returns 0. Since code did not check the result of fscanf(), it does not realize artist_ID and artist_name are left over from before and so prints the same text.
As feof() is never true for the char 179 is not consumed, we have infinite loop.
The while(!feof(file)) hid this problem, but did not cause it.
The fgets() proposed by #ryyker is a good approach. Another is:
while (fscanf(file, "%d\t%65[^\t\n]\n", &artist_ID, artist_name) == 2) {
temp = create_play(artist_ID, artist_name, 0, -1);
head = add_play(head, temp);
printf("%s\n", artist_name);
}
IOWs, validate the results of *scanf().

Related

puts and printf do not give out full text (text containing CJK characters), when the text is read from a local file, on Windows, MSVC

The text contains:
..... (some characters can't be posted on SO)
xxxxxxxx=xxx xxxxxxx=xxxxx://xxx..xxx/xxxxx/xx9528994
(for full text & data please see https://github.com/ggaarder/snippets/raw/master/x.txt)
which is ended in xxxxx://xxx..xxx/xxxxx/xx9528994, however, when reading it then puts, it only gives out
..... (some characters can't be posted on SO)
xxxxxxxx=xxx xxxxxxx=xxxxx:/
which only prints to xxxxx:/, and /xxx..xxx/xxxxx/xx9528994 is missed.
Code to test:
#include <stdio.h>
int main(void)
{
char s[30000];
FILE *f = fopen("x.txt", "r");
fread(s, sizeof(s), 1, f);
puts(s);
return 0;
}
The buffer size 30000 is adequate. x.txt is 1049 bytes.
You can download x.txt at https://github.com/ggaarder/snippets/raw/master/x.txt, for convenience I have packed everything to https://github.com/ggaarder/snippets/raw/master/foo.zip.
It will be very kind of you to download and take a look of x.txt, since most part of it can't be posted on SO because of the special characters, including some CJK.
Attempts:
The whole file is read properly. #pmg notices that fread returns zero, while #Someprogrammerdude points out that if fread's size and count arguments are swapped fread returns 1049, and this supports the guess.
If the CJK letters are removed, the output will be totally OK. So I think there is no '\0' in the middle.
By adding
ret = puts(s);
printf("\nret: %d, %s", ret, strerror(errno));
We will get ret: 0, No error. puts return zero and there's nothing in errno.
You may notice that there's a heading \n in 3.. Yes, puts doesn't gives out the newline as usual - does this suggest that puts failed?
But why does it returns zero and there's nothing in errno?
May it be related to Windows NT cmd? Maybe some special terminal control letters are unintentionally out.
Reading by rb is the same. x.txt is an XML text, just for convenience I removed part of it that are the irrelevant, so it looks like spam.
I guess this is just yet another encoding issue, plus some magical secret Windows commandline control sequence .... I'm not taking it. I will just erase all non-ASCII characters.

The order of the "size" and "count" arguments to fread is crucial.
The first argument is the "element" size, and the second argument is the number of elements to attempt to read.
In the case of a text file, the element size is a single character, usually a single byte. The number of elements to attempt to read is the size of the destination array.
So your call should be
fread(s, 1, sizeof s, f);
instead.
What happens now when you have the opposite is that you say that the "element" size is 30000 bytes, and that fread should read one such element. Since the size of the file is less than 30000 bytes, it just can't read even a single element, and returns 0 to indicate it.

open the file in binary mode
switch arguments and check the return value of fread().
#include <stdio.h>
#include <stdlib.h>
int main(void) {
char s[30000];
FILE *f = fopen("x.txt", "rb"); // binary mode
unsigned long len = fread(s, 1, sizeof(s), f); // switch args, check value
if (len < 1) {
perror("bad fread");
exit(EXIT_FAILURE);
}
s[len] = 0; // properly terminate s
puts(s);
return 0;
}

It's just yet another encoding issue happening everyday. Just SetConsoleOutputCP(65001) or /utf-8 or set execution code page in #pragma and everything will be fine.

How to read and print hexadecimal numbers from a file in C

I'm trying to read 14 digit long hexadecimal numbers from a file and then print them. My idea is to use a long long int and read the lines from the files with fscanf as if they were strings and then turn the string into a hex number using atoll. The problem is I am getting a seg value on my fscanf line according to valgrind and I have absolutely no idea why. Here is the code:
#include<stdio.h>
int main(int argc, char **argv){
if(argc != 2){
printf("error argc!= 2\n");
return 0;
}
char *fileName = argv[1];
FILE *fp = fopen( fileName, "r");
if(fp == NULL){
return 0;
}
long long int num;
char *line;
while( fscanf(fp, "%s", line) == 1 ){
num = atoll(line);
printf("%x\n", num);
}
return 0;
}

Are you sure you want to read your numbers as character strings? Why not allow the scanf do the work for you?
long long int num;
while( fscanf(fp, "%llx", &num) == 1 ){ // read a long long int in hex
printf("%llx\n", num); // print a long long int in hex
}
BTW, note the ll size specifier to %x conversion in printf - it defines the integer value will be of long long type.
Edit
Here is a simple example of two loops reading a 3-line input (with two, no and three numbers in consecutive lines) with a 'hex int' format and with a 'string' format:
http://ideone.com/ntzKEi
A call to rewind allows the second loop read the same input data.

That line variable is not initialized, so when fscanf() dereferences it you get undefined behavior.
You should use:
char line[1024];
while(fgets(line, sizeof line, fp) != NULL)
To do the loading.
If you're on C99, you might want to use uint64_t to hold the number, since that makes it clear that 14-digit hexadecimal numbers (4 * 14 = 56) will fit.

The other answers are good, but I want to clarify the actual reason for the crash you are seeing. The problem is that:
fscanf(fp, "%s", line)
... essentially means "read a string from a file, and store it in the buffer pointed at by line". In this case, your line variable hasn't been initialised, so it doesn't point anywhere. Technically, this is undefined behavior; in practice, the result will often be that you write over some arbitrary location in your process's address space; furthermore, since it will often point at an illegal address, the operating system can detect and report it as a segment violation or similar, as you are indeed seeing.
Note that fscanf with a %s conversion will not necessarily read a whole line - it reads a string delimited by whitespace. It might skip lines if they are empty and it might read multiple strings from a single line. This might not matter if you know the precise format of the input file (and it always has one value per line, for instance).
Although it appears in that case that you can probably just use an appropriate modifier to read a hexadecimal number (fscanf(fp, "%llx", &num)), rather than read a string and try to do a conversion, there are various situations where you do need to read strings and especially whole lines. There are various solutions to that problem, depending on what platform you are on. If it's a GNU system (generally including Linux) and you don't care about portability, you could use the m modifier, and change line to &line:
fscanf(fp, "%ms", &line);
This passes a pointer to line to fscanf, rather than its value (which is uninitialised), and the m causes fscanf to allocate a buffer and store its address in line. You then should free the buffer when you are done with it. Check the Glibc manual for details. The nice thing about this approach is that you do not need to know the line length beforehand.
If you are not using a GNU system or you do care about portability, use fgets instead of fscanf - this is more direct and allows you to limit the length of the line read, meaning that you won't overflow a fixed buffer - just be aware that it will read a whole line at a time, unlike fscanf, as discussed above. You should declare line as a char-array rather than a char * and choose a suitable size for it. (Note that you can also specify a "maximum field width" for fscanf, eg fscanf(fp, "%1000s", line), but you really might as well use fgets).

What happens with extra memory using fscanf?

I'm new to C and I have a couple questions about fscanf. I wrote a simple program that reads the contents of a file and spits it back out on the command line:
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char* argv[1])
{
if (argc != 2)
{
printf("Usage: fscanf txt\n");
return 1;
}
char* txt = argv[1];
FILE* fp = fopen(txt, "r");
if (fp == NULL)
{
printf("Could not open %s.\n", txt);
return 2;
}
char s[50];
while (fscanf(fp, "%49s", s) == 1)
printf("%s\n", s);
return 0;
}
Let's say the contents of my text file is just "C is cool.", which will output:
C
is
cool.
So I have two questions here:
1) Does fscanf assume that the placeholder "%s" will be a single word (an array of chars only)? According to this program's output, spaces and line breaks seem to prompt the function to return. But what if I wanted to read a whole paragraph? Would I use fread() instead?
2) More importantly I'm wondering what happens with all of the unused space in the array. On the first iteration, I think s[0] = "C" and s[1] = "\0", so are s[2] - s[49] just wasted?
EDIT: while (fscanf(fp, "%**49**s", s) == 1) - thanks to #M Oehm for pointing this out - enforcing strong limit here to prevent dangerous buffer overflows

1) Does fscanf assume that the placeholder "%s" will be a single word
(an array of chars only)? According to this program's output, spaces
and line breaks seem to prompt the function to return. But what if I
wanted to read a whole paragraph? Would I use fread() instead?
The %s specifier reads single words that are delimited by white space. The scanf family of functions are very cerude; they do not normally distinguish between line breaks and spaces, for example.
A line is anything up to the next newline. There is no concept of paragraph, but you might consider anything between blank lines a paragraph. The function to read lines of text is fgets, so you could read lines until you find an empty one. (fgets retains the newline at the end, mind.)
fread is a function for reading binary data. It is not useful for reading structured texts. (But it can be used to read the contents of a whole text file at once.)
2) More importantly I'm wondering what happens with all of the unused
space in the array. On the first iteration, I think c[0] = 'C' and
c[1] = '\0', so are c[2] - c[49] just wasted?
You are right, the data after the null ternimator isn't used. "Wasted" is too negative – with user input you don't know whether you encounter a longer word eventually. Because dynamic allocation requires some care in C, allocating "enogh for most cases" is a goopd practice in C. You should enforce the hard limit when reading, though, to prevent buffer overruns:
fscanf(fp, "%49s", s)
The issue of "wasted" memory becomes more serious if you have an array of arrays of 50 chars. Most of the words will be much shorter than 50 chars. Here, the extra memory might eventually hurt you. 48 extra characters for reading a line are okay, though.
(A strategy to save "compact" arrays of chars is to have a running array of chars that is a concatenation of all strings, including their terminators. The word array is then an array of piointers into that master string.)

You use specifier %s which will read and store data in array s until it encounters a space or newline . As soon as it encounters space fscanf returns.
I think c[0] = "C" and c[1] = "\0", so are c[2] - c[49] just wasted?
Yes , s[0]='C' and s[1]='\0' and you probably can't do anything about the size of array being much more.
If you want complete string "C is cool" stored in array use fgets.
#define len 1000
char s[len];
while(fgets(s,len,fp)!=NULL) {
//your code
}

fgets combined with sscanf

Today I've looked over some C code that was parsing data from a text file
and I've stumbled upon these lines
fgets(line,MAX,fp);
if(line[strlen(line)-1]=='\n'){
line[strlen(line)-1]='\0');
}else{
printf("Error on line length\n");
exit(1);
}
sscanf((line,"%s",records->bday));
with record being a structure
typedef struct {
char bday[11];
}record;
So my question here regards the fgets-sscanf combination to create a type/length safe stream reader:
Is there any other way to work this out beside having to combine these two readers?
What about the \n checking-removing sequence?

The combination of fgets() with sscanf() is usually good. However, you should probably be using:
if (fgets(line, sizeof(line), fp) != 0)
{
...
}
This checks for I/O errors and EOF. It also assumes that the definition of the array is visible (otherwise sizeof gives you the size of a pointer, not of the array). If the array is not in scope, you should probably pass the size of the array to the function containing this code. All that said, there are worse sins than using MAX in place of sizeof(line).
You have not checked for a zero-length birthday string; you will probably end up doing quite a lot of validation on the string that is entered, though (dates are fickle and hard to process).
Given that MAX is 60, but sizeof(records->bday) == 11, you need to protect yourself from buffer overflows in the sscanf(). One way to do that is:
if (sscanf(line, "%10s", records->bday) != 1)
...handle error...
Note that the 10 is sizeof(records->bday) - 1, but you can't provide the length as an argument to sscanf(); it has to appear in the format string literally. Here, you can probably live with the odd sizing, but if you were dealing with more generic code, you'd probably think about:
sprintf(format, "%%%zus", sizeof(records->bday) - 1);
The first %% maps to %; the %zu formats the size (z is C99 for size_t); the s is for the string conversion when the format is used.
Or you could consider using strcpy() or memcpy() or memmove() to copy the right subsection of the input string to the structure - but note that %10s skips leading blanks which strcpy() et al will not. You have to know how long the string is before you do the copying, of course, and make sure the string is null terminated.

Read in text file - 1 character at a time. using C

I'm trying to read in a text file line by line and process each character individually.
For example, one line in my text file might look like this:
ABC XXXX XXXXXXXX ABC
There will always be a different amount of spaces in the line. But the same number of characters (including spaces).
This is what I have so far...
char currentLine[100];
fgets(currentLine, 22, inputFile);
I'm then trying to iterate through the currentLine Array and work with each character...
for (j = 0; j<22; j++) {
if (&currentLine[j] == 'x') {
// character is an x... do something
}
}
Can anyone help me with how I should be doing this?
As you can probably tell - I've just started using C.

Something like the following is the canonical way to process a file character by character:
#include <stdio.h>
int main(int argc, char **argv)
{
FILE *fp;
int c;
if (argc != 2) {
fprintf(stderr, "Usage: %s file.txt\n", argv[0]);
exit(1);
}
if (!(fp = fopen(argv[1], "rt"))) {
perror(argv[1]);
exit(1);
}
while ((c = fgetc(fp)) != EOF) {
// now do something with each character, c.
}
fclose(fp);
return 0;
}
Note that c is declared int, not char because EOF has a value that is distinct from all characters that can be stored in a char.
For more complex parsing, then reading the file a line at a time is generally the right approach. You will, however, want to be much more defensive against input data that is not formatted correctly. Essentially, write the code to assume that the outside world is hostile. Never assume that the file is intact, even if it is a file that you just wrote.
For example, you are using a 100 character buffer to read lines, but limiting the amount read to 22 characters (probably because you know that 22 is the "correct" line length). The extra buffer space is fine, but you should allow for the possibility that the file might contain a line that is the wrong length. Even if that is an error, you have to decide how to handle that error and either resynchronize your process or abandon it.
Edit: I've added some skeleton of an assumed rest of the program for the canonical simple case. There are couple of things to point out there for new users of C. First, I've assumed a simple command line interface to get the name of the file to process, and verified using argc that an argument is really present. If not, I print a brief usage message taking advantage of the content of argv[0] which by convention names the current program in some useful way, and exit with a non-zero status.
I open the file for reading in text mode. The distinction between text and binary modes is unimportant on Unix platforms, but can be important on others, especially Windows. Since the discussion is of processing the file a character at a time, I'm assuming that the file is text and not binary. If fopen() fails, then it returns NULL and sets the global variable errno to a descriptive code for why it failed. The call to perror() translates errno to something human-readable and prints it along with a provided string. Here I've provided the name of the file we attempted to open. The result will look something like "foo.txt: no such file". We also exit with non-zero status in this case. I haven't bothered, but it is often sensible to exit with distinct non-zero status codes for distinct reasons, which can help shell scripts make better sense of errors.
Finally, I close the file. In principle, I should also test the fclose() for failure. For a process that just reads a file, most error conditions will already have been detected as some kind of content error, and there will be no useful status added at the close. For file writing, however, you might not discover certain I/O errors until the call to fclose(). When writing a file it is good practice to check return codes and expect to handle I/O errors at any call that touches the file.

You don't need the address operator (&). You're trying to compare the value of the variable currentLine[j] to 'x', not it's address.

ABC XXXX XXXXXXXX ABC has 21 characters. There's also the line break (22 chars) and the terminating null byte (23 chars).
You need to fgets(currentLine, 23, inputFile); to read the full line.
But you declared currentLine as an array of 100. Why not use all of it?
fgets(currentLine, sizeof currentLine, inputFile);
When using all of it, it doesn't mean that the system will put more than a line each time fgets is called. fgets always stops after reading a '\n'.

Try
while( fgets(currentLine, 100, inputFile) ) {
for (j = 0; j<22; j++) {
if (/*&*/currentLine[j] == 'x') { /* <--- without & */
// character is an x... do something
}
}
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight