How to read a non-text file containing EOF characters in C? - c

I'm trying to loop over all bytes of a file using a simple while loop, like so:
char c = fgetc(InputFile);
while (c != EOF)
{
doStuff(c)
c = fgetc(InputFile);
}
However, when working with non-text files, I've found that some of the bytes within the file (that aren't the last one) contain the value 255, and therefore register as EOF and the while loop ends prematurely.
How do I get around this and loop over all bytes?

As mentioned in the comments, you should assign the value returned by fgetc to an int variable, not a char. That way, you will be able to distinguish between a successfully input character that has the hex value 0xFF (fgetc will return 255) and a end-of-file condition (fgetc will return EOF, which is -1).
From the cppreference page for fgetc:
On success, returns the obtained character as an unsigned char
converted to an int. On failure, returns EOF.

Related

How to handle data or char -1 when reading from file, since EOF is also -1

I'm trying to read series of negative numbers from a file.
Number -1 is repeated n number of times, and I also read all data till EOF.
since data is -1 and EOF is also -1, how to handle this situation?
The standard C character input functions return a value that is either an unsigned char or EOF. Thus, to use the return value from a function like fgetc, store it in an int, not char:
int x = fgetc(stdin);
if (x == EOF)
// Code for handling error or end of file.
else
// Code for handling a character.
Also note that many of the standard C routines for working with characters use unsigned char. Using char in your code can cause problems.
If your code has a function that reads text from the input and converts numerals in it to numbers and then returns those numbers, you must design your function so that it has some way of indicating whether it is returning −1 or EOF. A common way to do this is to return two separate values: One is an indication of whether a value was successfully read or not, and the other is the value (if successful).
Methods of returning two values include:
Return a struct that contains two members.
Return a status indication (success or failure) in the function return value and return the actual value in an object that is passed to the function via a pointer.
Source: C 2018 7.21.7.1 paragraphs 2 and 3 say:
If the end-of-file indicator for the input stream pointed to by stream is not set and a next character is present, the fgetc function obtains that character as an unsigned char converted to an int…
If the end-of-file indicator for the stream is set, or if the stream is at end-of-file, the end-of-file indicator for the stream is set and the fgetc function returns EOF. Otherwise, the fgetc function returns the next character from the input stream pointed to by stream. If a read error occurs, the error indicator for the stream is set and the fgetc function returns EOF.

Why is type int needed to handle EOF and return of getchar()?

As written in book-
The problem is distinguishing the end of the input from valid data. The solution is that getchar returns a distinctive value when there is no more input, a value that cannot be confused with any real character. This value is called EOF,for "end of file." We must declare c to be a type big enough to hold any value that getchar returns. We can't use char since c must be big enough to hold EOF in addition to any possible char. Therefore we use int.
main()
{
int c;
c = getchar();
while(c != EOF) {
putchar(c);
c = getchar();
}
}
I am not able to understand the actual reason of using int instead of char. What will be returned by EOF such that cannot be stored in char.
A char can hold 256 different values (0 to 255). If EOF was a char, the value of EOF would therefore be some value between 0 and 255, which would imply that there would be one character that you cannot read. Therefore the value of EOF cannot be between 0 and 255, which implies that it cannot fit into a char, which implies that its type must be larger than char, for example an int.
In other words EOF is not a char and we don't want to store it in a char. It's only purpose is to enable a program to detect that one char beyond the end of the file has been attempted to read.
Or still in other words: let's suppose EOF is defined as 255 and therefore fit's into a char. Now let's suppose getchar returns the value 255 (that is EOF). Now what does that value represent? Is it an EOF or is it the character 255?

fgetc returns an unknown character

I have the following code:
FILE *f = fopen('/path/to/some/file', 'rb');
char c;
while((c = fgetc(f)) != EOF)
{
printf("next char: '%c', '%d'", c, c);
}
For some reason, when printing out the characters, at the end of the file, an un-renderable character gets printed out, along with the ASCII ordinal -1.
next char: '?', '-1'
What character is this supposed to be? I know it's not EOF because there's a check for that, and quickly after the character is printed, the program SEGFAULT.
The trouble is that fgetc() and its relatives return an int, not a char:
If the end-of-file indicator for the input stream pointed to by stream is not set and a
next character is present, the fgetc function obtains that character as an unsigned char converted to an int and advances the associated file position indicator for the
stream (if defined).
If the end-of-file indicator for the stream is set, or if the stream is at end-of-file, the end-of-
file indicator for the stream is set and the fgetc function returns EOF.
It has to return every possible valid character value and a distinct value, EOF (which is negative, and usually but not necessarily -1).
When you read the value into a char instead of an int, one of two undesirable things happens:
If plain char is unsigned, then you never get a value equal to EOF, so the loop never terminates.
If plain char is signed, then you can mistake a legitimate character, 0xFF (often ÿ, y-umlaut, U+00FF, LATIN SMALL LETTER Y WITH DIAERESIS) is treated the same as EOF, so you detect EOF prematurely.
Either way, it is not good.
The Fix
The fix is to use int c; instead of char c;.
Incidentally, the fopen() call should not compile:
FILE *f = fopen('/path/to/some/file', 'rb');
should be:
FILE *f = fopen("/path/to/some/file", "rb");
Always check the result of fopen(); of all the I/O functions, it is more prone to failure than almost any other (not through its own fault, but because the user or programmer makes a mistake with the file name).
This is the culprit:
char c;
Please change it to:
int c;
The return type of fgetc is int, not char. You get strange behavior when you convert int to char in some platforms.

EOF missing for unix file [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
fgetc does not identify EOF
fgetc, checking EOF
I have created a file and named it "file.txt" in Unix. I tried to read the file content from my C program. I am not able to receive the EOF character. Unix doesn't store EOF character on file creation? If so what is the alternative way to read the EOF from a Unix created file using C.
Here's the code sample
int main(){
File *fp;
int nl,c;
nl =0;
fp = fopen("file.txt", "r");
while((c = fgetc(fp)) != EOF){
if (c=='\n')
nl++;
}
return 0;
}
If I explicitly give CTRL + D the EOF is detected even when I use char c.
This can happen if the type of c is char (and char is unsigned in your compiler, you can check this by examining the value of CHAR_MIN in ) and not int.
The value of EOF is negative according to the C standard.
So, implicitly casting EOF to unsigned char will lose the true value of EOF and the comparison will always fail.
UPDATE: There's a bigger problem that has to be addressed first. In the expression c = fgetc(fp) != EOF, fgetc(fp) != EOF is evaluated first (to 0 or 1) and then the value is assigned to c. If there's at least one character in the file, fgetc(fp) != EOF will evaluate to 0 and the body of the while loop will never execute. You need to add parentheses, like so: (c = fgetc(fp)) != EOF.
Missing parentheses. Should be:
while((c = fgetc(fp)) != EOF)
Remember: fgetc() returns an int, not a char. It has to return an int because its set of return values includes all possible valid characters plus a separate (negative) EOF indicator.
There are two possible traps if you use type char for c instead of int:
If the type char is signed with your compiler, you will detect a valid character as EOF. Often, the character ÿ (y-umlaut, officially known in Unicode as LATIN LOWER CASE Y WITH DIAERESIS, U+00FF, hex code 0xFF in the ISO 8859-1 aka Latin 1 code set) will be detected as equivalent to EOF, when it is a valid character.
If the type char is unsigned, then the comparison will never be true.
Both problems are serious, and both are avoided by using the correct type:
FILE *fp = fopen("file.txt", "r");
if (fp != 0)
{
int c;
int nl = 0;
while ((c = fgetc(fp)) != EOF)
if (c == '\n')
nl++;
printf("Number of lines: %d\n", nl);
}
Note that the type is FILE and not File. Note that you should check that the file was opened before trying to read via fp.
If I explicitly give CTRL + D, the EOF is detected even when I use char c.
This means that your compiler provides you with char as a signed type. It also means you will not be able to count lines accurately in files which contain ÿ.
Unlike CP/M and DOS, Unix does not use any character to indicate EOF; you reach EOF when there are no more characters to read. What confuses many people is that if you type a certain key combination at the terminal, programs detect EOF. What actually happens is that the terminal driver recognizes the character and sends any unread characters to the program. If there are no unread characters, the program gets 0 bytes returned, which is the same result you get when you've reached the end of file. So, the character combination (often, but not always, Ctrl-D) appears to 'send EOF' to the program. However, the character is not stored in a file if you are using cat >file; further, if you read a file which contains a control-D, that is a perfectly fine character with byte value 0x04. If a program generates a control-D and sends that to a program, that does not indicate EOF to the program. It is strictly a property of Unix terminals (tty and pty — teletype and pseudo-teletype — devices).
You do not show how you declare the variable c it should be of type int, not char.

Reading input from file in C

Okay so I have a file of input that I calculate the amount of words and characters in each line with success.
When I get to the end of the line using the code below it exits the loop and only reads in the first line. How do I move on to the next line of input to continue the program?
EDIT: I must parse each line separately so I cant use EOF
while( (c = getchar()) != '\n')
Change '\n' to EOF. You're reading until the end of the line when you want to read until the end of the file (EOF is a macro in stdio.h which corresponds to the character at the end of a file).
Disclaimer: I make no claims about the security of the method.
'\n' is the line feed (new line)-character, so the loop will terminate when the end of first line is reached. The end of the file is marked by an end-of-file (EOF)-characte. cstdio (or stdio.h), which contains the getchar()-function, has the EOF -constant defined, so just change the while-line to
while( (c = getchar()) != EOF)
From the man page: "reads the next character from stream and returns it as an unsigned char cast to an int, or EOF on end of file or error." EOF is a macro (often -1) for the return of this and related functions that indicates end of file. You want to check whether this is what you're getting back. Note that getc returns a signed int, but that valid values are unsigned chars cast to ints. What out if c is a signed char.
Well, the \n character is actually a combination of two characters, two bytes:
the 13th byte + the 10th byte. You could try something like,
int c2=getchar(),c1;
while(1)
{
c1=c2;
c2=getchar();
if(c1==EOF)
break;
if(c1==(char)13 && c2==(char)10)
break;
/*use c1 as the input character*/
}
this should test if two input characters make the proper couplet (13,10)

Resources