What really is EOF for binary files? Condition? Character? - c

I have managed this far with the knowledge that EOF is a special character inserted automatically at the end of a text file to indicate its end. But I now feel the need for some more clarification on this. I checked on Google and the Wikipedia page for EOF but they couldn't answer the following, and there are no exact Stack Overflow links for this either. So please help me on this:
My book says that binary mode files keep track of the end of file from the number of characters present in the directory entry of the file. (In contrast to text files which have a special EOF character to mark the end). So what is the story of EOF in context of binary files? I am confused because in the following program I successfully use !=EOF comparison while reading from an .exe file in binary mode:
#include<stdio.h>
#include<stdlib.h>
int main()
{
int ch;
FILE *fp1,*fp2;
fp1=fopen("source.exe","rb");
fp2=fopen("dest.exe","wb");
if(fp1==NULL||fp2==NULL)
{
printf("Error opening files");
exit(-1);
}
while((ch=getc(fp1))!=EOF)
putc(ch,fp2);
fclose(fp1);
fclose(fp2);
}
Is EOF a special "character" at all? Or is it a condition as Wikipedia says, a condition where the computer knows when to return a particular value like -1 (EOF on my computer)? Example of such "condition" being when a character-reading function finishes reading all characters present, or when character/string I/O functions encounter an error in reading/writing?
Interestingly, the Stack Overflow tag for EOF blended both those definitions of the EOF. The tag for EOF said "In programming realm, EOF is a sequence of byte (or a chacracter) which indicates that there are no more contents after this.", while it also said in the "about" section that "End of file (commonly abbreviated EOF) is a condition in a computer operating system where no more data can be read from a data source. The data source is usually called a file or stream."
But I have a strong feeling EOF won't be a character as every other function seems to be returning it when it encounters an error during I/O.
It will be really nice of you if you can clear the matter for me.

The various EOF indicators that C provides to you do not necessarily have anything to do with how the file system marks the end of a file.
Most modern file systems know the length of a file because they record it somewhere, separately from the contents of the file. The routines that read the file keep track of where you are reading and they stop when you reach the end. The C library routines generate an EOF value to return to you; they are not returning a value that is actually in the file.
Note that the EOF returned by C library routines is not actually a character. The C library routines generally return an int, and that int is either a character value or an EOF. E.g., in one implementation, the characters might have values from 0 to 255, and EOF might have the value −1. When the library routine encountered the end of the file, it did not actually see a −1 character, because there is no such character. Instead, it was told by the underlying system routine that the end of file had been reached, and it responded by returning −1 to you.
Old and crude file systems might have a value in the file that marks the end of file. For various reasons, this is usually undesirable. In its simplest implementation, it makes it impossible to store arbitrary data in the file, because you cannot store the end-of-file marker as data. One could, however, have an implementation in which the raw data in the file contains something that indicates the end of file, but data is transformed when reading or writing so that arbitrary data can be stored. (E.g., by “quoting” the end-of-file marker.)
In certain cases, things like end-of-file markers also appear in streams. This is common when reading from the terminal (or a pseudo-terminal or terminal-like device). On Windows, pressing control-Z is an indication that the user is done entering input, and it is treated similarly to reach an end-of-file. This does not mean that control-Z is an EOF. The software reading from the terminal sees control-Z, treats it as end-of-file, and returns end-of-file indications, which are likely different from control-Z. On Unix, control-D is commonly a similar sentinel marking the end of input.

This should clear it up nicely for you.
Basically, EOF is just a macro with a pre-defined value representing the error code from I/O functions indicating that there is no more data to be read.

The file doesn't actually contain an EOF. EOF isn't a character of sorts - remember a byte can be between 0 and 255, so it wouldn't make sense if a file could contain a -1. The EOF is a signal from the operating system that you're using, which indicates the end of the file has been reached. Notice how getc() returns an int - that is so it can return that -1 to tell you the stream has reached the end of the file.
The EOF signal is treated the same for binary and text files - the actual definition of binary and text stream varies between the OSes (for example on *nix binary and text mode are the same thing.) Either way, as stated above, it is not part of the file itself. The OS passes it to getc() to tell the program that the end of the stream has been reached.
From From the GNU C library:
This macro is an integer value that is returned by a number of narrow stream functions to indicate an end-of-file condition, or some other error situation. With the GNU C Library, EOF is -1. In other libraries, its value may be some other negative number.

EOF is not a character. In this context, it's -1, which, technically, isn't a character (if you wanted to be extremely precise, it could be argued that it could be a character, but that's irrelevant in this discussion). EOF, just to be clear is "End of File". While you're reading a file, you need to know when to stop, otherwise a number of things could happen depending on the environment if you try to read past the end of the file.
So, a macro was devised to signal that End of File has been reached in the course of reading a file, which is EOF. For getc this works because it returns an int rather than a char, so there's extra room to return something other than a char to signal EOF. Other I/O calls may signal EOF differently, such as by throwing an exception.
As a point of interest, in DOS (and maybe still on Windows?) an actual, physical character ^Z was placed at the end of a file to signal its end. So, on DOS, there actually was an EOF character. Unix never had such a thing.

Well it is pretty much possible to find the EOF of a binary file if you study it's structure.
No, you don't need the OS to know the EOF of an executable EOF.
Almost every type of executable has a Page Zero which describes the basic information that the OS might need while loading the code into the memory and is stored as the first page of that executable.
Let's take the example of an MZ executable.
https://wiki.osdev.org/MZ
Here at offset 2, we have the total number of complete/partial pages and right after that at offset 4 we have the number of bytes in the last page. This information is generally used by the OS to safely load the code into the memory, but you can use it to calculate the EOF of your binary file.
Algorithm:
1. Start
2. Parse the parameter and instantiate the file pointer as per your requirement.
3. Load the first page (zero) in a (char) buffer of default size of page zero and print it.
4. Get the value at *((short int*)(&buffer+2)) and store it in a loop variable called (short int) i.
5. Get the value at *((short int*)(&buffer+4)) and store it in a variable called (short int) l.
6. i--
7. Load and print (or do whatever you wanted to do) 'size of page' characters into a buffer until i equals zero.
8. Once the loop has finished executing just load `l` bytes into that buffer and again perform whatever you wanted to
9. Stop
If you're designing your own binary file format then consider adding some sort of meta data at the start of that file or a special character or word that denotes the end of that file.
And there's a good amount of probability that the OS loads the size of the file from here with the help of simple maths and by analyzing the meta-data even though it might seem that the OS has stored it somewhere along with other information it's expected to store (Abstraction to reduce redundancy).

Related

Copy file in C doens't seem to work completely

For my programming course I have to make a program that copies a file.
This program asks for the following:
an input file in the command prompt
a name for the output file
The files required to copy are .WAV audio files. I tried this with an audio sample of 3 seconds.
The thing is that I do get a file back, for it to be empty. I have added the fclose and fopen statements
while((ch = fgetc(input)) != EOF)
{
fputc(ch, output);
}
I hope someone can point out where I probably made some beginners mistake.
The little while loop you show should in principle work if all prerequisites are met:
The files could be opened.
If on a Microsoft operating system, the files were opened in binary mode (see below).
ch is an int.
In other words, all problems you have are outside this code.
Binary mode: The CR-LF issue
There is a post explaining possible reasons for using a carriage return/linefeed combination; in the end, it is the natural thing to do, given that with typewriters, and by association with teletypes, the two are distinct operations: You move the large lever on the carriage to rotate the platen roller or cylinder a specified number of degrees so that the next line would not print over the previous one; that's the aptly named line feed. Only then, with the same lever, you move the carriage so that the horizontal print position is at the beginning of the line. That's the aptly named carriage return. The order of events is only a technicality.
DOS C implementations tried to be smart: A C program ported from Unix might produce text with only newlines in it; the output routines would transparently add the carriage return so that it would follow the DOS conventions and print properly. Correspondingly, CR/LF combinations in an input file would be silently converted to only LF when read by the standard library implementations.
The DOS file convention also uses CTR-Z (26) as an end-of-file marker. Again, this could be a useful hint to a printer that all data for the current job had been received.
Unfortunately, these conventions were made the default behavior, and today are typically a nuisance: Nobody sends plain text to a printer any longer (apart from the three people who will comment under this post that they still do that).
It is a nuisance because for files that are not plain text silent data changes are catastrophic and must be suppressed, with a b "flag" indicating "binary" data passed in the fopen mode argument: To faithfully read you must specify fopen(filename, "rb"), and in order to faithfully write you must specify fopen(filename, "wb").
Empty file !?
When I tried copying a wave file without the binary flags the data was changed in the described fashion, and the copy stopped before the first byte with the value 26 (CTRL-Z) in the source. In other words, while the copy was corrupt, it was not empty. By the way, all wave files start with the bytes RIFF, so that no CTR-Z can be encountered in the first position.
There are a number of possibilities for an empty target file, the most likely of which:
You didn't emit or missed an error message regarding opening the files (does your editor keep a lock on the output?), and the program crashed silently when one of the file pointers was null. Note that error messages may fail to be printed when you make error output on standard out: That stream is buffered, and buffered output may be lost in a crash. By contrast, output to stderr is unbuffered exactly to prevent message loss.
You are looking at the wrong output file. This kind of error is surprisingly common. You could perform a sanity check by deleting the file you are looking at, or by printing something manually before you start copying.
Generally, check the return value of every operation (including your fputc!).

How can I know in which way newline is represented in my environment?

I want to know in which ASCII characters a newline represented in my environment,
How can I check it?
when I read it by getchar or scanf and check what the the ASCII number that was read, I get 10.
How can I check the sequence that newline is represented in the environment itself?
Those "text-aware" I/O functions will abstract this and do conversions so that '\n' works.
One way is to create a text file containing a single (empty) line of text, then re-open it in binary mode and inspect the contents. Binary mode will turn off any such translations of course, and expose the raw bytes.
Not sure how you'd do that without touching the file system, but I'm sure it's doable. Most of the time this kind of thing is static, it's always going to be the same for a particular target platform, so it's of course possible to i.e. add the knowledge at compile-time instead.

How does the operating system recognize end of a text file?

For example, there is a text file called "Hello.txt"
Hello World!
Then how does the operating system (I'm using MS-DOS) recognize the end of this text file? Is some kind of character or symbol hidden after '!' which indicates the end of file?
If you use MS-Dos then there are some odds that there is indeed a special character at the end of the string. MS-Dos was derived from Tim Paterson's QDos who wrote it to be as compatible as possible with the then-dominant CP/M. An OS for 8-bit machines, it kept track of a file size by only counting the number of disk sectors used by the file. Which made the file size always a multiple of 128 bytes.
Which required a hack to indicate the real end of a text file, since it could be located in the middle of a sector, it used the Ctrl+Z control character (character code 0x1A). Which required a language runtime implementation to remove it again and declare end-of-file when it encounters the character. Ctrl+Z is not quite forgotten, it still works when you type it in a Windows console to terminate input. Compare to Ctrl+D in a Unix terminal.
Whether it actually is present in the file depends on what program created the file. Which would have to be an MS-Dos program as well to get the Ctrl+Z appended. It is certainly not required. Paterson improved on CP/M to remove some of its restrictions, greatly aided by having a lot more address space available (1 MB vs 64 KB), MS-Dos keeps track of the actual number of bytes in a file. So it can always reliable indicate the true end of a file. Which is probably the most accurate answer to your question.
Ancient history btw, invest your time wisely.

Convert binary data file generated in windows to linux

I apologize ahead of time for my lack of c knowledge, as I am a native FORTRAN programmer. I was given some c code to debug which ingests a binary file and parses it into an input file containing several hundred records (871, to be exact) for a Fortran program that I'm working with. The problem is that these input binaries, and the associated c code, were created in a Windows environment. The parser reads through the binary until it reaches the end of the file:
SAGE_Lvl0_Packet GetNextPacket()
{
int i;
SAGE_Lvl0_Packet inpkt;
WORD rdbuf[128];
memset(rdbuf,0,sizeof(rdbuf));
fprintf(stdout,"Nbytes: %u\n",Nbytes);//returns 224
if((i = fread(rdbuf,Nbytes,1,Fp)) != 1)
FileEnd = 1;
else
{
if(FileType == 0)
memcpy(&(inpkt.CCSDS),rdbuf,Nbytes);
else
memcpy(&inpkt,rdbuf,Nbytes);
memcpy(&CurrentPacket,&inpkt,sizeof(inpkt));
}
return inpkt;
}
So when the code gets to packet 872, this snippet should return FileEnd = 1. Instead, the parser attempts to read a large amount of data from (near) the end of the file. This, I would think, would cause the program to crash (at least it would in Fortran. Would c just start reading the next portion of memory?) Fortunately, there is a CRC later on in the code that catches that the parser isn't reading correct data and exits gracefully.
I assume the problem originates with the binary buffer size and value in a Windows binary being larger/different than that in Linux. If that is the case, is there an easy way to convert Windows' binaries to Linux either in c or Linux? If I'm wrong in my assumption, then perhaps I need to look over the code some more. BTW, a WORD is an unsigned short int, and a SAGE_Lvl0_Packet is a 3-tiered structure with a total of 106 WORDs.
I think the biggest problem here is that, when fread() indicates end of file, the FileEnd flag gets set, but the function still ends up returning an (invalid) zeroed-out packet. Not a particularly robust design. I assume that the caller should be checking FileEnd before it attempts to use the packet just returned, but since that's not shown, it's quite possible that's a false assumption.
Also, not knowing what the packet looks like, it's impossible to tell whether the various memcpy() calls are correct. The fact that memcpy() is asked to copy 224 bytes into a structure that is supposedly only 212 bytes long is highly problematic.
There are likely other issues, but those are the big ones I see at the moment.

C file reading incorrect number of chars

I have stumbled across a problem where I am attempting to read in a file, which is, according to windows, '87.1 kb' in size, and using the ftell method in program, returns '89282', effectively confirming what windows is saying.
So why is every method to read chars from the file only returning 173 or 174 characters?
The file is a .GIF file renamed to .txt (and I am trying to build a program that can load the data fully as I am working on a program to download online images and need to run comparisons on them).
So far I have tried:
fgetc - This returns 173/174 chars.
fread - Same as above, this is with a string with 1024 or more spaces available.
fgets - Doesn't work (as it doesn't return how many characters it has read - characters which include nulls).
setvbuf - Disabling this with _IONBF, or even supplying a buffer of 1024 or more only means 173/174 is still returned.
fflush - This produced a 'result', although a negative one - it returned '2' chars instead of '173'.
I am utterly stumped as to why it isn't reading anything more than 173/174 chars. Is there something I need to compensate for or expect at the lower level? Some buffer I need to expand or some weird character I need to look out for?
Here's one thing to look at. Have a look at the file in a hex viewer and see if there's a CTRL-Z somewhere around that 173/174 offset.
Then check to see if you're opening it with the "r" mode.
If so, it may be that the Windows translation between text and binary is stopping your reading there because CTRL-Z is an EOF marker in text mode. If so, you can probably fix this with "rb" mode on the fopen.
Failing that, you need to post the smallest code segment that exhibits the problem behaviour. It may be obvious to some of us here but only usually if we can see the code :-)

Resources