For example, there is a text file called "Hello.txt"
Hello World!
Then how does the operating system (I'm using MS-DOS) recognize the end of this text file? Is some kind of character or symbol hidden after '!' which indicates the end of file?
If you use MS-Dos then there are some odds that there is indeed a special character at the end of the string. MS-Dos was derived from Tim Paterson's QDos who wrote it to be as compatible as possible with the then-dominant CP/M. An OS for 8-bit machines, it kept track of a file size by only counting the number of disk sectors used by the file. Which made the file size always a multiple of 128 bytes.
Which required a hack to indicate the real end of a text file, since it could be located in the middle of a sector, it used the Ctrl+Z control character (character code 0x1A). Which required a language runtime implementation to remove it again and declare end-of-file when it encounters the character. Ctrl+Z is not quite forgotten, it still works when you type it in a Windows console to terminate input. Compare to Ctrl+D in a Unix terminal.
Whether it actually is present in the file depends on what program created the file. Which would have to be an MS-Dos program as well to get the Ctrl+Z appended. It is certainly not required. Paterson improved on CP/M to remove some of its restrictions, greatly aided by having a lot more address space available (1 MB vs 64 KB), MS-Dos keeps track of the actual number of bytes in a file. So it can always reliable indicate the true end of a file. Which is probably the most accurate answer to your question.
Ancient history btw, invest your time wisely.
Related
With the C standard library stdio.h, I read that to output ASCII/text data, one should use mode "w" and to output binary data, one should use "wb". But why the difference?
In either case, I'm just outputting a byte (char) array, right? And if I output a non-ASCII byte in ASCII mode, the program still outputs the correct byte.
Some operating systems - mostly named "windows" - don't guarantee that they will read and write ascii to files exactly the way you pass it in. So on windows they actually map \r\n to \n. This is fine and transparent when reading and writing ascii. But it would trash a stream of binary data. Basically just always give windows the 'b' flag if you want it to faithfully read and write data to files exactly the way you passed it in.
There are certain transformations that can take place when outputting in ASCII (e.g. outputting neline+carriage-return when the outputted character is new-line) -- depending on your platform. Such transformations will not take place when using binary format
I have managed this far with the knowledge that EOF is a special character inserted automatically at the end of a text file to indicate its end. But I now feel the need for some more clarification on this. I checked on Google and the Wikipedia page for EOF but they couldn't answer the following, and there are no exact Stack Overflow links for this either. So please help me on this:
My book says that binary mode files keep track of the end of file from the number of characters present in the directory entry of the file. (In contrast to text files which have a special EOF character to mark the end). So what is the story of EOF in context of binary files? I am confused because in the following program I successfully use !=EOF comparison while reading from an .exe file in binary mode:
#include<stdio.h>
#include<stdlib.h>
int main()
{
int ch;
FILE *fp1,*fp2;
fp1=fopen("source.exe","rb");
fp2=fopen("dest.exe","wb");
if(fp1==NULL||fp2==NULL)
{
printf("Error opening files");
exit(-1);
}
while((ch=getc(fp1))!=EOF)
putc(ch,fp2);
fclose(fp1);
fclose(fp2);
}
Is EOF a special "character" at all? Or is it a condition as Wikipedia says, a condition where the computer knows when to return a particular value like -1 (EOF on my computer)? Example of such "condition" being when a character-reading function finishes reading all characters present, or when character/string I/O functions encounter an error in reading/writing?
Interestingly, the Stack Overflow tag for EOF blended both those definitions of the EOF. The tag for EOF said "In programming realm, EOF is a sequence of byte (or a chacracter) which indicates that there are no more contents after this.", while it also said in the "about" section that "End of file (commonly abbreviated EOF) is a condition in a computer operating system where no more data can be read from a data source. The data source is usually called a file or stream."
But I have a strong feeling EOF won't be a character as every other function seems to be returning it when it encounters an error during I/O.
It will be really nice of you if you can clear the matter for me.
The various EOF indicators that C provides to you do not necessarily have anything to do with how the file system marks the end of a file.
Most modern file systems know the length of a file because they record it somewhere, separately from the contents of the file. The routines that read the file keep track of where you are reading and they stop when you reach the end. The C library routines generate an EOF value to return to you; they are not returning a value that is actually in the file.
Note that the EOF returned by C library routines is not actually a character. The C library routines generally return an int, and that int is either a character value or an EOF. E.g., in one implementation, the characters might have values from 0 to 255, and EOF might have the value −1. When the library routine encountered the end of the file, it did not actually see a −1 character, because there is no such character. Instead, it was told by the underlying system routine that the end of file had been reached, and it responded by returning −1 to you.
Old and crude file systems might have a value in the file that marks the end of file. For various reasons, this is usually undesirable. In its simplest implementation, it makes it impossible to store arbitrary data in the file, because you cannot store the end-of-file marker as data. One could, however, have an implementation in which the raw data in the file contains something that indicates the end of file, but data is transformed when reading or writing so that arbitrary data can be stored. (E.g., by “quoting” the end-of-file marker.)
In certain cases, things like end-of-file markers also appear in streams. This is common when reading from the terminal (or a pseudo-terminal or terminal-like device). On Windows, pressing control-Z is an indication that the user is done entering input, and it is treated similarly to reach an end-of-file. This does not mean that control-Z is an EOF. The software reading from the terminal sees control-Z, treats it as end-of-file, and returns end-of-file indications, which are likely different from control-Z. On Unix, control-D is commonly a similar sentinel marking the end of input.
This should clear it up nicely for you.
Basically, EOF is just a macro with a pre-defined value representing the error code from I/O functions indicating that there is no more data to be read.
The file doesn't actually contain an EOF. EOF isn't a character of sorts - remember a byte can be between 0 and 255, so it wouldn't make sense if a file could contain a -1. The EOF is a signal from the operating system that you're using, which indicates the end of the file has been reached. Notice how getc() returns an int - that is so it can return that -1 to tell you the stream has reached the end of the file.
The EOF signal is treated the same for binary and text files - the actual definition of binary and text stream varies between the OSes (for example on *nix binary and text mode are the same thing.) Either way, as stated above, it is not part of the file itself. The OS passes it to getc() to tell the program that the end of the stream has been reached.
From From the GNU C library:
This macro is an integer value that is returned by a number of narrow stream functions to indicate an end-of-file condition, or some other error situation. With the GNU C Library, EOF is -1. In other libraries, its value may be some other negative number.
EOF is not a character. In this context, it's -1, which, technically, isn't a character (if you wanted to be extremely precise, it could be argued that it could be a character, but that's irrelevant in this discussion). EOF, just to be clear is "End of File". While you're reading a file, you need to know when to stop, otherwise a number of things could happen depending on the environment if you try to read past the end of the file.
So, a macro was devised to signal that End of File has been reached in the course of reading a file, which is EOF. For getc this works because it returns an int rather than a char, so there's extra room to return something other than a char to signal EOF. Other I/O calls may signal EOF differently, such as by throwing an exception.
As a point of interest, in DOS (and maybe still on Windows?) an actual, physical character ^Z was placed at the end of a file to signal its end. So, on DOS, there actually was an EOF character. Unix never had such a thing.
Well it is pretty much possible to find the EOF of a binary file if you study it's structure.
No, you don't need the OS to know the EOF of an executable EOF.
Almost every type of executable has a Page Zero which describes the basic information that the OS might need while loading the code into the memory and is stored as the first page of that executable.
Let's take the example of an MZ executable.
https://wiki.osdev.org/MZ
Here at offset 2, we have the total number of complete/partial pages and right after that at offset 4 we have the number of bytes in the last page. This information is generally used by the OS to safely load the code into the memory, but you can use it to calculate the EOF of your binary file.
Algorithm:
1. Start
2. Parse the parameter and instantiate the file pointer as per your requirement.
3. Load the first page (zero) in a (char) buffer of default size of page zero and print it.
4. Get the value at *((short int*)(&buffer+2)) and store it in a loop variable called (short int) i.
5. Get the value at *((short int*)(&buffer+4)) and store it in a variable called (short int) l.
6. i--
7. Load and print (or do whatever you wanted to do) 'size of page' characters into a buffer until i equals zero.
8. Once the loop has finished executing just load `l` bytes into that buffer and again perform whatever you wanted to
9. Stop
If you're designing your own binary file format then consider adding some sort of meta data at the start of that file or a special character or word that denotes the end of that file.
And there's a good amount of probability that the OS loads the size of the file from here with the help of simple maths and by analyzing the meta-data even though it might seem that the OS has stored it somewhere along with other information it's expected to store (Abstraction to reduce redundancy).
I'm training with file management in C, I saw that there are plenty of ways to open a file with fopen using words as a,r,etc.. Everything ok, but I read also that if to that word I add b that become a binary file. What does it mean? Which are the differences with a normal file?
Opening a file in text mode causes the C libraries to do some handling specific to text. For example, new lines are different between Windows and Unix/linux but you can simply write '\n' because C is handling that difference for you.
Opening a file in binary mode doesn't do any of this special handling, it just treats it as raw bytes. There's a bit of a longer explanation of this on the C FAQ
Note that this only matters on Windows; Unix/linux systems don't (need to) differentiate between text and binary modes, though you can include the 'b' flag without them complaining.
If you open a regular file in the binary mode, you'll get all its data as-is and whatever you write into it, will appear in it.
OTOH, if you open a regular file in the text mode, things like ends of lines can get special treatment. For example, the sequence of bytes with values of 13 (CR or '\r') and 10 (LF or '\n') can get truncated to just one byte, 10, when reading or 10 can get expanded into 13 followed by 10 when writing. This treatment is platform-specific (read, compiler/OS-specific).
For text files, this is often unimportant. But if you apply the text mode to a non-text file, you risk data corruptions.
Also, reading and writing bytes at arbitrary offsets in files opened in the text mode isn't supported because of that special treatment.
The difference is explained here
A binary file is a series of 1's and 0's. This is called machine language because microprocessors can interpret this by sending a signal for 1's or no signal for 0's. This is much more compact, but not readable by humans.
For this reason, text files are a string of binary signals designated to be displayed as more people-friendly characters which lend themselves to language much better than binary. ASCII is an example of one such designation. This reveals the truth of the matter: all files are binary on the lowest level.
But, binary lends itself to any application which does not have to be textually legible to us lowly humans =] Examples applications where binary is preferred are sound files, images, and compiled programs. The reason binary is preferred to text is that it is more efficient to have an image described in machine language than textually (which has to be translated to machine language anyway).
There are two types of files: text files and binary files.
Binary files have two features that distinguish them from text files: You can jump instantly to any record in the file, which provides random access as in an array; and you can change the contents of a record anywhere in the file at any time. Binary files also usually have faster read and write times than text files, because a binary image of the record is stored directly from memory to disk (or vice versa). In a text file, everything has to be converted back and forth to text, and this takes time.
more info here
b is for working with binary files. However, this has no effect on POSIX compliant operating systems.
from the manpage of fopen:
The mode string can also include the letter 'b' either as a last char‐
acter or as a character between the characters in any of the two-char‐
acter strings described above. This is strictly for compatibility with
C89 and has no effect; the 'b' is ignored on all POSIX conforming sys‐
tems, including Linux. (Other systems may treat text files and binary
files differently, and adding the 'b' may be a good idea if you do I/O
to a binary file and expect that your program may be ported to non-UNIX
environments.)
I am working on a small text replacement application that basically lets the user select a file and replace text in it without ever having to open the file itself. However, I want to make sure that the function only runs for files that are text-based. I thought I could accomplish this by checking the encoding of the file, but I've found that Notepad .txt files use Unicode UTF-8 encoding, and so do MS Paint .bmp files. Is there an easy way to check this without placing restrictions on the file extensions themselves?
Unless you get a huge hint from somewhere, you're stuck. Purely by examining the bytes there's a non-zero probability you'll guess wrong given the plethora of encodings ("ASCII", Unicode, UTF-8, DBCS, MBCS, etc). Oh, and what if the first page happens to look like ASCII but the next page is a btree node that points to the first page...
Hints can be:
extension (not likely that foo.exe is editable)
something in the stream itself (like BOM [byte-order-marker])
user direction (just edit the file, goshdarnit)
Windows used to provide an API IsTextUnicode that would do a probabilistic examination, but there were well-known false-positives.
My take is that trying to be smarter than the user has some issues...
Honestly, given the Windows environment that you're working with, I'd consider a whitelist of known text formats. Windows users are typically trained to stick with extensions. However, I would personally relax the requirement that it not function on non-text files, instead checking with the user for goahead if the file does not match the internal whitelist. The risk of changing a binary file would be mitigated if your search string is long - that is assuming you're not performing Y2K conversion (a la sed 's/y/k/g').
It's pretty costly to determine if a file is text-based or not (i.e. a binary file). You would have to examine each byte in the file to determine if it is a valid character, irrespective of the file encoding.
Others have said to look at all the bytes in the file and see if they're alphanumeric. Some UNIX/Linux utils do this, but just check the first 1K or 2K of the file as an "optimistic optimization".
well a text file contains text, right ? so a really easy way to check a file if it does contain only text is to read it and check if it does contains alphanumeric characters.
So basically the first thing you have to do is to check the file encoding if its pure ASCII you have an easy task just read the whole file in to a char array (I'm assuming you are doing it in C/C++ or similar) and check every char in that array with functions isalpha and isdigit ...of course you have to take care about special exceptions like tabulators '\t' space ' ' or the newline ('\n' in linux , '\r'\'n' in windows)
In case of a different encoding the process is the same except the fact that you have to use different functions for checking if the current character is an alphanumeric character... also note that in case of UTF-16 or greater a simple char array is simply to small...but if you are doing it for example in C# you dont have to worry about the size :)
You can write a function that will try to determine if a file is text based. While this will not be 100% accurate, it may be just enough for you. Such a function does not need to go through the whole file, about a kilobyte should be enough (or even less). One thing to do is to count how many whitespaces and newlines are there. Another thing would be to consider individual bytes and check if they are alphanumeric or not. With some experiments you should be able to come up with a decent function. Note that this is just a basic approach and text encodings might complicate things.
It seems like just putting a linefeed is good enough, but I know it is supposed to be carriage return + line feed. Does anything horrible happen if you don't put the carriage return and only use line feeds?
This is in ANSI C and not going to be redirected to a file or anything else. Just a normal console app.
The Windows console follows the same line ending convention that is assumed for files, or for that matter for actual, physical terminals. It needs to see both CR and LF to properly move to the next line.
That said, there is a lot of software infrastructure between an ANSI C program and that console. In particular, any standard C library I/O function is going to try to do the right thing, assuming you've allowed it the chance. This is why fopen()'s t and b modifiers for the mode parameter were defined.
With t (the default for most streams, and in particular for stdin and stdout) then any \n printed is converted to a CRLF sequence, and the reverse happens for reads. To turn off that behavior, use the b modifier.
Incidentally, the terminals traditionally hooked to *nix boxes including the DEC VT100 emulated by XTerm also needs both CR and LF. However, in the *nix world, the conversion from a newline character to a CRLF sequence is handled in the tty device driver so most programs don't need to know about it, and the t and b modifiers are both ignored. On those platforms, if you need to send and receive characters on a tty without that modification, you need to look up stty(1) or the system calls it depends on.
If your otherwise ANSI C program is avoiding C library I/O to the console (perhaps because you need access to the console's character color and other attributes) then whether you need to send CR or not will depend on which Win32 API calls you are using to send the characters.
If you're in a *nix environment \n (Linefeed) is probably ok. If you're in Windows and aren't redirecting (now) a linefeed is also ok, but if someone at somepoint redirects, :-(
If you're doing Windows though, there could be issues if the output is redirected to a text file and then another process tries to consume the data.
The console knows what to show, but consumers might not be happy...
If you are using C# You might try the Environment.NewLine "constant".
http://msdn.microsoft.com/en-us/library/system.environment.newline.aspx
If you're really in vanilla c, you're stuck with \r\n. :-)
It depends on what you're using them for. Some programs will not display newlines properly if you don't put both \r and \n.
If you try to only write \n some programs that consume your text file (or output) may display your text as a single line instead of multiple lines.
There are also some file formats and protocols that will completely be invalid without using both \r and \n.
I haven't tried it in so long that I'm not sure I remember what happens... but doesn't a linefeed by itself move down a line without returning to the left column?
Depending on your compiler, the standard output might be opened in text mode, in which case a single linefeed will be translated to \r\n before being written out.
Edit: I just tried a quick test, and in XP a file without returns displays normally. I still don't know if any compilers insert the returns for you.
In C, files (called "streams") come in two flavors - binary or text.
The meaning of this distinction is left implementation/platform dependent, but on Windows (with common implementations that I've seen) when writing to text streams '\n' is automatically translated to "\r\n", and when reading from text streams "\r\n" is automatically translated to '\n'.
The "console" is actually "standard output", which is a stream opened by default as a text stream. So, in practice on Windows, writing "Hello, world!\n" should be quite sufficient - and portable.