I'm training with file management in C, I saw that there are plenty of ways to open a file with fopen using words as a,r,etc.. Everything ok, but I read also that if to that word I add b that become a binary file. What does it mean? Which are the differences with a normal file?
Opening a file in text mode causes the C libraries to do some handling specific to text. For example, new lines are different between Windows and Unix/linux but you can simply write '\n' because C is handling that difference for you.
Opening a file in binary mode doesn't do any of this special handling, it just treats it as raw bytes. There's a bit of a longer explanation of this on the C FAQ
Note that this only matters on Windows; Unix/linux systems don't (need to) differentiate between text and binary modes, though you can include the 'b' flag without them complaining.
If you open a regular file in the binary mode, you'll get all its data as-is and whatever you write into it, will appear in it.
OTOH, if you open a regular file in the text mode, things like ends of lines can get special treatment. For example, the sequence of bytes with values of 13 (CR or '\r') and 10 (LF or '\n') can get truncated to just one byte, 10, when reading or 10 can get expanded into 13 followed by 10 when writing. This treatment is platform-specific (read, compiler/OS-specific).
For text files, this is often unimportant. But if you apply the text mode to a non-text file, you risk data corruptions.
Also, reading and writing bytes at arbitrary offsets in files opened in the text mode isn't supported because of that special treatment.
The difference is explained here
A binary file is a series of 1's and 0's. This is called machine language because microprocessors can interpret this by sending a signal for 1's or no signal for 0's. This is much more compact, but not readable by humans.
For this reason, text files are a string of binary signals designated to be displayed as more people-friendly characters which lend themselves to language much better than binary. ASCII is an example of one such designation. This reveals the truth of the matter: all files are binary on the lowest level.
But, binary lends itself to any application which does not have to be textually legible to us lowly humans =] Examples applications where binary is preferred are sound files, images, and compiled programs. The reason binary is preferred to text is that it is more efficient to have an image described in machine language than textually (which has to be translated to machine language anyway).
There are two types of files: text files and binary files.
Binary files have two features that distinguish them from text files: You can jump instantly to any record in the file, which provides random access as in an array; and you can change the contents of a record anywhere in the file at any time. Binary files also usually have faster read and write times than text files, because a binary image of the record is stored directly from memory to disk (or vice versa). In a text file, everything has to be converted back and forth to text, and this takes time.
more info here
b is for working with binary files. However, this has no effect on POSIX compliant operating systems.
from the manpage of fopen:
The mode string can also include the letter 'b' either as a last char‐
acter or as a character between the characters in any of the two-char‐
acter strings described above. This is strictly for compatibility with
C89 and has no effect; the 'b' is ignored on all POSIX conforming sys‐
tems, including Linux. (Other systems may treat text files and binary
files differently, and adding the 'b' may be a good idea if you do I/O
to a binary file and expect that your program may be ported to non-UNIX
environments.)
Related
I am little confused when to open a file in Text Mode or Binary Mode. I read some documentations and examples, observed that it used getc()-putc() or fgets()-fputs() in Text Mode as well as in Binary Mode. Can I open a file in Text Mode to use fread()-fwrite() or I should use only Binary Mode for Binary I/O functions like fread()-fwrite().
To use fseek(), ftell() which mode I should use Text Mode or Binary Mode ?
I am using C programming language and Linux distro (fedora).
In Unix systems (and linux, in particular), there's no difference between binary mode and text mode (the library just ignores the t qualifier) but in other systems do. In Windows, a line end is indicated with a sequence of \r\n characters, which are converted in input into \n, while when outputting, the \n is converted into a sequence of \r\n for text files. Binary files are not converted at all, so no transformation is done, either in input or in output. You must have present that this transformation is not reversible, as you don't know if the characters or sequences are converted because they where converted in the process or they where already in the form read.
Text mode means that your file is text, and will be transformated to comply with the operating system's way of line ending. If it is actually text, normally this is not a problem, but if you do this with an actually binary file (e.g. a compressed file or a .jpg image) the results will be unpredictable.
You can use fread and fwrite in both text and binary, although they are used more commonly for binary. You can use also fseek and ftell in both modes (wb (Write-Binary) and w(Write-Text))
With the C standard library stdio.h, I read that to output ASCII/text data, one should use mode "w" and to output binary data, one should use "wb". But why the difference?
In either case, I'm just outputting a byte (char) array, right? And if I output a non-ASCII byte in ASCII mode, the program still outputs the correct byte.
Some operating systems - mostly named "windows" - don't guarantee that they will read and write ascii to files exactly the way you pass it in. So on windows they actually map \r\n to \n. This is fine and transparent when reading and writing ascii. But it would trash a stream of binary data. Basically just always give windows the 'b' flag if you want it to faithfully read and write data to files exactly the way you passed it in.
There are certain transformations that can take place when outputting in ASCII (e.g. outputting neline+carriage-return when the outputted character is new-line) -- depending on your platform. Such transformations will not take place when using binary format
With the C standard library stdio.h, I read that to output ASCII/text data, one should use mode "w" and to output binary data, one should use "wb". But why the difference?
In either case, I'm just outputting a byte (char) array, right? And if I output a non-ASCII byte in ASCII mode, the program still outputs the correct byte.
Some operating systems - mostly named "windows" - don't guarantee that they will read and write ascii to files exactly the way you pass it in. So on windows they actually map \r\n to \n. This is fine and transparent when reading and writing ascii. But it would trash a stream of binary data. Basically just always give windows the 'b' flag if you want it to faithfully read and write data to files exactly the way you passed it in.
There are certain transformations that can take place when outputting in ASCII (e.g. outputting neline+carriage-return when the outputted character is new-line) -- depending on your platform. Such transformations will not take place when using binary format
I am learning FileIO in C and was little confused with the binary files. My question is what is the use of having binary files when we can always use files in ASCII or someother format which can be easily understandable. Also in what applications are binary files more useful?
Any help on this really appriciated.Thanks!
All files are binary in nature. ASCII files are those subset of binary files that contain what can be considered to be 'human-readable' data. A pure binary file is not constrained to that subset of characters that is readable.
Speed of access
Obfuscation
The ability to write native objects to file without creating big serialised files.
ASCII is easily understandable by humans, but for many other purposes, it's more efficient and easier for the computer to store things in a binary format. For example, if you want to keep a sequence of integers, it's easier for the computer to read/write the 4 bytes it takes to represent an int, than it is to write out the ascii representation of the number, then parse it while reading.
It is critically important that any byte value can be stored, for example programs are binary. Any possible binary code may be a program instruction for the CPU.
ASCII only stores 7-bit values, so there are half the possible values wasted.
Further, what would an integer be stored as?
The number 4294967295 can be stored in 4 bytes, 32 bits, but if it were stored in ASCII, as a number, it would require 10 characters. Further, it would require processing to convert it into the 32bit number. Neither of those things are good.
The 32bit number is a fixed size, so it is easy to get to the 234856th value in the file, just seek to position 4*234856.
If 32bit numbers are stored as ASCII, either they must always take 10 bytes, making the file 2.5 times bigger, or they are stored as variable size, making it virtually impossible to seek to a particular value without reading the whole file.
Edit:
Is is worth adding that (in normal use) a human can not see the data held in a file. The only way to examine the contents of files is by running programs which can read and use the data. So the convenience of a human is a small consideration.
In general, data is stored in the most convenient form for programs use, and the form is designed to fit the programs purpose. ASCII is a format designed for text edit programs to create human readable documents and support simple ways to display the text, which are limited to English letters, numbers and some punctuation. When we want to support all human written language, ASCII is far too limited.
I believe we have over one million characters to represent human written languages (and some other pictures), and we have not yet got characters for all human languages.
UTF-8 is a way to represent the written characters we have so far, as multiple bytes. UTF-8 uses 8bit encoding, which is beyond the range of ASCII.
Think of a binary file as a true representation of data to be interpreted directly by a computer program and not to be read by humans. It would be a lot of overhead for a program to write out data, whether ascii or numeric in an ascii format. Most likely, the programmer would have to invent a protocol for writing arrays, structs, and scalars out into a file in ascii form, so they could be human readable and also be read back in by the program and converted back to binary form.
A database table is a good example. Whether or not there are text or numeric fields in the table, the database manager reads and writes that data in binary format. It is easier to write out, read in, and then convert as needed to display any data you can read.
Perception gave a great answer I had never considered before. All data is binary and ascii is a subset. That answer made me think of ftp and setting the mode to ascii or binary. If I'm shuttling Windows binaries being stored on a Linux system, I tell ftp to transfer them as binary. That means, don't interpret as an ascii file and add \cr at the end of each line. There are even times I'll transfer .csv and .txt data as binary, because I know Windows Excel knows how to interpret those non-DOS files.
I would not want to write a program that had to encode/decode images, or audio files, or GIS data, or spacecraft telemetry, or <fill in the blank> as ASCII.
I am working on a small text replacement application that basically lets the user select a file and replace text in it without ever having to open the file itself. However, I want to make sure that the function only runs for files that are text-based. I thought I could accomplish this by checking the encoding of the file, but I've found that Notepad .txt files use Unicode UTF-8 encoding, and so do MS Paint .bmp files. Is there an easy way to check this without placing restrictions on the file extensions themselves?
Unless you get a huge hint from somewhere, you're stuck. Purely by examining the bytes there's a non-zero probability you'll guess wrong given the plethora of encodings ("ASCII", Unicode, UTF-8, DBCS, MBCS, etc). Oh, and what if the first page happens to look like ASCII but the next page is a btree node that points to the first page...
Hints can be:
extension (not likely that foo.exe is editable)
something in the stream itself (like BOM [byte-order-marker])
user direction (just edit the file, goshdarnit)
Windows used to provide an API IsTextUnicode that would do a probabilistic examination, but there were well-known false-positives.
My take is that trying to be smarter than the user has some issues...
Honestly, given the Windows environment that you're working with, I'd consider a whitelist of known text formats. Windows users are typically trained to stick with extensions. However, I would personally relax the requirement that it not function on non-text files, instead checking with the user for goahead if the file does not match the internal whitelist. The risk of changing a binary file would be mitigated if your search string is long - that is assuming you're not performing Y2K conversion (a la sed 's/y/k/g').
It's pretty costly to determine if a file is text-based or not (i.e. a binary file). You would have to examine each byte in the file to determine if it is a valid character, irrespective of the file encoding.
Others have said to look at all the bytes in the file and see if they're alphanumeric. Some UNIX/Linux utils do this, but just check the first 1K or 2K of the file as an "optimistic optimization".
well a text file contains text, right ? so a really easy way to check a file if it does contain only text is to read it and check if it does contains alphanumeric characters.
So basically the first thing you have to do is to check the file encoding if its pure ASCII you have an easy task just read the whole file in to a char array (I'm assuming you are doing it in C/C++ or similar) and check every char in that array with functions isalpha and isdigit ...of course you have to take care about special exceptions like tabulators '\t' space ' ' or the newline ('\n' in linux , '\r'\'n' in windows)
In case of a different encoding the process is the same except the fact that you have to use different functions for checking if the current character is an alphanumeric character... also note that in case of UTF-16 or greater a simple char array is simply to small...but if you are doing it for example in C# you dont have to worry about the size :)
You can write a function that will try to determine if a file is text based. While this will not be 100% accurate, it may be just enough for you. Such a function does not need to go through the whole file, about a kilobyte should be enough (or even less). One thing to do is to count how many whitespaces and newlines are there. Another thing would be to consider individual bytes and check if they are alphanumeric or not. With some experiments you should be able to come up with a decent function. Note that this is just a basic approach and text encodings might complicate things.