I am learning FileIO in C and was little confused with the binary files. My question is what is the use of having binary files when we can always use files in ASCII or someother format which can be easily understandable. Also in what applications are binary files more useful?
Any help on this really appriciated.Thanks!
All files are binary in nature. ASCII files are those subset of binary files that contain what can be considered to be 'human-readable' data. A pure binary file is not constrained to that subset of characters that is readable.
Speed of access
Obfuscation
The ability to write native objects to file without creating big serialised files.
ASCII is easily understandable by humans, but for many other purposes, it's more efficient and easier for the computer to store things in a binary format. For example, if you want to keep a sequence of integers, it's easier for the computer to read/write the 4 bytes it takes to represent an int, than it is to write out the ascii representation of the number, then parse it while reading.
It is critically important that any byte value can be stored, for example programs are binary. Any possible binary code may be a program instruction for the CPU.
ASCII only stores 7-bit values, so there are half the possible values wasted.
Further, what would an integer be stored as?
The number 4294967295 can be stored in 4 bytes, 32 bits, but if it were stored in ASCII, as a number, it would require 10 characters. Further, it would require processing to convert it into the 32bit number. Neither of those things are good.
The 32bit number is a fixed size, so it is easy to get to the 234856th value in the file, just seek to position 4*234856.
If 32bit numbers are stored as ASCII, either they must always take 10 bytes, making the file 2.5 times bigger, or they are stored as variable size, making it virtually impossible to seek to a particular value without reading the whole file.
Edit:
Is is worth adding that (in normal use) a human can not see the data held in a file. The only way to examine the contents of files is by running programs which can read and use the data. So the convenience of a human is a small consideration.
In general, data is stored in the most convenient form for programs use, and the form is designed to fit the programs purpose. ASCII is a format designed for text edit programs to create human readable documents and support simple ways to display the text, which are limited to English letters, numbers and some punctuation. When we want to support all human written language, ASCII is far too limited.
I believe we have over one million characters to represent human written languages (and some other pictures), and we have not yet got characters for all human languages.
UTF-8 is a way to represent the written characters we have so far, as multiple bytes. UTF-8 uses 8bit encoding, which is beyond the range of ASCII.
Think of a binary file as a true representation of data to be interpreted directly by a computer program and not to be read by humans. It would be a lot of overhead for a program to write out data, whether ascii or numeric in an ascii format. Most likely, the programmer would have to invent a protocol for writing arrays, structs, and scalars out into a file in ascii form, so they could be human readable and also be read back in by the program and converted back to binary form.
A database table is a good example. Whether or not there are text or numeric fields in the table, the database manager reads and writes that data in binary format. It is easier to write out, read in, and then convert as needed to display any data you can read.
Perception gave a great answer I had never considered before. All data is binary and ascii is a subset. That answer made me think of ftp and setting the mode to ascii or binary. If I'm shuttling Windows binaries being stored on a Linux system, I tell ftp to transfer them as binary. That means, don't interpret as an ascii file and add \cr at the end of each line. There are even times I'll transfer .csv and .txt data as binary, because I know Windows Excel knows how to interpret those non-DOS files.
I would not want to write a program that had to encode/decode images, or audio files, or GIS data, or spacecraft telemetry, or <fill in the blank> as ASCII.
Related
I was thinking to the fact that everything in computer is stored as a sequence of 1s and 0s. So the same thing should be true for any files and software stored in the hard drives. But is it possible to see the sequence of 1 and 0 for a specific file? For example, suppose that in a folder, there are files named "myfile.docx", "myfile.iso", "myfile.dll", "myfile.rar" ,..... how can I see what sequences of 1 and 0 each of these files are made of?
thanks in advance.
There are a number of hex editors that can display the data in binary. I haven't used any recently so I don't have any specific recommendations.
You can open the file in a hex editor, which might display something similar to this:
What you're looking at is a hexadecimal representation of the bytes in the file, which is the binary data. Some editors may have an option to display it as binary 0's and 1's. Of course there would be a lot more to look at in that case. Failing that, you can infer the binary data from the hexadecimal representations. Or even write a program where you paste in the hex values and it calculates the binary representation for you.
You could also write a program in a variety of languages which would automate this for you. For example, if you read a file as a byte[] in something like C# then you can loop through it, byte by byte, outputting "binary" to the page for those bytes. (Some math on a helper method can convert a numeric value, which is all a byte is, to a series of 0's and 1's to print to the screen.) I imagine most, if not all, reasonable programming languages will have similar operations available.
i'm quite new in C and i would like some help.
lets say i need to store in a file only 6 digit numbers. (lets assume the size of int equals 4)
what would be more efficient (in terms of memory) using a text file or binary file? i am not really sure how to confront this problem, any help will be welcome
Most people classify files in two categories: binary files and ASCII (text) files. You've actually worked with both. Any program you write (C/C++/Perl/HTML) is almost surely an ASCII file.
An ASCII file is defined as a file that consists of ASCII characters. It's usually created by using a text editor like emacs, pico, vi, Notepad, etc. There are fancier editors out there for writing code, but they may not always save it as ASCII. ASCII is international standard.
Computer science is all about creating good abstractions. Sometimes it succeeds and sometimes it doesn't. Good abstractions are all about presenting a view of the world that the user can use. One of the most successful abstractions is the text editor.
When you're writing a program, and typing in comments, it's hard to imagine that this information is not being stored as characters. ASCII/text files are really stored as 0's and 1's.
Files are stored on disks, and disks have some way to represent 1's and 0's. We merely call them 1's and 0's because that's also an abstraction. Whatever way is used to store the 0's and 1's on a disk, we don't care, provided we can think of them that way.
In effect, ASCII files are basically binary files, because they store binary numbers. That is, ASCII files store 0's and 1's.
The Difference between ASCII and Binary Files?
An ASCII file is a binary file that stores ASCII codes. Recall that an ASCII code is a 7-bit code stored in a byte. To be more specific, there are 128 different ASCII codes, which means that only 7 bits are needed to represent an ASCII character.
However, since the minimum workable size is 1 byte, those 7 bits are the low 7 bits of any byte. The most significant bit is 0. That means, in any ASCII file, you're wasting 1/8 of the bits. In particular, the most significant bit of each byte is not being used.
Although ASCII files are binary files, some people treat them as different kinds of files. I like to think of ASCII files as special kinds of binary files. They're binary files where each byte is written in ASCII code.
A full, general binary file has no such restrictions. Any of the 256 bit patterns can be used in any byte of a binary file.
We work with binary files all the time. Executables, object files, image files, sound files, and many file formats are binary files. What makes them binary is merely the fact that each byte of a binary file can be one of 256 bit patterns. They're not restricted to the ASCII codes.
Example of ASCII files
Suppose you're editing a text file with a text editor. Because you're using a text editor, you're pretty much editing an ASCII file. In this brand new file, you type in "cat". That is, the letters 'c', then 'a', then 't'. Then, you save the file and quit.
What happens? For the time being, we won't worry about the mechanism of what it means to open a file, modify it, and close it. Instead, we're concerned with the ASCII encoding.
If you look up an ASCII table, you will discover the ASCII code for 0x63, 0x61, 0x74 (the 0x merely indicates the values are in hexadecimal, instead of decimal/base 10).
Here's how it looks:
ASCII 'c' 'a' 't'
Hex 63 61 74
Binary 0110 0011 0110 0001 0111 1000
Each time you type in an ASCII character and save it, an entire byte is written which corresponds to that character. This includes punctuations, spaces, and so forth.
Thus, when you type a 'c', it's being saved as 0110 0011 to a file.
Now sometimes a text editor throws in characters you may not expect. For example, some editors "insist" that each line end with a newline character.
The only place a file can be missing a newline at the end of the line is the very last line. Some editors allow the very last line to end in something besides a newline character. Some editors add a newline at the end of every file.
Unfortunately, even the newline character is not that universally standard. It's common to use newline characters on UNIX files, but in Windows, it's common to use two characters to end each line (carriage return, newline, which is \r and \n, I believe). Why two characters when only one is necessary?
This dates back to printers. In the old days, the time it took for a printer to return back to the beginning of a line was equal to the time it took to type two characters. So, two characters were placed in the file to give the printer time to move the printer ball back to the beginning of the line.
This fact isn't all that important. It's mostly trivia. The reason I bring it up is just in case you've wondered why transferring files to UNIX from Windows sometimes generates funny characters.
Editing Binary Files
Now that you know that each character typed in an ASCII file corresponds to one byte in a file, you might understand why it's difficult to edit a binary file.
If you want to edit a binary file, you really would like to edit individual bits. For example, suppose you want to write the binary pattern 1100 0011. How would you do this?
You might be naive, and type in the following in a file:
11000011
But you should know, by now, that this is not editing individual bits of a file. If you type in '1' and '0', you are really entering in 0x49 and 0x48. That is, you're entering in 0100 1001 and 0100 1000 into the files. You're actually (indirectly) typing 8 bits at a time.
There are some programs that allow you type in 49, and it translates this to a single byte, 0100 1001, instead of the ASCII code for '4' and '9'. You can call these programs hex editors. Unfortunately, these may not be so readily available. It's not too hard to write a program that reads in an ASCII file that looks like hex pairs, but then converts it to a true binary file with the corresponding bit patterns.
That is, it takes a file that looks like:
63 a0 de
and converts this ASCII file to a binary file that begins 0110 0011 (which is 63 in binary). Notice that this file is ASCII, which means what's really stored is the ASCII code for '6', '3', ' ' (space), 'a', '0', and so forth. A program can read this ASCII file then generate the appropriate binary code and write that to a file.
Thus, the ASCII file might contain 8 bytes (6 for the characters, 2 for the spaces), and the output binary file would contain 3 bytes, one byte per hex pair.
Writing Binary Files
Why do people use binary files anyway? One reason is compactness. For example, suppose you wanted to write the number 100000. If you type it in ASCII, this would take 6 characters (which is 6 bytes). However, if you represent it as unsigned binary, you can write it out using 4 bytes.
ASCII is convenient, because it tends to be human-readable, but it can use up a lot of space. You can represent information more compactly by using binary files.
For example, one thing you can do is to save an object to a file. This is a kind of serialization. To dump it to a file, you use a write() method. Usually, you pass in a pointer to the object and the number of bytes used to represent the object (use the sizeof operator to determine this) to the write() method. The method then dumps out the bytes as it appears in memory into a file.
You can then recover the information from the file and place it into the object by using a corresponding read() method which typically takes a pointer to an object (and it should point to an object that has memory allocated, whether it be statically or dynamically allocated) and the number of bytes for the object, and copies the bytes from the file into the object.
Of course, you must be careful. If you use two different compilers, or transfer the file from one kind of machine to another, this process may not work. In particular, the object may be laid out differently. This can be as simple as endianness, or there may be issues with padding.
This way of saving objects to a file is nice and simple, but it may not be all that portable. Furthermore, it does the equivalent of a shallow copy. If your object contains pointers, it will write out the addresses to the file. Those addresses are likely to be totally meaningless. Addresses may make sense at the time a program is running, but if you quit and restart, those addresses may change.
This is why some people invent their own format for storing objects: to increase portability.
But if you know you aren't storing objects that contain pointers, and you are reading the file in on the same kind of computer system you wrote it on, and you're using the same compiler, it should work.
This is one reason people sometimes prefer to write out ints, chars, etc. instead of entire objects. They tend to be somewhat more portable.
An ASCII file is a binary file that consists of ASCII characters. ASCII characters are 7-bit encodings stored in a byte. Thus, each byte of an ASCII file has its most significant bit set to 0. Think of an ASCII file as a special kind of binary file.
A generic binary file uses all 8-bits. Each byte of a binary file can have the full 256 bitstring patterns (as opposed to an ASCII file which only has 128 bitstring patterns).
There may be a time where Unicode text files becomes more prevalent. But for now, ASCII files are the standard format for text files.
A binary file is basically any file that is not "line-oriented". Any file where besides the actual written characters and newlines there are other symbols as well.
Usually when you write a file in text mode, any new line \n will be translated to a carriage return + line feed \r\n.
There isn't any memory efficiency that can be achieved by using a binary file as apposed to text files, files are stored on disk and not in memory. It all depends on what you want to do with the file and how you wish to format it.
Since you are working with pure integers (regardless of what the int size is) using a text or binary file will have the same impact on performance (meaning that it wont make any difference which type you choose to work with).
If you want to later modify or read the file in a text editor, it is best to use the text mode to write the file.
I have an EBCDIC flat file to be processed from a mainframe into a C module. What can be a good process in converting the COMP and COMP-3 values into readable values? Do I have to convert the ebcdic characters to ascii then hex for COMP-3? What about for COMP? Thanks
Bill Woodger has given you some very good advice through his comments to your question, actually he answered the question and should have
posted his comments as an answer.
I would like to reiterate a few of his points and expand on a few others.
If you need to convert a file created from what is probably a COBOL application so it may be read
by some other non-COBOL program, possibly on a machine with an architecture unlike the one where it was created, then
you should demand that the file be created using only display formatted data (i.e. all character data). Mashing non-display
(binary, packed, encoded) data outside of the operating environment where it was created is just a formula for
long term pain. You will be subjected to the joys of sorting out various endianness issues
between architectures and code page conversions. These are the things that
file transfer protocols are designed to manage - they do it well so don't try to reinvent them. Short answer, use FTP or
similar file transport mechanism to move data between machines. And only transport display (character) based data.
Packed Decimal (COMP-3) data types occupy a varying number of bytes depending on their specific PICTURE layout. The position of the decimal point
is implied so cannot be determined without reference to the PICTURE used to define it. Packed Decimal fields may be either signed
or unsigned. If signed, the sign is imbedded in the low 4 bits of the least significant digit. Each byte of a Packed Decimal
data type contains two digits, except possibly the first and last bytes. The first byte contains only 1 digit if the field is signed
and contains an even number of digits. The last byte contains 2 digits if unsigned but only 1 if signed. There are several other subtlies that
you need to be aware of if you want to do your own Packed Decimal to character conversions. At this point I hope you can see
that this is not going to be a trivial exercise.
Binary (COMP) data types have a different but no less complex set of issues to resolve. Again, not a trivial exercise.
So what should you be doing? Basically, do as Bill suggested. Have the program that generates this file use display formats
for output (meaning you have to do nothing). Or, failing that, use a utility program such as DFSORT/SYNCSORT do the conversions
for you. Going the utility
route still requires that you have the original COBOL file layout (and that you understand it) in order to do the conversion.
The last resort is simply writing a simple read-a-record-write-a-record COBOL program that takes in the unformatted data, MOVEes
each COMP-whatever field to a corresponding DISPLAY field and write it out again.
As Bill said, if the group that produced this file tells you that it is too difficult/expensive to produce a DISPLAY formatted
output file they are lying to you or they are incompetent or just too lazy to
do the job they were hired to do. I can think of no other excuses.
Use XML to transport data.
That is, write a program that converts your file into characters (if on mainframe, stay with the EBCIDIC but numeric fields are unpacked, etc.) and then enclose each record and each field in XML tags.
This avoids formatting issues (what field is in column 1, what field in column 2, are the delimters spaces or commas or either, etc. ad nauseum).
Then transmit the XML file with your favorite utility that converts from EBCIDIC to ASCII.
As i do understand, by saving a file in C using wb mode, shouldn't I see binary numbers in the saved files (zeros and ones).
When I save in wb mode the output in the file is:
Feras Wilson — n FFFF îè` c P xHF F
û¥2012
But this is not binary zeros and ones. How do I save file to contain zeros and ones and then be able to read It in C?
It is saved as 0 and 1, but your text editor reads them as bytes (it groups them in 8 bits) and displays them using ASCII. [1]
When you write to a text file, a lot of effort is done in order to interpret the binary data that you wish to write so it is put in a human readable format.
For example if you write the number 255, it would have to bring it to the form '2', '5', '5' (which are characters! ) and then write these each character.
If it writes to a binary file, it will just put in the file the actually binary data. This depends on what kind of variable it is ( on how many octets is it represent it on ) and on endianess and other things. If it is an unsigned char it will put in the binary file 0b11111111 ( which is the actual raw number, not characters!).
[1] http://www.asciitable.com/
This is only the textual representation of the file by your editor or command. Internally all files are stored with 0s and 1s on the HDD/SDD/RAM/... - try opening the file with a hex editor like bless (easy to use on linux, Mono required for Windows - alternatively search for another Hex Editor you want to use) to see how the bytes are stored. Furthermore I suggest using bless because if offers different representations in different formats.
In your code, you can use the read methods to store the content bytewise and interpret this. Just keep a possible endianness fix in mind if you read more than one byte at a time. That is that Little and Big Endian systems store and read bytes in "reversed" order. A word 0x1337 being read could possibly be read as 0x3713. Just get familiar with this term and use Wikipedia to understand how to handle this, if necessary.
All files are stored in binary! It's just a question of how a successive program views/interprets this binary. Depending on how you use this file, it'll get read as a sequence of bytes representing chararacters, or a sequence of bytes representing instructions, or words representing Unicode etc. etc.
If you want to see your file in different formats, use od:
NAME
od - dump files in octal and other formats
which will dump your file in hex, characters, octal etc. (the one thing it won't do is show you in binary, but you can derive that from the octal/hex output easily enough)
I am designing a binary file format from scratch, and I would like to include some magic bytes at the beginning so that it can be identified easily. How do I go about choosing which bytes? I am not aware of any central registry of magic numbers, so is it just a matter of picking something fairly random that isn't already identified by, say, the file command on a nearby UNIX box?
Stay away from super-short magic numbers. Just because you're designing a binary format doesn't mean you can't use a text string for identifier. Follow that by an EOF char, and as an added bonus people who cat or type your binary file won't get a mangled terminal.
There is no universally correct way. Best practices can be suggested, but these often situational. For example, if you're checking the integrity of volatile memory, which has an undefined initial state when power is applied, it may be beneficial to incorporate many 0s or 1s in a sequence (i.e. FFF0 00FF F000) which can stand out against random noise.
If the file is mostly binary, a popular choice is using a text encoding like ASCII which stands out among the binary data in a hex editor. For example, GIF uses GIF89a, FLAC uses fLaC. On the other hand, a plain text identifier may be falsely detected in a random text file, so invalid/control characters might be incorporated.
In general, it does not matter that much what they are, even a bunch of NULL bytes can be used for file detection. But ideally you want the longest unique identifier you can afford, and at minimum 4 bytes long. Any identifier under 4 bytes will show up more often in random data. The longer it is, the less likely it will ever be detected as a false positive. Some known examples are as long as 40 bytes. In a way, it's like a password.
Also, it doesn't have to be at offset 0. The file signature has conventionally been at offset zero, since it made sense to store it first if it will be processed first.
That said, a single file signature should not be the only line of defense. The actual parsing process itself should be able to verify integrity and weed out invalid files even if the signature matches. This can be done with additional file signatures, using length-sensitive data, value/range checking, and especially, hash/checksum values.