C binary file versus text file efficiency - c

i'm quite new in C and i would like some help.
lets say i need to store in a file only 6 digit numbers. (lets assume the size of int equals 4)
what would be more efficient (in terms of memory) using a text file or binary file? i am not really sure how to confront this problem, any help will be welcome

Most people classify files in two categories: binary files and ASCII (text) files. You've actually worked with both. Any program you write (C/C++/Perl/HTML) is almost surely an ASCII file.
An ASCII file is defined as a file that consists of ASCII characters. It's usually created by using a text editor like emacs, pico, vi, Notepad, etc. There are fancier editors out there for writing code, but they may not always save it as ASCII. ASCII is international standard.
Computer science is all about creating good abstractions. Sometimes it succeeds and sometimes it doesn't. Good abstractions are all about presenting a view of the world that the user can use. One of the most successful abstractions is the text editor.
When you're writing a program, and typing in comments, it's hard to imagine that this information is not being stored as characters. ASCII/text files are really stored as 0's and 1's.
Files are stored on disks, and disks have some way to represent 1's and 0's. We merely call them 1's and 0's because that's also an abstraction. Whatever way is used to store the 0's and 1's on a disk, we don't care, provided we can think of them that way.
In effect, ASCII files are basically binary files, because they store binary numbers. That is, ASCII files store 0's and 1's.
The Difference between ASCII and Binary Files?
An ASCII file is a binary file that stores ASCII codes. Recall that an ASCII code is a 7-bit code stored in a byte. To be more specific, there are 128 different ASCII codes, which means that only 7 bits are needed to represent an ASCII character.
However, since the minimum workable size is 1 byte, those 7 bits are the low 7 bits of any byte. The most significant bit is 0. That means, in any ASCII file, you're wasting 1/8 of the bits. In particular, the most significant bit of each byte is not being used.
Although ASCII files are binary files, some people treat them as different kinds of files. I like to think of ASCII files as special kinds of binary files. They're binary files where each byte is written in ASCII code.
A full, general binary file has no such restrictions. Any of the 256 bit patterns can be used in any byte of a binary file.
We work with binary files all the time. Executables, object files, image files, sound files, and many file formats are binary files. What makes them binary is merely the fact that each byte of a binary file can be one of 256 bit patterns. They're not restricted to the ASCII codes.
Example of ASCII files
Suppose you're editing a text file with a text editor. Because you're using a text editor, you're pretty much editing an ASCII file. In this brand new file, you type in "cat". That is, the letters 'c', then 'a', then 't'. Then, you save the file and quit.
What happens? For the time being, we won't worry about the mechanism of what it means to open a file, modify it, and close it. Instead, we're concerned with the ASCII encoding.
If you look up an ASCII table, you will discover the ASCII code for 0x63, 0x61, 0x74 (the 0x merely indicates the values are in hexadecimal, instead of decimal/base 10).
Here's how it looks:
ASCII 'c' 'a' 't'
Hex 63 61 74
Binary 0110 0011 0110 0001 0111 1000
Each time you type in an ASCII character and save it, an entire byte is written which corresponds to that character. This includes punctuations, spaces, and so forth.
Thus, when you type a 'c', it's being saved as 0110 0011 to a file.
Now sometimes a text editor throws in characters you may not expect. For example, some editors "insist" that each line end with a newline character.
The only place a file can be missing a newline at the end of the line is the very last line. Some editors allow the very last line to end in something besides a newline character. Some editors add a newline at the end of every file.
Unfortunately, even the newline character is not that universally standard. It's common to use newline characters on UNIX files, but in Windows, it's common to use two characters to end each line (carriage return, newline, which is \r and \n, I believe). Why two characters when only one is necessary?
This dates back to printers. In the old days, the time it took for a printer to return back to the beginning of a line was equal to the time it took to type two characters. So, two characters were placed in the file to give the printer time to move the printer ball back to the beginning of the line.
This fact isn't all that important. It's mostly trivia. The reason I bring it up is just in case you've wondered why transferring files to UNIX from Windows sometimes generates funny characters.
Editing Binary Files
Now that you know that each character typed in an ASCII file corresponds to one byte in a file, you might understand why it's difficult to edit a binary file.
If you want to edit a binary file, you really would like to edit individual bits. For example, suppose you want to write the binary pattern 1100 0011. How would you do this?
You might be naive, and type in the following in a file:
11000011
But you should know, by now, that this is not editing individual bits of a file. If you type in '1' and '0', you are really entering in 0x49 and 0x48. That is, you're entering in 0100 1001 and 0100 1000 into the files. You're actually (indirectly) typing 8 bits at a time.
There are some programs that allow you type in 49, and it translates this to a single byte, 0100 1001, instead of the ASCII code for '4' and '9'. You can call these programs hex editors. Unfortunately, these may not be so readily available. It's not too hard to write a program that reads in an ASCII file that looks like hex pairs, but then converts it to a true binary file with the corresponding bit patterns.
That is, it takes a file that looks like:
63 a0 de
and converts this ASCII file to a binary file that begins 0110 0011 (which is 63 in binary). Notice that this file is ASCII, which means what's really stored is the ASCII code for '6', '3', ' ' (space), 'a', '0', and so forth. A program can read this ASCII file then generate the appropriate binary code and write that to a file.
Thus, the ASCII file might contain 8 bytes (6 for the characters, 2 for the spaces), and the output binary file would contain 3 bytes, one byte per hex pair.
Writing Binary Files
Why do people use binary files anyway? One reason is compactness. For example, suppose you wanted to write the number 100000. If you type it in ASCII, this would take 6 characters (which is 6 bytes). However, if you represent it as unsigned binary, you can write it out using 4 bytes.
ASCII is convenient, because it tends to be human-readable, but it can use up a lot of space. You can represent information more compactly by using binary files.
For example, one thing you can do is to save an object to a file. This is a kind of serialization. To dump it to a file, you use a write() method. Usually, you pass in a pointer to the object and the number of bytes used to represent the object (use the sizeof operator to determine this) to the write() method. The method then dumps out the bytes as it appears in memory into a file.
You can then recover the information from the file and place it into the object by using a corresponding read() method which typically takes a pointer to an object (and it should point to an object that has memory allocated, whether it be statically or dynamically allocated) and the number of bytes for the object, and copies the bytes from the file into the object.
Of course, you must be careful. If you use two different compilers, or transfer the file from one kind of machine to another, this process may not work. In particular, the object may be laid out differently. This can be as simple as endianness, or there may be issues with padding.
This way of saving objects to a file is nice and simple, but it may not be all that portable. Furthermore, it does the equivalent of a shallow copy. If your object contains pointers, it will write out the addresses to the file. Those addresses are likely to be totally meaningless. Addresses may make sense at the time a program is running, but if you quit and restart, those addresses may change.
This is why some people invent their own format for storing objects: to increase portability.
But if you know you aren't storing objects that contain pointers, and you are reading the file in on the same kind of computer system you wrote it on, and you're using the same compiler, it should work.
This is one reason people sometimes prefer to write out ints, chars, etc. instead of entire objects. They tend to be somewhat more portable.
An ASCII file is a binary file that consists of ASCII characters. ASCII characters are 7-bit encodings stored in a byte. Thus, each byte of an ASCII file has its most significant bit set to 0. Think of an ASCII file as a special kind of binary file.
A generic binary file uses all 8-bits. Each byte of a binary file can have the full 256 bitstring patterns (as opposed to an ASCII file which only has 128 bitstring patterns).
There may be a time where Unicode text files becomes more prevalent. But for now, ASCII files are the standard format for text files.

A binary file is basically any file that is not "line-oriented". Any file where besides the actual written characters and newlines there are other symbols as well.
Usually when you write a file in text mode, any new line \n will be translated to a carriage return + line feed \r\n.
There isn't any memory efficiency that can be achieved by using a binary file as apposed to text files, files are stored on disk and not in memory. It all depends on what you want to do with the file and how you wish to format it.
Since you are working with pure integers (regardless of what the int size is) using a text or binary file will have the same impact on performance (meaning that it wont make any difference which type you choose to work with).
If you want to later modify or read the file in a text editor, it is best to use the text mode to write the file.

Related

Is it possible to see what sequence of 1s and 0s a stored file is made of?

I was thinking to the fact that everything in computer is stored as a sequence of 1s and 0s. So the same thing should be true for any files and software stored in the hard drives. But is it possible to see the sequence of 1 and 0 for a specific file? For example, suppose that in a folder, there are files named "myfile.docx", "myfile.iso", "myfile.dll", "myfile.rar" ,..... how can I see what sequences of 1 and 0 each of these files are made of?
thanks in advance.
There are a number of hex editors that can display the data in binary. I haven't used any recently so I don't have any specific recommendations.
You can open the file in a hex editor, which might display something similar to this:
What you're looking at is a hexadecimal representation of the bytes in the file, which is the binary data. Some editors may have an option to display it as binary 0's and 1's. Of course there would be a lot more to look at in that case. Failing that, you can infer the binary data from the hexadecimal representations. Or even write a program where you paste in the hex values and it calculates the binary representation for you.
You could also write a program in a variety of languages which would automate this for you. For example, if you read a file as a byte[] in something like C# then you can loop through it, byte by byte, outputting "binary" to the page for those bytes. (Some math on a helper method can convert a numeric value, which is all a byte is, to a series of 0's and 1's to print to the screen.) I imagine most, if not all, reasonable programming languages will have similar operations available.

Storing a Hex Dump inside an Array in C

I am trying to create a PFS De-compressor using the C Language. For this Project, I need to analyse the Archive's Hex dump from which I will get the Files Offsets, names and Sizes.
Do you think it is good practice to store a Hex dump in a char array, and will I be able to convert Hex to Characters?
Thank you,
Andrew Borg
"will I be able to convert Hex to Characters?"
Yes. After your program reads in the literal characters that represent Hex digits -- for example, '5' and 'A', the program can turn that into the hex value 0x5a, which is decimal 90, and then print that value out as a character (which prints as "Z").
Pretty much every other conceivable way of converting characters to hex and vice versa is pretty easy in C.
How to correctly convert a Hex String to Byte Array in C?
Printing char buffer in hex array
Convert array of bytes to hexadecimal string in plain old C
I am not clear on what you are asking in the rest of your question.
Are you asking:
"I have this binary PFS file; can I write a C program that, when the program is run, reads that file into RAM, and if so, is a char array (char[]) a good container?"
Yes. When C programs read raw binary file data into memory, they typically store that data into a char array, and they keep track of the length separately (because, unlike char arrays that represent C strings, your binary data is likely to have the occasional zero byte).
You may want to look at the standard memory library functions (memcmp(), memchr(), etc. )
"I have an ASCII text file that is a hex dump of a PFS file; can I write a C program that, when the program is run, reads that ASCII text file into RAM, and if so, is a char array (char[]) a good container?"
Yes. When C programs read text files into memory, they typically store that data in a char array. You don't need to keep track of the length separately, because ASCII text files never include a zero byte, so the normal end-of-string zero byte works normally.
You may want to look at the standard string library functions ( strchr(), strstr(), etc.).
"I have this PFS file; can I write a C program that, before the program is ever run, the complete contents of that file are embedded into the executable code of my program?"
In principle, yes, you could convert a raw binary file into a text file full of hex digits that you can include in your source code.
This is rarely done, and is a bigger hassle than simply loading the program in at run time,
but occasionally people do it when they want a single self-contained executable:
How to embed a file into an executable?
https://gist.github.com/valenok/4714740
http://www.daniweb.com/software-development/c/threads/352381/how-to-define-hex-character-array-in-c
http://www.massmind.org/techref/datafile/charset/extractor/charset_extractor.htm
How to embed a file into an executable file?
Embedding resources in executable using GCC
http://www.linuxjournal.com/content/embedding-file-executable-aka-hello-world-version-5967
p.s.:
Is your PFS file from an Amiga or a Playstation or something else?

Save as binary file in C but it doesn't display zeros and ones

As i do understand, by saving a file in C using wb mode, shouldn't I see binary numbers in the saved files (zeros and ones).
When I save in wb mode the output in the file is:
Feras Wilson — n FFFF îè` c P xHF F
û¥2012
But this is not binary zeros and ones. How do I save file to contain zeros and ones and then be able to read It in C?
It is saved as 0 and 1, but your text editor reads them as bytes (it groups them in 8 bits) and displays them using ASCII. [1]
When you write to a text file, a lot of effort is done in order to interpret the binary data that you wish to write so it is put in a human readable format.
For example if you write the number 255, it would have to bring it to the form '2', '5', '5' (which are characters! ) and then write these each character.
If it writes to a binary file, it will just put in the file the actually binary data. This depends on what kind of variable it is ( on how many octets is it represent it on ) and on endianess and other things. If it is an unsigned char it will put in the binary file 0b11111111 ( which is the actual raw number, not characters!).
[1] http://www.asciitable.com/
This is only the textual representation of the file by your editor or command. Internally all files are stored with 0s and 1s on the HDD/SDD/RAM/... - try opening the file with a hex editor like bless (easy to use on linux, Mono required for Windows - alternatively search for another Hex Editor you want to use) to see how the bytes are stored. Furthermore I suggest using bless because if offers different representations in different formats.
In your code, you can use the read methods to store the content bytewise and interpret this. Just keep a possible endianness fix in mind if you read more than one byte at a time. That is that Little and Big Endian systems store and read bytes in "reversed" order. A word 0x1337 being read could possibly be read as 0x3713. Just get familiar with this term and use Wikipedia to understand how to handle this, if necessary.
All files are stored in binary! It's just a question of how a successive program views/interprets this binary. Depending on how you use this file, it'll get read as a sequence of bytes representing chararacters, or a sequence of bytes representing instructions, or words representing Unicode etc. etc.
If you want to see your file in different formats, use od:
NAME
od - dump files in octal and other formats
which will dump your file in hex, characters, octal etc. (the one thing it won't do is show you in binary, but you can derive that from the octal/hex output easily enough)

What are the uses of having binary files?

I am learning FileIO in C and was little confused with the binary files. My question is what is the use of having binary files when we can always use files in ASCII or someother format which can be easily understandable. Also in what applications are binary files more useful?
Any help on this really appriciated.Thanks!
All files are binary in nature. ASCII files are those subset of binary files that contain what can be considered to be 'human-readable' data. A pure binary file is not constrained to that subset of characters that is readable.
Speed of access
Obfuscation
The ability to write native objects to file without creating big serialised files.
ASCII is easily understandable by humans, but for many other purposes, it's more efficient and easier for the computer to store things in a binary format. For example, if you want to keep a sequence of integers, it's easier for the computer to read/write the 4 bytes it takes to represent an int, than it is to write out the ascii representation of the number, then parse it while reading.
It is critically important that any byte value can be stored, for example programs are binary. Any possible binary code may be a program instruction for the CPU.
ASCII only stores 7-bit values, so there are half the possible values wasted.
Further, what would an integer be stored as?
The number 4294967295 can be stored in 4 bytes, 32 bits, but if it were stored in ASCII, as a number, it would require 10 characters. Further, it would require processing to convert it into the 32bit number. Neither of those things are good.
The 32bit number is a fixed size, so it is easy to get to the 234856th value in the file, just seek to position 4*234856.
If 32bit numbers are stored as ASCII, either they must always take 10 bytes, making the file 2.5 times bigger, or they are stored as variable size, making it virtually impossible to seek to a particular value without reading the whole file.
Edit:
Is is worth adding that (in normal use) a human can not see the data held in a file. The only way to examine the contents of files is by running programs which can read and use the data. So the convenience of a human is a small consideration.
In general, data is stored in the most convenient form for programs use, and the form is designed to fit the programs purpose. ASCII is a format designed for text edit programs to create human readable documents and support simple ways to display the text, which are limited to English letters, numbers and some punctuation. When we want to support all human written language, ASCII is far too limited.
I believe we have over one million characters to represent human written languages (and some other pictures), and we have not yet got characters for all human languages.
UTF-8 is a way to represent the written characters we have so far, as multiple bytes. UTF-8 uses 8bit encoding, which is beyond the range of ASCII.
Think of a binary file as a true representation of data to be interpreted directly by a computer program and not to be read by humans. It would be a lot of overhead for a program to write out data, whether ascii or numeric in an ascii format. Most likely, the programmer would have to invent a protocol for writing arrays, structs, and scalars out into a file in ascii form, so they could be human readable and also be read back in by the program and converted back to binary form.
A database table is a good example. Whether or not there are text or numeric fields in the table, the database manager reads and writes that data in binary format. It is easier to write out, read in, and then convert as needed to display any data you can read.
Perception gave a great answer I had never considered before. All data is binary and ascii is a subset. That answer made me think of ftp and setting the mode to ascii or binary. If I'm shuttling Windows binaries being stored on a Linux system, I tell ftp to transfer them as binary. That means, don't interpret as an ascii file and add \cr at the end of each line. There are even times I'll transfer .csv and .txt data as binary, because I know Windows Excel knows how to interpret those non-DOS files.
I would not want to write a program that had to encode/decode images, or audio files, or GIS data, or spacecraft telemetry, or <fill in the blank> as ASCII.

Saving data to a binary file

I would like to save a file as binary, because I've heard that it would probably be smaller than a normal text file.
Now I am trying to save a binary file with some text, but the problem is that the file just contains the text and NULL at the end. I would expect to see only zero's and one's inside the file.
Any explaination or suggestions are highly appreciated.
Here is my code
#include <iostream>
#include <stdio.h>
int main()
{
/*Temporary data buffer*/
char buffer[20];
/*Data to be stored in file*/
char temp[20]="Test";
/*Opening file for writing in binary mode*/
FILE *handleWrite=fopen("test.bin","wb");
/*Writing data to file*/
fwrite(temp, 1, 13, handleWrite);
/*Closing File*/
fclose(handleWrite);
/*Opening file for reading*/
FILE *handleRead=fopen("test.bin","rb");
/*Reading data from file into temporary buffer*/
fread(buffer,1,13,handleRead);
/*Displaying content of file on console*/
printf("%s",buffer);
/*Closing File*/
fclose(handleRead);
std::system("pause");
return 0;
}
All files contain only ones and zeroes, on binary computers that's all there is to play with.
When you save text, you are saving the binary representation of that text, in a given encoding that defines how each letter is mapped to bits.
So for text, a text file or a binary file almost doesn't matter; the savings in space that you've heard about generally come into play for other data types.
Consider a floating point number, such as 3.141592653589. If saved as text, that would take one character per digit (just count them), plus the period. If saved in binary as just a copy of the float's bits, it will take four characters (four bytes, or 32 bits) on a typical 32-bit system. The exact number of bits stored by a call such as:
FILE *my_file = fopen("pi.bin", "wb");
float x = 3.1415;
fwrite(&x, sizeof x, 1, my_file);
is CHAR_BIT * sizeof x, see <stdlib.h> for CHAR_BIT.
The problem you describe is a chain of (very common1, unfortunately) mistakes and misunderstandings. Let me try to fully detail what is going on, hopefully you will take the time to read through all the material: it is lengthy, but these are very important basics that any programmer should master. Please do not despair if you do not fully understand all of it: just try to play around with it, come back in a week, or two, practice, see what happens :)
There is a crucial difference between the concepts of a character encoding and a character set. Unless you really understand this difference, you will never really get what is going on, here. Joel Spolsky (one of the founders of Stackoverflow, come to think of it) wrote an article explaining the difference a while ago: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). Before you continue reading this, before you continue programming, even, read that. Honestly, read it, understand it: the title is no exaggeration. You must absolutely know this stuff.
After that, let us proceed:
When a C program runs, a memory location that is supposed to hold a value of type "char" contains, just like any other memory location, a sequence of ones and zeroes. "type" of a variable only means something to the compiler, not to the running program who just sees ones and zeroes and does not know more than that. In other words: where you commonly think of a "letter" (an element from a character set) residing in memory somewhere, what is actually there is a bit sequence (an element from a character encoding).
Every compiler is free to use whatever encoding they wish to represent characters in memory. As a consequence, it is free represent what we call a "newline" internally as any number it chooses. For example, say I write a compiler, I can agree with myself that every time I want to store a "newline" internally I store it as number six (6), which is just 0x6 in binary (or 110 in binary).
Writing to a file is done by telling the operating system2 four things at the same time:
The fact that you want to write to a file (fwrite())
Where the data starts that you want to write (first argument to fwrite)
How much data you want to write (second and third argument, multiplied)
What file you want to write to (last argument)
Note that this has nothing to do with the "type" of that data: your operating has no idea, and does not care. It does not know anything about characters sets and it does not care: it just sees a sequence of ones and zeroes starting somewhere and copies that to a file.
Opening a file in "binary" mode is actually the normal, intuitive way of dealing with files that a novice programmer would expect: the memory location you specify is copied one-on-one to the file. If you write a memory location that used to hold variables that the compiler decided to store as type "char", those values are written one-on-one to the file. Unless you know how the compiler stores values internally (what value it associates with a newline, with a letter 'a', 'b', etc), THIS IS MEANINGLESS. Compare this to Joel's similar point about a text file being useless without knowing what its encoding is: same thing.
Opening a file in "text" mode is almost equal to binary mode, with one (and only one) difference: anytime a value is written that has value equal to what the compiler uses INTERNALLY for the newline (6, in our case), it writes something different to the file: not that value, but whatever the operating system you are on considers to be a newline. On windows, this is two bytes (13 and 10, or 0x0d 0x0a, on Windows). Note, again, if you do not know about the compiler's choice of internal representation of the other characters, this is STILL MEANINGLESS.
Note at this point that it is pretty clear that writing anything but data that the compiler designated as characters to a file in text mode is a bad idea: in our case, a 6 might just happen to be among the values you are writing, in which case the output is altered in a way that we absolutely do not mean to.
(Un)Luckily, most (all?) compilers actually use the same internal representation for characters: this representation is US-ASCII and it is the mother of all defaults. This is the reason you can write some "characters" to a file in your program, compiled with any random compiler, and then open it with a text editor: they all use/understand US-ASCII and it happens to work.
OK, now to connect this to your example: why is there no difference between writing "test" in binary mode and in text mode? Because there is no newline in "test", that is why!
And what does it mean when you "open a file", and then "see" characters? It means that the program you used to inspect the sequence of ones and zeroes in that file (because everything is ones and zeroes on your hard disk) decided to interpret that as US-ASCII, and that happened to be what your compiler decided to encode that string as, in its memory.
Bonus points: write a program that reads the ones and zeroes from a file into memory and prints every BIT (there's multiple bits to make up one byte, to extract them you need to know 'bitwise' operator tricks, google!) as a "1" or "0" to the user. Note that "1" is the CHARACTER 1, the point in the character set of your choosing, so your program must take a bit (number 1 or 0) and transform it to the sequence of bits needed to represent character 1 or 0 in the encoding that the terminal emulator uses that you are viewing the standard out of the program on oh my God. Good news: you can take lots of short-cuts by assuming US-ASCII everywhere. This program will show you what you wanted: the sequence of ones and zeroes that your compiler uses to represent "test" internally.
This stuff is really daunting for newbies, and I know that it took me a long time to even know that there was a difference between a character set and an encoding, let alone how all of this worked. Hopefully I did not demotivate you, if I did, just remember that you can never lose knowledge you already have, only gain it (ok not always true :P). It is normal in life that a statement raises more questions than it answered, Socrates knew this and his wisdom seamlessly applies to modern day technology 2.4k years later.
Good luck, do not hesitate to continue asking. To other readers: please feel welcome to improve this post if you see errors.
Hraban
1 The person that told you that "saving a file in binary is probably smaller", for example, probably gravely misunderstands these fundamentals. Unless he was referring to compressing the data before you save it, in which case he just uses a confusing word ("binary") for "compressed".
2 "telling the operating system something" is what is commonly known as a system call.
Well, the difference between native and binary is the way the end of line is handled.
If you write a string in a binary, it will stay the string.
If you want to make it smaller, you'll have to somehow compress it (look for libz for example).
What is smaller is: when wanting to save binary data (like an array of bytes), it's smaller to save it as binary rather than putting it in a string (either in hexa representation or base64). I hope this helps.
I think you're a bit confused here.
The ASCII-string "Test" will still be an ASCII-string when you write it to the file (even in binary mode). The cases when it makes sense to write binary are for other types than chars (e.g. an array of integers).
try replacing
FILE *handleWrite=fopen("test.bin","wb");
fwrite(temp, 1, 13, handleWrite);
with
FILE *handleWrite=fopen("test.bin","w");
fprintf(handleWrite, "%s", temp);
Function printf("%s",buffer); prints buffer as zero-ending string.
Try to use:
char temp[20]="Test\n\rTest";

Resources