Understanding `read, write` system calls in Unix - c

My Systems Programming project has us implementing a compression/decompression program to crunch down ASCII text files by removing the zero top bit and writing the output to a separate file, depending on whether the compression or decompression routine is working. To do this, the professor has required us to use the binary files and Unix system calls, which include open, close, read, write, etc.
From my understanding of read and write, it reads the binary data by defined byte chunks. However, since this data is binary, I'm not sure how to parse it.
This is a stripped down version of my code, minus the error checking:
void compress(char readFile[]){
char buffer[BUFFER] //buffer size set to 4096, but tunable to system preference
int openReadFile;
openReadFile= open(readFile, O_RDONLY);
}
If I use read to read the data into buffer, will the data in buffer be in binary or character format? Nothing I've come across addresses that detail, and its very relevant to how I parse the contents.

read() will read the bytes in without any interpretation (so "binary" mode).
Being binary, and you want to access the individual bytes, you should use a buffer of unsigned char
unsigned char buffer[BUFFER]. You can regard char/unsigned char as bytes, they'll be 8 bits on linux.
Now, since what you're dealing with is 8 bit ascii compressed down to 7 bit, you'll have to convert those 7 bits into 8 bits again so you can make sense of the data.
To explain what's been done - consider the text Hey .That's 3 bytes. The bytes will have 8 bits each, and in ascii that's the bit patterns :
01001000 01100101 01111001
Now, removing the most significant bit from this, you shift the remaining bits one bit to the left.
X1001000 X1100101 X1111001
Above, X is the bit to removed. Removing those, and shifting the others you end up with bytes with this pattern:
10010001 10010111 11001000
The rightmost 3 bits is just filled in with 0. So far, no space is saved though. There's still 3 bytes.
With a string of 8 bytes, we'd saved 1 byte as that would compress down to 7 bytes.
Now you have to do the reverse on the bytes you've read back in

I'll quote the manual of the fopen function (that is based on the open function/primitive) from http://www.kernel.org/doc/man-pages/online/pages/man3/fopen.3.html
The mode string can also include the
letter 'b' either as a last character
or as a character between the
characters in any of the two-character
strings described above. This is
strictly for compatibility with C89
and has no effect; the 'b' is ignored
on all POSIX conforming systems,
including Linux
So even the high level function ignores the mode :-)

It will read the binary content of the file and load it in the memory buffer points to. Of course, a byte is 8 bits, and that's why a char is 8 bits, so, if the file was a regular plain text document you'll end up with a printable string (be careful with how it ends, read returns the number of bytes (characters in a ascii-encoded plain text file) read).
Edit: in case the file you're reading isn't a text file, and is a collection of binary representations, you can make the type of the buffer the one of the file, even if it's a struct.

Related

C binary file versus text file efficiency

i'm quite new in C and i would like some help.
lets say i need to store in a file only 6 digit numbers. (lets assume the size of int equals 4)
what would be more efficient (in terms of memory) using a text file or binary file? i am not really sure how to confront this problem, any help will be welcome
Most people classify files in two categories: binary files and ASCII (text) files. You've actually worked with both. Any program you write (C/C++/Perl/HTML) is almost surely an ASCII file.
An ASCII file is defined as a file that consists of ASCII characters. It's usually created by using a text editor like emacs, pico, vi, Notepad, etc. There are fancier editors out there for writing code, but they may not always save it as ASCII. ASCII is international standard.
Computer science is all about creating good abstractions. Sometimes it succeeds and sometimes it doesn't. Good abstractions are all about presenting a view of the world that the user can use. One of the most successful abstractions is the text editor.
When you're writing a program, and typing in comments, it's hard to imagine that this information is not being stored as characters. ASCII/text files are really stored as 0's and 1's.
Files are stored on disks, and disks have some way to represent 1's and 0's. We merely call them 1's and 0's because that's also an abstraction. Whatever way is used to store the 0's and 1's on a disk, we don't care, provided we can think of them that way.
In effect, ASCII files are basically binary files, because they store binary numbers. That is, ASCII files store 0's and 1's.
The Difference between ASCII and Binary Files?
An ASCII file is a binary file that stores ASCII codes. Recall that an ASCII code is a 7-bit code stored in a byte. To be more specific, there are 128 different ASCII codes, which means that only 7 bits are needed to represent an ASCII character.
However, since the minimum workable size is 1 byte, those 7 bits are the low 7 bits of any byte. The most significant bit is 0. That means, in any ASCII file, you're wasting 1/8 of the bits. In particular, the most significant bit of each byte is not being used.
Although ASCII files are binary files, some people treat them as different kinds of files. I like to think of ASCII files as special kinds of binary files. They're binary files where each byte is written in ASCII code.
A full, general binary file has no such restrictions. Any of the 256 bit patterns can be used in any byte of a binary file.
We work with binary files all the time. Executables, object files, image files, sound files, and many file formats are binary files. What makes them binary is merely the fact that each byte of a binary file can be one of 256 bit patterns. They're not restricted to the ASCII codes.
Example of ASCII files
Suppose you're editing a text file with a text editor. Because you're using a text editor, you're pretty much editing an ASCII file. In this brand new file, you type in "cat". That is, the letters 'c', then 'a', then 't'. Then, you save the file and quit.
What happens? For the time being, we won't worry about the mechanism of what it means to open a file, modify it, and close it. Instead, we're concerned with the ASCII encoding.
If you look up an ASCII table, you will discover the ASCII code for 0x63, 0x61, 0x74 (the 0x merely indicates the values are in hexadecimal, instead of decimal/base 10).
Here's how it looks:
ASCII 'c' 'a' 't'
Hex 63 61 74
Binary 0110 0011 0110 0001 0111 1000
Each time you type in an ASCII character and save it, an entire byte is written which corresponds to that character. This includes punctuations, spaces, and so forth.
Thus, when you type a 'c', it's being saved as 0110 0011 to a file.
Now sometimes a text editor throws in characters you may not expect. For example, some editors "insist" that each line end with a newline character.
The only place a file can be missing a newline at the end of the line is the very last line. Some editors allow the very last line to end in something besides a newline character. Some editors add a newline at the end of every file.
Unfortunately, even the newline character is not that universally standard. It's common to use newline characters on UNIX files, but in Windows, it's common to use two characters to end each line (carriage return, newline, which is \r and \n, I believe). Why two characters when only one is necessary?
This dates back to printers. In the old days, the time it took for a printer to return back to the beginning of a line was equal to the time it took to type two characters. So, two characters were placed in the file to give the printer time to move the printer ball back to the beginning of the line.
This fact isn't all that important. It's mostly trivia. The reason I bring it up is just in case you've wondered why transferring files to UNIX from Windows sometimes generates funny characters.
Editing Binary Files
Now that you know that each character typed in an ASCII file corresponds to one byte in a file, you might understand why it's difficult to edit a binary file.
If you want to edit a binary file, you really would like to edit individual bits. For example, suppose you want to write the binary pattern 1100 0011. How would you do this?
You might be naive, and type in the following in a file:
11000011
But you should know, by now, that this is not editing individual bits of a file. If you type in '1' and '0', you are really entering in 0x49 and 0x48. That is, you're entering in 0100 1001 and 0100 1000 into the files. You're actually (indirectly) typing 8 bits at a time.
There are some programs that allow you type in 49, and it translates this to a single byte, 0100 1001, instead of the ASCII code for '4' and '9'. You can call these programs hex editors. Unfortunately, these may not be so readily available. It's not too hard to write a program that reads in an ASCII file that looks like hex pairs, but then converts it to a true binary file with the corresponding bit patterns.
That is, it takes a file that looks like:
63 a0 de
and converts this ASCII file to a binary file that begins 0110 0011 (which is 63 in binary). Notice that this file is ASCII, which means what's really stored is the ASCII code for '6', '3', ' ' (space), 'a', '0', and so forth. A program can read this ASCII file then generate the appropriate binary code and write that to a file.
Thus, the ASCII file might contain 8 bytes (6 for the characters, 2 for the spaces), and the output binary file would contain 3 bytes, one byte per hex pair.
Writing Binary Files
Why do people use binary files anyway? One reason is compactness. For example, suppose you wanted to write the number 100000. If you type it in ASCII, this would take 6 characters (which is 6 bytes). However, if you represent it as unsigned binary, you can write it out using 4 bytes.
ASCII is convenient, because it tends to be human-readable, but it can use up a lot of space. You can represent information more compactly by using binary files.
For example, one thing you can do is to save an object to a file. This is a kind of serialization. To dump it to a file, you use a write() method. Usually, you pass in a pointer to the object and the number of bytes used to represent the object (use the sizeof operator to determine this) to the write() method. The method then dumps out the bytes as it appears in memory into a file.
You can then recover the information from the file and place it into the object by using a corresponding read() method which typically takes a pointer to an object (and it should point to an object that has memory allocated, whether it be statically or dynamically allocated) and the number of bytes for the object, and copies the bytes from the file into the object.
Of course, you must be careful. If you use two different compilers, or transfer the file from one kind of machine to another, this process may not work. In particular, the object may be laid out differently. This can be as simple as endianness, or there may be issues with padding.
This way of saving objects to a file is nice and simple, but it may not be all that portable. Furthermore, it does the equivalent of a shallow copy. If your object contains pointers, it will write out the addresses to the file. Those addresses are likely to be totally meaningless. Addresses may make sense at the time a program is running, but if you quit and restart, those addresses may change.
This is why some people invent their own format for storing objects: to increase portability.
But if you know you aren't storing objects that contain pointers, and you are reading the file in on the same kind of computer system you wrote it on, and you're using the same compiler, it should work.
This is one reason people sometimes prefer to write out ints, chars, etc. instead of entire objects. They tend to be somewhat more portable.
An ASCII file is a binary file that consists of ASCII characters. ASCII characters are 7-bit encodings stored in a byte. Thus, each byte of an ASCII file has its most significant bit set to 0. Think of an ASCII file as a special kind of binary file.
A generic binary file uses all 8-bits. Each byte of a binary file can have the full 256 bitstring patterns (as opposed to an ASCII file which only has 128 bitstring patterns).
There may be a time where Unicode text files becomes more prevalent. But for now, ASCII files are the standard format for text files.
A binary file is basically any file that is not "line-oriented". Any file where besides the actual written characters and newlines there are other symbols as well.
Usually when you write a file in text mode, any new line \n will be translated to a carriage return + line feed \r\n.
There isn't any memory efficiency that can be achieved by using a binary file as apposed to text files, files are stored on disk and not in memory. It all depends on what you want to do with the file and how you wish to format it.
Since you are working with pure integers (regardless of what the int size is) using a text or binary file will have the same impact on performance (meaning that it wont make any difference which type you choose to work with).
If you want to later modify or read the file in a text editor, it is best to use the text mode to write the file.

How is a character text in a file with ascii encoding stored on disk as file?

Let us say I have a 32-bit machine running a 32-bit OS with an application program like Notepad (assume). Assume I create a .txt file with that program which contains just a single character 'A' in it and save the file with ANSI coding (or ASCII) on disk. With 32 bits making up a single addressable memory block called a word, how would the 4 bytes in the word be used to store 'A' (i.e., number 65 in ASCII)? Now, 65 translates to 0100 0001 in binary.
ASCII means, that you are just using one byte per character. many encodings just use one byte per character, but there are some like utf16, which use constantly two bytes for each character.
the 32 bits get just relevant, if your are processing these characters in your CPU in a register, and you load them as an integer. then the single byte is converted to a 32 - bit integer and processed by the cpu, when you save it its again one byte long
how one byte is converted into a 32 bit integer, thats described for example here: http://en.wikipedia.org/wiki/Endianness

Save as binary file in C but it doesn't display zeros and ones

As i do understand, by saving a file in C using wb mode, shouldn't I see binary numbers in the saved files (zeros and ones).
When I save in wb mode the output in the file is:
Feras Wilson — n FFFF îè` c P xHF F
û¥2012
But this is not binary zeros and ones. How do I save file to contain zeros and ones and then be able to read It in C?
It is saved as 0 and 1, but your text editor reads them as bytes (it groups them in 8 bits) and displays them using ASCII. [1]
When you write to a text file, a lot of effort is done in order to interpret the binary data that you wish to write so it is put in a human readable format.
For example if you write the number 255, it would have to bring it to the form '2', '5', '5' (which are characters! ) and then write these each character.
If it writes to a binary file, it will just put in the file the actually binary data. This depends on what kind of variable it is ( on how many octets is it represent it on ) and on endianess and other things. If it is an unsigned char it will put in the binary file 0b11111111 ( which is the actual raw number, not characters!).
[1] http://www.asciitable.com/
This is only the textual representation of the file by your editor or command. Internally all files are stored with 0s and 1s on the HDD/SDD/RAM/... - try opening the file with a hex editor like bless (easy to use on linux, Mono required for Windows - alternatively search for another Hex Editor you want to use) to see how the bytes are stored. Furthermore I suggest using bless because if offers different representations in different formats.
In your code, you can use the read methods to store the content bytewise and interpret this. Just keep a possible endianness fix in mind if you read more than one byte at a time. That is that Little and Big Endian systems store and read bytes in "reversed" order. A word 0x1337 being read could possibly be read as 0x3713. Just get familiar with this term and use Wikipedia to understand how to handle this, if necessary.
All files are stored in binary! It's just a question of how a successive program views/interprets this binary. Depending on how you use this file, it'll get read as a sequence of bytes representing chararacters, or a sequence of bytes representing instructions, or words representing Unicode etc. etc.
If you want to see your file in different formats, use od:
NAME
od - dump files in octal and other formats
which will dump your file in hex, characters, octal etc. (the one thing it won't do is show you in binary, but you can derive that from the octal/hex output easily enough)

Reading a binary file bit by bit

I know the function below:
size_t fread(void *ptr, size_t size_of_elements, size_t number_of_elements, FILE *a_file);
It only reads byte by byte, my goal is to be able to read 12 bits at a time and then take them into an array. Any help or pointers would be greatly appreciated!
Adding to the first comment, you can try reading one byte at a time (declare a char variable and write there), and then use the bitwise operators >> and << to read bit by bit. Read more here: http://www.cprogramming.com/tutorial/bitwise_operators.html
Many years ago, I wrote some I/O routines in C for a Huffman encoder. This needs to be able to read and write on the granularity of bits rather than bytes. I created functions analogous to read(2) and write(2) that could be asked to (say) read 13 bits from a stream. To encode, for example, bytes would be fed into the coder and variable numbers of bits would emerge the other side. I had a simple structure with a bit pointer into the current byte being read or written. Every time it went off the end, it flushed the completed byte out and reset the pointer to zero. Unfortunately that code is long gone, but it might be an idea to pull apart an open-source Huffman coder and see how the problem was solved there. Similarly, base64 coding takes 3 bytes of data and turns them into 4 (or vice versa).
I've implemented a couple of methods to read/write files bit by bit. Here they are. Whether it is viable or not for your use case, you have to decide that for yourself. I've tried to make the most readable, optimized code i could, not being a seasoned C developer (for now).
Internally, it uses a "bitCursor" to store information about previous bits that don't yet fit a full byte. It has who data fields: d stores data and s stores size, or the amount of bits stored in the cursor.
You have four functions:
newBitCursor(), which returns a bitCursor object with default values
{0,0}. Such a cursor is needed at the beginning of a sequence of
read/write operations to or from a file.
fwriteb(void *ptr, size_t sizeb, size_t rSizeb, FILE *fp, bitCursor
*c), which writes sizeb rightmost bits of the value stored in ptr to fp.
fcloseb(FILE *fp, bitCursor *c), which writes a remaining byte, if
the previous writes did not exactly encapsulate all data needed to
be written, that is probably almost always the case...
freadb(void *ptr, size_t sizeb, size_t rSizeb, FILE *fp, bitCursor
*c), which reads sizeb bits and bitwise ORs them to *ptr. (it is, therefore, your responsibility to init *ptr as 0)
More info is provided in the comments. Have Fun!
Edit: It has come to my knowledge today that when i made that i assumed Little Endian! :P Oops! It's always nice to realize how much of a noob i still am ;D.
Edit: GNU's Binary File Descriptors.
Read the first two bytes from your a_file file pointer and check the bits in the least or greatest byte — depending on the endianness of your platform (x86 is little-endian) — using bitshift operators.
You can't really put bits into an array, as there isn't a datatype for bits. Rather than keeping 1's and 0's in an array, which is inefficient, it seems cheaper just to keep the two bytes in a two-element array (say, of type unsigned char *) and write functions to map those two bytes to one of 4096 (2^12) values-of-interest.
As a complication, on subsequent reads, if you want to fread through the pointer every 12 bits, you would read only one byte, using the left-over bits from the previous read to build a new 12-bit value. If there are no leftovers, you would need to read two bytes.
Your mapping functions would need to address the second case where bits were used from previous read, because the two bytes would need different mapping. To do this efficiently, a modulus on a read-counter could be used to swap between two mappings.
read 2 bytes and do bit wise operations will get it done for the next time read 2nd bytes onwards apply the bit-wise operations will get back you expected . . . .
For your problem you can see this demo program which read 2byte but actual information is only 12 bit.As well as this type of things are used it bit wise access.
fwrite() is a standard library function which take the size argument as byte and of type int.So it is not possible exactly 12bit read.If the file you create then create like below as well as read as below it solve your problem.
If that file is special file which not written by you then follow the standard provided for that file to read I think they also writing like this only.Or you can provide the axact where it I may help you.
#include<stdio.h>
#include<stdlib.h>
struct node
{
int data:12;
}NODE;
int main()
{
FILE *fp;
fp=fopen("t","w");
NODE.data=1024;
printf("%d\n",NODE.data);
fwrite(&NODE,sizeof(NODE),1,fp);
NODE.data=0;
NODE.data=2048;
printf("%d\n",(unsigned)NODE.data);
fwrite(&NODE,sizeof(NODE),1,fp);
fclose(fp);
fp=fopen("t","r");
fread(&NODE,sizeof(NODE),1,fp);
printf("%d\n",NODE.data);
fread(&NODE,sizeof(NODE),1,fp);
printf("%d\n",NODE.data);
fclose(fp);
}

Dumping bits/bytes to a file in binary mode

I have below code.
FILE *fp;
int a;
fp=fopen("dump.bin","wb");
a = 0xffafbcdf;
fprintf(fp,"%x",a&0x3ffff);
I am trying to dump only 18 LSBits of variable a. But the value dumped in the file is 3bcdf.
My question - Is it not possible to dump/fwrite/fprintf desired number of bits which is not a multiple of 4(nibble) to a file?
EDIT:
1.) When I checked the output of my file dump I realized, since I am opening file in binary mode, I should not be using fprintf, but I should use fwrite.
2.) What I see in the output is DF BC 03 , it writes in multiples of 8 bits, so it writes 24 bits(3 bytes) but i was interested in only 18 bits. But then I realized that no file write library will be able to write non-multiple of 8 bits to a file. it will always add leading zero bits to complete the byte and then dump it. "Least count unit" for a file data is a byte.
As others have said, the character is the minimal sized entity you can write to a file. I'd just like to observe that there is nothing wrong with using fprintf() (or any other stream output function) on binary files.
0x3bcdf are the 18 LSBits of 0xffafbcdf, so it seems to be working as expected. What did you expect?
Files are written in bytes (8 bits), not nibbles (4bits).
Edit after your edit: fprintf will print text, thus you'll get 6 bytes, where-as if you used fwrite you will write an int (32 bit) to a file, but masking the value recorded.

Resources