How do fread and fwrite distinguish between different data (types) in C? - c

I am working with a program and C (with Ubuntu and its bash) and using it to manipulate binary data files. First of all, when I use fopen(filename, 'w') it creates a file but without any extension. However, when I use vim filename it opens it up in some binary form.
For this question, when I use fwrite(array, sizeof(some struct), # of structs, filePointer) it writes (which I am not sure how in binary) into the file. When I use fread(anotherArray, sizeof(same struct), same # of structs, anotherFilePointer) it somehow magically knows how to read each struct in binary form and puts it into the array just by knowing its size and how much to read. What happens if I put a decimal value less than the number of structs there are in the # of structs parameter? How would fread know what to read correctly? How does it work in reading data just by looking at the sizes and not knowing what type of data it is?

fwrite writes the bytes of the memory where the object is stored to the output stream and fread reads bytes from the input stream into the memory whose address it gets as an argument. No assumption is made regarding the types and representations of the C objects stored in this memory.
Hence a number of problems can occur:
the representation of basic types can differ from one compiler to another, one machine to another, one OS to another, possibly even depending on compiler switches. Writing the bytes of the memory representation of basic types makes sense only if you know you will be reading the file back into byte-compatible structures.
the mode for accessing the input and output files matters: as you mention, files must be open in binary mode to avoid any translation between memory representation and file contents such as what happens for text files on legacy systems. For example text mode on MS-Windows causes 0A bytes to convert to 0D 0A sequences on output and 0D bytes to be stripped on input, resulting in different contents for isolated 0D bytes in the initial content.
if the C structure contains pointers, the bytes written to the output represent the value of these pointers, not what they point to. Reading these values back into memory is highly likely to create invalid pointers and very unlikely to make any sense.
if the C structure has a flexible array at the end, its contents is not included in the sizeof(T) bytes written by fwrite or read by fread.
the C structure may contain padding between members, causing the output file to contain non deterministic bytes, which might be a problem in some circumstances.
if the C structure has arrays with only partial meaningful contents, such as char arrays containing C strings, beware that fwrite will write the bytes beyond the null terminator, which should not be meaningful, but might be sensitive information such as password fragments or other meaningful data. Carefully erasing such arrays may avoid this issue, but padding bytes cannot be erased reliably, so this solution is not perfect.
For all the above reasons and other ones, reading/writing binary data is to be reserved to very specific cases where the programmer knows exactly what is happening. For other purposes, saving as text files in human readable form is much preferred.

In question comments from #David C. Rankin
"Well, fread/fwrite read and write bytes (binary data - if you write out then read in the same number of bytes -- you get the same thing back). If you want to read and write text where you need to worry about line-breaks, etc.., fgets/fputs. or fprintf"
So I guess I can never know what I read in with fread unless I know what I wrote to it in with fwriite?
"Right, look at the type for your buffer in fwrite(3) - Linux man page it is type void *. It's just a starting address for fwrite to use in writing however many bytes you told it to write. (obviously you know what it is writing) The same for fread -- it just reads bytes -- you have to know what you are reading (or at least the format of it). That's what binary I/O is about, it's all just bytes -- it's up to you, the Programmer, to know what you are writing and reading and how to unpack it. Otherwise, use formatted-I/O and lines, words, etc.."

Related

Reading data from a file to a struct in C

Lets say i used the fread function to read data from a file to a struct. How exactly is data read to the struct? Lets say my struct has the following:
Int x;
Char c;
Will the first 4 bytes read go into x and the next byte go into c?
And if i read in more bytes than the elements in my struct can hold what's gonna happen?
Will the first 4 bytes read go into x and the next byte go into c?
Yes, unless your compiler has extremely strange padding rules (e.g. every member must be 8 byte aligned). And assuming Int is 4 bytes and Char is 1 byte.
And if i read in more bytes than the elements in my struct can hold what's gonna happen?
That's undefined behavior, unless perhaps the over-long write is not more than sizeof(YourStruct) in which case you'll only be writing to the padding bytes (which on a lot of platforms will be 3 bytes after the char).
fread reads data byte-for-byte from a file (stream) into memory. Therefore, if what you're trying to read is a struct, the byte layout of the struct in the file must exactly match the layout your compiler has chosen for the struct in memory.
So the question of "How does fread read from a file?" really boils down to, "How does the compiler lay out structs in memory?"
And the answer to that question is, it's partly determined by the rules of the C language, and it's partly up to the compiler.
So if you want to read structures from a file, you have three choices:
Learn everything you can about the C rules for laying out structures in memory, and the choices compilers can make in interpreting these rules. Keep all these rules in mind as you design your structures and your data file formats. (This is not an impossible task. Many programmers take this approach to file i/o all the time.)
Don't worry about the layout too much. Define your structures, and write them out to files using fwrite. Then the files are automatically readable using fread -- at least, as long as the program doing the reading is running on the same kind of machine, and was compiled by the same compiler using the same settings. (This, too, is a popular strategy, and works much of the time.)
Don't use fread to read structures form a file. (And although it sounds defeatist, this is my own preferred argument.)
There's much, much more that could be said abut this question. If you choose approach 1, as I've already said, you're going to have to learn everything you can about the C rules for laying out structures in memory, and the choices compilers can make in interpreting these rules. If you choose approach 3, you have to learn some decent techniques for doing so without using fwrite and fread. But I'm not going to launch into long explanations of either of those topics here. I'm sure someone else will post some links, or you could start with Chapter 17 of these C programming notes.

C How fread reads different data blocks in a binary file?

I'm porting some C code to C#, but I know little about C, but I'm flexible and I can learn new programming languages. Anyway I wasn't able to figure out the exact behaviour from the code I'm porting.
I've read about fread() and on the web.
fread(&(targetObj->data), sizeof(TestObj), 1, file);
Now, file is a big binary file with lots of data in it.
What I want to know is how I can do this in C#.
Let me explain:
I think that line of code does this:
TestObj is an unsigned short
reads 1 time a chunk of data of the size of TestObj(unsigned short)
reads it from file (which is pointer to a binary file on filesystem) into targetObj->data
What I don't understand is:
I have a big binary file, what it actually reads? There are somewhere headers which define where an unsigned short sized chunk of data is written?
Where does it takes from the binary that object? How can I know how to read back from the binary file in C#? Maybe C knows where to pick that single unsigned short, but I don't in C#
For example if that binary file has saved in it 40 unsigned shorts the C code line above reads just the first one?
and if I do
fread(&(targetObj->data), sizeof(TestObj), 5, file);
it is expected that testObj->data is an array of 5 unsigned shorts?
And the code will read the first 5 unsigned shorts that it finds in the whole binary file?
I can't wrap my head around this but I need to know how C recognizes that unsigned short in a big binary file which I don't know the content of nor I can't think how I can say in C# read the first C unsigned short from that file
fread just reads the specified number of bytes from the current file cursor position, and advances the file cursor (or "file pointer", but not to be confused with a C pointer).
So if sizeof(TestObj) is 2, it will read two bytes and place them into the location pointed by &(targetObj->data), with no bounds checking, and regardless of any differences between your architecture endianess and the file protocol endianess. Note that this approach is not a platform-independent way of parsing files containing numbers in binary form, since the number might be stored differently on your machine, compared to how it is stored inside the file (by whoever designed the binary protocol you are trying to read).
In C#, you might achieve a similar thing by manually specifying struct packing and field placement, although the code will suffer from the same problems as your C code.
fread reads from current position in stream see also ftell and fseek. Equivalent in C# would be Stream.Read
From man fread
size_t
fread(void *restrict ptr, size_t size, size_t nitems, FILE *restrict stream);
The function fread() reads nitems objects, each size bytes long, from the stream pointed to by stream, storing them at the location given by ptr.
sizeof(short) is resolved by compiler, as per https://stackoverflow.com/a/14171152/6204612
And C does not do any pretty conversions from you. What is read is precisely sizeof(short) bytes, and these bytes are put into TestObj variable. Whether it is correct or not is implementer's responsibility. You need to manage offsets, collection sizes etc. on your own.

Writing a CFSTR to the terminal in Mac OS X

How best would I output the following code
#include <CoreFoundation/CoreFoundation.h> // Needed for CFSTR
int main(int argc, char *argv[])
{
char *c_string = "Hello I am a C String. :-).";
CFStringRef cf_string = CFStringCreateWithCString(0, c_string, kCFStringEncodingUTF8);
// output cf_string
//
}
There's no API to write a CFString directly to any file (including stdout or stderr), because you can only write bytes to a file. Characters are a (somewhat) more ideal concept; they're too high-level to be written to a file. It's like saying “I want to write these pixels”; you must first decide what format to write them in (say, PNG), and then encode them in that format, and then write that data.
So, too, with characters. You must encode them as bytes in some format, then write those bytes.
Encoding the characters as bytes/data
First, you must pick an encoding. For display on a Terminal, you probably want UTF-8, which is kCFStringEncodingUTF8. For writing to a file… you usually want UTF-8. In fact, unless you specifically need something else, you almost always want UTF-8.
Next, you must encode the characters as bytes. Creating a C string is one way; another is to create a CFData object; still another is to extract bytes (not null-terminated) directly.
To create a C string, use the CFStringGetCString function.
To extract bytes, use the CFStringGetBytes function.
You said you want to stick to CF, so we'll skip the C string option (which is less efficient anyway, since whatever calls write is going to have to call strlen)—it's easier, but slower, particularly when you use it on large strings and/or frequently. Instead, we'll create CFData.
Fortunately, CFString provides an API to create a CFData object from the CFString's contents. Unfortunately, this only works for creating an external representation. You probably do not want to write this to stdout; it's only appropriate for writing out as the entire contents of a regular file.
So, we need to drop down a level and get bytes ourselves. This function takes a buffer (region of memory) and the size of that buffer in bytes.
Do not use CFStringGetLength for the size of the buffer. That counts characters, not bytes, and the relationship between number of characters and number of bytes is not always linear. (For example, some characters can be encoded in UTF-8 in a single byte… but not all. Not nearly all. And for the others, the number of bytes required varies.)
The correct way is to call CFStringGetBytes twice: once with no buffer (NULL), whereupon it will simply tell you how many bytes it'll give you (without trying to write into the buffer you haven't given it); then, you create a buffer of that size, and then call it again with the buffer.
You could create a buffer using malloc, but you want to stick to CF stuff, so we'll do it this way instead: create a CFMutableData object whose capacity is the number of bytes you got from your first CFStringGetBytes call, increase its length to that same number of bytes, then get the data's mutable byte pointer. That pointer is the pointer to the buffer you need to write into; it's the pointer you pass to the second call to CFStringGetBytes.
To recap the steps so far:
Call CFStringGetBytes with no buffer to find out how big the buffer needs to be.
Create a CFMutableData object of that capacity and increase its length up to that size.
Get the CFMutableData object's mutable byte pointer, which is your buffer, and call CFStringGetBytes again, this time with the buffer, to encode the characters into bytes in the data object.
Writing it out
To write bytes/data to a file in pure CF, you must use CFWriteStream.
Sadly, there's no CF equivalent to nice Cocoa APIs like [NSFileHandle fileHandleWithStandardOutput]. The only way to create a write stream to stdout is to create it using the path to stdout, wrapped in a URL.
You can create a URL easily enough from a path; the path to the standard output device is /dev/stdout, so to create the URL looks like this:
CFURLRef stdoutURL = CFURLCreateWithFileSystemPath(kCFAllocatorDefault, CFSTR("/dev/stdout"), kCFURLPOSIXPathStyle, /*isDirectory*/ false);
(Of course, like everything you Create, you need to Release that.)
Having a URL, you can then create a write stream for the file so referenced. Then, you must open the stream, whereupon you can write the data to it (you will need to get the data's byte pointer and its length), and finally close the stream.
Note that you may have missing/un-displayed text if what you're writing out doesn't end with a newline. NSLog adds a newline for you when it writes to stderr on your behalf; when you write to stderr yourself, you have to do it (or live with the consequences).
So:
Create a URL that refers to the file you want to write to.
Create a stream that can write to that file.
Open the stream.
Write bytes to the stream. (You can do this as many times as you want, or do it asynchronously.)
When you're all done, close the stream.

Saving data to a binary file

I would like to save a file as binary, because I've heard that it would probably be smaller than a normal text file.
Now I am trying to save a binary file with some text, but the problem is that the file just contains the text and NULL at the end. I would expect to see only zero's and one's inside the file.
Any explaination or suggestions are highly appreciated.
Here is my code
#include <iostream>
#include <stdio.h>
int main()
{
/*Temporary data buffer*/
char buffer[20];
/*Data to be stored in file*/
char temp[20]="Test";
/*Opening file for writing in binary mode*/
FILE *handleWrite=fopen("test.bin","wb");
/*Writing data to file*/
fwrite(temp, 1, 13, handleWrite);
/*Closing File*/
fclose(handleWrite);
/*Opening file for reading*/
FILE *handleRead=fopen("test.bin","rb");
/*Reading data from file into temporary buffer*/
fread(buffer,1,13,handleRead);
/*Displaying content of file on console*/
printf("%s",buffer);
/*Closing File*/
fclose(handleRead);
std::system("pause");
return 0;
}
All files contain only ones and zeroes, on binary computers that's all there is to play with.
When you save text, you are saving the binary representation of that text, in a given encoding that defines how each letter is mapped to bits.
So for text, a text file or a binary file almost doesn't matter; the savings in space that you've heard about generally come into play for other data types.
Consider a floating point number, such as 3.141592653589. If saved as text, that would take one character per digit (just count them), plus the period. If saved in binary as just a copy of the float's bits, it will take four characters (four bytes, or 32 bits) on a typical 32-bit system. The exact number of bits stored by a call such as:
FILE *my_file = fopen("pi.bin", "wb");
float x = 3.1415;
fwrite(&x, sizeof x, 1, my_file);
is CHAR_BIT * sizeof x, see <stdlib.h> for CHAR_BIT.
The problem you describe is a chain of (very common1, unfortunately) mistakes and misunderstandings. Let me try to fully detail what is going on, hopefully you will take the time to read through all the material: it is lengthy, but these are very important basics that any programmer should master. Please do not despair if you do not fully understand all of it: just try to play around with it, come back in a week, or two, practice, see what happens :)
There is a crucial difference between the concepts of a character encoding and a character set. Unless you really understand this difference, you will never really get what is going on, here. Joel Spolsky (one of the founders of Stackoverflow, come to think of it) wrote an article explaining the difference a while ago: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). Before you continue reading this, before you continue programming, even, read that. Honestly, read it, understand it: the title is no exaggeration. You must absolutely know this stuff.
After that, let us proceed:
When a C program runs, a memory location that is supposed to hold a value of type "char" contains, just like any other memory location, a sequence of ones and zeroes. "type" of a variable only means something to the compiler, not to the running program who just sees ones and zeroes and does not know more than that. In other words: where you commonly think of a "letter" (an element from a character set) residing in memory somewhere, what is actually there is a bit sequence (an element from a character encoding).
Every compiler is free to use whatever encoding they wish to represent characters in memory. As a consequence, it is free represent what we call a "newline" internally as any number it chooses. For example, say I write a compiler, I can agree with myself that every time I want to store a "newline" internally I store it as number six (6), which is just 0x6 in binary (or 110 in binary).
Writing to a file is done by telling the operating system2 four things at the same time:
The fact that you want to write to a file (fwrite())
Where the data starts that you want to write (first argument to fwrite)
How much data you want to write (second and third argument, multiplied)
What file you want to write to (last argument)
Note that this has nothing to do with the "type" of that data: your operating has no idea, and does not care. It does not know anything about characters sets and it does not care: it just sees a sequence of ones and zeroes starting somewhere and copies that to a file.
Opening a file in "binary" mode is actually the normal, intuitive way of dealing with files that a novice programmer would expect: the memory location you specify is copied one-on-one to the file. If you write a memory location that used to hold variables that the compiler decided to store as type "char", those values are written one-on-one to the file. Unless you know how the compiler stores values internally (what value it associates with a newline, with a letter 'a', 'b', etc), THIS IS MEANINGLESS. Compare this to Joel's similar point about a text file being useless without knowing what its encoding is: same thing.
Opening a file in "text" mode is almost equal to binary mode, with one (and only one) difference: anytime a value is written that has value equal to what the compiler uses INTERNALLY for the newline (6, in our case), it writes something different to the file: not that value, but whatever the operating system you are on considers to be a newline. On windows, this is two bytes (13 and 10, or 0x0d 0x0a, on Windows). Note, again, if you do not know about the compiler's choice of internal representation of the other characters, this is STILL MEANINGLESS.
Note at this point that it is pretty clear that writing anything but data that the compiler designated as characters to a file in text mode is a bad idea: in our case, a 6 might just happen to be among the values you are writing, in which case the output is altered in a way that we absolutely do not mean to.
(Un)Luckily, most (all?) compilers actually use the same internal representation for characters: this representation is US-ASCII and it is the mother of all defaults. This is the reason you can write some "characters" to a file in your program, compiled with any random compiler, and then open it with a text editor: they all use/understand US-ASCII and it happens to work.
OK, now to connect this to your example: why is there no difference between writing "test" in binary mode and in text mode? Because there is no newline in "test", that is why!
And what does it mean when you "open a file", and then "see" characters? It means that the program you used to inspect the sequence of ones and zeroes in that file (because everything is ones and zeroes on your hard disk) decided to interpret that as US-ASCII, and that happened to be what your compiler decided to encode that string as, in its memory.
Bonus points: write a program that reads the ones and zeroes from a file into memory and prints every BIT (there's multiple bits to make up one byte, to extract them you need to know 'bitwise' operator tricks, google!) as a "1" or "0" to the user. Note that "1" is the CHARACTER 1, the point in the character set of your choosing, so your program must take a bit (number 1 or 0) and transform it to the sequence of bits needed to represent character 1 or 0 in the encoding that the terminal emulator uses that you are viewing the standard out of the program on oh my God. Good news: you can take lots of short-cuts by assuming US-ASCII everywhere. This program will show you what you wanted: the sequence of ones and zeroes that your compiler uses to represent "test" internally.
This stuff is really daunting for newbies, and I know that it took me a long time to even know that there was a difference between a character set and an encoding, let alone how all of this worked. Hopefully I did not demotivate you, if I did, just remember that you can never lose knowledge you already have, only gain it (ok not always true :P). It is normal in life that a statement raises more questions than it answered, Socrates knew this and his wisdom seamlessly applies to modern day technology 2.4k years later.
Good luck, do not hesitate to continue asking. To other readers: please feel welcome to improve this post if you see errors.
Hraban
1 The person that told you that "saving a file in binary is probably smaller", for example, probably gravely misunderstands these fundamentals. Unless he was referring to compressing the data before you save it, in which case he just uses a confusing word ("binary") for "compressed".
2 "telling the operating system something" is what is commonly known as a system call.
Well, the difference between native and binary is the way the end of line is handled.
If you write a string in a binary, it will stay the string.
If you want to make it smaller, you'll have to somehow compress it (look for libz for example).
What is smaller is: when wanting to save binary data (like an array of bytes), it's smaller to save it as binary rather than putting it in a string (either in hexa representation or base64). I hope this helps.
I think you're a bit confused here.
The ASCII-string "Test" will still be an ASCII-string when you write it to the file (even in binary mode). The cases when it makes sense to write binary are for other types than chars (e.g. an array of integers).
try replacing
FILE *handleWrite=fopen("test.bin","wb");
fwrite(temp, 1, 13, handleWrite);
with
FILE *handleWrite=fopen("test.bin","w");
fprintf(handleWrite, "%s", temp);
Function printf("%s",buffer); prints buffer as zero-ending string.
Try to use:
char temp[20]="Test\n\rTest";

What's a good coding style for reading different bits of data from a binary file in C?

I'm novice programmer and am writing a simple wav-player in C as a pet project. Part of the file loading process requires reading specific data (sampling rate, number of channels,...) from the file header.
Currently what I'm doing is similar to this:
Scan for a sequence of bytes and skip past it
Read 2 bytes into variable a
Check value and return on error
Skip 4 bytes
Read 4 bytes into variable b
Check value and return on error
...and so on. (code see: https://github.com/qgi/Player/blob/master/Importer.c)
I've written a number of helper functions to do the scanning/skipping/reading bit. Still I'm repeating the reading, checking, skipping part several times, which doesn't seem to be neither very effective nor very smart. It's not a real issue for my project, but as this seems to be quite a common task when handling binary files, I was wondering:
Is there some kind of a pattern on how to do this more effectively with cleaner code?
Most often, people define structs (often with something like #pragma pack(1) to assure against padding) that matches the file's structures. They then read data into an instance of that with something like fread, and use the values from the struct.
The cleanest option that I've come across is the scanf-like function unpack presented by Kernighan & Pike on page 219 of The Practice of Programming, which can be used like
// assume we read the file header into buf
// and the header consists of magic (4 bytes), type (2) and length (4).
// "l" == 4 bytes (long)
// "s" == 2 bytes (short)
unpack(buf, "lsl", &magic, &type, &length);
For efficiency using a buffer of say size 4096 to read into and then doing your parsing on the data in the buffer would be more efficient, and ofcource doing a single scan parsing where you only go forward is most efficient.

Resources