Saving data to a binary file - c

I would like to save a file as binary, because I've heard that it would probably be smaller than a normal text file.
Now I am trying to save a binary file with some text, but the problem is that the file just contains the text and NULL at the end. I would expect to see only zero's and one's inside the file.
Any explaination or suggestions are highly appreciated.
Here is my code
#include <iostream>
#include <stdio.h>
int main()
{
/*Temporary data buffer*/
char buffer[20];
/*Data to be stored in file*/
char temp[20]="Test";
/*Opening file for writing in binary mode*/
FILE *handleWrite=fopen("test.bin","wb");
/*Writing data to file*/
fwrite(temp, 1, 13, handleWrite);
/*Closing File*/
fclose(handleWrite);
/*Opening file for reading*/
FILE *handleRead=fopen("test.bin","rb");
/*Reading data from file into temporary buffer*/
fread(buffer,1,13,handleRead);
/*Displaying content of file on console*/
printf("%s",buffer);
/*Closing File*/
fclose(handleRead);
std::system("pause");
return 0;
}

All files contain only ones and zeroes, on binary computers that's all there is to play with.
When you save text, you are saving the binary representation of that text, in a given encoding that defines how each letter is mapped to bits.
So for text, a text file or a binary file almost doesn't matter; the savings in space that you've heard about generally come into play for other data types.
Consider a floating point number, such as 3.141592653589. If saved as text, that would take one character per digit (just count them), plus the period. If saved in binary as just a copy of the float's bits, it will take four characters (four bytes, or 32 bits) on a typical 32-bit system. The exact number of bits stored by a call such as:
FILE *my_file = fopen("pi.bin", "wb");
float x = 3.1415;
fwrite(&x, sizeof x, 1, my_file);
is CHAR_BIT * sizeof x, see <stdlib.h> for CHAR_BIT.

The problem you describe is a chain of (very common1, unfortunately) mistakes and misunderstandings. Let me try to fully detail what is going on, hopefully you will take the time to read through all the material: it is lengthy, but these are very important basics that any programmer should master. Please do not despair if you do not fully understand all of it: just try to play around with it, come back in a week, or two, practice, see what happens :)
There is a crucial difference between the concepts of a character encoding and a character set. Unless you really understand this difference, you will never really get what is going on, here. Joel Spolsky (one of the founders of Stackoverflow, come to think of it) wrote an article explaining the difference a while ago: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). Before you continue reading this, before you continue programming, even, read that. Honestly, read it, understand it: the title is no exaggeration. You must absolutely know this stuff.
After that, let us proceed:
When a C program runs, a memory location that is supposed to hold a value of type "char" contains, just like any other memory location, a sequence of ones and zeroes. "type" of a variable only means something to the compiler, not to the running program who just sees ones and zeroes and does not know more than that. In other words: where you commonly think of a "letter" (an element from a character set) residing in memory somewhere, what is actually there is a bit sequence (an element from a character encoding).
Every compiler is free to use whatever encoding they wish to represent characters in memory. As a consequence, it is free represent what we call a "newline" internally as any number it chooses. For example, say I write a compiler, I can agree with myself that every time I want to store a "newline" internally I store it as number six (6), which is just 0x6 in binary (or 110 in binary).
Writing to a file is done by telling the operating system2 four things at the same time:
The fact that you want to write to a file (fwrite())
Where the data starts that you want to write (first argument to fwrite)
How much data you want to write (second and third argument, multiplied)
What file you want to write to (last argument)
Note that this has nothing to do with the "type" of that data: your operating has no idea, and does not care. It does not know anything about characters sets and it does not care: it just sees a sequence of ones and zeroes starting somewhere and copies that to a file.
Opening a file in "binary" mode is actually the normal, intuitive way of dealing with files that a novice programmer would expect: the memory location you specify is copied one-on-one to the file. If you write a memory location that used to hold variables that the compiler decided to store as type "char", those values are written one-on-one to the file. Unless you know how the compiler stores values internally (what value it associates with a newline, with a letter 'a', 'b', etc), THIS IS MEANINGLESS. Compare this to Joel's similar point about a text file being useless without knowing what its encoding is: same thing.
Opening a file in "text" mode is almost equal to binary mode, with one (and only one) difference: anytime a value is written that has value equal to what the compiler uses INTERNALLY for the newline (6, in our case), it writes something different to the file: not that value, but whatever the operating system you are on considers to be a newline. On windows, this is two bytes (13 and 10, or 0x0d 0x0a, on Windows). Note, again, if you do not know about the compiler's choice of internal representation of the other characters, this is STILL MEANINGLESS.
Note at this point that it is pretty clear that writing anything but data that the compiler designated as characters to a file in text mode is a bad idea: in our case, a 6 might just happen to be among the values you are writing, in which case the output is altered in a way that we absolutely do not mean to.
(Un)Luckily, most (all?) compilers actually use the same internal representation for characters: this representation is US-ASCII and it is the mother of all defaults. This is the reason you can write some "characters" to a file in your program, compiled with any random compiler, and then open it with a text editor: they all use/understand US-ASCII and it happens to work.
OK, now to connect this to your example: why is there no difference between writing "test" in binary mode and in text mode? Because there is no newline in "test", that is why!
And what does it mean when you "open a file", and then "see" characters? It means that the program you used to inspect the sequence of ones and zeroes in that file (because everything is ones and zeroes on your hard disk) decided to interpret that as US-ASCII, and that happened to be what your compiler decided to encode that string as, in its memory.
Bonus points: write a program that reads the ones and zeroes from a file into memory and prints every BIT (there's multiple bits to make up one byte, to extract them you need to know 'bitwise' operator tricks, google!) as a "1" or "0" to the user. Note that "1" is the CHARACTER 1, the point in the character set of your choosing, so your program must take a bit (number 1 or 0) and transform it to the sequence of bits needed to represent character 1 or 0 in the encoding that the terminal emulator uses that you are viewing the standard out of the program on oh my God. Good news: you can take lots of short-cuts by assuming US-ASCII everywhere. This program will show you what you wanted: the sequence of ones and zeroes that your compiler uses to represent "test" internally.
This stuff is really daunting for newbies, and I know that it took me a long time to even know that there was a difference between a character set and an encoding, let alone how all of this worked. Hopefully I did not demotivate you, if I did, just remember that you can never lose knowledge you already have, only gain it (ok not always true :P). It is normal in life that a statement raises more questions than it answered, Socrates knew this and his wisdom seamlessly applies to modern day technology 2.4k years later.
Good luck, do not hesitate to continue asking. To other readers: please feel welcome to improve this post if you see errors.
Hraban
1 The person that told you that "saving a file in binary is probably smaller", for example, probably gravely misunderstands these fundamentals. Unless he was referring to compressing the data before you save it, in which case he just uses a confusing word ("binary") for "compressed".
2 "telling the operating system something" is what is commonly known as a system call.

Well, the difference between native and binary is the way the end of line is handled.
If you write a string in a binary, it will stay the string.
If you want to make it smaller, you'll have to somehow compress it (look for libz for example).
What is smaller is: when wanting to save binary data (like an array of bytes), it's smaller to save it as binary rather than putting it in a string (either in hexa representation or base64). I hope this helps.

I think you're a bit confused here.
The ASCII-string "Test" will still be an ASCII-string when you write it to the file (even in binary mode). The cases when it makes sense to write binary are for other types than chars (e.g. an array of integers).

try replacing
FILE *handleWrite=fopen("test.bin","wb");
fwrite(temp, 1, 13, handleWrite);
with
FILE *handleWrite=fopen("test.bin","w");
fprintf(handleWrite, "%s", temp);

Function printf("%s",buffer); prints buffer as zero-ending string.
Try to use:
char temp[20]="Test\n\rTest";

Related

Confusion about files in C?

I want to work with files in C. I know that the standard library has some functions to help with that. However I have some questions. Are the contents of a file just an array of characters or is it a combination of ints, chars, doubles and more data types. What are files terminated by in C?
A few things jump to mind such as EOF and NULL but I'm not sure.
C is extremely basic in that the contents of a file is just a byte stream until you write code that interprets otherwise.
Keep in mind the very concept of something like an int is extremely hazy at best due to endian issues. While any given C compiler has strong opinions on the form those take, these are architecture specific and not portable. Over the decades it has meant anything from binary coded decimal values, to "words" of varying sizes, and likely even stranger things along the way.
Files aren't "terminated" by anything in C. They're just files. They have zero or more bytes of data, and possibly holes to keep things interesting.
When reading a file you read in either a fixed amount of data, or read until you bump into the "End of File".
Remember, some files may be actively written to, so the EOF position might be constantly changing.
It's also important to keep in mind NULL in C means a pointer. When talking about the zero byte used as a string terminator that's often called NUL to avoid confusion, a term captured in ASCII standard, though it actually pre-dates that standard.

How do fread and fwrite distinguish between different data (types) in C?

I am working with a program and C (with Ubuntu and its bash) and using it to manipulate binary data files. First of all, when I use fopen(filename, 'w') it creates a file but without any extension. However, when I use vim filename it opens it up in some binary form.
For this question, when I use fwrite(array, sizeof(some struct), # of structs, filePointer) it writes (which I am not sure how in binary) into the file. When I use fread(anotherArray, sizeof(same struct), same # of structs, anotherFilePointer) it somehow magically knows how to read each struct in binary form and puts it into the array just by knowing its size and how much to read. What happens if I put a decimal value less than the number of structs there are in the # of structs parameter? How would fread know what to read correctly? How does it work in reading data just by looking at the sizes and not knowing what type of data it is?
fwrite writes the bytes of the memory where the object is stored to the output stream and fread reads bytes from the input stream into the memory whose address it gets as an argument. No assumption is made regarding the types and representations of the C objects stored in this memory.
Hence a number of problems can occur:
the representation of basic types can differ from one compiler to another, one machine to another, one OS to another, possibly even depending on compiler switches. Writing the bytes of the memory representation of basic types makes sense only if you know you will be reading the file back into byte-compatible structures.
the mode for accessing the input and output files matters: as you mention, files must be open in binary mode to avoid any translation between memory representation and file contents such as what happens for text files on legacy systems. For example text mode on MS-Windows causes 0A bytes to convert to 0D 0A sequences on output and 0D bytes to be stripped on input, resulting in different contents for isolated 0D bytes in the initial content.
if the C structure contains pointers, the bytes written to the output represent the value of these pointers, not what they point to. Reading these values back into memory is highly likely to create invalid pointers and very unlikely to make any sense.
if the C structure has a flexible array at the end, its contents is not included in the sizeof(T) bytes written by fwrite or read by fread.
the C structure may contain padding between members, causing the output file to contain non deterministic bytes, which might be a problem in some circumstances.
if the C structure has arrays with only partial meaningful contents, such as char arrays containing C strings, beware that fwrite will write the bytes beyond the null terminator, which should not be meaningful, but might be sensitive information such as password fragments or other meaningful data. Carefully erasing such arrays may avoid this issue, but padding bytes cannot be erased reliably, so this solution is not perfect.
For all the above reasons and other ones, reading/writing binary data is to be reserved to very specific cases where the programmer knows exactly what is happening. For other purposes, saving as text files in human readable form is much preferred.
In question comments from #David C. Rankin
"Well, fread/fwrite read and write bytes (binary data - if you write out then read in the same number of bytes -- you get the same thing back). If you want to read and write text where you need to worry about line-breaks, etc.., fgets/fputs. or fprintf"
So I guess I can never know what I read in with fread unless I know what I wrote to it in with fwriite?
"Right, look at the type for your buffer in fwrite(3) - Linux man page it is type void *. It's just a starting address for fwrite to use in writing however many bytes you told it to write. (obviously you know what it is writing) The same for fread -- it just reads bytes -- you have to know what you are reading (or at least the format of it). That's what binary I/O is about, it's all just bytes -- it's up to you, the Programmer, to know what you are writing and reading and how to unpack it. Otherwise, use formatted-I/O and lines, words, etc.."

C binary file versus text file efficiency

i'm quite new in C and i would like some help.
lets say i need to store in a file only 6 digit numbers. (lets assume the size of int equals 4)
what would be more efficient (in terms of memory) using a text file or binary file? i am not really sure how to confront this problem, any help will be welcome
Most people classify files in two categories: binary files and ASCII (text) files. You've actually worked with both. Any program you write (C/C++/Perl/HTML) is almost surely an ASCII file.
An ASCII file is defined as a file that consists of ASCII characters. It's usually created by using a text editor like emacs, pico, vi, Notepad, etc. There are fancier editors out there for writing code, but they may not always save it as ASCII. ASCII is international standard.
Computer science is all about creating good abstractions. Sometimes it succeeds and sometimes it doesn't. Good abstractions are all about presenting a view of the world that the user can use. One of the most successful abstractions is the text editor.
When you're writing a program, and typing in comments, it's hard to imagine that this information is not being stored as characters. ASCII/text files are really stored as 0's and 1's.
Files are stored on disks, and disks have some way to represent 1's and 0's. We merely call them 1's and 0's because that's also an abstraction. Whatever way is used to store the 0's and 1's on a disk, we don't care, provided we can think of them that way.
In effect, ASCII files are basically binary files, because they store binary numbers. That is, ASCII files store 0's and 1's.
The Difference between ASCII and Binary Files?
An ASCII file is a binary file that stores ASCII codes. Recall that an ASCII code is a 7-bit code stored in a byte. To be more specific, there are 128 different ASCII codes, which means that only 7 bits are needed to represent an ASCII character.
However, since the minimum workable size is 1 byte, those 7 bits are the low 7 bits of any byte. The most significant bit is 0. That means, in any ASCII file, you're wasting 1/8 of the bits. In particular, the most significant bit of each byte is not being used.
Although ASCII files are binary files, some people treat them as different kinds of files. I like to think of ASCII files as special kinds of binary files. They're binary files where each byte is written in ASCII code.
A full, general binary file has no such restrictions. Any of the 256 bit patterns can be used in any byte of a binary file.
We work with binary files all the time. Executables, object files, image files, sound files, and many file formats are binary files. What makes them binary is merely the fact that each byte of a binary file can be one of 256 bit patterns. They're not restricted to the ASCII codes.
Example of ASCII files
Suppose you're editing a text file with a text editor. Because you're using a text editor, you're pretty much editing an ASCII file. In this brand new file, you type in "cat". That is, the letters 'c', then 'a', then 't'. Then, you save the file and quit.
What happens? For the time being, we won't worry about the mechanism of what it means to open a file, modify it, and close it. Instead, we're concerned with the ASCII encoding.
If you look up an ASCII table, you will discover the ASCII code for 0x63, 0x61, 0x74 (the 0x merely indicates the values are in hexadecimal, instead of decimal/base 10).
Here's how it looks:
ASCII 'c' 'a' 't'
Hex 63 61 74
Binary 0110 0011 0110 0001 0111 1000
Each time you type in an ASCII character and save it, an entire byte is written which corresponds to that character. This includes punctuations, spaces, and so forth.
Thus, when you type a 'c', it's being saved as 0110 0011 to a file.
Now sometimes a text editor throws in characters you may not expect. For example, some editors "insist" that each line end with a newline character.
The only place a file can be missing a newline at the end of the line is the very last line. Some editors allow the very last line to end in something besides a newline character. Some editors add a newline at the end of every file.
Unfortunately, even the newline character is not that universally standard. It's common to use newline characters on UNIX files, but in Windows, it's common to use two characters to end each line (carriage return, newline, which is \r and \n, I believe). Why two characters when only one is necessary?
This dates back to printers. In the old days, the time it took for a printer to return back to the beginning of a line was equal to the time it took to type two characters. So, two characters were placed in the file to give the printer time to move the printer ball back to the beginning of the line.
This fact isn't all that important. It's mostly trivia. The reason I bring it up is just in case you've wondered why transferring files to UNIX from Windows sometimes generates funny characters.
Editing Binary Files
Now that you know that each character typed in an ASCII file corresponds to one byte in a file, you might understand why it's difficult to edit a binary file.
If you want to edit a binary file, you really would like to edit individual bits. For example, suppose you want to write the binary pattern 1100 0011. How would you do this?
You might be naive, and type in the following in a file:
11000011
But you should know, by now, that this is not editing individual bits of a file. If you type in '1' and '0', you are really entering in 0x49 and 0x48. That is, you're entering in 0100 1001 and 0100 1000 into the files. You're actually (indirectly) typing 8 bits at a time.
There are some programs that allow you type in 49, and it translates this to a single byte, 0100 1001, instead of the ASCII code for '4' and '9'. You can call these programs hex editors. Unfortunately, these may not be so readily available. It's not too hard to write a program that reads in an ASCII file that looks like hex pairs, but then converts it to a true binary file with the corresponding bit patterns.
That is, it takes a file that looks like:
63 a0 de
and converts this ASCII file to a binary file that begins 0110 0011 (which is 63 in binary). Notice that this file is ASCII, which means what's really stored is the ASCII code for '6', '3', ' ' (space), 'a', '0', and so forth. A program can read this ASCII file then generate the appropriate binary code and write that to a file.
Thus, the ASCII file might contain 8 bytes (6 for the characters, 2 for the spaces), and the output binary file would contain 3 bytes, one byte per hex pair.
Writing Binary Files
Why do people use binary files anyway? One reason is compactness. For example, suppose you wanted to write the number 100000. If you type it in ASCII, this would take 6 characters (which is 6 bytes). However, if you represent it as unsigned binary, you can write it out using 4 bytes.
ASCII is convenient, because it tends to be human-readable, but it can use up a lot of space. You can represent information more compactly by using binary files.
For example, one thing you can do is to save an object to a file. This is a kind of serialization. To dump it to a file, you use a write() method. Usually, you pass in a pointer to the object and the number of bytes used to represent the object (use the sizeof operator to determine this) to the write() method. The method then dumps out the bytes as it appears in memory into a file.
You can then recover the information from the file and place it into the object by using a corresponding read() method which typically takes a pointer to an object (and it should point to an object that has memory allocated, whether it be statically or dynamically allocated) and the number of bytes for the object, and copies the bytes from the file into the object.
Of course, you must be careful. If you use two different compilers, or transfer the file from one kind of machine to another, this process may not work. In particular, the object may be laid out differently. This can be as simple as endianness, or there may be issues with padding.
This way of saving objects to a file is nice and simple, but it may not be all that portable. Furthermore, it does the equivalent of a shallow copy. If your object contains pointers, it will write out the addresses to the file. Those addresses are likely to be totally meaningless. Addresses may make sense at the time a program is running, but if you quit and restart, those addresses may change.
This is why some people invent their own format for storing objects: to increase portability.
But if you know you aren't storing objects that contain pointers, and you are reading the file in on the same kind of computer system you wrote it on, and you're using the same compiler, it should work.
This is one reason people sometimes prefer to write out ints, chars, etc. instead of entire objects. They tend to be somewhat more portable.
An ASCII file is a binary file that consists of ASCII characters. ASCII characters are 7-bit encodings stored in a byte. Thus, each byte of an ASCII file has its most significant bit set to 0. Think of an ASCII file as a special kind of binary file.
A generic binary file uses all 8-bits. Each byte of a binary file can have the full 256 bitstring patterns (as opposed to an ASCII file which only has 128 bitstring patterns).
There may be a time where Unicode text files becomes more prevalent. But for now, ASCII files are the standard format for text files.
A binary file is basically any file that is not "line-oriented". Any file where besides the actual written characters and newlines there are other symbols as well.
Usually when you write a file in text mode, any new line \n will be translated to a carriage return + line feed \r\n.
There isn't any memory efficiency that can be achieved by using a binary file as apposed to text files, files are stored on disk and not in memory. It all depends on what you want to do with the file and how you wish to format it.
Since you are working with pure integers (regardless of what the int size is) using a text or binary file will have the same impact on performance (meaning that it wont make any difference which type you choose to work with).
If you want to later modify or read the file in a text editor, it is best to use the text mode to write the file.

C: creating an archive file header

I am creating a file archiver/extractor (like tar), using POSIX API system calls in C. I have done part of the archiving bit.
I would like to know if any one could help me with some C source code (using above) to create a file header for a file in C (where header acts as an index), which describes the files attributes/meta data (name,date time,etc). All I have done so far is understand (not sure if that's even correct) that to create a file header it needs
a struct to hold the meta data, and lseek is needed to seek to beginning/end of file
like:
FileName=file.txt FileSize=0
FileDir=./blah/blah
FilePerms=000
\n\n
The archiving part of program has this process:
Get a list of all the files from the command line. (I can do this part)
Create a structure to hold the meta data about each file: name (255 char), size (64-bit int), date and time, and permissions.
For each file, get its stats.
Store the stats of each file within an array of structures.
Open the archive for writing. (I can do this part)
Write the header structure.
For each file, append its content to the archive file (at the end/start of each file).
Close the archive file. (I can do this part)
I am having difficulty in creating the header file as a whole even though I know what it needs to do, as mentioned in numbered points above the bits I cant do are stated (2,3,4,6,7).
Any help will be appreciated.
Thanks.
As ijw notes, there are several ways to create an archive file header. If cross-platform portability is going to be an issue at all - or if you need to switch between 32-bit and 64-bit builds of the software on the same platform, even - then you need to ensure that the sizes and layouts of the fields are fully understood on all platforms.
Per-file Metadata
One way to do that is to use a fixed format binary header with types of known size and endianness. This is what ijw suggested. However, you will need to handle long file names, and so you will need to store a length (probably in a 2-byte unsigned integer) and then follow that with the actual pathname.
The alternative, and generally now favoured technique, is to use printable fields (often called ASCII format, though that is something of a misnomer). The time is recorded as the decimal number of seconds since the Epoch converted to a string, etc. This is what modern ar archives use; it is what GNU tar does (more or less; there are some historical quirks that make that more confusing); it is what cpio -c (which is usually the default these days) does. The fields might be separated by nulls or spaces; there is an easy way to detect the end of the header; the header contains information about the file name (not necessarily as directly as you'd like or expect, but again, that is usually because the format has evolved over the years), and then is followed by the actual data. Somehow, you know the size of each field, and the file which the header describes, so that you can read the data reliably.
Efficiency is a red herring. The conversion to/from the text format is so swift by comparison with the first disk access that there is essentially no measurable performance issue. And the guaranteed portability typically far outweighs the (microscopic) performance benefit from using binary data format instead - doubly so when the binary data has to be transformed on input or output anyway to get it into an architecture-neutral format.
Central Index vs Distributed Index
The other issue to consider is whether the index of files in the archive is centralized (at the front, or at the end) or distributed (the metadata for each file immediately precedes the data for the file). There are some advantages to each format - generally, systems use the distributed version because you can write the information for each file without knowing how many files there are to process in total (for example, because you are recursively archiving a directory's contents). Having a central index up front means you can list the files without reading the whole archive - distributed metadata means you have to read the whole file. However, the central index complicates the building of the archive.
Note that even with a distributed index, you will normally need a header for the archive as a whole so that you can detect that the file is in the format you expect. Typically, there is some sort of marker information (!<arch>\n for an ar archive, usually; %PDF-1.2\n at the start of a PDF file, etc) to reassure you that the file contains what you expect. There might be some overall (archive-level) metadata. Then you will have the first file metadata followed by the file data, repeating until the end of archive (which might, or might not, have a formal end marker - more metadata).
[H]ow would I go about implementing it in the 'fixed format binary header' you suggested. I am having trouble with deciding what commands/functions are needed.
I intended to suggest that you do not go with a fixed format binary header; you should use a text-based header format. If you can work out how to do the binary format, be my guest (I've done it numerous times over the years - that doesn't mean I think it is a good idea).
So, some pointers here towards the 'text header' format.
For the file metadata, you might define that you include:
size
mode (permissions, type)
owner
group
modification time
length of name
name
You might reasonably decide that your file sizes are limited to 64-bit unsigned integer quantities, which means 20 decimal digits. The mode might be printed as a 16-bit octal number, requiring 6 octal digits. The owner and group might be printed as UID and GID values (rather than name), in which case you could use 10 digits for each. Alternatively, you could decide to use names, but you should then allow for names up to say 32 characters each. Note that names are typically more portable than numbers. Neither name nor number is of much relevance on the receiving machine unless you extract the data as root (but why would you want to do that?). The modification time is classically a 32-bit signed integer, representing the number of seconds since the Epoch (1970-01-01 00:00:00Z). You should allow for the Y2038 bug by allowing the number of seconds to grow bigger than the 32-bit quantity; you might decide that 12 leading digits will take you beyond the Y10K crisis (by a factor of 4 or so), and this is good enough; you might decide to allow for fractional seconds too. Together, this suggests that 26 spaces for the timestamp should be overkill. You can decide that each field will be separated from the next by a space (for legibility - think 'ease of debugging'!). You might reasonably decide that all file names will be restricted to 4 decimal digits in total length.
You need to know how to format the types portably - #include <inttypes.h> is your friend.
You then devise a format string for printing (writing) the file metadata, and a parallel string for scanning (reading) the file metadata.
Printing:
"%20" PRIu64 " %06o %-.32s %-.32s %26" PRIu64 " %-4d %s\n"
This prints the name too. It terminates the header with a newline. The total size is 127 bytes plus the length of the file name. That's probably excessive, but you can tweak the numbers to suit yourself.
Scanning:
"%" SCNu64 " %o %.32s %.32s %" SCNu64 "%d"
This does not scan the name; you need to create the scanner for the name carefully, not least because you need to read spaces in the name. In fact, the code for scanning the user name and group name both assume no spaces too. If that is not acceptable (that is, names may contain spaces), then you need a more complex scan format, or something other than sscanf() to process the input data.
I'm assuming a 64-bit integer for the time field, rather than mixing up fractional seconds, etc, even though there's enough space to allow for fractional seconds. You'd likely save some space here.
The getting of information for each file you can do with the stat() system call.
For the writing of the header, here's two solutions.
Trivial but evil:
struct file_header {
... data you want to put in
} fhdr;
fwrite(file, fhdr, sizeof(fhdr));
This is evil because structure packing varies from machine to machine, as does byte order and the size of basic types like 'int'. A file written by your program may not be readable by your program when it's compiled on another machine, or even with another compiler on the same machine in some cases.
Non-trivial but safe:
char name[xxx];
uint32_t length; /* Fixed byte length across architectures */
...
fwrite(file, name, sizeof(name));
length=htonl(length); /* Or something else that converts
the length to a known endianness */
fwrite(file, &length, sizeof(length);
Personally I'm not a fan of htonl() and friends, I prefer to write something that converts a uint32_t to a uchar[4] using shift operators (which can be written trivially using shift operators) because C doesn't pin down the format of even an integer in memory. In practice you'd be hard pushed to find something that doesn't store a uint32_t as 4 bytes of 8 bits, but it's a thing to consider.
The variables listed above can be structure members in your structure. Reversing the process on read is left as an exercise to the reader.

How do I choose a good magic number for my file format?

I am designing a binary file format from scratch, and I would like to include some magic bytes at the beginning so that it can be identified easily. How do I go about choosing which bytes? I am not aware of any central registry of magic numbers, so is it just a matter of picking something fairly random that isn't already identified by, say, the file command on a nearby UNIX box?
Stay away from super-short magic numbers. Just because you're designing a binary format doesn't mean you can't use a text string for identifier. Follow that by an EOF char, and as an added bonus people who cat or type your binary file won't get a mangled terminal.
There is no universally correct way. Best practices can be suggested, but these often situational. For example, if you're checking the integrity of volatile memory, which has an undefined initial state when power is applied, it may be beneficial to incorporate many 0s or 1s in a sequence (i.e. FFF0 00FF F000) which can stand out against random noise.
If the file is mostly binary, a popular choice is using a text encoding like ASCII which stands out among the binary data in a hex editor. For example, GIF uses GIF89a, FLAC uses fLaC. On the other hand, a plain text identifier may be falsely detected in a random text file, so invalid/control characters might be incorporated.
In general, it does not matter that much what they are, even a bunch of NULL bytes can be used for file detection. But ideally you want the longest unique identifier you can afford, and at minimum 4 bytes long. Any identifier under 4 bytes will show up more often in random data. The longer it is, the less likely it will ever be detected as a false positive. Some known examples are as long as 40 bytes. In a way, it's like a password.
Also, it doesn't have to be at offset 0. The file signature has conventionally been at offset zero, since it made sense to store it first if it will be processed first.
That said, a single file signature should not be the only line of defense. The actual parsing process itself should be able to verify integrity and weed out invalid files even if the signature matches. This can be done with additional file signatures, using length-sensitive data, value/range checking, and especially, hash/checksum values.

Resources