Alternatives to fread and fwrite for use with structured data

Alternatives to fread and fwrite for use with structured data - c

The book Beginning Linux Programming (3rd ed) says
"Note that fread and fwrite are not recommended for use with structured data. Part of the problem is that files written with fwrite are potentially nonportable between different machines."
What does that mean exactly? What calls should I use if I want to write a portable structured data reader or writer? Direct system calls?

The book is wisely cautioning against reading an block of bytes from a file directly into a data structure.
The problem with this is that there can be unnamed padding bytes between individual elements of a data structure, and the number and position of these bytes is entirely implementation dependent.
You can still use the fread and fwrite calls to read and write data from and to a file, but you should read and write each element of the data structure individually, rather than reading or writing the whole struct at once.
There are other portability concerns you'll want to keep in mind as well. For example, the various numeric types have implementation-dependent sizes. For portability, you can use the types defined in the stdint.h header.
There may also be differences in floating point and unsigned integer representation, but most systems and file formats now use IEEE 754 and two's-complement, respectively, so compatibility issues are far less frequent with those types. Just make sure you know what your specifications say.

Data serialization is topic of your interest.
It's about sizes of variables, it's about coding (strings might be utf-8, utf16... etc), it's about endianess (BigEndian, LowEndian).
For portable solution I would recommend you to take a look at Google ProtocolBuffers and Thrift.

If portability of the data is a concern for you, you should look into Serialization techniques and libraries, in particular s11n JSON YAML XDR ASN1 Jansson XML etc.
Ask yourself about your data and your application in a couple of years?...
Textual representations are generally less "brittle" than binary ones.

Related

Unable to read from a file in Turbo C [duplicate]

I need to serialize a C struct to a file in a portable way, so that I can read the file on other machines and can be guaranteed that I will get the same thing that I put in.
The file format doesn't matter as long as it is reasonably compact (writing out the in-memory representation of a struct would be ideal if it wasn't for the portability issues.)
Is there a clean way to easily achieve this?

You are essentially designing a binary network protocol, so you may want to use an existing library (like Google's protocol buffers). If you still want to design your own, you can achieve reasonable portability of writing raw structs by doing this:
Pack your structs (GCC's __attribute__((packed)), MSVC's #pragma pack). This is compiler-specific.
Make sure your integer endianness is correct (htons, htonl). This is architecture-specific.
Do not use pointers for strings (use character buffers).
Use C99 exact integer sizes (uint32_t etc).
Ensure that the code only compiles where CHAR_BIT is 8, which is the most common, or otherwise handles transformation of character strings to a stream of 8-bit octets. There are some environments where CHAR_BIT != 8, but they tend to be special-purpose hardware.
With this you can be reasonably sure you will get the same result on the other end as long as you are using the same struct definition. I am not sure about floating point numbers representation, however, but I usually avoid sending those.
Another thing unrelated to portability you may want to address is backwards compatibility by introducing length as a first field, and/or using version tag.

You could try using a library such as protocol buffers; rolling your own is probably not worth the effort.

Write one function for output.
Use sprintf to print an ascii representation of each field to the file,
one field per line.
Write one function for input.
Use fgets to load each line from the file.
Use scanf to convert to binary, directly into the field in your structure.
If you plan on doing this with a lot of different structures,
consider adding a header to each file, which identifies what kind of structure
it represents.

Is it possible to fwrite() raw data to standard output?

Assume I have a struct called "Book", after constructing several "Books" I want to print the raw data of them to standard output. I can't think of any ways to do it using printf so I wonder can I use fwrite to read a "Book" and write it to standard output?
I tried something like the following but didn't work:

You can write it to stdout, but it will not be formatted in a human-readable manner - it will just be the binary values that happened to be in the structure fields (and padding bytes). If you want something human readable, use printf on the fields of the struct.
It's rarely a good idea to fwrite structs, even when you do want binary data. For one, there's no portable way of reading them back, as structure layout may depend on the platform, and you may also leak sensitive information, as the padding bytes and parts of arrays that you haven't written etc. may contain leftover information from other parts of your program. Either write the fields explicitly by yourself, or use one of the many serialization libraries available instead.

Is saving a binary file a standard? Is it limited to only 1 type?

When should a programmer use .bin files? (practical examples).
Is it popular (or accepted) to save different data types in one file?
When iterating over the data in a file (that has several data types), the program must know the exact length of every data type, and I find that limiting.

If you mean for some idealized general purpose application data, text files are often preferred because they provide transparency to the user, and might also make it easier to (for instance) move the data to a different application and avoid lock-in.
Binary files are mostly used for performance and compactness reasons, encoding things as text has non-trivial overhead in both of these departments (today, perhaps mostly in size) which sometimes are prohibitive.

Binary files are used whenever compactness or speed of reading/writing are required.
Those two requirements are closely related in the obvious way that reading and writing small files is fast, but there's one other important reason that binary I/O can be fast: when the records have fixed length, that makes random access to records in the file much easier and faster.
As an example, suppose you want to do a binary search within the records of a file (they'd have to be sorted, of course), without loading the entire file to memory (maybe because the file is so large that it doesn't fit in RAM). That can be done efficiently only when you know how to compute the offset of the "midpoint" between two records, without having to parse arbitrarily large parts of a file just to find out where a record starts or ends.
(As noted in the comments, random access can be achieved with text files as well; it's just usually harder to implement and slower.)

I think when embedded developers see a ".bin" file, it's generally a flattened version of an ELF or the like, intended for programming as firmware for a processor. For instance, putting the Linux kernel into flash (depending on your bootloader).
As a general practice of whether or not to use binary files, you see it done for many reasons. Text requires parsing, and that can be a great deal of overhead. If it's intended to be usable by the user though, binary is a poor format, and text really shines.
Where binary is best is for performance. You can do things like map it into memory, and take advantage of the structure to speed up access. Sometimes, you'll have two binary files, one with data, and one with metadata, that can be used to help with searching through gobs of data. For example, Git does this. It defines an index format, a pack format, and an object format that all work together to save the history of your project is a readily accessible, but compact way.

How do you create a file format?

I've been doing some reading on file formats and I'm very interested in them. I'm wondering what the process is to create a format. For example, a .jpeg, or .gif, or an audio format. What programming language would you use (if you use a programming language at all)?
The site warned me that this question might be closed, but that's just a risk I'll take in the pursuit of knowledge. :)

what the process is to create a format. For example, a .jpeg, or .gif, or an audio format.
Step 1. Decide what data is going to be in the file.
Step 2. Design how to represent that data in the file.
Step 3. Write it down so other people can understand it.
That's it. A file format is just an idea. Properly, it's an "agreement". Nothing more.
Everyone agrees to put the given information in the given format.
What programming language would you use (if you use a programming language at all)?
All programming languages that can do I/O can have file formats. Some have limitations on which file formats they can handle. Some languages don't handle low-level bytes as well as others.
But a "format" is not an "implementation".
The format is a concept. The implementation is -- well -- an implementation.

You do not need a programming language to write the specification for a file format, although a word processor might prove to be a handy tool.
Basically, you need to decide how the information of the file is to be stored as a sequence of bits. This might be trivial, or it might be exceedingly difficult. As a trivial example, a very primitive bitmap image format could start with one unsigned 32-bit integer representing the width of the bitmap, and then one more such integer representing the height of the bitmap. Then you could decide to simply write out the colour of the pixels sequentially, left-to-right and top-to-bottom (row 1 of pixels, row 2 of pixels, ...), using 24-bits per pixel, on the form 8 bits for red + 8 bits for green + 8 bits for blue. For instance, a 8×8 bitmap consisting of alternating blue and red pixels would be stored as
00000008000000080000FFFF00000000FFFF0000...
In a less trivial example, it really depends on the data you wish to save. Typically you would define a lot of records/structures, such as BITMAPINFOHEADER, and specify in what order they should come, how they should be nestled, and you might need to write a lot of indicies and look-up tables. Myself I have written quite a few file formats, most recently the ASD (AlgoSim Data) file format used to save AlgoSim structures. Such files consists of a number of records (maybe nestled), look-up tables, magic words (indicating structure begin, structures end, etc.) and strings in a custom-defined format. One typical thing that often simplifies the file format is that the records contain data about their size, and the sizes of the custom data parts following the record (in case the record is some sort of a header, preceeding data in a custom format, e.g. pixel colours or sound samples).
If you havn't been working with file formats before, I would suggest that you learn a very simple format, such as the Windows 3 Bitmap format, and write your own BMP encoder/decoder, i.e. programs that creates and reads BMP files (from scratch), and displays the read BMP files. Then you now the basic ideas.

Fundamentally, files only exist to store information that needs to be loaded back in the future, either by the same program or a different one. A really good file format is designed so that:
Any programming language can be used to read or write it.
The information a program would most likely need from the file can be accessed quickly and efficiently.
The format can be extended and expanded in the future, without breaking backwards compatibility.
The format should accommodate any special requirements (e.g. error resiliency, compression, encoding, etc.) present in the domain in which the file will be used

You are most certainly interested in looking into Protocol Buffers and Thrift. These tools provide a modern, principled way of designing forwards and backward compatible file formats.

Writing more to a file than just plain text

I have always been able to read and write basic text files in C++, but so far no one has discussed much more than that.
My question is this:
If developing a file type by myself for use by an application I also create, how would I go about writing the data to a file and preserve the layout, formatting, etc.? Are there any standards, or does it just depend on the creativity of the programmer?

You basically have to come up with your own file format and write binary data.
You can also serialize your object model and write the output to a file, but that's usually less efficient.
Better to use an existing database, or use xml (or other) for simple needs. If you want to write a file in a format that already exists, find a library that supports it.

You have to know the binary file format for the file you are trying to create. Consider Joel's post on this topic: the 97-2003 File Format is a 349 page spec.
Nearly all the time, to do something like that, you use an API, to avoid the grunt work. Be careful however, because trial and error and figuring out "what works" by trial and error can result in an upgrade of the program breaking your code. Plus you have to take into account other operating systems, minor version differences, patches, etc.

There are a number of standards of course. The likely one to use is some flavor of xml since there are libraries and tools that already exist to help you work with it, but nothing is stopping you from inventing your own.

Well you could store the data in a format you could read, but which maintained the integrity of your data (XML or JSON for instance).
Or (shudder) you could come up with your own propriatory binary format, and use that.

you would go at it exactly the same way as you would a text file. writing your data byte by byte, encoded in such a way that when you read the file you know what you are reading.
for a spreadsheet application you could even use a text format (OOXML, OpenDocument) to store presentation and content information.
Or you could define binary datastructures and write that directly to the file.
the choice between text or binary format depends on the application. for a configuration file you may prefer a text file which can be modified outside your app, for a database you will most likely choose a binary format for performance reasons.

See wotsit.org for information on file formats for various file types. Example: You can figure out exactly how to write out a .BMP file and how it is composed.
Writing to a database can be done by using a wrapper class in your language, mainly passing it SQL commands.

If you create a binary file , you can write any file to it . The only drawback is that you have to know exactly where it starts and where it ends .

Use xml (something open, descriptive, and validatable), and stick with the text. There are standards for this sort of thing as well, including ODF

You can open the file as binary, instead of text (how one does this depends somewhat on the platform), from there you can write the data directly out to disk. The only real caveat to this is endianess, which can become an issue when moving the files from one architecture to another (x86 to PPC for instance).
Writing binary data to disk is really no harder than writing text, and really, your creativity is key for how you store the data.

The general problem is usually referred to as serialization of your application state and in your case with a source/target of a file in whatever format makes sense for you. These days the preferred input/output format is XML, and you may want to look into the existing standards in this field. The problem then becomes how do I map from the state of my system to the particular schema. Boost has a serialization framework that you may want to check out.
/Allan

There are a variety of approaches you can take, but in general you'll want some sort of serialization library. BOOST::Serialization, or Google's Protocal Buffers are a good example of these. The basic idea is that you have memory structures (classes and objects) that represent your data, and you want to write that data to a file in a way that can be used to reconstruct those structures again.
If you're hesitant to use a library, you can do it all manually, but realize that you can end up writing a lot of redundant code, or developing your own library. See fopen, fread, fwrite and fclose for a starting point.

A typical binary file format for custom data is an "indexed file format" consisting of
-------
|index|
-------
|data |
-------
Where the index contains records "pointing" to the data.
The index consists of records containing an offset and a size. The offset tells you where in the file the data is stored and the size tells you the size of the data at that offset (i.e. the number of bytes to read).
typedef struct {
size_t offset
size_t size
} Index
typedef struct {
int ID
char First[20]
char Last[20]
char *RandomInfo
} Data
Suppose you want to store 50 records in the file you would create 50 indices and 50 data structures. The 50 index structures would be written to the file first, followed by the 50 data structures.
To read the file you would read in the 50 index structures, then from the data in the read-in index structures you could tell where to "seek" to read the data records.
Look up (fopen, fread, fwrite, fclose, ftell) for functions to read/write the data.
(Sorry my semicolon key doesn't work)

You usually use a third party library for these things. For example, you would link in a database library for say Oracle that would allow you to talk to the database. Because the underlying file type, ( i.e. Excel spreadsheet vs Openoffice, Oracle vs MySQL, etc. ) differ these libraries abstract away your need to care how the file is constructed.
Hope that helps you find what you're looking for!

1985 called, and said they have some help IFF you are willing to read up. The interchange file format is still in use today and provides some basic metadata around binary files, such as RIFF or WAV audio. (Unfortunately, TIFF is a false friend.) It allegedly even inspired PNG, so it can't be that bad.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight