How to write portable binary files in C? - c

Let us consider the following pice of code:
#include <stdio.h>
int main(){
int val = 30;
FILE *file;
if(!(fopen("file.bin","wb"))){
fwrite(&val,sizeof(int),1,file);
fclose(file);
}
return 0;
}
I was wondering about what happens if I try to read the file resulting from this code with fread in an architecture where integers have a different size from the integers in the architecture that produced the file. I think that the result will not match the original value of the variable val in this code.
If this is true, how can we deal with this problem? How can we produce portable binary files in C?

I was wondering about what happens if I try to read the file in an architecture where integers have a different size from the integers in the architecture that produced the file.
That is absolutely a good thing to worry about. The other big concern is byte order.
When you say
fwrite(&val, sizeof(int), 1, file);
you are saying, "write this int to the file in binary, using exactly the same representation it has in memory on my machine: same size, same byte order, same everything". And, yes, that means the file format is essentially defined by "the representation it has on my machine", not in any nicely-portable way.
But that's not the only way to write an int to a file in binary. There are lots of other ways, with varying degrees of portability. The way I like to do it is simply:
putc((val >> 8) & 0xff, file); /* MSB */
putc( val & 0xff, file); /* LSB */
For simplicity here I'm assuming that the binary format being written uses two bytes (16 bits) for the on-disk version of the integer, meaning I'm assuming that the int variable val never holds a number bigger than that.
Written that way, the two-byte integer is written in "big endian" order, with the most-significant byte first. If you want to define your binary file format to use "little endian" order instead, the change is almost trivial:
putc( val & 0xff, file); /* LSB */
putc((val >> 8) & 0xff, file); /* MSB */
You would use some similar-looking code, involving calls to getc and some more bit-shifting operations, to read the bytes back in on the other end and recombine them into an integer. Here's the little-endian version:
val = getc(file);
val |= getc(file) << 8;
These examples aren't perfect, and are guaranteed to work properly for all values only if val is an unsigned type. There are more wrinkles we might apply in order to deal with signed integers, and integers of size other than two bytes, but this should get you started.
See also questions 12.42 and 16.7 in the C FAQ list.
See also this chapter of some on-line C Programming notes.

Related

memcpy inverting data, C language

I've a doubt here, i'm trying to use memcpy() to copy an string[9] to a unsigned long long int variable, here's the code:
unsigned char string[9] = "message";
string[8] = '\0';
unsigned long long int aux;
memcpy(&aux, string, 8);
printf("%llx\n", aux); // prints inverted data
/*
* expected: 6d65737361676565
* printed: 656567617373656d
*/
How do I make this copy without inverting the data?
Your system is using little endian byte ordering for integers. That means that the least significant byte comes first. For example, a 32 bit integer would store 258 (0x00000102) as 0x02 0x01 0x00 0x00.
Rather than copying your string into an integer, just loop through the characters and print each one in hex:
int i;
int len = strlen(string);
for (i=0; i<len; i++) {
printf("%02x ", string[i]);
}
printf("\n");
Since string is an array of unsigned char and you're doing bit manipulation for the purpose of implementing DES, you don't need to change it at all. Just use it as it.
Looks like you've just discovered by accident how CPUs store integer values. There's two competing schools of thought that are termed endian, with little-endian and big-endian both found in the wild.
If you want them in byte-for-byte order, an integer type will be problematic and should be avoided. Just use a byte array.
There are conversion functions that can go from one endian form to another, though you need to know what sort your architecture uses before converting properly.
So if you're reading in a binary value you must know what endian form it's in in order to import it correctly into a native int type. It's generally a good practice to pick a consistent endian form when writing binary files to avoid guessing, where the "network byte order" scheme used in the vast majority of internet protocols is a good default. Then you can use functions like htonl and ntohl to convert back and forth as necessary.

Printing actual bit representation of integers in C

I wanted to print the actual bit representation of integers in C. These are the two approaches that I found.
First:
union int_char {
int val;
unsigned char c[sizeof(int)];
} data;
data.val = n1;
// printf("Integer: %p\nFirst char: %p\nLast char: %p\n", &data.f, &data.c[0], &data.c[sizeof(int)-1]);
for(int i = 0; i < sizeof(int); i++)
printf("%.2x", data.c[i]);
printf("\n");
Second:
for(int i = 0; i < 8*sizeof(int); i++) {
int j = 8 * sizeof(int) - 1 - i;
printf("%d", (val >> j) & 1);
}
printf("\n");
For the second approach, the outputs are 00000002 and 02000000. I also tried the other numbers and it seems that the bytes are swapped in the two. Which one is correct?
Welcome to the exotic world of endian-ness.
Because we write numbers most significant digit first, you might imagine the most significant byte is stored at the lower address.
The electrical engineers who build computers are more imaginative.
Someimes they store the most significant byte first but on your platform it's the least significant.
There are even platforms where it's all a bit mixed up - but you'll rarely encounter those in practice.
So we talk about big-endian and little-endian for the most part. It's a joke about Gulliver's Travels where there's a pointless war about which end of a boiled egg to start at. Which is itself a satire of some disputes in the Christian Church. But I digress.
Because your first snippet looks at the value as a series of bytes it encounters then in endian order.
But because the >> is defined as operating on bits it is implemented to work 'logically' without regard to implementation.
It's right of C to not define the byte order because hardware not supporting the model C chose would be burdened with an overhead of shuffling bytes around endlessly and pointlessly.
There sadly isn't a built-in identifier telling you what the model is - though code that does can be found.
It will become relevant to you if (a) as above you want to breakdown integer types into bytes and manipulate them or (b) you receive files for other platforms containing multi-byte structures.
Unicode offers something called a BOM (Byte Order Marker) in UTF-16 and UTF-32.
In fact a good reason (among many) for using UTF-8 is the problem goes away. Because each component is a single byte.
Footnote:
It's been pointed out quite fairly in the comments that I haven't told the whole story.
The C language specification admits more than one representation of integers and particularly signed integers. Specifically signed-magnitude, twos-complement and ones-complement.
It also permits 'padding bits' that don't represent part of the value.
So in principle along with tackling endian-ness we need to consider representation.
In principle. All modern computers use twos complement and extant machines that use anything else are very rare and unless you have a genuine requirement to support such platforms, I recommend assuming you're on a twos-complement system.
The correct Hex representation as string is 00000002 as if you declare the integer with hex represetation.
int n = 0x00000002; //n=2
or as you where get when printing integer as hex like in:
printf("%08x", n);
But when printing integer bytes 1 byte after the other, you also must consider the endianess, which is the byte order in multi-byte integers:
In big endian system (some UNIX system use it) the 4 bytes will be ordered in memory as:
00 00 00 02
While in little endian system (most of OS) the bytes will be ordered in memory as:
02 00 00 00
The first prints the bytes that represent the integer in the order they appear in memory. Platforms with different endian will print different results as they store integers in different ways.
The second prints the bits that make up the integer value most significant bit first. This result is independent of endian. The result is also independent of how the >> operator is implemented for signed ints as it does not look at the bits that may be influenced by the implementation.
The second is a better match to the question "Printing actual bit representation of integers in C". Although there is a lot of ambiguity.
It depends on your definition of "correct".
The first one will print the data exactly like it's laid out in memory, so I bet that's the one you're getting the maybe unexpected 02000000 for. *) IMHO, that's the correct one. It could be done simpler by just aliasing with unsigned char * directly (char pointers are always allowed to alias any other pointers, in fact, accessing representations is a usecase for char pointers mentioned in the standard):
int x = 2;
unsigned char *rep = (unsigned char *)&x;
for (int i = 0; i < sizeof x; ++i) printf("0x%hhx ", rep[i]);
The second one will print only the value bits **) and take them in the order from the most significant byte to the least significant one. I wouldn't call it correct because it also assumes that bytes have 8 bits, and because the shifting used is implementation-defined for negative numbers. ***) Furthermore, just ignoring padding bits doesn't seem correct either if you really want to see the representation.
edit: As commented by Gerhardh meanwhile, this second code doesn't print byte by byte but bit by bit. So, the output you claim to see isn't possible. Still, it's the same principle, it only prints value bits and starts at the most significant one.
*) You're on a "little endian" machine. On these machines, the least significant byte is stored first in memory. Read more about Endianness on wikipedia.
**) Representations of types in C may also have padding bits. Some types aren't allowed to include padding (like char), but int is allowed to have them. This second option doesn't alias to char, so the padding bits remain invisible.
***) A correct version of this code (for printing all the value bits) must a) correctly determine the number of value bits (8 * sizeof int is wrong because bytes (char) can have more then 8 bits, even CHAR_BIT * sizeof int is wrong, because this would also count padding bits if present) and b) avoid the implementation-defined shifting behavior by first converting to unsigned. It could look for example like this:
#define IMAX_BITS(m) ((m) /((m)%0x3fffffffL+1) /0x3fffffffL %0x3fffffffL *30 \
+ (m)%0x3fffffffL /((m)%31+1)/31%31*5 + 4-12/((m)%31+3))
int main(void)
{
int x = 2;
for (unsigned mask = 1U << (IMAX_BITS((unsigned)-1) - 1); mask; mask >>= 1)
{
putchar((unsigned) x & mask ? '1' : '0');
}
puts("");
}
See this answer for an explanation of this strange macro.

Writing a byte to a file and then reading the same byte are not the same

Basically I have a file, and in this file I am writing 3 bytes, and then I'm writing a 4 byte integer. In another application I read the first 3 bytes, and then I read the next 4 bytes and convert them to an integer.
When I print out the value, I have very different results...
fwrite(&recordNum, 2, 1, file); //The first 2 bytes (recordNum is a short int)
fwrite(&charval, 1, 1, file); //charval is a single byte char
fwrite(&time, 4, 1, file);
// I continue writing a total of 40 bytes
Here is how time was calculated:
time_t rawtime;
struct tm * timeinfo;
time(&rawtime);
timeinfo = localtime(&rawtime);
int time = (int)rawtime;
I have tested to see that sizeof(time) is 4 bytes, and it is. I have also tested using an epoch converter to make sure this is the correct time (in seconds) and it is.
Now, in another file I read the 40 bytes to a char buffer:
char record[40];
fread(record, 1, 40, file);
// Then I convert those 4 bytes into an uint32_t
uint32_t timestamp =(uint32_t)record[6] | (uint32_t)record[5] << 8 | (uint32_t)record[4] << 16 | (uint32_t)record[3] << 24;
printf("Testing timestamp = %d\n", timestamp);
But this prints out -6624. The expected value is 551995007.
EDIT
To be clear, everything else that I am reading from the char buffer is correct. After this timestamp I have text, which I simply print and it runs fine.
You write the time at once with fwrite, which uses the native byte ordering, then you explicitly read the individual bytes in big-endian format (most significant byte first). Your machine is likely using little-endian format for byte ordering, which would explain the difference.
You need to read/write in a consistent manner. The simplest way to do this is to fread one variable at a time, just like you're writing:
fread(&recordNum, sizeof(recordNum), 1, file);
fread(&charval, sizeof(charval), 1, file);
fread(&time, sizeof(time), 1, file);
Also note the use of sizeof to calculate the size.
You problem is probably right here:
uint32_t timestamp =(uint32_t)record[6] | (uint32_t)record[5] << 8 | (uint32_t)record[4] << 16 | (uint32_t)record[3] << 24;
printf("Testing timestamp = %d\n", timestamp);
You've used fwrite to write out a 32 bit integer.. in whatever order the processor stored it in memory.. and you don't actually know what byte ordering (endian-ness) the machine used. Maybe the first byte written out is the lowest byte of the integer, or maybe it's the highest byte of the integer.
If you're reading and writing the data on the same machine, or on different machines with the same architecture, you don't need to care about that.. it will work. But if the data is written on an architecture with one byte ordering, and potentially read in on an architecture with another byte ordering, it will be wrong: Your code needs to know what order the bytes should be in memory and what order they will be read/written on disk.
In this case, in your code, you are doing a mix of both: You write them out in whatever endian-ness the machine uses natively.. then when you read them in, you start shifting the bits around as if you know what order they were originally in.. but you don't, because you didn't pay attention to the order when you wrote them out.
So if you're writing and reading the file on the same machine, or identical machine (same processor, OS, compiler, etc), just write them out in the native order (without worrying about what that is) and then read them back in exactly as you wrote them out. If you write them and read them on the same machine, it'll work.
So if your timestamp is located at offset 3 through 6 of your record, just do this:
uint_32t timestamp;
memcpy(&timestamp, record+3, sizeof(timestamp);
Note that you cannot directly cast record+3 to a uint32_t pointer because it might violate the systems word alignment requirements.
Note also that you should probably be using time_t type to hold the timestamp, if you're on a unix-like system, that'll be the natural type supplied to hold epoch time values.
But if you are planning to move this file to another machine at any point and try to read it there, you could easily end up with your data on a system that has different endian-ness or different size for time_t. Simply writing bytes in and out of a file with no thought to the endian-ness or size of types on different operating systems is just fine for temporary files or for files which are meant to be used on one computer only and which will never be moved to other types of system.
Making data files that are portable between systems is a whole subject in itself. But the first thing you should do, if you care about that, is to look at functions htons(), ntonhs(), htonl(), ntonhl(), and their ilk.. which convert to and from the system native endian-ness to a known (big) endian-ness which is the standard for internet communications and generally used for interoperability (even though Intel processors are little-endian and dominate the market these days). These function do something similar to what you were doing with your bit-shifting but since someone else wrote it, you don't have to. It's a lot easier to use the library functions for this!
For example:
#include <stdio.h>
#include <arpa/inet.h>
int main() {
uint32_t x = 1234, y, z;
// open a file for writing, convert x from native to big endian, write it.
FILE *file = fopen("foo.txt", "w");
z = htonl(x);
fwrite(&z, sizeof(z), 1, file);
fclose(file);
file = fopen("foo.txt", "r");
fread(&z, sizeof(z), 1, file);
x = ntohl(z);
fclose(file);
printf("%d\n", x);
}
NOTE I am NOT CHECKING FOR ERRORS in this code, it is just an example.. do not use functions like fopen, fread etc without checking for errors.
By using these functions both when writing the data out to disk and when reading it back, you guarantee that the data on disk is always big-endian.. eg htonl() when on a big-endian platform does nothing, when on a little-endian platform it does the conversion from bit to little endian. And ntohl() does the opposite. So your data on disk will always be read in correctly.

Most efficient way to store an unsigned 16-bit Integer to a file

I'm making a dictionary compressor in C with dictionary max size 64000. Because of this, I'm storing my entries as 16-bit integers.
What I'm currently doing:
To encode 'a', I get its ASCII value, 97, and then convert this number into a string representation of the 16-bit integer of 97. So I end up encoding '0000000001100001' for 'a', which obviously isn't saving much space in the short run.
I'm aware that more efficient versions of this algorithm would start with smaller integer sizes (less bits of storage until we need more), but I'm wondering if there's a better way to either
Convert my integer '97' into an ASCII string of fixed length that can store 16 bits of data (97 would be x digits, 46347 would also be x digits)
writing to a file that can ONLY store 1s and 0s. Because as it is, it seems like I'm writing 16 ascii characters to a text file, each of which is 8 bits...so that's not really helping the cause much, is it?
Please let me know if I can be more clear in any way. I'm pretty new to this site. Thank you!
EDIT: How I store my dictionary is entirely up to me as far as I know. I just know that I need to be able to easily read the encoded file back and get the integers from it.
Also, I can only include stdio.h, stdlib.h, string.h, and header files I wrote for the program.
Please, do ignore these people who are suggesting that you "write directly to the file". There are a number of issues with that, which ultimately fall into the category of "integer representation". There appear to be some compelling reasons to write integers straight to external storage using fwrite or what-not, there are some solid facts in play here.
The bottleneck is the external storage controller. Either that, or the network, if you're writing a network application. Thus, writing two bytes as a single fwrite, or as two distinct fputcs, should be roughly the same speed, providing your memory profile is adequate for your platform. You can adjust the amount of buffer that your FILE *s use to a degree using setvbuf (note: must be a power of two), so we can always fine-tune per platform based on what our profilers tell us, though this information should probably float gracefully upstream to the standard library through gentle suggestions to be useful for other projects, too.
Underlying integer representations are inconsistent between todays computers. Suppose you write unsigned ints directly to a file using system X which uses 32-bit ints and big endian representation, you'll end up with issues reading that file on system Y which uses 16-bit ints and little endian representation, or system Z which uses 64-bit ints with mixed endian representation and 32 padding bits. Nowadays we have this mix of computers from 15 years ago that people torture themselves with to ARM big.Little SoCs, smartphones and smart TVs, gaming consoles and PCs, all of which have their own quirks which fall outside of the realm of standard C, especially with regards to integer representation, padding and so on.
C was developed with abstractions in mind that allow you to express your algorithm portably, so that you don't have to write different code for each OS! Here's an example of reading and converting four hex digits to an unsigned int value, portably:
unsigned int value;
int value_is_valid = fscanf(fd, "%04x", &value) == 1;
assert(value_is_valid); // #include <assert.h>
/* NOTE: Actual error correction should occur in place of that
* assertioon
*/
I should point out the reason why I choose %04X and not %08X or something more contemporary... if we go by questions asked even today, unfortunately there are students for example using textbooks and compilers that are over 20 years old... Their int is 16-bit and technically, their compilers are compliant in that aspect (though they really ought to push gcc and llvm throughout academia). With portability in mind, here's how I'd write that value:
value &= 0xFFFF;
fprintf(fd, "%04x", value);
// side-note: We often don't check the return value of `fprintf`, but it can also become \
very important, particularly when dealing with streams and large files...
Supposing your unsigned int values occupy two bytes, here's how I'd read those two bytes, portably, using big endian representation:
int hi = fgetc(fd);
int lo = fgetc(fd);
unsigned int value = 0;
assert(hi >= 0 && lo >= 0); // again, proper error detection & handling logic should be here
value += hi & 0xFF; value <<= 8;
value += lo & 0xFF;
... and here's how I'd write those two bytes, in their big endian order:
fputc((value >> 8) & 0xFF, fd);
fputc(value & 0xFF, fd);
// and you might also want to check this return value (perhaps in a finely tuned end product)
Perhaps you're more interested in little endian. The neat thing is, the code really isn't that different. Here's input:
int lo = fgetc(fd);
int hi = fgetc(fd);
unsigned int value = 0;
assert(hi >= 0 && lo >= 0);
value += hi & 0xFF; value <<= 8;
value += lo & 0xFF;
... and here's output:
fputc(value & 0xFF, fd);
fputc((value >> 8) & 0xFF, fd);
For anything larger than two bytes (i.e. a long unsigned or long signed), you might want to fwrite((char unsigned[]){ value >> 24, value >> 16, value >> 8, value }, 1, 4, fd); or something for example, to reduce boilerplate. With that in mind, it doesn't seem abusive to form a preprocessor macro:
#define write(fd, ...) fwrite((char unsigned){ __VA_ARGS__ }, 1, sizeof ((char unsigned) { __VA_ARGS__ }), fd)
I suppose one might look at this like choosing the better of two evils: preprocessor abuse or the magic number 4 in the code above, because now we can write(fd, value >> 24, value >> 16, value >> 8, value); without the 4 being hard-coded... but a word for the uninitiated: side-effects might cause headaches, so don't go causing modifications, writes or global state changes of any kind in arguments of write.
Well, that's my update to this post for the day... Socially delayed geek person signing out for now.
What you are contemplating is to utilize ASCII characters in saving your numbers, this is completely unnecessary and most inefficient.
The most space efficient way to do this (without utilizing complex algorithms) would be to just dump the bytes of the numbers into the file (the number of bits would have to depend on the largest number you intend to save. Or have multiple files for 8bit, 16bit etc.
Then when you read the file you know that your numbers are located per x # of bits so you just read them out one by one or in a big chunk(s) and then just make the chunk(s) into an array of a type that matches x # of bits.

is fread on a single integer affected by the endianness of my system

I'm currently working on a binary file format for some arbitrary values, including some strings, and string-length values, which are stored as uint32_t's.
But I was wondering, if I write the string length with fwrite to a file on a little-endian system, and read that value from the same file with fread on a big-endian system, will the oder of the bytes be reversed? And if so, what is the best practice to fix that?
EDIT: Surely there has to be some GNU functionality around that does this for me, and that is used, tested and validated for, like, 20 years?
Yes, fwrite and fread on an integer makes your file format unportable to another endianness, as other answers correctly state.
As of best practice, I would discourage any conditional byte flipping and endianness testing at all. Decide on the endianness of your file format, than write and read bytes, and make integers from them by ORing and shifting.
In other words, I agree with Rob Pike on the issue.
If I write the string length with fwrite to a file on a little-endian system, and read that value from the same file with fread on a big-endian system, will the oder of the bytes be reversed?
Yes. fwrite simply writes the memory contents to file in linear order. fread simply reads from file to memory in linear order.
What is the best practice to fix that?
Decide on an ordering for your files. Then write wrapper functions to write and read integers to/from files. Inside this function, conditionally flip the byte order if you're on a system with the opposite ordering.
(There are lots of questions here regarding determining the endianness of a system.)
Surely there has to be some GNU functionality around that does this for me
There's nothing in the standard library. However, POSIX defines a bunch of functions for this: ntohl, htonl, etc.. They're typically used for network transfer, but could equally be used for files.
Yes, it will, since fread() operates on war raw bytes. If the order of bytes is different in memory, it will be different in the file too.
And if so, what is the best practice to fix that?
Detect the endianness of your system, and flip the bytes if it doesn't match the endianness of your file format.
int is_little_endian()
{
uint32_t magic = 0x00000001;
uint8_t black_magic = *(uint8_t *)&magic;
return black_magic;
}
uint32_t to_little_endian(uint32_t dword)
{
if (is_little_endian()) return dword;
return (((dword >> 0) & 0xff) << 24)
| (((dword >> 8) & 0xff) << 16)
| (((dword >> 16) & 0xff) << 8)
| (((dword >> 24) & 0xff) << 0);
}
Linux provides
htobe16, htole16, be16toh, le16toh, htobe32, htole32, be32toh, le32toh, htobe64, htole64, be64toh, le64toh - convert values between host and big-/little-endian byte order
[https://linux.die.net/man/3/le32toh]

Resources