Dealing with different architectures when loading data using stdio - c

I want to read in some data from a file. Say an integer:
fread(&var1, 4, 1, f);
Where var1 would be an integer. But then I got to thinking that this is not safe as there isn't any guarantee that an integer is 4 bytes long. (I'm ignoring other issues like feof and ferror for the sake of this question).
I also soon realised that there were even more issues than just int size, such as the endianness of the system, and probably others which I haven't even thought of.
So, what is the best way to ensure that you data being read in is interpreted properly? So far, the only thing I can think of is to just store the data as text rather than as binary data, read in the text, and convert it at run time. I would guess that no matter the solution though, if you wanted to ensure that it is portable, it would always involve some form of conversion anyway.
Thank you.

To avoid the size problem, you should be doing:
fread(&var1, sizeof(var1), 1, f);
If you're worried that the size of int might vary between the platform that writes the data and the platform that reads the data, then you have a more fundamental problem. In this scenario, you should avoid using int, short, etc., and use the types defined in <stdint.h>, such as int16_t, uint32_t.
To deal with endianness issues, you should consider writing helper functions that explicitly write/read the individual bytes in a known order, such as:
void write_uint32_t(uint8_t *buf, uint32_t x)
{
buf[0] = (uint8_t)(x >> 0);
buf[1] = (uint8_t)(x >> 8);
buf[2] = (uint8_t)(x >> 16);
buf[3] = (uint8_t)(x >> 24);
}
All of the above applies only to integer types. For floating-point types, there is no perfect universal solution.

Always use sizeof() operator for getting size of types. Never rely on hard coded values!

Related

Location of function call changes functionality

When writing to a GPIO output register (BCM2711 chip on the Raspberry Pi 4), the location of my clear_register function call changes the result completely. The code is below.
void clear_register(long reg) {
*(unsigned int*)reg = (unsigned int)0;
}
int set_pin(unsigned int pin_number) {
long reg = SET_BASE; //0xFE200001c
clear_register(SET_BASE);
unsigned int curval = *(unsigned int*)reg;
unsigned int value = (1 << pin_number);
value |= curval;
gpio_write(reg, value);
return 1;
}
When written like this, the register gets cleared. However, if I call clear_register(SET_BASE) before long reg = SET_BASE, the registers don't get cleared. I can't tell why this would make a difference; any help is appreciated.
I can't reproduce it since I don't have the target hardware which allows writing to that address but you have several bugs and dangerous style issues:
0xF E2 00 00 1c (64 bit signed) does not necessarily fit inside a long (possibly 32 bit) and certainly does not fit inside a unsigned int (16 or 32 bits).
Even if that was a typo and you meant to write 0xFE20001c (32 bit), your program still stuffers from "sloppy typing", which is when the programmer just type out long, int etc without given it any deeper thought. The outcome is strange, subtle and intermittent bugs. I think everyone can relate to the situation "just cast it to unsigned int and it works" but you have no idea why. Bugs like that are always caused by "sloppy typing".
You should be using the stdint.h types instead. uint32_t and so on.
Also related to "sloppy typing", there exist very few cases where you want to actually use signed numbers in embedded systems - certainly not when dealing with addresses and bits. Negative addresses in Linux is a thing - kernel space - but accidentally changing an address to a negative value is a bug.
Signed types will only cause problems when doing bitwise arithmetic and they come with overflows. So you need to nuke the presence of every signed variable in your code unless it explicitly needs to be signed.
This means no sloppy long but the correct type, which is either uint32_t or uintptr_t if it should hold an address, as is the case in your example. It also mean so sloppy hex constants, they must always have u suffix: 0xFE20001Cu or they easily boil down to the wrong type.
For these reasons 1 << is always a bug in a C program, because 1 is of type signed int and you may end up shifting into the sign bit of that signed int, which is undefined behavior. There exists no reason why you shouldn't write 1u << , always.
Whenever you are dealing with hardware registers, you must always use volatile qualified pointers or you might end up with very strange bugs caused by the optimizer. No exceptions here either.
(unsigned int)0 is a pointless cast. If you wish to be explict you could write 0u but that doesn't change anything in case of assignment, since a "lvalue conversion" to the type of the left operand always happens upon assignment. 0u does make some static analysers happy though, MISRA C checkers etc.
With all bug fixes, your clear function should for example look something like this:
void clear_register (uintptr_t reg) {
*(volatile uint32_t*)reg = 0;
}
Or better yet, don't write functions for such very basic things! *(volatile uint32_t*)SET_BASE = 0; is perfectly clear, readable and self-documented code. While clear_register(SET_BASE) suggests that something more advanced than just a simple write is taking place.
However, you could have placed the cast inside the SET_BASE macro. For details and examples see How to access a hardware register from firmware?

Typecasting Arrays for Variable Width Access

Sorry, I am not sure if I wrote the title accurately.
But first, here are my constraints:
Array[], used as a register map, is declared as an unsigned 8-bit array (uint8_t),
this is so that indexing(offset) is per byte.
Data to be read/written into the array has varying width (8-bit, 16-bit, 32-bit and 64-bit).
Very Limited Memory and Speed is a must.
What are the caveats in doing the following
uint8_t some_function(uint16_t offset_addr) //16bit address
{
uint8_t Array[0x100];
uint8_t data_byte = 0xAA;
uint16_t data_word;
uint32_t data_double = 0xBEEFFACE;
\\ A. Storing wider-data into the array
*((uint32_t *) &Array[offset_addr]) = data_double;
\\ B. Reading multiple-bytes from the array
data_word = *((uint16_t *) &Array[offset_addr]);
return 0;
}
I know i could try writing the data per byte, but that would be slow due to bit shifting.
Is there going to be a significant problem with this usage?
I have run this on my hardware and have not seen any problems so far, but I want to take note of potential problems this implementation might cause.
Is there going to be a significant problem with this usage?
It produces undefined behavior. Therefore, even if in practice that manifests as you intend on your current C implementation, hardware, program, and data, you might find that it breaks unexpectedly when something (anything) changes.
Even if the compiler implements the cast and dereference in the obvious way (which it is not obligated to do, because UB) misaligned accesses resulting from your approach will at least slow many CPUs, and will produce traps on some.
The standard-conforming way to do what you want is this:
uint8_t some_function(uint16_t offset_addr) {
uint8_t Array[0x100];
uint8_t data_byte = 0xAA;
uint16_t data_word;
uint32_t data_double = 0xBEEFFACE;
\\ A. Storing wider-data into the array
memcpy(Array + offset_addr, &data_double, sizeof data_double);
\\ B. Reading multiple-bytes from the array
memcpy(&data_word, Array + offset_addr, sizeof data_word);
return 0;
}
This is not necessarily any slower than your version, and it has defined behavior as long as you do not overrun the bounds of your array.
This is probably fine. Many have done things like this. C performs well with this kind of thing.
Two things to watch out for:
Buffer overruns. You know those zero-days like Eternal Blue and hacks like WannaCry? Many of them exploited bugs in code like yours. Malicious input caused the code to write too much stuff into data structures like your uint8_t Array[0x100]. Be careful. Avoid allocating buffers on the stack (as function-local variables) as you have done because clobbering the stack is exploitable. Make them big enough. Check that you don't overrun them.
Machine byte ordering vs. network byte ordering, aka endianness. If these data structures move from machine to machine over the net you may get into trouble.

Most efficient way to store an unsigned 16-bit Integer to a file

I'm making a dictionary compressor in C with dictionary max size 64000. Because of this, I'm storing my entries as 16-bit integers.
What I'm currently doing:
To encode 'a', I get its ASCII value, 97, and then convert this number into a string representation of the 16-bit integer of 97. So I end up encoding '0000000001100001' for 'a', which obviously isn't saving much space in the short run.
I'm aware that more efficient versions of this algorithm would start with smaller integer sizes (less bits of storage until we need more), but I'm wondering if there's a better way to either
Convert my integer '97' into an ASCII string of fixed length that can store 16 bits of data (97 would be x digits, 46347 would also be x digits)
writing to a file that can ONLY store 1s and 0s. Because as it is, it seems like I'm writing 16 ascii characters to a text file, each of which is 8 bits...so that's not really helping the cause much, is it?
Please let me know if I can be more clear in any way. I'm pretty new to this site. Thank you!
EDIT: How I store my dictionary is entirely up to me as far as I know. I just know that I need to be able to easily read the encoded file back and get the integers from it.
Also, I can only include stdio.h, stdlib.h, string.h, and header files I wrote for the program.
Please, do ignore these people who are suggesting that you "write directly to the file". There are a number of issues with that, which ultimately fall into the category of "integer representation". There appear to be some compelling reasons to write integers straight to external storage using fwrite or what-not, there are some solid facts in play here.
The bottleneck is the external storage controller. Either that, or the network, if you're writing a network application. Thus, writing two bytes as a single fwrite, or as two distinct fputcs, should be roughly the same speed, providing your memory profile is adequate for your platform. You can adjust the amount of buffer that your FILE *s use to a degree using setvbuf (note: must be a power of two), so we can always fine-tune per platform based on what our profilers tell us, though this information should probably float gracefully upstream to the standard library through gentle suggestions to be useful for other projects, too.
Underlying integer representations are inconsistent between todays computers. Suppose you write unsigned ints directly to a file using system X which uses 32-bit ints and big endian representation, you'll end up with issues reading that file on system Y which uses 16-bit ints and little endian representation, or system Z which uses 64-bit ints with mixed endian representation and 32 padding bits. Nowadays we have this mix of computers from 15 years ago that people torture themselves with to ARM big.Little SoCs, smartphones and smart TVs, gaming consoles and PCs, all of which have their own quirks which fall outside of the realm of standard C, especially with regards to integer representation, padding and so on.
C was developed with abstractions in mind that allow you to express your algorithm portably, so that you don't have to write different code for each OS! Here's an example of reading and converting four hex digits to an unsigned int value, portably:
unsigned int value;
int value_is_valid = fscanf(fd, "%04x", &value) == 1;
assert(value_is_valid); // #include <assert.h>
/* NOTE: Actual error correction should occur in place of that
* assertioon
*/
I should point out the reason why I choose %04X and not %08X or something more contemporary... if we go by questions asked even today, unfortunately there are students for example using textbooks and compilers that are over 20 years old... Their int is 16-bit and technically, their compilers are compliant in that aspect (though they really ought to push gcc and llvm throughout academia). With portability in mind, here's how I'd write that value:
value &= 0xFFFF;
fprintf(fd, "%04x", value);
// side-note: We often don't check the return value of `fprintf`, but it can also become \
very important, particularly when dealing with streams and large files...
Supposing your unsigned int values occupy two bytes, here's how I'd read those two bytes, portably, using big endian representation:
int hi = fgetc(fd);
int lo = fgetc(fd);
unsigned int value = 0;
assert(hi >= 0 && lo >= 0); // again, proper error detection & handling logic should be here
value += hi & 0xFF; value <<= 8;
value += lo & 0xFF;
... and here's how I'd write those two bytes, in their big endian order:
fputc((value >> 8) & 0xFF, fd);
fputc(value & 0xFF, fd);
// and you might also want to check this return value (perhaps in a finely tuned end product)
Perhaps you're more interested in little endian. The neat thing is, the code really isn't that different. Here's input:
int lo = fgetc(fd);
int hi = fgetc(fd);
unsigned int value = 0;
assert(hi >= 0 && lo >= 0);
value += hi & 0xFF; value <<= 8;
value += lo & 0xFF;
... and here's output:
fputc(value & 0xFF, fd);
fputc((value >> 8) & 0xFF, fd);
For anything larger than two bytes (i.e. a long unsigned or long signed), you might want to fwrite((char unsigned[]){ value >> 24, value >> 16, value >> 8, value }, 1, 4, fd); or something for example, to reduce boilerplate. With that in mind, it doesn't seem abusive to form a preprocessor macro:
#define write(fd, ...) fwrite((char unsigned){ __VA_ARGS__ }, 1, sizeof ((char unsigned) { __VA_ARGS__ }), fd)
I suppose one might look at this like choosing the better of two evils: preprocessor abuse or the magic number 4 in the code above, because now we can write(fd, value >> 24, value >> 16, value >> 8, value); without the 4 being hard-coded... but a word for the uninitiated: side-effects might cause headaches, so don't go causing modifications, writes or global state changes of any kind in arguments of write.
Well, that's my update to this post for the day... Socially delayed geek person signing out for now.
What you are contemplating is to utilize ASCII characters in saving your numbers, this is completely unnecessary and most inefficient.
The most space efficient way to do this (without utilizing complex algorithms) would be to just dump the bytes of the numbers into the file (the number of bits would have to depend on the largest number you intend to save. Or have multiple files for 8bit, 16bit etc.
Then when you read the file you know that your numbers are located per x # of bits so you just read them out one by one or in a big chunk(s) and then just make the chunk(s) into an array of a type that matches x # of bits.

is fread on a single integer affected by the endianness of my system

I'm currently working on a binary file format for some arbitrary values, including some strings, and string-length values, which are stored as uint32_t's.
But I was wondering, if I write the string length with fwrite to a file on a little-endian system, and read that value from the same file with fread on a big-endian system, will the oder of the bytes be reversed? And if so, what is the best practice to fix that?
EDIT: Surely there has to be some GNU functionality around that does this for me, and that is used, tested and validated for, like, 20 years?
Yes, fwrite and fread on an integer makes your file format unportable to another endianness, as other answers correctly state.
As of best practice, I would discourage any conditional byte flipping and endianness testing at all. Decide on the endianness of your file format, than write and read bytes, and make integers from them by ORing and shifting.
In other words, I agree with Rob Pike on the issue.
If I write the string length with fwrite to a file on a little-endian system, and read that value from the same file with fread on a big-endian system, will the oder of the bytes be reversed?
Yes. fwrite simply writes the memory contents to file in linear order. fread simply reads from file to memory in linear order.
What is the best practice to fix that?
Decide on an ordering for your files. Then write wrapper functions to write and read integers to/from files. Inside this function, conditionally flip the byte order if you're on a system with the opposite ordering.
(There are lots of questions here regarding determining the endianness of a system.)
Surely there has to be some GNU functionality around that does this for me
There's nothing in the standard library. However, POSIX defines a bunch of functions for this: ntohl, htonl, etc.. They're typically used for network transfer, but could equally be used for files.
Yes, it will, since fread() operates on war raw bytes. If the order of bytes is different in memory, it will be different in the file too.
And if so, what is the best practice to fix that?
Detect the endianness of your system, and flip the bytes if it doesn't match the endianness of your file format.
int is_little_endian()
{
uint32_t magic = 0x00000001;
uint8_t black_magic = *(uint8_t *)&magic;
return black_magic;
}
uint32_t to_little_endian(uint32_t dword)
{
if (is_little_endian()) return dword;
return (((dword >> 0) & 0xff) << 24)
| (((dword >> 8) & 0xff) << 16)
| (((dword >> 16) & 0xff) << 8)
| (((dword >> 24) & 0xff) << 0);
}
Linux provides
htobe16, htole16, be16toh, le16toh, htobe32, htole32, be32toh, le32toh, htobe64, htole64, be64toh, le64toh - convert values between host and big-/little-endian byte order
[https://linux.die.net/man/3/le32toh]

Safely punning char* to double in C

In an Open Source program I
wrote, I'm reading binary data (written by another program) from a file and outputting ints, doubles,
and other assorted data types. One of the challenges is that it needs to
run on 32-bit and 64-bit machines of both endiannesses, which means that I
end up having to do quite a bit of low-level bit-twiddling. I know a (very)
little bit about type punning and strict aliasing and want to make sure I'm
doing things the right way.
Basically, it's easy to convert from a char* to an int of various sizes:
int64_t snativeint64_t(const char *buf)
{
/* Interpret the first 8 bytes of buf as a 64-bit int */
return *(int64_t *) buf;
}
and I have a cast of support functions to swap byte orders as needed, such
as:
int64_t swappedint64_t(const int64_t wrongend)
{
/* Change the endianness of a 64-bit integer */
return (((wrongend & 0xff00000000000000LL) >> 56) |
((wrongend & 0x00ff000000000000LL) >> 40) |
((wrongend & 0x0000ff0000000000LL) >> 24) |
((wrongend & 0x000000ff00000000LL) >> 8) |
((wrongend & 0x00000000ff000000LL) << 8) |
((wrongend & 0x0000000000ff0000LL) << 24) |
((wrongend & 0x000000000000ff00LL) << 40) |
((wrongend & 0x00000000000000ffLL) << 56));
}
At runtime, the program detects the endianness of the machine and assigns
one of the above to a function pointer:
int64_t (*slittleint64_t)(const char *);
if(littleendian) {
slittleint64_t = snativeint64_t;
} else {
slittleint64_t = sswappedint64_t;
}
Now, the tricky part comes when I'm trying to cast a char* to a double. I'd
like to re-use the endian-swapping code like so:
union
{
double d;
int64_t i;
} int64todouble;
int64todouble.i = slittleint64_t(bufoffset);
printf("%lf", int64todouble.d);
However, some compilers could optimize away the "int64todouble.i" assignment
and break the program. Is there a safer way to do this, while considering
that this program must stay optimized for performance, and also that I'd
prefer not to write a parallel set of transformations to cast char* to
double directly? If the union method of punning is safe, should I be
re-writing my functions like snativeint64_t to use it?
I ended up using Steve Jessop's answer because the conversion functions re-written to use memcpy, like so:
int64_t snativeint64_t(const char *buf)
{
/* Interpret the first 8 bytes of buf as a 64-bit int */
int64_t output;
memcpy(&output, buf, 8);
return output;
}
compiled into the exact same assembler as my original code:
snativeint64_t:
movq (%rdi), %rax
ret
Of the two, the memcpy version more explicitly expresses what I'm trying to do and should work on even the most naive compilers.
Adam, your answer was also wonderful and I learned a lot from it. Thanks for posting!
I highly suggest you read Understanding Strict Aliasing. Specifically, see the sections labeled "Casting through a union". It has a number of very good examples. While the article is on a website about the Cell processor and uses PPC assembly examples, almost all of it is equally applicable to other architectures, including x86.
Since you seem to know enough about your implementation to be sure that int64_t and double are the same size, and have suitable storage representations, you might hazard a memcpy. Then you don't even have to think about aliasing.
Since you're using a function pointer for a function that might easily be inlined if you were willing to release multiple binaries, performance must not be a huge issue anyway, but you might like to know that some compilers can be quite fiendish optimising memcpy - for small integer sizes a set of loads and stores can be inlined, and you might even find the variables are optimised away entirely and the compiler does the "copy" simply be reassigning the stack slots it's using for the variables, just like a union.
int64_t i = slittleint64_t(buffoffset);
double d;
memcpy(&d,&i,8); /* might emit no code if you're lucky */
printf("%lf", d);
Examine the resulting code, or just profile it. Chances are even in the worst case it will not be slow.
In general, though, doing anything too clever with byteswapping results in portability issues. There exist ABIs with middle-endian doubles, where each word is little-endian, but the big word comes first.
Normally you could consider storing your doubles using sprintf and sscanf, but for your project the file formats aren't under your control. But if your application is just shovelling IEEE doubles from an input file in one format to an output file in another format (not sure if it is, since I don't know the database formats in question, but if so), then perhaps you can forget about the fact that it's a double, since you aren't using it for arithmetic anyway. Just treat it as an opaque char[8], requiring byteswapping only if the file formats differ.
The standard says that writing to one field of a union and reading from it immediately is undefined behaviour. So if you go by the rule book, the union based method won't work.
Macros are usually a bad idea, but this might be an exception to the rule. It should be possible to get template-like behaviour in C using a set of macros using the input and output types as parameters.
As a very small sub-suggestion, I suggest you investigate if you can swap the masking and the shifting, in the 64-bit case. Since the operation is swapping bytes, you should be able to always get away with a mask of just 0xff. This should lead to faster, more compact code, unless the compiler is smart enough to figure that one out itself.
In brief, changing this:
(((wrongend & 0xff00000000000000LL) >> 56)
into this:
((wrongend >> 56) & 0xff)
should generate the same result.
Edit:
Removed comments regarding how to effectively store data always big endian and swapping to machine endianess, as questioner hasn't mentioned another program writes his data (which is important information).Still if the data needs conversion from any endian to big and from big to host endian, ntohs/ntohl/htons/htonl are the best methods, most elegant and unbeatable in speed (as they will perform task in hardware if CPU supports that, you can't beat that).
Regarding double/float, just store them to ints by memory casting:
double d = 3.1234;
printf("Double %f\n", d);
int64_t i = *(int64_t *)&d;
// Now i contains the double value as int
double d2 = *(double *)&i;
printf("Double2 %f\n", d2);
Wrap it into a function
int64_t doubleToInt64(double d)
{
return *(int64_t *)&d;
}
double int64ToDouble(int64_t i)
{
return *(double *)&i;
}
Questioner provided this link:
http://cocoawithlove.com/2008/04/using-pointers-to-recast-in-c-is-bad.html
as a prove that casting is bad... unfortunately I can only strongly disagree with most of this page. Quotes and comments:
As common as casting through a pointer
is, it is actually bad practice and
potentially risky code. Casting
through a pointer has the potential to
create bugs because of type punning.
It is not risky at all and it is also not bad practice. It has only a potential to cause bugs if you do it incorrectly, just like programming in C has the potential to cause bugs if you do it incorrectly, so does any programming in any language. By that argument you must stop programming altogether.
Type punning A form of pointer
aliasing where two pointers and refer
to the same location in memory but
represent that location as different
types. The compiler will treat both
"puns" as unrelated pointers. Type
punning has the potential to cause
dependency problems for any data
accessed through both pointers.
This is true, but unfortunately totally unrelated to my code.
What he refers to is code like this:
int64_t * intPointer;
:
// Init intPointer somehow
:
double * doublePointer = (double *)intPointer;
Now doublePointer and intPointer both point to the same memory location, but treating this as the same type. This is the situation you should solve with a union indeed, anything else is pretty bad. Bad that is not what my code does!
My code copies by value, not by reference. I cast a double to int64 pointer (or the other way round) and immediately deference it. Once the functions return, there is no pointer held to anything. There is a int64 and a double and these are totally unrelated to the input parameter of the functions. I never copy any pointer to a pointer of a different type (if you saw this in my code sample, you strongly misread the C code I wrote), I just transfer the value to a variable of different type (in an own memory location). So the definition of type punning does not apply at all, as it says "refer to the same location in memory" and nothing here refers to the same memory location.
int64_t intValue = 12345;
double doubleValue = int64ToDouble(intValue);
// The statement below will not change the value of doubleValue!
// Both are not pointing to the same memory location, both have their
// own storage space on stack and are totally unreleated.
intValue = 5678;
My code is nothing more than a memory copy, just written in C without an external function.
int64_t doubleToInt64(double d)
{
return *(int64_t *)&d;
}
Could be written as
int64_t doubleToInt64(double d)
{
int64_t result;
memcpy(&result, &d, sizeof(d));
return result;
}
It's nothing more than that, so there is no type punning even in sight anywhere. And this operation is also totally safe, as safe as an operation can be in C. A double is defined to always be 64 Bit (unlike int it does not vary in size, it is fixed at 64 bit), hence it will always fit into a int64_t sized variable.

Resources