Skip/avoid alignment padding bytes when calculating struct checksum - c

Is there a generic way to skip/avoid alignment padding bytes when calculating the checksum of a C struct?
I want to calculate the checksum of a struct by summing the bytes. The problem is, the struct has alignment padding bytes which can get random (unspecified) values and cause two structs with the identical data to get different checksum values.
Note: I'm mainly concerned about maintainability (adding/removing/modifying fields without the need to update the code) and reusability, not about portability (the platform is very specific and unlikely to change).
Currently, I found a few solutions, but they all have disadvantages:
Pack the struct (e.g. #pragma pack (1)). Disadvantage: I prefer to avoid packing for better performance.
Calculate checksum field by field. Disadvantage: The code will need to be updated when modifying the struct and requires more code (depending on the number of fields).
Set to zero all struct bytes before setting values. Disadvantage: I cannot fully guarantee that all structs were initially zeroed.
Arrange the struct fields to avoid padding and possibly add dummy fields to fill padding. Disadvantage: Not generic, the struct will need to be carefully rearranged when modifying the struct.
Is there a better generic way?
Calculating checksum example:
unsigned int calcCheckSum(MyStruct* myStruct)
{
unsigned int checkSum = 0;
unsigned char* bytes = (unsigned char*)myStruct;
unsigned int byteCount = sizeof(MyStruct);
for(int i = 0; i < byteCount; i++)
{
checkSum += bytes[i];
}
return checkSum;
}

Is there a generic way to skip/avoid alignment padding bytes when calculating the checksum of a C struct?
There is no such mechanism on which a strictly conforming program can rely. This follows from
the fact that C implementations are permitted to lay out structures with arbitrary padding following any member or members, for any reason or none, and
the fact that
When a value is stored in an object of structure or union type,
including in a member object, the bytes of the object representation
that correspond to any padding bytes take unspecified values.
(C2011, 6.2.6.1/6)
The former means that the standard provides no conforming way to guarantee that a structure layout contains no padding, and the latter means that in principle, there is nothing you can do to control the values of the padding bytes -- even if you initially zero-fill a structure instance, any padding takes indeterminate values as soon as you assign to that object or to any of its members.
In practice, it is likely that any of the approaches you mention in the question will do the job where the C implementation and the nature of the data permit. But only (2), computing the checksum member by member, can be used by a strictly-conforming program, and that one is not "generic" as I take you to mean that term. This is what I would choose. If you have many distinct structures that require checksumming, then it might be worthwhile to deploy a code generator or macro magic to help you with maintaining things.
On the other hand, your most reliable way to provide for generic checksumming is to exercise an implementation-specific extension that enables you to avoid structures containing any padding (your (1)). Note that this will tie you to a specific C implementation or implementations that implement such an extension compatibly, that it may not work at all on some systems (such as those where misaligned access is a hard error), and that it may reduce performance on other systems.
Your (4) is an alternative way to avoid padding, but it would be a portability and maintenance nightmare. Nevertheless, it could provide for generic checksumming, in the sense that the checksum algorithm wouldn't need to pay attention to individual members. But note also that this also places a requirement for initialization behavior analogous to (3). That would come cheaper, but it would not be altogether automatic.
In practice, C implementations do not wantonly modify padding bytes, but they don't necessarily go out of their way to preserve them, either. In particular, even if you zero-filled rigorously, per your (3), padding is not guaranteed to be copied by whole-structure assignment or when you pass or return a structure by value. If you want to do any of those things then you need to take measures on the receiving side to ensure zero-filling, and requires member-by-member attention.

This sounds like an XY problem. Computing a checksum for a C object in memory is not usually a meaningful operation; the result is dependent on the C implementation (arch/ABI if not even the specific compiler) and C does not admit a fault-tolerance programming model able to handle the possibility of object values changing out from under you due to hardware faults of memory-safety errors. Checksums make sense mainly for serialized data on disk or in transit over a network where you want to guard against data corruption in storage/transit. And C structs are not for serialization (although they're commonly abused for it). If you write proper serialization routines, you then can just do the checksum on the serialized byte stream.

Related

Should I use bit-fields for mapping incoming serial data?

We have data coming in over serial (Bluetooth), which maps to a particular structure. Some parts of the structure are sub-byte size, so the "obvious" solution is to map the incoming data to a bit-field. What I can't work out is whether the bit-endianness of the machine or compiler will affect it (which is difficult to test), and whether I should just abandon the bit-fields altogether.
For example, we have a piece of data which is 1.5 bytes, so we used the struct:
{
uint8_t data1; // lsb
uint8_t data2:4; // msb
uint8_t reserved:4;
} Data;
The reserved bits are always 1
So for example, if the incoming data is 0xD2,0xF4, the value is 0x04D2, or 1234.
The struct we have used is always working on the systems we have tested on, but we need it to be as portable as possible.
My questions are:
Will data1 always represent the correct value as expected regardless of endianness (I assume yes, and that the hardware/software interface should always handle that correctly for a single, whole byte - if 0xD2 is sent, 0xD2 should be received)?
Could data2 and reserved be the wrong way around, with data2 representing the upper 4 bits instead of the lower 4 bits?
If yes:
Is the bit endianness (generally) dependent on the byte endianness, or can they differ entirely?
Is the bit-endianness determined by the hardware or the compiler? It seems all linux systems on Intel are the same - is that true for ARM as well? (If we can say we can support all Intel and ARM linux builds, we should be OK)
Is there a simple way to determine in the compiler which way around it is, and reserve the bit-fields entries if needed?
Although bit-fields are the neatest way, code-wise, to map the incoming data, I suppose I am just wondering if it's a lot safer to just abandon them, and use something like:
struct {
uint8_t data1; // lsb (0xFF)
uint8_t data2; // msb (0x0F) & reserved (0xF0)
} Data;
Data d;
int value = (d.data2 & 0x0F) << 16 + d.data1
The reason we have not just done this in the first place is because a number of the data fields are less than 1 byte, rather than more than 1 - meaning that generally with a bit-field we don't have to do any masking and shifting, so the post-processing is simpler.
Should I use bit-fields for mapping incoming serial data?
No. Bit-fields have a lot of implementation specified behaviour that makes using them a nightmare.
Will data1 always represent the correct value as expected regardless of endianness.
Yes, but that is because uint8_t is smallest possible addressable unit: a byte. For larger data types you need to take care of the byte endianness.
Could data2 and reserved be the wrong way around, with data2 representing the upper 4 bits instead of the lower 4 bits?
Yes. They could also be on different bytes. Also, compiler doesn't have to support uint8_t for bitfields, even if it would support the type otherwise.
Is the bit endianness (generally) dependent on the byte endianness, or can they differ entirely?
The least signifact bit will always be in the least significant byte, but it's impossible to determine in C where in the byte the bit will be.
Bit shifting operators give reliable abstraction of the order that is good enough: For data type uint8_t the (1u << 0) is always the least significant and (1u << 7) the most significant bit, for all compilers and for all architectures.
Bit-fields on the other hand are so poorly defined that you cannot determine the order of bits by the order of your defined fields.
Is the bit-endianness determined by the hardware or the compiler?
Compiler dictates how datatypes map to actual bits, but hardware heavily influences it. For bit-fields, two different compilers for the same hardware can put fields in different order.
Is there a simple way to determine in the compiler which way around it is, and reserve the bit-fields entries if needed?
Not really. It depends on your compiler how to do it, if it's possible at all.
Although bit-fields are the neatest way, code-wise, to map the incoming data, I suppose I am just wondering if it's a lot safer to just abandon them, and use something like:
Definitely abandon bit-fields, but I would also recommend abandoning structures altogether for this purpose, because:
You need to use compiler extensions or manual work to handle byte order.
You need to use compiler extensions to disable padding to avoid gaps due to alignment restrictions. This affects member access performance on some systems.
You cannot have variable width or optional fields.
It's very easy to have strict aliasing violations if you are unaware of those issues. If you define byte array for the data frame and cast that to pointer to structure and then dereference that, you have problems in many cases.
Instead I recommend doing it manually. Define byte array and then write each field into it manually by breaking them apart using bit shifting and masking when necessary. You can write a simple reusable conversion functions for the basic data types.

Are there reasons to avoid bit-field structure members?

I long knew there are bit-fields in C and occasionally I use them for defining densely packed structs:
typedef struct Message_s {
unsigned int flag : 1;
unsigned int channel : 4;
unsigned int signal : 11;
} Message;
When I read open source code, I instead often find bit-masks and bit-shifting operations to store and retrieve such information in hand-rolled bit-fields. This is so common that I do not think the authors were not aware of the bit-field syntax, so I wonder if there are reasons to roll bit-fields via bit-masks and shifting operations your own instead of relying on the compiler to generate code for getting and setting such bit-fields.
Why other programmers use hand-coded bit manipulations instead of bitfields to pack multiple fields into a single word?
This answer is opinion based as the question is quite open:
Many programmers are unaware of the availability of bitfields or unsure about their portability and precise semantics. Some even distrust the compiler's ability to produce correct code. They prefer to write explicit code that they understand.
As commented by Cornstalks, this attitude is rooted in real life experience as explained in this article.
Bitfield's actual memory layout is implementation defined: if the memory layout must follow a precise specification, bitfields should not be used and hand-coded bit manipulations may be required.
The handing of signed values in signed typed bitfields is implementation defined. If signed values are packed into a range of bits, it may be more reliable to hand-code the access functions.
Are there reasons to avoid bitfield-structs?
bitfield-structs come with some limitations:
Bit fields result in non-portable code. Also, the bit field length has a high dependency on word size.
Reading (using scanf()) and using pointers on bit fields is not possible due to non-addressability.
Bit fields are used to pack more variables into a smaller data space, but cause the compiler to generate additional code to manipulate these variables. This results in an increase in both space as well as time complexities.
The sizeof() operator cannot be applied to the bit fields, since sizeof() yields the result in bytes and not in bits.
Source
So whether you should use them or not depends. Read more in Why bit endianness is an issue in bitfields?
PS: When to use bit-fields in C?
There is no reason for it. Bitfields are useful and convenient. They are in the common use in the embedded projects. Some architectures (like ARM) have even special instructions to manipulate bitfields.
Just compare the code (and write the rest of the function foo1)
https://godbolt.org/g/72b3vY
In many cases, it is useful to be able to address individual groups of bits within a word, or to operate on a word as a unit. The Standard presently does not provide
any practical and portable way to achieve such functionality. If code is written to use bitfields and it later becomes necessary to access multiple groups as a word, there would be no nice way to accommodate that without reworking all the code using the bit fields or disabling type-based aliasing optimizations, using type punning, and hoping everything gets laid out as expected.
Using shifts and masks may be inelegant, but until C provides a means of treating an explicitly-designated sequence of bits within one lvalue as another lvalue, it is often the best way to ensure that code will be adaptable to meet needs.

The use of __int8_t on some small values on 32bit machine is useless optimization effort?

I have some values on my functions that are < 10, normally. So,if I use __int8_t instead of int to store this values is useless effort of optimization?
Not only can this be a useless optimization, but you may cause integer misalignment (i.e., integers not aligned to 4-byte or 8-byte boundaries depending on the architecture) which may actually slow performance. To get around this, most compilers try to align integers on most architectures, so you may not be saving any space at all, because the compiler adds padding (e.g., for stack based variables and struct members) to properly align larger variables or struct members that follow your 8-bit integer. Since your question looked like it was regarding a local variable in a function and assuming that there was only one such variable and that the function was not recursive, the optimization is most likely unnecessary and you should just use the platform native integer size (i.e., int).
The optimization you mentioned can be worthwhile if you have very many instances of a particular type and you want to consume less memory, by using 8-bit fields instead of say 64-bit fields for small integers in a struct. If this is your intention, be sure to use the correct pragmas and/or compiler switches for your platform, so that structs get properly packed to the smallest size. Another instance where using a smaller integer size can come in handy is when arrays, especially large ones, are involved. Finally, keep in mind that specifying the number of bits in the type is non-portable, as the leading double underscore in its name indicated.
I hazard to mention this, since it appears to only be tangentially related to your question, but you can go even smaller than 8-bit ints in structs, for even smaller ranges of values than 0-255. In this case, bit fields become an option - the epitomy of micro-optimization. Don't use bit fields unless you have actually measured performance in the time and space domains and are sure that there are significant savings to be had. To disuade you further, I present some of the pitfalls of using them in the following paragraph.
Bit fields add more complexity to the development process by forcing you to manually order struct members to pack data into as few bytes as possible - a task that is not only arduous, but also often leads to a completely unintuitive struct member variable ordering. Changes, additions and deletions of member variables often tend to cascade to the members that follow them causing a maintenance problem down the road. To get around this problem, developers often stick padding and alignment bits into the struct, complicating the process further for all but the original developer of the code. Commenting such code stops being a nicety and often becomes a necessity. There is also the problem of developers forgetting that they are dealing with a bit field within the code and treating it as if it had the entire domain of its underlying type, leading to no end of silent and sometimes difficult to debug under/overflows.
Given the above guidelines, profile your application to see whether the optimizations just discussed make sense for your particular compiler(s) and platform(s).
It won't save time, but if you have a million of these things, it will save 3mb of space (and that may save time).

Questions about memory alignement in structures and portability of the sizeof operator

I have a question about structure padding and memory alignment optimizations regarding structures in C language. I am sending a structure over the network, I know that, for run-time optimizations purposes, the memory inside a structure is not contiguous. I've run some tests on my local computer and indeed, sizeof(my_structure) was different than the sum of all my structure members. I ran some research to find out two things :
First, the sizeof() operator retrieves the padded size of the structure (i.e the real size that would be stored in memory).
When specifying __attribute__((__packed__)) in the declaration of the structure this optimization is disabled by the compiler, so sizeof(my_structure) will be exactly the same as the sum of the fields of my structure.
That being said, i am wondering if the sizeof operator was getting the padded size on every compilers implementation and on every architecture, in other words, is it always safe to copy a structure with memcpy for example using the sizeof operator such as :
memcpy(struct_dest, struct_src, sizeof(struct_src));
I am also wondering what is the real purpose of __attribute__((__packed__)), is it used to send a less important amount the data on a network when submitting a structure or is it, in fact, used to avoid some unspecified and platform-dependant sizeof operator behaviour ?
Thanks by advance.
Different compilers on different architectures can and do use different padding. So for wire transmission it is not uncommon to pack structs to achieve a consistent binary layout. This can then cater for the code at each end of the wire running on different architecture.
However you also need to make sure that your data types are the same size if you use this approach. For example, on 64 bit systems, long is 4 bytes on Windows and 8 bytes almost everywhere else. And you also need to deal with endianness issues. The standard is to transmit over the wire in network byte order. In practice you would be better using a dedicated serialization library rather than trying to reinvent solutions to all these issues.
I am sending a structure over the network
Stop there. Perhaps some would disagree with me on this (in practice you do see a lot of projects doing this), but struct is a way of laying out things in memory - it's not a serialization mechanism. By using this tool for the job, you're already tying yourself to a bunch of non-portable assumptions.
Sure, you may be able to fake it with things like structure padding pragmas and attributes, but - can you really? Even with those non-portable mechanisms you never know what quirks might show up. I recall working in a code base where "packed" structures were used, then suddenly taking it to a platform where access had to be word aligned... even though it was nominally the same compiler (thus supported the same proprietary extensions) it produced binaries which crashed. Any pain you get from this path is probably deserved, and I would say only take it if you can be 100% sure it will only run in a given compiler and environment, and that will never change. I'd say the safer bet is to write a proper serialization mechanism that doesn't allow writing structures around across process boundaries.
Is it always safe to copy a structure with memcpy for example using the sizeof operator
Yes, it is and that is the purpose of providing the sizeof operator.
Usually __attribute__((__packed__)) is used not for size considerations but when you want want to to make sure of the layout of a structure is exactly as you want it to be.
For ex:
If a structure is to be used to match hardware or be sent on a wire then it needs to have the exact same layout without any padding.This is because different architectures usually implement different kinds & amounts of padding and alignment and the only way to ensure common ground is to remove padding out out of the picture by using packing.

Why can't I declare bitfields as automatic variables?

I want to declare a bitfield with the size specified using the a colon (I can't remember what the syntax is called). I want to write this:
void myFunction()
{
unsigned int thing : 12;
...
}
But GCC says it's a syntax error (it thinks I'm trying to write a nested function). I have no problem doing this though:
struct thingStruct
{
unsigned int thing : 4;
};
and then putting one such struct on the stack
void myFunction()
{
struct thingStruct thing;
...
}
This leads me to believe that it's being prevented by syntax, not semantic issues.
So why won't the first example work? What am I missing?
The first example won't work because you can only declare bitfields inside structs. This is syntax, not semantics, as you said, but there it is. If you want a bitfield, use a struct.
Why would you want to do such a thing? A bit field of 12 would on all common architectures be padded to at least 16 or 32 bits.
If you want to ensure the width of an integer variable use the types in inttypes.h, e.g int16_t or int32_t.
As others have said, bitfields must be declared inside a struct (or union, but that's not really useful). Why? Here are two reasons.
Mainly, it's to make the compiler writer's job easier. Bitfields tend to require more machine instructions to extract the bits from the bytes. Only fields can be bitfields, and not variables or other objects, so the compiler writer doesn't have to worry about them if there is no . or -> operator involved.
But, you say, sometimes the language designers make the compiler writer's job harder in order to make the programmer's life easier. Well, there is not a lot of demand from programmers for bitfields outside structs. The reason is that programmers pretty much only bother with bitfields when they're going to cram several small integers inside a single data structure. Otherwise, they'd use a plain integral type.
Other languages have integer range types, e.g., you can specify that a variable ranges from 17 to 42. There isn't much call for this in C because C never requires that an implementation check for overflow. So C programmers just choose a type that's capable of representing the desired range; it's their job to check bounds anyway.
C89 (i.e., the version of the C language that you can find just about everywhere) offers a limited selection of types that have at least n bits. There's unsigned char for 8 bits, unsigned short for 16 bits and unsigned long for 32 bits (plus signed variants). C99 offers a wider selection of types called uint_least8_t, uint_least16_t, uint_least32_t and uint_least64_t. These types are guaranteed to be the smallest types with at least that many value bits. An implementation can provide types for other number of bits, such as uint_least12_t, but most don't. These types are defined in <stdint.h>, which is available on many C89 implementations even though it's not required by the standard.
Bitfields provide a consistent syntax to access certain implementation-dependent functionality. The most common purpose of that functionality is to place certain data items into bits in a certain way, relative to each other. If two items (bit-fields or not) are declared as consecutive items in a struct, they are guaranteed to be stored consecutively. No such guarantee exists with individual variables, regardless of storage class or scope. If a struct contains:
struct foo {
unsigned bar: 1;
unsigned boz: 1;
};
it is guaranteed that bar and boz will be stored consecutively (most likely in the same storage location, though I don't think that's actually guaranteed). By contrast, 'bar' and 'boz' were single-bit automatic variables, there's no telling where they would be stored, so there'd be little benefit to having them as bitfields. If they did share space with some other variable, it would be hard to make sure that different functions reading and writing different bits in the same byte didn't interfere with each other.
Note that some embedded-systems compilers do expose a genuine 'bit' type, which are packed eight to a byte. Such compilers generally have an area of memory which is allocated for storing nothing but bit variables, and the processors for which they generate code have atomic instructions to test, set, and clear individual bits. Since the memory locations holding the bits are only accessed using such instructions, there's no danger of conflicts.

Resources