Casting uint64_t on bitfield - c

I found code where bitfield is used for network messages. I would like to know what casting bitfield_struct data = *(bitfield_struct *)&tmp; exaclty does and how it's syntax work. Won't it violate the strict aliasing rule? Here is part of code:
typedef struct
{
unsigned var1 : 1;
unsigned var2 : 13;
unsigned var3 : 8;
unsigned var4 : 10;
unsigned var5 : 7;
unsigned var6 : 12;
unsigned var7 : 7;
unsigned var8 : 6;
} bitfield_struct;
void print_data(u_int64_t * raw, FILE * f, int no_object)
{
uint64_t tmp = ntohll(*raw);
bitfield_struct data = *(bitfield_struct *)&tmp;
...
}

Won't it violate the strict aliasing rule?
Yes it will, so the code invokes undefined behavior. It is also highly non-portable:
We don't know the size of the abstract item called "addressable storage unit" that the given system uses. It isn't necessarily 64 bits, so there could in theory be padding and other nasty things hidden in the bit-field. 64 bit unsigned is fishy.
Neither do we know if the bit-field uses the same bit-order as uint64_t. Nor can we know if they use the same endianess.
If individual bit (fields) of the uint64_t need to be accessed, I would recommend doing so using bitwise shifts, as that makes the code fully portable even between different endianess architectures. Then you don't need the non-portable ntohll call either.

What it does (or attempts to do) is quite straightforward.
uint64_t tmp = ntohll(*raw);
This line takes the value in pointer raw, reverses the byte-order and copies it into temp.
bitfield_struct data = *(bitfield_struct *)&tmp;
This line reinterprets the data in temp (which was a uint64) as type bitfield_struct and copies it into data. This is basically the equivalent of doing:
/* Create a bitfield_struct pointer that points to tmp */
bitfield_struct *p = (bitfield_struct *)&tmp;
/* Copy the value in tmp to data */
bitfield_struct data = *p;
This because normally bitfield_struct and uint64 are incompatible types and you cannot assign one to the other with just bitfield_struct data = tmp;
The code presumably continues to access fields within the bitfield through data, such as data.var1.
Now, like people pointed out, there are several issues which makes this code unreliable and non-portable.
Bit-fields are heavily implementation-dependent. Solution? Read the manual and figure out how your specific compiler variant treats bit-fields. Or don't use bitfields at all.
There is no guarantee that a uint64_t and bitfield_struct have the same alignment. Which means there could be padding which can completely offset your expectations and make you end up with wrong data. One solution is to use memcpy to copy instead of pointers, which might let you this particular issue. Or specify packed alignment using the mechanism provided by your compiler.
The code invokes UB when strict aliasing rules are applied. Solution? Most compilers will have a no-strict-aliasing flag that can be enabled, at a performance cost. Or even better, create a union type with bitfield_struct and uint64_t and use this to reinterpret between one and the other. This is allowed even with the strict-aliasing rules. Using memcpy is also legal, since it treats the data as an array of chars.
However, the best thing to do is not use this piece of code at all. As you may have noticed, it relies too much on compiler and platform specific stuff. Instead, try to accomplish the same thing using bit masks and shifts. This gets rid of all three problems mentioned above, without needing special compiler flags or having to face any real question of portability. Most importantly, it saves other developers reading your code, from having to worry about such things in the future.

Right to left:
&tmp Take address of tmp
(bitfield_struct *)&tmp Address of tmp is address to data of type bitfield_struct
*(bitfield_struct *)&tmp Extract value out of tmp, assuming it's bitfield_struct data
bitfield_struct data = *(bitfield_struct *)&tmp; Store tmp to data, assuming that tmp is bitfield_struct
So it's just copy using extra pointers to avoid compilation errors/warnings of incompatible types.
What you may not understand is bit-addressing of structure.
unsigned var1 : 1;
unsigned var2 : 13;
Here you will find some more info about it: https://www.tutorialspoint.com/cprogramming/c_bit_fields.htm

Related

Unaligned memory acces with array

In a C program, having an array that is meant to work as a buffer FooBuffer[] for storing the contents of a data member of a struct like this:
struct Foo {
uint64_t data;
};
I was told that this line might cause unaligned access:
uint8_t FooBuffer[10] = {0U};
I have some knowledge that unaligned access depends on the alignment offset of the processor and in general, it consumes more read/write cycles. Under what circumstances would this cause unaligned memory access and how could I prevent it?
Edit:
A variable of type struct Foo would be stored in the buffer. Particularly, its member data would be split up into eight bytes that would be stored in the array FooBuffer. See attached code with some options for this.
#include <stdio.h>
#include <string.h>
typedef unsigned long uint64;
typedef unsigned char uint8;
struct Foo
{
uint64 data;
};
int main()
{
struct Foo foo1 = {0x0123456789001122};
uint8 FooBuffer[10] = {0U};
FooBuffer[0] = (uint8)(foo1.data);
FooBuffer[1] = (uint8)(foo1.data >> 8);
FooBuffer[2] = (uint8)(foo1.data >> 16);
FooBuffer[3] = (uint8)(foo1.data >> 24);
FooBuffer[4] = (uint8)(foo1.data >> 32);
FooBuffer[5] = (uint8)(foo1.data >> 40);
FooBuffer[6] = (uint8)(foo1.data >> 48);
FooBuffer[7] = (uint8)(foo1.data >> 56);
struct Foo foo2 = {0x9876543210112233};
uint8 FooBuffer2[10] = {0U};
memcpy(FooBuffer2, &foo2, sizeof(foo2));
return 0;
}
However, it is not clear how this process is done since a piece of privative software performs the operation. What would be the scenarios that could result in unaligned memory access after the "conversion"?
Defining either a structure such as struct Foo { uint64_t data; } or an array such as uint8_t FooBuffer[10]; and using them in normal ways will not cause an unaligned access. (Why did you use 10 for FooBuffer? Only 8 bytes are needed for this example?)
A method that novices sometimes attempt that can cause unaligned accesses is attempting to reinterpret an array of bytes as a data structure. For example, consider:
// Get raw bytes from network or somewhere.
uint8_t FooBuffer[10];
CallRoutineToReadBytes(FooBuffer,...);
// Reinterpret bytes as original type.
struct Foo x = * (struct Foo *) FooBuffer; // Never do this!
The problem here is that struct Foo has some alignment requirement, but FooBuffer does not. So FooBuffer could be at any address, but the cast to struct Foo * attempts to force it to an address for a struct Foo. If the alignment is not correct, the behavior is not defined by the C standard. Even if the system allows it and the program “works,” it may be accessing a struct Foo at an improperly aligned address and suffering performance problems.
To avoid this, a proper way to reinterpret bytes is to copy them into a new object:
struct Foo x;
memcpy(&x, FooBuffer, sizeof x);
Often a compiler will recognize what is happening here and, especially if struct Foo is not large, implement the memcpy in an efficient way, perhaps as two load-four-byte instructions or one load-eight-byte instruction.
Something you can do to help that along is ask the compiler to align FooBuffer by declaring it with the _Alignas keyword:
uint8_t _Alignas(Struct Foo) FooBuffer[10];
Note that that might not help if you need to take bytes from the middle of a buffer, such as from a network message that includes preceding protocol bytes and other data. And, even if it does give the desired alignment, never use the * (struct Foo *) FooBuffer shown above. It has more problems than just alignment, one of which is that the C standard does not guarantee the behavior of reinterpreting data like this. (A supported way to do it in C is through unions, but memcpy is a fine solution.)
In the code you show, bytes are copied from foo1.data to FooBuffer using bit shifts. This also will not cause alignment problems; expressions that manipulate data like this work just fine. But there are two issues with it. One is that it nominally manipulates individual bytes one by one. That is perfectly legal in C, but it can be slow. A compiler might optimize it, and there might be built-ins or library functions to assist with it, depending on your platform.
The other issue is that it puts the bytes in an order according to their position values: The low-position-value bytes are put into the buffer first. In contrast, the memcpy method copies the bytes in the order they are stored in memory. Which method you want to use depends on the problem you are trying to solve. To store data on one system and read it back later on the same system, the memcpy method is fine. To send data between two systems using the same byte ordering, the memcpy method is fine. However, if you want to send data from one system on the Internet to another, and the two systems do not use the same byte order in memory, you need to agree on an order to use in the network packages. In this case, it is common to use the arrange-bytes-by-position-value method. Again, your platform may have builtins or library routines to assist with this. For example, the htonl and ntohl routines are BSD routines that take a normal 32-bit unsigned integer and return it with its bytes arranged for network order or vice-versa.

Suspicious pointer-to-pointer conversion (area too small) in C

uint8_t *buf;
uint16_t *ptr = (uint16_t *)buf;
I feel the above code is correct but I get "Suspicious pointer-to-pointer conversion (area too small)" Lint warning. Anyone knows how to solve this warning?
IMHO, don't cast the pointer at all. It changes the way (representation) a variable (and corresponding memory) is accessed, which is very problematic.
If you have to, cast the value instead.
Its a problem in your code
In line uint8_t buf you must have assigned it with an memory area which is 8 bits long
and then in line uint16_t *ptr = (uint16_t *)buf; you try to assign it to a pointer that will access it as 16 bits.
When you try to access with ptr variable you are acessing memory beyond what is assigned to you and its a undefined behaviour.
The problem here is strict aliasing. Imagine that when you have a case like this:
uint8_t raw_data [N]; // chunk of raw data
uint16_t val = 1;
memcpy(raw_data, &val, sizeof(val)); // copy an uint16_t into this array
uint16_t* lets_go_crazy = (uint16_t*)raw_data;
*lets_go_crazy = 5;
print(raw_data);
Assume the print function prints all the values of the bytes. You might then think that the two first bytes have changed to contain your machine's representation of an uint16_t containing the value 5. Not necessarily so.
Because the compiler on the other hand is free to assume that you have never modified the raw_data array since the memcpy, because an uint16_t is not allowed to alias an uint8_t. So it might optimize the whole code into something like:
uint8_t raw_data [N]; // chunk of raw data
uint16_t val = 1;
memcpy(raw_data, &val, sizeof(val)); // copy an uint16_t into this array
print(raw_data);
What will happen is undefined, because the strict aliasing was broken.
Though note that your code might be perfectly fine if, and only if, you can guarantee that your particular compiler does not apply strict aliasing to whatever scenario you have. There are compiler options to disable it on some compilers.
Alternatively, you could use a pointer to a struct or union containing an uint16_t, in which case the aliasing does not apply.

C Casting from pointer to uint32_t to a pointer to a union containing uint32_t

I'd like to know if casting a pointer to uint32_t to a pointer of a union containing a uint32_t will lead to defined behavior in C, i.e.
typedef union
{
uint8_t u8[4];
uint32_t u32;
} T32;
void change_value(T32 *t32)
{
t32->u32 = 5678;
}
int main()
{
uint32_t value = 1234;
change_value((T32 *)&value); // value is 5678 afterwards
return EXIT_SUCCESS;
}
Is this valid C? Many thanks in advance.
The general answer to your question is, no, this is in general not defined. If the union contains a field that has larger alignment than uint32_t such a union must have the largest alignment and accessing that pointer would then lead to UB. This could e.g happen if you replace uint8_t in your example by double.
In your particular case, though, the behavior is well defined. uint8_t, if it exists, is most likely nothing other than unsigned char and all character types always have the least alignment requirement.
Edit:
As R.. mentions in his comments there are other issues with your approach. First, theoretically, uint8_t could be different from unsigned char if there is an unsigned "extended integer type" of that width. This is very unlikely, I never heard of such an architecture. Second, your approach is subject to aliasing issues, so you should be extremely careful.
At the risk of incurring downvotes... Conceptually, there is nothing wrong with what you are trying to do. That is, define a piece of storage that can be viewed as four bytes an a 32 bit integer, and then reference and modify that storage using a pointer.
However, I would ask why you would want to write code where its intent is obscured. What you are really doing is forcing the next programmer who reads your code to think for minutes and maybe even try a little test program. Thus, this programming style is "expensive".
You could have just as easily defined, value as:
T32 value;
// etc.
change_value(&value);
and then avoid the cast and subsequent angst.
Since all union members are guaranteed to start at the same memory address, your program as written does not lead to undefined behavior.

Portable way to find size of a packed structure in C

I'm coding a network layer protocol and it is required to find a size of packed a structure defined in C. Since compilers may add extra padding bytes which makes sizeof function useless in my case. I looked up Google and find that we could use ___attribute(packed)___ something like this to prevent compiler from adding extra padding bytes. But I believe this is not portable approach, my code needs to support both windows and linux environment.
Currently, I've defined a macro to map packed sizes of every structure defined in my code. Consider code below:
typedef struct {
...
} a_t;
typedef struct {
...
} b_t;
#define SIZE_a_t 8;
#define SIZE_b_t 10;
#define SIZEOF(XX) SIZE_##XX;
and then in main function, I can use above macro definition as below:-
int size = SIZEOF(a_t);
This approach does work, but I believe it may not be best approach. Any suggestions or ideas on how to efficiently solve this problem in C?
Example
Consider the C structure below:-
typedef struct {
uint8_t a;
uint16_t b;
} e_t;
Under Linux, sizeof function return 4 bytes instead of 3 bytes. To prevent this I'm currently doing this:-
typedef struct {
uint8_t a;
uint16_t b;
} e_t;
#define SIZE_e_t 3
#define SIZEOF(XX) SIZE_##e_t
Now, when I call SIZEOF(e_t) in my functin, it should return 3 not 4.
sizeof is the portable way to find the size of a struct, or of any other C data type.
The problem you're facing is how to ensure that your struct has the size and layout that you need.
#pragma pack or __attribute__((packed)) may well do the job for you. It's not 100% portable (there's no mention of packing in the C standard), but it may be portable enough for your current purposes, but consider whether your code might need to be ported to some other platform in the future. It's also potentially unsafe; see this question and this answer.
The only 100% portable approach is to use arrays of unsigned char and keep track of which fields occupy which ranges of bytes. This is a lot more cumbersome, of course.
Your macro tells you the size that you think the struct should have, if it has been laid out as you intend.
If that's not equal to sizeof(a_t), then whatever code you write that thinks it is packed isn't going to work anyway. Assuming they're equal, you might as well just use sizeof(a_t) for all purposes. If they're not equal then you should be using it only for some kind of check that SIZEOF(a_t) == sizeof(a_t), which will fail and prevent your non-working code from compiling.
So it follows that you might as well just put the check in the header file that sizeof(a_t) == 8, and not bother defining SIZEOF.
That's all aside from the fact that SIZEOF doesn't really behave like sizeof. For example consider typedef a_t foo; sizeof(foo);, which obviously won't work with SIZEOF.
I don't think, that specifying size manually is more portable, than using sizeof.
If size is changed your const-specified size will be wrong.
Attribute packed is portable. In Visual Studio it is #pragma pack.
I would recommend against trying to read/write data by overlaying it on a struct. I would suggest instead writing a family of routines which are conceptually like printf/scanf, but which use format specifiers that specify binary data formats. Rather than using percent-sign-based tags, I would suggest simply using a binary encoding of the data format.
There are a few approaches one could take, involving trade-off between the size of the serialization/deserialization routines themselves, the size of the code necessary to use them, and the ability to handle a variety of deserialization formats. The simplest (and most easily portable) approach would be to have routines which, instead of using a format string, process items individually by taking a double-indirect pointer, read some data type from it, and increment it suitably. Thus:
uint32_t read_uint32_bigendian(uint8_t const ** src)
{
uint8_t *p;
uint32_t tmp;
p = *src;
tmp = (*p++) << 24;
tmp |= (*p++) << 16;
tmp |= (*p++) << 8;
tmp |= (*p++);
*src = p;
}
...
char buff[256];
...
uint8_t *buffptr = buff;
first_word = read_uint32_bigendian(&buffptr);
next_word = read_uint32_bigendian(&buffptr);
This approach is simple, but has the disadvantage of having lots of redundancy in the packing and unpacking code. Adding a format string could simplify it:
#define BIGEND_INT32 "\x43" // Or whatever the appropriate token would be
uint8_t *buffptr = buff;
read_data(&buffptr, BIGEND_INT32 BIGEND_INT32, &first_word, &second_word);
This approach could read any number of data items with a single function call, passing buffptr only once, rather than once per data item. On some systems, it might still be a bit slow. An alternative approach would be to pass in a string indicating what sort of data should be received from the source, and then also pass in a string or structure indicating where the data should go. This could allow any amount of data to be parsed by a single call giving a double-indirect pointer for the source, a string pointer indicating the format of data at the source, a pointer to a struct indicating how the data should be unpacked, and a a pointer to a struct to hold the target data.

Does casting remove endian dependency in C/C++?

i.e. if we cast a C or C++ unsigned char array named arr as (unsigned short*)arr and then assign to it, is the result the same independent of machine endianness?
Side note - I saw the discussion on IBM and elsewhere on SO with example:
unsigned char endian[2] = {1, 0};
short x;
x = *(short *) endian;
...stating that the value of x will depend on the layout of endian, and hence the endianness of the machine. That means dereferencing an array is endian-dependent, but what about assigning to it?
*(short*) endian = 1;
Are all future short-casted dereferences then guaranteed to return 1, regardless of endianness?
After reading the responses, I wanted to post some context:
In this struct
struct pix {
unsigned char r;
unsigned char g;
unsigned char b;
unsigned char a;
unsigned char y[2];
};
replacing unsigned char y[2] with unsigned short y makes no individual difference, but if I make an array of these structs and put that in another struct, then I've noticed that the size of the container struct tends to be higher for the "unsigned short" version, so, since I intend to make a large array, I went with unsigned char[2] to save space overhead. I'm not sure why, but I imagine it's easier to align the uchar[2] in memory.
Because I need to do a ton of math with that variable y, which is meant to be a single short-length numerical value, I find myself casting to short a lot just to avoid individually accessing the uchar bytes... sort of a fast way to avoid ugly byte-specific math, but then I thought about endianness and whether my math would still be correct if I just cast everything like
*(unsigned short*)this->operator()(x0, y0).y = (ySum >> 2) & 0xFFFF;
...which is a line from a program that averages 4-adjacent-neighbors in a 2-D array, but the point is that I have a bunch of these operations that need to act on the uchar[2] field as a single short, and I'm trying to find the lightest (i.e. without an endian-based if-else statement every time I need to access or assign), endian-independent way of working with the short.
Thanks to strict pointer aliasing it's undefined behaviour, so it might be anything. If you'd do the same with a union however the answer is no, the result is dependent on machine endianness.
Each possible value of short has a so-called "object representation"[*], which is a sequence of byte values. When an object of type short holds that value, the bytes of the object hold that sequence of values.
You can think of endianness as just being one of the ways in which the object representation is implementation-dependent: does the byte with the lowest address hold the most significant bits of the value, or the least significant?
Hopefully this answers your question. Provided you've safely written a valid object representation of 1 as a short into some memory, when you read it back from the same memory you'll get the same value again, regardless of what the object representation of 1 actually is in that implementation. And in particular regardless of endianness. But as the others say, you do have to avoid undefined behavior.
[*] Or possibly there's more than one object representation for the same value, on exotic architectures.
Yes, all future dereferences will return 1 as well: As 1 is in range of type short, it will end up in memory unmodified and won't change behind your back once it's there.
However, the code itself violates effective typing: It's illegal to access an unsigned char[2] as a short, and may raise a SIGBUS if your architecture doesn't support unaligned access and you're particularly unlucky.
However, character-wise access of any object is always legal, and a portable version of your code looks like this:
short value = 1;
unsigned char *bytes = (unsigned char *)&value;
How value is stored in memory is of course still implementation-defined, ie you can't know what the following will print without further knowledge about the architecture:
assert(sizeof value == 2); // check for size 2 shorts
printf("%i %i\n", bytes[0], bytes[1]);

Resources