I am aware that type casting a buffer to struct is violating strict aliasing rule and that it's not portable
However is memcpy() a buffer to struct with attribute packed not voiding that rule, is it a good decision rather then parsing the contents of the buffer ? Lets keep in mind that both always have a fixed size
If you have ensured that the packed structure lays out the bytes as desired on all the target platforms you wish to support, then copying the bytes into the structure via memcpy and then accessing them through the structure members is fine.
Depending on circumstances, it may be advisable to copy the structure members to a normal (not packed) structure for further use, so that unaligned members in the packed structure are not repeatedly accessed, which may be inefficient. Ultimately, this may be equivalent to issuing multiple memcpy calls to copy bytes from the buffer into individual members of the normal structure.
Using memcpy is certainly at least as efficient as parsing the buffer, as memcpy is about the simplest thing one can do with data. But whether it is more efficient or just the same depends on what sort of parsing you would be doing. Once you have the data in a structure, you will still have to operate on it in whatever ways your application requires anyway, so the memcpy would not seem to eliminate any real work that must be done.
Related
In GCC, I need to use __attribute__((packed)) to make structs take the least amount of space, for example, if I have a large array of structs I should pack them. What other common compilers do struct padding, and how do I pack structs in these other compilers?
The premise of your question is mistaken. Packed structs are not something you're "supposed to use" to save space. They're a dubious nonstandard feature with lots of problems that should be avoided whenever possible. (Ultimately it's always possible to avoid them, but some people balk at the tradeoffs involved.) For example, whenever you use packed structures, any use of pointers to members is potentially unsafe because the pointer value is not necessarily a valid (properly aligned) pointer to the type it points to. The only time there is a "need" for packed structures is when you're using them to access memory-mapped hardware registers that are misaligned, or to access in-file/on-disk data structures that are misaligned (but the latter won't be portable anyway since representation/endianness may not match, and both problems are solved together much better with a proper serialization/deserialization function).
If your goal is to save space, then as long as you control the definition of the structure, simply order it so as not to leave unnecessary padding space. This can be achieved simply by ordering the members in order of decreasing size; if you do that, then on any reasonable implementation, the wasted space will be at most the difference between the size of the largest member and the size of the smallest.
I was wondering if anyone could offer a more full explanation to the meaning of the packed attribute used in the bitmap example in pset4.
"Our use, incidentally, of the attribute called packed ensures that clang does not try to "word-align" members (whereby the address of each member’s first byte is a multiple of 4), lest we end up with "gaps" in our structs that don’t actually exist on disk."
I do not understand the comment around gaps in our structs. Does this refer to gaps in the memory location between each struct (i.e. one byte between each 3 byte RGB if it was to word-algin)? Why does this matter in for optimization?
typedef uint8_t BYTE;
typedef struct
{
BYTE rgbtBlue;
BYTE rgbtGreen;
BYTE rgbtRed;
} __attribute__((__packed__))
RGBTRIPLE;
Beware: prejudices on display!
As noted in comments, when the compiler adds the padding to a structure, it does so to improve performance. It uses the alignments for the structure elements that will give the best performance.
Not so very long ago, the DEC Alpha chips would handle a 'unaligned memory request' (umr) by doing a page fault, jumping into the kernel, fiddling with the bytes to get the required result, and returning the correct result. This was painfully slow by comparison with a correctly aligned memory request; you avoided such behaviour at all costs.
Other RISC chips (used to) give you a SIGBUS error if you do misaligned memory accesses. Even Intel chips have to do some fancy footwork to deal with misaligned memory accesses.
The purpose of removing padding is to (decrease performance but) benefit by being able to serialize and unserialize the data without doing the job 'properly' — it is a form of laziness that actually doesn't work properly when the machines communicating are not of the same type, so proper serialization should have been done in the first place.
What I mean is that if you are writing data over the network, it seems simpler to be able to send the data by writing the contents of a structure as a block of memory (error checking etc omitted):
write(fd, &structure, sizeof(structure));
The receiving end can read the data:
read(fd, &structure, sizeof(structure));
However, if the machines are of different types (for example, one has an Intel CPU and the other a SPARC or Power CPU), the interpretation of the data in those structures will vary between the two machines (unless every element of the array is either a char or an array of char). To relay the information reliably, you have to agree on a byte order (e.g. network byte order — this is very much a factor in TCP/IP networking, for example), and the data should be transmitted in the agreed upon order so that both ends can understand what the other is saying.
You can define other mechanisms: you could use a 'sender makes right' mechanism, in which the 'receiver' let's the sender know how it wants the data presented and the sender is responsible for fixing up the transmitted data. You can also use a 'receiver makes right' mechanism which works the other way around. Both these have been used commercially — see DRDA for one such protocol.
Given that the type of BYTE is uint8_t, there won't be any padding in the structure in any sane (commercially viable) compiler. IMO, the precaution is a fantasy or phobia without a basis in reality. I'd certainly need a carefully documented counter-example to believe that there's an actual problem that the attribute helps with.
I was led to believe that you could encounter issues when you pass the entire struct to a function like fread as it assumes you're giving it an array like chunk of memory, with no gaps in it. If your struct has gaps, the first byte ends up in the right place, but the next two bytes get written in the gap, which you don't have a proper way to access.
Sorta...but mostly no. The issue is that the values in the padding bytes are indeterminate. However, in the structure shown, there will be no padding in any compiler I've come across; the structure will be 3 bytes long. There is no reason to put any padding anywhere inside the structure (between elements) or after the last element (and the standard prohibits padding before the first element). So, in this context, there is no issue.
If you write binary data to a file and it has holes in it, then you get arbitrary byte values written where the holes are. If you read back on the same (type of) machine, there won't actually be a problem. If you read back on a different (type of) machine, there may be problems — hence my comments about serialization and deserialization. I've only been programming in C a little over 30 years; I've never needed packed, and don't expect to. (And yes, I've dealt with serialization and deserialization using a standard layout — the system I mainly worked on used big-endian data transfer, which corresponds to network byte order.)
Sometimes, the elements of a struct are simply aligned to a 4-byte boundary (or whatever the size of a register is in the CPU) to optimize read/write access to RAM. Often, smaller elements are packed together, but alignment is dictated by a larger type in the struct.
In your case, you probably don't need to pack the struct, but it doesn't hurt.
With some compilers, each byte in your struct could end up taking 4 bytes of RAM each (so, 12 bytes for the entire struct). Packing the struct removes the alignment requirement for each of the BYTEs, and ensures that the entire struct is placed into one 4-byte DWORD (unless the alignment for the entire program is set to one byte, or the struct is in an array of said structs, in that case it would literally be stored in 3 contiguous bytes of RAM).
See comments below for further discussion...
The objective is exactly what you said, not having gaps between each struct. Why is this important? Mostly because of cache. Memory access is slow!!! Cache is really fast. If you can fit more in cache you avoid cache misses (memory accesses).
Edit: Seems I was wrong, didn't seem really useful if the objective was structure padding since the struct has 3 BYTE
In embedded software domain for copying structure of same type people don't use direct assignment and do that by memcpy() function or each element copying.
lets have for example
struct tag
{
int a;
int b;
};
struct tag exmple1 = {10,20};
struct tag exmple2;
for copying exmple1 into exmple2..
instead of writing direct
exmple2=exmple1;
people use
memcpy(exmple2,exmple1,sizeof(struct tag));
or
exmple2.a=exmple1.a;
exmple2.b=exmple1.b;
why ????
One way or the other there is nothing specific about embedded systems that makes this dangerous, the language semantics are identical for all platforms.
C has been used in embedded systems for many years, and early C compilers, before ANSI/ISO standardisation did not support direct structure assignment. Many practitioners are either from that era, or have been taught by those that were, or are using legacy code written by such practitioners. This is probably the root of the doubt, but it is not a problem on an ISO compliant implementation. On some very resource constrained targets, the available compiler may not be fully ISO compliant for a number of reasons, but I doubt that this feature would be affected.
One issue (that applies to embedded and non-embedded alike), is that when assigning a structure, an implementation need not duplicate the value of any undefined padding bits, therefore if you performed a structure assignment, and then performed a memcmp() rather than member-by-member comparison to test for equality, there is no guarantee that they will be equal. However if you perform a memcpy(), any padding bits will be copied so that memcmp() and member-by-member comparison will yield equality.
So it is arguably safer to use memcpy() in all cases (not just embedded), but the improvement is marginal, and not conducive to readability. It would be a strange implementation that did not use the simplest method of structure assignment, and that is a simple memcpy(), so it is unlikely that the theoretical mismatch would occur.
In your given code there is no problem even if you write:
example2 = example1;
But just assume if in future, the struct definition changes to:
struct tag
{
int a[1000];
int b;
};
Now if you execute the assignment operator as above then (some of the) compiler might inline the code for byte by byte (or int by int) copying. i.e.
example1.a[0] = example.a[0];
example1.a[1] = example.a[1];
example1.a[2] = example.a[2];
...
which will result in code bloat in your code segment. Such kind of memory errors are not trivial to find. That's why people use memcpy.
[However, I have heard that modern compilers are capable enough to use memcpy internally when such instruction is encountered especially for PODs.]
Copying C-structures via memcpy() is often used by programmers who learned C decades ago and did not follow the standardization process since. They simple don't know that C supports assignment of structures (direct structure assignment was not available in all pre-ANSI-C89 compilers).
When they learn about this feature some still stick to the memcpy() way because it is their custom. There are also motivations that originate in cargo cult programming, e.g. it is claimed that memcpy is just faster - of course - without being able to back this up with a benchmark test case.
Structures are also memcpy()ied by some newbie programmers because they either confuse structure assignment with the assignment of a pointer of a structure - or they simply overuse memcpy() (they often also use memcpy() where strcpy() would be more appropriate).
There is also the memcmp() structure comparison anti-pattern that is sometimes cited by some programmers for using memcpy() instead of structure assignment. The reasoning behind this is the following: since C does not automatically generate a == operator for structures and writing a custom structure comparison function is tedious, memcmp() is used to compare structures. In the next step - to avoid differences in the padding bits of compared structures - memset(...,0,...) is used to initialize all structures (instead of using the C99 initializer syntax or initializing all fields separately) and memcpy() is used to copy the structures! Because memcpy() also copies the content of the padding bits ...
But note that this reasoning is flawed for several reasons:
the use of memcpy()/memcmp()/memset() introduce new error possibilities - e.g. supplying a wrong size
when the structure contains integer fields the ordering under memcmp() changes between big- and little-endian architectures
a char array field of size n that is 0-terminated at position x must also have all elements after position x zeroed out at any time - else 2 otherwise equal structs compare unequal
assignment from a register to a field may also set the neighbouring padding bits to values unequal 0, thus, following comparisons with otherwise equal structures yield an unequal result
The last point is best illustrated with a small example (assuming architecture X):
struct S {
int a; // on X: sizeof(int) == 4
char b; // on X: 24 padding bits are inserted after b
int c;
};
typedef struct S S;
S s1;
memset(&s1, 0, sizeof(S));
s1.a = 0;
s1.b = 'a';
s1.c = 0;
S s2;
memcpy(&s2, &s1, sizeof(S));
assert(memcmp(&s1, &s2, sizeof(S)==0); // assertion is always true
s2.b = 'x';
assert(memcmp(&s1, &s2, sizeof(S)!=0); // assertion is always true
// some computation
char x = 'x'; // on X: 'x' is stored in a 32 bit register
// as least significant byte
// the other bytes contain previous data
s1.b = x; // the complete register is copied
// i.e. the higher 3 register bytes are the new
// padding bits in s1
assert(memcmp(&s1, &s2, sizeof(S)==0); // assertion is not always true
The failure of the last assertion may depend on code reordering, change of the compiler, change of compiler options and stuff like that.
Conclusion
As a general rule: to increase code correctness and portability use direct struct assignment (instead of memcpy()), C99 struct initialization syntax (instead of memset) and a custom comparison function (instead of memcmp()).
In C people probably do that, because they think that memcpy would be faster. But I don't think that is true. Compiler optimizations would take care of that.
In C++ it may also have different semantics because of user defined assignment operator and copy constructors.
On top of what the others wrote some additional points:
Using memcpy instead of a simple assignment gives a hint to someone who maintains the code that the operation might be expensive. Using memcpy in these cases will improves the understanding of the code.
Embedded systems are often written with portability and performance in mind. Portability is important because you may want to re-use your code even if the CPU in the original design is not available or if a cheaper micro-controller can do the same job.
These days low-end micro-controllers come and go faster than the compiler developers can catch up, so it is not uncommon to work with compilers that use a simple byte-copy loop instead of something optimized for structure assignments. With the move to 32 bit ARM cores this is not true for a large part of embedded developers. There are however a lot of people out there who build products that target obscure 8 and 16 bit micro-controllers.
A memcpy tuned for a specific platform may be more optimal than what a compiler can generate. For example on embedded platforms having structures in flash memory is common. Reading from flash is not as slow as writing to it, but it is still a lot slower than a ordinary copy from RAM to RAM. A optimized memcpy function may use DMA or special features from the flash controller to speed up the copy process.
That is a complete nonsense. Use whichever way you prefer. The simplest is :
exmple2=exmple1;
Whatever you do, don't do this:
exmple2.a=exmple1.a;
exmple2.b=exmple1.b;
It poses a maintainability problem because any time that anyone adds a member to the structure, they have to add a line of code to do the copy of that member. Someone is going to forget to do that and it will cause a hard to find bug.
On some implementations, the way in which memcpy() is performed may differ from the way in which "normal" structure assignment would be performed, in a manner that may be important in some narrow contexts. For example, one or the other structure operand may be unaligned and the compiler might not know about it (e.g. one memory region might have external linkage and be defined in a module written in a different language that has no means of enforcing alignment). Use of a __packed declaration would be better if a compiler supported such, but not all compilers do.
Another reason for using something other than structure assignment could be that a particular implementation's memcpy might access its operands in a sequence that would work correctly with certain kinds of volatile source or destination, while that implementation's struct assignment might use a different sequence that wouldn't work. This is generally not a good reason to use memcpy, however, since aside from the alignment issue (which memcpy is required to handle correctly in any case) the specifications for memcpy don't promise much about how the operation will be performed. It would be better to use a specially-written routine which performed the operations exactly as required (for example, if the target is a piece of hardware which needs to have 4 bytes of structure data written using four 8-bit writes rather than one 32-bit writes, one should write a routine which does that, rather than hoping that no future version of memcpy decides to "optimize" the operation).
A third reason for using memcpy in some cases would be the fact that compilers will often perform small structure assignments using a direct sequence of loads and stores, rather than using a library routine. On some controllers, the amount of code this requires may vary depending upon where the structures are located in memory, to the point that the load/store sequence may end up being bigger than a memcpy call. For example, on a PICmicro controller with 1Kwords of code space and 192 bytes of RAM, coping a 4-byte structure from bank 1 to bank 0 would take 16 instructions. A memcpy call would take eight or nine (depending upon whether count is an unsigned char or int [with only 192 bytes of RAM total, unsigned char should be more than sufficient!] Note, however, that calling a memcpy-ish routine which assumed a hard-coded size and required both operands be in RAM rather than code space would only require five instructions to call, and that could be reduced to four with the use of a global variable.
first version is perfect.
second one may be used for speed (there is no reason for your size).
3rd one is used only if padding is different for target and source.
I am making a memory block copy routine and need to deal with blocks of raw memory in efficient chunks. My question is not about the specialized copy routine I'm making, but in how to correctly examine raw pointer alignment in C.
I have a raw pointer of memory, let's say it's already cast as a non-null char *.
In my architecture, I can very efficiently copy memory in 64 byte chunks WHEN IT IS ALIGNED TO A 64 BYTE chunk. So the (standard) trick is that I will do a simple copy of 0-63 bytes "manually" at the head and/or tail to transform the copy from an arbitrary char* of arbitrary length to a 64 byte aligned pointer with some multiple of 64 bytes in length.
Now the question is, how do you legally "examine" a pointer to determine (and manipulate) its alignment?
The obvious way is to cast it into an integer and just examine the bits:
char *pointer=something.
int p=(int)pointer;
char *alignedPointer=(char *)((p+63)&~63);
Note here I realize that alignedPointer doesn't point to the same memory as pointer... this is the "rounded up" pointer that I can call my efficient copy routine on, and I'll handle any other bytes at the beginning manually.
But compilers (justifiably) freak out at casting a pointer into an integer. But how else can I examine and manipulate the pointer's lower bits in LEGAL C? Ideally so that with different compilers I'd get no errors or warnings.
For integer types that are large enough to hold pointers, C99 stdint.h has:
uintptr_t
intptr_t
For data lengths there are:
size_t
ssize_t
which have been around since well before C99.
If your platform doesn't have these, you can maximise your code's portability by still using these type names, and making suitable typedefs for them.
I don't think that in the past people were as reluctant to do their own bit-banging, but maybe the current "don't touch that" mood would be conducive to someone creating some kind of standard library for aligning pointers. Lacking some kind of official api, you have no choice but to AND and OR your way through.
Instead of int, try a datatype that's guaranteed to be the same size as a pointer (INT_PTR on Win32/64). Maybe the compiler won't freak out too much. :) Or use a union, if 64-bit compatibility is not important.
Casting pointers to and from integers is valid, but the results are implementation-defined. See section 6.3.2.3 of the standard. The intention seems to be that the results are what anybody familiar with the system would expect, and indeed this appears to be routinely the case in practice.
If the architecture in question can efficiently manipulate pointers and integers interchangeably, and the issue is just whether it will work on all compilers for that system, then the answer is that it probably will anyway.
(Certainly, if I were writing this code, I would think it fine as-is until proven otherwise. My experience has been that compilers for a given system all behave in very similar ways at this sort of level; the assembly language just suggests a particular approach, that all then take.)
"Probably works" isn't very good general advice though, so my suggestion would be just write the code that works, surround it enough suitable #ifdefs that only the known compiler(s) will compile it, and defer to memcpy in other cases.
#ifdef is rarely ideal, but it's fairly lightweight compared to other possibilities. And if implementation-defined behaviour or compiler-specific tricks are needed then the options are pretty limited anyway.
I'm writing an embedded control system in C that consists of multiple tasks that send messages to each other (a fairly common idiom, I believe!), but I'm having a hard time designing a mechanism which:
is neat
is generic
is relatively efficient
most importantly: is platform independent (specifically, doesn't violate strict-aliasing or alignment issues)
Conceptually, I'd like to represent each message type as a separate struct definition, and I'd like a system with the following functions (simplified):
void sendMsg(queue_t *pQueue, void *pMsg, size_t size);
void *dequeueMsg(queue_t *pQueue);
where a queue_t comprises a linked list of nodes, each with a char buf[MAX_SIZE] field. The system I'm on doesn't have a malloc() implementation, so there'll need to be a global pool of free nodes, and then one of the following (perceived issues in bold):
sendMsg() does a memcpy of the incoming message into the buffer of a free node.My understanding is that this will have alignment issues unless the caller of dequeueMsg() does a further memcpy on the return value.
or there'll be a void *getFreeBuffer() function which returns the buf[] of the next free node, which the caller (the sender) will cast to the appropriate pointer-to-type.My understanding is that this will now have alignment issues on the way in, and still requires a memcpy after dequeueMsg() to avoid alignment issues on the way out.
or redefine the buffer in queue_t nodes as (e.g.) uint32_t buf[MAX_SIZE].My understanding is that this violates strict aliasing, and isn't platform-independent.
The only other option I can see is to create a union of all the message types along with char buf[MAX_SIZE], but I don't count this as "neat"!
So my question is, how does one do this properly?
The way we deal with this is to have our free list consist entirely of aligned nodes. In fact we have multiple free lists for different sizes of node, so we have lists that are aligned on 2 byte, 4 byte, and 16 byte boundaries (our platform doesn't care about alignment larger than one SIMD vector) . Any allocation gets rounded up to one of those values and put in a properly aligned node. So, sendMsg always copies its data into an aligned node. Since you're writing the free list yourself, you can easily enforce alignment.
We would also use a #pragma or declspec to force that char buf[MAX_SIZE] array to be aligned to at least a word boundary inside the queue_t node struct.
This assumes of course that the input data is aligned, but if for some reason you're passing in a message that expects to be (let's say) 3 bytes off from alignment, you can always detect that with a modulus and return an offset into the free node.
With that underlying design we have interfaces that support both option 1 and 2 above. Again, we precondition that input data is always natively aligned, so our dequeue of course returns an aligned pointer; but if you need oddly aligned data in, again, just offset into the free node and return the offset pointer.
This keeps you dealing in void *s thus avoiding your strict aliasing problems. (In general though I think you may need to relax your strict aliasing requirements when writing your own memory allocators, since by nature they blur types internally.)
I don't understand why 1 presents an alignment issue - as long as each buf[MAX_SIZE] element is aligned to the natural largest single primitive type that occurs in your message structs (probably 32 or 64bit), then it doesn't matter what the contents of each message type is; as it will always be aligned to that size.
Edit
Actually, it's even simpler than that. As each element in your message queue is MAX_SIZE in length, then assuming that you start each message in it's own buf (i.e. you don't pack them if a message is < MAX_SIZE) then every message in the queue will start on a boundary at least as big as itself, therefore it will always be correctly aligned.