I am wondering whether C packs bytes in the stack for optimal CPU retrieval even if they are outside of a struct. And if not, why do so specifically for a struct?
Structs are very widely used in C, and the compiler does various tricks for (1) alignment of objects for access speeds (2) for specific architecture mappings (ex in ARM - Thumb) where a developer can write code to map to Peripheral registers. But sometimes, we need explicit control for transmission across different systems (like network protocols).
From the point of view of embedded systems (ARM), below is a specific recommendation - "peripheral locations should not be accessed using __packed structs (where unaligned members are allowed and there is no internal padding), or using C bitfields. This is because it is not possible to control the number and type of memory access that is being performed by the compiler. The result is code which is non-portable, has undesirable side-effects, and will not work as intended".
Also see Structure padding and packing
Related
How to get the memory granularity of a CPU in C?
Suppose I want to allocate an array where all the elements are properly memory aligned. I can pad each element to a certain size N to achieve this. How do I know the value of N?
Note: I am trying to create a memory pool where each slot is memory aligned. Any suggestion will be appreciated.
In Theory
How to get the memory granularity of a CPU in C?
First, you read the instruction set architecture manual. It may specify that certain instructions require certain alignments, or even that the addressing forms in certain instructions cannot represent non-aligned addresses. It may specify other properties regarding alignment.
Second, you read the processor manual. It may specify performance characteristics (such as that unaligned loads or stores are supported but may be slower or use more resources than aligned loads or stores) and may specify various options allowed by the instructions set architecture.
Third, you read the operating system documentation. Some architectures allow the operating system to select features related to alignment, such as whether unaligned loads and stores are made to fail or are supported albeit with slower performance than aligned loads or stores. The operating system documentation should have this information.
In Practice
For many programming situations, what you need to know is not the “memory granularity” of a CPU but the alignment requirements of the C implementation you are using (or of whatever language you are using). And, for the most part, you do not need to know the alignment requirements directly but just need to follow the language rules about managing objects—use objects with declared types, do not use casts to convert pointers between incompatible types exceed where specific rules allow it, use the suitably aligned memory as provided by malloc rather than adjusting your own pointers to bytes, and so on. Following these rules will give good alignment for the objects in your program.
In C, when you define an array, the element size will automatically be the size that C implementation needs for its alignment. For example, long double x[100]; may use 16 bytes for each array element even though the hardware uses only ten bytes for a long double. Or, for any struct foo that you define, the compiler will automatically include padding as needed in the structure to give the desired alignment, and any array struct foo x[100]; will already include that padding. sizeof(struct foo) will be the same as sizeof x[0], because each structure object has that padding built in, even just for a single structure object, not just for elements in arrays.
When you do need to know the alignment that a C implementation requires for a type, you can use C’s _Alignof operator. The expression _Alignof(type) provides the alignment required for type.
Other
… properly memory aligned.
Proper alignment is a matter of degrees:
What the processor supports may determine whether your program works or does not work. An improper alignment is one that causes your program to trap.
What is efficient with respect to individual loads and stores may affect how fast your program runs. An improper alignment is one that causes your program to execute more slowly.
In certain performance-critical situations, alignment with respect to cache and memory mapping features can also affect performance.
Short answer
Use 64 bytes.
Long answer
Data are loaded from and stored to memory in units called cache lines. If your program loads only part of the data in a cache line, then the whole line will be loaded into the CPU caches. Perhaps more importantly, the algorithm used for moving data between cores in a multi-core CPU operates on full cache lines; aligning your data to cache lines avoids false sharing, the situation where a cache line bounces between cores because it contains data manipulated by different threads.
It used to be the case that cache lines depended on the architecture, ranging from 16 up to 512 bytes. However, all current processors (Intel, AMD, ARM, MIPS) use a cache line of 64 bytes.
This depends heavily on the cpu microarchitecture that you are using.
In many cases, the memory address of an operator should be a multiple of the operand size, otherwise execution will be slow (or even might throw an exception).
But there are also CPUs which do not care about a specific alignment of the operands in memory at all.
Usually, the C compiler will care about those details for you. You should, however, make sure that the compiler assumes the correct target (micro-)architecture, for example by specifying it with the correct compiler flags (-march=? on gcc).
While reading the CERT C Coding Standard, I came across DCL39-C which discusses why it's generally a bad idea for something like the Linux kernel to return an unpacked struct to userspace due to information leakage.
In a nutshell, structs aren't generally packed by default and the padding bytes between members of a struct often contain uninitialized data, hence the information leakage.
Why aren't structs packed by default? There was a mention in the guide that it's an optimization feature of compilers for specific architectures, I believe. Why is aligning structs to a certain byte size more efficient, as it wastes memory space?
Also, why doesn't the C standard specify a standardized way of asking for a packed struct? I can ask GCC using __attribute__((packed)), and there are other ways for different compilers, but it seems like a feature that'd be nice to have as part of the standard.
Data is carried though electronic circuits by groups of parallel wires (buses). Likewise, the circuits themselves tend to be arrayed in parallel. The physical distance between parallel components adds resistance and capacitance to any crosswise wires that bridge them. So, such bridges tend to be expensive and slow, and computer architects avoid them when possible.
Unaligned loads require shifting bytes across lanes. Some CPUs (e.g. efficiency-oriented RISC) are physically incapable of doing this, because the bridge component doesn't exist. Some will detect the condition and interpose a lane shift at the expense of a cycle or two. Others can handle misalignment without a speed penalty… assuming paged memory doesn't add another problem.
There's another, completely different issue. The memory management unit (MMU) sits between the CPU execution core and the memory bus, translating program-visible logical addresses to the physical addresses for memory chips. Two adjacent logical addresses might reside on different chips. The MMU is optimized for the common case where an access only requires one translation, but a misaligned access may require two.
A misaligned access straddling a page boundary might incur an exception, which might be fatal inside a kernel. Since pages are relatively large, this condition is relatively rare. It might evade tests, and it may be non-deterministic.
TL;DR: Packed structures shouldn't be used for active program state, especially not in a kernel. They may be used for serialization and communication, but safe usage is another question.
Leaving structs "unpacked" allows the compiler to align all members so that operations are more efficient on those members (measured in terms of clock time, number of instructions, etc). The alignment requirement for types depends on the host architecture and (for struct types) on the alignment requirement of contained members.
Packing struct members forces some (if not all) members to be aligned in a way that is sub-optimal for performance. In some worst cases - depending on host architecture - operations on unaligned variables (or on unaligned struct members) triggers a processor fault condition. RISC processor architectures, for example, generate an alignment fault when a load or store operation affects an unaligned address. Some SSE instructions on recent x86 architectures require data they act on to be aligned on 16 byte boundaries.
In best cases, the operations behave as intended, but less efficiently, due to overhead of copying an unaligned variable to an aligned location or to a register, doing the operation there, and copying it back. Those copying operations are less efficient when unaligned variables are involved - after all, the processor architecture is optimised for performance when variable alignment meets its design requirements.
If you are worried about data leaking out of your program, simply use functions like memset() to overwrite the contents of your structures at the end of their lifetime (e.g. just before an instance is about to pass out of scope, or immediately before dynamically allocated memory is deallocated using free()).
Or use an operating system (like OpenBSD) which does overwrite memory before making it available to processes or programs. Bear in mind that such features tend to make both the operating system and programs it hosts run less efficiently.
Recent versions of the C standard (since 2011) do have some facilities to query and control alignment of variables (and affect packing of struct members). The default is whatever alignment is most effective for the host architecture - which for struct types normally means unpacked.
On some compilers such as Microchip XC8, all structs are indeed always packed.
On some platforms compilers will only generate byte access instructions to access members of a packed struct, because byte access instructions are always aligned. If all structs are packed, the 16-, 32-, and 64- bit load/store instructions are not used. This is an obvious waste of resources.
The C standard does not specify a way of packing struct possibly because the standard itself is not aware of the concept of packing. Since the layout of non-bit-field members of structs is implementation defined, out of scope for the standard. Or possibly, the standard is made to support architectures that always add padding in structs, since such architectures are indeed theoretically feasible.
What are the problems of mixing (Visual Studio) C/C++ projects which have different structure alignment options set? I know that I can obviously set the options differently in different projects, but does this have an affect on the behaviour of the output library/application?
Say I have a library, with structure alignment set to 1-byte, would there be a problem if I had an application which links to the library, but has structure alignment set to default? Does it make a difference if the structures of the library are not used by the application?
Thanks for any help with this!
The structure layout is determined by the alignment rules in affect at compilation time. For efficiency reasons, structure members are aligned in way as to make their addresses in memory multiples of 2, 4, 8 or sometimes larger values.
Extra padding in inserted between structure members depending on their sizes and alignment requirements, and extra padding may be added at the end of the structure to allow arrays of structures to keep this alignment valid for members of all array elements. malloc is guarantied to return a pointer aligned for all practical purposes:
The pointer returned if the allocation succeeds is suitably aligned so that it may be assigned to a pointer to any type of object with a fundamental alignment requirement.
This behavior can sometimes be controlled via compiler switches and or pragmas as seems to be possible with Visual C/C++.
The consequence is important for structures for which said options will produce different padding. The code to access structure members with be different in the various libraries and programs handling shared data, invoking undefined behaviour.
It is therefore strongly advised to not use different alignment options for data structures shared between different pieces of compiled code. Whether they live in different programs, dynamic libraries, statically linked object files, user or kernel code, device drivers, etc. the result is the same: undefined behavior with potential catastrophic consequences (like any UB).
It only matters for structures used and shared between the incompatible modules, but it is very difficult to keep track over time of who uses and who does not use structure definitions published in shared header files. Structures that are only used locally and whose definition is not visible should not pose a problem, but only for as long as they stay private.
I you use compile time switches, all structures will be affected by these, including standard definitions from the C library, and all structures defined locally or globally in your header files. Standard include files may have provisions to keep them immune from such discrepancies, but your own structure definitions will not.
Don't do this with compile time switches unless you know exactly why you need to do it. Don't do this in source code (via pragmas, attributes and other non portable constructs) unless you are a real expert with complete control over every use of your data.
If you are concerned with structure packing, reorder the struct members and chose their types appropriately to achieve your goals. Changing the compiler's packing rules in highly error prone. See http://www.catb.org/esr/structure-packing/ for a complete study.
Systems demand that certain primitives be aligned to certain points within the memory (ints to bytes that are multiples of 4, shorts to bytes that are multiples of 2, etc.). Of course, these can be optimized to waste the least space in padding.
My question is why doesn't GCC do this automatically? Is the more obvious heuristic (order variables from biggest size requirement to smallest) lacking in some way? Is some code dependent on the physical ordering of its structs (is that a good idea)?
I'm only asking because GCC is super optimized in a lot of ways but not in this one, and I'm thinking there must be some relatively cool explanation (to which I am oblivious).
gcc does not reorder the elements of a struct, because that would violate the C standard. Section 6.7.2.1 of the C99 standard states:
Within a structure object, the non-bit-field members and the units in which bit-fields
reside have addresses that increase in the order in which they are declared.
Structs are frequently used as representations of the packing order of binary file formats and network protocols. This would break if that were done. In addition, different compilers would optimize things differently and linking code together from both would be impossible. This simply isn't feasible.
GCC is smarter than most of us in producing machine code from our source code; however, I shiver if it was smarter than us in re-arranging our structs, since it's data that e.g. can be written to a file. A struct that starts with 4 chars and then has a 4 byte integer would be useless if read on another system where GCC decided that it should re-arrange the struct members.
gcc SVN does have a structure reorganization optimization (-fipa-struct-reorg), but it requires whole-program analysis and isn't very powerful at the moment.
C compilers don't automatically pack structs precisely because of alignment issues like you mention. Accesses not on word boundaries (32-bit on most CPUs) carry heavy penalty on x86 and cause fatal traps on RISC architectures.
Not saying it's a good idea, but you can certainly write code that relies on the order of the members of a struct. For example, as a hack, often people cast a pointer to a struct as the type of a certain field inside that they want access to, then use pointer arithmetic to get there. To me this is a pretty dangerous idea, but I've seen it used, especially in C++ to force a variable that's been declared private to be publicly accessible when it's in a class from a 3rd party library and isn't publicly encapsulated. Reordering the members would totally break that.
You might want to try the latest gcc trunk or, struct-reorg-branch which is under active development.
https://gcc.gnu.org/wiki/cauldron2015?action=AttachFile&do=view&target=Olga+Golovanevsky_+Memory+Layout+Optimizations+of+Structures+and+Objects.pdf
Older K&R (2nd ed.) and other C-language texts I have read that discuss the implementation of a dynamic memory allocator in the style of malloc() and free() usually also mention, in passing, something about data type alignment restrictions. Apparently certain computer hardware architectures (CPU, registers, and memory access) restrict how you can store and address certain value types. For example, there may be a requirement that a 4 byte (long) integer must be stored beginning at addresses that are multiples of four.
What restrictions, if any, do major platforms (Intel & AMD, SPARC, Alpha) impose for memory allocation and memory access, or can I safely ignore aligning memory allocations on specific address boundaries?
Sparc, MIPS, Alpha, and most other "classical RISC" architectures only allow aligned accesses to memory, even today. An unaligned access will cause an exception, but some operating systems will handle the exception by copying from the desired address in software using smaller loads and stores. The application code won't know there was a problem, except that the performance will be very bad.
MIPS has special instructions (lwl and lwr) which can be used to access 32 bit quantities from unaligned addresses. Whenever the compiler can tell that the address is likely unaligned it will use this two instruction sequence instead of a normal lw instruction.
x86 can handle unaligned memory accesses in hardware without an exception, but there is still a performance hit of up to 3X compared to aligned accesses.
Ulrich Drepper wrote a comprehensive paper on this and other memory-related topics, What Every Programmer Should Know About Memory. It is a very long writeup, but filled with chewy goodness.
Alignment is still quite important today. Some processors (the 68k family jumps to mind) would throw an exception if you tried to access a word value on an odd boundary. Today, most processors will run two memory cycles to fetch an unaligned word, but this will definitely be slower than an aligned fetch. Some other processors won't even throw an exception, but will fetch an incorrect value from memory!
If for no other reason than performance, it is wise to try to follow your processor's alignment preferences. Usually, your compiler will take care of all the details, but if you're doing anything where you lay out the memory structure yourself, then it's worth considering.
You still need to be aware of alignment issues when laying out a class or struct in C(++). In these cases the compiler will do the right thing for you, but the overall size of the struct/class may be more wastefull than necessary
For example:
struct
{
char A;
int B;
char C;
int D;
};
Would have a size of 4 * 4 = 16 bytes (assume Windows on x86) whereas
struct
{
char A;
char C;
int B;
int D;
};
Would have a size of 4*3 = 12 bytes.
This is because the compiler enforces a 4 byte alignment for integers, but only 1 byte for chars.
In general pack member variables of the same size (type) together to minimize wasted space.
As Greg mentioned it is still important today (perhaps more so in some ways) and compilers usually take care of the alignment based on the target of the architecture. In managed environments, the JIT compiler can optimize the alignment based on the runtime architecture.
You may see pragma directives (in C/C++) that change the alignment. This should only be used when very specific alignment is required.
// For example, this changes the pack to 2 byte alignment.
#pragma pack(2)
Note that even on IA-32 and the AMD64, some of the SSE instructions/intrinsics require aligned data. These instructions will throw an exception if the data is unaligned, so at least you won't have to debug "wrong data" bugs. There are equivalent unaligned instructions as well, but like Denton says, they're are slower.
If you're using VC++, then besides the #pragma pack directives, you also have the __declspec(align) directives for precise alignment. VC++ documentation also mentions an __aligned_malloc function for specific alignment requirements.
As a rule of thumb, unless you are moving data across compilers/languages or are using the SSE instructions, you can probably ignore alignment issues.