Alignment restrictions for malloc()/free() - c

Older K&R (2nd ed.) and other C-language texts I have read that discuss the implementation of a dynamic memory allocator in the style of malloc() and free() usually also mention, in passing, something about data type alignment restrictions. Apparently certain computer hardware architectures (CPU, registers, and memory access) restrict how you can store and address certain value types. For example, there may be a requirement that a 4 byte (long) integer must be stored beginning at addresses that are multiples of four.
What restrictions, if any, do major platforms (Intel & AMD, SPARC, Alpha) impose for memory allocation and memory access, or can I safely ignore aligning memory allocations on specific address boundaries?

Sparc, MIPS, Alpha, and most other "classical RISC" architectures only allow aligned accesses to memory, even today. An unaligned access will cause an exception, but some operating systems will handle the exception by copying from the desired address in software using smaller loads and stores. The application code won't know there was a problem, except that the performance will be very bad.
MIPS has special instructions (lwl and lwr) which can be used to access 32 bit quantities from unaligned addresses. Whenever the compiler can tell that the address is likely unaligned it will use this two instruction sequence instead of a normal lw instruction.
x86 can handle unaligned memory accesses in hardware without an exception, but there is still a performance hit of up to 3X compared to aligned accesses.
Ulrich Drepper wrote a comprehensive paper on this and other memory-related topics, What Every Programmer Should Know About Memory. It is a very long writeup, but filled with chewy goodness.

Alignment is still quite important today. Some processors (the 68k family jumps to mind) would throw an exception if you tried to access a word value on an odd boundary. Today, most processors will run two memory cycles to fetch an unaligned word, but this will definitely be slower than an aligned fetch. Some other processors won't even throw an exception, but will fetch an incorrect value from memory!
If for no other reason than performance, it is wise to try to follow your processor's alignment preferences. Usually, your compiler will take care of all the details, but if you're doing anything where you lay out the memory structure yourself, then it's worth considering.

You still need to be aware of alignment issues when laying out a class or struct in C(++). In these cases the compiler will do the right thing for you, but the overall size of the struct/class may be more wastefull than necessary
For example:
struct
{
char A;
int B;
char C;
int D;
};
Would have a size of 4 * 4 = 16 bytes (assume Windows on x86) whereas
struct
{
char A;
char C;
int B;
int D;
};
Would have a size of 4*3 = 12 bytes.
This is because the compiler enforces a 4 byte alignment for integers, but only 1 byte for chars.
In general pack member variables of the same size (type) together to minimize wasted space.

As Greg mentioned it is still important today (perhaps more so in some ways) and compilers usually take care of the alignment based on the target of the architecture. In managed environments, the JIT compiler can optimize the alignment based on the runtime architecture.
You may see pragma directives (in C/C++) that change the alignment. This should only be used when very specific alignment is required.
// For example, this changes the pack to 2 byte alignment.
#pragma pack(2)

Note that even on IA-32 and the AMD64, some of the SSE instructions/intrinsics require aligned data. These instructions will throw an exception if the data is unaligned, so at least you won't have to debug "wrong data" bugs. There are equivalent unaligned instructions as well, but like Denton says, they're are slower.
If you're using VC++, then besides the #pragma pack directives, you also have the __declspec(align) directives for precise alignment. VC++ documentation also mentions an __aligned_malloc function for specific alignment requirements.
As a rule of thumb, unless you are moving data across compilers/languages or are using the SSE instructions, you can probably ignore alignment issues.

Related

Why does Clang-Tidy suggest a larger alignment?

Given the following c language struct definition:
typedef struct PackTest {
long long a;
int b;
int c;
} PackTest;
Clang-Tidy gives the following message:
Accessing fields in struct 'PackTest' is inefficient due to poor alignment; currently aligned to 8 bytes, but recommended alignment is 16 bytes
I know why the struct is aligned to 8 bytes, but I don't know if the suggestion is valid and why.
Some particular specialized assembly instructions might have alignment requirements (for example, x86 non-scalar SSE instructions strictly require alignment to 16 bytes boundaries). Other instructions might have lower throughput when used on data that is not aligned to 16 byte boundaries (for example, x86 SSE2).
These kind of instructions are usually used to perform aggressive optimizations based on the hardware features of the processor. Overall, the message you get is only useful in those scenarios (i.e. if you are actually planning to take advantage of such instructions).
See also:
What does alignment to 16-byte boundary mean in x86
Why and where align 16 is used for SSE alignment for instructions?
Finally I'll just quote Rich from the above comment since they make a really good point:
There is nothing "untidy" about having standard structs that are not ridiculously over-aligned. For very specialized purposes you might want an over-aligned object, but if it's flagging this then most things it's flagging are just wrong, and encouraging you to write code that's inefficient and gratuitously nonstandard.
you can add -altera-struct-pack-align for Clang-Tidy to disable this warning
source: https://www.mail-archive.com/cfe-commits#lists.llvm.org/msg171275.html

Get memory granularity of a processor

How to get the memory granularity of a CPU in C?
Suppose I want to allocate an array where all the elements are properly memory aligned. I can pad each element to a certain size N to achieve this. How do I know the value of N?
Note: I am trying to create a memory pool where each slot is memory aligned. Any suggestion will be appreciated.
In Theory
How to get the memory granularity of a CPU in C?
First, you read the instruction set architecture manual. It may specify that certain instructions require certain alignments, or even that the addressing forms in certain instructions cannot represent non-aligned addresses. It may specify other properties regarding alignment.
Second, you read the processor manual. It may specify performance characteristics (such as that unaligned loads or stores are supported but may be slower or use more resources than aligned loads or stores) and may specify various options allowed by the instructions set architecture.
Third, you read the operating system documentation. Some architectures allow the operating system to select features related to alignment, such as whether unaligned loads and stores are made to fail or are supported albeit with slower performance than aligned loads or stores. The operating system documentation should have this information.
In Practice
For many programming situations, what you need to know is not the “memory granularity” of a CPU but the alignment requirements of the C implementation you are using (or of whatever language you are using). And, for the most part, you do not need to know the alignment requirements directly but just need to follow the language rules about managing objects—use objects with declared types, do not use casts to convert pointers between incompatible types exceed where specific rules allow it, use the suitably aligned memory as provided by malloc rather than adjusting your own pointers to bytes, and so on. Following these rules will give good alignment for the objects in your program.
In C, when you define an array, the element size will automatically be the size that C implementation needs for its alignment. For example, long double x[100]; may use 16 bytes for each array element even though the hardware uses only ten bytes for a long double. Or, for any struct foo that you define, the compiler will automatically include padding as needed in the structure to give the desired alignment, and any array struct foo x[100]; will already include that padding. sizeof(struct foo) will be the same as sizeof x[0], because each structure object has that padding built in, even just for a single structure object, not just for elements in arrays.
When you do need to know the alignment that a C implementation requires for a type, you can use C’s _Alignof operator. The expression _Alignof(type) provides the alignment required for type.
Other
… properly memory aligned.
Proper alignment is a matter of degrees:
What the processor supports may determine whether your program works or does not work. An improper alignment is one that causes your program to trap.
What is efficient with respect to individual loads and stores may affect how fast your program runs. An improper alignment is one that causes your program to execute more slowly.
In certain performance-critical situations, alignment with respect to cache and memory mapping features can also affect performance.
Short answer
Use 64 bytes.
Long answer
Data are loaded from and stored to memory in units called cache lines. If your program loads only part of the data in a cache line, then the whole line will be loaded into the CPU caches. Perhaps more importantly, the algorithm used for moving data between cores in a multi-core CPU operates on full cache lines; aligning your data to cache lines avoids false sharing, the situation where a cache line bounces between cores because it contains data manipulated by different threads.
It used to be the case that cache lines depended on the architecture, ranging from 16 up to 512 bytes. However, all current processors (Intel, AMD, ARM, MIPS) use a cache line of 64 bytes.
This depends heavily on the cpu microarchitecture that you are using.
In many cases, the memory address of an operator should be a multiple of the operand size, otherwise execution will be slow (or even might throw an exception).
But there are also CPUs which do not care about a specific alignment of the operands in memory at all.
Usually, the C compiler will care about those details for you. You should, however, make sure that the compiler assumes the correct target (micro-)architecture, for example by specifying it with the correct compiler flags (-march=? on gcc).

Where can I find what the alignment requirement for any arbitrary compiler? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I came across this page The Lost Art of C Structure Packing and while I have never had to actually pad any structs, I'd like to learn a bit more so that when/if I need too - I can.
It says:
Storage for the basic C datatypes on an x86 or ARM processor doesn’t normally start at arbitrary byte addresses in memory. Rather, each type except char has an alignment requirement; chars can start on any byte address, but 2-byte shorts must start on an even address, 4-byte ints or floats must start on an address divisible by 4, and 8-byte longs or doubles must start on an address divisible by 8. Signed or unsigned makes no difference.
Does this imply that all 32 bit processors (x86, ARM, AVR32, PIC32,...) have this alignment requirement? What about 16 bit processors?
If not, and it is device specific, where can I find this information?
I tried searching through Microchip XC16 Manual but I could not find the alignment requirements that say that ints start at addresses divisible by 4.
I assume that the information is there, and I am not searching for the right key words - what is the "alignment requirement" called if I were to search online for more information?
Alignments requirements have 2 considerations: required, preferred
Required: Example: some platforms require various types, like an int to be aligned. Contorted code that attempts to access an int on an unaligned boundary results in a fault. Compilers will normally aligned data automatically to prevent this issue.
Efficiency: Unaligned accesses may be allowed yet results in slower code. Many compilers, rather than packing the data, will default to aligned data for speed efficiency. Typically such compilers allow a compiler specific keyword or compiler option to pack the data instead for space efficiency.
These issues apply to various processors of various sizes in different degrees. An 8-bit processor may have a 16-bit data bus and oblige 16+ -bit types to be aligned. A compliant C compiler for a 64-bit processor may have only have 64-bit types, even char. The possibilities are vast.
C provides an integer type max_align_t in <stddef.h>. This could be used in various ways to determine the minimum general alignment requirement.
... max_align_t which is an object type whose alignment is as great as is supported by the implementation in all contexts; ... C11 §7.19 2
C also has _Alignas() to impose stricter alignment of a variable.
There are two global answers here. Yes, all processors have an alignment penalty of some sort (ARM, MIPS, x86, etc). No you cannot determine by type. All ARMs do not have the same alignment penalty, despite what folks think they know about the older ARMv4 and ARMv5, you could do unaligned accesses in a predictable way, that predictable way was not what most of us would have preferred, and you have to enable it. MIPS and ARMs and perhaps others at one point would have a severe punishment for unaligned transfers, you would get a data fault. But due to the nature of how programmers program, etc, the default at least for ARM is to have that disabled on some/newer cores. You can disable it or enable it whichever way you want.
ALL processors have a penalty for unaligned transfers, a performance penalty, and those hits happen at the various layers, sometimes in the core, at the edge of the core, on each cache layer, and at the outer layer of ram. Since the designs vary so widely you cannot come up with a single rule.
Likewise since alignment in compilers is implementation defined, you cant write portable code. So if you are dealing with a processor (likely an ARM since that is where most folks get bitten) that has unaligned faults enabled, the most portable solution, but not foolproof, is to start your structs with the 64 bit variables, then the 32 then the 16 then the 8. Compilers tend to place things in the order that you defined them, so long as the whole struct starts on the right boundary for that target, then the variables will fall into alignment properly, no padding required. There is no global solution to the problem other than dont use structs, or disable alignment checking and suffer the front end performance hits.
Note that the 32 bit arms we generally deal with today use a 64 bit AMBA/AXI bus not 32, they still can check all the alignments (16, 32, 64) for transfers if enabled, but the unaligned performance hits at least at the AMBA/AXI level dont hit you unless you cross the 64 bit aligned boundary. You may still have an extra cache line hit, although that is unlikely if you dont have an AMBA/AXI hit.

Why aren't structs packed by default?

While reading the CERT C Coding Standard, I came across DCL39-C which discusses why it's generally a bad idea for something like the Linux kernel to return an unpacked struct to userspace due to information leakage.
In a nutshell, structs aren't generally packed by default and the padding bytes between members of a struct often contain uninitialized data, hence the information leakage.
Why aren't structs packed by default? There was a mention in the guide that it's an optimization feature of compilers for specific architectures, I believe. Why is aligning structs to a certain byte size more efficient, as it wastes memory space?
Also, why doesn't the C standard specify a standardized way of asking for a packed struct? I can ask GCC using __attribute__((packed)), and there are other ways for different compilers, but it seems like a feature that'd be nice to have as part of the standard.
Data is carried though electronic circuits by groups of parallel wires (buses). Likewise, the circuits themselves tend to be arrayed in parallel. The physical distance between parallel components adds resistance and capacitance to any crosswise wires that bridge them. So, such bridges tend to be expensive and slow, and computer architects avoid them when possible.
Unaligned loads require shifting bytes across lanes. Some CPUs (e.g. efficiency-oriented RISC) are physically incapable of doing this, because the bridge component doesn't exist. Some will detect the condition and interpose a lane shift at the expense of a cycle or two. Others can handle misalignment without a speed penalty… assuming paged memory doesn't add another problem.
There's another, completely different issue. The memory management unit (MMU) sits between the CPU execution core and the memory bus, translating program-visible logical addresses to the physical addresses for memory chips. Two adjacent logical addresses might reside on different chips. The MMU is optimized for the common case where an access only requires one translation, but a misaligned access may require two.
A misaligned access straddling a page boundary might incur an exception, which might be fatal inside a kernel. Since pages are relatively large, this condition is relatively rare. It might evade tests, and it may be non-deterministic.
TL;DR: Packed structures shouldn't be used for active program state, especially not in a kernel. They may be used for serialization and communication, but safe usage is another question.
Leaving structs "unpacked" allows the compiler to align all members so that operations are more efficient on those members (measured in terms of clock time, number of instructions, etc). The alignment requirement for types depends on the host architecture and (for struct types) on the alignment requirement of contained members.
Packing struct members forces some (if not all) members to be aligned in a way that is sub-optimal for performance. In some worst cases - depending on host architecture - operations on unaligned variables (or on unaligned struct members) triggers a processor fault condition. RISC processor architectures, for example, generate an alignment fault when a load or store operation affects an unaligned address. Some SSE instructions on recent x86 architectures require data they act on to be aligned on 16 byte boundaries.
In best cases, the operations behave as intended, but less efficiently, due to overhead of copying an unaligned variable to an aligned location or to a register, doing the operation there, and copying it back. Those copying operations are less efficient when unaligned variables are involved - after all, the processor architecture is optimised for performance when variable alignment meets its design requirements.
If you are worried about data leaking out of your program, simply use functions like memset() to overwrite the contents of your structures at the end of their lifetime (e.g. just before an instance is about to pass out of scope, or immediately before dynamically allocated memory is deallocated using free()).
Or use an operating system (like OpenBSD) which does overwrite memory before making it available to processes or programs. Bear in mind that such features tend to make both the operating system and programs it hosts run less efficiently.
Recent versions of the C standard (since 2011) do have some facilities to query and control alignment of variables (and affect packing of struct members). The default is whatever alignment is most effective for the host architecture - which for struct types normally means unpacked.
On some compilers such as Microchip XC8, all structs are indeed always packed.
On some platforms compilers will only generate byte access instructions to access members of a packed struct, because byte access instructions are always aligned. If all structs are packed, the 16-, 32-, and 64- bit load/store instructions are not used. This is an obvious waste of resources.
The C standard does not specify a way of packing struct possibly because the standard itself is not aware of the concept of packing. Since the layout of non-bit-field members of structs is implementation defined, out of scope for the standard. Or possibly, the standard is made to support architectures that always add padding in structs, since such architectures are indeed theoretically feasible.

Data alignment: where can it be read off? can it be changed?

This is exert from a book about data alignment of primitive types in memory.
Microsoft Windows imposes a stronger alignment requirement—any primitive object of K bytes, for
K = 2, 4, or 8, must have an address that is a multiple of K. In particular, it requires that the address
of a double or a long long be a multiple of 8. This requirement enhances the memory performance at
the expense of some wasted space. The Linux convention, where 8-byte values are aligned on 4-byte
boundaries was probably good for the i386, back when memory was scarce and memory interfaces were
only 4 bytes wide. With modern processors, Microsoft’s alignment is a better design decision. Data type
long double, for which gcc generates IA32 code allocating 12 bytes (even though the actual data type
requires only 10 bytes) has a 4-byte alignment requirement with both Windows and Linux.
Questions are:
What imposes data alignment, OS or compiler?
Can I change it or it is fixed?
Generally speaking, it's the compiler that imposes the alignment. Whenever you declare a primitive type (eg. double), the compiler will automatically align it to 8 bytes on the stack.
Furthermore, memory allocations are also generally aligned to the largest primitive type so that you can safely do this:
double *ptr = (double*)malloc(size);
without having to worry about alignment.
Therefore, generally speaking, if you're programming with good habits, you won't have to worry about alignment. One way to get something misaligned is to do something like this:
char *ch_ptr = (char*)malloc(size);
double *d_ptr = (double*)(ch_ptr + 1);
There are some exceptions to this: When you start getting into SSE and vectorization, things get a bit messy because malloc no longer guarantees 16-byte alignment.
To override the alignment of something, MSVC has the declspec(align) modifier which will allow this. It's used to increase the alignment of something. Though I'm not sure if it lets you decrease the alignment of a primitive type. It says explicitly that you cannot decrease alignment with this modifier.
EDIT :
I found the documentation stating the alignment of malloc() on GCC:
The address of a block returned by malloc or realloc in the GNU system
is always a multiple of eight (or sixteen on 64-bit systems).
Source: http://www.gnu.org/s/hello/manual/libc/Aligned-Memory-Blocks.html
So yes, GCC now aligns to at least 8 bytes.
The x86 CPUs have pretty lax alignment requirements. Most of data can be stored and accessed at unaligned locations, possibly at the expense of degraded performance. Things become more complex when you start developing multiprocessor software as alignment becomes important for atomicity and observed order of events (writing this from memory, this may be not entirely correct).
Compilers can often be directed to align variables differently from the default alignment. There're compiler options for that and special compiler-specific keywords (e.g. #pragma pack and others).
The well-established OS APIs can't be changed, neither by the application programmer (the OS is already compiled), nor by the OS developers (unless, of course, they are OK with breaking compatibility).
So, you can change some things, but not everything.
I don't know where microsoft got its information from, but the results on
gcc (4.6.1 Target: x86_64-linux-gnu, standard mode, no flags except -Wall) are quite different:
#include <stdio.h>
struct lll {
long l;
long long ll;
};
struct lld {
long l;
long double ld;
};
struct lll lll1, lll2[2];
struct lld lld1, lld2[2];
int main(void)
{
printf("lll1=%u, lll2=%u\n"
, (unsigned) sizeof lll1
, (unsigned) sizeof lll2
);
printf("lld=%u, lld2=%u\n"
, (unsigned) sizeof lld1
, (unsigned) sizeof lld2
);
return 0;
}
Results:
./a.out
lll1=16, lll2=32
lld=32, lld2=64
This might be FUD (from the company that actually managed to put unaligned ints into the MBR ...). But it could also be a result of the author not being informed too well.
To answer the question: it is the hardware that imposes the alignment restrictions. The compiler only needs to implement them.

Resources