Memory alignment today and 20 years ago

Memory alignment today and 20 years ago - c

In the famous paper "Smashing the Stack for Fun and Profit", its author takes a C function
void function(int a, int b, int c) {
char buffer1[5];
char buffer2[10];
}
and generates the corresponding assembly code output
pushl %ebp
movl %esp,%ebp
subl $20,%esp
The author explains that since computers address memory in multiples of word size, the compiler reserved 20 bytes on the stack (8 bytes for buffer1, 12 bytes for buffer2).
I tried to recreate this example and got the following
pushl %ebp
movl %esp, %ebp
subl $16, %esp
A different result! I tried various combinations of sizes for buffer1 and buffer2, and it seems that modern gcc does not pad buffer sizes to multiples of word size anymore. Instead it abides the -mpreferred-stack-boundary option.
As an illustration -- using the paper's arithmetic rules, for buffer1[5] and buffer2[13] I'd get 8+16 = 24 bytes reserved on the stack. But in reality I got 32 bytes.
The paper is quite old and a lot of stuff happened since. I'd like to know, what exactly motivated this change of behavior? Is it the move towards 64bit machines? Or something else?
Edit
The code is compiled on a x86_64 machine using gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) like that:
$ gcc -S -o example1.s example1.c -fno-stack-protector -m32

What has changed is SSE, which requires 16 byte alignment, this is covered in this older gcc document for -mpreferred-stack-boundary=num which says (emphasis mine):
On Pentium and PentiumPro, double and long double values should be aligned to an 8 byte boundary (see -malign-double) or suffer significant run time performance penalties. On Pentium III, the Streaming SIMD Extension (SSE) data type __m128 suffers similar penalties if it is not 16 byte aligned.
This is also backed up by the paper Smashing The Modern Stack For Fun And Profit which covers this an other modern changes that break Smashing the Stack for Fun and Profit.

Memory alignment of which stack alignment is just one aspect depends on the architecture. It is partly defined in the Applicaion Binary Interface of the language and a Procedure Call Standard (sometimes it is both in a single spec) for the architecture (CPU, it might even vary depending on platform) and also depends on the compiler/toolchain where the former documents leave room for variations.
The former two documents (names may vary) are mostly for the external interface between functions; they might leave internal structure to the toolchain. Howwever, that has to match the architecture. Normally the hardware requires a minimal alignment, but allows for a larger alignment for performance reasons (e.g.: byte-alignment minimum, but this would require multiple bus-cycles to read a 32 bit word, so the compiler uses a 32 bit alignment).
Normally, the compiler (following the PCS) uses an alignment optimal for the architecture and under control of optimization settings (optimize for speed or size). It takes into account not only the size of the object (aligned to its natural boundary), but also sizes of internal busses (e.g. a 32 bit x86 has internal 64 or 128 bit busses, ARM CPUs have internal 32 to 128 (possibly even wider) bit busses), caches, etc. For local variables, it may also take into account access-patterns, so two adjacent variables may be loaded in parallel into a register pair instead of two separate loads or even reorder such variables.
The stackpointer might require a higher alignment for instance, so the CPU can push in an interrupt frame two registers at once, push vector registers which require higher alignment, etc. You can write quite a thick book about this subject (and I bet, someone already has).
So, in general, there is no single one-alignment-fits all rule. However, for struct and array packing, the C standard does define some rules for packing/alignment, mostly to guarantee consistence of e.g. sizeof(type) and the address in an array (required for correct malloc()).
Even char arrays might be aligned for optimal cache layout. Note it is not only the CPU which might have caches, but also PCIe bridges, not to mention PCIe transfers themselves down to DRAM pages.

I have not tried that specific version of compiler or the distribution version you report. My guess would be the 16 is from byte alignment requirements on stack (i.e. all stack adjustments would be x byte aligned and x may be 16 for your invocation).
Note that variable alignment you seem to have started with, is slightly different from the above and is controlled by align markings on the variable in gcc. Try using those and you should see a difference.

Related

Can memcpy of array of 16-bit objects be interrupted in between

Global data:
uint16_t global_buffer[128];
Thread 1:
uint16_t local_buffer[128];
while(true)
{
...
if(data_ready)
memcpy(global_buffer, local_buffer, sizeof(uint16_t)*128);
}
Thread 2:
void timer_handler()
{
uint16_t value = global_buffer[10];
//do something with value
}
My question is whether this is safe to do? I mean, is it guaranteed that value will either get an old value or a new value (if thread 1 memcpy() is interrupted by context switch)?
Is it possible that the memcpy gets interrupted after one byte of the 16-bit value is updated but not the second. In that case, value will be garbage.
If memcpy operation only gets interrupted in between blocks of even number of bytes, I think this is safe.
Platforms: x86 & x86-64 only (only Intel i7 processor or newer actually)
OS: Linux
Compiler: gcc

It would depend on the implementation of memcpy() - there are no guarantees. Even if you know the implementation makes this safe, it would be unwise to rely on it remaining so across all versions and platforms this code or pattern may get re-used on.
You might implement your own word-by-word 16 bit copy with a word copy that you know to be atomic. How to do that warrants a new question.

Interrupts aren't really relevant unless you're running this on a single-core VM. On a normal system with a multi-core CPU, two threads can be running simultaneously on separate cores. This is why we have C++ std::atomic<> and C _Atomic which are useful for single variables like int.
It depends on your memcpy implementation. Any non-terrible one won't do any single-byte copies, and all the 16-bit loads/stores will actually be part of larger loads/stores (or possibly the internals for rep movsb microcode). It's hard to imagine how a sensible compiler (not a DeathStation 9000) would ever choose to inline a copy that could introduce tearing across a uint16_t boundary.
But if you don't do it manually (e.g. with AVX intrinsics), it is barely possible some weird optimization could get a compiler to do a byte load/store.
For a SIMD implementation like a normal library will use for small sizes, it comes down to Per-element atomicity of vector load/store and gather/scatter? - annoyingly there's no formal guarantee from either major x86 vendor (AMD or Intel). It's almost certain that it's safe, though, especially if the entire vector is itself aligned (so no cache-line splits or page splits). Using alignas(64) uint16_t global_buffer[128]; would be a good way to ensure that.
If your total copy size wasn't a multiple of the vector width, overlapping copies still won't introduce tearing within one uint16_t. Like the first 8 uint16_t and the last 8 uint16_t, for copy sizes from 8 (full overlap) to 16 (no overlap) array elements.
And BTW, that's basically what glibc memcpy does for small copies. A 4 to 7-byte memcpy is done with two 4-byte loads and 4-byte stores, 32 .. 63 bytes is done with 2x 32-byte vectors. (2 fully-overlapping avoids store-forwarding stalls when reading later, vs. two non-overlapping halves. The upper end might actually let it go up to 64 bytes with a pair of full-size AVX vectors.)

how does the processor read memory?

I'm trying to re-implement malloc and I need to understand the purpose of the alignment. As I understand it, if the memory is aligned, the code will be executed faster because the processor won't have to take an extra step to recover the bits of memory that are cut. I think I understand that a 64-bit processor reads 64-bit by 64-bit memory. Now, let's imagine that I have a structure with in order (without padding): a char, a short, a char, and an int. Why will the short be misaligned? We have all the data in the block! Why does it have to be on an address which is a multiple of 2. Same question for the integers and other types?
I also have a second question: With the structure I mentioned before, how does the processor know when it reads its 64 bits that the first 8 bits correspond to a char, then the next 16 correspond to a short etc...?

The effects can even include correctness, not just performance: C Undefined Behaviour (UB) leading to possible segfaults or other misbehaviour for example if you have a short object that doesn't satisfy alignof(short). (Faulting is expected on ISAs where load/store instructions require alignment by default, like SPARC, and MIPS before MIPS64r6. And possible even on x86 after compiler optimization of loops, even though x86 asm allows unaligned loads/stores except for some SIMD with 16-byte or wider.)
Or tearing of atomic operations if an _Atomic int doesn't have alignof(_Atomic int).
(Typically alignof(T) = sizeof(T) up to some size, often register width or wider, in any given ABI).
malloc should return memory with alignof(max_align_t) because you don't have any type info about how the allocation will be used.
For allocations smaller than sizeof(max_align_t), you can return memory that's merely naturally aligned (e.g. a 4-byte allocation aligned by 4 bytes) if you want, because you know that storage can't be used for anything with a higher alignment requirement.
Over-aligned stuff like the dynamically-allocated equivalent of alignas (16) int32_t foo needs to use a special allocator like C11 aligned_alloc. If you're implementing your own allocator library, you probably want to support aligned_realloc and aligned_calloc, filling those gaps that ISO C leave for no apparent reason.
And make sure you don't implement the braindead ISO C++17 requirement for aligned_alloc to fail if the allocation size isn't a multiple of the alignment. Nobody wants an allocator that rejects an allocation of 101 floats starting on a 16-byte boundary, or much larger for better transparent hugepages. aligned_alloc function requirements and How to solve the 32-byte-alignment issue for AVX load/store operations?
I think I understand that a 64-bit processor reads 64-bit by 64-bit memory
Nope. Data bus width and burst size, and load/store execution unit max width or actually-used width, don't have to be the same as width of integer registers, or however the CPU defines its bitness. (And in modern high performance CPUs typically aren't. e.g. 32-bit P5 Pentium had a 64-bit bus; modern 32-bit ARM has load/store-pair instructions that do atomic 64-bit accesses.)
Processors read whole cache lines from DRAM / L3 / L2 cache into L1d cache; 64 bytes on modern x86; 32 bytes on some other systems.
And when reading individual objects or array elements, they read from L1d cache with the element width. e.g. a uint16_t array may only benefit from alignment to a 2-byte boundary for 2-byte loads/stores.
Or if a compiler vectorizes a loop with SIMD, a uint16_t array can be read 16 or 32 bytes at a time, i.e. SIMD vectors of 8 or 16 elements. (Or even 64 with AVX512). Aligning arrays to the expected vector width can be helpful; unaligned SIMD load/store run fast on modern x86 when they don't cross a cache-line boundary.
Cache-line splits and especially page-splits are where modern x86 slows down from misalignment; unaligned within a cache line generally not because they spend the transistors for fast unaligned load/store. Some other ISAs slow down, and some even fault, on any misalignment, even within a cache line. The solution is the same: give types natural alignment: alignof(T) = sizeof(T).
In your struct example, modern x86 CPUs will have no penalty even though the short is misaligned. alignof(int) = 4 in any normal ABI, so the whole struct has alignof(struct) = 4, so the char;short;char block starts at a 4-byte boundary. Thus the short is contained within a single 4-byte dword, not crossing any wider boundary. AMD and Intel both handle this with full efficiency. (And the x86 ISA guarantees that accesses to it are atomic, even uncached, on CPUs compatible with P5 Pentium or later: Why is integer assignment on a naturally aligned variable atomic on x86?)
Some non-x86 CPUs would have penalties for the misaligned short, or have to use other instructions. (Since you know the alignment relative to an aligned 32-bit chunk, for loads you'd probably do a 32-bit load and shift.)
So yes there's no problem accessing one single word containing the short, but the problem is for load-port hardware to extract and zero-extend (or sign-extend) that short into a full register. This is where x86 spends the transistors to make this fast. (#Eric's answer on a previous version of this question goes into more detail about the shifting required.)
Committing an unaligned store back into cache is also non-trivial. For example, L1d cache might have ECC (error-correction against bit flips) in 32-bit or 64-bit chunks (which I'll call "cache words"). Writing only part of a cache word is thus a problem for that reason, as well as for shifting it to an arbitrary byte boundary within the cache word you want to access. (Coalescing of adjacent narrow stores in the store buffer can produce a full-width commit that avoids an RMW cycle to update part of a word, in caches that handle narrow stores that way). Note that I'm saying "word" now because I'm talking about hardware that's more word-oriented instead of being designed around unaligned loads/stores the way modern x86 is. See Are there any modern CPUs where a cached byte store is actually slower than a word store? (storing a single byte is only slightly simpler than an unaligned short)
(If the short spans two cache words, it would of course needs to separate RMW cycles, one for each byte.)
And of course the short is misaligned for the simple reason that alignof(short) = 2 and it violates this ABI rule (assuming an ABI that does have that). So if you pass a pointer to it to some other function, you could get into trouble. Especially on CPUs that have fault-on-misaligned loads, instead of hardware handling that case when it turns out to be misaligned at runtime. Then you can get cases like Why does unaligned access to mmap'ed memory sometimes segfault on AMD64? where GCC auto-vectorization expected to reach a 16-byte boundary by doing some multiple of 2-byte elements scalar, so violating the ABI leads to a segfault on x86 (which is normally tolerant of misalignment.)
For the full details on memory access, from DRAM RAS / CAS latency up to cache bandwidth and alignment, see What Every Programmer Should Know About Memory? It's pretty much still relevant / applicable
Also Purpose of memory alignment has a nice answer. There are plenty of other good answers in SO's memory-alignment tag.
For a more detailed look at (somewhat) modern Intel load/store execution units, see: https://electronics.stackexchange.com/questions/329789/how-can-cache-be-that-fast/329955#329955
how does the processor know when it reads its 64 bits that the first 8 bits correspond to a char, then the next 16 correspond to a short etc...?
It doesn't, other than the fact it's running instructions which treat the data that way.
In asm / machine-code, everything is just bytes. Every instruction specifies exactly what to do with which data. It's up to the compiler (or human programmer) to implement variables with types, and the logic of a C program, on top of a raw array of bytes (main memory).
What I mean by that is that in asm, you can run any load or store instruction you want to, and it's up to you to use the right ones on the right addresses. You could load 4 bytes that overlap two adjacent int variable into a floating-point register, then and run addss (single-precision FP add) on it, and the CPU won't complain. But you probably don't want to because making the CPU interpret those 4 bytes as an IEEE754 binary32 float is unlikely to be meaningful.

modern processors and memory are built to optimize memory access as much as possible. One the current way of accessing memory is to address it not byte by byte but by an address of a bigger block, e.g. by an 8 byte blocks. You do not need 3 lower bits of the address this way. To access a certain byte within the block the processs needs to get the block at the aligned address, then shift and mask the byte. So, it gets slower.
When fields in the struct are not aligned, there is a risk of slowing down the access to them. Therefore, it is better to align them.
But the alignment requirements are based on the underlying platform. For systems which support word access (32 bit), 4-byte alignment is ok, otherwise 8-byte can be used or some other. The compiler (and libc) knows the requirements.
So, in your example char, short, char, the short will start with an odd byte position if not padded. To access it, the system might need to read the 64 bit word for the struct, then shift it 1 byte right and then mask 2 bytes in order to provide you with this byte.

As I understand it, if the memory is aligned, the code will be executed faster because the processor won't have to take an extra step to recover the bits of memory that are cut.
It's not necessarily an execution thing, an x86 has variable length instructions starting with single 8 bit instructions on up to a handful to several bytes, its all about being unaligned. but they have taken measures to smooth that out for the most part.
If I have a 64 bit bus on the edge of my processor that doesn't mean edge of chip that means edge of the core. The other side of this is a memory controller that knows the bus protocol and is the first place the addresses start to be decoded and the transactions start to split up down other buses toward their destination.
It is very much architecture and bus design specific and you can have architectures with different buses over time or different versions you can get an arm with a 64 bus or a 32 bit bus for example. But let's say we have a not atypical situation where the bus is 64 bits wide and all transactions on that bus are aligned on a 64 bit boundary.
If I were to do a 64 bit write to 0x1000 that would be a single bus transaction, which these days is some sort of write address bus with some id x and a length of 0 (n-1) then the other side acks that I see you want to do a write with id x, I am ready to take your data. Then the processor uses the data bus with id x to send the data, one clock per 64 bits this is a single 64 bit so one clock on that bus. and maybe an ack comes back or maybe not.
But if I wanted to do a 64 bit write to 0x1004, what would happen is that turns into two transactions one complete 64 bit address/data transaction at address 0x1000 with only four byte lanes enabled lanes 4-7 (representing bytes at address 0x1004-0x1007). Then a complete transaction at 0x1008 with 4 byte lanes enabled, lanes 0-3. So the actual data movement across the bus goes from one clock to two, but there is also twice the overhead of the handshakes to get to those data cycles. On that bus it is very noticeable, how the overall system design is though you may feel it or not, or may have to do many of them to feel it or not. But the inefficiency is there, buried in the noise or not.
I think I understand that a 64-bit processor reads 64-bit by 64-bit memory.
Not a good assumption at all. 32 bit ARMs have 64 bit buses these days the ARMv6 and ARMv7s for example come with them or can.
Now, let's imagine that I have a structure with in order (without padding): a char, a short, a char, and an int. Why will the short be misaligned? We have all the data in the block! Why does it have to be on an address which is a multiple of 2. Same question for the integers and other types?
unsigned char a 0x1000
unsigned short b 0x1001
unsigned char c 0x1003
unsigned int d 0x1004
You would normally use the structure items in the code something.a something.b something.c something.d. When you access something.b that is a 16 bit transaction against the bus. In a 64 bit system you are correct that if aligned as I have addressed it, then the whole structure is being read when you do x = something.b but the processor is going to discard all but byte lanes 1 and 2 (discarding 0 and 3-7), then if you access something.c it will do another bus transaction at 0x1000 and discard all but lane 3.
When you do a write to something.b with a 64 bit bus only byte lanes 1 and 2 are enabled. Now where more pain comes in is if there is a cache it is likely also constructed of a 64 bit ram to mate up with this bus, doesn't have to, but let's assume it does. You want to write through the cache to something.b, a write transaction at 0x1000 with byte lanes 1 and 2 enabled 0, 3-7 disabled. The cache ultimately gets this transaction, it internally has to do a read-modify write because it is not a full 64 bit wide transaction (all lanes enabled) so you are taking hit with that read-modify write from a performance perspective as well (same was true for the unaligned 64 bit write above).
The short is unaligned because when packed its address lsbit is set, to be aligned a 16 bit item in an 8 bit is a byte world needs to be zero, for a 32 bit item to be aligned the lower two bits of its address are zero, 64 bit, three zeros and so on.
Depending on the system you may end up on a 32 or 16 bit bus (not for memory so much these days) so you can end up with the multiple transfers thing.
Your highly efficient processors like MIPS and ARM took the approach of aligned instructions, and forced aligned transactions even in the something.b case that specifically doesn't have a penalty on a 32 nor 64 bit bus. The approach is performance over memory consumption, so the instructions are to some extent wasteful in their consumption to be more efficient in their fetching and execution. The data bus is likewise much simpler. When high level concepts like a struct in C are constructed there is memory waste in padding to align each item in the struct to gain performance.
unsigned char a 0x1000
unsigned short b 0x1002
unsigned char c 0x1004
unsigned int d 0x1008
as an example
I also have a second question: With the structure I mentioned before, how does the processor know when it reads its 64 bits that the first 8 bits correspond to a char, then the next 16 correspond to a short etc...?
unsigned char c 0x1003
the compiler generates a single byte sized read at address 0x1003, this turns in to that specific instruction with that address and the processor generates the bus transaction to do that, the other side of the processor bus then does its job and so on down the line.
The compiler in general does not turn a packed version of that struct into a single 64 bit transaction that gives you all of the items, you burn a 64 bit bus transaction for each item.
it is possible that depending on the instruction set, prefetcher, caches and so on that instead of using a struct at a high level you create a single 64 bit integer and you do the work in the code, then you might or might not gain performance. This is not expected to perform better on most architectures running with caches and such, but when you get into embedded systems where you may have some number of wait states on the ram or some number of wait states on the flash or whatever code storage there is you can find times where instead of fewer instructions and more data transactions you want more instructions and fewer data transactions. code is linear a code section like this read, mask and shift, mask and shift, etc. the instruction storage may have a burst mode for linear transactions but data transactions take as many clocks as they take.
A middle ground is to just make everything a 32 bit variable or a 64 bit, then it is all aligned and performs relatively well at the cost of more memory used.
Because folks don't understand alignment, have been spoiled by x86 programming, choose to use structs across compile domains (such a bad idea), the ARMs and others are tolerating unaligned accesses, you can very much feel the performance hit on those platforms as they are so efficient if everything is aligned, but when you do something unaligned it just generates more bus transactions making everything take longer. So the older arms would fault by default, the arm7 could have the fault disabled but would rotate the data around the word (nice trick for swapping 16 bit values in a word) rather than spill over into the next word, later architectures default to not fault on aligned or most folks set them to not fault on aligned and they read/write the unaligned transfers as one would hope/expect.
For every x86 chip you have in your computer you have several if not handfuls of non-x86 processors in that same computer or peripherals hanging off that computer (mouse, keyboard, monitor, etc). A lot of those are 8-bit 8051s and z80s, but also a lot of them are arm based. So there is lots of non-x86 development going on not just all the phones and tablets main processors. Those others desire to be low cost and low power so more efficiency in the coding both in its bus performance so the clock can be slower but also a balance of code/data usage overall to reduce the cost of the flash/ram.
It is quite difficult to force these alignment issues on an x86 platform there is a lot of overhead to overcome its architectural issues. But you can see this on more efficient platforms. Its like a train vs a sports car, something falls off a train a person jumps off or on there is so much momentum its not noticed one bit, but step change the mass on the sports car and you will feel it. So trying to do this on an x86 you are going to have to work a lot harder if you can even figure out how to do it. But on other platforms its easier to see the effects. Unless you find an 8086 chip and I suspect you can feel the differences there, would have to pull out my manual to confirm.
If you are lucky enough to have access to chip sources/simulations then you can see this kind of thing happening all over the place and can really start to hand tune your program (for that platform). Likewise you can see what caching, write buffering, instruction prefetching in its various forms and so on do for overall performance and at times create parallel periods of time where other not-so-efficient transactions can hide, and or intentional spare cycles are created so that transactions that take extra time can have a time slice.

structure padding - what is the purpose of natural alignment? [duplicate]

This question already has answers here:
Padding in structures in C
(5 answers)
Closed 8 years ago.
I was learning about structure padding and data alignment. I came about this point that all the elements of the structure in the memory should be in natural alignment. so for example if I have following structure declared:
struct align{
char c;
double d;
int s;
};
If I take a 32 bit architecture, then it fetches 4 bytes at a time.So keeping this point in mind,if I start padding I will get(my assumption):
1byte(char) + 3bytes(padding) + 8bytes(double) + 4bytes(int) ---------> 1
all these shall be fetched with minimum machine cycles.
But originally the following is happening:
1byte(char) + 7bytes(padding) + 8bytes(double) + 4bytes(int) ----------> 2
why is it that we need this natural alignment for double when we could save 4bits while going with method 1 (while fetching each element with same no. of machine cycles in both cases) ?

Natural alignment refers to the size of the variable, not the size of the processor register and/or data path. A floating point double is 8 bytes, and so its natural alignment is 8 bytes. To be more precise, the natural alignment is the smallest power of 2 that is large enough to hold the variable, that definition covers the case of "long double" or x86 extended precision which is a 10-byte variable and whose natural alignment is a multiple of 16 bytes. For x86 processors see the optimization manual and search for alignment, you will find this is a subject rich in detail and specifics vary by micro-architecture, even within the same processor family. In particular, section 3.6.4 Alignment says
For best performance, align data as follows:
Align 8-bit data at any address.
Align 16-bit data to be contained within an aligned 4-byte word.
Align 32-bit data so that its base address is a multiple of four.
Align 64-bit data so that its base address is a multiple of eight.
Align 80-bit data so that its base address is a multiple of sixteen.
Align 128-bit data so that its base address is a multiple of sixteen.
The Pentium 4 is a 32-bit processor, part of the IA-32 family, yet it has a 64-bit data path (Front Side Bus). There are 32-bit processors that have only 16-bit buses, see 32-bit computing historical perspective. Accessing a variable at an alignment other than its natural alignment may result in a performance penalty, or an alignment fault, depending on the processor, in some cases the setting of a control bit, the type of variable, the instruction used, etc.
The actual alignment is up to the compiler and the calling conventions. For structures the requirement is that the first member variable must be at offset 0 (zero) and variables must be allocated in the order they are declared, padding may be inserted between variables for alignment and after the last variable to pad the size of the structure. In 32-bit Windows the stack is only required to be 4-byte aligned, so the compiler would have to generate extra code to ensure 8-byte alignment of a double allocated on the stack.
In Agner Fog's Calling Conventions document you will find details on the alignment used in different operating systems and by different compilers. The stack has a 4-byte alignment in 32-bit Windows, which explains why you may have observed a floating point double aligned at a 4-byte but not 8-byte boundary when allocated on the stack - the compiler doesn't have a clue when a function gets called whether the stack will be 8-byte aligned or not. In table-2 of that document it shows the alignment of various data types allocated in static storage as implemented by various compilers, you will notice that in 32-bit Windows the only compiler that allows 4-byte alignment for double is the Borland compiler.
When allocating in a structure according to that document the Borland compiler allows double to be at any byte offset (which I find surprising).
Here's the text description in the document, copied here for reference
Table 3 shows the alignment in bytes of data members of structures
and classes. The compiler will insert unused bytes, as required,
between members to obtain this alignment. The compiler will also
insert unused bytes at the end of the structure so that the total size
of the structure is a multiple of the alignment of the element that
requires the highest alignment. Many compilers have options to change
the default alignments. Differences in structure member alignment will
cause incompatibility between different programs or modules accessing
the same data and when data are stored in binary files. The programmer
can avoid such compatibility problems by ordering the structure
members so that no unused bytes need to be inserted. Likewise, the
padding at the end of the structure may be specified explicitly by
inserting dummy members of the required size. The size of the virtual
table pointer, if any, must be taken into account (see chapter 11).
5 Stack alignment
The stack pointer must be aligned by the stack word
size at all times. Some systems require a higher alignment. The Gnu
compiler version 3.x and later for 32-bit Linux and Mac OS X makes the
stack pointer aligned by 16 at every function call instruction.
Consequently it can rely on ESP = 12 modulo 16 at every function
entry. This alignment is not consistently implemented. It is
specified in the Mac OS ABI, but nowhere else. The stack is not
aligned when compiling with option -Os or
-mpreferred-stack-boundary=2, but apparently the Gnu compiler erroneously relies on the stack being aligned by 16 despite these
options. The Intel compiler (v. 9.1.038) for 32 bit Linux does not
have the same alignment. (I have submitted bug reports to Gnu and
Intel about this in 2006. In 2009 Intel added a -falign-stack=
assume-16-byte option to ICC version 11.0 to fix the problem). The
stack is aligned by 4 in 32-bit Windows. The 64 bit systems keep the
stack aligned by 16. The stack word size is 8 bytes, but the stack
must be aligned by 16 before any call instruction. Consequently, the
value of the stack 10 pointer is always 8 modulo 16 at the entry of a
procedure. A procedure must subtract an odd multiple of 8 from the
stack pointer before any call instruction. A procedure can rely on
these rules when storing XMM data that require 16-byte alignment. This
applies to all 64 bit systems (Windows, Linux, BSD). Where at least
one function parameter of type __m256 is transferred on the stack,
Unix systems (32 and 64 bit) align the parameter by 32 and the called
function can rely on the stack being aligned by 32 before the call
(i.e. the stack pointer is 32 minus the word size modulo 32 at the
function entry). This does not apply if the parameter is transferred
in a register. Various methods for aligning the stack are described
in Intel's application note AP 589 "Software Conventions for
Streaming SIMD Extensions", "Data Alignment and Programming Issues
for the Streaming SIMD Extensions with the Intel® C/C++ Compiler", and
"IA-32 Intel ® Architecture Optimization Reference Manual".

Your comment is valid, and you'll probably get the result you are looking for if, instead of using a struct, you simply lay down the variables as part of the local stack inside a function. Something along these lines :
void alignTest()
{
char c;
double d;
int s;
printf("%x %x %x", (int)&c, (int)&d, (int)&s);
}
In this example, the compiler is free to make its optimal choices performance and memory wise. Heck, it can even re-order variables if it wishes. On this setup, I've already witnessed double on 4-bytes boundaries (not 8) using 32-bits compilers.
On the other hand, using a struct, you need to keep in mind that it is part of an interface contract. It's not just a matter of the compiler selecting whatever choice it feels better : if part of an API, this struct will be used by other programs, potentially using another compiler, or another version of the same compiler. It happens all the time : think DLL, wrapper from other languages (calling a C function from a Delphi or Python program) etc.
You can't have an interface element in a "random state", with different choices depending on compiler. In this case, the allocation rules regarding variables inside a struct are set in stone by the specification.
In this specification, variable order is always respected, and double are aligned on 8 bytes.

Prefetching aligned memory

I have some threaded C code that requires 64 byte alignment of the processed data structure. How will this alignment interact with prefetch instructions like the gcc __builtin_prefetch? Will the effects of prefetching be the same as using a non-aligned array or not?
Note that I am using memalign to obtain the aligned array.
Thanks.

The answer to this one is highly implementation-dependent.
However, on x86 and x86_64, GCC implements __builtin_prefetch as a single PREFETCH assembly instruction.
According to Intel's documentation (search for "PREFETCH"):
Fetches the line of data from memory that contains the byte specified with the source
operand to a location in the cache hierarchy specified by a locality hint:
I am 99% sure the AMD version behaves the same way, but I am too busy to check...
So if the memory operand is unaligned, it will effectively be rounded down to a multiple of 64 bytes and that cache line will be prefetched. (Well, 64 bytes on all the current CPUs I know of. The instruction set reference only guaranteed to be "a minimum of 32 bytes". Not sure why they bothered saying that; in any situation where it makes sense to use this gadget, you have to be assuming a lot about the particular CPU already.)

Alignment restrictions for malloc()/free()

Older K&R (2nd ed.) and other C-language texts I have read that discuss the implementation of a dynamic memory allocator in the style of malloc() and free() usually also mention, in passing, something about data type alignment restrictions. Apparently certain computer hardware architectures (CPU, registers, and memory access) restrict how you can store and address certain value types. For example, there may be a requirement that a 4 byte (long) integer must be stored beginning at addresses that are multiples of four.
What restrictions, if any, do major platforms (Intel & AMD, SPARC, Alpha) impose for memory allocation and memory access, or can I safely ignore aligning memory allocations on specific address boundaries?

Sparc, MIPS, Alpha, and most other "classical RISC" architectures only allow aligned accesses to memory, even today. An unaligned access will cause an exception, but some operating systems will handle the exception by copying from the desired address in software using smaller loads and stores. The application code won't know there was a problem, except that the performance will be very bad.
MIPS has special instructions (lwl and lwr) which can be used to access 32 bit quantities from unaligned addresses. Whenever the compiler can tell that the address is likely unaligned it will use this two instruction sequence instead of a normal lw instruction.
x86 can handle unaligned memory accesses in hardware without an exception, but there is still a performance hit of up to 3X compared to aligned accesses.
Ulrich Drepper wrote a comprehensive paper on this and other memory-related topics, What Every Programmer Should Know About Memory. It is a very long writeup, but filled with chewy goodness.

Alignment is still quite important today. Some processors (the 68k family jumps to mind) would throw an exception if you tried to access a word value on an odd boundary. Today, most processors will run two memory cycles to fetch an unaligned word, but this will definitely be slower than an aligned fetch. Some other processors won't even throw an exception, but will fetch an incorrect value from memory!
If for no other reason than performance, it is wise to try to follow your processor's alignment preferences. Usually, your compiler will take care of all the details, but if you're doing anything where you lay out the memory structure yourself, then it's worth considering.

You still need to be aware of alignment issues when laying out a class or struct in C(++). In these cases the compiler will do the right thing for you, but the overall size of the struct/class may be more wastefull than necessary
For example:
struct
{
char A;
int B;
char C;
int D;
};
Would have a size of 4 * 4 = 16 bytes (assume Windows on x86) whereas
struct
{
char A;
char C;
int B;
int D;
};
Would have a size of 4*3 = 12 bytes.
This is because the compiler enforces a 4 byte alignment for integers, but only 1 byte for chars.
In general pack member variables of the same size (type) together to minimize wasted space.

As Greg mentioned it is still important today (perhaps more so in some ways) and compilers usually take care of the alignment based on the target of the architecture. In managed environments, the JIT compiler can optimize the alignment based on the runtime architecture.
You may see pragma directives (in C/C++) that change the alignment. This should only be used when very specific alignment is required.
// For example, this changes the pack to 2 byte alignment.
#pragma pack(2)

Note that even on IA-32 and the AMD64, some of the SSE instructions/intrinsics require aligned data. These instructions will throw an exception if the data is unaligned, so at least you won't have to debug "wrong data" bugs. There are equivalent unaligned instructions as well, but like Denton says, they're are slower.
If you're using VC++, then besides the #pragma pack directives, you also have the __declspec(align) directives for precise alignment. VC++ documentation also mentions an __aligned_malloc function for specific alignment requirements.
As a rule of thumb, unless you are moving data across compilers/languages or are using the SSE instructions, you can probably ignore alignment issues.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight