Structure padding - c

I was trying to understand why structure padding is the reason structures cannot be compared by memcmp.
One small thing i dont understand about structure padding is this...
why should "a short be 2 byte aligned"or"a long be 4 byte aligned". I understand it is with their sizes but why can they not appear at any byte boundary?
Or in other words "why is 0x10004566 not a valid location for a long variable but 0x10004568 is?"

Because some platforms (i.e. CPUs) physically don't support "mis-aligned" memory accesses. Other platforms support them, but in a much slower fashion.
The padding you get in a struct is dependent on the choices your compiler makes, but it will be making those choices in order to satisfy the specific requirements of the CPU the code is targeted at.

Memory alignment is a very important issue when optimizing a program for speed. C, being a language that - generally - puts strong emphasis on speed, likes to enforce some rules which may make the program faster.
The limitation of aligned and unaligned memory accesses comes directly from the hardware used for fetching the data from the memory, which usually fetches it in chunks which are equal to the machine word in size. Say you want to access a doubleword (4 bytes) stored at location 101. This means that the memory controller would firstly have to (probably) issue a read of a doubleword at location 100, then another read of a doubleword at location 104, and then splice the individual bytes from locations 101, 102, 103, and 104 together. The whole operation takes (hypothetically) two clock cycles.
If you want to access a doubleword at location 100, there's no such issue, which should be illustrated clearly enough by the example I provided.
In fact, misaligned data access is such a big issue that SSE instructions (the "aligned" versions, there are also "misaligned" versions which don't do that) will cause a general protection fault if you try to access misaligned data with those.
As a rule of thumb, it never hurts to align 4-byte data on a 4-byte boundary, 8-byte data on a 8-byte boundary, and so forth.

The only additional example I can think of with respect to alignment is transfer of data, transfers of data (depending on the architecture) goes in blocks of say 32 bytes for example, if your data crosses a boundary it could require 2 transfers to receive the data, rather then 1.

Related

what is aligned attribute and what are the uses of it

I have following lines in the code
# define __align_(x) __attribute__((aligned(x)))
I can use it int i __align_; what difference does it makes like like
I am using aligned attribute as above or if I am just creating my variable like int i; does it differ in how variable get created in memory?
I can use it int i __align_; what difference does it makes like like
This will not work because the macro is defined to have a parameter, __align_(x). When it is used without a parameter, it will not be replaced, and the compiler will report a syntax error. Also, identifiers starting with __ are reserved for the C implementation (for the use of the compiler, the standard library, and any other parts forming the C implementation), so a regular program should not use such a name.
When you use the macro correctly, it changes the normal alignment requirement for the type.
Generally, objects of various types have alignment requirements: They should be located in memory at addresses that are multiples of their requirement. The reasons for this are because computer hardware is usually designed to work with groups of bytes, so it may fetch data from memory in groups of, for example, four bytes: Bytes from 0 to 3, bytes from 4 to 7, bytes from 8 to 11, and so on.
If a four-byte object with four-byte alignment requirement is located at a multiple of four bytes, then it can be read from memory easily, by loading the group of bytes it is in. It can also be written to memory easily.
If the object were not at a multiple of four bytes, it cannot be loaded as one group of bytes. It can be loaded by loading the two groups of bytes it straddles, extracting the desired bytes, and combining the desired bytes in one processor register. However, that takes more work, so we want to avoid it. The compiler is written to automatically align things as desired for the C implementation, and it writes load and store instructions that expect the desired alignment.1
Different object types can have different alignment requirements even though they are bound by the same hardware behavior. For example, with a two-byte short, the alignment requirement may be two bytes. This is because, whether it starts at byte 0 or byte 2 within a group (say at address 100, 102, 104, or 106), we can load the short by loading a single group of four bytes and taking just the two bytes we want. However, if it started at byte 3 (say at address 103), we would have to load two groups of bytes (100 to 103 and 104 to 107) to get the bytes we needed for the short (103 and 104). So two-byte alignment suffices for this short even though the hardware is designed with four-byte groups.
As mentioned, the compiler handles alignment automatically. When you define a structure with multiple members of different types, the compiler inserts padding so that each member is aligned correctly, and it inserts padding at the end of the structure so that an array of them keeps the alignment from element to element in the array.
There are times when we want to override the compiler’s automatic behavior. When we are preparing to send data over a network connection, the communication protocol might require the different fields of a message to be packed together in consecutive bytes, with no padding. In this case, we can define a structure with an alignment requirement of 1 byte for it and all its members. When we are ready to send a message, we could copy data into this structure’s members and then write the structure to the network device.
When you tell the compiler an object is not aligned normally, the compiler will generate instructions for that. Instead of the normal load or store instructions, it will use special unaligned load or store instructions if the computer architecture has them. If it does not, the compiler will use instructions to shift and store individual bytes or to shift and merge bytes and store them as aligned words, depending on what instructions are available in the computer architecture. This is generally inefficient; it will slow down your program. So it should not be used in normal programming. Decreasing the alignment requirements should be used only when there is a need for controlling the layout of data in memory.
Sometimes increasing the alignment requirements is used for performance. For example, an array of four-byte float elements generally only needs four-byte alignment. However, some computers have special instructions to process four float elements (16 bytes) at a time, and the benefit from having that data aligned to a multiple of 16 bytes. (And some computers have instructions for even more data at one time.) In this case, we might increase the alignment requirement for our float array (but not its individual elements) so that it is aligned to be good with these instructions.
Footnote
1 What happens if you force an object to be located at an undesired alignment without telling the compiler varies. In some computers, when a load instruction is executed with an unaligned address, the processor will “trap,” meaning it stops normal program execution and transfers control to the operating system, reporting an error in your program. In some computers, the processor will ignore the low bits of the address and load the wrong data. In some computers, the processor will load the two groups of bytes, extract the desired bytes, and merge them. On computers that trap, the operating system might do the manual fix-up of loading the bytes, or it might terminate your program or report the error to your program.
The attribute tells the compiler that the variable in question must be placed in memory in addresses that are aligned to a certain number of bytes (addr % alignement == 0).
This is important because the CPU can only work on some integer values if they are aligned - such as int32 must be 4 bytes aligned and int64 must be 8 bytes aligned, pointers need to be 4/8 (32/64 bit cpu) aligned too.
The attribute is mostly used for structures, where certain fields within the structure must be memory aligned in order to allow the CPU to do integer operations on them (like mov.l) without hitting a BUS ERROR from the memory controller.
If structures aren't properly aligned, the compiler will have to add extra instructions to first load the unaligned value into a register with several memory operations which is more expensive in performance.
It can also be used to bump performance in more performance sensitive systems by creating buffers that are page aligned (4k usually) so that paging will have less of an impact, or if you want to create DMA-able buffer zones - but that's a bit more advanced...

Why "any primitive object of K bytes must have an address that is a multiple of K"?

Computer Systems: a Programmer's Perspective says
The x86-64 hardware will work correctly regardless of the alignment of
data. However, Intel recommends that data be aligned to improve memory
system performance. Their alignment rule is based on the principle
that any primitive object of K bytes must have an address that is a
multiple of K. We can see that this rule leads to the following
alignments:
K Types
1 char
2 short
4 int, float
8 long, double, char *
Why is it that "any primitive object of K bytes must have an address that is a multiple of K"?
How is "aligned" defined or what does it mean?
On a x86-64 machine,
if an object has K bytes (such as K=2 (e.g. short) or K=4 (e.g. int, or float)), "any primitive object of K bytes must have an address that is a multiple of K" means that such an object must have an address that is a multiple of K. But isn't the object aligned, as long as its storage space falls completely between two addresses which are two consecutive multiples of 8, which is a less strict requirement than that the object must have an address that is a multiple of K?
If the K of an object is smaller than 8 but not equal to 1, 2 or 4, does "any primitive object of K bytes must have an address that is a multiple of K" still apply? For example if K=3,5,6, or 7?
On a X86 machine, which has 32-bit addresses,
what is the alignment rule, and Does "any primitive object of K bytes must have an address that is a multiple of K" still apply?
Thanks.
Since this was tagged in C as well; do note that not only does the architecture make these decisions, but so do compilers. The C compiler often has its own alignment rules that mostly follow either the required or the preferred alignment of the architecture - especially when optimizing for speed. And the compiler's requirements are what you you need to worry about the most time, not the architecture requirement.
Even if the processor supports unaligned accesses, it might have a preferred alignment for multibyte objects that the C compiler can exploit. For example a compiler is allowed to know that a any int will reside at, and therefore any int * pointer will always point to - an address divisible by 4.
Now there are people who say that since x86-64 supports unaligned acccess, they can make an int * pointer that points to an address not divisible by 4 and things will work fine.
They're wrong.
There are some instructions in the x86-64 instruction set that require alignment. I.e. the "will work correctly regardless of alignment" means that these instructions too work "correctly, according to the specification, when given an unaligned access" - they raise an exception that would kill your process. The reason for having these is that they can be so much faster and require less silicon to implement than the versions that can deal with unaligned data.
And the compiler knows exactly when it is allowed to use these instructions! Whenever it sees an int * being dereferenced it knows that it can use an instruction that requires the operand be aligned at 4 bytes, should it be more effective.
See this question for a case where OP run into problem with C code that "should have been fine on x86-64 anyway": C undefined behavior. Strict aliasing rule, or incorrect alignment?
As for x86-32, the alignment requirement for doubles is generally 4 in C compilers because doubles need to be passed on stack and stack grows in 4 not 8 byte increments.
And finally:
If the K of an object is smaller than 8 but not equal to 1, 2 or 4, does "any primitive object of K bytes must have an address that is a multiple of K" still apply? For example if K=3,5,6, or 7?
There are no primitive objects with K<-{3,5,6,7} in x86.
The C standard's stance is that an alignment can only be a power of 2, and there are no gaps in arrays. Therefore an object with such a size would need to be padded upwards to its alignment requirement, or its alignment requirement must be 1.
The rules are different on each processor model. I will discuss one hypothetical example. We may have a processor with an eight-byte interface to the bus. Given some address X, the processor can load eight bytes from that address by requesting the memory to deliver eight bytes from its unit of storage numbered X/8. That is, the memory does not have any way to address individual bytes. The processor can only request data at a certain address that is a multiple of eight, and the memory will send the entire eight bytes at that address. (Keep in mind this is a hypothetical example to illustrate basic principles. Also, I am ignoring cache. Cache helps mask some of the effects of alignment issues, because the misalignments can be largely managed in level-one cache inside the processor. But handling this still requires extra hardware, as discussed below.)
Suppose we want the four-byte object that is in bytes 7, 8, 9, and 10. To get this, the processor has to request unit 0 from memory, which supplies bytes 0 through 7, and it has to request unit 1, which supplies bytes 8 through 15. So, already, there is a performance problem: We had to use two bus transfers to get this word that is only half the size of one transfer. That is inefficient, and the bus can only do half as many of these double transfers as it can if we loaded only aligned data requiring single transfers.
Continuing, the processor has all the bytes it needs, 0 through 15, so it extracts bytes 7 through 10, which make up the object we want. To do this, though, it has to shift the bytes to put them into a register. Ideally, if nobody did any “unaligned” loads, four-byte objects would come in from the bus only at offsets 0 and 4 in the eight-byte transfers, and the processor only needs to have wires gong from those offsets to the register destinations.
However, our processor supports unaligned loads, so it has additional switches and wires so the data can be shunted down a different path, where it will be shifted by three bytes. Keep in mind, the data from both transfers has to be shifted by three bytes and then spliced together. So a lot of extra wires and switches are needed. Two eight-byte transfers is 128 bits, so there are hundreds of extra connections involved in this.
Well, fine, the processor has these wires and switches, why not use them? To make this processor fast, it supports multiple loads and stores in progress simultaneously. As soon as the bus transfers one piece of data, we want to be getting another from the bus, while the data from the first is still on its way to a register. So there are actually multiple parts of the processor moving data around for several loads. Since we expect unaligned loads to be rare, maybe only one of the parts for handling loads has the extra components to handle unaligned loads. The others all handle aligned loads. So, if you have just one unaligned load occasionally, the processor sends it to that part, and the performance effect is unnoticeable. However, if you do many unaligned loads in a row, they all have to go through the one part, so they end up waiting in a queue instead of running in parallel, and performance decreases.
That is just for loads. When you store that four-byte object, there is no way to write just bytes 7 through 10. Since the bus and the memory only work in eight-byte units, we need to write units 0 and 1, which also contains bytes 0 through 6 and bytes 11 to 15. To implement the store, the processor must:
Load memory unit 0, providing bytes 0 through 7.
Load memory unit 1, providing bytes 8 through 15.
Move the first byte of the four-byte object into byte 7.
Move the last three bytes of the object into bytes 8 through 10.
Store the changed memory unit 0.
Store the changed memory unit 1.
Again, that is twice as much work as it would be with an aligned object (load one memory unit, move the bytes in, store the unit). And, besides the time of the operations, you are occupying more resources inside the processor—it has to use two internal registers to hold the data from memory temporarily while it is merging the changes.
Actually, it is more than twice the work and resources, because it also requires extra wires and switches to shift the bytes by non-standard amounts.
The processor bus, which is the media used to access memory is normally the processor size in bits. This means a 32bit processor normally access memory in 32bit chunks, meaning that only one memory read access is necessary to read the data from memory.
Addresses by the contrary, are byte oriented, so a double (8 bytes) normally occupies eight different contiguous memory. So to make an access to a single eight bytes data (with only one bus request) The data must begin at a single eight byte word and finish before we get to the next. For old processors this was imperative, in case you requested a memory access that is not data aligned, an exception was fired. Actual processors don't have this restriction, but beware you that in case you have for example a double in a non multiple of eight address, the processor will need to make two bus accesses (with the overhead that this implies) to get the data from memory.
For this reason (you can double or even more, the time required to execute some piece of code if all the data is unaligned, against the time required to if the data is properly aligned) the processor vendor warns you about the alignment of data.
Modern processors have several levels of caches, that are read from main memory in chunks of one cache line (64 or even more bytes) so this is not an issue. Anyway, it is good idea to have data aligned anyway, for the case you need to run your code in a non-such-advanced processor.

Is 8-byte alignment for "double" type necessary?

I understand word-alignment, which makes the cpu only need to read once when reading an integer into a register.
But is 8-byte alignment (let's assume 32bit system) for "double" necessary? What is the benefit? What will happen if the space for storing a "double" is just 4-byte alignment?
There are multiple hardware components that may be adversely affected by unaligned loads or stores.
The interface to memory might be eight bytes wide and only able to access memory at multiples of eight bytes. Loading an unaligned eight-byte double then requires two reads on the bus. Stores are worse, because an aligned eight-byte store can simply write eight bytes to memory, but an unaligned eight-byte store must read two eight-byte pieces, merge the new data with the old data, and write two eight-byte pieces.
Cache lines are typically 32 or 64 bytes. If eight-byte objects are aligned to multiples of eight bytes, then each object is in just one cache line. If they are unaligned, then some of the objects are partly in one cache line and partly in another. Loading or storing these objects then requires using two cache lines instead of one. This effect occurs at all levels of cache (three levels is not uncommon in modern processors).
Memory system pages are typically 512 bytes or more. Again, each aligned object is in just one page, but some unaligned objects are in multiple pages. Each page that is accessed requires hardware resources: The virtual address must be translated to a physical address, this may require accessing translation tables, and address collisions must be detected. (Processors may have multiple load and store operations in operation simultaneously. Even though your program may appear to be single-threaded, the processor reads instructions in advance and tries to execute those that it can. So a processor may start a load instruction before preceding instructions have completed. However, to be sure this does not cause an error, the processor checks each load instruction to be sure it is not loading from an address that a prior store instruction is changing. If an access crosses a page boundary, the two parts of the loaded data have to be checked separately.)
The response of the system to unaligned operations varies from system to system. Some systems are designed to support only aligned accesses. In these cases, unaligned accesses either cause exceptions that lead to program termination or exceptions that cause execution of special handlers that emulate unaligned operations in software (by performing aligned operations and merging the data as necessary). Software handlers such as these are much slower than hardware operations.
Some systems support unaligned accesses, but this usually consumes more hardware resources than aligned accesses. In the best case, the hardware performs two operations instead of one. But some hardware is designed to start operations as if they were aligned and then, upon discovering the operation is not aligned, to abort it and start over using different paths in the hardware to handle the unaligned operation. In such systems, unaligned accesses have a significant performance penalty, although it is not as great as in systems where software handles unaligned accesses.
In some systems, the hardware may have multiple load-store execution units that can perform the two operations required of unaligned accesses just as quickly as one unit can perform the operation of aligned accesses. So there is no direct performance degradation of unaligned accesses. However, because multiple execution units are kept busy by unaligned accesses, they are unavailable to perform other operations. Thus, programs that perform many load-store operations, normally in parallel, will execute more slowly with unaligned accesses than with aligned accesses.
On many architectures, unaligned access of any load/store unit (short, int, long) is simply an exception. Compilers are responsible for ensuring it doesn't happen on potentially mis-aligned data, by emitting smaller access instructions and re-assembling in registers if they can't prove a given pointer is OK.
Performance-wise, 8-byte alignment of doubles on 32-bit systems can be valuable for a few reasons. The most apparent is that 4-byte alignment of an 8-byte double means that one element could cross the boundary of two cache lines. Memory access occurs in units of whole cache lines, and so misalignment doubles the cost of access.
I seem to remember that the recommendation for 486 was to align double on 32 bits boundaries, so requiring 64 bits alignment is not mandatory.
You seem to think that there is a relationship between the data bus width and the processor bitness. While it is often the case, you can find variation in both direction. For instance the Pentium was a 32-bit processor, but its data bus size was 64 bits.
Caches offer something else which may explain the usefulness of having 64-bit alignment for 64-bit types. Here the external bus is not a factor, it is the cache line size which is important. Data crossing the line cache is costlier to access than data not crossing it (even if it is unaligned in both cases). Aligning types on their size makes it sure that they won't cross cache lines as long as cache line size is a multiple of the type size.
I just found the answer:
"6. When memory reading is efficient in reading 4 bytes at a time on 32 bit machine, why should a double type be aligned on 8 byte boundary?
It is important to note that most of the processors will have math co-processor, called Floating Point Unit (FPU). Any floating point operation in the code will be translated into FPU instructions. The main processor is nothing to do with floating point execution. All this will be done behind the scenes.
As per standard, double type will occupy 8 bytes. And, every floating point operation performed in FPU will be of 64 bit length. Even float types will be promoted to 64 bit prior to execution.
The 64 bit length of FPU registers forces double type to be allocated on 8 byte boundary. I am assuming (I don’t have concrete information) in case of FPU operations, data fetch might be different, I mean the data bus, since it goes to FPU. Hence, the address decoding will be different for double types (which is expected to be on 8 byte boundary). It means, the address decoding circuits of floating point unit will not have last 3 pins."
Edited:
The advantage of byte alignment is to reduce the number of memory cycles to retrieve the data. For example, an 8 byte which might take a single cycle if it is aligned might now take 2 cycles since a part of it is obtained the first time and the second part in the next memory cycle.
I came across this:
"Aligned access is faster because the external bus to memory is not a single byte wide - it is typically 4 or 8 bytes wide (or even wider). So the CPU doesn't fetch a single byte at a time - it fetches 4 or 8 bytes starting at the requested address. Therefore, the 2 or 3 least significant bits of the memory address are not actually sent by the CPU - the external memory can only be read or written at addresses that are a multiple of the bus width. If you requested a byte at address "9", the CPU would actually ask the memory for the block of bytes beginning at address 8, and load the second one into your register (discarding the others).
This implies that a misaligned access can require two reads from memory: If you ask for 8 bytes beginning at address 9, the CPU must fetch the 8 bytes beginning at address 8 as well as the 8 bytes beginning at address 16, then mask out the bytes you wanted. On the other hand, if you ask for the 8 bytes beginning at address 8, then only a single fetch is needed. Some CPUs will not even perform such a misaligned load - they will simply raise an exception (or even silently load the wrong data!)."
You might see this link for more details.
http://www.ibm.com/developerworks/library/pa-dalign/

CPU and Data alignment

Pardon me if you feel this has been answered numerous times, but I need answers to the following queries!
Why data has to be aligned (on 2-byte / 4-byte / 8-byte boundaries)? Here my doubt is when the CPU has address lines Ax Ax-1 Ax-2 ... A2 A1 A0 then it is quite possible to address the memory locations sequentially. So why there is the need to align the data at specific boundaries?
How to find the alignment requirements when I am compiling my code and generating the executable?
If for e.g the data alignment is 4-byte boundary, does that mean each consecutive byte is located at modulo 4 offsets? My doubt is if data is 4-byte aligned does that mean that if a byte is at 1004 then the next byte is at 1008 (or at 1005)?
CPUs are word oriented, not byte oriented. In a simple CPU, memory is generally configured to return one word (32bits, 64bits, etc) per address strobe, where the bottom two (or more) address lines are generally don't-care bits.
Intel CPUs can perform accesses on non-word boundries for many instructions, however there is a performance penalty as internally the CPU performs two memory accesses and a math operation to load one word. If you are doing byte reads, no alignment applies.
Some CPUs (ARM, or Intel SSE instructions) require aligned memory and have undefined operation when doing unaligned accesses (or throw an exception). They save significant silicon space by not implementing the much more complicated load/store subsystem.
Alignment depends on the CPU word size (16, 32, 64bit) or in the case of SSE the SSE register size (128 bits).
For your last question, if you are loading a single data byte at a time there is no alignment restriction on most CPUs (some DSPs don't have byte level instructions, but its likely you won't run into one).
Very little data "has" to be aligned. It's more that certain types of data may perform better or certain cpu operations require a certain data alignment.
First of all, let's say you're reading 4 bytes of data at a time. Let's also say that your CPU has a 32 bit data buss. Let's also say your data is stored at byte 2 in the system memory.
Now since you can load 4 bytes of data at once, it doesn't make too much sense to have your Address register to point to a single byte. By making your address register point to every 4 bytes you can manipulate 4 times the data. So in other words your CPU may only be able to read data starting at bytes 0, 4, 8, 12, 16, etc.
So here's the issue. If you want the data starting at byte 2 and you're reading 4 bytes, then half your data will be in address position 0 and the other half in position 1.
So basically you'd end up hitting the memory twice to read your one 4 byte data element. Some CPUs don't support this sort of operation (or force you to load and combine the two results manually).
Go here for more details: http://en.wikipedia.org/wiki/Data_structure_alignment
1.) Some architectures do not have this requirement at all, some encourage alignment (there is a speed penalty when accessing non-alignet data items), and some may enforce it strictly (misaligment causes a processor exception).
Many of todays popular architectures fall in the speed penalty category. The CPU designers had to make a trade between flexibility/performance and cost (silicon area/number of control signals required for bus cycles).
2.) What language, which architecture? Consult your compilers manual and/or the CPU architecture documentation.
3.) Again this is totally architecture dependent (some architectures may not permit access on byte-sized items at all, or have bus widths which are not even a multiple of 8 bits). So unless you are asking about a specific architecture you wont get any useful answers.
In general, the one answer to all three of those questions is "it depends on your system". Some more details:
Your memory system might not be byte-addressable. Besides that, you might incur a performance penalty to have your processor access unaligned data. Some processors (like older ARM chips, for example) just can't do it at all.
Read the manual for your processor and whatever ABI specification your code is being generated for,
Usually when people refer to data being at a certain alignment, it refers only to the first byte. So if the ABI spec said "data structure X must be 4-byte aligned", it means that X should be placed in memory at an address that's divisible by 4. Nothing is implied by that statment about the size or internal layout of structure X.
As far as your particular example goes, if the data is 4-byte aligned starting at address 1004, the next byte will be at 1005.
Its completely depends on the CPU you are using!
Some architectures deal only in 32 (or 36!) bit words and you need special instructions to load singel characters or haalf words.
Some cpus (notably PowerPC and other IBM risc chips) dont care about alignments and will load integers from odd addresses.
For most modern architectures you need to align integers to word boundies and long integers to double word boundries. This simplifies the circutry for loading registers and speeds things up ever so slighly.
Data alignment is required by CPU for performance reason. Intel website give out the detail on how to align the data in the memory
Data Alignment when Migrating to 64-Bit Intel® Architecture
One of these is the alignment of data items – their location in memory in relation to addresses that are multiples of four, eight or 16 bytes. Under the 16-bit Intel architecture, data alignment had little effect on performance, and its use was entirely optional. Under IA-32, aligning data correctly can be an important optimization, although its use is still optional with a very few exceptions, where correct alignment is mandatory. The 64-bit environment, however, imposes more-stringent requirements on data items. Misaligned objects cause program exceptions. For an item to be aligned properly, it must fulfill the requirements imposed by 64-bit Intel architecture (discussed shortly), plus those of the linker used to build the application.
The fundamental rule of data alignment is that the safest (and most widely supported) approach relies on what Intel terms "the natural boundaries." Those are the ones that occur when you round up the size of a data item to the next largest size of two, four, eight or 16 bytes. For example, a 10-byte float should be aligned on a 16-byte address, whereas 64-bit integers should be aligned to an eight-byte address. Because this is a 64-bit architecture, pointer sizes are all eight bytes wide, and so they too should align on eight-byte boundaries.
It is recommended that all structures larger than 16 bytes align on 16-byte boundaries. In general, for the best performance, align data as follows:
Align 8-bit data at any address
Align 16-bit data to be contained within an aligned four-byte word
Align 32-bit data so that its base address is a multiple of four
Align 64-bit data so that its base address is a multiple of eight
Align 80-bit data so that its base address is a multiple of sixteen
Align 128-bit data so that its base address is a multiple of sixteen
A 64-byte or greater data structure or array should be aligned so that its base address is a multiple of 64. Sorting data in decreasing size order is one heuristic for assisting with natural alignment. As long as 16-byte boundaries (and cache lines) are never crossed, natural alignment is not strictly necessary, although it is an easy way to enforce adherence to general alignment recommendations.
Aligning data correctly within structures can cause data bloat (due to the padding necessary to place fields correctly), so where necessary and possible, it is useful to reorganize structures so that fields that require the widest alignment are first in the structure. More on solving this problem appears in the article "Preparing Code for the IA-64 Architecture (Code Clean)."
For Intel Architecture, Chapter 4 DATA TYPES of Intel 64 and IA-32 Architectures Software Developer’s Manual answers your question 1.

Memory alignment on modern processors?

I often see code such as the following when, e.g., representing a large bitmap in memory:
size_t width = 1280;
size_t height = 800;
size_t bytesPerPixel = 3;
size_t bytewidth = ((width * bytesPerPixel) + 3) & ~3; /* Aligned to 4 bytes */
uint8_t *pixelData = malloc(bytewidth * height);
(that is, a bitmap allocated as a contiguous block of memory having a bytewidth aligned to a certain number of bytes, most commonly 4.)
A point on the image is then given via:
pixelData + (bytewidth * y) + (bytesPerPixel * x)
This leads me to two questions:
Does aligning a buffer like this have a performance impact on modern processors? Should I be worrying about alignment at all, or will the compiler handle this?
If it does have an impact, could someone point me to a resource to find the ideal byte alignment for various processors?
Thanks.
It depends on a lot of factors. If you're only accessing the pixel data one byte at a time, the alignment will not make any difference the vast majority of the time. For reading/writing one byte of data, most processors won't care at all whether that byte is on a 4-byte boundary or not.
However, if you're accessing data in units larger than a byte (say, in 2-byte or 4-byte units), then you will definitely see alignment effects. For some processors (e.g. many RISC processors), it is outright illegal to access unaligned data on certain levels: attempting to read a 4-byte word from an address that's not 4-byte aligned will generate a Data Access Exception (or Data Storage Exception) on a PowerPC, for example.
On other processors (e.g. x86), accessing unaligned addresses is permitted, but it often comes with a hidden performance penalty. Memory loads/stores are often implemented in microcode, and the microcode will detect the unaligned access. Normally, the microcode will fetch the proper 4-byte quantity from memory, but if it's not aligned, it will have to fetch two 4-byte locations from memory and reconstruct the desired 4-byte quantity from the appropriate bytes of the two locations. Fetching two memory locations is obviously slower than one.
That's just for simple loads and stores, though. Some instructions, such as those in the MMX or SSE instruction sets, require their memory operands to be properly aligned. If you attempt to access unaligned memory using those special instructions, you'll see something like an illegal instruction exception.
To summarize, I wouldn't really worry too much about alignment unless you're writing super performance-critical code (e.g. in assembly). The compiler helps you out a lot, e.g. by padding structures so that 4-byte quantities are aligned on 4-byte boundaries, and on x86, the CPU also helps you out when dealing with unaligned accesses. Since the pixel data you're dealing with is in quantities of 3 bytes, you'll almost always being doing single byte accesses anyways.
If you decide you instead want to access pixels in singular 4-byte accesses (as opposed to 3 1-byte accesses), it would be better to use 32-bit pixels and have each individual pixel aligned on a 4-byte boundary. Aligning each row to a 4-byte boundary but not each pixel will have little, if any, effect.
Based on your code, I'm guessing it's related to reading the Windows bitmap file format -- bitmap files require the length of each scanline to be a multiple of 4 bytes, so setting up your pixel data buffers with that property has the property that you can just read in the entire bitmap in one fell swoop into your buffer (of course, you still have to deal with the fact that the scanlines are stored bottom-to-top instead of top-to-bottom and that the pixel data is BGR instead of RGB). This isn't really much of an advantage, though -- it's not that much harder to read in the bitmap one scanline at a time.
Yes, alignment does have a performance impact on modern-- let's say x86--processors. Generally, loads and stores of data happen on natural alignment boundaries; if you're getting a 32-bit value into a register, it's going to be fastest if it's aligned on a 32-bit boundary already. If it's not, the x86 will "take care of it for you", in the sense that the CPU will still do the load, but it will take a significantly larger number of cycles to do it, because there will be internal wrangling to "re-align" the access.
Of course, in most cases, this overhead is trivial. Structures of binary data are frequently packed together in unaligned ways for transport over the network or for persistence on disk, and the size benefits of the packed storage outweigh any perf hit from operating occasionally on this data.
But particularly with large buffers of uniform data that get accessed randomly and where performance in the aggregate really is important, as in your pixel buffer above, keeping data structures aligned can still be beneficial.
Note that in the case of the example you give above, only each "line" of pixel data is aligned. The pixels themselves are still 3 bytes long and often unaligned within the "lines", so there's not much benefit here. There are texture formats, for example, that have 3 bytes of real data per pixel, and literally just waste an extra byte on each one to keep the data aligned.
There's some more general information here: http://en.wikipedia.org/wiki/Data_structure_alignment
(The specific characteristics vary between architectures, both in what the natural alignments are, whether the CPU handles unaligned loads/stores automatically, and in how expensive those end up being. In cases where the CPU doesn't handle access magically, often the compiler/C runtime will do what it can to do this work for you.)
Buffer alignment has an impact. The question is: is it a significant impact? The answer can be highly application specific. In architectures which do not natively support unaligned access—for example, the 68000 and 68010 (the 68020 adds unaligned access)—it's truly a performance and/or maintenance problem since the CPU will fault, or maybe trap to a handler to perform unaligned access.
The ideal alignment for various processors can be estimated: 4-byte alignment is appropriate for architectures with a 32-bit data path. 8-byte alignment for 64-bit. However, L1 caching has an effect. For many CPUs this is 64 bytes though it will no doubt change in the future.
Too high of an alignment (that is, eight byte where only two byte is needed) causes no performance inefficiency for any narrower system, even on an 8-bit microcontroller. It simply wastes (potentially) a few bytes of storage.
Your example is rather peculiar: the 3-byte elements have a 50% chance of individually being unaligned (to 32 bits), so aligning the buffer seems pointless—at least for performance reasons. However, in the case of a bulk transfer of the whole thing, it optimizes the first access. Note that an unaligned first byte might also have a performance impact in the transfer to a video controller.
Does aligning a buffer like this have a performance impact on modern processors?
Yes. For instance if memcpy is optimized using SIMD instructions (like MMX/SSE) some operations will be faster with aligned memory. In some architectures there are (processor) instructions that fail if the data is not aligned, thus something might work on your machine but not in another one.
With aligned data you also make a better use of the CPU caches.
Should I be worrying about alignment at all, or will the compiler handle this?
I should worry about alignment when I use dynamic memory and the compiler cannot handle this (see the reply to this comment).
For other stuff in your code you have the -malign flag and aligned attribute to play with.

Resources