Why does CPU access memory on a word boundary? - c

I heard a lot that data should be properly aligned in memory for better access efficiency. CPU access memory on a word boundary.
So in the following scenario, the CPU has to make 2 memory accesses to get a single word.
Supposing: 1 word = 4 bytes
("|" stands for word boundary. "o" stands for byte boundary)
|----o----o----o----|----o----o----o----| (The word boundary in CPU's eye)
----o----o----o---- (What I want to read from memory)
Why should this happen? What's the root cause of the CPU can only read at the word boundary?
If the CPU can only access at the 4-byte word boundary, the address line should only need 30bit, not 32bit width. Cause the last 2bit are always 0 in CPU's eye.
ADD 1
And even more, if we admit that CPU must read at the word boundary, why can't the boundary start at where I want to read? It seems that the boundary is fixed in CPU's eye.
ADD 2
According to AnT, it seems that the boundary setting is hardwired and it is hardwired by the memory access hardware. CPU is just innocent as far as this is concerned.

The meaning of "can" (in "...CPU can access...") in this case depends on the hardware platform.
On x86 platform CPU instructions can access data aligned on absolutely any boundary, not only on "word boundary". The misaligned access might be less efficient than aligned access, but the reasons for that have absolutely nothing to do with CPU. It has everything to do with how the underlying low-level memory access hardware works. It is quite possible that in this case the memory-related hardware will have to make two accesses to the actual memory, but that's something CPU instructions don't know about and don't need to know about. As far as CPU is concerned, it can access any data on any boundary. The rest is implemented transparently to CPU instructions.
On hardware platforms like Sun SPARC, CPU cannot access misaligned data (in simple words, your program will crash if you attempt to), which means that if for some reason you need to perform this kind of misaligned access, you'll have to implement it manually and explicitly: split it into two (or more) CPU instructions and thus explicitly perform two (or more) memory accesses.
As for why it is so... well, that's just how modern computer memory hardware works. The data has to be aligned. If it is not aligned, the access either is less efficient or does not work at all.
A very simplified model of modern memory would be a grid of cells (rows and columns), each cell storing a word of data. A programmable robotic arm can put a word into a specific cell and retrieve a word from a specific cell. One at a time. If your data is spread across several cells, you have no other choice but to make several consecutive trips with that robotic arm. On some hardware platforms the task of organizing these consecutive trips is hidden from CPU (meaning that the arm itself knows what to do to assemble the necessary data from several pieces), on other platforms it is visible to the CPU (meaning that it is the CPU who's responsible for organizing these consecutive trips of the arm).

It saves silicon in the addressing logic if you can make certain assumptions about the address (like "bottom n bits are zero). Some CPUs (x86 and their work-alikes) will put logic in place to turn misaligned data into multiple fetches, concealing some nasty performance hits from the programmer. Most CPUs outside of that world will instead raise a hardware error explaining in no uncertain terms that they don't like this.
All the arguments you're going to hear about "efficiency" are bollocks or, more precisely are begging the question. The real reason is simply that it saves silicon in the processor core if the number of address bits can be reduced for operations. Any inefficiency that arises from misaligned access (like in the x86 world) are a result of the hardware design decisions, not intrinsic to addressing in general.
Now that being said, for most use cases the hardware design decision makes sense. If you're accessing data in two-byte words, most common use cases have you access offset, then offset+2, then offset+4 and so on. Being able to increment the address byte-wise while accessing two-byte words is typically (as in 99.44% certainly) not what you want to be doing. As such it doesn't hurt to require address offsets to align on word boundaries (it's a mild, one-time inconvenience when you design your data structures) but it sure does save on your silicon.
As a historical aside, I worked once on an Interdata Model 70 -- a 16-bit minicomputer. It required all memory access to be 16-bit aligned. It also had a very small amount of memory by the time I was working on it by the standards of the time. (It was a relic even back then.) The word-alignment was used to double the memory capacity since the wire-wrapped CPU could be easily hacked. New address decode logic was added that took a 1 in the low bit of the address (previously an alignment error in the making) and used it to switch to a second bank of memory. Try that without alignment logic! :)

Because it is more efficient.
In your example, the CPU would have to do two reads: it has to read in the first half, then read in the second half separately, then reassemble them together to do the computation. This is much more complicated and slower than doing the read in one go if the data was properly aligned.
Some processors, like x86, can tolerate misaligned data access (so you would still need all 32 bits) - others like Itanium absolutely cannot handle misaligned data accesses and will complain quite spectacularly.

Word alignment is not only featured by CPUs
On the hardware level, most RAM-Modules have a given Word size in respect to the amount of bits that can be accessed per read/write cycle.
On a module I had to interface on an embedded device, addressing was implemented through three parameters: The module was organized in four banks which could be selected prior to the RW operation. each of this banks was essentially a large table 32-bit words, wich could be adressed through a row and column index.
In this design, access was only possible per cell, so every read operation returned 4 bytes, and every write operation expected 4 bytes.
A memory controller hooked up to this RAM chip could be desigend in two ways: either allowing unrestricted access to the memory chip using several cycles to split/merge unaligned data to/from several cells (with additional logic), or imposing some restrictions on how memory can be accessed with the gain of reduced complexity.
As complexity can impede maintainability and performance, most designers chose the latter [citation needed]

Related

What is the motivation to explicitly set the "falign-functions" compiler flag to a certain value?

I am working on the SW for an embedded system and trying to understand some low-level details that was setup by an earlier developer. The target platform is a custom made OpenRISC 1200 processor, synthesized in a FPGA. The software is built using a GCC based cross-compiler.
Among the compiler flags I find this one: -falign-functions=16. There is a comment in the build configuration saying:
On Open RISC 1200, function alignment needs to be on a cache boundary (16 bytes). If not, performance suffer severely.
I realize my understanding of cache memories are a bit shallow and I should probably read something like: What Every Programmer Should Know About Memory. I haven't yet, but I will. With that said, I have some questions:
I understand that this is about minimizing cache misses in the instruction cache, but why is that achieved by setting the function alignment to the instruction cache line size (i.e. 16 bytes)?
If this is the most memory efficient way, wouldn't you expect this to be the default setting for function alignment in the cross-compiler? I mean, for a more common platform like x86, amd64 or ARM you don't need to care about function alignments (or am I wrong?).
Most architectures have aspects of memory access and instructions that can depend on alignment.
but why is that achieved by setting the function alignment to the instruction cache line size
The CPU will fetch complete cache lines from memory (as if the memory is divided into these larger blocks rather than bytes). So if all the data you need fits in one cache line, there is just one fetch, but if you have even just 2 bytes of data, but one byte is the end of a cache line and the other byte the start of the next, well now it has to load in two complete cache lines. This wastes space in the small CPU cache, and more memory transfers.
A quick search indicates that the OpenRISC 1200 uses a 16 byte cache line, so when targeting that specifically, aligning the start of any data you have on those 16 byte multiples helps avoid straddling two lines within one function / piece of data.
If this is the most memory efficient way, wouldn't you expect this to be the default setting for function alignment in the cross-compiler?
There can be more to it than that. Firstly, this alignment is achieved by wasting "padding" memory. If you would have used 1 byte of a cache line calling a function, then another 15 bytes are wasted to reach the 16 byte boundary.
Also in the case of a function call, there is a reasonable chance that memory will be in cache anyway, and jumping forward might leave the cached memory, causing a load that would otherwise not be needed.
So this leaves a trade off, functions that use little stack space and return quickly, might not benefit much from the extra alignment, but a function that runs for longer and uses more stack space might benefit by not "wasting" cache space on the "previous function".
Another reason alignment is often desired is when dealing with instructions that either require it outright (fail on an unaligned address), or are much slower (with loads/stores getting split up into parts), or maybe some other effects (like a load/store not being atomic if not properly aligned).
With a quick search I believe the general alignment requirement on OR1200 appears to be 4 bytes, even for 8 byte types. So in this respect an alignment of at least 4 would seem desirable, and 8 or 16 might only provide a benefit in certain cases mentioned before.
I am not familiar with Open RISC specifically, but on some platforms instructions added at a later date (e.g. 16byte / 128bit SSE instructions) require or benefit from an alignment greater than what was the default (I believe AMD64 upped the default alignment to 16, but then later AVX came wanting 32 byte alignment).

Why must an int have a memory address that is divisible by four on most current architectures? [duplicate]

Admittedly I don't get it. Say you have a memory with a memory word of length of 1 byte. Why can't you access a 4 byte long variable in a single memory access on an unaligned address(i.e. not divisible by 4), as it's the case with aligned addresses?
The memory subsystem on a modern processor is restricted to accessing memory at the granularity and alignment of its word size; this is the case for a number of reasons.
Speed
Modern processors have multiple levels of cache memory that data must be pulled through; supporting single-byte reads would make the memory subsystem throughput tightly bound to the execution unit throughput (aka cpu-bound); this is all reminiscent of how PIO mode was surpassed by DMA for many of the same reasons in hard drives.
The CPU always reads at its word size (4 bytes on a 32-bit processor), so when you do an unaligned address access — on a processor that supports it — the processor is going to read multiple words. The CPU will read each word of memory that your requested address straddles. This causes an amplification of up to 2X the number of memory transactions required to access the requested data.
Because of this, it can very easily be slower to read two bytes than four. For example, say you have a struct in memory that looks like this:
struct mystruct {
char c; // one byte
int i; // four bytes
short s; // two bytes
}
On a 32-bit processor it would most likely be aligned like shown here:
The processor can read each of these members in one transaction.
Say you had a packed version of the struct, maybe from the network where it was packed for transmission efficiency; it might look something like this:
Reading the first byte is going to be the same.
When you ask the processor to give you 16 bits from 0x0005 it will have to read a word from 0x0004 and shift left 1 byte to place it in a 16-bit register; some extra work, but most can handle that in one cycle.
When you ask for 32 bits from 0x0001 you'll get a 2X amplification. The processor will read from 0x0000 into the result register and shift left 1 byte, then read again from 0x0004 into a temporary register, shift right 3 bytes, then OR it with the result register.
Range
For any given address space, if the architecture can assume that the 2 LSBs are always 0 (e.g., 32-bit machines) then it can access 4 times more memory (the 2 saved bits can represent 4 distinct states), or the same amount of memory with 2 bits for something like flags. Taking the 2 LSBs off of an address would give you a 4-byte alignment; also referred to as a stride of 4 bytes. Each time an address is incremented it is effectively incrementing bit 2, not bit 0, i.e., the last 2 bits will always continue to be 00.
This can even affect the physical design of the system. If the address bus needs 2 fewer bits, there can be 2 fewer pins on the CPU, and 2 fewer traces on the circuit board.
Atomicity
The CPU can operate on an aligned word of memory atomically, meaning that no other instruction can interrupt that operation. This is critical to the correct operation of many lock-free data structures and other concurrency paradigms.
Conclusion
The memory system of a processor is quite a bit more complex and involved than described here; a discussion on how an x86 processor actually addresses memory can help (many processors work similarly).
There are many more benefits to adhering to memory alignment that you can read at this IBM article.
A computer's primary use is to transform data. Modern memory architectures and technologies have been optimized over decades to facilitate getting more data, in, out, and between more and faster execution units–in a highly reliable way.
Bonus: Caches
Another alignment-for-performance that I alluded to previously is alignment on cache lines which are (for example, on some CPUs) 64B.
For more info on how much performance can be gained by leveraging caches, take a look at Gallery of Processor Cache Effects; from this question on cache-line sizes
Understanding of cache lines can be important for certain types of program optimizations. For example, the alignment of data may determine whether an operation touches one or two cache lines. As we saw in the example above, this can easily mean that in the misaligned case, the operation will be twice slower.
It's a limitation of many underlying processors. It can usually be worked around by doing 4 inefficient single byte fetches rather than one efficient word fetch, but many language specifiers decided it would be easier just to outlaw them and force everything to be aligned.
There is much more information in this link that the OP discovered.
you can with some processors (the nehalem can do this), but previously all memory access was aligned on a 64-bit (or 32-bit) line, because the bus is 64 bits wide, you had to fetch 64 bit at a time, and it was significantly easier to fetch these in aligned 'chunks' of 64 bits.
So, if you wanted to get a single byte, you fetched the 64-bit chunk and then masked off the bits you didn't want. Easy and fast if your byte was at the right end, but if it was in the middle of that 64-bit chunk, you'd have to mask off the unwanted bits and then shift the data over to the right place. Worse, if you wanted a 2 byte variable, but that was split across 2 chunks, then that required double the required memory accesses.
So, as everyone thinks memory is cheap, they just made the compiler align the data on the processor's chunk sizes so your code runs faster and more efficiently at the cost of wasted memory.
Fundamentally, the reason is because the memory bus has some specific length that is much, much smaller than the memory size.
So, the CPU reads out of the on-chip L1 cache, which is often 32KB these days. But the memory bus that connects the L1 cache to the CPU will have the vastly smaller width of the cache line size. This will be on the order of 128 bits.
So:
262,144 bits - size of memory
128 bits - size of bus
Misaligned accesses will occasionally overlap two cache lines, and this will require an entirely new cache read in order to obtain the data. It might even miss all the way out to the DRAM.
Furthermore, some part of the CPU will have to stand on its head to put together a single object out of these two different cache lines which each have a piece of the data. On one line, it will be in the very high order bits, in the other, the very low order bits.
There will be dedicated hardware fully integrated into the pipeline that handles moving aligned objects onto the necessary bits of the CPU data bus, but such hardware may be lacking for misaligned objects, because it probably makes more sense to use those transistors for speeding up correctly optimized programs.
In any case, the second memory read that is sometimes necessary would slow down the pipeline no matter how much special-purpose hardware was (hypothetically and foolishly) dedicated to patching up misaligned memory operations.
#joshperry has given an excellent answer to this question. In addition to his answer, I have some numbers that show graphically the effects which were described, especially the 2X amplification. Here's a link to a Google spreadsheet showing what the effect of different word alignments look like.
In addition here's a link to a Github gist with the code for the test.
The test code is adapted from the article written by Jonathan Rentzsch which #joshperry referenced. The tests were run on a Macbook Pro with a quad-core 2.8 GHz Intel Core i7 64-bit processor and 16GB of RAM.
If you have a 32bit data bus, the address bus address lines connected to the memory will start from A2, so only 32bit aligned addresses can be accessed in a single bus cycle.
So if a word spans an address alignment boundary - i.e. A0 for 16/32 bit data or A1 for 32 bit data are not zero, two bus cycles are required to obtain the data.
Some architectures/instruction sets do not support unaligned access and will generate an exception on such attempts, so compiler generated unaligned access code requires not just additional bus cycles, but additional instructions, making it even less efficient.
If a system with byte-addressable memory has a 32-bit-wide memory bus, that means there are effectively four byte-wide memory systems which are all wired to read or write the same address. An aligned 32-bit read will require information stored in the same address in all four memory systems, so all systems can supply data simultaneously. An unaligned 32-bit read would require some memory systems to return data from one address, and some to return data from the next higher address. Although there are some memory systems that are optimized to be able to fulfill such requests (in addition to their address, they effectively have a "plus one" signal which causes them to use an address one higher than specified) such a feature adds considerable cost and complexity to a memory system; most commodity memory systems simply cannot return portions of different 32-bit words at the same time.
On PowerPC you can load an integer from an odd address with no problems.
Sparc and I86 and (I think) Itatnium raise hardware exceptions when you try this.
One 32 bit load vs four 8 bit loads isnt going to make a lot of difference on most modern processors. Whether the data is already in cache or not will have a far greater effect.

Is 8-byte alignment for "double" type necessary?

I understand word-alignment, which makes the cpu only need to read once when reading an integer into a register.
But is 8-byte alignment (let's assume 32bit system) for "double" necessary? What is the benefit? What will happen if the space for storing a "double" is just 4-byte alignment?
There are multiple hardware components that may be adversely affected by unaligned loads or stores.
The interface to memory might be eight bytes wide and only able to access memory at multiples of eight bytes. Loading an unaligned eight-byte double then requires two reads on the bus. Stores are worse, because an aligned eight-byte store can simply write eight bytes to memory, but an unaligned eight-byte store must read two eight-byte pieces, merge the new data with the old data, and write two eight-byte pieces.
Cache lines are typically 32 or 64 bytes. If eight-byte objects are aligned to multiples of eight bytes, then each object is in just one cache line. If they are unaligned, then some of the objects are partly in one cache line and partly in another. Loading or storing these objects then requires using two cache lines instead of one. This effect occurs at all levels of cache (three levels is not uncommon in modern processors).
Memory system pages are typically 512 bytes or more. Again, each aligned object is in just one page, but some unaligned objects are in multiple pages. Each page that is accessed requires hardware resources: The virtual address must be translated to a physical address, this may require accessing translation tables, and address collisions must be detected. (Processors may have multiple load and store operations in operation simultaneously. Even though your program may appear to be single-threaded, the processor reads instructions in advance and tries to execute those that it can. So a processor may start a load instruction before preceding instructions have completed. However, to be sure this does not cause an error, the processor checks each load instruction to be sure it is not loading from an address that a prior store instruction is changing. If an access crosses a page boundary, the two parts of the loaded data have to be checked separately.)
The response of the system to unaligned operations varies from system to system. Some systems are designed to support only aligned accesses. In these cases, unaligned accesses either cause exceptions that lead to program termination or exceptions that cause execution of special handlers that emulate unaligned operations in software (by performing aligned operations and merging the data as necessary). Software handlers such as these are much slower than hardware operations.
Some systems support unaligned accesses, but this usually consumes more hardware resources than aligned accesses. In the best case, the hardware performs two operations instead of one. But some hardware is designed to start operations as if they were aligned and then, upon discovering the operation is not aligned, to abort it and start over using different paths in the hardware to handle the unaligned operation. In such systems, unaligned accesses have a significant performance penalty, although it is not as great as in systems where software handles unaligned accesses.
In some systems, the hardware may have multiple load-store execution units that can perform the two operations required of unaligned accesses just as quickly as one unit can perform the operation of aligned accesses. So there is no direct performance degradation of unaligned accesses. However, because multiple execution units are kept busy by unaligned accesses, they are unavailable to perform other operations. Thus, programs that perform many load-store operations, normally in parallel, will execute more slowly with unaligned accesses than with aligned accesses.
On many architectures, unaligned access of any load/store unit (short, int, long) is simply an exception. Compilers are responsible for ensuring it doesn't happen on potentially mis-aligned data, by emitting smaller access instructions and re-assembling in registers if they can't prove a given pointer is OK.
Performance-wise, 8-byte alignment of doubles on 32-bit systems can be valuable for a few reasons. The most apparent is that 4-byte alignment of an 8-byte double means that one element could cross the boundary of two cache lines. Memory access occurs in units of whole cache lines, and so misalignment doubles the cost of access.
I seem to remember that the recommendation for 486 was to align double on 32 bits boundaries, so requiring 64 bits alignment is not mandatory.
You seem to think that there is a relationship between the data bus width and the processor bitness. While it is often the case, you can find variation in both direction. For instance the Pentium was a 32-bit processor, but its data bus size was 64 bits.
Caches offer something else which may explain the usefulness of having 64-bit alignment for 64-bit types. Here the external bus is not a factor, it is the cache line size which is important. Data crossing the line cache is costlier to access than data not crossing it (even if it is unaligned in both cases). Aligning types on their size makes it sure that they won't cross cache lines as long as cache line size is a multiple of the type size.
I just found the answer:
"6. When memory reading is efficient in reading 4 bytes at a time on 32 bit machine, why should a double type be aligned on 8 byte boundary?
It is important to note that most of the processors will have math co-processor, called Floating Point Unit (FPU). Any floating point operation in the code will be translated into FPU instructions. The main processor is nothing to do with floating point execution. All this will be done behind the scenes.
As per standard, double type will occupy 8 bytes. And, every floating point operation performed in FPU will be of 64 bit length. Even float types will be promoted to 64 bit prior to execution.
The 64 bit length of FPU registers forces double type to be allocated on 8 byte boundary. I am assuming (I don’t have concrete information) in case of FPU operations, data fetch might be different, I mean the data bus, since it goes to FPU. Hence, the address decoding will be different for double types (which is expected to be on 8 byte boundary). It means, the address decoding circuits of floating point unit will not have last 3 pins."
Edited:
The advantage of byte alignment is to reduce the number of memory cycles to retrieve the data. For example, an 8 byte which might take a single cycle if it is aligned might now take 2 cycles since a part of it is obtained the first time and the second part in the next memory cycle.
I came across this:
"Aligned access is faster because the external bus to memory is not a single byte wide - it is typically 4 or 8 bytes wide (or even wider). So the CPU doesn't fetch a single byte at a time - it fetches 4 or 8 bytes starting at the requested address. Therefore, the 2 or 3 least significant bits of the memory address are not actually sent by the CPU - the external memory can only be read or written at addresses that are a multiple of the bus width. If you requested a byte at address "9", the CPU would actually ask the memory for the block of bytes beginning at address 8, and load the second one into your register (discarding the others).
This implies that a misaligned access can require two reads from memory: If you ask for 8 bytes beginning at address 9, the CPU must fetch the 8 bytes beginning at address 8 as well as the 8 bytes beginning at address 16, then mask out the bytes you wanted. On the other hand, if you ask for the 8 bytes beginning at address 8, then only a single fetch is needed. Some CPUs will not even perform such a misaligned load - they will simply raise an exception (or even silently load the wrong data!)."
You might see this link for more details.
http://www.ibm.com/developerworks/library/pa-dalign/

Structure padding

I was trying to understand why structure padding is the reason structures cannot be compared by memcmp.
One small thing i dont understand about structure padding is this...
why should "a short be 2 byte aligned"or"a long be 4 byte aligned". I understand it is with their sizes but why can they not appear at any byte boundary?
Or in other words "why is 0x10004566 not a valid location for a long variable but 0x10004568 is?"
Because some platforms (i.e. CPUs) physically don't support "mis-aligned" memory accesses. Other platforms support them, but in a much slower fashion.
The padding you get in a struct is dependent on the choices your compiler makes, but it will be making those choices in order to satisfy the specific requirements of the CPU the code is targeted at.
Memory alignment is a very important issue when optimizing a program for speed. C, being a language that - generally - puts strong emphasis on speed, likes to enforce some rules which may make the program faster.
The limitation of aligned and unaligned memory accesses comes directly from the hardware used for fetching the data from the memory, which usually fetches it in chunks which are equal to the machine word in size. Say you want to access a doubleword (4 bytes) stored at location 101. This means that the memory controller would firstly have to (probably) issue a read of a doubleword at location 100, then another read of a doubleword at location 104, and then splice the individual bytes from locations 101, 102, 103, and 104 together. The whole operation takes (hypothetically) two clock cycles.
If you want to access a doubleword at location 100, there's no such issue, which should be illustrated clearly enough by the example I provided.
In fact, misaligned data access is such a big issue that SSE instructions (the "aligned" versions, there are also "misaligned" versions which don't do that) will cause a general protection fault if you try to access misaligned data with those.
As a rule of thumb, it never hurts to align 4-byte data on a 4-byte boundary, 8-byte data on a 8-byte boundary, and so forth.
The only additional example I can think of with respect to alignment is transfer of data, transfers of data (depending on the architecture) goes in blocks of say 32 bytes for example, if your data crosses a boundary it could require 2 transfers to receive the data, rather then 1.

Memory alignment on modern processors?

I often see code such as the following when, e.g., representing a large bitmap in memory:
size_t width = 1280;
size_t height = 800;
size_t bytesPerPixel = 3;
size_t bytewidth = ((width * bytesPerPixel) + 3) & ~3; /* Aligned to 4 bytes */
uint8_t *pixelData = malloc(bytewidth * height);
(that is, a bitmap allocated as a contiguous block of memory having a bytewidth aligned to a certain number of bytes, most commonly 4.)
A point on the image is then given via:
pixelData + (bytewidth * y) + (bytesPerPixel * x)
This leads me to two questions:
Does aligning a buffer like this have a performance impact on modern processors? Should I be worrying about alignment at all, or will the compiler handle this?
If it does have an impact, could someone point me to a resource to find the ideal byte alignment for various processors?
Thanks.
It depends on a lot of factors. If you're only accessing the pixel data one byte at a time, the alignment will not make any difference the vast majority of the time. For reading/writing one byte of data, most processors won't care at all whether that byte is on a 4-byte boundary or not.
However, if you're accessing data in units larger than a byte (say, in 2-byte or 4-byte units), then you will definitely see alignment effects. For some processors (e.g. many RISC processors), it is outright illegal to access unaligned data on certain levels: attempting to read a 4-byte word from an address that's not 4-byte aligned will generate a Data Access Exception (or Data Storage Exception) on a PowerPC, for example.
On other processors (e.g. x86), accessing unaligned addresses is permitted, but it often comes with a hidden performance penalty. Memory loads/stores are often implemented in microcode, and the microcode will detect the unaligned access. Normally, the microcode will fetch the proper 4-byte quantity from memory, but if it's not aligned, it will have to fetch two 4-byte locations from memory and reconstruct the desired 4-byte quantity from the appropriate bytes of the two locations. Fetching two memory locations is obviously slower than one.
That's just for simple loads and stores, though. Some instructions, such as those in the MMX or SSE instruction sets, require their memory operands to be properly aligned. If you attempt to access unaligned memory using those special instructions, you'll see something like an illegal instruction exception.
To summarize, I wouldn't really worry too much about alignment unless you're writing super performance-critical code (e.g. in assembly). The compiler helps you out a lot, e.g. by padding structures so that 4-byte quantities are aligned on 4-byte boundaries, and on x86, the CPU also helps you out when dealing with unaligned accesses. Since the pixel data you're dealing with is in quantities of 3 bytes, you'll almost always being doing single byte accesses anyways.
If you decide you instead want to access pixels in singular 4-byte accesses (as opposed to 3 1-byte accesses), it would be better to use 32-bit pixels and have each individual pixel aligned on a 4-byte boundary. Aligning each row to a 4-byte boundary but not each pixel will have little, if any, effect.
Based on your code, I'm guessing it's related to reading the Windows bitmap file format -- bitmap files require the length of each scanline to be a multiple of 4 bytes, so setting up your pixel data buffers with that property has the property that you can just read in the entire bitmap in one fell swoop into your buffer (of course, you still have to deal with the fact that the scanlines are stored bottom-to-top instead of top-to-bottom and that the pixel data is BGR instead of RGB). This isn't really much of an advantage, though -- it's not that much harder to read in the bitmap one scanline at a time.
Yes, alignment does have a performance impact on modern-- let's say x86--processors. Generally, loads and stores of data happen on natural alignment boundaries; if you're getting a 32-bit value into a register, it's going to be fastest if it's aligned on a 32-bit boundary already. If it's not, the x86 will "take care of it for you", in the sense that the CPU will still do the load, but it will take a significantly larger number of cycles to do it, because there will be internal wrangling to "re-align" the access.
Of course, in most cases, this overhead is trivial. Structures of binary data are frequently packed together in unaligned ways for transport over the network or for persistence on disk, and the size benefits of the packed storage outweigh any perf hit from operating occasionally on this data.
But particularly with large buffers of uniform data that get accessed randomly and where performance in the aggregate really is important, as in your pixel buffer above, keeping data structures aligned can still be beneficial.
Note that in the case of the example you give above, only each "line" of pixel data is aligned. The pixels themselves are still 3 bytes long and often unaligned within the "lines", so there's not much benefit here. There are texture formats, for example, that have 3 bytes of real data per pixel, and literally just waste an extra byte on each one to keep the data aligned.
There's some more general information here: http://en.wikipedia.org/wiki/Data_structure_alignment
(The specific characteristics vary between architectures, both in what the natural alignments are, whether the CPU handles unaligned loads/stores automatically, and in how expensive those end up being. In cases where the CPU doesn't handle access magically, often the compiler/C runtime will do what it can to do this work for you.)
Buffer alignment has an impact. The question is: is it a significant impact? The answer can be highly application specific. In architectures which do not natively support unaligned access—for example, the 68000 and 68010 (the 68020 adds unaligned access)—it's truly a performance and/or maintenance problem since the CPU will fault, or maybe trap to a handler to perform unaligned access.
The ideal alignment for various processors can be estimated: 4-byte alignment is appropriate for architectures with a 32-bit data path. 8-byte alignment for 64-bit. However, L1 caching has an effect. For many CPUs this is 64 bytes though it will no doubt change in the future.
Too high of an alignment (that is, eight byte where only two byte is needed) causes no performance inefficiency for any narrower system, even on an 8-bit microcontroller. It simply wastes (potentially) a few bytes of storage.
Your example is rather peculiar: the 3-byte elements have a 50% chance of individually being unaligned (to 32 bits), so aligning the buffer seems pointless—at least for performance reasons. However, in the case of a bulk transfer of the whole thing, it optimizes the first access. Note that an unaligned first byte might also have a performance impact in the transfer to a video controller.
Does aligning a buffer like this have a performance impact on modern processors?
Yes. For instance if memcpy is optimized using SIMD instructions (like MMX/SSE) some operations will be faster with aligned memory. In some architectures there are (processor) instructions that fail if the data is not aligned, thus something might work on your machine but not in another one.
With aligned data you also make a better use of the CPU caches.
Should I be worrying about alignment at all, or will the compiler handle this?
I should worry about alignment when I use dynamic memory and the compiler cannot handle this (see the reply to this comment).
For other stuff in your code you have the -malign flag and aligned attribute to play with.

Resources