Alignment of pointers for lock-free algorithms - c

When calloc is being used pointers to newly allocated memory are aligned to at least certian number of the least significant bits, meaning that least significent bits(as tagged pointeres) can be used for lock-free algorithms, and in fact is commonly used in case of those algorithms. I was testing memory menagment feature on linux ubuntu server( x86_64 GNU/Linux, 3.10.23-xxxx-std-ipv6-64-vps) and it seems, from my experiments, that the 4 least significant bits are set to 0. From what i have read it states that pointer alignment is formed in such a way for pointer expressed as uintptr to be divided by 4(alignment to 2 least significant bits)
What is the minimum number of the least significant bits in newly allocated memory pointers, obtained from memory menagment system in POSIX (linux), that are always set to 0 during initial memory allocation process?
What is the maximum number of the least significant bits that can be used as tagged pointers on linux systems (eg. lock-free algorithms)?
How to force compiler to align newly allocated pointers to exect number of the least significant bits ?
Does the alignment of pointers affect system overall performance, and how ?

Alignment is important in optimization for many related reasons:
efficient usage of the cache lines
avoid to disable the prefetching logics
best usage of vector registers/instructions (SSE, AVX).
especially when I/O is concerned, also memory page alignment can be important.
You can find very good references for Intel architecture here:
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
Answering quickly to your questions:
What is the minimum number of the least significant bits in newly
allocated memory pointers, obtained from memory menagment system in
POSIX (linux), that are always set to 0 during initial memory
allocation process?
It actually depends on the CPU/architecture you are speaking of.
What is the maximum number of the least significant bits that can be
used as tagged pointers on linux systems (eg. lock-free algorithms)?
The same as the former: you should use std::atomic or boost::atomic in order to have some sort of portability, if C++ is an option.
On Intel architectures, memory load and stores are atomic for 32 bit, on x86_32, for 64 on x86_64, if data are properly aligned.
If you are really enjoying this kind of low level, don't forget to have a look into memory semantics, memory fences and so on ("Fence instructions" in the above manual)

I'm afraid I can't answer your whole question, but I can make a start:
Pointer alignment might not only change performance but also necessary to make your code work at all. Especially for things like ARM processors you can't read numbers larger then 1 byte if the pointer is unaligned. Doing this will result in an error.
If I, for example, work with a big data-stream I prefer have my data aligned so I can read more bytes at the same time, instead having to read byte for byte what will cost more time/CPU.

on x86/x86_64 architecture reading/writing to unaligned memory is paid with a performance cost, because you will need two memory ops instead of a single one: the bus operations to/from memory are always aligned.
On GNU/Linux you can use posix_memalign & C. to get heap aligned memory (man memalign) in user space.
Some compilers also supports macro to get aligned memory on the stack, for instance
/* GCC align declarator */
#define MYMEMALIGN(x, y) x __attribute__( (aligned( y )) )
#endif
but I guess this are non portable solutions.

Related

What value of alignment should I with mkl_malloc?

The function mkl_malloc is similar to malloc but has an extra alignment argument. Here's the prototype:
void* mkl_malloc (size_t alloc_size, int alignment);
I've noticed different performances with different values of alignment. Apart from trial and error, is there a canonical or documented methodical way to decide on the best value of alignment? i.e. processor being used, function being called, operation being performed etc.
This question widely applicable to anyone who uses MKL so I'm very surprised it is not in the reference manual.
update: I have tried with mkl_sparse_spmm and have not noticed a significant difference in performance for setting the alignment to powers of 2 up to 1024 bytes, after that the performance tends to drop. I'm using an Intel Xeon E5-2683.
Alignment only affects performance when SSE/AVX instructions can be used - this is commonly true when operating with arrays as you wish to apply the same operation to a range of elements.
In general, you want to choose alignment based on the CPU, if it supports AVX2 which has 256bit registers, then you want 32 byte alignment, if it supports AVX512, then 64 bytes would be optimal.
To that end, mkl_malloc will guarantee alignment to the value you specify, however, obviously if the data are 32-byte aligned, then they are also aligned to a (16, 8, 4...)-byte boundary. The purpose of the call is to ensure this is always the case and thus avoid any potential complications.
On my machine (Linux kernel 4.17.11 running on i7 6700K), the default alignment of mkl_malloc seems to be 128-bytes (for large enough arrays, if they are too small the value seems to be 32KB), in other words, any value smaller than that has no effect on alignment, I can however input 256 and the data will be aligned to the 256-byte boundary.
In contrast, using malloc gives me 16byte alignment for 1GB of data and 32-byte alignment for 1KB, whatever the OS gives me with absolutely no preference regarding alignment.
So using mkl_malloc makes sense as it ensures you get the alignment you desire. However, that doesn't mean you should set the value to be too large, that will simply cause you to waste memory and potentially expose you to an increased number of cache misses.
In short, you want your data to be aligned to the size of the vector registers in your CPU so that you can make use of the relevant extensions. Using mkl_malloc with some parameter for alignment guarantees alignment to at least that value, it can however be more. It should be used to make sure the data are aligned the way you want, but there is absolutely no good reason to align to 1MB.
The only reason, why regardless of your input, you have no penalties / gains from specifying the alignment is that you get machine aligned memory no matter what you type in. So on your processor, which supports AVX, you are always getting 32 byte aligned memory regardless of your input.
You will also see, that whatever alignment value you go for, the memory address, which mkl_malloc, returns is divisible 32-aligned. Alternatively you may test that low level intrisics like _mm256_load_pd, which would seg fault, when a not 32 byte aligned address is used never seg fault.
Some minor details: OSX always gives you 32 byte address, independant of heap / stack when you allocate a chunk of memory, while Linux will always give you aligned memory, when allocating on heap. Stack is a matter of luck on Linux, but you exceed with small matrix size already the limit for stack allocations. I have no understanding of memory allocation on Windows.
I noticed the latter, when I was writing tests for my numerics library where I use std::vector<typename T, alignment A> for memory allocation and smaller matrix tests sometimes seg faulted on Linux.
TLDR: your alignment input is effectively discarded and you are getting machine alignment regardless.
I think there can be no "best" value for alignment. Depending on your architecture, alignment is generally a property enforced by the hardware, for optimization reasons mostly.
Coming to your specific question, it's important to state what exactly are you allocating memory for? What piece of hw accesses the memory? For e.g., I have worked with DMA engines which required the source address to be aligned to per transaction transfer size(where xfer size = 4, 8, 16, 32, 128). I also worked with vector registers where it was wise to have a 128 bit aligned load.
To summarize: It depends.

When do we need to use posix_memalign instead of malloc?

Seems posix_memalign let you choose a customized alignment,but when is that necessary?
malloc has already done the alignment work internally.
UPDATE
The exact reason I ask this is because I see nginx does this,ngx_memalign(NGX_POOL_ALIGNMENT, size, log);,here NGX_POOL_ALIGNMENT is defined as 16, nginxs.googlecode.com/svn-history/trunk/src/core/ngx_palloc.c
Basically, if you need tougher alignment than malloc will give you. Malloc generally returns a pointer aligned such, that it may be used with any of the primitive types (often, 8 bytes on common desktop machines).
However, sometimes you need memory aligned on other boundaries, for example 4K-aligned, etc. In this case, you would need memalign.
You would need this, for example,
when writing a memory manager (such as a garbage collector). In this case, it is sometimes handy to work with memory aligned on larger block sizes. This way, you can store meta-data common to all objects in a given block at the bottom of the allocated area, and access this simply by masking the least significant bits of the object pointer.
when interfacing with hardware (never done this myself, but IIRC, certain kinds of block-devices require aligned memory). See n.m.'s answer for details.
The only benefits of posix_memalign, as far as I can tell, are:
Allocating page-aligned (typically 4096 or larger alignment) memory for hardware-specific purposes.
Evil hacks where you keep the low N bits of a pointer zero so you can store an N-bit integer in the low bits. :-)
Various hardware may have alignment requirements which malloc cannot satisfy. The Linux man page gives one such example, I quote:
On many systems there are alignment
restrictions, e.g. on buffers used for
direct block device I/O. POSIX
specifies the
pathconf(path,_PC_REC_XFER_ALIGN) call
that tells what alignment is needed.
A couple of uses:
Some processors have instructions that will only work on data that is aligned on a power of two greater than or equal to the buffer size - for example bit reverse addressing instructions used in ffts (fast fourier transforms).
To align data to cache boundaries to optimize access in multiprocessing applications so that data in the same cache line isn't being accessed by two processors simultaneously.
Basically, if you don't need to do absurd levels of optimizations and/or your hardware doesn't demand that an array be on a particular boundary then you can forget about posix_memalign.

How did 16-bit C compilers work?

C's memory model, with its use of pointer arithmetic and all, seems to model flat address space. 16-bit computers used segmented memory access. How did 16-bit C compilers deal with this issue and simulate a flat address space from the perspective of the C programmer? For example, roughly what assembly language instructions would the following code compile to on an 8086?
long arr[65536]; // Assume 32 bit longs.
long i;
for(i = 0; i < 65536; i++) {
arr[i] = i;
}
How did 16-bit C compilers deal with
this issue and simulate a flat address
space from the perspective of the C
programmer?
They didn't. Instead, they made segmentation visible to the C programmer, extending the language by having multiple types of pointers: near, far, and huge. A near pointer was an offset only, while far and huge pointers were a combined segment and offset. There was a compiler option to set the memory model, which determined whether the default pointer type was near or far.
In Windows code, even today, you'll often see typedefs like LPCSTR (for const char*). The "LP" is a holdover from the 16-bit days; it stands for "Long (far) Pointer".
C memory model does not in any way imply flat address space. It never did. In fact, C language specification is specifically designed to allow non-flat address spaces.
In the most trivial implementation with segmented address space, the size of the largest continuous object would be limited by the size of the segment (65536 bytes on a 16 bit platform). This means that size_t in such implementation would be 16 bit, and that your code simply would not compile, since you are attempting to declare an object that has larger size than the allowed maximum.
A more complex implementation would support so called huge memory model. You see, there's really no problem addressing continuous memory blocks of any size on a segmented memory model, it just requires some extra efforts in pointer arithmetics. So, within the huge memory model, the implementation would make those extra efforts, which would make the code a bit slower, but at the same time would allow addressing objects of virtually any size. So, your code would compile perfectly fine.
The true 16-bit environments use 16 bit pointers which reach any address. Examples include the PDP-11, 6800 family (6802, 6809, 68HC11), and the 8085. This is a clean and efficient environment, just like a simple 32-bit architecture.
The 80x86 family forced upon us a hybrid 16-bit/20-bit address space in so-called "real mode"—the native 8086 addressing space. The usual mechanism to deal with this was enhancing the types of pointers into two basic types, near (16-bit pointer) and far (32-bit pointer). The default for code and data pointers can be set in bulk by a "memory model": tiny, small, compact, medium, far, and huge (some compilers do not support all models).
The tiny memory model is useful for small programs in which the entire space (code + data + stack) is less than 64K. All pointers are (by default) 16 bits or near; a pointer is implicitly associated with a segment value for the whole program.
The small model assumes that data + stack is less than 64K and in the same segment; the code segment contains only code, so can have up to 64K as well, for a maximum memory footprint of 128K. Code pointers are near and implicitly associated with CS (the code segment). Data pointers are also near and associated with DS (the data segment).
The medium model has up to 64K of data + stack (like small), but can have any amount of code. Data pointers are 16 bits and are implicitly tied to the data segment. Code pointers are 32 bit far pointers and have a segment value depending on how the linker has set up the code groups (a yucky bookkeeping hassle).
The compact model is the complement of medium: less than 64K of code, but any amount of data. Data pointers are far and code pointers are near.
In large or huge model, the default subtype of pointers are 32 bit or far. The main difference is that huge pointers are always automatically normalized so that incrementing them avoids problems with 64K wrap arounds. See this.
In DOS 16 bit, I dont remember being able to do that. You could have multiple things that were each 64K (bytes)(because the segment could be adjusted and the offset zeroed) but dont remember if you could cross the boundary with a single array. The flat memory space where you could willy nilly allocate whatever you wanted and reach as deep as you liked into an array didnt happen until we could compile 32 bit DOS programs (on 386 or 486 processors). Perhaps other operating systems and compilers other than microsoft and borland could generate flat arrays greater than 64kbytes. Win16 I dont remember that freedom until win32 hit, perhaps my memory is getting rusty...You were lucky or rich to have a megabyte of memory anyway, a 256kbyte or 512kbyte machine was not unheard of. Your floppy drive had a fraction of a meg to 1.44 meg eventually, and your hard disk if any had a dozen or few meg, so you just didnt compute thing that large that often.
I remember the particular challenge I had learning about DNS when you could download the entire DNS database of all registered domain names on the planet, in fact you had to to put up your own dns server which was almost required at the time to have a web site. That file was 35megabytes, and my hard disk was 100megabytes, plus dos and windows chewing up some of that. Probably had 1 or 2 meg of memory, might have been able to do 32 bit dos programs at the time. Part if it was me wanting to parse the ascii file which I did in multiple passes, but each pass the output had to go to another file, and I had to delete the prior file to have room on the disk for the next file. Two disk controllers on a standard motherboard, one for the hard disk and one for the cdrom drive, here again this stuff wasnt cheap, there were not a lot of spare isa slots if you could afford another hard disk and disk controller card.
There was even the problem of reading 64kbytes with C you passed fread the number of bytes you wanted to read in a 16 bit int, which meant 0 to 65535 not 65536 bytes, and performance dropped dramatically if you didnt read in even sized sectors so you just read 32kbytes at a time to maximize performance, 64k didnt come until well into the dos32 days when you were finally convinced that the value passed to fread was now a 32 bit number and the compiler wasnt going to chop off the upper 16 bits and only use the lower 16 bits (which happened often if you used enough compilers/versions). We are currently suffering similar problems in the 32 bit to 64 transition as we did with the 16 to 32 bit transition. What is most interesting is the code from the folks like me that learned that going from 16 to 32 bit int changed size, but unsigned char and unsigned long did not, so you adapted and rarely used int so that your programs would compile and work for both 16 and 32 bit. (The code from folks from that generation kind of stands out to other folks that also lived through it and used the same trick). But for the 32 to 64 transition it is the other way around and code not refactored to use uint32 type declarations are suffering.
Reading wallyk's answer that just came in, the huge pointer thing that wrapped around does ring a bell, also not always being able to compile for huge. small was the flat memory model we are comfortable with today, and as with today was easy because you didnt have to worry about segments. So it was a desireable to compile for small when you could. You still didnt have a lot of memory or disk or floppy space so you just didnt normally deal with data that large.
And agreeing with another answer, the segment offset thing was 8088/8086 intel. The whole world was not yet dominated by intel, so there were other platforms that just had a flat memory space, or used other tricks perhaps in hardware (outside the processor) to solve the problem. Because of the segment/offset intel was able to ride the 16 bit thing longer than it probably should have. Segment/offset had some cool and interesting things you could do with it, but it was as much a pain as anything else. You either simplified your life and lived in a flat memory space or you constantly worried about segment boundaries.
Really pinning down the address size on old x86's is sort of tricky. You could say that its 16 bit, because the arithmetic you can perform on an address must fit in a 16 bit register. You could also say that it's 32 bit, because actual addresses are computed against a 16 bit general purpose register and 16 bit segment register (all 32 bits are significant). You could also just say it's 20 bit, because the segment registers are shifted 4 bits left and added to the gp registers for hardware addressing.
It actually doesn't matter that much which one of these you chose, because they are all roughly equal approximations of the c abstract machine. Some compilers let you pick a memory model you were using per compilation, while others just assume 32 bit addresses and then carefully check that operations that could overflow 16 bits emit instructions that handle that case correctly.
Check out this wikipedia entry. About Far pointers. Basically, its possible to indicate a segment and an offset, making it possible to jump to another segment.

What is the difference between far pointers and near pointers?

Can anybody tell me the difference between far pointers and near pointers in C?
On a 16-bit x86 segmented memory architecture, four registers are used to refer to the respective segments:
DS → data segment
CS → code segment
SS → stack segment
ES → extra segment
A logical address on this architecture is written segment:offset. Now to answer the question:
Near pointers refer (as an offset) to the current segment.
Far pointers use segment info and an offset to point across segments. So, to use them, DS or CS must be changed to the specified value, the memory will be dereferenced and then the original value of DS/CS restored. Note that pointer arithmetic on them doesn't modify the segment portion of the pointer, so overflowing the offset will just wrap it around.
And then there are huge pointers, which are normalized to have the highest possible segment for a given address (contrary to far pointers).
On 32-bit and 64-bit architectures, memory models are using segments differently, or not at all.
Since nobody mentioned DOS, lets forget about old DOS PC computers and look at this from a generic point-of-view. Then, very simplified, it goes like this:
Any CPU has a data bus, which is the maximum amount of data the CPU can process in one single instruction, i.e equal to the size of its registers. The data bus width is expressed in bits: 8 bits, or 16 bits, or 64 bits etc. This is where the term "64 bit CPU" comes from - it refers to the data bus.
Any CPU has an address bus, also with a certain bus width expressed in bits. Any memory cell in your computer that the CPU can access directly has an unique address. The address bus is large enough to cover all the addressable memory you have.
For example, if a computer has 65536 bytes of addressable memory, you can cover these with a 16 bit address bus, 2^16 = 65536.
Most often, but not always, the data bus width is as wide as the address bus width. It is nice if they are of the same size, as it keeps both the CPU instruction set and the programs written for it clearer. If the CPU needs to calculate an address, it is convenient if that address is small enough to fit inside the CPU registers (often called index registers when it comes to addresses).
The non-standard keywords far and near are used to describe pointers on systems where you need to address memory beyond the normal CPU address bus width.
For example, it might be convenient for a CPU with 16 bit data bus to also have a 16 bit address bus. But the same computer may also need more than 2^16 = 65536 bytes = 64kB of addressable memory.
The CPU will then typically have special instructions (that are slightly slower) which allows it to address memory beyond those 64kb. For example, the CPU can divide its large memory into n pages (also sometimes called banks, segments and other such terms, that could mean a different thing from one CPU to another), where every page is 64kB. It will then have a "page" register which has to be set first, before addressing that extended memory. Similarly, it will have special instructions when calling/returning from sub routines in extended memory.
In order for a C compiler to generate the correct CPU instructions when dealing with such extended memory, the non-standard near and far keywords were invented. Non-standard as in they aren't specified by the C standard, but they are de facto industry standard and almost every compiler supports them in some manner.
far refers to memory located in extended memory, beyond the width of the address bus. Since it refers to addresses, most often you use it when declaring pointers. For example: int * far x; means "give me a pointer that points to extended memory". And the compiler will then know that it should generate the special instructions needed to access such memory. Similarly, function pointers that use far will generate special instructions to jump to/return from extended memory. If you didn't use far then you would get a pointer to the normal, addressable memory, and you'd end up pointing at something entirely different.
near is mainly included for consistency with far; it refers to anything in the addressable memory as is equivalent to a regular pointer. So it is mainly a useless keyword, save for some rare cases where you want to ensure that code is placed inside the standard addressable memory. You could then explicitly label something as near. The most typical case is low-level hardware programming where you write interrupt service routines. They are called by hardware from an interrupt vector with a fixed width, which is the same as the address bus width. Meaning that the interrupt service routine must be in the standard addressable memory.
The most famous use of far and near is perhaps the mentioned old MS DOS PC, which is nowadays regarded as quite ancient and therefore of mild interest.
But these keywords exist on more modern CPUs too! Most notably in embedded systems where they exist for pretty much every 8 and 16 bit microcontroller family on the market, as those microcontrollers typically have an address bus width of 16 bits, but sometimes more than 64kB memory.
Whenever you have a CPU where you need to address memory beyond the address bus width, you will have the need of far and near. Generally, such solutions are frowned upon though, since it is quite a pain to program on them and always take the extended memory in account.
One of the main reasons why there was a push to develop the 64 bit PC, was actually that the 32 bit PCs had come to the point where their memory usage was starting to hit the address bus limit: they could only address 4GB of RAM. 2^32 = 4,29 billion bytes = 4GB. In order to enable the use of more RAM, the options were then either to resort to some burdensome extended memory solution like in the DOS days, or to expand the computers, including their address bus, to 64 bits.
Far and near pointers were used in old platforms like DOS.
I don't think they're relevant in modern platforms. But you can learn about them here and here (as pointed by other answers). Basically, a far pointer is a way to extend the addressable memory in a computer. I.E., address more than 64k of memory in a 16bit platform.
A pointer basically holds addresses. As we all know, Intel memory management is divided into 4 segments.
So when an address pointed to by a pointer is within the same segment, then it is a near pointer and therefore it requires only 2 bytes for offset.
On the other hand, when a pointer points to an address which is out of the segment (that means in another segment), then that pointer is a far pointer. It consist of 4 bytes: two for segment and two for offset.
Four registers are used to refer to four segments on the 16-bit x86 segmented memory architecture. DS (data segment), CS (code segment), SS (stack segment), and ES (extra segment). A logical address on this platform is written segment:offset, in hexadecimal.
Near pointers refer (as an offset) to the current segment.
Far pointers use segment info and an offset to point across segments. So, to use them, DS or CS must be changed to the specified value, the memory will be dereferenced and then the original value of DS/CS restored. Note that pointer arithmetic on them doesn't modify the segment portion of the pointer, so overflowing the offset will just wrap it around.
And then there are huge pointers, which are normalized to have the highest possible segment for a given address (contrary to far pointers).
On 32-bit and 64-bit architectures, memory models are using segments differently, or not at all.
Well in DOS it was kind of funny dealing with registers. And Segments. All about maximum counting capacities of RAM.
Today it is pretty much irrelevant. All you need to read is difference about virtual/user space and kernel.
Since win nt4 (when they stole ideas from *nix) microsoft programmers started to use what was called user/kernel memory spaces.
And avoided direct access to physical controllers since then. Since then dissapered a problem dealing with direct access to memory segments as well. - Everything became R/W through OS.
However if you insist on understanding and manipulating far/near pointers look at linux kernel source and how it works - you will newer come back I guess.
And if you still need to use CS (Code Segment)/DS (Data Segment) in DOS. Look at these:
https://en.wikipedia.org/wiki/Intel_Memory_Model
http://www.digitalmars.com/ctg/ctgMemoryModel.html
I would like to point out to perfect answer below.. from Lundin. I was too lazy to answer properly. Lundin gave very detailed and sensible explanation "thumbs up"!

Alignment restrictions for malloc()/free()

Older K&R (2nd ed.) and other C-language texts I have read that discuss the implementation of a dynamic memory allocator in the style of malloc() and free() usually also mention, in passing, something about data type alignment restrictions. Apparently certain computer hardware architectures (CPU, registers, and memory access) restrict how you can store and address certain value types. For example, there may be a requirement that a 4 byte (long) integer must be stored beginning at addresses that are multiples of four.
What restrictions, if any, do major platforms (Intel & AMD, SPARC, Alpha) impose for memory allocation and memory access, or can I safely ignore aligning memory allocations on specific address boundaries?
Sparc, MIPS, Alpha, and most other "classical RISC" architectures only allow aligned accesses to memory, even today. An unaligned access will cause an exception, but some operating systems will handle the exception by copying from the desired address in software using smaller loads and stores. The application code won't know there was a problem, except that the performance will be very bad.
MIPS has special instructions (lwl and lwr) which can be used to access 32 bit quantities from unaligned addresses. Whenever the compiler can tell that the address is likely unaligned it will use this two instruction sequence instead of a normal lw instruction.
x86 can handle unaligned memory accesses in hardware without an exception, but there is still a performance hit of up to 3X compared to aligned accesses.
Ulrich Drepper wrote a comprehensive paper on this and other memory-related topics, What Every Programmer Should Know About Memory. It is a very long writeup, but filled with chewy goodness.
Alignment is still quite important today. Some processors (the 68k family jumps to mind) would throw an exception if you tried to access a word value on an odd boundary. Today, most processors will run two memory cycles to fetch an unaligned word, but this will definitely be slower than an aligned fetch. Some other processors won't even throw an exception, but will fetch an incorrect value from memory!
If for no other reason than performance, it is wise to try to follow your processor's alignment preferences. Usually, your compiler will take care of all the details, but if you're doing anything where you lay out the memory structure yourself, then it's worth considering.
You still need to be aware of alignment issues when laying out a class or struct in C(++). In these cases the compiler will do the right thing for you, but the overall size of the struct/class may be more wastefull than necessary
For example:
struct
{
char A;
int B;
char C;
int D;
};
Would have a size of 4 * 4 = 16 bytes (assume Windows on x86) whereas
struct
{
char A;
char C;
int B;
int D;
};
Would have a size of 4*3 = 12 bytes.
This is because the compiler enforces a 4 byte alignment for integers, but only 1 byte for chars.
In general pack member variables of the same size (type) together to minimize wasted space.
As Greg mentioned it is still important today (perhaps more so in some ways) and compilers usually take care of the alignment based on the target of the architecture. In managed environments, the JIT compiler can optimize the alignment based on the runtime architecture.
You may see pragma directives (in C/C++) that change the alignment. This should only be used when very specific alignment is required.
// For example, this changes the pack to 2 byte alignment.
#pragma pack(2)
Note that even on IA-32 and the AMD64, some of the SSE instructions/intrinsics require aligned data. These instructions will throw an exception if the data is unaligned, so at least you won't have to debug "wrong data" bugs. There are equivalent unaligned instructions as well, but like Denton says, they're are slower.
If you're using VC++, then besides the #pragma pack directives, you also have the __declspec(align) directives for precise alignment. VC++ documentation also mentions an __aligned_malloc function for specific alignment requirements.
As a rule of thumb, unless you are moving data across compilers/languages or are using the SSE instructions, you can probably ignore alignment issues.

Resources