Seems posix_memalign let you choose a customized alignment,but when is that necessary?
malloc has already done the alignment work internally.
UPDATE
The exact reason I ask this is because I see nginx does this,ngx_memalign(NGX_POOL_ALIGNMENT, size, log);,here NGX_POOL_ALIGNMENT is defined as 16, nginxs.googlecode.com/svn-history/trunk/src/core/ngx_palloc.c
Basically, if you need tougher alignment than malloc will give you. Malloc generally returns a pointer aligned such, that it may be used with any of the primitive types (often, 8 bytes on common desktop machines).
However, sometimes you need memory aligned on other boundaries, for example 4K-aligned, etc. In this case, you would need memalign.
You would need this, for example,
when writing a memory manager (such as a garbage collector). In this case, it is sometimes handy to work with memory aligned on larger block sizes. This way, you can store meta-data common to all objects in a given block at the bottom of the allocated area, and access this simply by masking the least significant bits of the object pointer.
when interfacing with hardware (never done this myself, but IIRC, certain kinds of block-devices require aligned memory). See n.m.'s answer for details.
The only benefits of posix_memalign, as far as I can tell, are:
Allocating page-aligned (typically 4096 or larger alignment) memory for hardware-specific purposes.
Evil hacks where you keep the low N bits of a pointer zero so you can store an N-bit integer in the low bits. :-)
Various hardware may have alignment requirements which malloc cannot satisfy. The Linux man page gives one such example, I quote:
On many systems there are alignment
restrictions, e.g. on buffers used for
direct block device I/O. POSIX
specifies the
pathconf(path,_PC_REC_XFER_ALIGN) call
that tells what alignment is needed.
A couple of uses:
Some processors have instructions that will only work on data that is aligned on a power of two greater than or equal to the buffer size - for example bit reverse addressing instructions used in ffts (fast fourier transforms).
To align data to cache boundaries to optimize access in multiprocessing applications so that data in the same cache line isn't being accessed by two processors simultaneously.
Basically, if you don't need to do absurd levels of optimizations and/or your hardware doesn't demand that an array be on a particular boundary then you can forget about posix_memalign.
Related
How to get the memory granularity of a CPU in C?
Suppose I want to allocate an array where all the elements are properly memory aligned. I can pad each element to a certain size N to achieve this. How do I know the value of N?
Note: I am trying to create a memory pool where each slot is memory aligned. Any suggestion will be appreciated.
In Theory
How to get the memory granularity of a CPU in C?
First, you read the instruction set architecture manual. It may specify that certain instructions require certain alignments, or even that the addressing forms in certain instructions cannot represent non-aligned addresses. It may specify other properties regarding alignment.
Second, you read the processor manual. It may specify performance characteristics (such as that unaligned loads or stores are supported but may be slower or use more resources than aligned loads or stores) and may specify various options allowed by the instructions set architecture.
Third, you read the operating system documentation. Some architectures allow the operating system to select features related to alignment, such as whether unaligned loads and stores are made to fail or are supported albeit with slower performance than aligned loads or stores. The operating system documentation should have this information.
In Practice
For many programming situations, what you need to know is not the “memory granularity” of a CPU but the alignment requirements of the C implementation you are using (or of whatever language you are using). And, for the most part, you do not need to know the alignment requirements directly but just need to follow the language rules about managing objects—use objects with declared types, do not use casts to convert pointers between incompatible types exceed where specific rules allow it, use the suitably aligned memory as provided by malloc rather than adjusting your own pointers to bytes, and so on. Following these rules will give good alignment for the objects in your program.
In C, when you define an array, the element size will automatically be the size that C implementation needs for its alignment. For example, long double x[100]; may use 16 bytes for each array element even though the hardware uses only ten bytes for a long double. Or, for any struct foo that you define, the compiler will automatically include padding as needed in the structure to give the desired alignment, and any array struct foo x[100]; will already include that padding. sizeof(struct foo) will be the same as sizeof x[0], because each structure object has that padding built in, even just for a single structure object, not just for elements in arrays.
When you do need to know the alignment that a C implementation requires for a type, you can use C’s _Alignof operator. The expression _Alignof(type) provides the alignment required for type.
Other
… properly memory aligned.
Proper alignment is a matter of degrees:
What the processor supports may determine whether your program works or does not work. An improper alignment is one that causes your program to trap.
What is efficient with respect to individual loads and stores may affect how fast your program runs. An improper alignment is one that causes your program to execute more slowly.
In certain performance-critical situations, alignment with respect to cache and memory mapping features can also affect performance.
Short answer
Use 64 bytes.
Long answer
Data are loaded from and stored to memory in units called cache lines. If your program loads only part of the data in a cache line, then the whole line will be loaded into the CPU caches. Perhaps more importantly, the algorithm used for moving data between cores in a multi-core CPU operates on full cache lines; aligning your data to cache lines avoids false sharing, the situation where a cache line bounces between cores because it contains data manipulated by different threads.
It used to be the case that cache lines depended on the architecture, ranging from 16 up to 512 bytes. However, all current processors (Intel, AMD, ARM, MIPS) use a cache line of 64 bytes.
This depends heavily on the cpu microarchitecture that you are using.
In many cases, the memory address of an operator should be a multiple of the operand size, otherwise execution will be slow (or even might throw an exception).
But there are also CPUs which do not care about a specific alignment of the operands in memory at all.
Usually, the C compiler will care about those details for you. You should, however, make sure that the compiler assumes the correct target (micro-)architecture, for example by specifying it with the correct compiler flags (-march=? on gcc).
I am working on the SW for an embedded system and trying to understand some low-level details that was setup by an earlier developer. The target platform is a custom made OpenRISC 1200 processor, synthesized in a FPGA. The software is built using a GCC based cross-compiler.
Among the compiler flags I find this one: -falign-functions=16. There is a comment in the build configuration saying:
On Open RISC 1200, function alignment needs to be on a cache boundary (16 bytes). If not, performance suffer severely.
I realize my understanding of cache memories are a bit shallow and I should probably read something like: What Every Programmer Should Know About Memory. I haven't yet, but I will. With that said, I have some questions:
I understand that this is about minimizing cache misses in the instruction cache, but why is that achieved by setting the function alignment to the instruction cache line size (i.e. 16 bytes)?
If this is the most memory efficient way, wouldn't you expect this to be the default setting for function alignment in the cross-compiler? I mean, for a more common platform like x86, amd64 or ARM you don't need to care about function alignments (or am I wrong?).
Most architectures have aspects of memory access and instructions that can depend on alignment.
but why is that achieved by setting the function alignment to the instruction cache line size
The CPU will fetch complete cache lines from memory (as if the memory is divided into these larger blocks rather than bytes). So if all the data you need fits in one cache line, there is just one fetch, but if you have even just 2 bytes of data, but one byte is the end of a cache line and the other byte the start of the next, well now it has to load in two complete cache lines. This wastes space in the small CPU cache, and more memory transfers.
A quick search indicates that the OpenRISC 1200 uses a 16 byte cache line, so when targeting that specifically, aligning the start of any data you have on those 16 byte multiples helps avoid straddling two lines within one function / piece of data.
If this is the most memory efficient way, wouldn't you expect this to be the default setting for function alignment in the cross-compiler?
There can be more to it than that. Firstly, this alignment is achieved by wasting "padding" memory. If you would have used 1 byte of a cache line calling a function, then another 15 bytes are wasted to reach the 16 byte boundary.
Also in the case of a function call, there is a reasonable chance that memory will be in cache anyway, and jumping forward might leave the cached memory, causing a load that would otherwise not be needed.
So this leaves a trade off, functions that use little stack space and return quickly, might not benefit much from the extra alignment, but a function that runs for longer and uses more stack space might benefit by not "wasting" cache space on the "previous function".
Another reason alignment is often desired is when dealing with instructions that either require it outright (fail on an unaligned address), or are much slower (with loads/stores getting split up into parts), or maybe some other effects (like a load/store not being atomic if not properly aligned).
With a quick search I believe the general alignment requirement on OR1200 appears to be 4 bytes, even for 8 byte types. So in this respect an alignment of at least 4 would seem desirable, and 8 or 16 might only provide a benefit in certain cases mentioned before.
I am not familiar with Open RISC specifically, but on some platforms instructions added at a later date (e.g. 16byte / 128bit SSE instructions) require or benefit from an alignment greater than what was the default (I believe AMD64 upped the default alignment to 16, but then later AVX came wanting 32 byte alignment).
The function mkl_malloc is similar to malloc but has an extra alignment argument. Here's the prototype:
void* mkl_malloc (size_t alloc_size, int alignment);
I've noticed different performances with different values of alignment. Apart from trial and error, is there a canonical or documented methodical way to decide on the best value of alignment? i.e. processor being used, function being called, operation being performed etc.
This question widely applicable to anyone who uses MKL so I'm very surprised it is not in the reference manual.
update: I have tried with mkl_sparse_spmm and have not noticed a significant difference in performance for setting the alignment to powers of 2 up to 1024 bytes, after that the performance tends to drop. I'm using an Intel Xeon E5-2683.
Alignment only affects performance when SSE/AVX instructions can be used - this is commonly true when operating with arrays as you wish to apply the same operation to a range of elements.
In general, you want to choose alignment based on the CPU, if it supports AVX2 which has 256bit registers, then you want 32 byte alignment, if it supports AVX512, then 64 bytes would be optimal.
To that end, mkl_malloc will guarantee alignment to the value you specify, however, obviously if the data are 32-byte aligned, then they are also aligned to a (16, 8, 4...)-byte boundary. The purpose of the call is to ensure this is always the case and thus avoid any potential complications.
On my machine (Linux kernel 4.17.11 running on i7 6700K), the default alignment of mkl_malloc seems to be 128-bytes (for large enough arrays, if they are too small the value seems to be 32KB), in other words, any value smaller than that has no effect on alignment, I can however input 256 and the data will be aligned to the 256-byte boundary.
In contrast, using malloc gives me 16byte alignment for 1GB of data and 32-byte alignment for 1KB, whatever the OS gives me with absolutely no preference regarding alignment.
So using mkl_malloc makes sense as it ensures you get the alignment you desire. However, that doesn't mean you should set the value to be too large, that will simply cause you to waste memory and potentially expose you to an increased number of cache misses.
In short, you want your data to be aligned to the size of the vector registers in your CPU so that you can make use of the relevant extensions. Using mkl_malloc with some parameter for alignment guarantees alignment to at least that value, it can however be more. It should be used to make sure the data are aligned the way you want, but there is absolutely no good reason to align to 1MB.
The only reason, why regardless of your input, you have no penalties / gains from specifying the alignment is that you get machine aligned memory no matter what you type in. So on your processor, which supports AVX, you are always getting 32 byte aligned memory regardless of your input.
You will also see, that whatever alignment value you go for, the memory address, which mkl_malloc, returns is divisible 32-aligned. Alternatively you may test that low level intrisics like _mm256_load_pd, which would seg fault, when a not 32 byte aligned address is used never seg fault.
Some minor details: OSX always gives you 32 byte address, independant of heap / stack when you allocate a chunk of memory, while Linux will always give you aligned memory, when allocating on heap. Stack is a matter of luck on Linux, but you exceed with small matrix size already the limit for stack allocations. I have no understanding of memory allocation on Windows.
I noticed the latter, when I was writing tests for my numerics library where I use std::vector<typename T, alignment A> for memory allocation and smaller matrix tests sometimes seg faulted on Linux.
TLDR: your alignment input is effectively discarded and you are getting machine alignment regardless.
I think there can be no "best" value for alignment. Depending on your architecture, alignment is generally a property enforced by the hardware, for optimization reasons mostly.
Coming to your specific question, it's important to state what exactly are you allocating memory for? What piece of hw accesses the memory? For e.g., I have worked with DMA engines which required the source address to be aligned to per transaction transfer size(where xfer size = 4, 8, 16, 32, 128). I also worked with vector registers where it was wise to have a 128 bit aligned load.
To summarize: It depends.
While reading the CERT C Coding Standard, I came across DCL39-C which discusses why it's generally a bad idea for something like the Linux kernel to return an unpacked struct to userspace due to information leakage.
In a nutshell, structs aren't generally packed by default and the padding bytes between members of a struct often contain uninitialized data, hence the information leakage.
Why aren't structs packed by default? There was a mention in the guide that it's an optimization feature of compilers for specific architectures, I believe. Why is aligning structs to a certain byte size more efficient, as it wastes memory space?
Also, why doesn't the C standard specify a standardized way of asking for a packed struct? I can ask GCC using __attribute__((packed)), and there are other ways for different compilers, but it seems like a feature that'd be nice to have as part of the standard.
Data is carried though electronic circuits by groups of parallel wires (buses). Likewise, the circuits themselves tend to be arrayed in parallel. The physical distance between parallel components adds resistance and capacitance to any crosswise wires that bridge them. So, such bridges tend to be expensive and slow, and computer architects avoid them when possible.
Unaligned loads require shifting bytes across lanes. Some CPUs (e.g. efficiency-oriented RISC) are physically incapable of doing this, because the bridge component doesn't exist. Some will detect the condition and interpose a lane shift at the expense of a cycle or two. Others can handle misalignment without a speed penalty… assuming paged memory doesn't add another problem.
There's another, completely different issue. The memory management unit (MMU) sits between the CPU execution core and the memory bus, translating program-visible logical addresses to the physical addresses for memory chips. Two adjacent logical addresses might reside on different chips. The MMU is optimized for the common case where an access only requires one translation, but a misaligned access may require two.
A misaligned access straddling a page boundary might incur an exception, which might be fatal inside a kernel. Since pages are relatively large, this condition is relatively rare. It might evade tests, and it may be non-deterministic.
TL;DR: Packed structures shouldn't be used for active program state, especially not in a kernel. They may be used for serialization and communication, but safe usage is another question.
Leaving structs "unpacked" allows the compiler to align all members so that operations are more efficient on those members (measured in terms of clock time, number of instructions, etc). The alignment requirement for types depends on the host architecture and (for struct types) on the alignment requirement of contained members.
Packing struct members forces some (if not all) members to be aligned in a way that is sub-optimal for performance. In some worst cases - depending on host architecture - operations on unaligned variables (or on unaligned struct members) triggers a processor fault condition. RISC processor architectures, for example, generate an alignment fault when a load or store operation affects an unaligned address. Some SSE instructions on recent x86 architectures require data they act on to be aligned on 16 byte boundaries.
In best cases, the operations behave as intended, but less efficiently, due to overhead of copying an unaligned variable to an aligned location or to a register, doing the operation there, and copying it back. Those copying operations are less efficient when unaligned variables are involved - after all, the processor architecture is optimised for performance when variable alignment meets its design requirements.
If you are worried about data leaking out of your program, simply use functions like memset() to overwrite the contents of your structures at the end of their lifetime (e.g. just before an instance is about to pass out of scope, or immediately before dynamically allocated memory is deallocated using free()).
Or use an operating system (like OpenBSD) which does overwrite memory before making it available to processes or programs. Bear in mind that such features tend to make both the operating system and programs it hosts run less efficiently.
Recent versions of the C standard (since 2011) do have some facilities to query and control alignment of variables (and affect packing of struct members). The default is whatever alignment is most effective for the host architecture - which for struct types normally means unpacked.
On some compilers such as Microchip XC8, all structs are indeed always packed.
On some platforms compilers will only generate byte access instructions to access members of a packed struct, because byte access instructions are always aligned. If all structs are packed, the 16-, 32-, and 64- bit load/store instructions are not used. This is an obvious waste of resources.
The C standard does not specify a way of packing struct possibly because the standard itself is not aware of the concept of packing. Since the layout of non-bit-field members of structs is implementation defined, out of scope for the standard. Or possibly, the standard is made to support architectures that always add padding in structs, since such architectures are indeed theoretically feasible.
When calloc is being used pointers to newly allocated memory are aligned to at least certian number of the least significant bits, meaning that least significent bits(as tagged pointeres) can be used for lock-free algorithms, and in fact is commonly used in case of those algorithms. I was testing memory menagment feature on linux ubuntu server( x86_64 GNU/Linux, 3.10.23-xxxx-std-ipv6-64-vps) and it seems, from my experiments, that the 4 least significant bits are set to 0. From what i have read it states that pointer alignment is formed in such a way for pointer expressed as uintptr to be divided by 4(alignment to 2 least significant bits)
What is the minimum number of the least significant bits in newly allocated memory pointers, obtained from memory menagment system in POSIX (linux), that are always set to 0 during initial memory allocation process?
What is the maximum number of the least significant bits that can be used as tagged pointers on linux systems (eg. lock-free algorithms)?
How to force compiler to align newly allocated pointers to exect number of the least significant bits ?
Does the alignment of pointers affect system overall performance, and how ?
Alignment is important in optimization for many related reasons:
efficient usage of the cache lines
avoid to disable the prefetching logics
best usage of vector registers/instructions (SSE, AVX).
especially when I/O is concerned, also memory page alignment can be important.
You can find very good references for Intel architecture here:
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
Answering quickly to your questions:
What is the minimum number of the least significant bits in newly
allocated memory pointers, obtained from memory menagment system in
POSIX (linux), that are always set to 0 during initial memory
allocation process?
It actually depends on the CPU/architecture you are speaking of.
What is the maximum number of the least significant bits that can be
used as tagged pointers on linux systems (eg. lock-free algorithms)?
The same as the former: you should use std::atomic or boost::atomic in order to have some sort of portability, if C++ is an option.
On Intel architectures, memory load and stores are atomic for 32 bit, on x86_32, for 64 on x86_64, if data are properly aligned.
If you are really enjoying this kind of low level, don't forget to have a look into memory semantics, memory fences and so on ("Fence instructions" in the above manual)
I'm afraid I can't answer your whole question, but I can make a start:
Pointer alignment might not only change performance but also necessary to make your code work at all. Especially for things like ARM processors you can't read numbers larger then 1 byte if the pointer is unaligned. Doing this will result in an error.
If I, for example, work with a big data-stream I prefer have my data aligned so I can read more bytes at the same time, instead having to read byte for byte what will cost more time/CPU.
on x86/x86_64 architecture reading/writing to unaligned memory is paid with a performance cost, because you will need two memory ops instead of a single one: the bus operations to/from memory are always aligned.
On GNU/Linux you can use posix_memalign & C. to get heap aligned memory (man memalign) in user space.
Some compilers also supports macro to get aligned memory on the stack, for instance
/* GCC align declarator */
#define MYMEMALIGN(x, y) x __attribute__( (aligned( y )) )
#endif
but I guess this are non portable solutions.