Related
I'm trying to re-implement malloc and I need to understand the purpose of the alignment. As I understand it, if the memory is aligned, the code will be executed faster because the processor won't have to take an extra step to recover the bits of memory that are cut. I think I understand that a 64-bit processor reads 64-bit by 64-bit memory. Now, let's imagine that I have a structure with in order (without padding): a char, a short, a char, and an int. Why will the short be misaligned? We have all the data in the block! Why does it have to be on an address which is a multiple of 2. Same question for the integers and other types?
I also have a second question: With the structure I mentioned before, how does the processor know when it reads its 64 bits that the first 8 bits correspond to a char, then the next 16 correspond to a short etc...?
The effects can even include correctness, not just performance: C Undefined Behaviour (UB) leading to possible segfaults or other misbehaviour for example if you have a short object that doesn't satisfy alignof(short). (Faulting is expected on ISAs where load/store instructions require alignment by default, like SPARC, and MIPS before MIPS64r6. And possible even on x86 after compiler optimization of loops, even though x86 asm allows unaligned loads/stores except for some SIMD with 16-byte or wider.)
Or tearing of atomic operations if an _Atomic int doesn't have alignof(_Atomic int).
(Typically alignof(T) = sizeof(T) up to some size, often register width or wider, in any given ABI).
malloc should return memory with alignof(max_align_t) because you don't have any type info about how the allocation will be used.
For allocations smaller than sizeof(max_align_t), you can return memory that's merely naturally aligned (e.g. a 4-byte allocation aligned by 4 bytes) if you want, because you know that storage can't be used for anything with a higher alignment requirement.
Over-aligned stuff like the dynamically-allocated equivalent of alignas (16) int32_t foo needs to use a special allocator like C11 aligned_alloc. If you're implementing your own allocator library, you probably want to support aligned_realloc and aligned_calloc, filling those gaps that ISO C leave for no apparent reason.
And make sure you don't implement the braindead ISO C++17 requirement for aligned_alloc to fail if the allocation size isn't a multiple of the alignment. Nobody wants an allocator that rejects an allocation of 101 floats starting on a 16-byte boundary, or much larger for better transparent hugepages. aligned_alloc function requirements and How to solve the 32-byte-alignment issue for AVX load/store operations?
I think I understand that a 64-bit processor reads 64-bit by 64-bit memory
Nope. Data bus width and burst size, and load/store execution unit max width or actually-used width, don't have to be the same as width of integer registers, or however the CPU defines its bitness. (And in modern high performance CPUs typically aren't. e.g. 32-bit P5 Pentium had a 64-bit bus; modern 32-bit ARM has load/store-pair instructions that do atomic 64-bit accesses.)
Processors read whole cache lines from DRAM / L3 / L2 cache into L1d cache; 64 bytes on modern x86; 32 bytes on some other systems.
And when reading individual objects or array elements, they read from L1d cache with the element width. e.g. a uint16_t array may only benefit from alignment to a 2-byte boundary for 2-byte loads/stores.
Or if a compiler vectorizes a loop with SIMD, a uint16_t array can be read 16 or 32 bytes at a time, i.e. SIMD vectors of 8 or 16 elements. (Or even 64 with AVX512). Aligning arrays to the expected vector width can be helpful; unaligned SIMD load/store run fast on modern x86 when they don't cross a cache-line boundary.
Cache-line splits and especially page-splits are where modern x86 slows down from misalignment; unaligned within a cache line generally not because they spend the transistors for fast unaligned load/store. Some other ISAs slow down, and some even fault, on any misalignment, even within a cache line. The solution is the same: give types natural alignment: alignof(T) = sizeof(T).
In your struct example, modern x86 CPUs will have no penalty even though the short is misaligned. alignof(int) = 4 in any normal ABI, so the whole struct has alignof(struct) = 4, so the char;short;char block starts at a 4-byte boundary. Thus the short is contained within a single 4-byte dword, not crossing any wider boundary. AMD and Intel both handle this with full efficiency. (And the x86 ISA guarantees that accesses to it are atomic, even uncached, on CPUs compatible with P5 Pentium or later: Why is integer assignment on a naturally aligned variable atomic on x86?)
Some non-x86 CPUs would have penalties for the misaligned short, or have to use other instructions. (Since you know the alignment relative to an aligned 32-bit chunk, for loads you'd probably do a 32-bit load and shift.)
So yes there's no problem accessing one single word containing the short, but the problem is for load-port hardware to extract and zero-extend (or sign-extend) that short into a full register. This is where x86 spends the transistors to make this fast. (#Eric's answer on a previous version of this question goes into more detail about the shifting required.)
Committing an unaligned store back into cache is also non-trivial. For example, L1d cache might have ECC (error-correction against bit flips) in 32-bit or 64-bit chunks (which I'll call "cache words"). Writing only part of a cache word is thus a problem for that reason, as well as for shifting it to an arbitrary byte boundary within the cache word you want to access. (Coalescing of adjacent narrow stores in the store buffer can produce a full-width commit that avoids an RMW cycle to update part of a word, in caches that handle narrow stores that way). Note that I'm saying "word" now because I'm talking about hardware that's more word-oriented instead of being designed around unaligned loads/stores the way modern x86 is. See Are there any modern CPUs where a cached byte store is actually slower than a word store? (storing a single byte is only slightly simpler than an unaligned short)
(If the short spans two cache words, it would of course needs to separate RMW cycles, one for each byte.)
And of course the short is misaligned for the simple reason that alignof(short) = 2 and it violates this ABI rule (assuming an ABI that does have that). So if you pass a pointer to it to some other function, you could get into trouble. Especially on CPUs that have fault-on-misaligned loads, instead of hardware handling that case when it turns out to be misaligned at runtime. Then you can get cases like Why does unaligned access to mmap'ed memory sometimes segfault on AMD64? where GCC auto-vectorization expected to reach a 16-byte boundary by doing some multiple of 2-byte elements scalar, so violating the ABI leads to a segfault on x86 (which is normally tolerant of misalignment.)
For the full details on memory access, from DRAM RAS / CAS latency up to cache bandwidth and alignment, see What Every Programmer Should Know About Memory? It's pretty much still relevant / applicable
Also Purpose of memory alignment has a nice answer. There are plenty of other good answers in SO's memory-alignment tag.
For a more detailed look at (somewhat) modern Intel load/store execution units, see: https://electronics.stackexchange.com/questions/329789/how-can-cache-be-that-fast/329955#329955
how does the processor know when it reads its 64 bits that the first 8 bits correspond to a char, then the next 16 correspond to a short etc...?
It doesn't, other than the fact it's running instructions which treat the data that way.
In asm / machine-code, everything is just bytes. Every instruction specifies exactly what to do with which data. It's up to the compiler (or human programmer) to implement variables with types, and the logic of a C program, on top of a raw array of bytes (main memory).
What I mean by that is that in asm, you can run any load or store instruction you want to, and it's up to you to use the right ones on the right addresses. You could load 4 bytes that overlap two adjacent int variable into a floating-point register, then and run addss (single-precision FP add) on it, and the CPU won't complain. But you probably don't want to because making the CPU interpret those 4 bytes as an IEEE754 binary32 float is unlikely to be meaningful.
modern processors and memory are built to optimize memory access as much as possible. One the current way of accessing memory is to address it not byte by byte but by an address of a bigger block, e.g. by an 8 byte blocks. You do not need 3 lower bits of the address this way. To access a certain byte within the block the processs needs to get the block at the aligned address, then shift and mask the byte. So, it gets slower.
When fields in the struct are not aligned, there is a risk of slowing down the access to them. Therefore, it is better to align them.
But the alignment requirements are based on the underlying platform. For systems which support word access (32 bit), 4-byte alignment is ok, otherwise 8-byte can be used or some other. The compiler (and libc) knows the requirements.
So, in your example char, short, char, the short will start with an odd byte position if not padded. To access it, the system might need to read the 64 bit word for the struct, then shift it 1 byte right and then mask 2 bytes in order to provide you with this byte.
As I understand it, if the memory is aligned, the code will be executed faster because the processor won't have to take an extra step to recover the bits of memory that are cut.
It's not necessarily an execution thing, an x86 has variable length instructions starting with single 8 bit instructions on up to a handful to several bytes, its all about being unaligned. but they have taken measures to smooth that out for the most part.
If I have a 64 bit bus on the edge of my processor that doesn't mean edge of chip that means edge of the core. The other side of this is a memory controller that knows the bus protocol and is the first place the addresses start to be decoded and the transactions start to split up down other buses toward their destination.
It is very much architecture and bus design specific and you can have architectures with different buses over time or different versions you can get an arm with a 64 bus or a 32 bit bus for example. But let's say we have a not atypical situation where the bus is 64 bits wide and all transactions on that bus are aligned on a 64 bit boundary.
If I were to do a 64 bit write to 0x1000 that would be a single bus transaction, which these days is some sort of write address bus with some id x and a length of 0 (n-1) then the other side acks that I see you want to do a write with id x, I am ready to take your data. Then the processor uses the data bus with id x to send the data, one clock per 64 bits this is a single 64 bit so one clock on that bus. and maybe an ack comes back or maybe not.
But if I wanted to do a 64 bit write to 0x1004, what would happen is that turns into two transactions one complete 64 bit address/data transaction at address 0x1000 with only four byte lanes enabled lanes 4-7 (representing bytes at address 0x1004-0x1007). Then a complete transaction at 0x1008 with 4 byte lanes enabled, lanes 0-3. So the actual data movement across the bus goes from one clock to two, but there is also twice the overhead of the handshakes to get to those data cycles. On that bus it is very noticeable, how the overall system design is though you may feel it or not, or may have to do many of them to feel it or not. But the inefficiency is there, buried in the noise or not.
I think I understand that a 64-bit processor reads 64-bit by 64-bit memory.
Not a good assumption at all. 32 bit ARMs have 64 bit buses these days the ARMv6 and ARMv7s for example come with them or can.
Now, let's imagine that I have a structure with in order (without padding): a char, a short, a char, and an int. Why will the short be misaligned? We have all the data in the block! Why does it have to be on an address which is a multiple of 2. Same question for the integers and other types?
unsigned char a 0x1000
unsigned short b 0x1001
unsigned char c 0x1003
unsigned int d 0x1004
You would normally use the structure items in the code something.a something.b something.c something.d. When you access something.b that is a 16 bit transaction against the bus. In a 64 bit system you are correct that if aligned as I have addressed it, then the whole structure is being read when you do x = something.b but the processor is going to discard all but byte lanes 1 and 2 (discarding 0 and 3-7), then if you access something.c it will do another bus transaction at 0x1000 and discard all but lane 3.
When you do a write to something.b with a 64 bit bus only byte lanes 1 and 2 are enabled. Now where more pain comes in is if there is a cache it is likely also constructed of a 64 bit ram to mate up with this bus, doesn't have to, but let's assume it does. You want to write through the cache to something.b, a write transaction at 0x1000 with byte lanes 1 and 2 enabled 0, 3-7 disabled. The cache ultimately gets this transaction, it internally has to do a read-modify write because it is not a full 64 bit wide transaction (all lanes enabled) so you are taking hit with that read-modify write from a performance perspective as well (same was true for the unaligned 64 bit write above).
The short is unaligned because when packed its address lsbit is set, to be aligned a 16 bit item in an 8 bit is a byte world needs to be zero, for a 32 bit item to be aligned the lower two bits of its address are zero, 64 bit, three zeros and so on.
Depending on the system you may end up on a 32 or 16 bit bus (not for memory so much these days) so you can end up with the multiple transfers thing.
Your highly efficient processors like MIPS and ARM took the approach of aligned instructions, and forced aligned transactions even in the something.b case that specifically doesn't have a penalty on a 32 nor 64 bit bus. The approach is performance over memory consumption, so the instructions are to some extent wasteful in their consumption to be more efficient in their fetching and execution. The data bus is likewise much simpler. When high level concepts like a struct in C are constructed there is memory waste in padding to align each item in the struct to gain performance.
unsigned char a 0x1000
unsigned short b 0x1002
unsigned char c 0x1004
unsigned int d 0x1008
as an example
I also have a second question: With the structure I mentioned before, how does the processor know when it reads its 64 bits that the first 8 bits correspond to a char, then the next 16 correspond to a short etc...?
unsigned char c 0x1003
the compiler generates a single byte sized read at address 0x1003, this turns in to that specific instruction with that address and the processor generates the bus transaction to do that, the other side of the processor bus then does its job and so on down the line.
The compiler in general does not turn a packed version of that struct into a single 64 bit transaction that gives you all of the items, you burn a 64 bit bus transaction for each item.
it is possible that depending on the instruction set, prefetcher, caches and so on that instead of using a struct at a high level you create a single 64 bit integer and you do the work in the code, then you might or might not gain performance. This is not expected to perform better on most architectures running with caches and such, but when you get into embedded systems where you may have some number of wait states on the ram or some number of wait states on the flash or whatever code storage there is you can find times where instead of fewer instructions and more data transactions you want more instructions and fewer data transactions. code is linear a code section like this read, mask and shift, mask and shift, etc. the instruction storage may have a burst mode for linear transactions but data transactions take as many clocks as they take.
A middle ground is to just make everything a 32 bit variable or a 64 bit, then it is all aligned and performs relatively well at the cost of more memory used.
Because folks don't understand alignment, have been spoiled by x86 programming, choose to use structs across compile domains (such a bad idea), the ARMs and others are tolerating unaligned accesses, you can very much feel the performance hit on those platforms as they are so efficient if everything is aligned, but when you do something unaligned it just generates more bus transactions making everything take longer. So the older arms would fault by default, the arm7 could have the fault disabled but would rotate the data around the word (nice trick for swapping 16 bit values in a word) rather than spill over into the next word, later architectures default to not fault on aligned or most folks set them to not fault on aligned and they read/write the unaligned transfers as one would hope/expect.
For every x86 chip you have in your computer you have several if not handfuls of non-x86 processors in that same computer or peripherals hanging off that computer (mouse, keyboard, monitor, etc). A lot of those are 8-bit 8051s and z80s, but also a lot of them are arm based. So there is lots of non-x86 development going on not just all the phones and tablets main processors. Those others desire to be low cost and low power so more efficiency in the coding both in its bus performance so the clock can be slower but also a balance of code/data usage overall to reduce the cost of the flash/ram.
It is quite difficult to force these alignment issues on an x86 platform there is a lot of overhead to overcome its architectural issues. But you can see this on more efficient platforms. Its like a train vs a sports car, something falls off a train a person jumps off or on there is so much momentum its not noticed one bit, but step change the mass on the sports car and you will feel it. So trying to do this on an x86 you are going to have to work a lot harder if you can even figure out how to do it. But on other platforms its easier to see the effects. Unless you find an 8086 chip and I suspect you can feel the differences there, would have to pull out my manual to confirm.
If you are lucky enough to have access to chip sources/simulations then you can see this kind of thing happening all over the place and can really start to hand tune your program (for that platform). Likewise you can see what caching, write buffering, instruction prefetching in its various forms and so on do for overall performance and at times create parallel periods of time where other not-so-efficient transactions can hide, and or intentional spare cycles are created so that transactions that take extra time can have a time slice.
Admittedly I don't get it. Say you have a memory with a memory word of length of 1 byte. Why can't you access a 4 byte long variable in a single memory access on an unaligned address(i.e. not divisible by 4), as it's the case with aligned addresses?
The memory subsystem on a modern processor is restricted to accessing memory at the granularity and alignment of its word size; this is the case for a number of reasons.
Speed
Modern processors have multiple levels of cache memory that data must be pulled through; supporting single-byte reads would make the memory subsystem throughput tightly bound to the execution unit throughput (aka cpu-bound); this is all reminiscent of how PIO mode was surpassed by DMA for many of the same reasons in hard drives.
The CPU always reads at its word size (4 bytes on a 32-bit processor), so when you do an unaligned address access — on a processor that supports it — the processor is going to read multiple words. The CPU will read each word of memory that your requested address straddles. This causes an amplification of up to 2X the number of memory transactions required to access the requested data.
Because of this, it can very easily be slower to read two bytes than four. For example, say you have a struct in memory that looks like this:
struct mystruct {
char c; // one byte
int i; // four bytes
short s; // two bytes
}
On a 32-bit processor it would most likely be aligned like shown here:
The processor can read each of these members in one transaction.
Say you had a packed version of the struct, maybe from the network where it was packed for transmission efficiency; it might look something like this:
Reading the first byte is going to be the same.
When you ask the processor to give you 16 bits from 0x0005 it will have to read a word from 0x0004 and shift left 1 byte to place it in a 16-bit register; some extra work, but most can handle that in one cycle.
When you ask for 32 bits from 0x0001 you'll get a 2X amplification. The processor will read from 0x0000 into the result register and shift left 1 byte, then read again from 0x0004 into a temporary register, shift right 3 bytes, then OR it with the result register.
Range
For any given address space, if the architecture can assume that the 2 LSBs are always 0 (e.g., 32-bit machines) then it can access 4 times more memory (the 2 saved bits can represent 4 distinct states), or the same amount of memory with 2 bits for something like flags. Taking the 2 LSBs off of an address would give you a 4-byte alignment; also referred to as a stride of 4 bytes. Each time an address is incremented it is effectively incrementing bit 2, not bit 0, i.e., the last 2 bits will always continue to be 00.
This can even affect the physical design of the system. If the address bus needs 2 fewer bits, there can be 2 fewer pins on the CPU, and 2 fewer traces on the circuit board.
Atomicity
The CPU can operate on an aligned word of memory atomically, meaning that no other instruction can interrupt that operation. This is critical to the correct operation of many lock-free data structures and other concurrency paradigms.
Conclusion
The memory system of a processor is quite a bit more complex and involved than described here; a discussion on how an x86 processor actually addresses memory can help (many processors work similarly).
There are many more benefits to adhering to memory alignment that you can read at this IBM article.
A computer's primary use is to transform data. Modern memory architectures and technologies have been optimized over decades to facilitate getting more data, in, out, and between more and faster execution units–in a highly reliable way.
Bonus: Caches
Another alignment-for-performance that I alluded to previously is alignment on cache lines which are (for example, on some CPUs) 64B.
For more info on how much performance can be gained by leveraging caches, take a look at Gallery of Processor Cache Effects; from this question on cache-line sizes
Understanding of cache lines can be important for certain types of program optimizations. For example, the alignment of data may determine whether an operation touches one or two cache lines. As we saw in the example above, this can easily mean that in the misaligned case, the operation will be twice slower.
It's a limitation of many underlying processors. It can usually be worked around by doing 4 inefficient single byte fetches rather than one efficient word fetch, but many language specifiers decided it would be easier just to outlaw them and force everything to be aligned.
There is much more information in this link that the OP discovered.
you can with some processors (the nehalem can do this), but previously all memory access was aligned on a 64-bit (or 32-bit) line, because the bus is 64 bits wide, you had to fetch 64 bit at a time, and it was significantly easier to fetch these in aligned 'chunks' of 64 bits.
So, if you wanted to get a single byte, you fetched the 64-bit chunk and then masked off the bits you didn't want. Easy and fast if your byte was at the right end, but if it was in the middle of that 64-bit chunk, you'd have to mask off the unwanted bits and then shift the data over to the right place. Worse, if you wanted a 2 byte variable, but that was split across 2 chunks, then that required double the required memory accesses.
So, as everyone thinks memory is cheap, they just made the compiler align the data on the processor's chunk sizes so your code runs faster and more efficiently at the cost of wasted memory.
Fundamentally, the reason is because the memory bus has some specific length that is much, much smaller than the memory size.
So, the CPU reads out of the on-chip L1 cache, which is often 32KB these days. But the memory bus that connects the L1 cache to the CPU will have the vastly smaller width of the cache line size. This will be on the order of 128 bits.
So:
262,144 bits - size of memory
128 bits - size of bus
Misaligned accesses will occasionally overlap two cache lines, and this will require an entirely new cache read in order to obtain the data. It might even miss all the way out to the DRAM.
Furthermore, some part of the CPU will have to stand on its head to put together a single object out of these two different cache lines which each have a piece of the data. On one line, it will be in the very high order bits, in the other, the very low order bits.
There will be dedicated hardware fully integrated into the pipeline that handles moving aligned objects onto the necessary bits of the CPU data bus, but such hardware may be lacking for misaligned objects, because it probably makes more sense to use those transistors for speeding up correctly optimized programs.
In any case, the second memory read that is sometimes necessary would slow down the pipeline no matter how much special-purpose hardware was (hypothetically and foolishly) dedicated to patching up misaligned memory operations.
#joshperry has given an excellent answer to this question. In addition to his answer, I have some numbers that show graphically the effects which were described, especially the 2X amplification. Here's a link to a Google spreadsheet showing what the effect of different word alignments look like.
In addition here's a link to a Github gist with the code for the test.
The test code is adapted from the article written by Jonathan Rentzsch which #joshperry referenced. The tests were run on a Macbook Pro with a quad-core 2.8 GHz Intel Core i7 64-bit processor and 16GB of RAM.
If you have a 32bit data bus, the address bus address lines connected to the memory will start from A2, so only 32bit aligned addresses can be accessed in a single bus cycle.
So if a word spans an address alignment boundary - i.e. A0 for 16/32 bit data or A1 for 32 bit data are not zero, two bus cycles are required to obtain the data.
Some architectures/instruction sets do not support unaligned access and will generate an exception on such attempts, so compiler generated unaligned access code requires not just additional bus cycles, but additional instructions, making it even less efficient.
If a system with byte-addressable memory has a 32-bit-wide memory bus, that means there are effectively four byte-wide memory systems which are all wired to read or write the same address. An aligned 32-bit read will require information stored in the same address in all four memory systems, so all systems can supply data simultaneously. An unaligned 32-bit read would require some memory systems to return data from one address, and some to return data from the next higher address. Although there are some memory systems that are optimized to be able to fulfill such requests (in addition to their address, they effectively have a "plus one" signal which causes them to use an address one higher than specified) such a feature adds considerable cost and complexity to a memory system; most commodity memory systems simply cannot return portions of different 32-bit words at the same time.
On PowerPC you can load an integer from an odd address with no problems.
Sparc and I86 and (I think) Itatnium raise hardware exceptions when you try this.
One 32 bit load vs four 8 bit loads isnt going to make a lot of difference on most modern processors. Whether the data is already in cache or not will have a far greater effect.
Many methods found in high-performance algorithms could be (and are) simplified if they were allowed to read a small amount past the end of input buffers. Here, "small amount" generally means up to W - 1 bytes past the end, where W is the word size in bytes of the algorithm (e.g., up to 7 bytes for an algorithm processing the input in 64-bit chunks).
It's clear that writing past the end of an input buffer is never safe, in general, since you may clobber data beyond the buffer1. It is also clear that reading past the end of a buffer into another page may trigger a segmentation fault/access violation, since the next page may not be readable.
In the special case of reading aligned values, however, a page fault seems impossible, at least on x86. On that platform, pages (and hence memory protection flags) have a 4K granularity (larger pages, e.g. 2MiB or 1GiB, are possible, but these are multiples of 4K) and so aligned reads will only access bytes in the same page as the valid part of the buffer.
Here's a canonical example of some loop that aligns its input and reads up to 7 bytes past the end of buffer:
int processBytes(uint8_t *input, size_t size) {
uint64_t *input64 = (uint64_t *)input, end64 = (uint64_t *)(input + size);
int res;
if (size < 8) {
// special case for short inputs that we aren't concerned with here
return shortMethod();
}
// check the first 8 bytes
if ((res = match(*input)) >= 0) {
return input + res;
}
// align pointer to the next 8-byte boundary
input64 = (ptrdiff_t)(input64 + 1) & ~0x7;
for (; input64 < end64; input64++) {
if ((res = match(*input64)) > 0) {
return input + res < input + size ? input + res : -1;
}
}
return -1;
}
The inner function int match(uint64_t bytes) isn't shown, but it is something that looks for a byte matching a certain pattern, and returns the lowest such position (0-7) if found or -1 otherwise.
First, cases with size < 8 are pawned off to another function for simplicity of exposition. Then a single check is done for the first 8 (unaligned bytes). Then a loop is done for the remaining floor((size - 7) / 8) chunks of 8 bytes2. This loop may read up to 7 bytes past the end of the buffer (the 7 byte case occurs when input & 0xF == 1). However, return call has a check which excludes any spurious matches which occur beyond the end of the buffer.
Practically speaking, is such a function safe on x86 and x86-64?
These types of overreads are common in high performance code. Special tail code to avoid such overreads is also common. Sometimes you see the latter type replacing the former to silence tools like valgrind. Sometimes you see a proposal to do such a replacement, which is rejected on the grounds the idiom is safe and the tool is in error (or simply too conservative)3.
A note for language lawyers:
Reading from a pointer beyond its allocated size is definitely not allowed
in the standard. I appreciate language lawyer answers, and even occasionally write
them myself, and I'll even be happy when someone digs up the chapter
and verse which shows the code above is undefined behavior and hence
not safe in the strictest sense (and I'll copy the details here). Ultimately though, that's not what
I'm after. As a practical matter, many common idioms involving pointer
conversion, structure access though such pointers and so are
technically undefined, but are widespread in high quality and high
performance code. Often there is no alternative, or the alternative
runs at half speed or less.
If you wish, consider a modified version of this question, which is:
After the above code has been compiled to x86/x86-64 assembly, and the user has verified that it is compiled in the expected way (i.e.,
the compiler hasn't used a provable partially out-of-bounds access to
do something really
clever,
is executing the compiled program safe?
In that respect, this question is both a C question and a x86 assembly question. Most of the code using this trick that I've seen is written in C, and C is still the dominant language for high performance libraries, easily eclipsing lower level stuff like asm, and higher level stuff like <everything else>. At least outside of the hardcore numerical niche where FORTRAN still plays ball. So I'm interested in the C-compiler-and-below view of the question, which is why I didn't formulate it as a pure x86 assembly question.
All that said, while I am only moderately interested in a link to the
standard showing this is UD, I am very interested in any details of
actual implementations that can use this particular UD to produce
unexpected code. Now I don't think this can happen without some deep
pretty deep cross-procedure analysis, but the gcc overflow stuff
surprised a lot of people too...
1 Even in apparently harmless cases, e.g., where the same value is written back, it can break concurrent code.
2 Note for this overlapping to work requires that this function and match() function to behave in a specific idempotent way - in particular that the return value supports overlapping checks. So a "find first byte matching pattern" works since all the match() calls are still in-order. A "count bytes matching pattern" method would not work, however, since some bytes could be double counted. As an aside: some functions such as "return the minimum byte" call would work even without the in-order restriction, but need to examine all bytes.
3 It's worth noting here that for valgrind's Memcheck there is a flag, --partial-loads-ok which controls whether such reads are in fact reported as an error. The default is yes, means that in general such loads are not treated as immediate errors, but that an effort is made to track the subsequent use of loaded bytes, some of which are valid and some of which are not, with an error being flagged if the out-of-range bytes are used. In cases such as the example above, in which the entire word is accessed in match(), such analysis will conclude the bytes are accessed, even though the results are ultimately discarded. Valgrind cannot in general determine whether invalid bytes from a partial load are actually used (and detection in general is probably very hard).
Yes, it's safe in x86 asm, and existing libc strlen(3) implementations take advantage of this in hand-written asm. And even glibc's fallback C, but it compiles without LTO so it it can never inline. It's basically using C as a portable assembler to create machine code for one function, not as part of a larger C program with inlining. But that's mostly because it also has potential strict-aliasing UB, see my answer on the linked Q&A. You probably also want a GNU C __attribute__((may_alias)) typedef instead of plain unsigned long as your wider type, like __m128i etc. already use.
It's safe because an aligned load will never cross a higher alignment boundary, and memory protection happens with aligned pages, so at least 4k boundaries1
Any naturally-aligned load that touches at least 1 valid byte can't fault. It's also safe to just check if you're far enough from the next page boundary to do a 16-byte load, like if (p & 4095 > (4096 - 16)) do_special_case_fallback. See the section below about that for more detail.
It's also generally safe in C compiled for x86, as far as I know. Reading outside an object is of course Undefined Behaviour in C, but works in C-targeting-x86. I don't think compilers explicitly / on purpose define the behaviour, but in practice it works that way.
I think it's not the kind of UB that aggressive compilers will assume can't happen while optimizing, but confirmation from a compiler-writer on this point would be good, especially for cases where it's easily provable at compile-time that an access goes out of past the end of an object. (See discussion in comments with #RossRidge: a previous version of this answer asserted that it was absolutely safe, but that LLVM blog post doesn't really read that way).
This is required in asm to go faster than 1 byte at a time processing an implicit-length string. In C in theory a compiler could know how to optimize such a loop, but in practice they don't so you have to do hacks like this. Until that changes, I suspect that the compilers people care about will generally avoid breaking code that contains this potential UB.
There's no danger when the overread isn't visible to code that knows how long an object is. A compiler has to make asm that works for the case where there are array elements as far as we actually read. The plausible danger I can see with possible future compilers is: after inlining, a compiler might see the UB and decide that this path of execution must never be taken. Or that the terminating condition must be found before the final not-full-vector and leave that out when fully unrolling.
The data you get is unpredictable garbage, but there won't be any other potential side-effects. As long as the your program isn't affected by the garbage bytes, it's fine. (e.g. use bithacks to find if one of the bytes of a uint64_t are zero, then a byte loop to find the first zero byte, regardless of what garbage is beyond it.)
Unusual situations where this wouldn't be safe in x86 asm
Hardware data breakpoints (watchpoints) that trigger on a load from a given address. If there's a variable you're monitoring right after an array, you could get a spurious hit. This might be a minor annoyance to someone debugging a normal program. If your function will be part of a program that uses x86 debug registers D0-D3 and the resulting exceptions for something that could affect correctness, then be careful with this.
Or similarly a code checker like valgrind could complain about reading outside an object.
Under a hypothetical 16 or 32-bit OS could that uses segmentation: A segment limit can use 4k or 1-byte granularity so it's possible to create a segment where the first faulting offset is odd. (Having the base of the segment aligned to a cache line or page is irrelevant except for performance). All mainstream x86 OSes use flat memory models, and x86-64 removes support for segment limits for 64-bit mode.
Memory-mapped I/O registers right after the buffer you wanted to loop over with wide loads, especially the same 64B cache-line. This is extremely unlikely even if you're calling functions like this from a device driver (or a user-space program like an X server that has mapped some MMIO space).
If you're processing a 60-byte buffer and need to avoid reading from a 4-byte MMIO register, you'll know about it and will be using a volatile T*. This sort of situation doesn't happen for normal code.
strlen is the canonical example of a loop that processes an implicit-length buffer and thus can't vectorize without reading past the end of a buffer. If you need to avoid reading past the terminating 0 byte, you can only read one byte at a time.
For example, glibc's implementation uses a prologue to handle data up to the first 64B alignment boundary. Then in the main loop (gitweb link to the asm source), it loads a whole 64B cache line using four SSE2 aligned loads. It merges them down to one vector with pminub (min of unsigned bytes), so the final vector will have a zero element only if any of the four vectors had a zero. After finding that the end of the string was somewhere in that cache line, it re-checks each of the four vectors separately to see where. (Using the typical pcmpeqb against a vector of all-zero, and pmovmskb / bsf to find the position within the vector.) glibc used to have a couple different strlen strategies to choose from, but the current one is good on all x86-64 CPUs.
Usually loops like this avoid touching any extra cache-lines they don't need to touch, not just pages, for performance reasons, like glibc's strlen.
Loading 64B at a time is of course only safe from a 64B-aligned pointer, since naturally-aligned accesses can't cross cache-line or page-line boundaries.
If you do know the length of a buffer ahead of time, you can avoid reading past the end by handling the bytes beyond the last full aligned vector using an unaligned load that ends at the last byte of the buffer.
(Again, this only works with idempotent algorithms, like memcpy, which don't care if they do overlapping stores into the destination. Modify-in-place algorithms often can't do this, except with something like converting a string to upper-case with SSE2, where it's ok to reprocess data that's already been upcased. Other than the store-forwarding stall if you do an unaligned load that overlaps with your last aligned store.)
So if you are vectorizing over a buffer of known length, it's often best to avoid overread anyway.
Non-faulting overread of an object is the kind of UB that definitely can't hurt if the compiler can't see it at compile time. The resulting asm will work as if the extra bytes were part of some object.
But even if it is visible at compile-time, it generally doesn't hurt with current compilers.
PS: a previous version of this answer claimed that unaligned deref of int * was also safe in C compiled for x86. That is not true. I was a bit too cavalier 3 years ago when writing that part. You need a typedef with GNU C __attribute__((aligned(1),may_alias)), or memcpy, to make that safe. The may_alias part isn't needed if you only access it via signed/unsigned int* and/or `char*, i.e. in ways that wouldn't violate the normal C strict-aliasing rules.
The set of things ISO C leaves undefined but that Intel intrinsics requires compilers to define does include creating unaligned pointers (at least with types like __m128i*), but not dereferencing them directly. Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
Checking if a pointer is far enough from the end of a 4k page
This is useful for the first vector of strlen; after this you can p = (p+16) & -16 to go to the next aligned vector. This will partially overlap if p was not 16-byte aligned, but doing redundant work is sometimes the most compact way to set up for an efficient loop. Avoiding it might mean looping 1 byte at a time until an alignment boundary, and that's certainly worse.
e.g. check ((p + 15) ^ p) & 0xFFF...F000 == 0 (LEA / XOR / TEST) which tells you that the last byte of a 16-byte load has the same page-address bits as the first byte. Or p+15 <= p|0xFFF (LEA / OR / CMP with better ILP) checks that the last byte-address of the load is <= the last byte of the page containing the first byte.
Or more simply, p & 4095 > (4096 - 16) (MOV / AND / CMP), i.e. p & (pgsize-1) < (pgsize - vecwidth) checks that the offset-within-page is far enough from the end of a page.
You can use 32-bit operand-size to save code size (REX prefixes) for this or any of the other checks because the high bits don't matter. Some compilers don't notice this optimization, so you can cast to unsigned int instead of uintptr_t, although to silence warnings about code that isn't 64-bit clean you might need to cast (unsigned)(uintptr_t)p. Further code-size saving can be had with ((unsigned int)p << 20) > ((4096 - vectorlen) << 20) (MOV / SHL / CMP), because shl reg, 20 is 3 bytes, vs. and eax, imm32 being 5, or 6 for any other register. (Using EAX will also allow the no-modrm short form for cmp eax, 0xfff.)
If doing this in GNU C, you probably want typedef unsigned long aliasing_unaligned_ulong __attribute__((aligned(1),may_alias)); to make it safe to do unaligned accesses.
If you permit consideration of non-CPU devices, then one example of a potentially unsafe operation is accessing out-of-bounds regions of PCI-mapped memory pages. There's no guarantee that the target device is using the same page size or alignment as the main memory subsystem. Attempting to access, for example, address [cpu page base]+0x800 may trigger a device page fault if the device is in a 2KiB page mode. This will usually cause a system bugcheck.
When calloc is being used pointers to newly allocated memory are aligned to at least certian number of the least significant bits, meaning that least significent bits(as tagged pointeres) can be used for lock-free algorithms, and in fact is commonly used in case of those algorithms. I was testing memory menagment feature on linux ubuntu server( x86_64 GNU/Linux, 3.10.23-xxxx-std-ipv6-64-vps) and it seems, from my experiments, that the 4 least significant bits are set to 0. From what i have read it states that pointer alignment is formed in such a way for pointer expressed as uintptr to be divided by 4(alignment to 2 least significant bits)
What is the minimum number of the least significant bits in newly allocated memory pointers, obtained from memory menagment system in POSIX (linux), that are always set to 0 during initial memory allocation process?
What is the maximum number of the least significant bits that can be used as tagged pointers on linux systems (eg. lock-free algorithms)?
How to force compiler to align newly allocated pointers to exect number of the least significant bits ?
Does the alignment of pointers affect system overall performance, and how ?
Alignment is important in optimization for many related reasons:
efficient usage of the cache lines
avoid to disable the prefetching logics
best usage of vector registers/instructions (SSE, AVX).
especially when I/O is concerned, also memory page alignment can be important.
You can find very good references for Intel architecture here:
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
Answering quickly to your questions:
What is the minimum number of the least significant bits in newly
allocated memory pointers, obtained from memory menagment system in
POSIX (linux), that are always set to 0 during initial memory
allocation process?
It actually depends on the CPU/architecture you are speaking of.
What is the maximum number of the least significant bits that can be
used as tagged pointers on linux systems (eg. lock-free algorithms)?
The same as the former: you should use std::atomic or boost::atomic in order to have some sort of portability, if C++ is an option.
On Intel architectures, memory load and stores are atomic for 32 bit, on x86_32, for 64 on x86_64, if data are properly aligned.
If you are really enjoying this kind of low level, don't forget to have a look into memory semantics, memory fences and so on ("Fence instructions" in the above manual)
I'm afraid I can't answer your whole question, but I can make a start:
Pointer alignment might not only change performance but also necessary to make your code work at all. Especially for things like ARM processors you can't read numbers larger then 1 byte if the pointer is unaligned. Doing this will result in an error.
If I, for example, work with a big data-stream I prefer have my data aligned so I can read more bytes at the same time, instead having to read byte for byte what will cost more time/CPU.
on x86/x86_64 architecture reading/writing to unaligned memory is paid with a performance cost, because you will need two memory ops instead of a single one: the bus operations to/from memory are always aligned.
On GNU/Linux you can use posix_memalign & C. to get heap aligned memory (man memalign) in user space.
Some compilers also supports macro to get aligned memory on the stack, for instance
/* GCC align declarator */
#define MYMEMALIGN(x, y) x __attribute__( (aligned( y )) )
#endif
but I guess this are non portable solutions.
I was curious as to whether or not there was any advantage in regards to efficiency to utilizing memset() in a situation similar to the one below.
Given the following buffer declarations...
struct More_Buffer_Info
{
unsigned char a[10];
unsigned char b[10];
unsigned char c[10];
};
struct My_Buffer_Type
{
struct More_Buffer_Info buffer_info[100];
};
struct My_Buffer_Type my_buffer[5];
unsigned char *p;
p = (unsigned char *)my_buffer;
Besides having less lines of code, is there an advantage to using this:
memset((void *)p, 0, sizeof(my_buffer));
Over this:
for (i = 0; i < sizeof(my_buffer); i++)
{
*p++ = 0;
}
This applies to both memset() and memcpy():
Less Code: As you have already mentioned, it's shorter - fewer lines of code.
More Readable: Shorter usually makes it more readable as well. (memset() is more readable than that loop)
It can be faster: It can sometimes allow more aggressive compiler optimizations. (so it may be faster)
Misalignment: In some cases, when you're dealing with misaligned data on a processor that doesn't support misaligned accesses, memset() and memcpy() may be the only clean solution.
To expand on the 3rd point, memset() can be heavily optimized by the compiler using SIMD and such. If you write a loop instead, the compiler will first need to "figure out" what it does before it can attempt to optimize it.
The basic idea here is that memset() and similar library functions, in some sense, "tells" the compiler your intent.
As mentioned by #Oli in the comments, there are some downsides. I'll expand on them here:
You need to make sure that memset() actually does what you want. The standard doesn't say that zeros for the various datatypes are necessarily zero in memory.
For non-zero data, memset() is restricted to only 1 byte content. So you can't use memset() if you want to set an array of ints to something other than zero (or 0x01010101 or something...).
Although rare, there are some corner cases, where it's actually possible to beat the compiler in performance with your own loop.*
*I'll give one example of this from my experience:
Although memset() and memcpy() are usually compiler intrinsics with special handling by the compiler, they are still generic functions. They say nothing about the datatype including the alignment of the data.
So in a few (abeit rare) cases, the compiler isn't able to determine the alignment of the memory region, and thus must produce extra code to handle misalignment. Whereas, if you the programmer, is 100% sure of alignment, using a loop might actually be faster.
A common example is when using SSE/AVX intrinsics. (such as copying a 16/32-byte aligned array of floats) If the compiler can't determine the 16/32-byte alignment, it will need to use misaligned load/stores and/or handling code. If you simply write a loop using SSE/AVX aligned load/store intrinsics, you can probably do better.
float *ptrA = ... // some unknown source, guaranteed to be 32-byte aligned
float *ptrB = ... // some unknown source, guaranteed to be 32-byte aligned
int length = ... // some unknown source, guaranteed to be multiple of 8
// memcopy() - Compiler can't read comments. It doesn't know the data is 32-byte
// aligned. So it may generate unnecessary misalignment handling code.
memcpy(ptrA, ptrB, length * sizeof(float));
// This loop could potentially be faster because it "uses" the fact that
// the pointers are aligned. The compiler can also further optimize this.
for (int c = 0; c < length; c += 8){
_mm256_store_ps(ptrA + c, _mm256_load_ps(ptrB + c));
}
It depends on the quality of the compiler and the libraries. In most cases memset is superior.
The advantage of memset is that in many platforms it is actually a compiler intrinsic; that is, the compiler can "understand" the intention to set a large swath of memory to a certain value, and possibly generate better code.
In particular, that could mean using specific hardware operations for setting large regions of memory, like SSE on the x86, AltiVec on the PowerPC, NEON on the ARM, and so on. This can be an enormous performance improvement.
On the other hand, by using a for loop you are telling the compiler to do something more specific, "load this address into a register. Write a number to it. Add one to the address. Write a number to it," and so on. In theory a perfectly intelligent compiler would recognize this loop for what it is and turn it into a memset anyway; but I have never encountered a real compiler that did this.
So, the assumption is that memset was written by smart people to be the very best and fastest possible way to set a whole region of memory, for the specific platform and hardware the compiler supports. That is often, but not always, true.
Remember that this
for (i = 0; i < sizeof(my_buffer); i++)
{
p[i] = 0;
}
can also be faster than
for (i = 0; i < sizeof(my_buffer); i++)
{
*p++ = 0;
}
As already answered, the compiler often has hand optimized routines for memset() memcpy() and other string functions. And we are talking significantly faster. now the amount of code, number of instructions, that a fast memcpy or memset from the compiler, is usually much larger than the loop solution you suggested. fewer lines of code, fewer instructions does not mean faster.
Anyway, my message is try both. diassemble the code, see the difference, try to understand, ask questions at stack overflow if you dont. and then use a timer and time the two solutions, call whichever memcpy function thousands or hundreds of thousands of times and time the whole thing (to elminate error in the timing). Make sure you do short copies like say 7 items or 5 items, and large copies like hundreds of bytes per memset and try some prime numbers while you are at it. On some processors on some systems, your loop can be faster for a few items like 3 or 5 or something like that, very quickly though it gets slow.
Here is one hint about performance. The DDR memory in your computer is likely 64 bits wide and needs to be written 64 bits at a time, maybe it has ecc and you have to compute across those bits and write 72 bits at a time. Not always that exact number but follow the thought here it will make sense for 32 bits or 64 or 128 or whatever. If you perform a single byte write instruction to ram, the hardware is going to need to do one of two things, if there are no caches along the way, the memory system has to perform a 64 bit read, modify the one byte, then write it back. Without some sort of hardware optimization, writing 8 bytes within that one dram row, is 16 memory cycles, and dram is very very slow, dont be fooled by the 1333mhz numbers.
Now if you have a cache, the first byte write is going to require a cache line read from dram, which is one or multiple of these 64 bit reads, the next 7 or 15 or whatever byte writes are probably going to be really fast as they only go to the cache and not to ddr, eventually that cache line goes out to dram, slow, so one or two or four, etc of these 64 bit or whatever ddr locations. So even though you are only doing writes you still have to read all of that ram then write it, so twice as many cycles as desired. If possible, and it is with some processors and memory systems, the memset or the write part of a memcpy, can be single instructions with a whole cache line or whole ddr location and there is no read required, instantly doubled speed. This is not how all the optimizations work but it hopefully gives you an idea of how to think about the problem. With your program being pulled into cache in cache lines, you can double or triple the number of instructions executed if in return you half or quarter or more cutbacks on the number of DDR cycles and you win overall.
At a minimum the compiler memset and memcpy routines are going to perform a byte operation if the start address is odd then a 16 bit if not aligned on 32 bits. Then a 32 bit if not aligned on 64 and on up until they hit the optimal transfer size for that instruction set/system. On arm they tend to aim for 128 bits. So worst case on the front end would be a single byte then single halfword then a few words, then get into the main set or copy loop. In the case of ARM 128 bit transfers, 128 bits written per instruction. Then on the back end if unaligned the same deal, a few words, one half word, one byte worst case. You will also see the libraries do things like, if number of bytes is less than X where X is a small number like 13 or so, then it goes into a loop like yours, just copy some bytes because the number of instructions and clock cycles to support that loop is smaller/faster. disassemble or find the gcc source code for ARM and probably mips and some other good processors and see what I am talking about.
Two advantages:
The version with memset is easier to read - this is related to, but not the same as, having fewer lines of code. It takes less thinking to know what the memset version does, especially if you write it
memset(my_buffer, 0, sizeof(my_buffer));
instead of with the indirection through p and the unnecessary cast to void * (NOTE: only unnecessary if you're really coding in C and not C++ - some people are unclear on the difference).
memset is likely to be able to write 4 or 8 bytes at a time and/or take advantage of special cache hint instructions; therefore it may well be faster than your byte-at-a-time loop. (NOTE: Some compilers are clever enough to recognize a bulk-clearing loop and substitute either wider writes to memory or a call to memset. Your mileage may vary. Always measure performance before attempting to shave cycles.)
memset gives a standard way to write code, letting the particular platform/compiler libraries determine the most efficient mechanism. Based on data sizes it may for example do 32-bit or 64-bit stores as much as possible.
Your variable p is only required for the initialisation loop. The code for the memset should be simply
memset( my_buffer, 0, sizeof(my_buffer));
which is simpler and less error prone. The point of a void* parameter is exactly that it will accept any pointer type, the explicit cast is unnecessary, and the assignment to a pointer of an different type is pointless.
So one benefit of using memset() in this case is to avoid a unnecessary intermediate variable.
Another benefit is that memset() on any particular platform is likely to be optimised for the target platform, whereas your loop efficiency is dependent on the compiler and compiler settings.