Related
How to get the memory granularity of a CPU in C?
Suppose I want to allocate an array where all the elements are properly memory aligned. I can pad each element to a certain size N to achieve this. How do I know the value of N?
Note: I am trying to create a memory pool where each slot is memory aligned. Any suggestion will be appreciated.
In Theory
How to get the memory granularity of a CPU in C?
First, you read the instruction set architecture manual. It may specify that certain instructions require certain alignments, or even that the addressing forms in certain instructions cannot represent non-aligned addresses. It may specify other properties regarding alignment.
Second, you read the processor manual. It may specify performance characteristics (such as that unaligned loads or stores are supported but may be slower or use more resources than aligned loads or stores) and may specify various options allowed by the instructions set architecture.
Third, you read the operating system documentation. Some architectures allow the operating system to select features related to alignment, such as whether unaligned loads and stores are made to fail or are supported albeit with slower performance than aligned loads or stores. The operating system documentation should have this information.
In Practice
For many programming situations, what you need to know is not the “memory granularity” of a CPU but the alignment requirements of the C implementation you are using (or of whatever language you are using). And, for the most part, you do not need to know the alignment requirements directly but just need to follow the language rules about managing objects—use objects with declared types, do not use casts to convert pointers between incompatible types exceed where specific rules allow it, use the suitably aligned memory as provided by malloc rather than adjusting your own pointers to bytes, and so on. Following these rules will give good alignment for the objects in your program.
In C, when you define an array, the element size will automatically be the size that C implementation needs for its alignment. For example, long double x[100]; may use 16 bytes for each array element even though the hardware uses only ten bytes for a long double. Or, for any struct foo that you define, the compiler will automatically include padding as needed in the structure to give the desired alignment, and any array struct foo x[100]; will already include that padding. sizeof(struct foo) will be the same as sizeof x[0], because each structure object has that padding built in, even just for a single structure object, not just for elements in arrays.
When you do need to know the alignment that a C implementation requires for a type, you can use C’s _Alignof operator. The expression _Alignof(type) provides the alignment required for type.
Other
… properly memory aligned.
Proper alignment is a matter of degrees:
What the processor supports may determine whether your program works or does not work. An improper alignment is one that causes your program to trap.
What is efficient with respect to individual loads and stores may affect how fast your program runs. An improper alignment is one that causes your program to execute more slowly.
In certain performance-critical situations, alignment with respect to cache and memory mapping features can also affect performance.
Short answer
Use 64 bytes.
Long answer
Data are loaded from and stored to memory in units called cache lines. If your program loads only part of the data in a cache line, then the whole line will be loaded into the CPU caches. Perhaps more importantly, the algorithm used for moving data between cores in a multi-core CPU operates on full cache lines; aligning your data to cache lines avoids false sharing, the situation where a cache line bounces between cores because it contains data manipulated by different threads.
It used to be the case that cache lines depended on the architecture, ranging from 16 up to 512 bytes. However, all current processors (Intel, AMD, ARM, MIPS) use a cache line of 64 bytes.
This depends heavily on the cpu microarchitecture that you are using.
In many cases, the memory address of an operator should be a multiple of the operand size, otherwise execution will be slow (or even might throw an exception).
But there are also CPUs which do not care about a specific alignment of the operands in memory at all.
Usually, the C compiler will care about those details for you. You should, however, make sure that the compiler assumes the correct target (micro-)architecture, for example by specifying it with the correct compiler flags (-march=? on gcc).
I was wondering if anyone could offer a more full explanation to the meaning of the packed attribute used in the bitmap example in pset4.
"Our use, incidentally, of the attribute called packed ensures that clang does not try to "word-align" members (whereby the address of each member’s first byte is a multiple of 4), lest we end up with "gaps" in our structs that don’t actually exist on disk."
I do not understand the comment around gaps in our structs. Does this refer to gaps in the memory location between each struct (i.e. one byte between each 3 byte RGB if it was to word-algin)? Why does this matter in for optimization?
typedef uint8_t BYTE;
typedef struct
{
BYTE rgbtBlue;
BYTE rgbtGreen;
BYTE rgbtRed;
} __attribute__((__packed__))
RGBTRIPLE;
Beware: prejudices on display!
As noted in comments, when the compiler adds the padding to a structure, it does so to improve performance. It uses the alignments for the structure elements that will give the best performance.
Not so very long ago, the DEC Alpha chips would handle a 'unaligned memory request' (umr) by doing a page fault, jumping into the kernel, fiddling with the bytes to get the required result, and returning the correct result. This was painfully slow by comparison with a correctly aligned memory request; you avoided such behaviour at all costs.
Other RISC chips (used to) give you a SIGBUS error if you do misaligned memory accesses. Even Intel chips have to do some fancy footwork to deal with misaligned memory accesses.
The purpose of removing padding is to (decrease performance but) benefit by being able to serialize and unserialize the data without doing the job 'properly' — it is a form of laziness that actually doesn't work properly when the machines communicating are not of the same type, so proper serialization should have been done in the first place.
What I mean is that if you are writing data over the network, it seems simpler to be able to send the data by writing the contents of a structure as a block of memory (error checking etc omitted):
write(fd, &structure, sizeof(structure));
The receiving end can read the data:
read(fd, &structure, sizeof(structure));
However, if the machines are of different types (for example, one has an Intel CPU and the other a SPARC or Power CPU), the interpretation of the data in those structures will vary between the two machines (unless every element of the array is either a char or an array of char). To relay the information reliably, you have to agree on a byte order (e.g. network byte order — this is very much a factor in TCP/IP networking, for example), and the data should be transmitted in the agreed upon order so that both ends can understand what the other is saying.
You can define other mechanisms: you could use a 'sender makes right' mechanism, in which the 'receiver' let's the sender know how it wants the data presented and the sender is responsible for fixing up the transmitted data. You can also use a 'receiver makes right' mechanism which works the other way around. Both these have been used commercially — see DRDA for one such protocol.
Given that the type of BYTE is uint8_t, there won't be any padding in the structure in any sane (commercially viable) compiler. IMO, the precaution is a fantasy or phobia without a basis in reality. I'd certainly need a carefully documented counter-example to believe that there's an actual problem that the attribute helps with.
I was led to believe that you could encounter issues when you pass the entire struct to a function like fread as it assumes you're giving it an array like chunk of memory, with no gaps in it. If your struct has gaps, the first byte ends up in the right place, but the next two bytes get written in the gap, which you don't have a proper way to access.
Sorta...but mostly no. The issue is that the values in the padding bytes are indeterminate. However, in the structure shown, there will be no padding in any compiler I've come across; the structure will be 3 bytes long. There is no reason to put any padding anywhere inside the structure (between elements) or after the last element (and the standard prohibits padding before the first element). So, in this context, there is no issue.
If you write binary data to a file and it has holes in it, then you get arbitrary byte values written where the holes are. If you read back on the same (type of) machine, there won't actually be a problem. If you read back on a different (type of) machine, there may be problems — hence my comments about serialization and deserialization. I've only been programming in C a little over 30 years; I've never needed packed, and don't expect to. (And yes, I've dealt with serialization and deserialization using a standard layout — the system I mainly worked on used big-endian data transfer, which corresponds to network byte order.)
Sometimes, the elements of a struct are simply aligned to a 4-byte boundary (or whatever the size of a register is in the CPU) to optimize read/write access to RAM. Often, smaller elements are packed together, but alignment is dictated by a larger type in the struct.
In your case, you probably don't need to pack the struct, but it doesn't hurt.
With some compilers, each byte in your struct could end up taking 4 bytes of RAM each (so, 12 bytes for the entire struct). Packing the struct removes the alignment requirement for each of the BYTEs, and ensures that the entire struct is placed into one 4-byte DWORD (unless the alignment for the entire program is set to one byte, or the struct is in an array of said structs, in that case it would literally be stored in 3 contiguous bytes of RAM).
See comments below for further discussion...
The objective is exactly what you said, not having gaps between each struct. Why is this important? Mostly because of cache. Memory access is slow!!! Cache is really fast. If you can fit more in cache you avoid cache misses (memory accesses).
Edit: Seems I was wrong, didn't seem really useful if the objective was structure padding since the struct has 3 BYTE
Many methods found in high-performance algorithms could be (and are) simplified if they were allowed to read a small amount past the end of input buffers. Here, "small amount" generally means up to W - 1 bytes past the end, where W is the word size in bytes of the algorithm (e.g., up to 7 bytes for an algorithm processing the input in 64-bit chunks).
It's clear that writing past the end of an input buffer is never safe, in general, since you may clobber data beyond the buffer1. It is also clear that reading past the end of a buffer into another page may trigger a segmentation fault/access violation, since the next page may not be readable.
In the special case of reading aligned values, however, a page fault seems impossible, at least on x86. On that platform, pages (and hence memory protection flags) have a 4K granularity (larger pages, e.g. 2MiB or 1GiB, are possible, but these are multiples of 4K) and so aligned reads will only access bytes in the same page as the valid part of the buffer.
Here's a canonical example of some loop that aligns its input and reads up to 7 bytes past the end of buffer:
int processBytes(uint8_t *input, size_t size) {
uint64_t *input64 = (uint64_t *)input, end64 = (uint64_t *)(input + size);
int res;
if (size < 8) {
// special case for short inputs that we aren't concerned with here
return shortMethod();
}
// check the first 8 bytes
if ((res = match(*input)) >= 0) {
return input + res;
}
// align pointer to the next 8-byte boundary
input64 = (ptrdiff_t)(input64 + 1) & ~0x7;
for (; input64 < end64; input64++) {
if ((res = match(*input64)) > 0) {
return input + res < input + size ? input + res : -1;
}
}
return -1;
}
The inner function int match(uint64_t bytes) isn't shown, but it is something that looks for a byte matching a certain pattern, and returns the lowest such position (0-7) if found or -1 otherwise.
First, cases with size < 8 are pawned off to another function for simplicity of exposition. Then a single check is done for the first 8 (unaligned bytes). Then a loop is done for the remaining floor((size - 7) / 8) chunks of 8 bytes2. This loop may read up to 7 bytes past the end of the buffer (the 7 byte case occurs when input & 0xF == 1). However, return call has a check which excludes any spurious matches which occur beyond the end of the buffer.
Practically speaking, is such a function safe on x86 and x86-64?
These types of overreads are common in high performance code. Special tail code to avoid such overreads is also common. Sometimes you see the latter type replacing the former to silence tools like valgrind. Sometimes you see a proposal to do such a replacement, which is rejected on the grounds the idiom is safe and the tool is in error (or simply too conservative)3.
A note for language lawyers:
Reading from a pointer beyond its allocated size is definitely not allowed
in the standard. I appreciate language lawyer answers, and even occasionally write
them myself, and I'll even be happy when someone digs up the chapter
and verse which shows the code above is undefined behavior and hence
not safe in the strictest sense (and I'll copy the details here). Ultimately though, that's not what
I'm after. As a practical matter, many common idioms involving pointer
conversion, structure access though such pointers and so are
technically undefined, but are widespread in high quality and high
performance code. Often there is no alternative, or the alternative
runs at half speed or less.
If you wish, consider a modified version of this question, which is:
After the above code has been compiled to x86/x86-64 assembly, and the user has verified that it is compiled in the expected way (i.e.,
the compiler hasn't used a provable partially out-of-bounds access to
do something really
clever,
is executing the compiled program safe?
In that respect, this question is both a C question and a x86 assembly question. Most of the code using this trick that I've seen is written in C, and C is still the dominant language for high performance libraries, easily eclipsing lower level stuff like asm, and higher level stuff like <everything else>. At least outside of the hardcore numerical niche where FORTRAN still plays ball. So I'm interested in the C-compiler-and-below view of the question, which is why I didn't formulate it as a pure x86 assembly question.
All that said, while I am only moderately interested in a link to the
standard showing this is UD, I am very interested in any details of
actual implementations that can use this particular UD to produce
unexpected code. Now I don't think this can happen without some deep
pretty deep cross-procedure analysis, but the gcc overflow stuff
surprised a lot of people too...
1 Even in apparently harmless cases, e.g., where the same value is written back, it can break concurrent code.
2 Note for this overlapping to work requires that this function and match() function to behave in a specific idempotent way - in particular that the return value supports overlapping checks. So a "find first byte matching pattern" works since all the match() calls are still in-order. A "count bytes matching pattern" method would not work, however, since some bytes could be double counted. As an aside: some functions such as "return the minimum byte" call would work even without the in-order restriction, but need to examine all bytes.
3 It's worth noting here that for valgrind's Memcheck there is a flag, --partial-loads-ok which controls whether such reads are in fact reported as an error. The default is yes, means that in general such loads are not treated as immediate errors, but that an effort is made to track the subsequent use of loaded bytes, some of which are valid and some of which are not, with an error being flagged if the out-of-range bytes are used. In cases such as the example above, in which the entire word is accessed in match(), such analysis will conclude the bytes are accessed, even though the results are ultimately discarded. Valgrind cannot in general determine whether invalid bytes from a partial load are actually used (and detection in general is probably very hard).
Yes, it's safe in x86 asm, and existing libc strlen(3) implementations take advantage of this in hand-written asm. And even glibc's fallback C, but it compiles without LTO so it it can never inline. It's basically using C as a portable assembler to create machine code for one function, not as part of a larger C program with inlining. But that's mostly because it also has potential strict-aliasing UB, see my answer on the linked Q&A. You probably also want a GNU C __attribute__((may_alias)) typedef instead of plain unsigned long as your wider type, like __m128i etc. already use.
It's safe because an aligned load will never cross a higher alignment boundary, and memory protection happens with aligned pages, so at least 4k boundaries1
Any naturally-aligned load that touches at least 1 valid byte can't fault. It's also safe to just check if you're far enough from the next page boundary to do a 16-byte load, like if (p & 4095 > (4096 - 16)) do_special_case_fallback. See the section below about that for more detail.
It's also generally safe in C compiled for x86, as far as I know. Reading outside an object is of course Undefined Behaviour in C, but works in C-targeting-x86. I don't think compilers explicitly / on purpose define the behaviour, but in practice it works that way.
I think it's not the kind of UB that aggressive compilers will assume can't happen while optimizing, but confirmation from a compiler-writer on this point would be good, especially for cases where it's easily provable at compile-time that an access goes out of past the end of an object. (See discussion in comments with #RossRidge: a previous version of this answer asserted that it was absolutely safe, but that LLVM blog post doesn't really read that way).
This is required in asm to go faster than 1 byte at a time processing an implicit-length string. In C in theory a compiler could know how to optimize such a loop, but in practice they don't so you have to do hacks like this. Until that changes, I suspect that the compilers people care about will generally avoid breaking code that contains this potential UB.
There's no danger when the overread isn't visible to code that knows how long an object is. A compiler has to make asm that works for the case where there are array elements as far as we actually read. The plausible danger I can see with possible future compilers is: after inlining, a compiler might see the UB and decide that this path of execution must never be taken. Or that the terminating condition must be found before the final not-full-vector and leave that out when fully unrolling.
The data you get is unpredictable garbage, but there won't be any other potential side-effects. As long as the your program isn't affected by the garbage bytes, it's fine. (e.g. use bithacks to find if one of the bytes of a uint64_t are zero, then a byte loop to find the first zero byte, regardless of what garbage is beyond it.)
Unusual situations where this wouldn't be safe in x86 asm
Hardware data breakpoints (watchpoints) that trigger on a load from a given address. If there's a variable you're monitoring right after an array, you could get a spurious hit. This might be a minor annoyance to someone debugging a normal program. If your function will be part of a program that uses x86 debug registers D0-D3 and the resulting exceptions for something that could affect correctness, then be careful with this.
Or similarly a code checker like valgrind could complain about reading outside an object.
Under a hypothetical 16 or 32-bit OS could that uses segmentation: A segment limit can use 4k or 1-byte granularity so it's possible to create a segment where the first faulting offset is odd. (Having the base of the segment aligned to a cache line or page is irrelevant except for performance). All mainstream x86 OSes use flat memory models, and x86-64 removes support for segment limits for 64-bit mode.
Memory-mapped I/O registers right after the buffer you wanted to loop over with wide loads, especially the same 64B cache-line. This is extremely unlikely even if you're calling functions like this from a device driver (or a user-space program like an X server that has mapped some MMIO space).
If you're processing a 60-byte buffer and need to avoid reading from a 4-byte MMIO register, you'll know about it and will be using a volatile T*. This sort of situation doesn't happen for normal code.
strlen is the canonical example of a loop that processes an implicit-length buffer and thus can't vectorize without reading past the end of a buffer. If you need to avoid reading past the terminating 0 byte, you can only read one byte at a time.
For example, glibc's implementation uses a prologue to handle data up to the first 64B alignment boundary. Then in the main loop (gitweb link to the asm source), it loads a whole 64B cache line using four SSE2 aligned loads. It merges them down to one vector with pminub (min of unsigned bytes), so the final vector will have a zero element only if any of the four vectors had a zero. After finding that the end of the string was somewhere in that cache line, it re-checks each of the four vectors separately to see where. (Using the typical pcmpeqb against a vector of all-zero, and pmovmskb / bsf to find the position within the vector.) glibc used to have a couple different strlen strategies to choose from, but the current one is good on all x86-64 CPUs.
Usually loops like this avoid touching any extra cache-lines they don't need to touch, not just pages, for performance reasons, like glibc's strlen.
Loading 64B at a time is of course only safe from a 64B-aligned pointer, since naturally-aligned accesses can't cross cache-line or page-line boundaries.
If you do know the length of a buffer ahead of time, you can avoid reading past the end by handling the bytes beyond the last full aligned vector using an unaligned load that ends at the last byte of the buffer.
(Again, this only works with idempotent algorithms, like memcpy, which don't care if they do overlapping stores into the destination. Modify-in-place algorithms often can't do this, except with something like converting a string to upper-case with SSE2, where it's ok to reprocess data that's already been upcased. Other than the store-forwarding stall if you do an unaligned load that overlaps with your last aligned store.)
So if you are vectorizing over a buffer of known length, it's often best to avoid overread anyway.
Non-faulting overread of an object is the kind of UB that definitely can't hurt if the compiler can't see it at compile time. The resulting asm will work as if the extra bytes were part of some object.
But even if it is visible at compile-time, it generally doesn't hurt with current compilers.
PS: a previous version of this answer claimed that unaligned deref of int * was also safe in C compiled for x86. That is not true. I was a bit too cavalier 3 years ago when writing that part. You need a typedef with GNU C __attribute__((aligned(1),may_alias)), or memcpy, to make that safe. The may_alias part isn't needed if you only access it via signed/unsigned int* and/or `char*, i.e. in ways that wouldn't violate the normal C strict-aliasing rules.
The set of things ISO C leaves undefined but that Intel intrinsics requires compilers to define does include creating unaligned pointers (at least with types like __m128i*), but not dereferencing them directly. Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
Checking if a pointer is far enough from the end of a 4k page
This is useful for the first vector of strlen; after this you can p = (p+16) & -16 to go to the next aligned vector. This will partially overlap if p was not 16-byte aligned, but doing redundant work is sometimes the most compact way to set up for an efficient loop. Avoiding it might mean looping 1 byte at a time until an alignment boundary, and that's certainly worse.
e.g. check ((p + 15) ^ p) & 0xFFF...F000 == 0 (LEA / XOR / TEST) which tells you that the last byte of a 16-byte load has the same page-address bits as the first byte. Or p+15 <= p|0xFFF (LEA / OR / CMP with better ILP) checks that the last byte-address of the load is <= the last byte of the page containing the first byte.
Or more simply, p & 4095 > (4096 - 16) (MOV / AND / CMP), i.e. p & (pgsize-1) < (pgsize - vecwidth) checks that the offset-within-page is far enough from the end of a page.
You can use 32-bit operand-size to save code size (REX prefixes) for this or any of the other checks because the high bits don't matter. Some compilers don't notice this optimization, so you can cast to unsigned int instead of uintptr_t, although to silence warnings about code that isn't 64-bit clean you might need to cast (unsigned)(uintptr_t)p. Further code-size saving can be had with ((unsigned int)p << 20) > ((4096 - vectorlen) << 20) (MOV / SHL / CMP), because shl reg, 20 is 3 bytes, vs. and eax, imm32 being 5, or 6 for any other register. (Using EAX will also allow the no-modrm short form for cmp eax, 0xfff.)
If doing this in GNU C, you probably want typedef unsigned long aliasing_unaligned_ulong __attribute__((aligned(1),may_alias)); to make it safe to do unaligned accesses.
If you permit consideration of non-CPU devices, then one example of a potentially unsafe operation is accessing out-of-bounds regions of PCI-mapped memory pages. There's no guarantee that the target device is using the same page size or alignment as the main memory subsystem. Attempting to access, for example, address [cpu page base]+0x800 may trigger a device page fault if the device is in a 2KiB page mode. This will usually cause a system bugcheck.
I was curious as to whether or not there was any advantage in regards to efficiency to utilizing memset() in a situation similar to the one below.
Given the following buffer declarations...
struct More_Buffer_Info
{
unsigned char a[10];
unsigned char b[10];
unsigned char c[10];
};
struct My_Buffer_Type
{
struct More_Buffer_Info buffer_info[100];
};
struct My_Buffer_Type my_buffer[5];
unsigned char *p;
p = (unsigned char *)my_buffer;
Besides having less lines of code, is there an advantage to using this:
memset((void *)p, 0, sizeof(my_buffer));
Over this:
for (i = 0; i < sizeof(my_buffer); i++)
{
*p++ = 0;
}
This applies to both memset() and memcpy():
Less Code: As you have already mentioned, it's shorter - fewer lines of code.
More Readable: Shorter usually makes it more readable as well. (memset() is more readable than that loop)
It can be faster: It can sometimes allow more aggressive compiler optimizations. (so it may be faster)
Misalignment: In some cases, when you're dealing with misaligned data on a processor that doesn't support misaligned accesses, memset() and memcpy() may be the only clean solution.
To expand on the 3rd point, memset() can be heavily optimized by the compiler using SIMD and such. If you write a loop instead, the compiler will first need to "figure out" what it does before it can attempt to optimize it.
The basic idea here is that memset() and similar library functions, in some sense, "tells" the compiler your intent.
As mentioned by #Oli in the comments, there are some downsides. I'll expand on them here:
You need to make sure that memset() actually does what you want. The standard doesn't say that zeros for the various datatypes are necessarily zero in memory.
For non-zero data, memset() is restricted to only 1 byte content. So you can't use memset() if you want to set an array of ints to something other than zero (or 0x01010101 or something...).
Although rare, there are some corner cases, where it's actually possible to beat the compiler in performance with your own loop.*
*I'll give one example of this from my experience:
Although memset() and memcpy() are usually compiler intrinsics with special handling by the compiler, they are still generic functions. They say nothing about the datatype including the alignment of the data.
So in a few (abeit rare) cases, the compiler isn't able to determine the alignment of the memory region, and thus must produce extra code to handle misalignment. Whereas, if you the programmer, is 100% sure of alignment, using a loop might actually be faster.
A common example is when using SSE/AVX intrinsics. (such as copying a 16/32-byte aligned array of floats) If the compiler can't determine the 16/32-byte alignment, it will need to use misaligned load/stores and/or handling code. If you simply write a loop using SSE/AVX aligned load/store intrinsics, you can probably do better.
float *ptrA = ... // some unknown source, guaranteed to be 32-byte aligned
float *ptrB = ... // some unknown source, guaranteed to be 32-byte aligned
int length = ... // some unknown source, guaranteed to be multiple of 8
// memcopy() - Compiler can't read comments. It doesn't know the data is 32-byte
// aligned. So it may generate unnecessary misalignment handling code.
memcpy(ptrA, ptrB, length * sizeof(float));
// This loop could potentially be faster because it "uses" the fact that
// the pointers are aligned. The compiler can also further optimize this.
for (int c = 0; c < length; c += 8){
_mm256_store_ps(ptrA + c, _mm256_load_ps(ptrB + c));
}
It depends on the quality of the compiler and the libraries. In most cases memset is superior.
The advantage of memset is that in many platforms it is actually a compiler intrinsic; that is, the compiler can "understand" the intention to set a large swath of memory to a certain value, and possibly generate better code.
In particular, that could mean using specific hardware operations for setting large regions of memory, like SSE on the x86, AltiVec on the PowerPC, NEON on the ARM, and so on. This can be an enormous performance improvement.
On the other hand, by using a for loop you are telling the compiler to do something more specific, "load this address into a register. Write a number to it. Add one to the address. Write a number to it," and so on. In theory a perfectly intelligent compiler would recognize this loop for what it is and turn it into a memset anyway; but I have never encountered a real compiler that did this.
So, the assumption is that memset was written by smart people to be the very best and fastest possible way to set a whole region of memory, for the specific platform and hardware the compiler supports. That is often, but not always, true.
Remember that this
for (i = 0; i < sizeof(my_buffer); i++)
{
p[i] = 0;
}
can also be faster than
for (i = 0; i < sizeof(my_buffer); i++)
{
*p++ = 0;
}
As already answered, the compiler often has hand optimized routines for memset() memcpy() and other string functions. And we are talking significantly faster. now the amount of code, number of instructions, that a fast memcpy or memset from the compiler, is usually much larger than the loop solution you suggested. fewer lines of code, fewer instructions does not mean faster.
Anyway, my message is try both. diassemble the code, see the difference, try to understand, ask questions at stack overflow if you dont. and then use a timer and time the two solutions, call whichever memcpy function thousands or hundreds of thousands of times and time the whole thing (to elminate error in the timing). Make sure you do short copies like say 7 items or 5 items, and large copies like hundreds of bytes per memset and try some prime numbers while you are at it. On some processors on some systems, your loop can be faster for a few items like 3 or 5 or something like that, very quickly though it gets slow.
Here is one hint about performance. The DDR memory in your computer is likely 64 bits wide and needs to be written 64 bits at a time, maybe it has ecc and you have to compute across those bits and write 72 bits at a time. Not always that exact number but follow the thought here it will make sense for 32 bits or 64 or 128 or whatever. If you perform a single byte write instruction to ram, the hardware is going to need to do one of two things, if there are no caches along the way, the memory system has to perform a 64 bit read, modify the one byte, then write it back. Without some sort of hardware optimization, writing 8 bytes within that one dram row, is 16 memory cycles, and dram is very very slow, dont be fooled by the 1333mhz numbers.
Now if you have a cache, the first byte write is going to require a cache line read from dram, which is one or multiple of these 64 bit reads, the next 7 or 15 or whatever byte writes are probably going to be really fast as they only go to the cache and not to ddr, eventually that cache line goes out to dram, slow, so one or two or four, etc of these 64 bit or whatever ddr locations. So even though you are only doing writes you still have to read all of that ram then write it, so twice as many cycles as desired. If possible, and it is with some processors and memory systems, the memset or the write part of a memcpy, can be single instructions with a whole cache line or whole ddr location and there is no read required, instantly doubled speed. This is not how all the optimizations work but it hopefully gives you an idea of how to think about the problem. With your program being pulled into cache in cache lines, you can double or triple the number of instructions executed if in return you half or quarter or more cutbacks on the number of DDR cycles and you win overall.
At a minimum the compiler memset and memcpy routines are going to perform a byte operation if the start address is odd then a 16 bit if not aligned on 32 bits. Then a 32 bit if not aligned on 64 and on up until they hit the optimal transfer size for that instruction set/system. On arm they tend to aim for 128 bits. So worst case on the front end would be a single byte then single halfword then a few words, then get into the main set or copy loop. In the case of ARM 128 bit transfers, 128 bits written per instruction. Then on the back end if unaligned the same deal, a few words, one half word, one byte worst case. You will also see the libraries do things like, if number of bytes is less than X where X is a small number like 13 or so, then it goes into a loop like yours, just copy some bytes because the number of instructions and clock cycles to support that loop is smaller/faster. disassemble or find the gcc source code for ARM and probably mips and some other good processors and see what I am talking about.
Two advantages:
The version with memset is easier to read - this is related to, but not the same as, having fewer lines of code. It takes less thinking to know what the memset version does, especially if you write it
memset(my_buffer, 0, sizeof(my_buffer));
instead of with the indirection through p and the unnecessary cast to void * (NOTE: only unnecessary if you're really coding in C and not C++ - some people are unclear on the difference).
memset is likely to be able to write 4 or 8 bytes at a time and/or take advantage of special cache hint instructions; therefore it may well be faster than your byte-at-a-time loop. (NOTE: Some compilers are clever enough to recognize a bulk-clearing loop and substitute either wider writes to memory or a call to memset. Your mileage may vary. Always measure performance before attempting to shave cycles.)
memset gives a standard way to write code, letting the particular platform/compiler libraries determine the most efficient mechanism. Based on data sizes it may for example do 32-bit or 64-bit stores as much as possible.
Your variable p is only required for the initialisation loop. The code for the memset should be simply
memset( my_buffer, 0, sizeof(my_buffer));
which is simpler and less error prone. The point of a void* parameter is exactly that it will accept any pointer type, the explicit cast is unnecessary, and the assignment to a pointer of an different type is pointless.
So one benefit of using memset() in this case is to avoid a unnecessary intermediate variable.
Another benefit is that memset() on any particular platform is likely to be optimised for the target platform, whereas your loop efficiency is dependent on the compiler and compiler settings.
Seems posix_memalign let you choose a customized alignment,but when is that necessary?
malloc has already done the alignment work internally.
UPDATE
The exact reason I ask this is because I see nginx does this,ngx_memalign(NGX_POOL_ALIGNMENT, size, log);,here NGX_POOL_ALIGNMENT is defined as 16, nginxs.googlecode.com/svn-history/trunk/src/core/ngx_palloc.c
Basically, if you need tougher alignment than malloc will give you. Malloc generally returns a pointer aligned such, that it may be used with any of the primitive types (often, 8 bytes on common desktop machines).
However, sometimes you need memory aligned on other boundaries, for example 4K-aligned, etc. In this case, you would need memalign.
You would need this, for example,
when writing a memory manager (such as a garbage collector). In this case, it is sometimes handy to work with memory aligned on larger block sizes. This way, you can store meta-data common to all objects in a given block at the bottom of the allocated area, and access this simply by masking the least significant bits of the object pointer.
when interfacing with hardware (never done this myself, but IIRC, certain kinds of block-devices require aligned memory). See n.m.'s answer for details.
The only benefits of posix_memalign, as far as I can tell, are:
Allocating page-aligned (typically 4096 or larger alignment) memory for hardware-specific purposes.
Evil hacks where you keep the low N bits of a pointer zero so you can store an N-bit integer in the low bits. :-)
Various hardware may have alignment requirements which malloc cannot satisfy. The Linux man page gives one such example, I quote:
On many systems there are alignment
restrictions, e.g. on buffers used for
direct block device I/O. POSIX
specifies the
pathconf(path,_PC_REC_XFER_ALIGN) call
that tells what alignment is needed.
A couple of uses:
Some processors have instructions that will only work on data that is aligned on a power of two greater than or equal to the buffer size - for example bit reverse addressing instructions used in ffts (fast fourier transforms).
To align data to cache boundaries to optimize access in multiprocessing applications so that data in the same cache line isn't being accessed by two processors simultaneously.
Basically, if you don't need to do absurd levels of optimizations and/or your hardware doesn't demand that an array be on a particular boundary then you can forget about posix_memalign.