What's missing/sub-optimal in this memcpy implementation?

What's missing/sub-optimal in this memcpy implementation? - c

I've become interested in writing a memcpy() as an educational exercise. I won't write a whole treatise of what I did and didn't think about, but here's
some guy's implementation:
__forceinline // Since Size is usually known,
// most useless code will be optimized out
// if the function is inlined.
void* myMemcpy(char* Dst, const char* Src, size_t Size)
{
void* start = Dst;
for ( ; Size >= sizeof(__m256i); Size -= sizeof(__m256i) )
{
__m256i ymm = _mm256_loadu_si256(((const __m256i* &)Src)++);
_mm256_storeu_si256(((__m256i* &)Dst)++, ymm);
}
#define CPY_1B *((uint8_t * &)Dst)++ = *((const uint8_t * &)Src)++
#define CPY_2B *((uint16_t* &)Dst)++ = *((const uint16_t* &)Src)++
#define CPY_4B *((uint32_t* &)Dst)++ = *((const uint32_t* &)Src)++
#if defined _M_X64 || defined _M_IA64 || defined __amd64
#define CPY_8B *((uint64_t* &)Dst)++ = *((const uint64_t* &)Src)++
#else
#define CPY_8B _mm_storel_epi64((__m128i *)Dst, _mm_loadu_si128((const __m128i *)Src)), ++(const uint64_t* &)Src, ++(uint64_t* &)Dst
#endif
#define CPY16B _mm_storeu_si128((__m128i *)Dst, _mm_loadu_si128((const __m128i *)Src)), ++(const __m128i* &)Src, ++(__m128i* &)Dst
switch (Size) {
case 0x00: break;
case 0x01: CPY_1B; break;
case 0x02: CPY_2B; break;
case 0x03: CPY_1B; CPY_2B; break;
case 0x04: CPY_4B; break;
case 0x05: CPY_1B; CPY_4B; break;
case 0x06: CPY_2B; CPY_4B; break;
case 0x07: CPY_1B; CPY_2B; CPY_4B; break;
case 0x08: CPY_8B; break;
case 0x09: CPY_1B; CPY_8B; break;
case 0x0A: CPY_2B; CPY_8B; break;
case 0x0B: CPY_1B; CPY_2B; CPY_8B; break;
case 0x0C: CPY_4B; CPY_8B; break;
case 0x0D: CPY_1B; CPY_4B; CPY_8B; break;
case 0x0E: CPY_2B; CPY_4B; CPY_8B; break;
case 0x0F: CPY_1B; CPY_2B; CPY_4B; CPY_8B; break;
case 0x10: CPY16B; break;
case 0x11: CPY_1B; CPY16B; break;
case 0x12: CPY_2B; CPY16B; break;
case 0x13: CPY_1B; CPY_2B; CPY16B; break;
case 0x14: CPY_4B; CPY16B; break;
case 0x15: CPY_1B; CPY_4B; CPY16B; break;
case 0x16: CPY_2B; CPY_4B; CPY16B; break;
case 0x17: CPY_1B; CPY_2B; CPY_4B; CPY16B; break;
case 0x18: CPY_8B; CPY16B; break;
case 0x19: CPY_1B; CPY_8B; CPY16B; break;
case 0x1A: CPY_2B; CPY_8B; CPY16B; break;
case 0x1B: CPY_1B; CPY_2B; CPY_8B; CPY16B; break;
case 0x1C: CPY_4B; CPY_8B; CPY16B; break;
case 0x1D: CPY_1B; CPY_4B; CPY_8B; CPY16B; break;
case 0x1E: CPY_2B; CPY_4B; CPY_8B; CPY16B; break;
case 0x1F: CPY_1B; CPY_2B; CPY_4B; CPY_8B; CPY16B; break;
}
#undef CPY_1B
#undef CPY_2B
#undef CPY_4B
#undef CPY_8B
#undef CPY16B
return start;
}
The comment translates as "Size is usually known as the compiler can optimize the code inline out most useless".
I would like to improve, if possible, on this implementation - but maybe there isn't much to improve. I see it uses SSE/AVX for the larger chunks of memory, then instead of a loop over the last < 32 bytes does the equivalent of manual unrolling, with some tweaking. So, here are my questions:
Why unroll the loop for the last several bytes, but not partially unroll the first (and now single) loop?
What about alignment issues? Aren't they important? Should I handle the first several bytes up to some alignment quantum differently, then perform the 256-bit ops on aligned sequences of bytes? And if so, how do I determine the appropriate alignment quantum?
What's the most important missing feature in this implementation (if any)?
Features/Principles mentioned in the answers so far
You should __restrict__ your parameters. (#chux)
The memory bandwidth is a limiting factor; measure your implementation against it.(#Zboson)
For small arrays, you can expect to approach the memory bandwidth; for larger arrays - not as much. (#Zboson)
Multiple threads (may be | are) necessary to saturate the memory bandwidth. (#Zboson)
It is probably wise to optimize differently for large and small copy sizes. (#Zboson)
(Alignment is important? Not explicitly addressed!)
The compiler should be made more explicitly aware of "obvious facts" it can use for optimization (such as the fact that Size < 32 after the first loop). (#chux)
There are arguments for unrolling your SSE/AVX calls (#BenJackson, here), and arguments against doing so (#PaulR)
non-temporal transfers (with which you tell the CPU you don't need it to cache the target location) should be useful for copying larger buffers. (#Zboson)

I have been studying measuring memory bandwidth for Intel processors with various operations and one of them is memcpy. I have done this on Core2, Ivy Bridge, and Haswell. I did most of my tests using C/C++ with intrinsics (see the code below - but I'm currently rewriting my tests in assembly).
To write your own efficient memcpy function it's important to know what the absolute best bandwidth possible is. This bandwidth is a function of the size of the arrays which will be copied and therefore an efficient memcpy function needs to optimize differently for small and big (and maybe in between). To keep things simple I have optimized for small arrays of 8192 bytes and large arrays of 1 GB.
For small arrays the maximum read and write bandwidth for each core is:
Core2-Ivy Bridge 32 bytes/cycle
Haswell 64 bytes/cycle
This is the benchmark you should aim for small arrays. For my tests I assume the arrays are aligned to 64-bytes and that the array size is a multiple of 8*sizeof(float)*unroll_factor. Here are my current memcpy results for a size of 8192 bytes (Ubuntu 14.04, GCC 4.9, EGLIBC 2.19):
GB/s efficiency
Core2 (p9600#2.66 GHz)
builtin 35.2 41.3%
eglibc 39.2 46.0%
asmlib: 76.0 89.3%
copy_unroll1: 39.1 46.0%
copy_unroll8: 73.6 86.5%
Ivy Bridge (E5-1620#3.6 GHz)
builtin 102.2 88.7%
eglibc: 107.0 92.9%
asmlib: 107.6 93.4%
copy_unroll1: 106.9 92.8%
copy_unroll8: 111.3 96.6%
Haswell (i5-4250U#1.3 GHz)
builtin: 68.4 82.2%
eglibc: 39.7 47.7%
asmlib: 73.2 87.6%
copy_unroll1: 39.6 47.6%
copy_unroll8: 81.9 98.4%
The asmlib is Agner Fog's asmlib. The copy_unroll1 and copy_unroll8 functions are defined below.
From this table we can see that the GCC builtin memcpy does not work well on Core2 and that memcpy in EGLIBC does not work well on Core2 or Haswell. I did check out a head version of GLIBC recently and the performance was much better on Haswell. In all cases unrolling gets the best result.
void copy_unroll1(const float *x, float *y, const int n) {
for(int i=0; i<n/JUMP; i++) {
VECNF().LOAD(&x[JUMP*(i+0)]).STORE(&y[JUMP*(i+0)]);
}
}
void copy_unroll8(const float *x, float *y, const int n) {
for(int i=0; i<n/JUMP; i+=8) {
VECNF().LOAD(&x[JUMP*(i+0)]).STORE(&y[JUMP*(i+0)]);
VECNF().LOAD(&x[JUMP*(i+1)]).STORE(&y[JUMP*(i+1)]);
VECNF().LOAD(&x[JUMP*(i+2)]).STORE(&y[JUMP*(i+2)]);
VECNF().LOAD(&x[JUMP*(i+3)]).STORE(&y[JUMP*(i+3)]);
VECNF().LOAD(&x[JUMP*(i+4)]).STORE(&y[JUMP*(i+4)]);
VECNF().LOAD(&x[JUMP*(i+5)]).STORE(&y[JUMP*(i+5)]);
VECNF().LOAD(&x[JUMP*(i+6)]).STORE(&y[JUMP*(i+6)]);
VECNF().LOAD(&x[JUMP*(i+7)]).STORE(&y[JUMP*(i+7)]);
}
}
Where VECNF().LOADis _mm_load_ps() for SSE or _mm256_load_ps() for AVX, VECNF().STORE is _mm_store_ps() for SSE or _mm256_store_ps() for AVX, and JUMP is 4 for SSE or 8 for AVX.
For the large size the best result is obtained by using non-temporal store instructions and by using multiple threads. Contrary to what many people may believe a single thread does NOT usually saturate the memory bandwidth.
void copy_stream(const float *x, float *y, const int n) {
#pragma omp parallel for
for(int i=0; i<n/JUMP; i++) {
VECNF v = VECNF().load_a(&x[JUMP*i]);
stream(&y[JUMP*i], v);
}
}
Where stream is _mm_stream_ps() for SSE or _mm256_stream_ps() for AVX
Here are the memcpy results on my E5-1620#3.6 GHz with four threads for 1 GB with a maximum main memory bandwidth of 51.2 GB/s.
GB/s efficiency
eglibc: 23.6 46%
asmlib: 36.7 72%
copy_stream: 36.7 72%
Once again EGLIBC performs poorly. This is because it does not use non-temporal stores.
I modfied the eglibc and asmlib memcpy functions to run in parallel like this
void COPY(const float * __restrict x, float * __restrict y, const int n) {
#pragma omp parallel
{
size_t my_start, my_size;
int id = omp_get_thread_num();
int num = omp_get_num_threads();
my_start = (id*n)/num;
my_size = ((id+1)*n)/num - my_start;
memcpy(y+my_start, x+my_start, sizeof(float)*my_size);
}
}
A general memcpy function needs to account for arrays which are not aligned to 64 bytes (or even to 32 or to 16 bytes) and where the size is not a multiple of 32 bytes or the unroll factor. Additionally, a decision has to be made as to when to use non-temporal stores. The general rule of thumb is to only use non-temporal stores for sizes larger than half the largest cache level (usually L3). But theses are "second order" details which I think should be dealt with after optimizing for ideal cases of large and small. There's not much point in worrying about correcting for misalignment or non-ideal size multiples if the ideal case performs poorly as well.
Update
Based on comments by Stephen Canon I have learned that on Ivy Bridge and Haswell it's more efficient to use rep movsb than movntdqa (a non-temporal store instruction). Intel calls this enhanced rep movsb (ERMSB). This is described in the Intel Optimization manuals in the section 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB).
Additionally, in Agner Fog's Optimizing Subroutines in Assembly manual in section 17.9 Moving blocks of data (All processors) he writes:
"There are several ways of moving large blocks of data. The most common methods are:
REP MOVS instruction.
If data are aligned: Read and write in a loop with the largest available register size.
If size is constant: inline move instructions.
If data are misaligned: First move as many bytes as required to make the destination
aligned. Then read unaligned and write aligned in a loop with the largest available
register size.
If data are misaligned: Read aligned, shift to compensate for misalignment and write
aligned.
If the data size is too big for caching, use non-temporal writes to bypass the cache.
Shift to compensate for misalignment, if necessary."
A general memcpy should consider each of these points. Additionally, with Ivy Bridge and Haswell it seems that point 1 is better than point 6 for large arrays. Different techniques are necessary for Intel and AMD and for each iteration of technology. I think it's clear that writing your own general efficient memcpyfunction can be quite complicated. But in the special cases I have looked at I have already managed to do better than the GCC builtin memcpy or the one in EGLIBC so the assumption that you can't do better than the standard libraries is incorrect.

The question can't be answered precisely without some additional details such as:
What is the target platform (CPU architecture, most, but memory configuration plays a role too)?
What is the distribution and predictability1 of the copy lengths (and to a lesser extent, the distribution and predictability of alignments)?
Will the copy size ever be statically known at compile-time?
Still, I can point out a couple things that are likely to be sub-optimal for at least some combination of the above parameters.
32-case Switch Statement
The 32-case switch statement is a cute way of handling the trailing 0 to 31 bytes, and likely benchmarks very well - but may perform badly in the real world due to at least two factors.
Code Size
This switch statement alone takes several hundred bytes of code for the body, in addition to a 32-entry lookup table needed to jump to the correct location for each length. The cost of this isn't going to show up in a focused benchmark of memcpy on a full-sized CPU because everything still fit in the fastest cache level: but in the real world you execute other code too and there is contention for the uop cache and L1 data and instruction caches.
That many instructions may take fully 20% of the effective size of your uop cache3, and uop cache misses (and the corresponding cache-to-legacy encoder transition cycles) could easily wipe the small benefit given by this elaborate switch.
On top of that, the switch requires a 32-entry, 256 byte lookup table for the jump targets4. If you ever get a miss to DRAM on that lookup, you are talking a penalty of 150+ cycles: how many non-misses do you need to then to make the switch worth it, given it's probably saving a few or two at the most? Again, that won't show up in a microbenchmark.
For what its worth, this memcpy isn't unusual: that kind of "exhaustive enumeration of cases" is common even in optimized libraries. I can conclude that either their development was driven mostly by microbenchmarks, or that it is still worth it for a large slice of general purpose code, despite the downsides. That said, there are certainly scenarios (instruction and/or data cache pressure) where this is suboptimal.
Branch Prediction
The switch statement relies on a single indirect branch to choose among the alternatives. This going to be efficient to the extent that the branch predictor can predict this indirect branch, which basically means that the sequence of observed lengths needs to be predictable.
Because it is an indirect branch, there are more limits on the predictability of the branch than a conditional branch since there are a limited number of BTB entries. Recent CPUs have made strides here, but it is safe to say that if the series of lengths fed to memcpy don't follow a simple repeating pattern of a short period (as short as 1 or 2 on older CPUs), there will be a branch-mispredict on each call.
This issue is particularly insidious because it is likely to hurt you the most in real-world in exactly the situations where a microbenchmark shows the switch to be the best: short lengths. For very long lengths, the behavior on the trailing 31 bytes isn't very important since it is dominated by the bulk copy. For short lengths, the switch is all-important (indeed, for copies of 31 bytes or less it is all that executes)!
For these short lengths, a predictable series of lengths works very well for the switch since the indirect jump is basically free. In particular, a typical memcpy benchmark "sweeps" over a series of lengths, using the same length repeatedly for each sub-test to report the results for easy graphing of "time vs length" graphs. The switch does great on these tests, often reporting results like 2 or 3 cycles for small lengths of a few bytes.
In the real world, your lengths might be small but unpredicable. In that case, the indirect branch will frequently mispredict5, with a penalty of ~20 cycles on modern CPUs. Compared to best case of a couple cycles it is an order of magnitude worse. So the glass jaw here can be very serious (i.e., the behavior of the switch in this typical case can be an order of magnitude worse than the best, whereas at long lengths, you are usually looking at a difference of 50% at most between different strategies).
Solutions
So how can you do better than the above, at least under the conditions where the switch falls apart?
Use Duff's Device
One solution to the code size issue is to combine the switch cases together, duff's device-style.
For example, the assembled code for the length 1, 3 and 7 cases looks like:
Length 1
movzx edx, BYTE PTR [rsi]
mov BYTE PTR [rcx], dl
ret
Length 3
movzx edx, BYTE PTR [rsi]
mov BYTE PTR [rcx], dl
movzx edx, WORD PTR [rsi+1]
mov WORD PTR [rcx+1], dx
Length 7
movzx edx, BYTE PTR [rsi]
mov BYTE PTR [rcx], dl
movzx edx, WORD PTR [rsi+1]
mov WORD PTR [rcx+1], dx
mov edx, DWORD PTR [rsi+3]
mov DWORD PTR [rcx+3], edx
ret
This can combined into a single case, with various jump-ins:
len7:
mov edx, DWORD PTR [rsi-6]
mov DWORD PTR [rcx-6], edx
len3:
movzx edx, WORD PTR [rsi-2]
mov WORD PTR [rcx-2], dx
len1:
movzx edx, BYTE PTR [rsi]
mov BYTE PTR [rcx], dl
ret
The labels don't cost anything, and they combine the cases together and removes two out of 3 ret instructions. Note that the basis for rsi and rcx have changed here: they point to the last byte to copy from/to, rather than the first. That change is free or very cheap depending on the code before the jump.
You can extend that for longer lengths (e.g., you can attach lengths 15 and 31 to the chain above), and use other chains for the missing lengths. The full exercise is left to the reader. You can probably get a 50% size reduction alone from this approach, and much better if you combine it with something else to collapse the sizes from 16 - 31.
This approach only helps with the code size (and possibly the jump table size, if you shrink the size as described in 4 and you get under 256 bytes, allowing a byte-sized lookup table. It does nothing for predictability.
Overlapping Stores
One trick that helps for both code size and predictability is to use overlapping stores. That is, memcpy of 8 to 15 bytes can be accomplished in a branch-free way with two 8-byte stores, with the second store partly overlapping the first. For example, to copy 11 bytes, you would do an 8-byte copy at relative position 0 and 11 - 8 == 3. Some of the bytes in the middle would be "copied twice", but in practice this is fine since an 8-byte copy is the same speed as a 1, 2 or 4-byte one.
The C code looks like:
if (Size >= 8) {
*((uint64_t*)Dst) = *((const uint64_t*)Src);
size_t offset = Size & 0x7;
*(uint64_t *)(Dst + offset) = *(const uint64_t *)(Src + offset);
}
... and the corresponding assembly is not problematic:
cmp rdx, 7
jbe .L8
mov rcx, QWORD PTR [rsi]
and edx, 7
mov QWORD PTR [rdi], rcx
mov rcx, QWORD PTR [rsi+rdx]
mov QWORD PTR [rdi+rdx], rcx
In particular, note that you get exactly two loads, two stores and one and (in addition to the cmp and jmp whose existence depends on how you organize the surrounding code). That's already tied or better than most of the compiler-generated approaches for 8-15 bytes, which might use up to 4 load/store pairs.
Older processors suffered some penalty for such "overlapping stores", but newer architectures (the last decade or so, at least) seem to handle them without penalty6. This has two main advantages:
The behavior is branch free for a range of sizes. Effectively, this quantizes the branching so that many values take the same path. All sizes from 8 to 15 (or 8 to 16 if you want) take the same path and suffer no misprediction pressure.
At least 8 or 9 different cases from the switch are subsumed into a single case with a fraction of the total code size.
This approach can be combined with the switch approach, but using only a few cases, or it can be extended to larger sizes with conditional moves that could do, for example, all moves from 8 to 31 bytes without branches.
What works out best again depends on the branch distribution, but overall this "overlapping" technique works very well.
Alignment
The existing code doesn't address alignment.
In fact, it isn't, in general, legal or C or C++, since the char * pointers are simply casted to larger types and dereferenced, which is not legal - although in practice it generates codes that works on today's x86 compilers (but in fact would fail for platform with stricter alignment requirements).
Beyond that, it is often better to handle the alignment specifically. There are three main cases:
The source and destination are already alignment. Even the original algorithm will work fine here.
The source and destination are relatively aligned, but absolutely misaligned. That is, there is a value A that can be added to both the source and destination such that both are aligned.
The source and destination are fully misaligned (i.e., they are not actually aligned and case (2) does not apply).
The existing algorithm will work ok in case (1). It is potentially missing a large optimization the case of (2) since small intro loop could turn an unaligned copy into an aligned one.
It is also likely performing poorly in case (3), since in general in the totally misaligned case you can chose to either align the destination or the source and then proceed "semi-aligned".
The alignment penalties have been getting smaller over time and on the most recent chips are modest for general purpose code but can still be serious for code with many loads and stores. For large copies, it probably doesn't matter too much since you'll end up DRAM bandwidth limited, but for smaller copies misalignment may reduce throughput by 50% or more.
If you use NT stores, alignment can also be important, because many of the NT store instructions perform poorly with misaligned arguments.
No unrolling
The code is not unrolled and compilers unrolled by different amounts by default. Clearly this is suboptimal since among two compilers with different unroll strategies, at most one will be best.
The best approach (at least for known platform targets) is determine which unroll factor is best, and then apply that in the code.
Furthermore, the unrolling can often be combined in a smart way with the "intro" our "outro" code, doing a better job than the compiler could.
Known sizes
The primary reason that it is tough to beat the "builtin" memcpy routine with modern compilers is that compilers don't just call a library memcpy whenever memcpy appears in the source. They know the contract of memcpy and are free to implement it with a single inlined instruction, or even less7, in the right scenario.
This is especially obvious with known lengths in memcpy. In this case, if the length is small, compilers will just insert a few instructions to perform the copy efficiently and in-place. This not only avoids the overhead of the function call, but all the checks about size and so on - and also generates at compile time efficient code for the copy, much like the big switch in the implementation above - but without the costs of the switch.
Similarly, the compiler knows a lot of about the alignment of structures in the calling code, and can create code that deals efficiently with alignment.
If you just implement a memcpy2 as a library function, that is tough to replicate. You can get part of the way there my splitting the method into a small and big part: the small part appears in the header file, and does some size checks and potentially just calls the existing memcpy if the size is small or delegates to the library routine if it is large. Through the magic of inlining, you might get to the same place as the builtin memcpy.
Finally, you can also try tricks with __builtin_constant_p or equivalents to handle the small, known case efficiently.
1 Note that I'm drawing a distinction here between the "distribution" of sizes - e.g., you might say _uniformly distributed between 8 and 24 bytes - and the "predictability" of the actual sequence of sizes (e.g., do the sizes have a predicable pattern)? The question of predictability somewhat subtle because it depends on on the implementation, since as described above certain implementations are inherently more predictable.
2 In particular, ~750 bytes of instructions in clang and ~600 bytes in gcc for the body alone, on top of the 256-byte jump lookup table for the switch body which had 180 - 250 instructions (gcc and clang respectively). Godbolt link.
3 Basically 200 fused uops out of an effective uop cache size of 1000 instructions. While recent x86 have had uop cache sizes around ~1500 uops, you can't use it all outside of extremely dedicated padding of your codebase because of the restrictive code-to-cache assignment rules.
4 The switch cases have different compiled lengths, so the jump can't be directly calculated. For what it's worth, it could have been done differently: they could have used a 16-bit value in the lookup table at the cost of not using memory-source for the jmp, cutting its size by 75%.
5 Unlike conditional branch prediction, which has a typical worst-case prediction rate of ~50% (for totally random branches), a hard-to-predict indirect branch can easily approach 100% since you aren't flipping a coin, you are choosing for an almost infinite set of branch targets. This happens in the real-world: if memcpy is being used to copy small strings with lengths uniformly distributed between 0 and 30, the switch code will mispredict ~97% of the time.
6 Of course, there may be penalties for misaligned stores, but these are also generally small and have been getting smaller.
7 For example, a memcpy to the stack, followed by some manipulation and a copy somewhere else may be totally eliminated, directly moving the original data to its final location. Even things like malloc followed by memcpy can be totally eliminated.

Firstly the main loop uses unaligned AVX vector loads/stores to copy 32 bytes at a time, until there are < 32 bytes left to copy:
for ( ; Size >= sizeof(__m256i); Size -= sizeof(__m256i) )
{
__m256i ymm = _mm256_loadu_si256(((const __m256i* &)Src)++);
_mm256_storeu_si256(((__m256i* &)Dst)++, ymm);
}
Then the final switch statement handles the residual 0..31 bytes in as efficient manner as possible, using a combination of 8/4/2/1 byte copies as appropriate. Note that this is not an unrolled loop - it's just 32 different optimised code paths which handle the residual bytes using the minimum number of loads and stores.
As for why the main 32 byte AVX loop is not manually unrolled - there are several possible reasons for this:
most compilers will unroll small loops automatically (depending on loop size and optimisation switches)
excessive unrolling can cause small loops to spill out of the LSD cache (typically only 28 decoded µops)
on current Core iX CPUs you can only issue two concurrent loads/stores before you stall [*]
typically even a non-unrolled AVX loop like this can saturate available DRAM bandwidth [*]
[*] note that the last two comments above apply to cases where source and/or destination are not in cache (i.e. writing/reading to/from DRAM), and therefore load/store latency is high.

Taking Benefits of The ERMSB
Please also consider using REP MOVSB for larger blocks.
As you know, since first Pentium CPU produced in 1993, Intel began to make simple commands faster and complex commands (like REP MOVSB) slower. So, REP MOVSB became very slow, and there was no more reason to use it. In 2013, Intel decided to revisit REP MOVSB. If the CPU has CPUID ERMSB (Enhanced REP MOVSB) bit, then REP MOVSB commands are executed differently than on older processors, and are supposed to be fast. On practice, it is only fast for large blocks, 256 bytes and larger, and only when certain conditions are met:
both the source and destination addresses have to be aligned to a 16-Byte boundary;
the source region should not overlap with the destination region;
the length has to be a multiple of 64 to produce higher performance;
the direction has to be forward (CLD).
See the Intel Manual on Optimization, section 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB) http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
Intel recommends using AVX for blocks smaller than 2048 bytes. For the larger blocks, Intel recommends using REP MOVSB. This is because high initial startup costs of REP MOVSB (about 35 cycles).
I have done speed tests, and for the blocks of than 2048 bytes and higher, the performance of REP MOVSB is unbeatable. However, for blocks smaller than 256 bytes, REP MOVSB is very slow, even slower than plain MOV RAX back and forth in a loop.
Please not that ERMSB only affects MOVSB, not MOVSD (MOVSQ), so MOVSB is little bit faster than MOVSD (MOVSQ).
So, you can use AVX for your memcpy() implementation, and if the block is larger than 2048 bytes and all the conditions are met, then call REP MOVSB - so your memcpy() implementation will be unbeatable.
Taking Benefits of The Out-of-Order Execution Engine
You can also read about The Out-of-Order Execution Engine
in the "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf section the 2.1.2, and take benefits of it.
For example, in Intel SkyLake processor series (launched in 2015), it has:
4 execution units for the Arithmetic logic unit (ALU) (add, and, cmp, or, test, xor, movzx, movsx, mov, (v)movdqu, (v)movdqa, (v)movap*, (v)movup),
3 execution units for Vector ALU ( (v)pand, (v)por, (v)pxor, (v)movq, (v)movq, (v)movap*, (v)movup*, (v)andp*, (v)orp*, (v)paddb/w/d/q, (v)blendv*, (v)blendp*, (v)pblendd)
So we can occupy above units (3+4) in parallel if we use register-only operations. We cannot use 3+4 instructions in parallel for memory copy. We can use simultaneously maximum of up to two 32-bytes instructions to load from memory and one 32-bytes instructions to store from memory, and even if we are working with Level-1 cache.
Please see the Intel manual again to understand on how to do the fastest memcpy implementation: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
Section 2.2.2 (The Out-of-Order Engine of the Haswelll microarchitecture): "The Scheduler controls the dispatch of micro-ops onto the dispatch ports. There are eight dispatch ports to support the out-of-order execution core. Four of the eight ports provided execution resources for computational operations. The other 4 ports support memory operations of up to two 256-bit load and one 256-bit store operation in a cycle."
Section 2.2.4 (Cache and Memory Subsystem) has the following note: "First level data cache supports two load micro-ops each cycle; each micro-op can fetch up to 32-bytes of data."
Section 2.2.4.1 (Load and Store Operation Enhancements) has the following information: The L1 data cache can handle two 256-bit (32 bytes) load and one 256-bit (32 bytes) store operations each cycle. The unified L2 can service one cache line (64 bytes) each cycle. Additionally, there are 72 load buffers and 42 store buffers available to support micro-ops execution in-flight.
The other sections (2.3 and so on, dedicated to Sandy Bridge and other microarchitectures) basically reiterate the above information.
The section 2.3.4 (The Execution Core) gives additional details.
The scheduler can dispatch up to six micro-ops every cycle, one on each port. The following table summarizes which operations can be dispatched on which port.
Port 0: ALU, Shift, Mul, STTNI, Int-Div, 128b-Mov, Blend, 256b-Mov
Port 1: ALU, Fast LEA, Slow LEA, MUL, Shuf, Blend, 128bMov, Add, CVT
Port 2 & Port 3: Load_Addr, Store_addr
Port 4: Store_data
Port 5: ALU, Shift, Branch, Fast LEA, Shuf, Blend, 128b-Mov, 256b-Mov
The section 2.3.5.1 (Load and Store Operation Overview) may also be useful to understand on how to make fast memory copy, as well as the section 2.4.4.1 (Loads and Stores).
For the other processor architectures, it is again - two load units and one store unit. Table 2-4 (Cache Parameters of the Skylake Microarchitecture) has the following information:
Peak Bandwidth (bytes/cyc):
First Level Data Cache: 96 bytes (2x32B Load + 1*32B Store)
Second Level Cache: 64 bytes
Third Level Cache: 32 bytes.
I have also done speed tests on my Intel Core i5 6600 CPU (Skylake, 14nm, released in September 2015) with DDR4 memory, and this has confirmed the teory. For example, my test have shown that using generic 64-bit registers for memory copy, even many registers in parallel, degrades performance. Also, using just 2 XMM registers is enough - adding the 3rd doesn't add performance.
If your CPU has AVX CPUID bit, you may take benefits of the large, 256-bit (32 byte) YMM registers to copy memory, to occupy two full load units. The AVX support was first introduced by Intel with the Sandy Bridge processors, shipping in Q1 2011 and later on by AMD with the Bulldozer processor shipping in Q3 2011.
// first cycle
vmovdqa ymm0, ymmword ptr [rcx+0] // load 1st 32-byte part using first load unit
vmovdqa ymm1, ymmword ptr [rcx+20h] // load 2nd 32-byte part using second load unit
// second cycle
vmovdqa ymmword ptr [rdx+0], ymm0 // store 1st 32-byte part using the single store unit
// third cycle
vmovdqa ymmword ptr [rdx+20h], ymm1 ; store 2nd 32-byte part - using the single store unit (this instruction will require a separate cycle since there is only one store unit, and we cannot do two stores in a single cycle)
add ecx, 40h // these instructions will be used by a different unit since they don't invoke load or store, so they won't require a new cycle
add edx, 40h
Also, there is speed benefit if you loop-unroll this code at least 8 times. As I wrote before, adding more registers besides ymm0 and ymm1 doesn't increase performance, because there are just two load units and one store unit. Adding loops like "dec r9 jnz ##again" degrades the performance, but simple "add ecx/edx" does not.
Finally, if your CPU has AVX-512 extension, you can use 512-bit (64-byte) registers to copy memory:
vmovdqu64 zmm0, [rcx+0] ; load 1st 64-byte part
vmovdqu64 zmm1, [rcx+40h] ; load 2nd 64-byte part
vmovdqu64 [rdx+0], zmm0 ; store 1st 64-byte part
vmovdqu64 [rdx+40h], zmm1 ; store 2nd 64-byte part
add rcx, 80h
add rdx, 80h
AVX-512 is supported by the following processors: Xeon Phi x200, released in 2016; Skylake EP/EX Xeon "Purley" (Xeon E5-26xx V5) processors (H2 2017); Cannonlake processors (H2 2017), Skylake-X processors - Core i9-7×××X, i7-7×××X, i5-7×××X - released on June 2017.
Please note that the memory have to be aligned on the size of the registers that you are using. If it is not, please use "unaligned" instructions: vmovdqu and moveups.

Related

Benchmarking memory copy in a single shot

Whiskey Lake i7-8565U
I'm trying to learn how to write benchmarks in a single shot by hands (without using any benchmarking frameworks) on an example of memory copy routine with regular and NonTemporal writes to WB memory and would like to ask for some sort of review.
Declaration:
void *avx_memcpy_forward_llss(void *restrict, const void *restrict, size_t);
void *avx_nt_memcpy_forward_llss(void *restrict, const void *restrict, size_t);
Definition:
avx_memcpy_forward_llss:
shr rdx, 0x3
xor rcx, rcx
avx_memcpy_forward_loop_llss:
vmovdqa ymm0, [rsi + 8*rcx]
vmovdqa ymm1, [rsi + 8*rcx + 0x20]
vmovdqa [rdi + rcx*8], ymm0
vmovdqa [rdi + rcx*8 + 0x20], ymm1
add rcx, 0x08
cmp rdx, rcx
ja avx_memcpy_forward_loop_llss
ret
avx_nt_memcpy_forward_llss:
shr rdx, 0x3
xor rcx, rcx
avx_nt_memcpy_forward_loop_llss:
vmovdqa ymm0, [rsi + 8*rcx]
vmovdqa ymm1, [rsi + 8*rcx + 0x20]
vmovntdq [rdi + rcx*8], ymm0
vmovntdq [rdi + rcx*8 + 0x20], ymm1
add rcx, 0x08
cmp rdx, rcx
ja avx_nt_memcpy_forward_loop_llss
ret
Benchmark code:
#include <stdio.h>
#include <inttypes.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <immintrin.h>
#include <x86intrin.h>
#include "memcopy.h"
#define BUF_SIZE 128 * 1024 * 1024
_Alignas(64) char src[BUF_SIZE];
_Alignas(64) char dest[BUF_SIZE];
static inline void warmup(unsigned wa_iterations, void *(*copy_fn)(void *, const void *, size_t));
static inline void cache_flush(char *buf, size_t size);
static inline void generate_data(char *buf, size_t size);
uint64_t run_benchmark(unsigned wa_iteration, void *(*copy_fn)(void *, const void *, size_t)){
generate_data(src, sizeof src);
warmup(4, copy_fn);
cache_flush(src, sizeof src);
cache_flush(dest, sizeof dest);
__asm__ __volatile__("mov $0, %%rax\n cpuid":::"rax", "rbx", "rcx", "rdx", "memory");
uint64_t cycles_start = __rdpmc((1 << 30) + 1);
copy_fn(dest, src, sizeof src);
__asm__ __volatile__("lfence" ::: "memory");
uint64_t cycles_end = __rdpmc((1 << 30) + 1);
return cycles_end - cycles_start;
}
int main(void){
uint64_t single_shot_result = run_benchmark(1024, avx_memcpy_forward_llss);
printf("Core clock cycles = %" PRIu64 "\n", single_shot_result);
}
static inline void warmup(unsigned wa_iterations, void *(*copy_fn)(void *, const void *, size_t)){
while(wa_iterations --> 0){
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
}
}
static inline void generate_data(char *buf, size_t sz){
int fd = open("/dev/urandom", O_RDONLY);
read(fd, buf, sz);
}
static inline void cache_flush(char *buf, size_t sz){
for(size_t i = 0; i < sz; i+=_SC_LEVEL1_DCACHE_LINESIZE){
_mm_clflush(buf + i);
}
}
Results:
avx_memcpy_forward_llss median: 44479368 core cycles
UPD: time
real 0m0,217s
user 0m0,093s
sys 0m0,124s
avx_nt_memcpy_forward_llss median: 24053086 core cycles
UPD: time
real 0m0,184s
user 0m0,056s
sys 0m0,128s
UPD: The result was gotten when running the benchmark with taskset -c 1 ./bin
So I got almost almost 2 times difference in core cycles between the memory copy routine implementation. I interpret it as in case of regular stores to WB memory we have RFO requests competing on bus bandwidth as it is specified in IOM/3.6.12 (emphasize mine):
Although the data bandwidth of full 64-byte bus writes due to
non-temporal stores is twice that of bus writes to WB memory,
transferring 8-byte chunks wastes bus request bandwidth and delivers
significantly lower data bandwidth.
QUESTION 1: How to do benchmark analysis in case of a single shot? Perf counters does not seem to be useful due to perf startup overhead and warmup iteration overhead.
QUESTION 2: Is such benchmark correct. I accounted cpuid in the beginning in order to start measuring with clean CPU resources to avoid stalls due to previous instruction in flight. I added memory clobbers as compile barrier and lfence to avoid rdpmc to be executed OoO.

Whenever possible, benchmarks should report results in ways that allow as much "sanity-checking" as possible. In this case, a few ways to enable such checks include:
For tests involving main memory bandwidth, results should be presented in units that allow direct comparison with the known peak DRAM bandwidth of the system. For a typical configuration of the Core i7-8565U, this is 2 channels * 8 Bytes/transfer * 2.4 billion transfers/sec = 38.4 GB/s (See also item (6), below.)
For tests that involve transfer of data anywhere in the memory hierarchy, the results should include a clear description of the size of the "memory footprint" (number of distinct cache line addresses accessed times the cache line size) and the number of repetitions of the transfer(s). Your code is easy to read here and the size is completely reasonable for a main memory test.
For any timed test, the absolute time should be included to enable comparison against plausible overheads of timing. Your use of only the CORE_CYCLES_UNHALTED counter makes it impossible to compute the elapsed time directly (though the test is clearly long enough that timing overheads are negligible).
Other important "best practice" principles:
Any test that employs RDPMC instructions must be bound to a single logical processor. Results should be presented in a way that confirms to the reader that such binding was employed. Common ways to enforce such binding in Linux include using the "taskset" or "numactl --physcpubind=[n]" commands, or including an inline call to "sched_setaffinity()" with a single allowed logical processor, or setting an environment variable that causes a runtime library (e.g., OpenMP) to bind the thread to a single logical processor.
When using hardware performance counters, extra care is needed to ensure that all of the configuration data for the counters is available and described correctly. The code above uses RDPMC to read IA32_PERF_FIXED_CTR1, which has an event name of CPU_CLK_UNHALTED. The modifier to the event name depends on the programming of IA32_FIXED_CTR_CTRL (MSR 0x38d) bits 7:4. There is no generally-accepted way of mapping from all possible control bits to event name modifiers, so it is best to provide the complete contents of IA32_FIXED_CTR_CTRL along with the results.
The CPU_CLK_UNHALTED performance counter event is the right one to use for benchmarks of portions of the processor whose behavior scales directly with processor core frequency -- such as instruction execution and data transfers involving only the L1 and L2 caches. Memory bandwidth involves portions of the processor whose performance does not scale directly with processor frequency. In particular, using CPU_CLK_UNHALTED without also forcing fixed-frequency operation makes it impossible to compute the elapsed time (required by (1) and (3) above). In your case, RDTSCP would have been easier than RDPMC -- RDTSC does not require the processes to be bound a single logical processor, it is not influenced by other configuration MSRs, and it allows direct computation of elapsed time in seconds.
Advanced: For tests involving transfer of data in the memory hierarchy, it is helpful to control for cache contents and the state (clean or dirty) of the cache contents, and to provide explicit descriptions of the "before" and "after" states along with the results. Given the sizes of your arrays, your code should completely fill all levels of the cache with some combination of portions of the source and destination arrays, and then flush all of those addresses, leaving a cache hierarchy that is (almost) completely full of invalid (clean) entries.
Advanced: Using CPUID as a serialization instruction is almost never useful in benchmarking. Although it guarantees ordering, it also takes a long time to execute -- Agner Fog's "Instruction Tables" report it at 100-250 cycles (presumably depending on the input arguments). (Update: Measurements over short intervals are always very tricky. The CPUID instruction has a long and variable execution time, and it is not clear what impact the microcoded implementation has on the internal state of the processor. It may be helpful in specific cases, but it should not be considered as something that is automatically included in benchmarks. For measurements over long intervals, out-of-order processing across the measurement boundaries is negligible, so CPUID is not needed.)
Advanced: Using LFENCE in benchmarks is only relevant if you are measuring at very fine granularity -- less than a few hundred cycles. More notes on this topic at http://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/
If I assume that your processor was running at its maximum Turbo frequency of 4.6 GHz during the test, then the reported cycle counts correspond to 9.67 milliseconds and 5.23 milliseconds, respectively. Plugging these into a "sanity check" shows:
Assuming that the first case performs one read, one allocate, and one writeback (each 128MiB), the corresponding DRAM traffic rates are 27.8GB/s + 13.9 GB/s = 41.6 GB/s == 108% of peak.
Assuming that the second case performs one read and one streaming store (each 128MiB), the corresponding DRAM traffic rates are 25.7 GB/s + 25.7 GB/s = 51.3 GB/s = 134% of peak.
The failure of these "sanity checks" tells us that the frequency could not have been as high as 4.6 GHz (and was probably no higher than 3.0 GHz), but mostly just points to the need to measure the elapsed time unambiguously....
Your quote from the optimization manual on the inefficiency of streaming stores applies only to cases that cannot be coalesced into full cache line transfers. Your code stores to every element of the output cache lines following "best practice" recommendations (all store instructions writing to the same line are executed consecutively and generating only one stream of stores per loop). It is not possible to completely prevent the hardware from breaking up streaming stores, but in your case it should be extremely rare -- perhaps a few out of a million. Detecting partial streaming stores is a very advanced topic, requiring the use of poorly-documented performance counters in the "uncore" and/or indirect detection of partial streaming stores by looking for elevated DRAM CAS counts (which might be due to other causes). More notes on streaming stores are at http://sites.utexas.edu/jdm4372/2018/01/01/notes-on-non-temporal-aka-streaming-stores/

Does multiplying a 1-100 int by -1 or setting said int to zero take more time?

This is for C, if the language matters. If it goes down to assembly language, it sets things to negative using two's complements. And with the variable, you're storing the value "0" inside the variable int. Which I'm not entirely sure what happens.
I got: 1.90s user 0.01s system 99% cpu 1.928 total for the beneath code and I'm guessing most of the runtime was in adding up the counter variables.
int i;
int n;
i = 0;
while (i < 999999999)
{
n = 0;
i++;
n++;
}
I got: 4.56s user 0.02s system 99% cpu 4.613 total for the beneath code.
int i;
int n;
i = 0;
n = 5;
while (i < 999999999)
{
n *= -1;
i++;
n++;
}
return (0);
I don't particularly understand much about assembly, but it doesn't seem intuitive that using the two's complement operation takes more time than setting one thing to another. What's the underlying implementation that makes one faster than the other, and what's happening beneath the surface? Or is my test simply a bad one that doesn't accurately portray how quick it'll actually be in practice.
If it seems pointless, the reason for it is because I can easily implement a "checklist" by simply multiplying an integer on a map by -1, meaning it's already been checked(But I need to keep the value, so when I do the check, I can just -1 whatever I'm comparing it to). But I was wondering if that's too slow, I could make a separate boolean 2D array to check if the value was checked or not, or change my data structure into an array of structures so it could hold an int 1/0. I'm wondering what the best implementation will be-- doing the -1 operation itself a billion times will already total up to around 5 seconds not counting the rest of my program. But making a separate 1 billion square int array or creating a billion square struct doesn't seem to be the best way either.

Assigning zero is very cheap.
But your microbenchmark tells you very little about what you should do for your large array. Memory bandwidth / cache-miss / cache footprint considerations will dominate there, and your microbench doesn't test that at all.
Using one bit of your integer values to represent checked / not-checked seems reasonable compared to having a separate bitmap. (Having a separate array of 0/1 32-bit integers would be totally silly, but a bitmap is worth considering, especially if you want to search quickly for the next unchecked or the next checked entry. It's not clear what you're doing with this, so I'll mostly just stick to explaining the observed performance in your microbenchmark.)
And BTW, questions like this are a perfect example of why SO comments like "why don't you benchmark it yourself" are misguided: because you have to understand what you're testing in quite a lot of detail to write a useful microbenchmark.
You obviously compiled this in debug mode, e.g. gcc with the default -O0, which spills everything to memory after every C statement (so your program still works even if you modify variables with a debugger). Otherwise the loops would optimize away, because you didn't use volatile or an asm statement to limit optimization, and your loops are trivial to optimize.
Benchmarking with -O0 does not reflect reality (of compiling normally), and is a total waste of time (unless you're actually worried about the performance of debug builds of something like a game).
That said, your results are easy to explain: Since -O0 compiles each C statement separately and predictably.
n = 0; is write-only, and breaks the dependency on the old value.
n *= -1; compiles the same as n = -n; with gcc (even with -O0). It has to read the old value from memory before writing the new value.
The store/reload between a write and a read of a C variable across statements costs about 5 cycles of store-forwarding latency on Intel Haswell for example (see http://agner.org/optimize and other links on the x86 tag wiki). (You didn't say what CPU microarchitecture you tested on, but I'm assuming some kind of x86 because that's usually "the default"). But dependency analysis still works the same way in this case.
So the n*=-1 version has a loop-carried dependency chain involving n, with an n++ and a negate.
The n=0 version breaks that dependency every iteration by doing a store without reading the old value. The loop only bottlenecks on the 6-cycle loop-carried dependency of the i++ loop counter. The latency of the n=0; n++ chain doesn't matter, because each loop iteration starts a fresh chain, so multiple can be in flight at once. (Store forwarding provides a sort of memory renaming, like register renaming but for a memory location).
This is all unrealistic nonsense: With optimization enabled, the cost of a unary - totally depends on the surrounding code. You can't just add up the costs of separate operations to get a total, that's not how pipelined out-of-order CPUs work, and compiler optimization itself also makes that model bogus.

About the code itself
I compiled your pieces of code into x86_64 assembly outputs using GCC 7.2 without any optimization. I also shortened each piece of code without changing the assembly output. Here are the results.
Code 1:
// C
int main() {
int n;
for (int i = 0; i < 999999999; i++) {
n = 0;
n++;
}
}
// assembly
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 0
jmp .L2
.L3:
mov DWORD PTR [rbp-8], 0
add DWORD PTR [rbp-8], 1
add DWORD PTR [rbp-4], 1
.L2:
cmp DWORD PTR [rbp-4], 999999998
jle .L3
mov eax, 0
pop rbp
ret
Code 2:
// C
int main() {
int n = 5;
for (int i = 0; i < 999999999; i++) {
n *= -1;
n++;
}
}
// assembly
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 5
mov DWORD PTR [rbp-8], 0
jmp .L2
.L3:
neg DWORD PTR [rbp-4]
add DWORD PTR [rbp-4], 1
add DWORD PTR [rbp-8], 1
.L2:
cmp DWORD PTR [rbp-8], 999999998
jle .L3
mov eax, 0
pop rbp
ret
The C instructions inside the loop are, in the assembly, located between the two labels (.L3: and .L2:). In both cases, that's three instructions, among which only the first one is different. In the first code, it is a mov, corresponding to n = 0;. In the second code however, it is a neg, corresponding to n *= -1;.
According to this manual, these two instructions have different execution speed depending on the CPU. One can be faster than the other on one chip while being slower on another.
Thanks to aschepler in the comments for the input.
This means, all the other instructions being identical, that you cannot tell which code will be faster in general. Therefore, trying to compare their performance is pointless.
About your intent
Your reason for asking about the performance of these short pieces of code is faulty. What you want is to implement a checklist structure, and you have two conflicting ideas on how to build it. One uses a special value, -1, to add special meaning onto variables in a map. The other uses additional data, either an external boolean array or a boolean for each variable, to add the same meaning without changing the purpose of the existing variables.
The choice you have to make should be a design decision rather than be motivated by unclear performance issues. Personally, whenever I am facing this kind of choice between a special value or additional data with precise meaning, I tend to prefer the latter option. That's mainly because I don't like dealing with special values, but it's only my opinion.
My advice would be to go for the solution you can maintain better, namely the one you are most comfortable with and won't harm future code, and ask about performance when it matters, or rather if it even matters.

How avoid cache line invalidation from multiple threads writing to a shared array?

Context of the problem:
I am writing a code that creates 32 threads, and set affinity of them to each one of the 32 cores in my multi-core-multi-processor system.
Threads simply execute the RDTSCP instruction and the value is stored in a shared array at a non-overlapping position, this is the shared array:
uint64_t rdtscp_values[32];
So, every thread is going to write to the specific array position based on its core number.
Up to know, everything is working properly with the exception that I know that I may not be using the right data structure to avoid cache line bouncing.
P.S: I have checked already that my processor's cache line is 64-bytes wide.
Because I am using a simple uint64_t array, it implies that a single cache line is going to store 8 positions of this array, because of the read-ahead.
Question:
Because of this simple array, although the threads write to different indexes, my understanding tells that every write to this array will cause a cache invalidation to all other threads?
How could I create a structure that is aligned to the cache line?
EDIT 1
My system is: 2x Intel Xeon E5-2670 2.30GHz (8 cores, 16 threads)

Yes you definitely want to avoid "false sharing" and cache-line ping-pong.
But this probably doesn't make sense: if these memory locations are thread-private more often than they're collected by other threads, they should be stored with other per-thread data so you're not wasting cache footprint on 56 bytes of padding. See also Cache-friendly way to collect results from multiple threads. (There's no great answer; avoid designing a system that needs really fine-grained gathering of results if you can.)
But let's just assume for a minute that unused padding between slots for different threads is actually what you want.
Yes, you need the stride to be 64 bytes (1 cache line), but you don't actually need the 8B you're using to be at the start of each cache line. Thus, you don't need any extra alignment as long as the uint64_t objects are naturally-aligned (so they aren't split across a cache-line boundary).
It's fine if each thread is writing to the 3rd qword of its cache line instead of the 1st. OTOH, aligning to 64B makes sure nothing else is sharing a cache line with the first element, and it's easy so we might as well.
Static storage: aligning static storage is very easy in ISO C11 using alignas(), or with compiler-specific stuff.
With a struct, padding is implicit to make the size a multiple of the required alignment. Having one member with an alignment requirement implies that the whole struct requires at least that much alignment. The compiler takes care of this for you with static and automatic storage, but you have to use aligned_alloc or an alternative for over-aligned dynamic allocation.
#include <stdalign.h> // for #define alignas _Alignas for C++ compat
#include <stdint.h> // for uint64_t
// compiler knows the padding is just padding
struct { alignas(64) uint64_t v; } rdtscp_values[32];
int foo(unsigned t) {
rdtscp_values[t].v = 1;
return sizeof(rdtscp_values[0]); // yes, this is 64
}
Or with an array as suggested by # Eric Postpischil:
alignas(64) // optional, stride will still be 64B without this.
uint64_t rdtscp_values_2d[32][8]; // 8 uint64_t per cache line
void bar(unsigned t) {
rdtscp_values_2d[t][0] = 1;
}
alignas() is optional if you don't care about the whole thing being 64B aligned, just having 64B stride between elements you use. You could also use __attribute__((aligned(64))) in GNU C or C++, or __declspec(align(64)) for MSVC, using #ifdef to define an ALIGN macro that's portable across the major x86 compilers.
Either way produces the same asm. We can check compiler output to verify that we got what we wanted. I put it up on the Godbolt compiler explorer. We get:
foo: # and same for bar
mov eax, edi # zero extend 32-bit to 64-bit
shl rax, 6 # *64 is the same as <<6
mov qword ptr [rax + rdtscp_values], 1 # store 1
mov eax, 64 # return value = 64 = sizeof(struct)
ret
Both arrays are declared the same way, with the compiler requesting 64B alignment from the assembler/linker with the 3rd arg to .comm:
.comm rdtscp_values_2d,2048,64
.comm rdtscp_values,2048,64
Dynamic storage:
If the number of threads is not a compile-time constant, then you can use an aligned allocation function to get aligned dynamically-allocated memory (especially if you want to support a very high number of threads). See How to solve the 32-byte-alignment issue for AVX load/store operations?, but really just use C11 aligned_alloc. It's perfect for this, and returns a pointer that's compatible with free().
struct { alignas(64) uint64_t v; } *dynamic_rdtscp_values;
void init(unsigned nthreads) {
size_t sz = sizeof(dynamic_rdtscp_values[0]);
dynamic_rdtscp_values = aligned_alloc(nthreads*sz, sz);
}
void baz(unsigned t) {
dynamic_rdtscp_values[t].v = 1;
}
baz:
mov rax, qword ptr [rip + dynamic_rdtscp_values]
mov ecx, edi # same code as before to scale by 64 bytes
shl rcx, 6
mov qword ptr [rax + rcx], 1
ret
The address of the array is no longer a link-time constant, so there's an extra level of indirection to access it. But the pointer is read-only after it's initialized, so it will stay shared in cache in each core and reloading it when needed is very cheap.
Footnote: In the i386 System V ABI, uint64_t only has 4B-alignment inside structs by default (without alignas(8) or __attribute__((aligned(8)))), so if you put an int before a uint64_t and didn't do any alignment of the whole struct, it would be possible to get cache-line splits. But compilers align it by 8B whenever possible, so your struct-with padding is still fine.

SOLUTION
So, I followed the comments here and I must say thanks for all contributions.
Finally I got what I expected: cache lines being used properly per each thread.
Here is the shared structure:
typedef struct align_st {
uint64_t v;
uint64_t padding[7];
} align_st_t __attribute__ ((aligned (64)));
I am using a padding uint64_t padding[7] inside the structure to fill the remaining bytes in the cache line when this structure is loaded to the L1 cache. Nonetheless, I am asking to the compiler to use 64-bytes memory alignment when compiling it __attribute__ ((aligned (64))).
So, I allocate this structure dynamically based on the number of cores, using the memalign() for this:
align_st_t *al = (align_st_t*) memalign(64, n_cores * sizeof(align_st_t));
To compare it, I wrote one code version (V1) that uses these aligned mechanisms, and other code version (V2) that uses the simple array method.
By executing with perf, and I got these numbers:
V1: 7.184 cache-misses;
V2: 2.621.347 cache-misses.
P.S.: Each thread is writing 1-thousand times to the same address of the shared structure just to increase the numbers

SIMD (AVX2) mask store and pack [duplicate]

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2?
I've seen in SSE where it was done like this:
(From:https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf)
__m128i LeftPack_SSSE3(__m128 mask, __m128 val)
{
// Move 4 sign bits of mask to 4-bit integer value.
int mask = _mm_movemask_ps(mask);
// Select shuffle control data
__m128i shuf_ctrl = _mm_load_si128(&shufmasks[mask]);
// Permute to move valid values to front of SIMD register
__m128i packed = _mm_shuffle_epi8(_mm_castps_si128(val), shuf_ctrl);
return packed;
}
This seems fine for SSE which is 4 wide, and thus only needs a 16 entry LUT, but for AVX which is 8 wide, the LUT becomes quite large(256 entries, each 32 bytes, or 8k).
I'm surprised that AVX doesn't appear to have an instruction for simplifying this process, such as a masked store with packing.
I think with some bit shuffling to count the # of sign bits set to the left you could generate the necessary permutation table, and then call _mm256_permutevar8x32_ps. But this is also quite a few instructions I think..
Does anyone know of any tricks to do this with AVX2? Or what is the most efficient method?
Here is an illustration of the Left Packing Problem from the above document:
Thanks

AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.)
We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle.
We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we need.
Beware that pdep/pext are very slow on AMD CPUs before Zen 3, like 6 uops / 18 cycle latency and throughput on Ryzen Zen 1 and Zen 2. This implementation will perform horribly on those AMD CPUs. For AMD, you might be best with 128-bit vectors using a pshufb or vpermilps LUT, or some of the AVX2 variable-shift suggestions discussed in comments. Especially if your mask input is a vector mask (not an already packed bitmask from memory).
AMD before Zen2 only has 128-bit vector execution units anyway, and 256-bit lane-crossing shuffles are slow. So 128-bit vectors are very attractive for this on Zen 1. But Zen 2 has 256-bit load/store and execution units. (And still slow microcoded pext/pdep.)
For integer vectors with 32-bit or wider elements: Either 1) _mm256_movemask_ps(_mm256_castsi256_ps(compare_mask)).
Or 2) use _mm256_movemask_epi8 and then change the first PDEP constant from 0x0101010101010101 to 0x0F0F0F0F0F0F0F0F to scatter blocks of 4 contiguous bits. Change the multiply by 0xFFU into expanded_mask |= expanded_mask<<4; or expanded_mask *= 0x11; (Not tested). Either way, use the shuffle mask with VPERMD instead of VPERMPS.
For 64-bit integer or double elements, everything still Just Works; The compare-mask just happens to always have pairs of 32-bit elements that are the same, so the resulting shuffle puts both halves of each 64-bit element in the right place. (So you still use VPERMPS or VPERMD, because VPERMPD and VPERMQ are only available with immediate control operands.)
For 16-bit elements, you might be able to adapt this with 128-bit vectors.
For 8-bit elements, see Efficient sse shuffle mask generation for left-packing byte elements for a different trick, storing the result in multiple possibly-overlapping chunks.
The algorithm:
Start with a constant of packed 3 bit indices, with each position holding its own index. i.e. [ 7 6 5 4 3 2 1 0 ] where each element is 3 bits wide. 0b111'110'101'...'010'001'000.
Use pext to extract the indices we want into a contiguous sequence at the bottom of an integer register. e.g. if we want indices 0 and 2, our control-mask for pext should be 0b000'...'111'000'111. pext will grab the 010 and 000 index groups that line up with the 1 bits in the selector. The selected groups are packed into the low bits of the output, so the output will be 0b000'...'010'000. (i.e. [ ... 2 0 ])
See the commented code for how to generate the 0b111000111 input for pext from the input vector mask.
Now we're in the same boat as the compressed-LUT: unpack up to 8 packed indices.
By the time you put all the pieces together, there are three total pext/pdeps. I worked backwards from what I wanted, so it's probably easiest to understand it in that direction, too. (i.e. start with the shuffle line, and work backward from there.)
We can simplify the unpacking if we work with indices one per byte instead of in packed 3-bit groups. Since we have 8 indices, this is only possible with 64bit code.
See this and a 32bit-only version on the Godbolt Compiler Explorer. I used #ifdefs so it compiles optimally with -m64 or -m32. gcc wastes some instructions, but clang makes really nice code.
#include <stdint.h>
#include <immintrin.h>
// Uses 64bit pdep / pext to save a step in unpacking.
__m256 compress256(__m256 src, unsigned int mask /* from movmskps */)
{
uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101); // unpack each bit to a byte
expanded_mask *= 0xFF; // mask |= mask<<1 | mask<<2 | ... | mask<<7;
// ABC... -> AAAAAAAABBBBBBBBCCCCCCCC...: replicate each bit to fill its byte
const uint64_t identity_indices = 0x0706050403020100; // the identity shuffle for vpermps, packed to one index per byte
uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask);
__m128i bytevec = _mm_cvtsi64_si128(wanted_indices);
__m256i shufmask = _mm256_cvtepu8_epi32(bytevec);
return _mm256_permutevar8x32_ps(src, shufmask);
}
This compiles to code with no loads from memory, only immediate constants. (See the godbolt link for this and the 32bit version).
# clang 3.7.1 -std=gnu++14 -O3 -march=haswell
mov eax, edi # just to zero extend: goes away when inlining
movabs rcx, 72340172838076673 # The constants are hoisted after inlining into a loop
pdep rax, rax, rcx # ABC -> 0000000A0000000B....
imul rax, rax, 255 # 0000000A0000000B.. -> AAAAAAAABBBBBBBB..
movabs rcx, 506097522914230528
pext rax, rcx, rax
vmovq xmm1, rax
vpmovzxbd ymm1, xmm1 # 3c latency since this is lane-crossing
vpermps ymm0, ymm1, ymm0
ret
(Later clang compiles like GCC, with mov/shl/sub instead of imul, see below.)
So, according to Agner Fog's numbers and https://uops.info/, this is 6 uops (not counting the constants, or the zero-extending mov that disappears when inlined). On Intel Haswell, it's 16c latency (1 for vmovq, 3 for each pdep/imul/pext / vpmovzx / vpermps). There's no instruction-level parallelism. In a loop where this isn't part of a loop-carried dependency, though, (like the one I included in the Godbolt link), the bottleneck is hopefully just throughput, keeping multiple iterations of this in flight at once.
This can maybe manage a throughput of one per 4 cycles, bottlenecked on port1 for pdep/pext/imul plus popcnt in the loop. Of course, with loads/stores and other loop overhead (including the compare and movmsk), total uop throughput can easily be an issue, too.
e.g. the filter loop in my godbolt link is 14 uops with clang, with -fno-unroll-loops to make it easier to read. It might sustain one iteration per 4c, keeping up with the front-end, if we're lucky.
clang 6 and earlier created a loop-carried dependency with popcnt's false dependency on its output, so it will bottleneck on 3/5ths of the latency of the compress256 function. clang 7.0 and later use xor-zeroing to break the false dependency (instead of just using popcnt edx,edx or something like GCC does :/).
gcc (and later clang) does the multiply by 0xFF with multiple instructions, using a left shift by 8 and a sub, instead of imul by 255. This takes 3 total uops vs. 1 for the front-end, but the latency is only 2 cycles, down from 3. (Haswell handles mov at register-rename stage with zero latency.) Most significantly for this, imul can only run on port 1, competing with pdep/pext/popcnt, so it's probably good to avoid that bottleneck.
Since all hardware that supports AVX2 also supports BMI2, there's probably no point providing a version for AVX2 without BMI2.
If you need to do this in a very long loop, the LUT is probably worth it if the initial cache-misses are amortized over enough iterations with the lower overhead of just unpacking the LUT entry. You still need to movmskps, so you can popcnt the mask and use it as a LUT index, but you save a pdep/imul/pext.
You can unpack LUT entries with the same integer sequence I used, but #Froglegs's set1() / vpsrlvd / vpand is probably better when the LUT entry starts in memory and doesn't need to go into integer registers in the first place. (A 32bit broadcast-load doesn't need an ALU uop on Intel CPUs). However, a variable-shift is 3 uops on Haswell (but only 1 on Skylake).

See my other answer for AVX2+BMI2 with no LUT.
Since you mention a concern about scalability to AVX512: don't worry, there's an AVX512F instruction for exactly this:
VCOMPRESSPS — Store Sparse Packed Single-Precision Floating-Point Values into Dense Memory. (There are also versions for double, and 32 or 64bit integer elements (vpcompressq), but not byte or word (16bit)). It's like BMI2 pdep / pext, but for vector elements instead of bits in an integer reg.
The destination can be a vector register or a memory operand, while the source is a vector and a mask register. With a register dest, it can merge or zero the upper bits. With a memory dest, "Only the contiguous vector is written to the destination memory location".
To figure out how far to advance your pointer for the next vector, popcnt the mask.
Let's say you want to filter out everything but values >= 0 from an array:
#include <stdint.h>
#include <immintrin.h>
size_t filter_non_negative(float *__restrict__ dst, const float *__restrict__ src, size_t len) {
const float *endp = src+len;
float *dst_start = dst;
do {
__m512 sv = _mm512_loadu_ps(src);
__mmask16 keep = _mm512_cmp_ps_mask(sv, _mm512_setzero_ps(), _CMP_GE_OQ); // true for src >= 0.0, false for unordered and src < 0.0
_mm512_mask_compressstoreu_ps(dst, keep, sv); // clang is missing this intrinsic, which can't be emulated with a separate store
src += 16;
dst += _mm_popcnt_u64(keep); // popcnt_u64 instead of u32 helps gcc avoid a wasted movsx, but is potentially slower on some CPUs
} while (src < endp);
return dst - dst_start;
}
This compiles (with gcc4.9 or later) to (Godbolt Compiler Explorer):
# Output from gcc6.1, with -O3 -march=haswell -mavx512f. Same with other gcc versions
lea rcx, [rsi+rdx*4] # endp
mov rax, rdi
vpxord zmm1, zmm1, zmm1 # vpxor xmm1, xmm1,xmm1 would save a byte, using VEX instead of EVEX
.L2:
vmovups zmm0, ZMMWORD PTR [rsi]
add rsi, 64
vcmpps k1, zmm0, zmm1, 29 # AVX512 compares have mask regs as a destination
kmovw edx, k1 # There are some insns to add/or/and mask regs, but not popcnt
movzx edx, dx # gcc is dumb and doesn't know that kmovw already zero-extends to fill the destination.
vcompressps ZMMWORD PTR [rax]{k1}, zmm0
popcnt rdx, rdx
## movsx rdx, edx # with _popcnt_u32, gcc is dumb. No casting can get gcc to do anything but sign-extend. You'd expect (unsigned) would mov to zero-extend, but no.
lea rax, [rax+rdx*4] # dst += ...
cmp rcx, rsi
ja .L2
sub rax, rdi
sar rax, 2 # address math -> element count
ret
Performance: 256-bit vectors may be faster on Skylake-X / Cascade Lake
In theory, a loop that loads a bitmap and filters one array into another should run at 1 vector per 3 clocks on SKX / CSLX, regardless of vector width, bottlenecked on port 5. (kmovb/w/d/q k1, eax runs on p5, and vcompressps into memory is 2p5 + a store, according to IACA and to testing by http://uops.info/).
#ZachB reports in comments that in practice, that a loop using ZMM _mm512_mask_compressstoreu_ps is slightly slower than _mm256_mask_compressstoreu_ps on real CSLX hardware. (I'm not sure if that was a microbenchmark that would allow the 256-bit version to get out of "512-bit vector mode" and clock higher, or if there was surrounding 512-bit code.)
I suspect misaligned stores are hurting the 512-bit version. vcompressps probably effectively does a masked 256 or 512-bit vector store, and if that crosses a cache line boundary then it has to do extra work. Since the output pointer is usually not a multiple of 16 elements, a full-line 512-bit store will almost always be misaligned.
Misaligned 512-bit stores may be worse than cache-line-split 256-bit stores for some reason, as well as happening more often; we already know that 512-bit vectorization of other things seems to be more alignment sensitive. That may just be from running out of split-load buffers when they happen every time, or maybe the fallback mechanism for handling cache-line splits is less efficient for 512-bit vectors.
It would be interesting to benchmark vcompressps into a register, with separate full-vector overlapping stores. That's probably the same uops, but the store can micro-fuse when it's a separate instruction. And if there's some difference between masked stores vs. overlapping stores, this would reveal it.
Another idea discussed in comments below was using vpermt2ps to build up full vectors for aligned stores. This would be hard to do branchlessly, and branching when we fill a vector will probably mispredict unless the bitmask has a pretty regular pattern, or big runs of all-0 and all-1.
A branchless implementation with a loop-carried dependency chain of 4 or 6 cycles through the vector being constructed might be possible, with a vpermt2ps and a blend or something to replace it when it's "full". With an aligned vector store every iteration, but only moving the output pointer when the vector is full.
This is likely slower than vcompressps with unaligned stores on current Intel CPUs.

If you are targeting AMD Zen this method may be preferred, due to the very slow pdepand pext on ryzen (18 cycles each).
I came up with this method, which uses a compressed LUT, which is 768(+1 padding) bytes, instead of 8k. It requires a broadcast of a single scalar value, which is then shifted by a different amount in each lane, then masked to the lower 3 bits, which provides a 0-7 LUT.
Here is the intrinsics version, along with code to build LUT.
//Generate Move mask via: _mm256_movemask_ps(_mm256_castsi256_ps(mask)); etc
__m256i MoveMaskToIndices(u32 moveMask) {
u8 *adr = g_pack_left_table_u8x3 + moveMask * 3;
__m256i indices = _mm256_set1_epi32(*reinterpret_cast<u32*>(adr));//lower 24 bits has our LUT
// __m256i m = _mm256_sllv_epi32(indices, _mm256_setr_epi32(29, 26, 23, 20, 17, 14, 11, 8));
//now shift it right to get 3 bits at bottom
//__m256i shufmask = _mm256_srli_epi32(m, 29);
//Simplified version suggested by wim
//shift each lane so desired 3 bits are a bottom
//There is leftover data in the lane, but _mm256_permutevar8x32_ps only examines the first 3 bits so this is ok
__m256i shufmask = _mm256_srlv_epi32 (indices, _mm256_setr_epi32(0, 3, 6, 9, 12, 15, 18, 21));
return shufmask;
}
u32 get_nth_bits(int a) {
u32 out = 0;
int c = 0;
for (int i = 0; i < 8; ++i) {
auto set = (a >> i) & 1;
if (set) {
out |= (i << (c * 3));
c++;
}
}
return out;
}
u8 g_pack_left_table_u8x3[256 * 3 + 1];
void BuildPackMask() {
for (int i = 0; i < 256; ++i) {
*reinterpret_cast<u32*>(&g_pack_left_table_u8x3[i * 3]) = get_nth_bits(i);
}
}
Here is the assembly generated by MSVC:
lea ecx, DWORD PTR [rcx+rcx*2]
lea rax, OFFSET FLAT:unsigned char * g_pack_left_table_u8x3 ; g_pack_left_table_u8x3
vpbroadcastd ymm0, DWORD PTR [rcx+rax]
vpsrlvd ymm0, ymm0, YMMWORD PTR __ymm#00000015000000120000000f0000000c00000009000000060000000300000000

Will add more information to a great answer from #PeterCordes : https://stackoverflow.com/a/36951611/5021064.
I did the implementations of std::remove from C++ standard for integer types with it. The algorithm, once you can do compress, is relatively simple: load a register, compress, store. First I'm going to show the variations and then benchmarks.
I ended up with two meaningful variations on the proposed solution:
__m128i registers, any element type, using _mm_shuffle_epi8 instruction
__m256i registers, element type of at least 4 bytes, using _mm256_permutevar8x32_epi32
When the types are smaller then 4 bytes for 256 bit register, I split them in two 128 bit registers and compress/store each one separately.
Link to compiler explorer where you can see complete assembly (there is a using type and width (in elements per pack) in the bottom, which you can plug in to get different variations) : https://gcc.godbolt.org/z/yQFR2t
NOTE: my code is in C++17 and is using a custom simd wrappers, so I do not know how readable it is. If you want to read my code -> most of it is behind the link in the top include on godbolt. Alternatively, all of the code is on github.
Implementations of #PeterCordes answer for both cases
Note: together with the mask, I also compute the number of elements remaining using popcount. Maybe there is a case where it's not needed, but I have not seen it yet.
Mask for _mm_shuffle_epi8
Write an index for each byte into a half byte: 0xfedcba9876543210
Get pairs of indexes into 8 shorts packed into __m128i
Spread them out using x << 4 | x & 0x0f0f
Example of spreading the indexes. Let's say 7th and 6th elements are picked.
It means that the corresponding short would be: 0x00fe. After << 4 and | we'd get 0x0ffe. And then we clear out the second f.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of `_mm_movemask_epi8`,
// `uint16_t` - there are at most 16 bits with values for __m128i.
inline std::pair<__m128i, std::uint8_t> mask128(std::uint16_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x1111111111111111) * 0xf;
const std::uint8_t offset =
static_cast<std::uint8_t>(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes =
_pext_u64(0xfedcba9876543210, mmask_expanded); // Do the #PeterCordes answer
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0...0|compressed_indexes
const __m128i as_16bit = _mm_cvtepu8_epi16(as_lower_8byte); // From bytes to shorts over the whole register
const __m128i shift_by_4 = _mm_slli_epi16(as_16bit, 4); // x << 4
const __m128i combined = _mm_or_si128(shift_by_4, as_16bit); // | x
const __m128i filter = _mm_set1_epi16(0x0f0f); // 0x0f0f
const __m128i res = _mm_and_si128(combined, filter); // & 0x0f0f
return {res, offset};
}
} // namespace _compress_mask
template <typename T>
std::pair<__m128i, std::uint8_t> compress_mask_for_shuffle_epi8(std::uint32_t mmask) {
auto res = _compress_mask::mask128(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
}
Mask for _mm256_permutevar8x32_epi32
This is almost one for one #PeterCordes solution - the only difference is _pdep_u64 bit (he suggests this as a note).
The mask that I chose is 0x5555'5555'5555'5555. The idea is - I have 32 bits of mmask, 4 bits for each of 8 integers. I have 64 bits that I want to get => I need to convert each bit of 32 bits into 2 => therefore 0101b = 5.The multiplier also changes from 0xff to 3 because I will get 0x55 for each integer, not 1.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of _mm256_movemask_epi8
inline std::pair<__m256i, std::uint8_t> mask256_epi32(std::uint32_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x5555'5555'5555'5555) * 3;
const std::uint8_t offset = static_cast<std::uint8_t(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes = _pext_u64(0x0706050403020100, mmask_expanded); // Do the #PeterCordes answer
// Every index was one byte => we need to make them into 4 bytes
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0000|compressed indexes
const __m256i expanded = _mm256_cvtepu8_epi32(as_lower_8byte); // spread them out
return {expanded, offset};
}
} // namespace _compress_mask
template <typename T>
std::pair<__m256i, std::uint8_t> compress_mask_for_permutevar8x32(std::uint32_t mmask) {
static_assert(sizeof(T) >= 4); // You cannot permute shorts/chars with this.
auto res = _compress_mask::mask256_epi32(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
}
Benchmarks
Processor: Intel Core i7 9700K (a modern consumer level CPU, no AVX-512 support)
Compiler: clang, build from trunk near the version 10 release
Compiler options: --std=c++17 --stdlib=libc++ -g -Werror -Wall -Wextra -Wpedantic -O3 -march=native -mllvm -align-all-functions=7
Micro-benchmarking library: google benchmark
Controlling for code alignment:
If you are not familiar with the concept, read this or watch this
All functions in the benchmark's binary are aligned to 128 byte boundary. Each benchmarking function is duplicated 64 times, with a different noop slide in the beginning of the function (before entering the loop). The main numbers I show is min per each measurement. I think this works since the algorithm is inlined. I'm also validated by the fact that I get very different results. At the very bottom of the answer I show the impact of code alignment.
Note: benchmarking code. BENCH_DECL_ATTRIBUTES is just noinline
Benchmark removes some percentage of 0s from an array. I test arrays with {0, 5, 20, 50, 80, 95, 100} percent of zeroes.
I test 3 sizes: 40 bytes (to see if this is usable for really small arrays), 1000 bytes and 10'000 bytes. I group by size because of SIMD depends on the size of the data and not a number of elements. The element count can be derived from an element size (1000 bytes is 1000 chars but 500 shorts and 250 ints). Since time it takes for non simd code depends mostly on the element count, the wins should be bigger for chars.
Plots: x - percentage of zeroes, y - time in nanoseconds. padding : min indicates that this is minimum among all alignments.
40 bytes worth of data, 40 chars
For 40 bytes this does not make sense even for chars - my implementation gets about 8-10 times slower when using 128 bit registers over non-simd code. So, for example, compiler should be careful doing this.
1000 bytes worth of data, 1000 chars
Apparently the non-simd version is dominated by branch prediction: when we get small amount of zeroes we get a smaller speed up: for no 0s - about 3 times, for 5% zeroes - about 5-6 times speed up. For when the branch predictor can't help the non-simd version - there is about a 27 times speed up. It's an interesting property of simd code that it's performance tends to be much less dependent on of data. Using 128 vs 256 register shows practically no difference, since most of the work is still split into 2 128 registers.
1000 bytes worth of data, 500 shorts
Similar results for shorts except with a much smaller gain - up to 2 times.
I don't know why shorts do that much better than chars for non-simd code: I'd expect shorts to be two times faster, since there are only 500 shorts, but the difference is actually up to 10 times.
1000 bytes worth of data, 250 ints
For a 1000 only 256 bit version makes sense - 20-30% win excluding no 0s to remove what's so ever (perfect branch prediction, no removing for non-simd code).
10'000 bytes worth of data, 10'000 chars
The same order of magnitude wins as as for a 1000 chars: from 2-6 times faster when branch predictor is helpful to 27 times when it's not.
Same plots, only simd versions:
Here we can see about a 10% win from using 256 bit registers and splitting them in 2 128 bit ones: about 10% faster. In size it grows from 88 to 129 instructions, which is not a lot, so might make sense depending on your use-case. For base-line - non-simd version is 79 instructions (as far as I know - these are smaller then SIMD ones though).
10'000 bytes worth of data, 5'000 shorts
From 20% to 9 times win, depending on the data distributions. Not showing the comparison between 256 and 128 bit registers - it's almost the same assembly as for chars and the same win for 256 bit one of about 10%.
10'000 bytes worth of data, 2'500 ints
Seems to make a lot of sense to use 256 bit registers, this version is about 2 times faster compared to 128 bit registers. When comparing with non-simd code - from a 20% win with a perfect branch prediction to 3.5 - 4 times as soon as it's not.
Conclusion: when you have a sufficient amount of data (at least 1000 bytes) this can be a very worthwhile optimisation for a modern processor without AVX-512
PS:
On percentage of elements to remove
On one hand it's uncommon to filter half of your elements. On the other hand a similar algorithm can be used in partition during sorting => that is actually expected to have ~50% branch selection.
Code alignment impact
The question is: how much worth it is, if the code happens to be poorly aligned
(generally speaking - there is very little one can do about it).
I'm only showing for 10'000 bytes.
The plots have two lines for min and for max for each percentage point (meaning - it's not one best/worst code alignment - it's the best code alignment for a given percentage).
Code alignment impact - non-simd
Chars:
From 15-20% for poor branch prediction to 2-3 times when branch prediction helped a lot. (branch predictor is known to be affected by code alignment).
Shorts:
For some reason - the 0 percent is not affected at all. It can be explained by std::remove first doing linear search to find the first element to remove. Apparently linear search for shorts is not affected.
Other then that - from 10% to 1.6-1.8 times worth
Ints:
Same as for shorts - no 0s is not affected. As soon as we go into remove part it goes from 1.3 times to 5 times worth then the best case alignment.
Code alignment impact - simd versions
Not showing shorts and ints 128, since it's almost the same assembly as for chars
Chars - 128 bit register
About 1.2 times slower
Chars - 256 bit register
About 1.1 - 1.24 times slower
Ints - 256 bit register
1.25 - 1.35 times slower
We can see that for simd version of the algorithm, code alignment has significantly less impact compared to non-simd version. I suspect that this is due to practically not having branches.

In case anyone is interested here is a solution for SSE2 which uses an instruction LUT instead of a data LUT aka a jump table. With AVX this would need 256 cases though.
Each time you call LeftPack_SSE2 below it uses essentially three instructions: jmp, shufps, jmp. Five of the sixteen cases don't need to modify the vector.
static inline __m128 LeftPack_SSE2(__m128 val, int mask) {
switch(mask) {
case 0:
case 1: return val;
case 2: return _mm_shuffle_ps(val,val,0x01);
case 3: return val;
case 4: return _mm_shuffle_ps(val,val,0x02);
case 5: return _mm_shuffle_ps(val,val,0x08);
case 6: return _mm_shuffle_ps(val,val,0x09);
case 7: return val;
case 8: return _mm_shuffle_ps(val,val,0x03);
case 9: return _mm_shuffle_ps(val,val,0x0c);
case 10: return _mm_shuffle_ps(val,val,0x0d);
case 11: return _mm_shuffle_ps(val,val,0x34);
case 12: return _mm_shuffle_ps(val,val,0x0e);
case 13: return _mm_shuffle_ps(val,val,0x38);
case 14: return _mm_shuffle_ps(val,val,0x39);
case 15: return val;
}
}
__m128 foo(__m128 val, __m128 maskv) {
int mask = _mm_movemask_ps(maskv);
return LeftPack_SSE2(val, mask);
}

This is perhaps a bit late though I recently ran into this exact problem and found an alternative solution which used a strictly AVX implementation. If you don't care if unpacked elements are swapped with the last elements of each vector, this could work as well. The following is an AVX version:
inline __m128 left_pack(__m128 val, __m128i mask) noexcept
{
const __m128i shiftMask0 = _mm_shuffle_epi32(mask, 0xA4);
const __m128i shiftMask1 = _mm_shuffle_epi32(mask, 0x54);
const __m128i shiftMask2 = _mm_shuffle_epi32(mask, 0x00);
__m128 v = val;
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask0);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask1);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask2);
return v;
}
Essentially, each element in val is shifted once to the left using the bitfield, 0xF9 for blending with it's unshifted variant. Next, both shifted and unshifted versions are blended against the input mask (which has the first non-zero element broadcast across the remaining elements 3 and 4). Repeat this process two more times, broadcasting the second and third elements of mask to its subsequent elements on each iteration and this should provide an AVX version of the _pdep_u32() BMI2 instruction.
If you don't have AVX, you can easily swap out each _mm_permute_ps() with _mm_shuffle_ps() for an SSE4.1-compatible version.
And if you're using double-precision, here's an additional version for AVX2:
inline __m256 left_pack(__m256d val, __m256i mask) noexcept
{
const __m256i shiftMask0 = _mm256_permute4x64_epi64(mask, 0xA4);
const __m256i shiftMask1 = _mm256_permute4x64_epi64(mask, 0x54);
const __m256i shiftMask2 = _mm256_permute4x64_epi64(mask, 0x00);
__m256d v = val;
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask0);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask1);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask2);
return v;
}
Additionally _mm_popcount_u32(_mm_movemask_ps(val)) can be used to determine the number of elements which remained after the left-packing.

How to optimize these loops (with compiler optimization disabled)?

I need to optimize some for-loops for speed (for a school assignment) without using compiler optimization flags.
Given a specific Linux server (owned by the school), a satisfactory improvement is to make it run under 7 seconds, and a great improvement is to make it run under 5 seconds. This code that I have right here gets about 5.6 seconds. I am thinking I may need to use pointers with this in some way to get it to go faster, but I'm not really sure. What options do I have?
The file must remain 50 lines or less (not counting comments).
#include <stdio.h>
#include <stdlib.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
register double sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0, sum5 = 0, sum6 = 0, sum7 = 0, sum8 = 0, sum9 = 0;
register int j;
// ... and this one.
printf("CS201 - Asgmt 4 - \n");
for (i = 0; i < N_TIMES; i++)
{
// You can change anything between this comment ...
for (j = 0; j < ARRAY_SIZE; j += 10)
{
sum += array[j];
sum1 += array[j + 1];
sum2 += array[j + 2];
sum3 += array[j + 3];
sum4 += array[j + 4];
sum5 += array[j + 5];
sum6 += array[j + 6];
sum7 += array[j + 7];
sum8 += array[j + 8];
sum9 += array[j + 9];
}
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
}
// You can add some final code between this comment ...
sum += sum1 + sum2 + sum3 + sum4 + sum5 + sum6 + sum7 + sum8 + sum9;
// ... and this one.
return 0;
}

Re-posting a modified version of my answer from optimized sum of an array of doubles in C, since that question got voted down to -5. The OP of the other question phrased it more as "what else is possible", so I took him at his word and info-dumped about vectorizing and tuning for current CPU hardware. :)
The OP of that question eventually said he wasn't allowed to use compiler options higher than -O0, which I guess is the case here, too.
Summary:
Why using -O0 distorts things (unfairly penalizes things that are fine in normal code for a normal compiler). Using -O0 (the gcc/clang default) so your loops don't optimize away is not a valid excuse or a useful way to find out what will be faster with normal optimization enabled. (See also Idiomatic way of performance evaluation? for more about benchmark methods and pitfalls, like ways to enable optimization but still stop the compiler from optimizing away the work you want to measure.)
Stuff that's wrong with the assignment.
Types of optimizations. FP latency vs. throughput, and dependency chains. Link to Agner Fog's site. (Essential reading for optimization).
Experiments getting the compiler to optimize it (after fixing it to not optimize away). Best result with auto-vectorization (no source changes): gcc: half as fast as an optimal vectorized loop. clang: same speed as a hand-vectorized loop.
Some more comments on why bigger expressions are a perf win with -O0 only.
Source changes to get good performance without -ffast-math, making the code closer to what we want the compiler to do. Also some rules-lawyering ideas that would be useless in the real-world.
Vectorizing the loop with GCC architecture-neutral vectors, to see how close the auto-vectorizing compilers came to matching the performance of ideal asm code (since I checked the compiler output).
I think the point of the assignment is to sort of teach assembly-language performance optimizations using C with no compiler optimizations. This is silly. It's mixing up things the compiler will do for you in real life with things that do require source-level changes.
See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?
-O0 doesn't just "not optimize", it makes the compiler store variables to memory after every statement instead of keeping them in registers. It does this so you get the "expected" results if you set a breakpoint with gdb and modify the value (in memory) of a C variable. Or even if you jump to another line in the same function. So each C statement has to be compiled to an independent block of asm that starts and ends with all variables in memory. For a modern portable compiler like gcc which already transforms through multiple internal representations of program flow on the way from source to asm, this part of -O0 requires explicitly de-optimizing its graph of data flow back into separate C statements. These store/reloads lengthen every loop-carried dependency chain so it's horrible for tiny loops if the loop counter is kept in memory. (e.g. 1 cycle per iteration for inc reg vs. 6c for inc [mem], creating a bottleneck on loop counter updates in tight loops).
With gcc -O0, the register keyword lets gcc keep a var in a register instead of memory, and thus can make a big difference in tight loops (Example on the Godbolt Compiler explorer). But that's only with -O0. In real code, register is meaningless: the compiler attempts to optimally use the available registers for variables and temporaries. register is already deprecated in ISO C++11 (but not C11), and there's a proposal to remove it from the language along with other obsolete stuff like trigraphs.
With an extra variables involved, -O0 hurts array indexing a bit more than pointer incrementing.
Array indexing usually makes code easier to read. Compilers sometimes fail to optimize stuff like array[i*width + j*width*height], so it's a good idea to change the source to do the strength-reduction optimization of turning the multiplies into += adds.
At an asm level, array indexing vs. pointer incrementing are close to the same performance. (x86 for example has addressing modes like [rsi + rdx*4] which are as fast as [rdi]. except on Sandybridge and later.) It's the compiler's job to optimize your code by using pointer incrementing even when the source uses array indexing, when that's faster.
For good performance, you have to be aware of what compilers can and can't do. Some optimizations are "brittle", and a small seemingly-innocent change to the source will stop the compiler from doing an optimization that was essential for some code to run fast. (e.g. pulling a constant computation out of a loop, or proving something about how different branch conditions are related to each other, and simplifying.)
Besides all that, it's a crap sample because it doesn't have anything to stop a smart compiler from optimizing away the entire thing. It doesn't even print the sum. Even gcc -O1 (instead of -O3) threw away some of the looping.
(You can fix this by printing sum at the end. gcc and clang don't seem to realize that calloc returns zeroed memory, and optimize it away to 0.0. See my code below.)
Normally you'd put your code in a function, and call it in a loop from main() in another file. And compile them separately, without whole-program cross-file optimisation, so the compiler can't do optimisations based on the compile-time constants you call it with. The repeat-loop being wrapped so tightly around the actual loop over the array is causing havoc with gcc's optimizer (see below).
Also, the other version of this question had an uninitialized variable kicking around. It looks like long int help was introduced by the OP of that question, not the prof. So I will have to downgrade my "utter nonsense" to merely "silly", because the code doesn't even print the result at the end. That's the most common way of getting the compiler not to optimize everything away in a microbenchmark like this.
I assume your prof mentioned a few things about performance. There are a crapton of different things that could come into play here, many of which I assume didn't get mentioned in a 2nd-year CS class.
Besides multithreading with openmp, there's vectorizing with SIMD. There are also optimizations for modern pipelined CPUs: specifically, avoid having one long dependency chain.
Further essential reading:
Agner Fog's guides for optimizing C and asm for x86. Some of it applies to all CPUs.
What Every Programmer Should Know About Memory
Your compiler manual is also essential, esp. for floating point code. Floating point has limited precision, and is not associative. The final sum does depend on which order you do the additions in. Usually the difference in rounding error is small, so the compiler can get a big speedup by re-ordering things if you use -ffast-math to allow it.
Instead of just unrolling, keep multiple accumulators which you only add up at the end, like you're doing with the sum0..sum9 unroll-by-10. FP instructions have medium latency but high throughput, so you need to keep multiple FP operations in flight to keep the floating point execution units saturated.
If you need the result of the last op to be complete before the next one can start, you're limited by latency. For FP add, that's one per 3 cycles. In Intel Sandybridge, IvB, Haswell, and Broadwell, the throughput of FP add is one per cycle. So you need to keep at least 3 independent ops that can be in flight at once to saturate the machine. For Skylake, it's 2 per cycle with latency of 4 clocks. (On the plus side for Skylake, FMA is down to 4 cycle latency.)
In this case, there's also basic stuff like pulling things out of the loop, e.g. help += ARRAY_SIZE.
Compiler Options
Lets start by seeing what the compiler can do for us.
I started out with the original inner loop, with just help += ARRAY_SIZE pulled out, and adding a printf at the end so gcc doesn't optimize everything away. Let's try some compiler options and see what we can achieve with gcc 4.9.2 (on my i5 2500k Sandybridge. 3.8GHz max turbo (slight OC), 3.3GHz sustained (irrelevant for this short benchmark)):
gcc -O0 fast-loop-cs201.c -o fl: 16.43s performance is a total joke. Variables are stored to memory after every operation, and re-loaded before the next. This is a bottleneck, and adds a lot of latency. Not to mention losing out on actual optimisations. Timing / tuning code with -O0 is not useful.
-O1: 4.87s
-O2: 4.89s
-O3: 2.453s (uses SSE to do 2 at once. I'm of course using a 64bit system, so hardware support for -msse2 is baseline.)
-O3 -ffast-math -funroll-loops: 2.439s
-O3 -march=sandybridge -ffast-math -funroll-loops: 1.275s (uses AVX to do 4 at once.)
-Ofast ...: no gain
-O3 -ftree-parallelize-loops=4 -march=sandybridge -ffast-math -funroll-loops: 0m2.375s real, 0m8.500s user. Looks like locking overhead killed it. It only spawns the 4 threads total, but the inner loop is too short for it to be a win: it collects the sums every time, instead of giving each thread 1/4 of the outer loop iterations.
-Ofast -fprofile-generate -march=sandybridge -ffast-math, run it, then
-Ofast -fprofile-use -march=sandybridge -ffast-math: 1.275s. profile-guided optimization is a good idea when you can exercise all the relevant code-paths, so the compiler can make better unrolling / inlining decisions.
clang-3.5 -Ofast -march=native -ffast-math: 1.070s. (clang 3.5 is too old to support -march=sandybridge. You should prefer to use a compiler version that's new enough to know about the target architecture you're tuning for, esp. if using -march to make code that doesn't need to run on older architectures.)
gcc -O3 vectorizes in a hilarious way: The inner loop does 2 (or 4) iterations of the outer loop in parallel, by broadcasting one array element to all elements of an xmm (or ymm) register, and doing an addpd on that. So it sees the same values are being added repeatedly, but even -ffast-math doesn't let gcc just turn it into a multiply. Or switch the loops.
clang-3.5 vectorizes a lot better: it vectorizes the inner loop, instead of the outer, so it doesn't need to broadcast. It even uses 4 vector registers as 4 separate accumulators. It knows that calloc only returns 16-byte aligned memory (on x86-64 System V), and when tuning for Sandybridge (before Haswell) it knows that 32-byte loads have a big penalty when misaligned. And that splitting them isn't too expensive since a 32-byte load takes 2 cycles in a load port anyway.
vmovupd -0x60(%rbx,%rcx,8),%xmm4
vinsertf128 $0x1,-0x50(%rbx,%rcx,8),%ymm4,%ymm4
This is worse on later CPUs, especially when the data does happen to be aligned at run-time; see Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd? about GCC versions where -mavx256-split-unaligned-load was on by default with -mtune=generic.
It's actually slower when I tell it that the array is aligned. (with a stupid hack like array = (double*)((ptrdiff_t)array & ~31); which actually generates an instruction to mask off the low 5 bits, because clang-3.5 doesn't support gcc's __builtin_assume_aligned.) In that case it uses a tight loop of 4x vaddpd mem, %ymm, %ymm. It only runs about 0.65 insns per cycle (and 0.93 uops / cycle), according to perf, so the bottleneck isn't front-end.
I checked with a debugger, and calloc is indeed returning a pointer that's an odd multiple of 16. (glibc for large allocations tends to allocate new pages, and put bookkeeping info in the initial bytes, always misaligning to any boundary wider than 16.) So half the 32B memory accesses are crossing a cache line, causing a big slowdown. It is slightly faster to do two separate 16B loads when your pointer is 16B-aligned but not 32B-aligned, on Sandybridge. (gcc enables -mavx256-split-unaligned-load and ...-store for -march=sandybridge, and also for the default tune=generic with -mavx, which is not so good especially for Haswell or with memory that's usually aligned by the compiler doesn't know about it.)
Source level changes
As we can see from clang beating gcc, multiple accumulators are excellent. The most obvious way to do this would be:
for (j = 0; j < ARRAY_SIZE; j+=4) { // unroll 4 times
sum0 += array[j];
sum1 += array[j+1];
sum2 += array[j+2];
sum3 += array[j+3];
}
and then don't collect the 4 accumulators into one until after the end of the outer loop.
Your (from the other question) source change of
sum += j[0]+j[1]+j[2]+j[3]+j[4]+j[5]+j[6]+j[7]+j[8]+j[9];
actually has a similar effect, thanks to out-of-order execution. Each group of 10 is a separate dependency chain. order-of-operations rules say the j values get added together first, and then added to sum. So the loop-carried dependency chain is still only the latency of one FP add, and there's lots of independent work for each group of 10. Each group is a separate dependency chain of 9 adds, and takes few enough instructions for the out-of-order execution hardware to see the start of the next chain and, and find the parallelism to keep those medium latency, high throughput FP execution units fed.
With -O0, as your silly assignment apparently requires, values are stored to RAM at the end of every statement. Writing longer expressions without updating any variables, even temporaries, will make -O0 run faster, but it's not a useful optimisation. Don't waste your time on changes that only help with -O0, esp. not at the expense of readability.
Using 4 accumulator variables and not adding them together until the end of the outer loop defeats clang's auto-vectorizer. It still runs in only 1.66s (vs. 4.89 for gcc's non-vectorized -O2 with one accumulator). Even gcc -O2 without -ffast-math also gets 1.66s for this source change. Note that ARRAY_SIZE is known to be a multiple of 4, so I didn't include any cleanup code to handle the last up-to-3 elements (or to avoid reading past the end of the array, which would happen as written now). It's really easy to get something wrong and read past the end of the array when doing this.
GCC, on the other hand, does vectorize this, but it also pessimises (un-optimises) the inner loop into a single dependency chain. I think it's doing multiple iterations of the outer loop, again.
Using gcc's platform-independent vector extensions, I wrote a version which compiles into apparently-optimal code:
// compile with gcc -g -Wall -std=gnu11 -Ofast -fno-tree-vectorize -march=native fast-loop-cs201.vec.c -o fl3-vec
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <assert.h>
#include <string.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
long int help = 0;
typedef double v4df __attribute__ ((vector_size (8*4)));
v4df sum0={0}, sum1={0}, sum2={0}, sum3={0};
const size_t array_bytes = ARRAY_SIZE*sizeof(double);
double *aligned_array = NULL;
// this more-than-declaration could go in an if(i == 0) block for strict compliance with the rules
if ( posix_memalign((void**)&aligned_array, 32, array_bytes) ) {
exit (1);
}
memcpy(aligned_array, array, array_bytes); // In this one case: faster to align once and have no extra overhead for N_TIMES through the loop
// ... and this one.
// Please change 'your name' to your actual name.
printf("CS201 - Asgmt 4 - I. Forgot\n");
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
/*
#if defined(__GNUC__) && (__GNUC__ * 100 + __GNUC_MINOR__) >= 407 // GCC 4.7 or later.
array = __builtin_assume_aligned(array, 32);
#else
// force-align for other compilers. This loop-invariant will be done outside the loop.
array = (double*) ((ptrdiff_t)array & ~31);
#endif
*/
assert ( ARRAY_SIZE / (4*4) == (ARRAY_SIZE+15) / (4*4) ); // We don't have a cleanup loop to handle where the array size isn't a multiple of 16
// incrementing pointers can be more efficient than indexing arrays
// esp. on recent Intel where micro-fusion only works with one-register addressing modes
// of course, the compiler can always generate pointer-incrementing asm from array-indexing source
const double *start = aligned_array;
while ( (ptrdiff_t)start & 31 ) {
// annoying loops like this are the reason people use aligned buffers
sum += *start++; // scalar until we reach 32B alignment
// in practice, this loop doesn't run, because we copy into an aligned buffer
// This will also require a cleanup loop, and break our multiple-of-16 doubles assumption.
}
const v4df *end = (v4df *)(aligned_array+ARRAY_SIZE);
for (const v4df *p = (v4df *)start ; p+3 < end; p+=4) {
sum0 += p[0]; // p+=4 increments the pointer by 4 * 4 * 8 bytes
sum1 += p[1]; // make sure you keep track of what you're incrementing
sum2 += p[2];
sum3 += p[3];
}
// the compiler might be smart enough to pull this out of the inner loop
// in fact, gcc turns this into a 64bit movabs outside of both loops :P
help+= ARRAY_SIZE;
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
/* You could argue legalese and say that
if (i == 0) {
for (j ...)
sum += array[j];
sum *= N_TIMES;
}
* still does as many adds in its *INNER LOOP*, but it just doesn't run it as often
*/
}
// You can add some final code between this comment ...
sum0 = (sum0 + sum1) + (sum2 + sum3);
sum += sum0[0] + sum0[1] + sum0[2] + sum0[3];
printf("sum = %g; help=%ld\n", sum, help); // defeat the compiler.
free (aligned_array);
free (array); // not strictly necessary, because this is the end of main(). Leaving it out for this special case is a bad example for a CS class, though.
// ... and this one.
return 0;
}
The inner loop compiles to:
4007c0: c5 e5 58 19 vaddpd (%rcx),%ymm3,%ymm3
4007c4: 48 83 e9 80 sub $0xffffffffffffff80,%rcx # subtract -128, because -128 fits in imm8 instead of requiring an imm32 to encode add $128, %rcx
4007c8: c5 f5 58 49 a0 vaddpd -0x60(%rcx),%ymm1,%ymm1 # one-register addressing mode can micro-fuse
4007cd: c5 ed 58 51 c0 vaddpd -0x40(%rcx),%ymm2,%ymm2
4007d2: c5 fd 58 41 e0 vaddpd -0x20(%rcx),%ymm0,%ymm0
4007d7: 4c 39 c1 cmp %r8,%rcx # compare with end with p
4007da: 75 e4 jne 4007c0 <main+0xb0>
(For more, see online compiler output at the godbolt compiler explorer. The -xc compiler option compiles as C, not C++. The inner loop is from .L3 to jne .L3. See the x86 tag wiki for x86 asm links. See also this q&a about micro-fusion not happening on SnB-family, which Agner Fog's guides don't cover).
performance:
$ perf stat -e task-clock,cycles,instructions,r1b1,r10e,stalled-cycles-frontend,stalled-cycles-backend,L1-dcache-load-misses,cache-misses ./fl3-vec
CS201 - Asgmt 4 - I. Forgot
sum = 0; help=6000000000
Performance counter stats for './fl3-vec':
1086.571078 task-clock (msec) # 1.000 CPUs utilized
4,072,679,849 cycles # 3.748 GHz
2,629,419,883 instructions # 0.65 insns per cycle
# 1.27 stalled cycles per insn
4,028,715,968 r1b1 # 3707.733 M/sec # unfused uops
2,257,875,023 r10e # 2077.982 M/sec # fused uops. lower than insns because of macro-fusion
3,328,275,626 stalled-cycles-frontend # 81.72% frontend cycles idle
1,648,011,059 stalled-cycles-backend # 40.47% backend cycles idle
751,736,741 L1-dcache-load-misses # 691.843 M/sec
18,772 cache-misses # 0.017 M/sec
1.086925466 seconds time elapsed
I still don't know why it's getting such low instructions per cycle. The inner loop is using 4 separate accumulators, and I checked with gdb that the pointers are aligned. So cache-bank conflicts shouldn't be the problem. Sandybridge L2 cache can sustain one 32B transfers per cycle, which should keep up with the one 32B FP vector add per cycle.
32B loads from L1 take 2 cycles (it wasn't until Haswell that Intel made 32B loads a single-cycle operation). However, there are 2 load ports, so the sustained throughput is 32B per cycle (which we're not reaching).
Perhaps the loads need to be pipelined ahead of when they're used, to minimize having the ROB (re-order buffer) fill up when a load stalls? But the perf counters indicate a fairly high L1 cache hit rate, so hardware prefetch from L2 to L1 seems to be doing its job.
0.65 instructions per cycle is only about half way to saturating the vector FP adder. This is frustrating. Even IACA says the loop should run in 4 cycles per iteration. (i.e. saturate the load ports and port1 (where the FP adder lives)) :/
update: I guess L2 bandwidth was the problem after all. There aren't enough line-fill buffers to keep enough misses in flight to sustain the peak throughput every cycle. L2 sustained bandwidth is less than peak on Intel SnB / Haswell / Skylake CPUs.
See also Single Threaded Memory Bandwidth on Sandy Bridge (Intel forum thread, with much discussion about what limits throughput, and how latency * max_concurrency is one possible bottleneck. See also the "Latency Bound Platforms" part of the answer to Enhanced REP MOVSB for memcpy limited memory concurrency is a bottleneck for loads as well as stores, but for loads prefetch into L2 does mean you might not be limited purely by Line Fill buffers for outstanding L1D misses.
Reducing ARRAY_SIZE to 1008 (multiple of 16), and increasing N_TIMES by a factor of 10, brought the runtime down to 0.5s. That's 1.68 insns per cycle. (The inner loop is 7 total instructions for 4 FP adds, thus we are finally saturating the vector FP add unit, and the load ports.) Loop tiling is a much better solution, see below.
Intel CPUs only have 32k each L1-data and L1-instruction caches. I think your array would just barely fit in the 64kiB L1D on an AMD K10 (Istanbul) CPU, but not Bulldozer-family (16kiB L1D) or Ryzen (32kiB L1D).
Gcc's attempt to vectorize by broadcasting the same value into a parallel add doesn't seem so crazy. If it had managed to get this right (using multiple accumulators to hide latency), that would have allowed it to saturate the vector FP adder with only half the memory bandwidth. As-is, it was pretty much a wash, probably because of overhead in broadcasting.
Also, it's pretty silly. The N_TIMES is a just a make-work repeat. We don't actually want to optimize for doing the identical work multiple times. Unless we want to win at silly assignments like this. A source-level way to do this would be to increment i in the part of the code we're allowed to modify:
for (...) {
sum += a[j] + a[j] + a[j] + a[j];
}
i += 3; // The inner loop does 4 total iterations of the outer loop
More realistically, to deal with this you could interchange your loops (loop over the array once, adding each value N_TIMES times). I think I've read that Intel's compiler will sometimes do that for you.
A more general technique is called cache blocking, or loop tiling. The idea is to work on your input data in small blocks that fit in cache. Depending on your algorithm, it can be possible to do various stages of thing on a chunk, then repeat for the next chunk, instead of having each stage loop over the whole input. As always, once you know the right name for a trick (and that it exists at all), you can google up a ton of info.
You could rules-lawyer your way into putting an interchanged loop inside an if (i == 0) block in the part of the code you're allowed to modify. It would still do the same number of additions, but in a more cache-optimal order.

You may be on the right track, though you'll need to measure it to be certain (my normal advice to measure, not guess seems a little superfluous here since the whole point of the assignment is to measure).
Optimising compilers will probably not see much of a difference since they're pretty clever about that sort of stuff but, since we don't know what optimisation level it will be compiling at, you may get a substantial improvement.
To use pointers in the inner loop is a simple matter of first adding a pointer variable:
register double *pj;
then changing the loop to:
for (pj = &(array[0]); pj < &(array[ARRAY_SIZE]); j++) {
sum += *j++;
sum1 += *j++;
sum2 += *j++;
sum3 += *j++;
sum4 += *j++;
sum5 += *j++;
sum6 += *j++;
sum7 += *j++;
sum8 += *j++;
sum9 += *j;
}
This keeps the amount of additions the same within the loop (assuming you're counting += and ++ as addition operators, of course) but basically uses pointers rather than array indexes.
With no optimisation1 on my system, this drops it from 9.868 seconds (CPU time) to 4.84 seconds. Your mileage may vary.
1 With optimisation level -O3, both are reported as taking 0.001 seconds so, as mentioned, the optimisers are pretty clever. However, given you're seeing 5+ seconds, I'd suggest it wasn't been compiled with optimisation on.
As an aside, this is a good reason why it's usually advisable to write your code in a readable manner and let the compiler take care of getting it running faster. While my meager attempts at optimisation roughly doubled the speed, using -O3 made it run some ten thousand times faster :-)

Before anything else, try to change compiler settings to produce faster code. There is general optimisation, and the compiler might do auto vectorisation.
What you would always do is try several approaches and check what is fastest. As a target, try to get to one cycle per addition or better.
Number of iterations per loop: You add up 10 sums simultaneously. It might be that your processor doesn't have enough registers for that, or it has more. I'd measure the time for 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... sums per loop.
Number of sums: Having more than one sum means that latency doesn't bite you, just throughput. But more than four or six might not be helpful. Try four sums, with 4, 8, 12, 16 iterations per loop. Or six sums, with 6, 12, 18 iterations.
Caching: You are running through an array of 80,000 bytes. Probably more than L1 cache. Split the array into 2 or 4 parts. Do an outer loop iterating over the two or four subarrays, the next loop from 0 to N_TIMES - 1, and the inner loop adding up values.
And then you can try using vector operations, or multi-threading your code, or using the GPU to do the work.
And if you are forced to use no optimisation, then the "register" keyword might actually work.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight