Benchmarking memory copy in a single shot

Benchmarking memory copy in a single shot - c

Whiskey Lake i7-8565U
I'm trying to learn how to write benchmarks in a single shot by hands (without using any benchmarking frameworks) on an example of memory copy routine with regular and NonTemporal writes to WB memory and would like to ask for some sort of review.
Declaration:
void *avx_memcpy_forward_llss(void *restrict, const void *restrict, size_t);
void *avx_nt_memcpy_forward_llss(void *restrict, const void *restrict, size_t);
Definition:
avx_memcpy_forward_llss:
shr rdx, 0x3
xor rcx, rcx
avx_memcpy_forward_loop_llss:
vmovdqa ymm0, [rsi + 8*rcx]
vmovdqa ymm1, [rsi + 8*rcx + 0x20]
vmovdqa [rdi + rcx*8], ymm0
vmovdqa [rdi + rcx*8 + 0x20], ymm1
add rcx, 0x08
cmp rdx, rcx
ja avx_memcpy_forward_loop_llss
ret
avx_nt_memcpy_forward_llss:
shr rdx, 0x3
xor rcx, rcx
avx_nt_memcpy_forward_loop_llss:
vmovdqa ymm0, [rsi + 8*rcx]
vmovdqa ymm1, [rsi + 8*rcx + 0x20]
vmovntdq [rdi + rcx*8], ymm0
vmovntdq [rdi + rcx*8 + 0x20], ymm1
add rcx, 0x08
cmp rdx, rcx
ja avx_nt_memcpy_forward_loop_llss
ret
Benchmark code:
#include <stdio.h>
#include <inttypes.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <immintrin.h>
#include <x86intrin.h>
#include "memcopy.h"
#define BUF_SIZE 128 * 1024 * 1024
_Alignas(64) char src[BUF_SIZE];
_Alignas(64) char dest[BUF_SIZE];
static inline void warmup(unsigned wa_iterations, void *(*copy_fn)(void *, const void *, size_t));
static inline void cache_flush(char *buf, size_t size);
static inline void generate_data(char *buf, size_t size);
uint64_t run_benchmark(unsigned wa_iteration, void *(*copy_fn)(void *, const void *, size_t)){
generate_data(src, sizeof src);
warmup(4, copy_fn);
cache_flush(src, sizeof src);
cache_flush(dest, sizeof dest);
__asm__ __volatile__("mov $0, %%rax\n cpuid":::"rax", "rbx", "rcx", "rdx", "memory");
uint64_t cycles_start = __rdpmc((1 << 30) + 1);
copy_fn(dest, src, sizeof src);
__asm__ __volatile__("lfence" ::: "memory");
uint64_t cycles_end = __rdpmc((1 << 30) + 1);
return cycles_end - cycles_start;
}
int main(void){
uint64_t single_shot_result = run_benchmark(1024, avx_memcpy_forward_llss);
printf("Core clock cycles = %" PRIu64 "\n", single_shot_result);
}
static inline void warmup(unsigned wa_iterations, void *(*copy_fn)(void *, const void *, size_t)){
while(wa_iterations --> 0){
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
}
}
static inline void generate_data(char *buf, size_t sz){
int fd = open("/dev/urandom", O_RDONLY);
read(fd, buf, sz);
}
static inline void cache_flush(char *buf, size_t sz){
for(size_t i = 0; i < sz; i+=_SC_LEVEL1_DCACHE_LINESIZE){
_mm_clflush(buf + i);
}
}
Results:
avx_memcpy_forward_llss median: 44479368 core cycles
UPD: time
real 0m0,217s
user 0m0,093s
sys 0m0,124s
avx_nt_memcpy_forward_llss median: 24053086 core cycles
UPD: time
real 0m0,184s
user 0m0,056s
sys 0m0,128s
UPD: The result was gotten when running the benchmark with taskset -c 1 ./bin
So I got almost almost 2 times difference in core cycles between the memory copy routine implementation. I interpret it as in case of regular stores to WB memory we have RFO requests competing on bus bandwidth as it is specified in IOM/3.6.12 (emphasize mine):
Although the data bandwidth of full 64-byte bus writes due to
non-temporal stores is twice that of bus writes to WB memory,
transferring 8-byte chunks wastes bus request bandwidth and delivers
significantly lower data bandwidth.
QUESTION 1: How to do benchmark analysis in case of a single shot? Perf counters does not seem to be useful due to perf startup overhead and warmup iteration overhead.
QUESTION 2: Is such benchmark correct. I accounted cpuid in the beginning in order to start measuring with clean CPU resources to avoid stalls due to previous instruction in flight. I added memory clobbers as compile barrier and lfence to avoid rdpmc to be executed OoO.

Whenever possible, benchmarks should report results in ways that allow as much "sanity-checking" as possible. In this case, a few ways to enable such checks include:
For tests involving main memory bandwidth, results should be presented in units that allow direct comparison with the known peak DRAM bandwidth of the system. For a typical configuration of the Core i7-8565U, this is 2 channels * 8 Bytes/transfer * 2.4 billion transfers/sec = 38.4 GB/s (See also item (6), below.)
For tests that involve transfer of data anywhere in the memory hierarchy, the results should include a clear description of the size of the "memory footprint" (number of distinct cache line addresses accessed times the cache line size) and the number of repetitions of the transfer(s). Your code is easy to read here and the size is completely reasonable for a main memory test.
For any timed test, the absolute time should be included to enable comparison against plausible overheads of timing. Your use of only the CORE_CYCLES_UNHALTED counter makes it impossible to compute the elapsed time directly (though the test is clearly long enough that timing overheads are negligible).
Other important "best practice" principles:
Any test that employs RDPMC instructions must be bound to a single logical processor. Results should be presented in a way that confirms to the reader that such binding was employed. Common ways to enforce such binding in Linux include using the "taskset" or "numactl --physcpubind=[n]" commands, or including an inline call to "sched_setaffinity()" with a single allowed logical processor, or setting an environment variable that causes a runtime library (e.g., OpenMP) to bind the thread to a single logical processor.
When using hardware performance counters, extra care is needed to ensure that all of the configuration data for the counters is available and described correctly. The code above uses RDPMC to read IA32_PERF_FIXED_CTR1, which has an event name of CPU_CLK_UNHALTED. The modifier to the event name depends on the programming of IA32_FIXED_CTR_CTRL (MSR 0x38d) bits 7:4. There is no generally-accepted way of mapping from all possible control bits to event name modifiers, so it is best to provide the complete contents of IA32_FIXED_CTR_CTRL along with the results.
The CPU_CLK_UNHALTED performance counter event is the right one to use for benchmarks of portions of the processor whose behavior scales directly with processor core frequency -- such as instruction execution and data transfers involving only the L1 and L2 caches. Memory bandwidth involves portions of the processor whose performance does not scale directly with processor frequency. In particular, using CPU_CLK_UNHALTED without also forcing fixed-frequency operation makes it impossible to compute the elapsed time (required by (1) and (3) above). In your case, RDTSCP would have been easier than RDPMC -- RDTSC does not require the processes to be bound a single logical processor, it is not influenced by other configuration MSRs, and it allows direct computation of elapsed time in seconds.
Advanced: For tests involving transfer of data in the memory hierarchy, it is helpful to control for cache contents and the state (clean or dirty) of the cache contents, and to provide explicit descriptions of the "before" and "after" states along with the results. Given the sizes of your arrays, your code should completely fill all levels of the cache with some combination of portions of the source and destination arrays, and then flush all of those addresses, leaving a cache hierarchy that is (almost) completely full of invalid (clean) entries.
Advanced: Using CPUID as a serialization instruction is almost never useful in benchmarking. Although it guarantees ordering, it also takes a long time to execute -- Agner Fog's "Instruction Tables" report it at 100-250 cycles (presumably depending on the input arguments). (Update: Measurements over short intervals are always very tricky. The CPUID instruction has a long and variable execution time, and it is not clear what impact the microcoded implementation has on the internal state of the processor. It may be helpful in specific cases, but it should not be considered as something that is automatically included in benchmarks. For measurements over long intervals, out-of-order processing across the measurement boundaries is negligible, so CPUID is not needed.)
Advanced: Using LFENCE in benchmarks is only relevant if you are measuring at very fine granularity -- less than a few hundred cycles. More notes on this topic at http://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/
If I assume that your processor was running at its maximum Turbo frequency of 4.6 GHz during the test, then the reported cycle counts correspond to 9.67 milliseconds and 5.23 milliseconds, respectively. Plugging these into a "sanity check" shows:
Assuming that the first case performs one read, one allocate, and one writeback (each 128MiB), the corresponding DRAM traffic rates are 27.8GB/s + 13.9 GB/s = 41.6 GB/s == 108% of peak.
Assuming that the second case performs one read and one streaming store (each 128MiB), the corresponding DRAM traffic rates are 25.7 GB/s + 25.7 GB/s = 51.3 GB/s = 134% of peak.
The failure of these "sanity checks" tells us that the frequency could not have been as high as 4.6 GHz (and was probably no higher than 3.0 GHz), but mostly just points to the need to measure the elapsed time unambiguously....
Your quote from the optimization manual on the inefficiency of streaming stores applies only to cases that cannot be coalesced into full cache line transfers. Your code stores to every element of the output cache lines following "best practice" recommendations (all store instructions writing to the same line are executed consecutively and generating only one stream of stores per loop). It is not possible to completely prevent the hardware from breaking up streaming stores, but in your case it should be extremely rare -- perhaps a few out of a million. Detecting partial streaming stores is a very advanced topic, requiring the use of poorly-documented performance counters in the "uncore" and/or indirect detection of partial streaming stores by looking for elevated DRAM CAS counts (which might be due to other causes). More notes on streaming stores are at http://sites.utexas.edu/jdm4372/2018/01/01/notes-on-non-temporal-aka-streaming-stores/

Related

Why does adding an if(!memcmp()) speed up a loop that makes random short strides through a huge byte array?

Sorry, I have never understood the rules here. I have deleted all the duplicate posts I have posted. This is the first related issue.
Please do not mark this post as a duplicate of my other one(Reduce the number of executions by 3 times, but the execution efficiency is almost unchanged. In C), even if the code is somewhat similar, they raise very different questions.These are also two problems that I found on the same day. A similar post has been repeated by "misjudgment" and then closed. It may be that I did not clarify this issue. I really hope to get the answer, so I reposted it. I hope everyone can see the problem clearly, thank you very much!
In the following C code, I added an "if" statement to the loop of the first test time, and the execution times are exactly the same. In theory, it should be slower. Even though branch prediction can make their performance almost the same, it actually becomes a lot faster. What is the principle of this? I tried to use clang and gcc compilers to run in Mac and Linux environments respectively, and tried various optimization levels. In order to prevent the cache from being affected, I let the faster ones execute first, but the loops with redundant code execute faster.
If you think my description is not credible, please compile the following code into your computer and run it. Hope someone can answer this question for me，thanks.
C code:
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>
#define TLen 300000000
#define SLen 10
int main(int argc, const char * argv[]) {
srandom((unsigned)time(NULL));
// An array to increase the index,
// the range of its elements is 1-256
int rand_arr[128];
for (int i = 0; i < 128; ++i)
rand_arr[i] = random()%256+1;
// A random text(very long), the range of its elements is 0-127
char *tex = malloc((sizeof *tex) * TLen);
for (int i = 0; i < TLen; ++i)
tex[i] = random()%128;
// A random string(very short))
char *str = malloc((sizeof *str) * SLen);
for (int i = 0; i < SLen; ++i)
str[i] = random()%128;
// The first testing
clock_t start = clock();
for (int i = 0; i < TLen; ){
if (!memcmp(tex+i, str, SLen)) printf("yes!\n");
i += rand_arr[tex[i]];
}
clock_t end = clock();
printf("No.1: %lf s\n", ((double)(end - start)) / CLOCKS_PER_SEC);
// The second testing
start = clock();
for (int i = 0; i < TLen; ){
i += rand_arr[tex[i]];
}
end = clock();
printf("No.2: %lf s\n", ((double)(end - start)) / CLOCKS_PER_SEC);
return 0;
}
I have run hundreds of times, almost all of this ratio. The following is a representative result of a test in Linux:
No.1: 0.110000 s
No.2: 0.140000 s

The key bottleneck is the load-use latency of cache misses while striding through tex[], as part of the dependency chain from old i to new i. Lookups from the small rand_array[] will hit in cache and just add about 5 cycles of L1d load-use latency to the loop-carried dependency chain, making it even more unlikely for anything other than latency through i to be relevant to the overall speed of the loop. (Jérôme Richard's answer on your previous question discusses that latency dep chain and the random stride increment.)
memcmp(tex+i, str, SLen) might be acting as prefetching if tex+i is near the end of a cache line. If you compiled with -O0, you're calling glibc memcmp so the SSE 16-byte or AVX2 32-byte load done in glibc memcmp crosses a cache-line boundary and pulls in the next line. (IIRC, glibc memcmp checks and avoids page crossing, but not cache-line splits, because those are relatively cheap on modern CPUs. Single-step into it to see. Into the 2nd call to avoid lazy dynamic linking, or compile with -fno-plt).
Possibly touching some cache lines that would have been skipped by the random stride helps HW prefetch be more aggressive and detect it as a sequential access pattern. A 32-byte AVX load is half a cache line, so it's fairly likely to cross into the next one.
So a more-predictable sequence of cache-line requests seen by the L2 cache is a plausible explanation for the memcmp version being faster. (Intel CPUs put the main prefetchers in the L2.)
If you compile with optimization, that 10-byte memcmp inlines and in the inner loop only touches 8 bytes. (If they match, then it jumps to another block where it checks the last 2. But it's exceedingly unlikely that ever happens).
This might be why -O0 is faster than -O3 on my system, with GCC11.1 on i7-6700k Skylake with DDR4-2666 DRAM, on Arch GNU/Linux. (Your question didn't bother to specify which compiler and options your numbers were from, or which hardware.)
$ gcc -O3 array-stride.c
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,mem_load_retired.l1_hit,mem_load_retired.l1_miss -r 2 ./a.out
No.1: 0.041710 s
No.2: 0.063072 s
No.1: 0.040457 s
No.2: 0.061166 s
Performance counter stats for './a.out' (2 runs):
1,843.71 msec task-clock # 0.999 CPUs utilized ( +- 0.14% )
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
73,300 page-faults # 39.757 K/sec ( +- 0.00% )
6,607,078,798 cycles # 3.584 GHz ( +- 0.06% )
18,614,303,386 instructions # 2.82 insn per cycle ( +- 0.00% )
19,407,689,229 uops_issued.any # 10.526 G/sec ( +- 0.02% )
21,970,261,576 uops_executed.thread # 11.916 G/sec ( +- 0.02% )
5,408,090,087 mem_load_retired.l1_hit # 2.933 G/sec ( +- 0.00% )
3,861,687 mem_load_retired.l1_miss # 2.095 M/sec ( +- 2.11% )
1.84537 +- 0.00135 seconds time elapsed ( +- 0.07% )
$ grep . /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference
/sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference:balance_performance
... same on all 8 logical cores
The perf counter data is basically meaningless; it's for the whole run time, not the timed regions (which are about 0.1 sec out of 1.8 sec), so the majority of the time spent here was in glibc random(3) which uses "a nonlinear additive feedback random number generator employing a default table of size 31 long integers". Also in page-faults on the big malloc region.
It's only interesting in terms of the delta between two builds, but the loops outside the timed region still contribute many extra uops so it's still not as interesting as one would like.
vs. gcc -O0: No.1 is faster, No.2 is as expected a bit slower, with -O0 putting a store/reload into the dependency chain involving i.
$ gcc -O0 array-stride.c
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,mem_load_retired.l1_hit,mem_load_retired.l1_miss -r 2 ./a.out
No.1: 0.028402 s
No.2: 0.076405 s
No.1: 0.028079 s
No.2: 0.069492 s
Performance counter stats for './a.out' (2 runs):
1,979.57 msec task-clock # 0.999 CPUs utilized ( +- 0.04% )
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
66,656 page-faults # 33.672 K/sec ( +- 0.00% )
7,252,728,414 cycles # 3.664 GHz ( +- 0.02% )
20,507,166,672 instructions # 2.83 insn per cycle ( +- 0.01% )
22,268,130,378 uops_issued.any # 11.249 G/sec ( +- 0.00% )
25,117,638,171 uops_executed.thread # 12.688 G/sec ( +- 0.00% )
6,640,523,801 mem_load_retired.l1_hit # 3.355 G/sec ( +- 0.01% )
3,350,518 mem_load_retired.l1_miss # 1.693 M/sec ( +- 1.39% )
1.9810591 +- 0.0000934 seconds time elapsed ( +- 0.00% )
Remember this profiling is for the total run time, not the timed regions, so the 2.83 IPC is not for the timed region.
Out of order exec hiding cost of independent work
The actual throughput cost of running that macro-fused cmp/je uop doesn't add to any bottleneck, thanks to out-of-order exec. Or even a whole call memcmp#plt and setting up the args. The bottleneck is latency, not the front-end, or load ports in the back end, and the OoO exec window is deep enough to hide the memcmp work. See also the following to understand modern CPUs. (Yes, it will take a lot of reading to wrap your head around them.)
Modern Microprocessors: A 90-Minute Guide!
https://agner.org/optimize/
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
The loop-carried dependency through i goes i -> tex[i] -> indirectly back into i for next iteration, monotonically increasing it. tex[] is too big to fit in cache, and HW prefetch won't keep up with those strides.
So DRAM bandwidth might be the limiting factor, or even DRAM latency if HW prefetch doesn't detect and lock on to the somewhat-sequential access pattern to consecutive or every-other cache line.
The actual branch predicts perfectly, so it can execute (and confirm the prediction) when the load result arrives, in parallel with the stuff waiting for the sign-extending byte load that started at the same place.
and the execution times are exactly the same. ... it actually becomes a lot faster.
Huh? Your times don't show exactly the same. Yes, it does become measurably faster with that memcmp.
In theory, it should be slower.
Only if your theory is too simplistic to model modern CPUs. Not slowing it down is easily explainable because there is spare out-of-order exec throughput to do that independent work while waiting for the load latency.
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
How many CPU cycles are needed for each assembly instruction? (that's not how performance works; CPUs overlap the execution of multiple instructions; there is no single number cost you can assign to each instruction and then add them up).
Also possible related for -O0 performance:
Adding a redundant assignment speeds up code when compiled without optimization - Sandybridge-family store forwarding has variable latency, and not trying too soon can actually let you achieve better latency, i.e. less of a bottleneck. But I think the major reason -O0 memcmp is faster is the prefetch effect from doing a wider load.
For reference, the with-memcmp inner loops from GCC on Godbolt, matching what I benchmarked on my desktop.
# gcc11.1 -O0 -fPIE
.L10: # do{
mov eax, DWORD PTR -16[rbp]
movsx rdx, eax
mov rax, QWORD PTR -32[rbp]
lea rcx, [rdx+rax]
mov rax, QWORD PTR -40[rbp]
mov edx, 10
mov rsi, rax
mov rdi, rcx
call memcmp#PLT
test eax, eax
jne .L9 # normally taken jump over puts
lea rax, .LC0[rip]
mov rdi, rax
call puts#PLT
.L9:
mov eax, DWORD PTR -16[rbp]
movsx rdx, eax
mov rax, QWORD PTR -32[rbp]
add rax, rdx # strange not using an addressing mode, but ok
movzx eax, BYTE PTR [rax]
movsx eax, al # GCC -O0 is dumb,
cdqe # really dumb.
mov eax, DWORD PTR -576[rbp+rax*4] # look up in rand_array[]
add DWORD PTR -16[rbp], eax # i += ...
.L8:
cmp DWORD PTR -16[rbp], 299999999
jle .L10 # }while(i<300000000)
vs. the -O3 code:
# RBP holds char *tex at at this point
.L10: # do{
movsx r12, r13d # sign-extend i
mov rax, QWORD PTR [rbx] # str[0..7] gets reloaded because alias analysis and the possible call to puts defeated the optimizer. Avoiding malloc for it may have helped.
add r12, rbp # i+tex to avoid indexed addressing modes later? Probably not optimal
cmp QWORD PTR [r12], rax # first 8 bytes of the memcmp
je .L18 # code at .L18 checks next 2 and maybe does puts, before jumping back
.L5:
movsx rax, BYTE PTR [r12] # sign-extending byte load to pointer width, of tex[i]
add r13d, DWORD PTR [rsp+rax*4] # i += look up in rand_array[]
cmp r13d, 299999999
jle .L10 # }while(i < 300000000)
Since the je .L18 is never taken, this is 7 uops per iteration, so Skylake can comfortable issue it at under 2 cycles per iteration.
Even with L1d cache hits, the loop-carried dependency through i (R13D) is:
1 cycle: movsx r12, r13d
1 cycle: add r12, rbp
~5 cycle load-use latency: movsx rax, BYTE PTR [r12]. (Not 4c, that optimistic special case only applies when the reg in the simple addressing mode was itself the direct result of a load.)
~5 cycle load-use latency plus 1 cycle ALU latency: add r13d, DWORD PTR [rsp+rax*4]
So a total of about 13 cycle latency best case with L1d hits, leaving tons of free "slots" in the front-end, and idle back-end execution units, that even a call to actual glibc memcmp would be no big deal.
(Of course the -O0 code is noisier so some of those free slots in the pipeline area already used up, but the dep chain is even longer because -O0 code keeps i in memory: Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?)
CPU frequency in memory-bottlenecked loops, especially on Skylake
The init loops provide plenty of time for the CPU to get to max turbo before the timed region, and to pre-fault the allocated pages by touching them first. Idiomatic way of performance evaluation?
There might also be an effect of keeping CPU frequency higher if there's more work, less stalling, during those waits for incoming data from cache miss loads. See Slowing down CPU Frequency by imposing memory stress.
In the above results, I tested with EPP = balance_performance.
I also tested with performance, and -O0 (libc memcmp) is still faster than -O3 (inlined memcmp), but all 4 loops did speed up some (with/without memcmp, and optimized or not).
compiler/CPU
No.1 if(memcmp)
No.2 plain
-O0 / balance_performance
0.028079 s
0.069492 s
-O0 / performance
0.026898 s
0.053805 s
-O3 / balance_performance
0.040457 s
0.061166 s
-O3 / performance
0.035621 s
0.047475 s
The faster-at--O0 effect is still present and very significant even with EPP=performance, so we know it's not just a matter of clock speed. From previous experiments, performance keeps it pegged at max turbo even in memory-bound conditions where other settings would clock down from 4.2 to 2.7 GHz. So it's very likely that calling memcmp is helping trigger better prefetching to reduce average cache-miss latency.
I unfortunately didn't use a fixed seed for the PRNG so there may be some variation due to the randomness being good vs. bad for the prefetcher, in terms of patterns of whole cache lines accessed. And I just took one specific run for each (the 2nd one in the pair started by perf stat -r 2, so it should hopefully be even less affected by system fluctuation.)
Looks like performance made a bigger difference to the No.2 loop (with less going on in it), and also to the -O3 version (again with less going on in it), which is consistent with Skylake lowering clock speed when a core is pretty much only bottlenecking on memory, not running many other instructions, in EPP settings other than performance.

Short answer, and pure speculation: In the first timed loop, the buffer pointed to by tex is still in cache after initialization. The printf does a lot of complicated stuff, including kernel calls, and any cache-lines are now very likely filled with other data.
The second timed loop now has to re-read the data from RAM.
Try to swap the loops, and/or delay the printf to after both measurements.
Oh, and the call to memcmp has a potential buffer overflow.

How avoid cache line invalidation from multiple threads writing to a shared array?

Context of the problem:
I am writing a code that creates 32 threads, and set affinity of them to each one of the 32 cores in my multi-core-multi-processor system.
Threads simply execute the RDTSCP instruction and the value is stored in a shared array at a non-overlapping position, this is the shared array:
uint64_t rdtscp_values[32];
So, every thread is going to write to the specific array position based on its core number.
Up to know, everything is working properly with the exception that I know that I may not be using the right data structure to avoid cache line bouncing.
P.S: I have checked already that my processor's cache line is 64-bytes wide.
Because I am using a simple uint64_t array, it implies that a single cache line is going to store 8 positions of this array, because of the read-ahead.
Question:
Because of this simple array, although the threads write to different indexes, my understanding tells that every write to this array will cause a cache invalidation to all other threads?
How could I create a structure that is aligned to the cache line?
EDIT 1
My system is: 2x Intel Xeon E5-2670 2.30GHz (8 cores, 16 threads)

Yes you definitely want to avoid "false sharing" and cache-line ping-pong.
But this probably doesn't make sense: if these memory locations are thread-private more often than they're collected by other threads, they should be stored with other per-thread data so you're not wasting cache footprint on 56 bytes of padding. See also Cache-friendly way to collect results from multiple threads. (There's no great answer; avoid designing a system that needs really fine-grained gathering of results if you can.)
But let's just assume for a minute that unused padding between slots for different threads is actually what you want.
Yes, you need the stride to be 64 bytes (1 cache line), but you don't actually need the 8B you're using to be at the start of each cache line. Thus, you don't need any extra alignment as long as the uint64_t objects are naturally-aligned (so they aren't split across a cache-line boundary).
It's fine if each thread is writing to the 3rd qword of its cache line instead of the 1st. OTOH, aligning to 64B makes sure nothing else is sharing a cache line with the first element, and it's easy so we might as well.
Static storage: aligning static storage is very easy in ISO C11 using alignas(), or with compiler-specific stuff.
With a struct, padding is implicit to make the size a multiple of the required alignment. Having one member with an alignment requirement implies that the whole struct requires at least that much alignment. The compiler takes care of this for you with static and automatic storage, but you have to use aligned_alloc or an alternative for over-aligned dynamic allocation.
#include <stdalign.h> // for #define alignas _Alignas for C++ compat
#include <stdint.h> // for uint64_t
// compiler knows the padding is just padding
struct { alignas(64) uint64_t v; } rdtscp_values[32];
int foo(unsigned t) {
rdtscp_values[t].v = 1;
return sizeof(rdtscp_values[0]); // yes, this is 64
}
Or with an array as suggested by # Eric Postpischil:
alignas(64) // optional, stride will still be 64B without this.
uint64_t rdtscp_values_2d[32][8]; // 8 uint64_t per cache line
void bar(unsigned t) {
rdtscp_values_2d[t][0] = 1;
}
alignas() is optional if you don't care about the whole thing being 64B aligned, just having 64B stride between elements you use. You could also use __attribute__((aligned(64))) in GNU C or C++, or __declspec(align(64)) for MSVC, using #ifdef to define an ALIGN macro that's portable across the major x86 compilers.
Either way produces the same asm. We can check compiler output to verify that we got what we wanted. I put it up on the Godbolt compiler explorer. We get:
foo: # and same for bar
mov eax, edi # zero extend 32-bit to 64-bit
shl rax, 6 # *64 is the same as <<6
mov qword ptr [rax + rdtscp_values], 1 # store 1
mov eax, 64 # return value = 64 = sizeof(struct)
ret
Both arrays are declared the same way, with the compiler requesting 64B alignment from the assembler/linker with the 3rd arg to .comm:
.comm rdtscp_values_2d,2048,64
.comm rdtscp_values,2048,64
Dynamic storage:
If the number of threads is not a compile-time constant, then you can use an aligned allocation function to get aligned dynamically-allocated memory (especially if you want to support a very high number of threads). See How to solve the 32-byte-alignment issue for AVX load/store operations?, but really just use C11 aligned_alloc. It's perfect for this, and returns a pointer that's compatible with free().
struct { alignas(64) uint64_t v; } *dynamic_rdtscp_values;
void init(unsigned nthreads) {
size_t sz = sizeof(dynamic_rdtscp_values[0]);
dynamic_rdtscp_values = aligned_alloc(nthreads*sz, sz);
}
void baz(unsigned t) {
dynamic_rdtscp_values[t].v = 1;
}
baz:
mov rax, qword ptr [rip + dynamic_rdtscp_values]
mov ecx, edi # same code as before to scale by 64 bytes
shl rcx, 6
mov qword ptr [rax + rcx], 1
ret
The address of the array is no longer a link-time constant, so there's an extra level of indirection to access it. But the pointer is read-only after it's initialized, so it will stay shared in cache in each core and reloading it when needed is very cheap.
Footnote: In the i386 System V ABI, uint64_t only has 4B-alignment inside structs by default (without alignas(8) or __attribute__((aligned(8)))), so if you put an int before a uint64_t and didn't do any alignment of the whole struct, it would be possible to get cache-line splits. But compilers align it by 8B whenever possible, so your struct-with padding is still fine.

SOLUTION
So, I followed the comments here and I must say thanks for all contributions.
Finally I got what I expected: cache lines being used properly per each thread.
Here is the shared structure:
typedef struct align_st {
uint64_t v;
uint64_t padding[7];
} align_st_t __attribute__ ((aligned (64)));
I am using a padding uint64_t padding[7] inside the structure to fill the remaining bytes in the cache line when this structure is loaded to the L1 cache. Nonetheless, I am asking to the compiler to use 64-bytes memory alignment when compiling it __attribute__ ((aligned (64))).
So, I allocate this structure dynamically based on the number of cores, using the memalign() for this:
align_st_t *al = (align_st_t*) memalign(64, n_cores * sizeof(align_st_t));
To compare it, I wrote one code version (V1) that uses these aligned mechanisms, and other code version (V2) that uses the simple array method.
By executing with perf, and I got these numbers:
V1: 7.184 cache-misses;
V2: 2.621.347 cache-misses.
P.S.: Each thread is writing 1-thousand times to the same address of the shared structure just to increase the numbers

SIMD (AVX2) mask store and pack [duplicate]

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2?
I've seen in SSE where it was done like this:
(From:https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf)
__m128i LeftPack_SSSE3(__m128 mask, __m128 val)
{
// Move 4 sign bits of mask to 4-bit integer value.
int mask = _mm_movemask_ps(mask);
// Select shuffle control data
__m128i shuf_ctrl = _mm_load_si128(&shufmasks[mask]);
// Permute to move valid values to front of SIMD register
__m128i packed = _mm_shuffle_epi8(_mm_castps_si128(val), shuf_ctrl);
return packed;
}
This seems fine for SSE which is 4 wide, and thus only needs a 16 entry LUT, but for AVX which is 8 wide, the LUT becomes quite large(256 entries, each 32 bytes, or 8k).
I'm surprised that AVX doesn't appear to have an instruction for simplifying this process, such as a masked store with packing.
I think with some bit shuffling to count the # of sign bits set to the left you could generate the necessary permutation table, and then call _mm256_permutevar8x32_ps. But this is also quite a few instructions I think..
Does anyone know of any tricks to do this with AVX2? Or what is the most efficient method?
Here is an illustration of the Left Packing Problem from the above document:
Thanks

AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.)
We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle.
We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we need.
Beware that pdep/pext are very slow on AMD CPUs before Zen 3, like 6 uops / 18 cycle latency and throughput on Ryzen Zen 1 and Zen 2. This implementation will perform horribly on those AMD CPUs. For AMD, you might be best with 128-bit vectors using a pshufb or vpermilps LUT, or some of the AVX2 variable-shift suggestions discussed in comments. Especially if your mask input is a vector mask (not an already packed bitmask from memory).
AMD before Zen2 only has 128-bit vector execution units anyway, and 256-bit lane-crossing shuffles are slow. So 128-bit vectors are very attractive for this on Zen 1. But Zen 2 has 256-bit load/store and execution units. (And still slow microcoded pext/pdep.)
For integer vectors with 32-bit or wider elements: Either 1) _mm256_movemask_ps(_mm256_castsi256_ps(compare_mask)).
Or 2) use _mm256_movemask_epi8 and then change the first PDEP constant from 0x0101010101010101 to 0x0F0F0F0F0F0F0F0F to scatter blocks of 4 contiguous bits. Change the multiply by 0xFFU into expanded_mask |= expanded_mask<<4; or expanded_mask *= 0x11; (Not tested). Either way, use the shuffle mask with VPERMD instead of VPERMPS.
For 64-bit integer or double elements, everything still Just Works; The compare-mask just happens to always have pairs of 32-bit elements that are the same, so the resulting shuffle puts both halves of each 64-bit element in the right place. (So you still use VPERMPS or VPERMD, because VPERMPD and VPERMQ are only available with immediate control operands.)
For 16-bit elements, you might be able to adapt this with 128-bit vectors.
For 8-bit elements, see Efficient sse shuffle mask generation for left-packing byte elements for a different trick, storing the result in multiple possibly-overlapping chunks.
The algorithm:
Start with a constant of packed 3 bit indices, with each position holding its own index. i.e. [ 7 6 5 4 3 2 1 0 ] where each element is 3 bits wide. 0b111'110'101'...'010'001'000.
Use pext to extract the indices we want into a contiguous sequence at the bottom of an integer register. e.g. if we want indices 0 and 2, our control-mask for pext should be 0b000'...'111'000'111. pext will grab the 010 and 000 index groups that line up with the 1 bits in the selector. The selected groups are packed into the low bits of the output, so the output will be 0b000'...'010'000. (i.e. [ ... 2 0 ])
See the commented code for how to generate the 0b111000111 input for pext from the input vector mask.
Now we're in the same boat as the compressed-LUT: unpack up to 8 packed indices.
By the time you put all the pieces together, there are three total pext/pdeps. I worked backwards from what I wanted, so it's probably easiest to understand it in that direction, too. (i.e. start with the shuffle line, and work backward from there.)
We can simplify the unpacking if we work with indices one per byte instead of in packed 3-bit groups. Since we have 8 indices, this is only possible with 64bit code.
See this and a 32bit-only version on the Godbolt Compiler Explorer. I used #ifdefs so it compiles optimally with -m64 or -m32. gcc wastes some instructions, but clang makes really nice code.
#include <stdint.h>
#include <immintrin.h>
// Uses 64bit pdep / pext to save a step in unpacking.
__m256 compress256(__m256 src, unsigned int mask /* from movmskps */)
{
uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101); // unpack each bit to a byte
expanded_mask *= 0xFF; // mask |= mask<<1 | mask<<2 | ... | mask<<7;
// ABC... -> AAAAAAAABBBBBBBBCCCCCCCC...: replicate each bit to fill its byte
const uint64_t identity_indices = 0x0706050403020100; // the identity shuffle for vpermps, packed to one index per byte
uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask);
__m128i bytevec = _mm_cvtsi64_si128(wanted_indices);
__m256i shufmask = _mm256_cvtepu8_epi32(bytevec);
return _mm256_permutevar8x32_ps(src, shufmask);
}
This compiles to code with no loads from memory, only immediate constants. (See the godbolt link for this and the 32bit version).
# clang 3.7.1 -std=gnu++14 -O3 -march=haswell
mov eax, edi # just to zero extend: goes away when inlining
movabs rcx, 72340172838076673 # The constants are hoisted after inlining into a loop
pdep rax, rax, rcx # ABC -> 0000000A0000000B....
imul rax, rax, 255 # 0000000A0000000B.. -> AAAAAAAABBBBBBBB..
movabs rcx, 506097522914230528
pext rax, rcx, rax
vmovq xmm1, rax
vpmovzxbd ymm1, xmm1 # 3c latency since this is lane-crossing
vpermps ymm0, ymm1, ymm0
ret
(Later clang compiles like GCC, with mov/shl/sub instead of imul, see below.)
So, according to Agner Fog's numbers and https://uops.info/, this is 6 uops (not counting the constants, or the zero-extending mov that disappears when inlined). On Intel Haswell, it's 16c latency (1 for vmovq, 3 for each pdep/imul/pext / vpmovzx / vpermps). There's no instruction-level parallelism. In a loop where this isn't part of a loop-carried dependency, though, (like the one I included in the Godbolt link), the bottleneck is hopefully just throughput, keeping multiple iterations of this in flight at once.
This can maybe manage a throughput of one per 4 cycles, bottlenecked on port1 for pdep/pext/imul plus popcnt in the loop. Of course, with loads/stores and other loop overhead (including the compare and movmsk), total uop throughput can easily be an issue, too.
e.g. the filter loop in my godbolt link is 14 uops with clang, with -fno-unroll-loops to make it easier to read. It might sustain one iteration per 4c, keeping up with the front-end, if we're lucky.
clang 6 and earlier created a loop-carried dependency with popcnt's false dependency on its output, so it will bottleneck on 3/5ths of the latency of the compress256 function. clang 7.0 and later use xor-zeroing to break the false dependency (instead of just using popcnt edx,edx or something like GCC does :/).
gcc (and later clang) does the multiply by 0xFF with multiple instructions, using a left shift by 8 and a sub, instead of imul by 255. This takes 3 total uops vs. 1 for the front-end, but the latency is only 2 cycles, down from 3. (Haswell handles mov at register-rename stage with zero latency.) Most significantly for this, imul can only run on port 1, competing with pdep/pext/popcnt, so it's probably good to avoid that bottleneck.
Since all hardware that supports AVX2 also supports BMI2, there's probably no point providing a version for AVX2 without BMI2.
If you need to do this in a very long loop, the LUT is probably worth it if the initial cache-misses are amortized over enough iterations with the lower overhead of just unpacking the LUT entry. You still need to movmskps, so you can popcnt the mask and use it as a LUT index, but you save a pdep/imul/pext.
You can unpack LUT entries with the same integer sequence I used, but #Froglegs's set1() / vpsrlvd / vpand is probably better when the LUT entry starts in memory and doesn't need to go into integer registers in the first place. (A 32bit broadcast-load doesn't need an ALU uop on Intel CPUs). However, a variable-shift is 3 uops on Haswell (but only 1 on Skylake).

See my other answer for AVX2+BMI2 with no LUT.
Since you mention a concern about scalability to AVX512: don't worry, there's an AVX512F instruction for exactly this:
VCOMPRESSPS — Store Sparse Packed Single-Precision Floating-Point Values into Dense Memory. (There are also versions for double, and 32 or 64bit integer elements (vpcompressq), but not byte or word (16bit)). It's like BMI2 pdep / pext, but for vector elements instead of bits in an integer reg.
The destination can be a vector register or a memory operand, while the source is a vector and a mask register. With a register dest, it can merge or zero the upper bits. With a memory dest, "Only the contiguous vector is written to the destination memory location".
To figure out how far to advance your pointer for the next vector, popcnt the mask.
Let's say you want to filter out everything but values >= 0 from an array:
#include <stdint.h>
#include <immintrin.h>
size_t filter_non_negative(float *__restrict__ dst, const float *__restrict__ src, size_t len) {
const float *endp = src+len;
float *dst_start = dst;
do {
__m512 sv = _mm512_loadu_ps(src);
__mmask16 keep = _mm512_cmp_ps_mask(sv, _mm512_setzero_ps(), _CMP_GE_OQ); // true for src >= 0.0, false for unordered and src < 0.0
_mm512_mask_compressstoreu_ps(dst, keep, sv); // clang is missing this intrinsic, which can't be emulated with a separate store
src += 16;
dst += _mm_popcnt_u64(keep); // popcnt_u64 instead of u32 helps gcc avoid a wasted movsx, but is potentially slower on some CPUs
} while (src < endp);
return dst - dst_start;
}
This compiles (with gcc4.9 or later) to (Godbolt Compiler Explorer):
# Output from gcc6.1, with -O3 -march=haswell -mavx512f. Same with other gcc versions
lea rcx, [rsi+rdx*4] # endp
mov rax, rdi
vpxord zmm1, zmm1, zmm1 # vpxor xmm1, xmm1,xmm1 would save a byte, using VEX instead of EVEX
.L2:
vmovups zmm0, ZMMWORD PTR [rsi]
add rsi, 64
vcmpps k1, zmm0, zmm1, 29 # AVX512 compares have mask regs as a destination
kmovw edx, k1 # There are some insns to add/or/and mask regs, but not popcnt
movzx edx, dx # gcc is dumb and doesn't know that kmovw already zero-extends to fill the destination.
vcompressps ZMMWORD PTR [rax]{k1}, zmm0
popcnt rdx, rdx
## movsx rdx, edx # with _popcnt_u32, gcc is dumb. No casting can get gcc to do anything but sign-extend. You'd expect (unsigned) would mov to zero-extend, but no.
lea rax, [rax+rdx*4] # dst += ...
cmp rcx, rsi
ja .L2
sub rax, rdi
sar rax, 2 # address math -> element count
ret
Performance: 256-bit vectors may be faster on Skylake-X / Cascade Lake
In theory, a loop that loads a bitmap and filters one array into another should run at 1 vector per 3 clocks on SKX / CSLX, regardless of vector width, bottlenecked on port 5. (kmovb/w/d/q k1, eax runs on p5, and vcompressps into memory is 2p5 + a store, according to IACA and to testing by http://uops.info/).
#ZachB reports in comments that in practice, that a loop using ZMM _mm512_mask_compressstoreu_ps is slightly slower than _mm256_mask_compressstoreu_ps on real CSLX hardware. (I'm not sure if that was a microbenchmark that would allow the 256-bit version to get out of "512-bit vector mode" and clock higher, or if there was surrounding 512-bit code.)
I suspect misaligned stores are hurting the 512-bit version. vcompressps probably effectively does a masked 256 or 512-bit vector store, and if that crosses a cache line boundary then it has to do extra work. Since the output pointer is usually not a multiple of 16 elements, a full-line 512-bit store will almost always be misaligned.
Misaligned 512-bit stores may be worse than cache-line-split 256-bit stores for some reason, as well as happening more often; we already know that 512-bit vectorization of other things seems to be more alignment sensitive. That may just be from running out of split-load buffers when they happen every time, or maybe the fallback mechanism for handling cache-line splits is less efficient for 512-bit vectors.
It would be interesting to benchmark vcompressps into a register, with separate full-vector overlapping stores. That's probably the same uops, but the store can micro-fuse when it's a separate instruction. And if there's some difference between masked stores vs. overlapping stores, this would reveal it.
Another idea discussed in comments below was using vpermt2ps to build up full vectors for aligned stores. This would be hard to do branchlessly, and branching when we fill a vector will probably mispredict unless the bitmask has a pretty regular pattern, or big runs of all-0 and all-1.
A branchless implementation with a loop-carried dependency chain of 4 or 6 cycles through the vector being constructed might be possible, with a vpermt2ps and a blend or something to replace it when it's "full". With an aligned vector store every iteration, but only moving the output pointer when the vector is full.
This is likely slower than vcompressps with unaligned stores on current Intel CPUs.

If you are targeting AMD Zen this method may be preferred, due to the very slow pdepand pext on ryzen (18 cycles each).
I came up with this method, which uses a compressed LUT, which is 768(+1 padding) bytes, instead of 8k. It requires a broadcast of a single scalar value, which is then shifted by a different amount in each lane, then masked to the lower 3 bits, which provides a 0-7 LUT.
Here is the intrinsics version, along with code to build LUT.
//Generate Move mask via: _mm256_movemask_ps(_mm256_castsi256_ps(mask)); etc
__m256i MoveMaskToIndices(u32 moveMask) {
u8 *adr = g_pack_left_table_u8x3 + moveMask * 3;
__m256i indices = _mm256_set1_epi32(*reinterpret_cast<u32*>(adr));//lower 24 bits has our LUT
// __m256i m = _mm256_sllv_epi32(indices, _mm256_setr_epi32(29, 26, 23, 20, 17, 14, 11, 8));
//now shift it right to get 3 bits at bottom
//__m256i shufmask = _mm256_srli_epi32(m, 29);
//Simplified version suggested by wim
//shift each lane so desired 3 bits are a bottom
//There is leftover data in the lane, but _mm256_permutevar8x32_ps only examines the first 3 bits so this is ok
__m256i shufmask = _mm256_srlv_epi32 (indices, _mm256_setr_epi32(0, 3, 6, 9, 12, 15, 18, 21));
return shufmask;
}
u32 get_nth_bits(int a) {
u32 out = 0;
int c = 0;
for (int i = 0; i < 8; ++i) {
auto set = (a >> i) & 1;
if (set) {
out |= (i << (c * 3));
c++;
}
}
return out;
}
u8 g_pack_left_table_u8x3[256 * 3 + 1];
void BuildPackMask() {
for (int i = 0; i < 256; ++i) {
*reinterpret_cast<u32*>(&g_pack_left_table_u8x3[i * 3]) = get_nth_bits(i);
}
}
Here is the assembly generated by MSVC:
lea ecx, DWORD PTR [rcx+rcx*2]
lea rax, OFFSET FLAT:unsigned char * g_pack_left_table_u8x3 ; g_pack_left_table_u8x3
vpbroadcastd ymm0, DWORD PTR [rcx+rax]
vpsrlvd ymm0, ymm0, YMMWORD PTR __ymm#00000015000000120000000f0000000c00000009000000060000000300000000

Will add more information to a great answer from #PeterCordes : https://stackoverflow.com/a/36951611/5021064.
I did the implementations of std::remove from C++ standard for integer types with it. The algorithm, once you can do compress, is relatively simple: load a register, compress, store. First I'm going to show the variations and then benchmarks.
I ended up with two meaningful variations on the proposed solution:
__m128i registers, any element type, using _mm_shuffle_epi8 instruction
__m256i registers, element type of at least 4 bytes, using _mm256_permutevar8x32_epi32
When the types are smaller then 4 bytes for 256 bit register, I split them in two 128 bit registers and compress/store each one separately.
Link to compiler explorer where you can see complete assembly (there is a using type and width (in elements per pack) in the bottom, which you can plug in to get different variations) : https://gcc.godbolt.org/z/yQFR2t
NOTE: my code is in C++17 and is using a custom simd wrappers, so I do not know how readable it is. If you want to read my code -> most of it is behind the link in the top include on godbolt. Alternatively, all of the code is on github.
Implementations of #PeterCordes answer for both cases
Note: together with the mask, I also compute the number of elements remaining using popcount. Maybe there is a case where it's not needed, but I have not seen it yet.
Mask for _mm_shuffle_epi8
Write an index for each byte into a half byte: 0xfedcba9876543210
Get pairs of indexes into 8 shorts packed into __m128i
Spread them out using x << 4 | x & 0x0f0f
Example of spreading the indexes. Let's say 7th and 6th elements are picked.
It means that the corresponding short would be: 0x00fe. After << 4 and | we'd get 0x0ffe. And then we clear out the second f.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of `_mm_movemask_epi8`,
// `uint16_t` - there are at most 16 bits with values for __m128i.
inline std::pair<__m128i, std::uint8_t> mask128(std::uint16_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x1111111111111111) * 0xf;
const std::uint8_t offset =
static_cast<std::uint8_t>(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes =
_pext_u64(0xfedcba9876543210, mmask_expanded); // Do the #PeterCordes answer
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0...0|compressed_indexes
const __m128i as_16bit = _mm_cvtepu8_epi16(as_lower_8byte); // From bytes to shorts over the whole register
const __m128i shift_by_4 = _mm_slli_epi16(as_16bit, 4); // x << 4
const __m128i combined = _mm_or_si128(shift_by_4, as_16bit); // | x
const __m128i filter = _mm_set1_epi16(0x0f0f); // 0x0f0f
const __m128i res = _mm_and_si128(combined, filter); // & 0x0f0f
return {res, offset};
}
} // namespace _compress_mask
template <typename T>
std::pair<__m128i, std::uint8_t> compress_mask_for_shuffle_epi8(std::uint32_t mmask) {
auto res = _compress_mask::mask128(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
}
Mask for _mm256_permutevar8x32_epi32
This is almost one for one #PeterCordes solution - the only difference is _pdep_u64 bit (he suggests this as a note).
The mask that I chose is 0x5555'5555'5555'5555. The idea is - I have 32 bits of mmask, 4 bits for each of 8 integers. I have 64 bits that I want to get => I need to convert each bit of 32 bits into 2 => therefore 0101b = 5.The multiplier also changes from 0xff to 3 because I will get 0x55 for each integer, not 1.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of _mm256_movemask_epi8
inline std::pair<__m256i, std::uint8_t> mask256_epi32(std::uint32_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x5555'5555'5555'5555) * 3;
const std::uint8_t offset = static_cast<std::uint8_t(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes = _pext_u64(0x0706050403020100, mmask_expanded); // Do the #PeterCordes answer
// Every index was one byte => we need to make them into 4 bytes
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0000|compressed indexes
const __m256i expanded = _mm256_cvtepu8_epi32(as_lower_8byte); // spread them out
return {expanded, offset};
}
} // namespace _compress_mask
template <typename T>
std::pair<__m256i, std::uint8_t> compress_mask_for_permutevar8x32(std::uint32_t mmask) {
static_assert(sizeof(T) >= 4); // You cannot permute shorts/chars with this.
auto res = _compress_mask::mask256_epi32(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
}
Benchmarks
Processor: Intel Core i7 9700K (a modern consumer level CPU, no AVX-512 support)
Compiler: clang, build from trunk near the version 10 release
Compiler options: --std=c++17 --stdlib=libc++ -g -Werror -Wall -Wextra -Wpedantic -O3 -march=native -mllvm -align-all-functions=7
Micro-benchmarking library: google benchmark
Controlling for code alignment:
If you are not familiar with the concept, read this or watch this
All functions in the benchmark's binary are aligned to 128 byte boundary. Each benchmarking function is duplicated 64 times, with a different noop slide in the beginning of the function (before entering the loop). The main numbers I show is min per each measurement. I think this works since the algorithm is inlined. I'm also validated by the fact that I get very different results. At the very bottom of the answer I show the impact of code alignment.
Note: benchmarking code. BENCH_DECL_ATTRIBUTES is just noinline
Benchmark removes some percentage of 0s from an array. I test arrays with {0, 5, 20, 50, 80, 95, 100} percent of zeroes.
I test 3 sizes: 40 bytes (to see if this is usable for really small arrays), 1000 bytes and 10'000 bytes. I group by size because of SIMD depends on the size of the data and not a number of elements. The element count can be derived from an element size (1000 bytes is 1000 chars but 500 shorts and 250 ints). Since time it takes for non simd code depends mostly on the element count, the wins should be bigger for chars.
Plots: x - percentage of zeroes, y - time in nanoseconds. padding : min indicates that this is minimum among all alignments.
40 bytes worth of data, 40 chars
For 40 bytes this does not make sense even for chars - my implementation gets about 8-10 times slower when using 128 bit registers over non-simd code. So, for example, compiler should be careful doing this.
1000 bytes worth of data, 1000 chars
Apparently the non-simd version is dominated by branch prediction: when we get small amount of zeroes we get a smaller speed up: for no 0s - about 3 times, for 5% zeroes - about 5-6 times speed up. For when the branch predictor can't help the non-simd version - there is about a 27 times speed up. It's an interesting property of simd code that it's performance tends to be much less dependent on of data. Using 128 vs 256 register shows practically no difference, since most of the work is still split into 2 128 registers.
1000 bytes worth of data, 500 shorts
Similar results for shorts except with a much smaller gain - up to 2 times.
I don't know why shorts do that much better than chars for non-simd code: I'd expect shorts to be two times faster, since there are only 500 shorts, but the difference is actually up to 10 times.
1000 bytes worth of data, 250 ints
For a 1000 only 256 bit version makes sense - 20-30% win excluding no 0s to remove what's so ever (perfect branch prediction, no removing for non-simd code).
10'000 bytes worth of data, 10'000 chars
The same order of magnitude wins as as for a 1000 chars: from 2-6 times faster when branch predictor is helpful to 27 times when it's not.
Same plots, only simd versions:
Here we can see about a 10% win from using 256 bit registers and splitting them in 2 128 bit ones: about 10% faster. In size it grows from 88 to 129 instructions, which is not a lot, so might make sense depending on your use-case. For base-line - non-simd version is 79 instructions (as far as I know - these are smaller then SIMD ones though).
10'000 bytes worth of data, 5'000 shorts
From 20% to 9 times win, depending on the data distributions. Not showing the comparison between 256 and 128 bit registers - it's almost the same assembly as for chars and the same win for 256 bit one of about 10%.
10'000 bytes worth of data, 2'500 ints
Seems to make a lot of sense to use 256 bit registers, this version is about 2 times faster compared to 128 bit registers. When comparing with non-simd code - from a 20% win with a perfect branch prediction to 3.5 - 4 times as soon as it's not.
Conclusion: when you have a sufficient amount of data (at least 1000 bytes) this can be a very worthwhile optimisation for a modern processor without AVX-512
PS:
On percentage of elements to remove
On one hand it's uncommon to filter half of your elements. On the other hand a similar algorithm can be used in partition during sorting => that is actually expected to have ~50% branch selection.
Code alignment impact
The question is: how much worth it is, if the code happens to be poorly aligned
(generally speaking - there is very little one can do about it).
I'm only showing for 10'000 bytes.
The plots have two lines for min and for max for each percentage point (meaning - it's not one best/worst code alignment - it's the best code alignment for a given percentage).
Code alignment impact - non-simd
Chars:
From 15-20% for poor branch prediction to 2-3 times when branch prediction helped a lot. (branch predictor is known to be affected by code alignment).
Shorts:
For some reason - the 0 percent is not affected at all. It can be explained by std::remove first doing linear search to find the first element to remove. Apparently linear search for shorts is not affected.
Other then that - from 10% to 1.6-1.8 times worth
Ints:
Same as for shorts - no 0s is not affected. As soon as we go into remove part it goes from 1.3 times to 5 times worth then the best case alignment.
Code alignment impact - simd versions
Not showing shorts and ints 128, since it's almost the same assembly as for chars
Chars - 128 bit register
About 1.2 times slower
Chars - 256 bit register
About 1.1 - 1.24 times slower
Ints - 256 bit register
1.25 - 1.35 times slower
We can see that for simd version of the algorithm, code alignment has significantly less impact compared to non-simd version. I suspect that this is due to practically not having branches.

In case anyone is interested here is a solution for SSE2 which uses an instruction LUT instead of a data LUT aka a jump table. With AVX this would need 256 cases though.
Each time you call LeftPack_SSE2 below it uses essentially three instructions: jmp, shufps, jmp. Five of the sixteen cases don't need to modify the vector.
static inline __m128 LeftPack_SSE2(__m128 val, int mask) {
switch(mask) {
case 0:
case 1: return val;
case 2: return _mm_shuffle_ps(val,val,0x01);
case 3: return val;
case 4: return _mm_shuffle_ps(val,val,0x02);
case 5: return _mm_shuffle_ps(val,val,0x08);
case 6: return _mm_shuffle_ps(val,val,0x09);
case 7: return val;
case 8: return _mm_shuffle_ps(val,val,0x03);
case 9: return _mm_shuffle_ps(val,val,0x0c);
case 10: return _mm_shuffle_ps(val,val,0x0d);
case 11: return _mm_shuffle_ps(val,val,0x34);
case 12: return _mm_shuffle_ps(val,val,0x0e);
case 13: return _mm_shuffle_ps(val,val,0x38);
case 14: return _mm_shuffle_ps(val,val,0x39);
case 15: return val;
}
}
__m128 foo(__m128 val, __m128 maskv) {
int mask = _mm_movemask_ps(maskv);
return LeftPack_SSE2(val, mask);
}

This is perhaps a bit late though I recently ran into this exact problem and found an alternative solution which used a strictly AVX implementation. If you don't care if unpacked elements are swapped with the last elements of each vector, this could work as well. The following is an AVX version:
inline __m128 left_pack(__m128 val, __m128i mask) noexcept
{
const __m128i shiftMask0 = _mm_shuffle_epi32(mask, 0xA4);
const __m128i shiftMask1 = _mm_shuffle_epi32(mask, 0x54);
const __m128i shiftMask2 = _mm_shuffle_epi32(mask, 0x00);
__m128 v = val;
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask0);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask1);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask2);
return v;
}
Essentially, each element in val is shifted once to the left using the bitfield, 0xF9 for blending with it's unshifted variant. Next, both shifted and unshifted versions are blended against the input mask (which has the first non-zero element broadcast across the remaining elements 3 and 4). Repeat this process two more times, broadcasting the second and third elements of mask to its subsequent elements on each iteration and this should provide an AVX version of the _pdep_u32() BMI2 instruction.
If you don't have AVX, you can easily swap out each _mm_permute_ps() with _mm_shuffle_ps() for an SSE4.1-compatible version.
And if you're using double-precision, here's an additional version for AVX2:
inline __m256 left_pack(__m256d val, __m256i mask) noexcept
{
const __m256i shiftMask0 = _mm256_permute4x64_epi64(mask, 0xA4);
const __m256i shiftMask1 = _mm256_permute4x64_epi64(mask, 0x54);
const __m256i shiftMask2 = _mm256_permute4x64_epi64(mask, 0x00);
__m256d v = val;
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask0);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask1);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask2);
return v;
}
Additionally _mm_popcount_u32(_mm_movemask_ps(val)) can be used to determine the number of elements which remained after the left-packing.

What's missing/sub-optimal in this memcpy implementation?

I've become interested in writing a memcpy() as an educational exercise. I won't write a whole treatise of what I did and didn't think about, but here's
some guy's implementation:
__forceinline // Since Size is usually known,
// most useless code will be optimized out
// if the function is inlined.
void* myMemcpy(char* Dst, const char* Src, size_t Size)
{
void* start = Dst;
for ( ; Size >= sizeof(__m256i); Size -= sizeof(__m256i) )
{
__m256i ymm = _mm256_loadu_si256(((const __m256i* &)Src)++);
_mm256_storeu_si256(((__m256i* &)Dst)++, ymm);
}
#define CPY_1B *((uint8_t * &)Dst)++ = *((const uint8_t * &)Src)++
#define CPY_2B *((uint16_t* &)Dst)++ = *((const uint16_t* &)Src)++
#define CPY_4B *((uint32_t* &)Dst)++ = *((const uint32_t* &)Src)++
#if defined _M_X64 || defined _M_IA64 || defined __amd64
#define CPY_8B *((uint64_t* &)Dst)++ = *((const uint64_t* &)Src)++
#else
#define CPY_8B _mm_storel_epi64((__m128i *)Dst, _mm_loadu_si128((const __m128i *)Src)), ++(const uint64_t* &)Src, ++(uint64_t* &)Dst
#endif
#define CPY16B _mm_storeu_si128((__m128i *)Dst, _mm_loadu_si128((const __m128i *)Src)), ++(const __m128i* &)Src, ++(__m128i* &)Dst
switch (Size) {
case 0x00: break;
case 0x01: CPY_1B; break;
case 0x02: CPY_2B; break;
case 0x03: CPY_1B; CPY_2B; break;
case 0x04: CPY_4B; break;
case 0x05: CPY_1B; CPY_4B; break;
case 0x06: CPY_2B; CPY_4B; break;
case 0x07: CPY_1B; CPY_2B; CPY_4B; break;
case 0x08: CPY_8B; break;
case 0x09: CPY_1B; CPY_8B; break;
case 0x0A: CPY_2B; CPY_8B; break;
case 0x0B: CPY_1B; CPY_2B; CPY_8B; break;
case 0x0C: CPY_4B; CPY_8B; break;
case 0x0D: CPY_1B; CPY_4B; CPY_8B; break;
case 0x0E: CPY_2B; CPY_4B; CPY_8B; break;
case 0x0F: CPY_1B; CPY_2B; CPY_4B; CPY_8B; break;
case 0x10: CPY16B; break;
case 0x11: CPY_1B; CPY16B; break;
case 0x12: CPY_2B; CPY16B; break;
case 0x13: CPY_1B; CPY_2B; CPY16B; break;
case 0x14: CPY_4B; CPY16B; break;
case 0x15: CPY_1B; CPY_4B; CPY16B; break;
case 0x16: CPY_2B; CPY_4B; CPY16B; break;
case 0x17: CPY_1B; CPY_2B; CPY_4B; CPY16B; break;
case 0x18: CPY_8B; CPY16B; break;
case 0x19: CPY_1B; CPY_8B; CPY16B; break;
case 0x1A: CPY_2B; CPY_8B; CPY16B; break;
case 0x1B: CPY_1B; CPY_2B; CPY_8B; CPY16B; break;
case 0x1C: CPY_4B; CPY_8B; CPY16B; break;
case 0x1D: CPY_1B; CPY_4B; CPY_8B; CPY16B; break;
case 0x1E: CPY_2B; CPY_4B; CPY_8B; CPY16B; break;
case 0x1F: CPY_1B; CPY_2B; CPY_4B; CPY_8B; CPY16B; break;
}
#undef CPY_1B
#undef CPY_2B
#undef CPY_4B
#undef CPY_8B
#undef CPY16B
return start;
}
The comment translates as "Size is usually known as the compiler can optimize the code inline out most useless".
I would like to improve, if possible, on this implementation - but maybe there isn't much to improve. I see it uses SSE/AVX for the larger chunks of memory, then instead of a loop over the last < 32 bytes does the equivalent of manual unrolling, with some tweaking. So, here are my questions:
Why unroll the loop for the last several bytes, but not partially unroll the first (and now single) loop?
What about alignment issues? Aren't they important? Should I handle the first several bytes up to some alignment quantum differently, then perform the 256-bit ops on aligned sequences of bytes? And if so, how do I determine the appropriate alignment quantum?
What's the most important missing feature in this implementation (if any)?
Features/Principles mentioned in the answers so far
You should __restrict__ your parameters. (#chux)
The memory bandwidth is a limiting factor; measure your implementation against it.(#Zboson)
For small arrays, you can expect to approach the memory bandwidth; for larger arrays - not as much. (#Zboson)
Multiple threads (may be | are) necessary to saturate the memory bandwidth. (#Zboson)
It is probably wise to optimize differently for large and small copy sizes. (#Zboson)
(Alignment is important? Not explicitly addressed!)
The compiler should be made more explicitly aware of "obvious facts" it can use for optimization (such as the fact that Size < 32 after the first loop). (#chux)
There are arguments for unrolling your SSE/AVX calls (#BenJackson, here), and arguments against doing so (#PaulR)
non-temporal transfers (with which you tell the CPU you don't need it to cache the target location) should be useful for copying larger buffers. (#Zboson)

I have been studying measuring memory bandwidth for Intel processors with various operations and one of them is memcpy. I have done this on Core2, Ivy Bridge, and Haswell. I did most of my tests using C/C++ with intrinsics (see the code below - but I'm currently rewriting my tests in assembly).
To write your own efficient memcpy function it's important to know what the absolute best bandwidth possible is. This bandwidth is a function of the size of the arrays which will be copied and therefore an efficient memcpy function needs to optimize differently for small and big (and maybe in between). To keep things simple I have optimized for small arrays of 8192 bytes and large arrays of 1 GB.
For small arrays the maximum read and write bandwidth for each core is:
Core2-Ivy Bridge 32 bytes/cycle
Haswell 64 bytes/cycle
This is the benchmark you should aim for small arrays. For my tests I assume the arrays are aligned to 64-bytes and that the array size is a multiple of 8*sizeof(float)*unroll_factor. Here are my current memcpy results for a size of 8192 bytes (Ubuntu 14.04, GCC 4.9, EGLIBC 2.19):
GB/s efficiency
Core2 (p9600#2.66 GHz)
builtin 35.2 41.3%
eglibc 39.2 46.0%
asmlib: 76.0 89.3%
copy_unroll1: 39.1 46.0%
copy_unroll8: 73.6 86.5%
Ivy Bridge (E5-1620#3.6 GHz)
builtin 102.2 88.7%
eglibc: 107.0 92.9%
asmlib: 107.6 93.4%
copy_unroll1: 106.9 92.8%
copy_unroll8: 111.3 96.6%
Haswell (i5-4250U#1.3 GHz)
builtin: 68.4 82.2%
eglibc: 39.7 47.7%
asmlib: 73.2 87.6%
copy_unroll1: 39.6 47.6%
copy_unroll8: 81.9 98.4%
The asmlib is Agner Fog's asmlib. The copy_unroll1 and copy_unroll8 functions are defined below.
From this table we can see that the GCC builtin memcpy does not work well on Core2 and that memcpy in EGLIBC does not work well on Core2 or Haswell. I did check out a head version of GLIBC recently and the performance was much better on Haswell. In all cases unrolling gets the best result.
void copy_unroll1(const float *x, float *y, const int n) {
for(int i=0; i<n/JUMP; i++) {
VECNF().LOAD(&x[JUMP*(i+0)]).STORE(&y[JUMP*(i+0)]);
}
}
void copy_unroll8(const float *x, float *y, const int n) {
for(int i=0; i<n/JUMP; i+=8) {
VECNF().LOAD(&x[JUMP*(i+0)]).STORE(&y[JUMP*(i+0)]);
VECNF().LOAD(&x[JUMP*(i+1)]).STORE(&y[JUMP*(i+1)]);
VECNF().LOAD(&x[JUMP*(i+2)]).STORE(&y[JUMP*(i+2)]);
VECNF().LOAD(&x[JUMP*(i+3)]).STORE(&y[JUMP*(i+3)]);
VECNF().LOAD(&x[JUMP*(i+4)]).STORE(&y[JUMP*(i+4)]);
VECNF().LOAD(&x[JUMP*(i+5)]).STORE(&y[JUMP*(i+5)]);
VECNF().LOAD(&x[JUMP*(i+6)]).STORE(&y[JUMP*(i+6)]);
VECNF().LOAD(&x[JUMP*(i+7)]).STORE(&y[JUMP*(i+7)]);
}
}
Where VECNF().LOADis _mm_load_ps() for SSE or _mm256_load_ps() for AVX, VECNF().STORE is _mm_store_ps() for SSE or _mm256_store_ps() for AVX, and JUMP is 4 for SSE or 8 for AVX.
For the large size the best result is obtained by using non-temporal store instructions and by using multiple threads. Contrary to what many people may believe a single thread does NOT usually saturate the memory bandwidth.
void copy_stream(const float *x, float *y, const int n) {
#pragma omp parallel for
for(int i=0; i<n/JUMP; i++) {
VECNF v = VECNF().load_a(&x[JUMP*i]);
stream(&y[JUMP*i], v);
}
}
Where stream is _mm_stream_ps() for SSE or _mm256_stream_ps() for AVX
Here are the memcpy results on my E5-1620#3.6 GHz with four threads for 1 GB with a maximum main memory bandwidth of 51.2 GB/s.
GB/s efficiency
eglibc: 23.6 46%
asmlib: 36.7 72%
copy_stream: 36.7 72%
Once again EGLIBC performs poorly. This is because it does not use non-temporal stores.
I modfied the eglibc and asmlib memcpy functions to run in parallel like this
void COPY(const float * __restrict x, float * __restrict y, const int n) {
#pragma omp parallel
{
size_t my_start, my_size;
int id = omp_get_thread_num();
int num = omp_get_num_threads();
my_start = (id*n)/num;
my_size = ((id+1)*n)/num - my_start;
memcpy(y+my_start, x+my_start, sizeof(float)*my_size);
}
}
A general memcpy function needs to account for arrays which are not aligned to 64 bytes (or even to 32 or to 16 bytes) and where the size is not a multiple of 32 bytes or the unroll factor. Additionally, a decision has to be made as to when to use non-temporal stores. The general rule of thumb is to only use non-temporal stores for sizes larger than half the largest cache level (usually L3). But theses are "second order" details which I think should be dealt with after optimizing for ideal cases of large and small. There's not much point in worrying about correcting for misalignment or non-ideal size multiples if the ideal case performs poorly as well.
Update
Based on comments by Stephen Canon I have learned that on Ivy Bridge and Haswell it's more efficient to use rep movsb than movntdqa (a non-temporal store instruction). Intel calls this enhanced rep movsb (ERMSB). This is described in the Intel Optimization manuals in the section 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB).
Additionally, in Agner Fog's Optimizing Subroutines in Assembly manual in section 17.9 Moving blocks of data (All processors) he writes:
"There are several ways of moving large blocks of data. The most common methods are:
REP MOVS instruction.
If data are aligned: Read and write in a loop with the largest available register size.
If size is constant: inline move instructions.
If data are misaligned: First move as many bytes as required to make the destination
aligned. Then read unaligned and write aligned in a loop with the largest available
register size.
If data are misaligned: Read aligned, shift to compensate for misalignment and write
aligned.
If the data size is too big for caching, use non-temporal writes to bypass the cache.
Shift to compensate for misalignment, if necessary."
A general memcpy should consider each of these points. Additionally, with Ivy Bridge and Haswell it seems that point 1 is better than point 6 for large arrays. Different techniques are necessary for Intel and AMD and for each iteration of technology. I think it's clear that writing your own general efficient memcpyfunction can be quite complicated. But in the special cases I have looked at I have already managed to do better than the GCC builtin memcpy or the one in EGLIBC so the assumption that you can't do better than the standard libraries is incorrect.

The question can't be answered precisely without some additional details such as:
What is the target platform (CPU architecture, most, but memory configuration plays a role too)?
What is the distribution and predictability1 of the copy lengths (and to a lesser extent, the distribution and predictability of alignments)?
Will the copy size ever be statically known at compile-time?
Still, I can point out a couple things that are likely to be sub-optimal for at least some combination of the above parameters.
32-case Switch Statement
The 32-case switch statement is a cute way of handling the trailing 0 to 31 bytes, and likely benchmarks very well - but may perform badly in the real world due to at least two factors.
Code Size
This switch statement alone takes several hundred bytes of code for the body, in addition to a 32-entry lookup table needed to jump to the correct location for each length. The cost of this isn't going to show up in a focused benchmark of memcpy on a full-sized CPU because everything still fit in the fastest cache level: but in the real world you execute other code too and there is contention for the uop cache and L1 data and instruction caches.
That many instructions may take fully 20% of the effective size of your uop cache3, and uop cache misses (and the corresponding cache-to-legacy encoder transition cycles) could easily wipe the small benefit given by this elaborate switch.
On top of that, the switch requires a 32-entry, 256 byte lookup table for the jump targets4. If you ever get a miss to DRAM on that lookup, you are talking a penalty of 150+ cycles: how many non-misses do you need to then to make the switch worth it, given it's probably saving a few or two at the most? Again, that won't show up in a microbenchmark.
For what its worth, this memcpy isn't unusual: that kind of "exhaustive enumeration of cases" is common even in optimized libraries. I can conclude that either their development was driven mostly by microbenchmarks, or that it is still worth it for a large slice of general purpose code, despite the downsides. That said, there are certainly scenarios (instruction and/or data cache pressure) where this is suboptimal.
Branch Prediction
The switch statement relies on a single indirect branch to choose among the alternatives. This going to be efficient to the extent that the branch predictor can predict this indirect branch, which basically means that the sequence of observed lengths needs to be predictable.
Because it is an indirect branch, there are more limits on the predictability of the branch than a conditional branch since there are a limited number of BTB entries. Recent CPUs have made strides here, but it is safe to say that if the series of lengths fed to memcpy don't follow a simple repeating pattern of a short period (as short as 1 or 2 on older CPUs), there will be a branch-mispredict on each call.
This issue is particularly insidious because it is likely to hurt you the most in real-world in exactly the situations where a microbenchmark shows the switch to be the best: short lengths. For very long lengths, the behavior on the trailing 31 bytes isn't very important since it is dominated by the bulk copy. For short lengths, the switch is all-important (indeed, for copies of 31 bytes or less it is all that executes)!
For these short lengths, a predictable series of lengths works very well for the switch since the indirect jump is basically free. In particular, a typical memcpy benchmark "sweeps" over a series of lengths, using the same length repeatedly for each sub-test to report the results for easy graphing of "time vs length" graphs. The switch does great on these tests, often reporting results like 2 or 3 cycles for small lengths of a few bytes.
In the real world, your lengths might be small but unpredicable. In that case, the indirect branch will frequently mispredict5, with a penalty of ~20 cycles on modern CPUs. Compared to best case of a couple cycles it is an order of magnitude worse. So the glass jaw here can be very serious (i.e., the behavior of the switch in this typical case can be an order of magnitude worse than the best, whereas at long lengths, you are usually looking at a difference of 50% at most between different strategies).
Solutions
So how can you do better than the above, at least under the conditions where the switch falls apart?
Use Duff's Device
One solution to the code size issue is to combine the switch cases together, duff's device-style.
For example, the assembled code for the length 1, 3 and 7 cases looks like:
Length 1
movzx edx, BYTE PTR [rsi]
mov BYTE PTR [rcx], dl
ret
Length 3
movzx edx, BYTE PTR [rsi]
mov BYTE PTR [rcx], dl
movzx edx, WORD PTR [rsi+1]
mov WORD PTR [rcx+1], dx
Length 7
movzx edx, BYTE PTR [rsi]
mov BYTE PTR [rcx], dl
movzx edx, WORD PTR [rsi+1]
mov WORD PTR [rcx+1], dx
mov edx, DWORD PTR [rsi+3]
mov DWORD PTR [rcx+3], edx
ret
This can combined into a single case, with various jump-ins:
len7:
mov edx, DWORD PTR [rsi-6]
mov DWORD PTR [rcx-6], edx
len3:
movzx edx, WORD PTR [rsi-2]
mov WORD PTR [rcx-2], dx
len1:
movzx edx, BYTE PTR [rsi]
mov BYTE PTR [rcx], dl
ret
The labels don't cost anything, and they combine the cases together and removes two out of 3 ret instructions. Note that the basis for rsi and rcx have changed here: they point to the last byte to copy from/to, rather than the first. That change is free or very cheap depending on the code before the jump.
You can extend that for longer lengths (e.g., you can attach lengths 15 and 31 to the chain above), and use other chains for the missing lengths. The full exercise is left to the reader. You can probably get a 50% size reduction alone from this approach, and much better if you combine it with something else to collapse the sizes from 16 - 31.
This approach only helps with the code size (and possibly the jump table size, if you shrink the size as described in 4 and you get under 256 bytes, allowing a byte-sized lookup table. It does nothing for predictability.
Overlapping Stores
One trick that helps for both code size and predictability is to use overlapping stores. That is, memcpy of 8 to 15 bytes can be accomplished in a branch-free way with two 8-byte stores, with the second store partly overlapping the first. For example, to copy 11 bytes, you would do an 8-byte copy at relative position 0 and 11 - 8 == 3. Some of the bytes in the middle would be "copied twice", but in practice this is fine since an 8-byte copy is the same speed as a 1, 2 or 4-byte one.
The C code looks like:
if (Size >= 8) {
*((uint64_t*)Dst) = *((const uint64_t*)Src);
size_t offset = Size & 0x7;
*(uint64_t *)(Dst + offset) = *(const uint64_t *)(Src + offset);
}
... and the corresponding assembly is not problematic:
cmp rdx, 7
jbe .L8
mov rcx, QWORD PTR [rsi]
and edx, 7
mov QWORD PTR [rdi], rcx
mov rcx, QWORD PTR [rsi+rdx]
mov QWORD PTR [rdi+rdx], rcx
In particular, note that you get exactly two loads, two stores and one and (in addition to the cmp and jmp whose existence depends on how you organize the surrounding code). That's already tied or better than most of the compiler-generated approaches for 8-15 bytes, which might use up to 4 load/store pairs.
Older processors suffered some penalty for such "overlapping stores", but newer architectures (the last decade or so, at least) seem to handle them without penalty6. This has two main advantages:
The behavior is branch free for a range of sizes. Effectively, this quantizes the branching so that many values take the same path. All sizes from 8 to 15 (or 8 to 16 if you want) take the same path and suffer no misprediction pressure.
At least 8 or 9 different cases from the switch are subsumed into a single case with a fraction of the total code size.
This approach can be combined with the switch approach, but using only a few cases, or it can be extended to larger sizes with conditional moves that could do, for example, all moves from 8 to 31 bytes without branches.
What works out best again depends on the branch distribution, but overall this "overlapping" technique works very well.
Alignment
The existing code doesn't address alignment.
In fact, it isn't, in general, legal or C or C++, since the char * pointers are simply casted to larger types and dereferenced, which is not legal - although in practice it generates codes that works on today's x86 compilers (but in fact would fail for platform with stricter alignment requirements).
Beyond that, it is often better to handle the alignment specifically. There are three main cases:
The source and destination are already alignment. Even the original algorithm will work fine here.
The source and destination are relatively aligned, but absolutely misaligned. That is, there is a value A that can be added to both the source and destination such that both are aligned.
The source and destination are fully misaligned (i.e., they are not actually aligned and case (2) does not apply).
The existing algorithm will work ok in case (1). It is potentially missing a large optimization the case of (2) since small intro loop could turn an unaligned copy into an aligned one.
It is also likely performing poorly in case (3), since in general in the totally misaligned case you can chose to either align the destination or the source and then proceed "semi-aligned".
The alignment penalties have been getting smaller over time and on the most recent chips are modest for general purpose code but can still be serious for code with many loads and stores. For large copies, it probably doesn't matter too much since you'll end up DRAM bandwidth limited, but for smaller copies misalignment may reduce throughput by 50% or more.
If you use NT stores, alignment can also be important, because many of the NT store instructions perform poorly with misaligned arguments.
No unrolling
The code is not unrolled and compilers unrolled by different amounts by default. Clearly this is suboptimal since among two compilers with different unroll strategies, at most one will be best.
The best approach (at least for known platform targets) is determine which unroll factor is best, and then apply that in the code.
Furthermore, the unrolling can often be combined in a smart way with the "intro" our "outro" code, doing a better job than the compiler could.
Known sizes
The primary reason that it is tough to beat the "builtin" memcpy routine with modern compilers is that compilers don't just call a library memcpy whenever memcpy appears in the source. They know the contract of memcpy and are free to implement it with a single inlined instruction, or even less7, in the right scenario.
This is especially obvious with known lengths in memcpy. In this case, if the length is small, compilers will just insert a few instructions to perform the copy efficiently and in-place. This not only avoids the overhead of the function call, but all the checks about size and so on - and also generates at compile time efficient code for the copy, much like the big switch in the implementation above - but without the costs of the switch.
Similarly, the compiler knows a lot of about the alignment of structures in the calling code, and can create code that deals efficiently with alignment.
If you just implement a memcpy2 as a library function, that is tough to replicate. You can get part of the way there my splitting the method into a small and big part: the small part appears in the header file, and does some size checks and potentially just calls the existing memcpy if the size is small or delegates to the library routine if it is large. Through the magic of inlining, you might get to the same place as the builtin memcpy.
Finally, you can also try tricks with __builtin_constant_p or equivalents to handle the small, known case efficiently.
1 Note that I'm drawing a distinction here between the "distribution" of sizes - e.g., you might say _uniformly distributed between 8 and 24 bytes - and the "predictability" of the actual sequence of sizes (e.g., do the sizes have a predicable pattern)? The question of predictability somewhat subtle because it depends on on the implementation, since as described above certain implementations are inherently more predictable.
2 In particular, ~750 bytes of instructions in clang and ~600 bytes in gcc for the body alone, on top of the 256-byte jump lookup table for the switch body which had 180 - 250 instructions (gcc and clang respectively). Godbolt link.
3 Basically 200 fused uops out of an effective uop cache size of 1000 instructions. While recent x86 have had uop cache sizes around ~1500 uops, you can't use it all outside of extremely dedicated padding of your codebase because of the restrictive code-to-cache assignment rules.
4 The switch cases have different compiled lengths, so the jump can't be directly calculated. For what it's worth, it could have been done differently: they could have used a 16-bit value in the lookup table at the cost of not using memory-source for the jmp, cutting its size by 75%.
5 Unlike conditional branch prediction, which has a typical worst-case prediction rate of ~50% (for totally random branches), a hard-to-predict indirect branch can easily approach 100% since you aren't flipping a coin, you are choosing for an almost infinite set of branch targets. This happens in the real-world: if memcpy is being used to copy small strings with lengths uniformly distributed between 0 and 30, the switch code will mispredict ~97% of the time.
6 Of course, there may be penalties for misaligned stores, but these are also generally small and have been getting smaller.
7 For example, a memcpy to the stack, followed by some manipulation and a copy somewhere else may be totally eliminated, directly moving the original data to its final location. Even things like malloc followed by memcpy can be totally eliminated.

Firstly the main loop uses unaligned AVX vector loads/stores to copy 32 bytes at a time, until there are < 32 bytes left to copy:
for ( ; Size >= sizeof(__m256i); Size -= sizeof(__m256i) )
{
__m256i ymm = _mm256_loadu_si256(((const __m256i* &)Src)++);
_mm256_storeu_si256(((__m256i* &)Dst)++, ymm);
}
Then the final switch statement handles the residual 0..31 bytes in as efficient manner as possible, using a combination of 8/4/2/1 byte copies as appropriate. Note that this is not an unrolled loop - it's just 32 different optimised code paths which handle the residual bytes using the minimum number of loads and stores.
As for why the main 32 byte AVX loop is not manually unrolled - there are several possible reasons for this:
most compilers will unroll small loops automatically (depending on loop size and optimisation switches)
excessive unrolling can cause small loops to spill out of the LSD cache (typically only 28 decoded µops)
on current Core iX CPUs you can only issue two concurrent loads/stores before you stall [*]
typically even a non-unrolled AVX loop like this can saturate available DRAM bandwidth [*]
[*] note that the last two comments above apply to cases where source and/or destination are not in cache (i.e. writing/reading to/from DRAM), and therefore load/store latency is high.

Taking Benefits of The ERMSB
Please also consider using REP MOVSB for larger blocks.
As you know, since first Pentium CPU produced in 1993, Intel began to make simple commands faster and complex commands (like REP MOVSB) slower. So, REP MOVSB became very slow, and there was no more reason to use it. In 2013, Intel decided to revisit REP MOVSB. If the CPU has CPUID ERMSB (Enhanced REP MOVSB) bit, then REP MOVSB commands are executed differently than on older processors, and are supposed to be fast. On practice, it is only fast for large blocks, 256 bytes and larger, and only when certain conditions are met:
both the source and destination addresses have to be aligned to a 16-Byte boundary;
the source region should not overlap with the destination region;
the length has to be a multiple of 64 to produce higher performance;
the direction has to be forward (CLD).
See the Intel Manual on Optimization, section 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB) http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
Intel recommends using AVX for blocks smaller than 2048 bytes. For the larger blocks, Intel recommends using REP MOVSB. This is because high initial startup costs of REP MOVSB (about 35 cycles).
I have done speed tests, and for the blocks of than 2048 bytes and higher, the performance of REP MOVSB is unbeatable. However, for blocks smaller than 256 bytes, REP MOVSB is very slow, even slower than plain MOV RAX back and forth in a loop.
Please not that ERMSB only affects MOVSB, not MOVSD (MOVSQ), so MOVSB is little bit faster than MOVSD (MOVSQ).
So, you can use AVX for your memcpy() implementation, and if the block is larger than 2048 bytes and all the conditions are met, then call REP MOVSB - so your memcpy() implementation will be unbeatable.
Taking Benefits of The Out-of-Order Execution Engine
You can also read about The Out-of-Order Execution Engine
in the "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf section the 2.1.2, and take benefits of it.
For example, in Intel SkyLake processor series (launched in 2015), it has:
4 execution units for the Arithmetic logic unit (ALU) (add, and, cmp, or, test, xor, movzx, movsx, mov, (v)movdqu, (v)movdqa, (v)movap*, (v)movup),
3 execution units for Vector ALU ( (v)pand, (v)por, (v)pxor, (v)movq, (v)movq, (v)movap*, (v)movup*, (v)andp*, (v)orp*, (v)paddb/w/d/q, (v)blendv*, (v)blendp*, (v)pblendd)
So we can occupy above units (3+4) in parallel if we use register-only operations. We cannot use 3+4 instructions in parallel for memory copy. We can use simultaneously maximum of up to two 32-bytes instructions to load from memory and one 32-bytes instructions to store from memory, and even if we are working with Level-1 cache.
Please see the Intel manual again to understand on how to do the fastest memcpy implementation: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
Section 2.2.2 (The Out-of-Order Engine of the Haswelll microarchitecture): "The Scheduler controls the dispatch of micro-ops onto the dispatch ports. There are eight dispatch ports to support the out-of-order execution core. Four of the eight ports provided execution resources for computational operations. The other 4 ports support memory operations of up to two 256-bit load and one 256-bit store operation in a cycle."
Section 2.2.4 (Cache and Memory Subsystem) has the following note: "First level data cache supports two load micro-ops each cycle; each micro-op can fetch up to 32-bytes of data."
Section 2.2.4.1 (Load and Store Operation Enhancements) has the following information: The L1 data cache can handle two 256-bit (32 bytes) load and one 256-bit (32 bytes) store operations each cycle. The unified L2 can service one cache line (64 bytes) each cycle. Additionally, there are 72 load buffers and 42 store buffers available to support micro-ops execution in-flight.
The other sections (2.3 and so on, dedicated to Sandy Bridge and other microarchitectures) basically reiterate the above information.
The section 2.3.4 (The Execution Core) gives additional details.
The scheduler can dispatch up to six micro-ops every cycle, one on each port. The following table summarizes which operations can be dispatched on which port.
Port 0: ALU, Shift, Mul, STTNI, Int-Div, 128b-Mov, Blend, 256b-Mov
Port 1: ALU, Fast LEA, Slow LEA, MUL, Shuf, Blend, 128bMov, Add, CVT
Port 2 & Port 3: Load_Addr, Store_addr
Port 4: Store_data
Port 5: ALU, Shift, Branch, Fast LEA, Shuf, Blend, 128b-Mov, 256b-Mov
The section 2.3.5.1 (Load and Store Operation Overview) may also be useful to understand on how to make fast memory copy, as well as the section 2.4.4.1 (Loads and Stores).
For the other processor architectures, it is again - two load units and one store unit. Table 2-4 (Cache Parameters of the Skylake Microarchitecture) has the following information:
Peak Bandwidth (bytes/cyc):
First Level Data Cache: 96 bytes (2x32B Load + 1*32B Store)
Second Level Cache: 64 bytes
Third Level Cache: 32 bytes.
I have also done speed tests on my Intel Core i5 6600 CPU (Skylake, 14nm, released in September 2015) with DDR4 memory, and this has confirmed the teory. For example, my test have shown that using generic 64-bit registers for memory copy, even many registers in parallel, degrades performance. Also, using just 2 XMM registers is enough - adding the 3rd doesn't add performance.
If your CPU has AVX CPUID bit, you may take benefits of the large, 256-bit (32 byte) YMM registers to copy memory, to occupy two full load units. The AVX support was first introduced by Intel with the Sandy Bridge processors, shipping in Q1 2011 and later on by AMD with the Bulldozer processor shipping in Q3 2011.
// first cycle
vmovdqa ymm0, ymmword ptr [rcx+0] // load 1st 32-byte part using first load unit
vmovdqa ymm1, ymmword ptr [rcx+20h] // load 2nd 32-byte part using second load unit
// second cycle
vmovdqa ymmword ptr [rdx+0], ymm0 // store 1st 32-byte part using the single store unit
// third cycle
vmovdqa ymmword ptr [rdx+20h], ymm1 ; store 2nd 32-byte part - using the single store unit (this instruction will require a separate cycle since there is only one store unit, and we cannot do two stores in a single cycle)
add ecx, 40h // these instructions will be used by a different unit since they don't invoke load or store, so they won't require a new cycle
add edx, 40h
Also, there is speed benefit if you loop-unroll this code at least 8 times. As I wrote before, adding more registers besides ymm0 and ymm1 doesn't increase performance, because there are just two load units and one store unit. Adding loops like "dec r9 jnz ##again" degrades the performance, but simple "add ecx/edx" does not.
Finally, if your CPU has AVX-512 extension, you can use 512-bit (64-byte) registers to copy memory:
vmovdqu64 zmm0, [rcx+0] ; load 1st 64-byte part
vmovdqu64 zmm1, [rcx+40h] ; load 2nd 64-byte part
vmovdqu64 [rdx+0], zmm0 ; store 1st 64-byte part
vmovdqu64 [rdx+40h], zmm1 ; store 2nd 64-byte part
add rcx, 80h
add rdx, 80h
AVX-512 is supported by the following processors: Xeon Phi x200, released in 2016; Skylake EP/EX Xeon "Purley" (Xeon E5-26xx V5) processors (H2 2017); Cannonlake processors (H2 2017), Skylake-X processors - Core i9-7×××X, i7-7×××X, i5-7×××X - released on June 2017.
Please note that the memory have to be aligned on the size of the registers that you are using. If it is not, please use "unaligned" instructions: vmovdqu and moveups.

My attempt to optimize memset on a 64bit machine takes more time than standard implementation. Can someone please explain why?

(machine is x86 64 bit running SL6)
I was trying to see if I can optimize memset on my 64 bit machine. As per my understanding memset goes byte by byte and sets the value. I assumed that if I do in units of 64 bits, it would be faster. But somehow it takes more time. Can someone take a look at my code and suggest why ?
/* Code */
#include <stdio.h>
#include <time.h>
#include <stdint.h>
#include <string.h>
void memset8(unsigned char *dest, unsigned char val, uint32_t count)
{
while (count--)
*dest++ = val;
}
void memset32(uint32_t *dest, uint32_t val, uint32_t count)
{
while (count--)
*dest++ = val;
}
void
memset64(uint64_t *dest, uint64_t val, uint32_t count)
{
while (count--)
*dest++ = val;
}
#define CYCLES 1000000000
int main()
{
clock_t start, end;
double total;
uint64_t loop;
uint64_t val;
/* memset 32 */
start = clock();
for (loop = 0; loop < CYCLES; loop++) {
val = 0xDEADBEEFDEADBEEF;
memset32((uint32_t*)&val, 0, 2);
}
end = clock();
total = (double)(end-start)/CLOCKS_PER_SEC;
printf("Timetaken memset32 %g\n", total);
/* memset 64 */
start = clock();
for (loop = 0; loop < CYCLES; loop++) {
val = 0xDEADBEEFDEADBEEF;
memset64(&val, 0, 1);
}
end = clock();
total = (double)(end-start)/CLOCKS_PER_SEC;
printf("Timetaken memset64 %g\n", total);
/* memset 8 */
start = clock();
for (loop = 0; loop < CYCLES; loop++) {
val = 0xDEADBEEFDEADBEEF;
memset8((unsigned char*)&val, 0, 8);
}
end = clock();
total = (double)(end-start)/CLOCKS_PER_SEC;
printf("Timetaken memset8 %g\n", total);
/* memset */
start = clock();
for (loop = 0; loop < CYCLES; loop++) {
val = 0xDEADBEEFDEADBEEF;
memset(&val, 0, 8);
}
end = clock();
total = (double)(end-start)/CLOCKS_PER_SEC;
printf("Timetaken memset %g\n", total);
printf("-----------------------------------------\n");
}
/*Result*/
Timetaken memset32 12.46
Timetaken memset64 7.57
Timetaken memset8 37.12
Timetaken memset 6.03
-----------------------------------------
Looks like the standard memset is more optimized than my implementation.
I tried looking into code and everywhere is see that implementation of memset is same as what I did for memset8. When I use memset8, the results are more like what I expect and very different from memset.
Can someone suggest what am I doing wrong ?

Actual memset implementations are typically hand-optimized in assembly, and use the widest aligned writes available on the targeted hardware. On x86_64 that will be at least 16B stores (movaps, for example). It may also take advantage of prefetching (this is less common recently, as most architectures have good automatic streaming prefetchers for regular access patterns), streaming stores or dedicated instructions (historically rep stos was unusably slow on x86, but it is quite fast on recent microarchitectures). Your implementation does none of these things. It should not be terribly surprising that the system implementation is faster.
As an example, consider the implementation used in OS X 10.8 (which has been superseded in 10.9). Here’s the core loop for modest-sized buffers:
.align 4,0x90
1: movdqa %xmm0, (%rdi,%rcx)
movdqa %xmm0, 16(%rdi,%rcx)
movdqa %xmm0, 32(%rdi,%rcx)
movdqa %xmm0, 48(%rdi,%rcx)
addq $64, %rcx
jne 1b
This loop will saturate the LSU when hitting cache on pre-Haswell microarchitectures at 16B/cycle. An implementation based on 64-bit stores like your memset64 cannot exceed 8B/cycle (and may not even achieve that, depending on the microarchitecture in question and whether or not the compiler unrolls your loop). On Haswell, an implementation that uses AVX stores or rep stos can go even faster and achieve 32B/cycle.

As per my understanding memset goes byte by byte and sets the value.
The details of what the memset facility does are implementation dependent. Relying on this is usually a good thing, because the I'm sure the implementors have extensive knowledge of the system and know all kind of techniques to make things as fast as possible.
To elaborate a little more, lets look at:
memset(&val, 0, 8);
When the compiler sees this it can notice a few things like:
The fill value is 0
The number of bytes to fill is 8
and then choose the right instructions to use depending on where val or &val is (in a register, in memory, ...). But if memset is stuck needing to be a function call (like your implementations), none of those optimizations are possible. Even if it can't make compile time decisions like:
memset(&val, x, y); // no way to tell at compile time what x and y will be...
you can be assured that there's a function call written in assembler that will be as fast as possible for your platform.

I think it's worth exploring how to write a faster memset particularly with GCC (which I assume you are using with Scientific Linux 6) in C/C++. Many people assume the standard implementation is optimized. This is not necessarily true. If you see table 2.1 of Agner Fog's Optimizing Software in C++ manuals he compares memcpy for for several different compilers and platforms to his own assembly optimized version of memcpy. Memcpy in GCC at the time really underperformed (but the Mac version was good). He claims the built in functions are even worse and recommends using -no-builtin. GCC in my experience is very good at optimizing code but its library functions (and built in functions) are not very optimized (with ICC it's the other way around).
It would be interesting to see how good you could do using intrinsics. If you look at his asmlib you can see how he implements memset with SSE and AVX (it would be interesting to compare this to the Apple's optimized version Stephen Canon posted).
With AVX you can see he writes 32 bytes at a time.
K100: ; Loop through 32-bytes blocks. Register use is swapped
; Rcount = end of 32-bytes blocks part
; Rdest = negative index from the end, counting up to zero
vmovaps [Rcount+Rdest], ymm0
add Rdest, 20H
jnz K100
vmovaps in this case is the same as the intrinsic _mm256_store_ps. Maybe GCC has improved since then but you might be able to beat GCC's implementation of memset using intrinsics. If you don't have AVX you certainly have SSE (all x86 64bit do) so you could look at the SSE version of his code to see what you could do.
Here is a start for your memset32 funcion assuming the array fits in the L1 cache. If the array does not fit in the cache you want to do a non temporal store with _mm256_stream_ps. For a general function you need several cases also including cases when the memory is not 32 byte aligned.
#include <immintrin.h>
int main() {
int count = (1<<14)/sizeof(int);
int* dest = (int*)_mm_malloc(sizeof(int)*count, 32); // 32 byte aligned
int val = 0xDEADBEEFDEADBEEF;
__m256 val8 = _mm256_castsi256_ps(_mm256_set1_epi32(val));
for(int i=0; i<count; i+=8) {
_mm256_store_ps((float*)(dest+i), val8);
}
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight