Neon Optimization using intrinsics

Neon Optimization using intrinsics - arm

Learning about ARM NEON intrinsics, I was timing a function that I wrote to double the elements in an array.The version that used the intrinsics takes more time than a plain C version of the function.
Without NEON :
void double_elements(unsigned int *ptr, unsigned int size)
{
unsigned int loop;
for( loop= 0; loop<size; loop++)
ptr[loop]<<=1;
return;
}
With NEON:
void double_elements(unsigned int *ptr, unsigned int size)
{
unsigned int i;
uint32x4_t Q0,vector128Output;
for( i=0;i<(SIZE/4);i++)
{
Q0=vld1q_u32(ptr);
Q0=vaddq_u32(Q0,Q0);
vst1q_u32(ptr,Q0);
ptr+=4;
}
return;
}
Wondering if the load/store operations between the array and vector is consuming more time which offsets the benefit of the parallel addition.
UPDATE:More Info in response to Igor's reply.
1.The code is posted here:
plain.c
plain.s
neon.c
neon.s
From the section(label) L7 in both the assembly listings,I see that the neon version has more number of assembly instructions.(hence more time taken?)
2.I compiled using -mfpu=neon on arm-gcc, no other flags or optimizations.For the plain version, no compiler flags at all.
3.That was a typo, SIZE was meant to be size;both are same.
4,5.Tried on an array of 4000 elements. I timed using gettimeofday() before and after the function call.NEON=230us,ordinary=155us.
6.Yes I printed the elements in each case.
7.Did this, no improvement whatsoever.

Something like this might run a bit faster.
void double_elements(unsigned int *ptr, unsigned int size)
{
unsigned int i;
uint32x4_t Q0,Q1,Q2,Q3;
for( i=0;i<(SIZE/16);i++)
{
Q0=vld1q_u32(ptr);
Q1=vld1q_u32(ptr+4);
Q0=vaddq_u32(Q0,Q0);
Q2=vld1q_u32(ptr+8);
Q1=vaddq_u32(Q1,Q1);
Q3=vld1q_u32(ptr+12);
Q2=vaddq_u32(Q2,Q2);
vst1q_u32(ptr,Q0);
Q3=vaddq_u32(Q3,Q3);
vst1q_u32(ptr+4,Q1);
vst1q_u32(ptr+8,Q2);
vst1q_u32(ptr+12,Q3);
ptr+=16;
}
return;
}
There are a few problems with the original code (some of those the optimizer may fix but other it may not, you need to verify in the generated code):
The result of the add is only available in the N3 stage of the NEON pipeline so the following store will stall.
Assuming the compiler is not unrolling the loop there may be some overhead associated with the loop/branch.
It doesn't take advantage of the ability to dual issue load/store with another NEON instruction.
If the source data isn't in cache then the loads would stall. You can preload the data to speed this up with the __builtin_prefetch intrinsic.
Also as others have pointed out the operation is fairly trivial, you'll see more gains for more complex operations.
If you were to write this with inline assembly you could also:
Use the aligned load/stores (which I don't think the intrinsics can generate) and ensure your pointer is always 128 bit aligned, e.g. vld1.32 {q0}, [r1 :128]
You could also use the postincrement version (which I'm also not sure intrinsics will generate), e.g. vld1.32 {q0}, [r1 :128]!
95us for 4000 elements sounds pretty slow, on a 1GHz processor that's ~95 cycles per 128bit chunk. You should be able to do better assuming you're working from the cache. This figure is about what you'd expect if you're bound by the speed of the external memory.

The question is rather vague and you didn't provide much info but I'll try to give you some pointers.
You won't know for sure what's going on until you look at the assembly. Use -S, Luke!
You didn't specify the compiler settings. Are you using optimizations? Loop unrolling?
First function uses size, second uses SIZE, is this intentional? Are they the same?
What is the size of the array you tried? I don't expect NEON to help at all for a couple of elements.
What is the speed difference? Several percents? Couple of orders of magnitude?
Did you check that the results are the same? Are you sure the code is equivalent?
You're using the same variable for intermediate result. Try storing the result of the addition in another variable, that could help (though I expect the compiler will be smart and allocate a different register). Also, you could try using shift (vshl_n_u32) instead of the addition.
Edit: thanks for the answers. I've looked a bit around and found this discussion, which says (emphasis mine):
Moving data from NEON to ARM registers
is Cortex-A8 is expensive, so NEON in
Cortex-A8 is best used for large
blocks of work with little ARM
pipeline interaction.
In your case there's no NEON to ARM conversion but only loads and stores. Still, it seems that the savings in parallel operation are eaten up by the non-NEON parts. I would expect better results in code which does many things while in NEON, e.g. color conversions.

Process in bigger quantities per instruction, and interleave load/stores, and interleave usage. This function currently doubles (shifts left) 56 uint.
void shiftleft56(const unsigned int* input, unsigned int* output)
{
__asm__ (
"vldm %0!, {q2-q8}\n\t"
"vldm %0!, {q9-q15}\n\t"
"vshl.u32 q0, q2, #1\n\t"
"vshl.u32 q1, q3, #1\n\t"
"vshl.u32 q2, q4, #1\n\t"
"vshl.u32 q3, q5, #1\n\t"
"vshl.u32 q4, q6, #1\n\t"
"vshl.u32 q5, q7, #1\n\t"
"vshl.u32 q6, q8, #1\n\t"
"vshl.u32 q7, q9, #1\n\t"
"vstm %1!, {q0-q6}\n\t"
// "vldm %0!, {q0-q6}\n\t" if you want to overlap...
"vshl.u32 q8, q10, #1\n\t"
"vshl.u32 q9, q11, #1\n\t"
"vshl.u32 q10, q12, #1\n\t"
"vshl.u32 q11, q13, #1\n\t"
"vshl.u32 q12, q14, #1\n\t"
"vshl.u32 q13, q15, #1\n\t"
// lost cycle here unless you overlap
"vstm %1!, {q7-q13}\n\t"
: "=r"(input), "=r"(output) : "0"(input), "1"(output)
: "q0", "q1", "q2", "q3", "q4", "q5", "q6", "q7",
"q8", "q9", "q10", "q11", "q12", "q13", "q14", "q15", "memory" );
}
What's important to remember for Neon optimization... It has two pipelines, one for load/stores (with a 2 instruction queue - one pending and one running - typically taking 3-9 cycles each), and one for arithmetical operations (with a 2 instruction pipeline, one executing and one saving its results). As long as you keep these two pipelines busy and interleave your instructions, it will work really fast. Even better, if you have ARM instructions, as long as you stay in registers, it will never have to wait for NEON to be done, they will be executed at the same time (up to 8 instructions in cache)! So you can put up some basic loop logic in ARM instructions, and they'll be executed simultaneously.
Your original code also was only using one register value out of 4 (q register have 4 32 bits values). 3 of them were getting a doubling operation for no apparent reason, so you were 4 times as slow as you could've been.
What would be better in this code is to for this loop, process them embedded by adding vldm %0!, {q2-q8} following the vstm %1! ... and so on. You also see I wait 1 more instruction before sending out its results, so the pipes are never waiting for something else. Finally, note the !, it means post-increment. So it reads/writes the value, and then increments the pointer from the register automatically. I suggest you don't use that register in ARM code, so it won't hang its own pipelines... keep your registers separated, have a redundant count variable on ARM side.
Last part ... what I said might be true, but not always. It depends on the current Neon revision you have. Timing might change in the future, or might not have always been like that. It works for me, ymmv.

Related

Benchmarking memory copy in a single shot

Whiskey Lake i7-8565U
I'm trying to learn how to write benchmarks in a single shot by hands (without using any benchmarking frameworks) on an example of memory copy routine with regular and NonTemporal writes to WB memory and would like to ask for some sort of review.
Declaration:
void *avx_memcpy_forward_llss(void *restrict, const void *restrict, size_t);
void *avx_nt_memcpy_forward_llss(void *restrict, const void *restrict, size_t);
Definition:
avx_memcpy_forward_llss:
shr rdx, 0x3
xor rcx, rcx
avx_memcpy_forward_loop_llss:
vmovdqa ymm0, [rsi + 8*rcx]
vmovdqa ymm1, [rsi + 8*rcx + 0x20]
vmovdqa [rdi + rcx*8], ymm0
vmovdqa [rdi + rcx*8 + 0x20], ymm1
add rcx, 0x08
cmp rdx, rcx
ja avx_memcpy_forward_loop_llss
ret
avx_nt_memcpy_forward_llss:
shr rdx, 0x3
xor rcx, rcx
avx_nt_memcpy_forward_loop_llss:
vmovdqa ymm0, [rsi + 8*rcx]
vmovdqa ymm1, [rsi + 8*rcx + 0x20]
vmovntdq [rdi + rcx*8], ymm0
vmovntdq [rdi + rcx*8 + 0x20], ymm1
add rcx, 0x08
cmp rdx, rcx
ja avx_nt_memcpy_forward_loop_llss
ret
Benchmark code:
#include <stdio.h>
#include <inttypes.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <immintrin.h>
#include <x86intrin.h>
#include "memcopy.h"
#define BUF_SIZE 128 * 1024 * 1024
_Alignas(64) char src[BUF_SIZE];
_Alignas(64) char dest[BUF_SIZE];
static inline void warmup(unsigned wa_iterations, void *(*copy_fn)(void *, const void *, size_t));
static inline void cache_flush(char *buf, size_t size);
static inline void generate_data(char *buf, size_t size);
uint64_t run_benchmark(unsigned wa_iteration, void *(*copy_fn)(void *, const void *, size_t)){
generate_data(src, sizeof src);
warmup(4, copy_fn);
cache_flush(src, sizeof src);
cache_flush(dest, sizeof dest);
__asm__ __volatile__("mov $0, %%rax\n cpuid":::"rax", "rbx", "rcx", "rdx", "memory");
uint64_t cycles_start = __rdpmc((1 << 30) + 1);
copy_fn(dest, src, sizeof src);
__asm__ __volatile__("lfence" ::: "memory");
uint64_t cycles_end = __rdpmc((1 << 30) + 1);
return cycles_end - cycles_start;
}
int main(void){
uint64_t single_shot_result = run_benchmark(1024, avx_memcpy_forward_llss);
printf("Core clock cycles = %" PRIu64 "\n", single_shot_result);
}
static inline void warmup(unsigned wa_iterations, void *(*copy_fn)(void *, const void *, size_t)){
while(wa_iterations --> 0){
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
copy_fn(dest, src, sizeof src);
}
}
static inline void generate_data(char *buf, size_t sz){
int fd = open("/dev/urandom", O_RDONLY);
read(fd, buf, sz);
}
static inline void cache_flush(char *buf, size_t sz){
for(size_t i = 0; i < sz; i+=_SC_LEVEL1_DCACHE_LINESIZE){
_mm_clflush(buf + i);
}
}
Results:
avx_memcpy_forward_llss median: 44479368 core cycles
UPD: time
real 0m0,217s
user 0m0,093s
sys 0m0,124s
avx_nt_memcpy_forward_llss median: 24053086 core cycles
UPD: time
real 0m0,184s
user 0m0,056s
sys 0m0,128s
UPD: The result was gotten when running the benchmark with taskset -c 1 ./bin
So I got almost almost 2 times difference in core cycles between the memory copy routine implementation. I interpret it as in case of regular stores to WB memory we have RFO requests competing on bus bandwidth as it is specified in IOM/3.6.12 (emphasize mine):
Although the data bandwidth of full 64-byte bus writes due to
non-temporal stores is twice that of bus writes to WB memory,
transferring 8-byte chunks wastes bus request bandwidth and delivers
significantly lower data bandwidth.
QUESTION 1: How to do benchmark analysis in case of a single shot? Perf counters does not seem to be useful due to perf startup overhead and warmup iteration overhead.
QUESTION 2: Is such benchmark correct. I accounted cpuid in the beginning in order to start measuring with clean CPU resources to avoid stalls due to previous instruction in flight. I added memory clobbers as compile barrier and lfence to avoid rdpmc to be executed OoO.

Whenever possible, benchmarks should report results in ways that allow as much "sanity-checking" as possible. In this case, a few ways to enable such checks include:
For tests involving main memory bandwidth, results should be presented in units that allow direct comparison with the known peak DRAM bandwidth of the system. For a typical configuration of the Core i7-8565U, this is 2 channels * 8 Bytes/transfer * 2.4 billion transfers/sec = 38.4 GB/s (See also item (6), below.)
For tests that involve transfer of data anywhere in the memory hierarchy, the results should include a clear description of the size of the "memory footprint" (number of distinct cache line addresses accessed times the cache line size) and the number of repetitions of the transfer(s). Your code is easy to read here and the size is completely reasonable for a main memory test.
For any timed test, the absolute time should be included to enable comparison against plausible overheads of timing. Your use of only the CORE_CYCLES_UNHALTED counter makes it impossible to compute the elapsed time directly (though the test is clearly long enough that timing overheads are negligible).
Other important "best practice" principles:
Any test that employs RDPMC instructions must be bound to a single logical processor. Results should be presented in a way that confirms to the reader that such binding was employed. Common ways to enforce such binding in Linux include using the "taskset" or "numactl --physcpubind=[n]" commands, or including an inline call to "sched_setaffinity()" with a single allowed logical processor, or setting an environment variable that causes a runtime library (e.g., OpenMP) to bind the thread to a single logical processor.
When using hardware performance counters, extra care is needed to ensure that all of the configuration data for the counters is available and described correctly. The code above uses RDPMC to read IA32_PERF_FIXED_CTR1, which has an event name of CPU_CLK_UNHALTED. The modifier to the event name depends on the programming of IA32_FIXED_CTR_CTRL (MSR 0x38d) bits 7:4. There is no generally-accepted way of mapping from all possible control bits to event name modifiers, so it is best to provide the complete contents of IA32_FIXED_CTR_CTRL along with the results.
The CPU_CLK_UNHALTED performance counter event is the right one to use for benchmarks of portions of the processor whose behavior scales directly with processor core frequency -- such as instruction execution and data transfers involving only the L1 and L2 caches. Memory bandwidth involves portions of the processor whose performance does not scale directly with processor frequency. In particular, using CPU_CLK_UNHALTED without also forcing fixed-frequency operation makes it impossible to compute the elapsed time (required by (1) and (3) above). In your case, RDTSCP would have been easier than RDPMC -- RDTSC does not require the processes to be bound a single logical processor, it is not influenced by other configuration MSRs, and it allows direct computation of elapsed time in seconds.
Advanced: For tests involving transfer of data in the memory hierarchy, it is helpful to control for cache contents and the state (clean or dirty) of the cache contents, and to provide explicit descriptions of the "before" and "after" states along with the results. Given the sizes of your arrays, your code should completely fill all levels of the cache with some combination of portions of the source and destination arrays, and then flush all of those addresses, leaving a cache hierarchy that is (almost) completely full of invalid (clean) entries.
Advanced: Using CPUID as a serialization instruction is almost never useful in benchmarking. Although it guarantees ordering, it also takes a long time to execute -- Agner Fog's "Instruction Tables" report it at 100-250 cycles (presumably depending on the input arguments). (Update: Measurements over short intervals are always very tricky. The CPUID instruction has a long and variable execution time, and it is not clear what impact the microcoded implementation has on the internal state of the processor. It may be helpful in specific cases, but it should not be considered as something that is automatically included in benchmarks. For measurements over long intervals, out-of-order processing across the measurement boundaries is negligible, so CPUID is not needed.)
Advanced: Using LFENCE in benchmarks is only relevant if you are measuring at very fine granularity -- less than a few hundred cycles. More notes on this topic at http://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/
If I assume that your processor was running at its maximum Turbo frequency of 4.6 GHz during the test, then the reported cycle counts correspond to 9.67 milliseconds and 5.23 milliseconds, respectively. Plugging these into a "sanity check" shows:
Assuming that the first case performs one read, one allocate, and one writeback (each 128MiB), the corresponding DRAM traffic rates are 27.8GB/s + 13.9 GB/s = 41.6 GB/s == 108% of peak.
Assuming that the second case performs one read and one streaming store (each 128MiB), the corresponding DRAM traffic rates are 25.7 GB/s + 25.7 GB/s = 51.3 GB/s = 134% of peak.
The failure of these "sanity checks" tells us that the frequency could not have been as high as 4.6 GHz (and was probably no higher than 3.0 GHz), but mostly just points to the need to measure the elapsed time unambiguously....
Your quote from the optimization manual on the inefficiency of streaming stores applies only to cases that cannot be coalesced into full cache line transfers. Your code stores to every element of the output cache lines following "best practice" recommendations (all store instructions writing to the same line are executed consecutively and generating only one stream of stores per loop). It is not possible to completely prevent the hardware from breaking up streaming stores, but in your case it should be extremely rare -- perhaps a few out of a million. Detecting partial streaming stores is a very advanced topic, requiring the use of poorly-documented performance counters in the "uncore" and/or indirect detection of partial streaming stores by looking for elevated DRAM CAS counts (which might be due to other causes). More notes on streaming stores are at http://sites.utexas.edu/jdm4372/2018/01/01/notes-on-non-temporal-aka-streaming-stores/

Mixing C and assembly and its impact on registers

Consider the following C and (ARM) assembly snippet, which is to be compiled with GCC:
__asm__ __volatile__ (
"vldmia.64 %[data_addr]!, {d0-d1}\n\t"
"vmov.f32 q12, #0.0\n\t"
: [data_addr] "+r" (data_addr)
: : "q0", "q12");
for(int n=0; n<10; ++n){
__asm__ __volatile__ (
"vadd.f32 q12, q12, q0\n\t"
"vldmia.64 %[data_addr]!, {d0-d1}\n\t"
: [data_addr] "+r" (data_addr),
:: "q0", "q12");
}
In this example, I am initialising some SIMD registers outside the loop and then having C handle the loop logic, with those initialised registers being used inside the loop.
This works in some test code, but I'm concerned of the risk of the compiler clobbering the registers between snippets. Is there any way of ensuring this doesn't happen? Can I infer any assurances about the type of registers that are going to be used in a snippet (in this case, that no SIMD registers will be clobbered)?

In general, there's not a way to do this in gcc; clobbers only guarantee that registers will be preserved around the asm call. If you need to ensure that the registers are saved between two asm sections, you will need to store them to memory in the first, and reload in the second.

Edit: After much fiddling around I've come to the conclusion this is much harder to solve in general using the strategy described below than I initially thought.
The problem is that, particularly when all the registers are used, there is nothing to stop the first register stash from overwriting another. Whether there is some trick to play with using direct memory writes that can be optimised away I don't know, but initial tests would suggest the compiler might still choose to clobber not-yet-stashed registers
For the time being and until I have more information, I'm unmarking this answer as correct and this answer should treated as probably wrong in the general case. My conclusion is this that such local protection of registers needs better support in the compiler to be useful
This absolutely is possible to do reliably. Drawing on the comments by #PeterCordes as well as the docs and a couple of useful bug reports (gcc 41538 and 37188) I came up with the following solution.
The point that makes it valid is the use of temporary variables to make sure the registers are maintained (logically, if the loop clobbers them, then they will be reloaded). In practice, the temporary variables are optimised away which is clear from inspection of the resultant asm.
// d0 and d1 map to the first and second values of q0, so we use
// q0 to reduce the number of tmp variables we pass around (instead
// of using one for each of d0 and d1).
register float32x4_t data __asm__ ("q0");
register float32x4_t output __asm__ ("q12");
float32x4_t tmp_data;
float32x4_t tmp_output;
__asm__ __volatile__ (
"vldmia.64 %[data_addr]!, {d0-d1}\n\t"
"vmov.f32 %q[output], #0.0\n\t"
: [data_addr] "+r" (data_addr),
[output] "=&w" (output),
"=&w" (data) // we still need to constrain data (q0) as written to.
::);
// Stash the register values
tmp_data = data;
tmp_output = output;
for(int n=0; n<10; ++n){
// Make sure the registers are loaded correctly
output = tmp_output;
data = tmp_data;
__asm__ __volatile__ (
"vadd.f32 %[output], %[output], q0\n\t"
"vldmia.64 %[data_addr]!, {d0-d1}\n\t"
: [data_addr] "+r" (data_addr),
[output] "+w" (output),
"+w" (data) // again, data (q0) was written to in the vldmia op.
::);
// Remember to stash the registers again before continuing
tmp_data = data;
tmp_output = output;
}
It's necessary to instruct the compiler that q0 is written to in the last line of each asm output constraint block, so it doesn't think it can reorder the stashing and reloading of the data register resulting in the asm block getting invalid values.

What's missing/sub-optimal in this memcpy implementation?

I've become interested in writing a memcpy() as an educational exercise. I won't write a whole treatise of what I did and didn't think about, but here's
some guy's implementation:
__forceinline // Since Size is usually known,
// most useless code will be optimized out
// if the function is inlined.
void* myMemcpy(char* Dst, const char* Src, size_t Size)
{
void* start = Dst;
for ( ; Size >= sizeof(__m256i); Size -= sizeof(__m256i) )
{
__m256i ymm = _mm256_loadu_si256(((const __m256i* &)Src)++);
_mm256_storeu_si256(((__m256i* &)Dst)++, ymm);
}
#define CPY_1B *((uint8_t * &)Dst)++ = *((const uint8_t * &)Src)++
#define CPY_2B *((uint16_t* &)Dst)++ = *((const uint16_t* &)Src)++
#define CPY_4B *((uint32_t* &)Dst)++ = *((const uint32_t* &)Src)++
#if defined _M_X64 || defined _M_IA64 || defined __amd64
#define CPY_8B *((uint64_t* &)Dst)++ = *((const uint64_t* &)Src)++
#else
#define CPY_8B _mm_storel_epi64((__m128i *)Dst, _mm_loadu_si128((const __m128i *)Src)), ++(const uint64_t* &)Src, ++(uint64_t* &)Dst
#endif
#define CPY16B _mm_storeu_si128((__m128i *)Dst, _mm_loadu_si128((const __m128i *)Src)), ++(const __m128i* &)Src, ++(__m128i* &)Dst
switch (Size) {
case 0x00: break;
case 0x01: CPY_1B; break;
case 0x02: CPY_2B; break;
case 0x03: CPY_1B; CPY_2B; break;
case 0x04: CPY_4B; break;
case 0x05: CPY_1B; CPY_4B; break;
case 0x06: CPY_2B; CPY_4B; break;
case 0x07: CPY_1B; CPY_2B; CPY_4B; break;
case 0x08: CPY_8B; break;
case 0x09: CPY_1B; CPY_8B; break;
case 0x0A: CPY_2B; CPY_8B; break;
case 0x0B: CPY_1B; CPY_2B; CPY_8B; break;
case 0x0C: CPY_4B; CPY_8B; break;
case 0x0D: CPY_1B; CPY_4B; CPY_8B; break;
case 0x0E: CPY_2B; CPY_4B; CPY_8B; break;
case 0x0F: CPY_1B; CPY_2B; CPY_4B; CPY_8B; break;
case 0x10: CPY16B; break;
case 0x11: CPY_1B; CPY16B; break;
case 0x12: CPY_2B; CPY16B; break;
case 0x13: CPY_1B; CPY_2B; CPY16B; break;
case 0x14: CPY_4B; CPY16B; break;
case 0x15: CPY_1B; CPY_4B; CPY16B; break;
case 0x16: CPY_2B; CPY_4B; CPY16B; break;
case 0x17: CPY_1B; CPY_2B; CPY_4B; CPY16B; break;
case 0x18: CPY_8B; CPY16B; break;
case 0x19: CPY_1B; CPY_8B; CPY16B; break;
case 0x1A: CPY_2B; CPY_8B; CPY16B; break;
case 0x1B: CPY_1B; CPY_2B; CPY_8B; CPY16B; break;
case 0x1C: CPY_4B; CPY_8B; CPY16B; break;
case 0x1D: CPY_1B; CPY_4B; CPY_8B; CPY16B; break;
case 0x1E: CPY_2B; CPY_4B; CPY_8B; CPY16B; break;
case 0x1F: CPY_1B; CPY_2B; CPY_4B; CPY_8B; CPY16B; break;
}
#undef CPY_1B
#undef CPY_2B
#undef CPY_4B
#undef CPY_8B
#undef CPY16B
return start;
}
The comment translates as "Size is usually known as the compiler can optimize the code inline out most useless".
I would like to improve, if possible, on this implementation - but maybe there isn't much to improve. I see it uses SSE/AVX for the larger chunks of memory, then instead of a loop over the last < 32 bytes does the equivalent of manual unrolling, with some tweaking. So, here are my questions:
Why unroll the loop for the last several bytes, but not partially unroll the first (and now single) loop?
What about alignment issues? Aren't they important? Should I handle the first several bytes up to some alignment quantum differently, then perform the 256-bit ops on aligned sequences of bytes? And if so, how do I determine the appropriate alignment quantum?
What's the most important missing feature in this implementation (if any)?
Features/Principles mentioned in the answers so far
You should __restrict__ your parameters. (#chux)
The memory bandwidth is a limiting factor; measure your implementation against it.(#Zboson)
For small arrays, you can expect to approach the memory bandwidth; for larger arrays - not as much. (#Zboson)
Multiple threads (may be | are) necessary to saturate the memory bandwidth. (#Zboson)
It is probably wise to optimize differently for large and small copy sizes. (#Zboson)
(Alignment is important? Not explicitly addressed!)
The compiler should be made more explicitly aware of "obvious facts" it can use for optimization (such as the fact that Size < 32 after the first loop). (#chux)
There are arguments for unrolling your SSE/AVX calls (#BenJackson, here), and arguments against doing so (#PaulR)
non-temporal transfers (with which you tell the CPU you don't need it to cache the target location) should be useful for copying larger buffers. (#Zboson)

I have been studying measuring memory bandwidth for Intel processors with various operations and one of them is memcpy. I have done this on Core2, Ivy Bridge, and Haswell. I did most of my tests using C/C++ with intrinsics (see the code below - but I'm currently rewriting my tests in assembly).
To write your own efficient memcpy function it's important to know what the absolute best bandwidth possible is. This bandwidth is a function of the size of the arrays which will be copied and therefore an efficient memcpy function needs to optimize differently for small and big (and maybe in between). To keep things simple I have optimized for small arrays of 8192 bytes and large arrays of 1 GB.
For small arrays the maximum read and write bandwidth for each core is:
Core2-Ivy Bridge 32 bytes/cycle
Haswell 64 bytes/cycle
This is the benchmark you should aim for small arrays. For my tests I assume the arrays are aligned to 64-bytes and that the array size is a multiple of 8*sizeof(float)*unroll_factor. Here are my current memcpy results for a size of 8192 bytes (Ubuntu 14.04, GCC 4.9, EGLIBC 2.19):
GB/s efficiency
Core2 (p9600#2.66 GHz)
builtin 35.2 41.3%
eglibc 39.2 46.0%
asmlib: 76.0 89.3%
copy_unroll1: 39.1 46.0%
copy_unroll8: 73.6 86.5%
Ivy Bridge (E5-1620#3.6 GHz)
builtin 102.2 88.7%
eglibc: 107.0 92.9%
asmlib: 107.6 93.4%
copy_unroll1: 106.9 92.8%
copy_unroll8: 111.3 96.6%
Haswell (i5-4250U#1.3 GHz)
builtin: 68.4 82.2%
eglibc: 39.7 47.7%
asmlib: 73.2 87.6%
copy_unroll1: 39.6 47.6%
copy_unroll8: 81.9 98.4%
The asmlib is Agner Fog's asmlib. The copy_unroll1 and copy_unroll8 functions are defined below.
From this table we can see that the GCC builtin memcpy does not work well on Core2 and that memcpy in EGLIBC does not work well on Core2 or Haswell. I did check out a head version of GLIBC recently and the performance was much better on Haswell. In all cases unrolling gets the best result.
void copy_unroll1(const float *x, float *y, const int n) {
for(int i=0; i<n/JUMP; i++) {
VECNF().LOAD(&x[JUMP*(i+0)]).STORE(&y[JUMP*(i+0)]);
}
}
void copy_unroll8(const float *x, float *y, const int n) {
for(int i=0; i<n/JUMP; i+=8) {
VECNF().LOAD(&x[JUMP*(i+0)]).STORE(&y[JUMP*(i+0)]);
VECNF().LOAD(&x[JUMP*(i+1)]).STORE(&y[JUMP*(i+1)]);
VECNF().LOAD(&x[JUMP*(i+2)]).STORE(&y[JUMP*(i+2)]);
VECNF().LOAD(&x[JUMP*(i+3)]).STORE(&y[JUMP*(i+3)]);
VECNF().LOAD(&x[JUMP*(i+4)]).STORE(&y[JUMP*(i+4)]);
VECNF().LOAD(&x[JUMP*(i+5)]).STORE(&y[JUMP*(i+5)]);
VECNF().LOAD(&x[JUMP*(i+6)]).STORE(&y[JUMP*(i+6)]);
VECNF().LOAD(&x[JUMP*(i+7)]).STORE(&y[JUMP*(i+7)]);
}
}
Where VECNF().LOADis _mm_load_ps() for SSE or _mm256_load_ps() for AVX, VECNF().STORE is _mm_store_ps() for SSE or _mm256_store_ps() for AVX, and JUMP is 4 for SSE or 8 for AVX.
For the large size the best result is obtained by using non-temporal store instructions and by using multiple threads. Contrary to what many people may believe a single thread does NOT usually saturate the memory bandwidth.
void copy_stream(const float *x, float *y, const int n) {
#pragma omp parallel for
for(int i=0; i<n/JUMP; i++) {
VECNF v = VECNF().load_a(&x[JUMP*i]);
stream(&y[JUMP*i], v);
}
}
Where stream is _mm_stream_ps() for SSE or _mm256_stream_ps() for AVX
Here are the memcpy results on my E5-1620#3.6 GHz with four threads for 1 GB with a maximum main memory bandwidth of 51.2 GB/s.
GB/s efficiency
eglibc: 23.6 46%
asmlib: 36.7 72%
copy_stream: 36.7 72%
Once again EGLIBC performs poorly. This is because it does not use non-temporal stores.
I modfied the eglibc and asmlib memcpy functions to run in parallel like this
void COPY(const float * __restrict x, float * __restrict y, const int n) {
#pragma omp parallel
{
size_t my_start, my_size;
int id = omp_get_thread_num();
int num = omp_get_num_threads();
my_start = (id*n)/num;
my_size = ((id+1)*n)/num - my_start;
memcpy(y+my_start, x+my_start, sizeof(float)*my_size);
}
}
A general memcpy function needs to account for arrays which are not aligned to 64 bytes (or even to 32 or to 16 bytes) and where the size is not a multiple of 32 bytes or the unroll factor. Additionally, a decision has to be made as to when to use non-temporal stores. The general rule of thumb is to only use non-temporal stores for sizes larger than half the largest cache level (usually L3). But theses are "second order" details which I think should be dealt with after optimizing for ideal cases of large and small. There's not much point in worrying about correcting for misalignment or non-ideal size multiples if the ideal case performs poorly as well.
Update
Based on comments by Stephen Canon I have learned that on Ivy Bridge and Haswell it's more efficient to use rep movsb than movntdqa (a non-temporal store instruction). Intel calls this enhanced rep movsb (ERMSB). This is described in the Intel Optimization manuals in the section 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB).
Additionally, in Agner Fog's Optimizing Subroutines in Assembly manual in section 17.9 Moving blocks of data (All processors) he writes:
"There are several ways of moving large blocks of data. The most common methods are:
REP MOVS instruction.
If data are aligned: Read and write in a loop with the largest available register size.
If size is constant: inline move instructions.
If data are misaligned: First move as many bytes as required to make the destination
aligned. Then read unaligned and write aligned in a loop with the largest available
register size.
If data are misaligned: Read aligned, shift to compensate for misalignment and write
aligned.
If the data size is too big for caching, use non-temporal writes to bypass the cache.
Shift to compensate for misalignment, if necessary."
A general memcpy should consider each of these points. Additionally, with Ivy Bridge and Haswell it seems that point 1 is better than point 6 for large arrays. Different techniques are necessary for Intel and AMD and for each iteration of technology. I think it's clear that writing your own general efficient memcpyfunction can be quite complicated. But in the special cases I have looked at I have already managed to do better than the GCC builtin memcpy or the one in EGLIBC so the assumption that you can't do better than the standard libraries is incorrect.

The question can't be answered precisely without some additional details such as:
What is the target platform (CPU architecture, most, but memory configuration plays a role too)?
What is the distribution and predictability1 of the copy lengths (and to a lesser extent, the distribution and predictability of alignments)?
Will the copy size ever be statically known at compile-time?
Still, I can point out a couple things that are likely to be sub-optimal for at least some combination of the above parameters.
32-case Switch Statement
The 32-case switch statement is a cute way of handling the trailing 0 to 31 bytes, and likely benchmarks very well - but may perform badly in the real world due to at least two factors.
Code Size
This switch statement alone takes several hundred bytes of code for the body, in addition to a 32-entry lookup table needed to jump to the correct location for each length. The cost of this isn't going to show up in a focused benchmark of memcpy on a full-sized CPU because everything still fit in the fastest cache level: but in the real world you execute other code too and there is contention for the uop cache and L1 data and instruction caches.
That many instructions may take fully 20% of the effective size of your uop cache3, and uop cache misses (and the corresponding cache-to-legacy encoder transition cycles) could easily wipe the small benefit given by this elaborate switch.
On top of that, the switch requires a 32-entry, 256 byte lookup table for the jump targets4. If you ever get a miss to DRAM on that lookup, you are talking a penalty of 150+ cycles: how many non-misses do you need to then to make the switch worth it, given it's probably saving a few or two at the most? Again, that won't show up in a microbenchmark.
For what its worth, this memcpy isn't unusual: that kind of "exhaustive enumeration of cases" is common even in optimized libraries. I can conclude that either their development was driven mostly by microbenchmarks, or that it is still worth it for a large slice of general purpose code, despite the downsides. That said, there are certainly scenarios (instruction and/or data cache pressure) where this is suboptimal.
Branch Prediction
The switch statement relies on a single indirect branch to choose among the alternatives. This going to be efficient to the extent that the branch predictor can predict this indirect branch, which basically means that the sequence of observed lengths needs to be predictable.
Because it is an indirect branch, there are more limits on the predictability of the branch than a conditional branch since there are a limited number of BTB entries. Recent CPUs have made strides here, but it is safe to say that if the series of lengths fed to memcpy don't follow a simple repeating pattern of a short period (as short as 1 or 2 on older CPUs), there will be a branch-mispredict on each call.
This issue is particularly insidious because it is likely to hurt you the most in real-world in exactly the situations where a microbenchmark shows the switch to be the best: short lengths. For very long lengths, the behavior on the trailing 31 bytes isn't very important since it is dominated by the bulk copy. For short lengths, the switch is all-important (indeed, for copies of 31 bytes or less it is all that executes)!
For these short lengths, a predictable series of lengths works very well for the switch since the indirect jump is basically free. In particular, a typical memcpy benchmark "sweeps" over a series of lengths, using the same length repeatedly for each sub-test to report the results for easy graphing of "time vs length" graphs. The switch does great on these tests, often reporting results like 2 or 3 cycles for small lengths of a few bytes.
In the real world, your lengths might be small but unpredicable. In that case, the indirect branch will frequently mispredict5, with a penalty of ~20 cycles on modern CPUs. Compared to best case of a couple cycles it is an order of magnitude worse. So the glass jaw here can be very serious (i.e., the behavior of the switch in this typical case can be an order of magnitude worse than the best, whereas at long lengths, you are usually looking at a difference of 50% at most between different strategies).
Solutions
So how can you do better than the above, at least under the conditions where the switch falls apart?
Use Duff's Device
One solution to the code size issue is to combine the switch cases together, duff's device-style.
For example, the assembled code for the length 1, 3 and 7 cases looks like:
Length 1
movzx edx, BYTE PTR [rsi]
mov BYTE PTR [rcx], dl
ret
Length 3
movzx edx, BYTE PTR [rsi]
mov BYTE PTR [rcx], dl
movzx edx, WORD PTR [rsi+1]
mov WORD PTR [rcx+1], dx
Length 7
movzx edx, BYTE PTR [rsi]
mov BYTE PTR [rcx], dl
movzx edx, WORD PTR [rsi+1]
mov WORD PTR [rcx+1], dx
mov edx, DWORD PTR [rsi+3]
mov DWORD PTR [rcx+3], edx
ret
This can combined into a single case, with various jump-ins:
len7:
mov edx, DWORD PTR [rsi-6]
mov DWORD PTR [rcx-6], edx
len3:
movzx edx, WORD PTR [rsi-2]
mov WORD PTR [rcx-2], dx
len1:
movzx edx, BYTE PTR [rsi]
mov BYTE PTR [rcx], dl
ret
The labels don't cost anything, and they combine the cases together and removes two out of 3 ret instructions. Note that the basis for rsi and rcx have changed here: they point to the last byte to copy from/to, rather than the first. That change is free or very cheap depending on the code before the jump.
You can extend that for longer lengths (e.g., you can attach lengths 15 and 31 to the chain above), and use other chains for the missing lengths. The full exercise is left to the reader. You can probably get a 50% size reduction alone from this approach, and much better if you combine it with something else to collapse the sizes from 16 - 31.
This approach only helps with the code size (and possibly the jump table size, if you shrink the size as described in 4 and you get under 256 bytes, allowing a byte-sized lookup table. It does nothing for predictability.
Overlapping Stores
One trick that helps for both code size and predictability is to use overlapping stores. That is, memcpy of 8 to 15 bytes can be accomplished in a branch-free way with two 8-byte stores, with the second store partly overlapping the first. For example, to copy 11 bytes, you would do an 8-byte copy at relative position 0 and 11 - 8 == 3. Some of the bytes in the middle would be "copied twice", but in practice this is fine since an 8-byte copy is the same speed as a 1, 2 or 4-byte one.
The C code looks like:
if (Size >= 8) {
*((uint64_t*)Dst) = *((const uint64_t*)Src);
size_t offset = Size & 0x7;
*(uint64_t *)(Dst + offset) = *(const uint64_t *)(Src + offset);
}
... and the corresponding assembly is not problematic:
cmp rdx, 7
jbe .L8
mov rcx, QWORD PTR [rsi]
and edx, 7
mov QWORD PTR [rdi], rcx
mov rcx, QWORD PTR [rsi+rdx]
mov QWORD PTR [rdi+rdx], rcx
In particular, note that you get exactly two loads, two stores and one and (in addition to the cmp and jmp whose existence depends on how you organize the surrounding code). That's already tied or better than most of the compiler-generated approaches for 8-15 bytes, which might use up to 4 load/store pairs.
Older processors suffered some penalty for such "overlapping stores", but newer architectures (the last decade or so, at least) seem to handle them without penalty6. This has two main advantages:
The behavior is branch free for a range of sizes. Effectively, this quantizes the branching so that many values take the same path. All sizes from 8 to 15 (or 8 to 16 if you want) take the same path and suffer no misprediction pressure.
At least 8 or 9 different cases from the switch are subsumed into a single case with a fraction of the total code size.
This approach can be combined with the switch approach, but using only a few cases, or it can be extended to larger sizes with conditional moves that could do, for example, all moves from 8 to 31 bytes without branches.
What works out best again depends on the branch distribution, but overall this "overlapping" technique works very well.
Alignment
The existing code doesn't address alignment.
In fact, it isn't, in general, legal or C or C++, since the char * pointers are simply casted to larger types and dereferenced, which is not legal - although in practice it generates codes that works on today's x86 compilers (but in fact would fail for platform with stricter alignment requirements).
Beyond that, it is often better to handle the alignment specifically. There are three main cases:
The source and destination are already alignment. Even the original algorithm will work fine here.
The source and destination are relatively aligned, but absolutely misaligned. That is, there is a value A that can be added to both the source and destination such that both are aligned.
The source and destination are fully misaligned (i.e., they are not actually aligned and case (2) does not apply).
The existing algorithm will work ok in case (1). It is potentially missing a large optimization the case of (2) since small intro loop could turn an unaligned copy into an aligned one.
It is also likely performing poorly in case (3), since in general in the totally misaligned case you can chose to either align the destination or the source and then proceed "semi-aligned".
The alignment penalties have been getting smaller over time and on the most recent chips are modest for general purpose code but can still be serious for code with many loads and stores. For large copies, it probably doesn't matter too much since you'll end up DRAM bandwidth limited, but for smaller copies misalignment may reduce throughput by 50% or more.
If you use NT stores, alignment can also be important, because many of the NT store instructions perform poorly with misaligned arguments.
No unrolling
The code is not unrolled and compilers unrolled by different amounts by default. Clearly this is suboptimal since among two compilers with different unroll strategies, at most one will be best.
The best approach (at least for known platform targets) is determine which unroll factor is best, and then apply that in the code.
Furthermore, the unrolling can often be combined in a smart way with the "intro" our "outro" code, doing a better job than the compiler could.
Known sizes
The primary reason that it is tough to beat the "builtin" memcpy routine with modern compilers is that compilers don't just call a library memcpy whenever memcpy appears in the source. They know the contract of memcpy and are free to implement it with a single inlined instruction, or even less7, in the right scenario.
This is especially obvious with known lengths in memcpy. In this case, if the length is small, compilers will just insert a few instructions to perform the copy efficiently and in-place. This not only avoids the overhead of the function call, but all the checks about size and so on - and also generates at compile time efficient code for the copy, much like the big switch in the implementation above - but without the costs of the switch.
Similarly, the compiler knows a lot of about the alignment of structures in the calling code, and can create code that deals efficiently with alignment.
If you just implement a memcpy2 as a library function, that is tough to replicate. You can get part of the way there my splitting the method into a small and big part: the small part appears in the header file, and does some size checks and potentially just calls the existing memcpy if the size is small or delegates to the library routine if it is large. Through the magic of inlining, you might get to the same place as the builtin memcpy.
Finally, you can also try tricks with __builtin_constant_p or equivalents to handle the small, known case efficiently.
1 Note that I'm drawing a distinction here between the "distribution" of sizes - e.g., you might say _uniformly distributed between 8 and 24 bytes - and the "predictability" of the actual sequence of sizes (e.g., do the sizes have a predicable pattern)? The question of predictability somewhat subtle because it depends on on the implementation, since as described above certain implementations are inherently more predictable.
2 In particular, ~750 bytes of instructions in clang and ~600 bytes in gcc for the body alone, on top of the 256-byte jump lookup table for the switch body which had 180 - 250 instructions (gcc and clang respectively). Godbolt link.
3 Basically 200 fused uops out of an effective uop cache size of 1000 instructions. While recent x86 have had uop cache sizes around ~1500 uops, you can't use it all outside of extremely dedicated padding of your codebase because of the restrictive code-to-cache assignment rules.
4 The switch cases have different compiled lengths, so the jump can't be directly calculated. For what it's worth, it could have been done differently: they could have used a 16-bit value in the lookup table at the cost of not using memory-source for the jmp, cutting its size by 75%.
5 Unlike conditional branch prediction, which has a typical worst-case prediction rate of ~50% (for totally random branches), a hard-to-predict indirect branch can easily approach 100% since you aren't flipping a coin, you are choosing for an almost infinite set of branch targets. This happens in the real-world: if memcpy is being used to copy small strings with lengths uniformly distributed between 0 and 30, the switch code will mispredict ~97% of the time.
6 Of course, there may be penalties for misaligned stores, but these are also generally small and have been getting smaller.
7 For example, a memcpy to the stack, followed by some manipulation and a copy somewhere else may be totally eliminated, directly moving the original data to its final location. Even things like malloc followed by memcpy can be totally eliminated.

Firstly the main loop uses unaligned AVX vector loads/stores to copy 32 bytes at a time, until there are < 32 bytes left to copy:
for ( ; Size >= sizeof(__m256i); Size -= sizeof(__m256i) )
{
__m256i ymm = _mm256_loadu_si256(((const __m256i* &)Src)++);
_mm256_storeu_si256(((__m256i* &)Dst)++, ymm);
}
Then the final switch statement handles the residual 0..31 bytes in as efficient manner as possible, using a combination of 8/4/2/1 byte copies as appropriate. Note that this is not an unrolled loop - it's just 32 different optimised code paths which handle the residual bytes using the minimum number of loads and stores.
As for why the main 32 byte AVX loop is not manually unrolled - there are several possible reasons for this:
most compilers will unroll small loops automatically (depending on loop size and optimisation switches)
excessive unrolling can cause small loops to spill out of the LSD cache (typically only 28 decoded µops)
on current Core iX CPUs you can only issue two concurrent loads/stores before you stall [*]
typically even a non-unrolled AVX loop like this can saturate available DRAM bandwidth [*]
[*] note that the last two comments above apply to cases where source and/or destination are not in cache (i.e. writing/reading to/from DRAM), and therefore load/store latency is high.

Taking Benefits of The ERMSB
Please also consider using REP MOVSB for larger blocks.
As you know, since first Pentium CPU produced in 1993, Intel began to make simple commands faster and complex commands (like REP MOVSB) slower. So, REP MOVSB became very slow, and there was no more reason to use it. In 2013, Intel decided to revisit REP MOVSB. If the CPU has CPUID ERMSB (Enhanced REP MOVSB) bit, then REP MOVSB commands are executed differently than on older processors, and are supposed to be fast. On practice, it is only fast for large blocks, 256 bytes and larger, and only when certain conditions are met:
both the source and destination addresses have to be aligned to a 16-Byte boundary;
the source region should not overlap with the destination region;
the length has to be a multiple of 64 to produce higher performance;
the direction has to be forward (CLD).
See the Intel Manual on Optimization, section 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB) http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
Intel recommends using AVX for blocks smaller than 2048 bytes. For the larger blocks, Intel recommends using REP MOVSB. This is because high initial startup costs of REP MOVSB (about 35 cycles).
I have done speed tests, and for the blocks of than 2048 bytes and higher, the performance of REP MOVSB is unbeatable. However, for blocks smaller than 256 bytes, REP MOVSB is very slow, even slower than plain MOV RAX back and forth in a loop.
Please not that ERMSB only affects MOVSB, not MOVSD (MOVSQ), so MOVSB is little bit faster than MOVSD (MOVSQ).
So, you can use AVX for your memcpy() implementation, and if the block is larger than 2048 bytes and all the conditions are met, then call REP MOVSB - so your memcpy() implementation will be unbeatable.
Taking Benefits of The Out-of-Order Execution Engine
You can also read about The Out-of-Order Execution Engine
in the "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf section the 2.1.2, and take benefits of it.
For example, in Intel SkyLake processor series (launched in 2015), it has:
4 execution units for the Arithmetic logic unit (ALU) (add, and, cmp, or, test, xor, movzx, movsx, mov, (v)movdqu, (v)movdqa, (v)movap*, (v)movup),
3 execution units for Vector ALU ( (v)pand, (v)por, (v)pxor, (v)movq, (v)movq, (v)movap*, (v)movup*, (v)andp*, (v)orp*, (v)paddb/w/d/q, (v)blendv*, (v)blendp*, (v)pblendd)
So we can occupy above units (3+4) in parallel if we use register-only operations. We cannot use 3+4 instructions in parallel for memory copy. We can use simultaneously maximum of up to two 32-bytes instructions to load from memory and one 32-bytes instructions to store from memory, and even if we are working with Level-1 cache.
Please see the Intel manual again to understand on how to do the fastest memcpy implementation: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
Section 2.2.2 (The Out-of-Order Engine of the Haswelll microarchitecture): "The Scheduler controls the dispatch of micro-ops onto the dispatch ports. There are eight dispatch ports to support the out-of-order execution core. Four of the eight ports provided execution resources for computational operations. The other 4 ports support memory operations of up to two 256-bit load and one 256-bit store operation in a cycle."
Section 2.2.4 (Cache and Memory Subsystem) has the following note: "First level data cache supports two load micro-ops each cycle; each micro-op can fetch up to 32-bytes of data."
Section 2.2.4.1 (Load and Store Operation Enhancements) has the following information: The L1 data cache can handle two 256-bit (32 bytes) load and one 256-bit (32 bytes) store operations each cycle. The unified L2 can service one cache line (64 bytes) each cycle. Additionally, there are 72 load buffers and 42 store buffers available to support micro-ops execution in-flight.
The other sections (2.3 and so on, dedicated to Sandy Bridge and other microarchitectures) basically reiterate the above information.
The section 2.3.4 (The Execution Core) gives additional details.
The scheduler can dispatch up to six micro-ops every cycle, one on each port. The following table summarizes which operations can be dispatched on which port.
Port 0: ALU, Shift, Mul, STTNI, Int-Div, 128b-Mov, Blend, 256b-Mov
Port 1: ALU, Fast LEA, Slow LEA, MUL, Shuf, Blend, 128bMov, Add, CVT
Port 2 & Port 3: Load_Addr, Store_addr
Port 4: Store_data
Port 5: ALU, Shift, Branch, Fast LEA, Shuf, Blend, 128b-Mov, 256b-Mov
The section 2.3.5.1 (Load and Store Operation Overview) may also be useful to understand on how to make fast memory copy, as well as the section 2.4.4.1 (Loads and Stores).
For the other processor architectures, it is again - two load units and one store unit. Table 2-4 (Cache Parameters of the Skylake Microarchitecture) has the following information:
Peak Bandwidth (bytes/cyc):
First Level Data Cache: 96 bytes (2x32B Load + 1*32B Store)
Second Level Cache: 64 bytes
Third Level Cache: 32 bytes.
I have also done speed tests on my Intel Core i5 6600 CPU (Skylake, 14nm, released in September 2015) with DDR4 memory, and this has confirmed the teory. For example, my test have shown that using generic 64-bit registers for memory copy, even many registers in parallel, degrades performance. Also, using just 2 XMM registers is enough - adding the 3rd doesn't add performance.
If your CPU has AVX CPUID bit, you may take benefits of the large, 256-bit (32 byte) YMM registers to copy memory, to occupy two full load units. The AVX support was first introduced by Intel with the Sandy Bridge processors, shipping in Q1 2011 and later on by AMD with the Bulldozer processor shipping in Q3 2011.
// first cycle
vmovdqa ymm0, ymmword ptr [rcx+0] // load 1st 32-byte part using first load unit
vmovdqa ymm1, ymmword ptr [rcx+20h] // load 2nd 32-byte part using second load unit
// second cycle
vmovdqa ymmword ptr [rdx+0], ymm0 // store 1st 32-byte part using the single store unit
// third cycle
vmovdqa ymmword ptr [rdx+20h], ymm1 ; store 2nd 32-byte part - using the single store unit (this instruction will require a separate cycle since there is only one store unit, and we cannot do two stores in a single cycle)
add ecx, 40h // these instructions will be used by a different unit since they don't invoke load or store, so they won't require a new cycle
add edx, 40h
Also, there is speed benefit if you loop-unroll this code at least 8 times. As I wrote before, adding more registers besides ymm0 and ymm1 doesn't increase performance, because there are just two load units and one store unit. Adding loops like "dec r9 jnz ##again" degrades the performance, but simple "add ecx/edx" does not.
Finally, if your CPU has AVX-512 extension, you can use 512-bit (64-byte) registers to copy memory:
vmovdqu64 zmm0, [rcx+0] ; load 1st 64-byte part
vmovdqu64 zmm1, [rcx+40h] ; load 2nd 64-byte part
vmovdqu64 [rdx+0], zmm0 ; store 1st 64-byte part
vmovdqu64 [rdx+40h], zmm1 ; store 2nd 64-byte part
add rcx, 80h
add rdx, 80h
AVX-512 is supported by the following processors: Xeon Phi x200, released in 2016; Skylake EP/EX Xeon "Purley" (Xeon E5-26xx V5) processors (H2 2017); Cannonlake processors (H2 2017), Skylake-X processors - Core i9-7×××X, i7-7×××X, i5-7×××X - released on June 2017.
Please note that the memory have to be aligned on the size of the registers that you are using. If it is not, please use "unaligned" instructions: vmovdqu and moveups.

Using GCC inline assembly with instructions that take immediate values

The problem
I'm working on a custom OS for an ARM Cortex-M3 processor. To interact with my kernel, user threads have to generate a SuperVisor Call (SVC) instruction (previously known as SWI, for SoftWare Interrupt). The definition of this instruction in the ARM ARM is:
Which means that the instruction requires an immediate argument, not a register value.
This is making it difficult for me to architect my interface in a readable fashion. It requires code like:
asm volatile( "svc #0");
when I'd much prefer something like
svc(SVC_YIELD);
However, I'm at a loss to construct this function, because the SVC instruciton requires an immediate argument and I can't provide that when the value is passed in through a register.
The kernel:
For background, the svc instruction is decoded in the kernel as follows
#define SVC_YIELD 0
// Other SVC codes
// Called by the SVC interrupt handler (not shown)
void handleSVC(char code)
{
switch (code) {
case SVC_YIELD:
svc_yield();
break;
// Other cases follow
This case statement is getting rapidly out of hand, but I see no way around this problem. Any suggestions are welcome.
What I've tried
SVC with a register argument
I initially considered
__attribute__((naked)) svc(char code)
{
asm volatile ("scv r0");
}
but that, of course, does not work as SVC requires a register argument.
Brute force
The brute-force attempt to solve the problem looks like:
void svc(char code)
switch (code) {
case 0:
asm volatile("svc #0");
break;
case 1:
asm volatile("svc #1");
break;
/* 253 cases omitted */
case 255:
asm volatile("svc #255");
break;
}
}
but that has a nasty code smell. Surely this can be done better.
Generating the instruction encoding on the fly
A final attempt was to generate the instruction in RAM (the rest of the code is running from read-only Flash) and then run it:
void svc(char code)
{
asm volatile (
"orr r0, 0xDF00 \n\t" // Bitwise-OR the code with the SVC encoding
"push {r1, r0} \n\t" // Store the instruction to RAM (on the stack)
"mov r0, sp \n\t" // Copy the stack pointer to an ordinary register
"add r0, #1 \n\t" // Add 1 to the address to specify THUMB mode
"bx r0 \n\t" // Branch to newly created instruction
"pop {r1, r0} \n\t" // Restore the stack
"bx lr \n\t" // Return to caller
);
}
but this just doesn't feel right either. Also, it doesn't work - There's something I'm doing wrong here; perhaps my instruction isn't properly aligned or I haven't set up the processor to allow running code from RAM at this location.
What should I do?
I have to work on that last option. But still, it feels like I ought to be able to do something like:
__attribute__((naked)) svc(char code)
{
asm volatile ("scv %1"
: /* No outputs */
: "i" (code) // Imaginary directive specifying an immediate argument
// as opposed to conventional "r"
);
}
but I'm not finding any such option in the documentation and I'm at a loss to explain how such a feature would be implemented, so it probably doesn't exist. How should I do this?

You want to use a constraint to force the operand to be allocated as an 8-bit immediate. For ARM, that is constraint I. So you want
#define SVC(code) asm volatile ("svc %0" : : "I" (code) )
See the GCC documentation for a summary of what all the constaints are -- you need to look at the processor-specific notes to see the constraints for specific platforms. In some cases, you may need to look at the .md (machine description) file for the architecture in the gcc source for full information.
There's also some good ARM-specific gcc docs here. A couple of pages down under the heading "Input and output operands" it provides a table of all the ARM constraints

What about using a macro:
#define SVC(i) asm volatile("svc #"#i)

As noted by Chris Dodd in the comments on the macro, it doesn't quite work, but this does:
#define STRINGIFY0(v) #v
#define STRINGIFY(v) STRINGIFY0(v)
#define SVC(i) asm volatile("svc #" STRINGIFY(i))
Note however that it won't work if you pass an enum value to it, only a #defined one.
Therefore, Chris' answer above is the best, as it uses an immediate value, which is what's required, for thumb instructions at least.

My solution ("Generating the instruction encoding on the fly"):
#define INSTR_CODE_SVC (0xDF00)
#define INSTR_CODE_BX_LR (0x4770)
void svc_call(uint32_t svc_num)
{
uint16_t instrs[2];
instrs[0] = (uint16_t)(INSTR_CODE_SVC | svc_num);
instrs[1] = (uint16_t)(INSTR_CODE_BX_LR);
// PC = instrs (or 1 -> thumb mode)
((void(*)(void))((uint32_t)instrs | 1))();
}
It works and its much better than switch-case variant, which takes ~2kb ROM for 256 svc's. This func does not have to be placed in RAM section, FLASH is ok.
You can use it if svc_num should be a runtime variable.

As discussed in this question, the operand of SVC is fixed, that is it should be known to the preprocessor, and it is different from immediate Data-processing operands.
The gcc manual reads
'I'- Integer that is valid as an immediate operand in a data processing instruction. That is, an integer in the range 0 to 255 rotated by a multiple of 2.
Therefore the answers here that use a macro are preferred, and the answer of Chris Dodd is not guaranteed to work, depending on the gcc version and optimization level. See the discussion of the other question.

I wrote one handler recently for my own toy OS on Cortex-M. Works if tasks use PSP pointer.
Idea:
Get interrupted process's stack pointer, get process's stacked PC, it will have the instruction address of instruction after SVC, look up the immediate value in the instruction. It's not as hard as it sounds.
uint8_t __attribute__((naked)) get_svc_code(void){
__asm volatile("MSR R0, PSP"); //Get Process Stack Pointer (We're in SVC ISR, so currently MSP in use)
__asm volatile("ADD R0, #24"); //Pointer to stacked process's PC is in R0
__asm volatile("LDR R1, [R0]"); //Instruction Address after SVC is in R1
__asm volatile("SUB R1, R1, #2"); //Subtract 2 bytes from the address of the current instruction. Now R1 contains address of SVC instruction
__asm volatile("LDRB R0, [R1]"); //Load lower byte of 16-bit instruction into R0. It's immediate value.
//Value is in R0. Function can return
}

Calculating CPU frequency in C with RDTSC always returns 0

The following piece of code was given to us from our instructor so we could measure some algorithms performance:
#include <stdio.h>
#include <unistd.h>
static unsigned cyc_hi = 0, cyc_lo = 0;
static void access_counter(unsigned *hi, unsigned *lo) {
asm("rdtsc; movl %%edx,%0; movl %%eax,%1"
: "=r" (*hi), "=r" (*lo)
: /* No input */
: "%edx", "%eax");
}
void start_counter() {
access_counter(&cyc_hi, &cyc_lo);
}
double get_counter() {
unsigned ncyc_hi, ncyc_lo, hi, lo, borrow;
double result;
access_counter(&ncyc_hi, &ncyc_lo);
lo = ncyc_lo - cyc_lo;
borrow = lo > ncyc_lo;
hi = ncyc_hi - cyc_hi - borrow;
result = (double) hi * (1 << 30) * 4 + lo;
return result;
}
However, I need this code to be portable to machines with different CPU frequencies. For that, I'm trying to calculate the CPU frequency of the machine where the code is being run like this:
int main(void)
{
double c1, c2;
start_counter();
c1 = get_counter();
sleep(1);
c2 = get_counter();
printf("CPU Frequency: %.1f MHz\n", (c2-c1)/1E6);
printf("CPU Frequency: %.1f GHz\n", (c2-c1)/1E9);
return 0;
}
The problem is that the result is always 0 and I can't understand why. I'm running Linux (Arch) as guest on VMware.
On a friend's machine (MacBook) it is working to some extent; I mean, the result is bigger than 0 but it's variable because the CPU frequency is not fixed (we tried to fix it but for some reason we are not able to do it). He has a different machine which is running Linux (Ubuntu) as host and it also reports 0. This rules out the problem being on the virtual machine, which I thought it was the issue at first.
Any ideas why this is happening and how can I fix it?

Okay, since the other answer wasn't helpful, I'll try to explain on more detail. The problem is that a modern CPU can execute instructions out of order. Your code starts out as something like:
rdtsc
push 1
call sleep
rdtsc
Modern CPUs do not necessarily execute instructions in their original order though. Despite your original order, the CPU is (mostly) free to execute that just like:
rdtsc
rdtsc
push 1
call sleep
In this case, it's clear why the difference between the two rdtscs would be (at least very close to) 0. To prevent that, you need to execute an instruction that the CPU will never rearrange to execute out of order. The most common instruction to use for that is CPUID. The other answer I linked should (if memory serves) start roughly from there, about the steps necessary to use CPUID correctly/effectively for this task.
Of course, it's possible that Tim Post was right, and you're also seeing problems because of a virtual machine. Nonetheless, as it stands right now, there's no guarantee that your code will work correctly even on real hardware.
Edit: as to why the code would work: well, first of all, the fact that instructions can be executed out of order doesn't guarantee that they will be. Second, it's possible that (at least some implementations of) sleep contain serializing instructions that prevent rdtsc from being rearranged around it, while others don't (or may contain them, but only execute them under specific (but unspecified) circumstances).
What you're left with is behavior that could change with almost any re-compilation, or even just between one run and the next. It could produce extremely accurate results dozens of times in a row, then fail for some (almost) completely unexplainable reason (e.g., something that happened in some other process entirely).

I can't say for certain what exactly is wrong with your code, but you're doing quite a bit of unnecessary work for such a simple instruction. I recommend you simplify your rdtsc code substantially. You don't need to do 64-bit math carries your self, and you don't need to store the result of that operation as a double. You don't need to use separate outputs in your inline asm, you can tell GCC to use eax and edx.
Here is a greatly simplified version of this code:
#include <stdint.h>
uint64_t rdtsc() {
uint64_t ret;
# if __WORDSIZE == 64
asm ("rdtsc; shl $32, %%rdx; or %%rdx, %%rax;"
: "=A"(ret)
: /* no input */
: "%edx"
);
#else
asm ("rdtsc"
: "=A"(ret)
);
#endif
return ret;
}
Also you should consider printing out the values you're getting out of this so you can see if you're getting out 0s, or something else.

As for VMWare, take a look at the time keeping spec (PDF Link), as well as this thread. TSC instructions are (depending on the guest OS):
Passed directly to the real hardware (PV guest)
Count cycles while the VM is executing on the host processor (Windows / etc)
Note, in #2 the while the VM is executing on the host processor. The same phenomenon would go for Xen, as well, if I recall correctly. In essence, you can expect that the code should work as expected on a paravirtualized guest. If emulated, its entirely unreasonable to expect hardware like consistency.

You forgot to use volatile in your asm statement, so you're telling the compiler that the asm statement produces the same output every time, like a pure function. (volatile is only implicit for asm statements with no outputs.)
This explains why you're getting exactly zero: the compiler optimized end-start to 0 at compile time, through CSE (common-subexpression elimination).
See my answer on Get CPU cycle count? for the __rdtsc() intrinsic, and #Mysticial's answer there has working GNU C inline asm, which I'll quote here:
// prefer using the __rdtsc() intrinsic instead of inline asm at all.
uint64_t rdtsc(){
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}
This works correctly and efficiently for 32 and 64-bit code.

hmmm I'm not positive but I suspect the problem may be inside this line:
result = (double) hi * (1 << 30) * 4 + lo;
I'm suspicious if you can safely carry out such huge multiplications in an "unsigned"... isn't that often a 32-bit number? ...just the fact that you couldn't safely multiply by 2^32 and had to append it as an extra "* 4" added to the 2^30 at the end already hints at this possibility... you might need to convert each sub-component hi and lo to a double (instead of a single one at the very end) and do the multiplication using the two doubles

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight