SSE byte and half word swapping

SSE byte and half word swapping - c

I would like to translate this code using SSE intrinsics.
for (uint32_t i = 0; i < length; i += 4, src += 4, dest += 4)
{
uint32_t value = *(uint32_t*)src;
*(uint32_t*)dest = ((value >> 16) & 0xFFFF) | (value << 16);
}
Is anyone aware of an intrinsic to perform the 16-bit word swapping?

pshufb (SSSE3) should be faster than 2 shifts and an OR. Also, a slight modification to the shuffle mask would enable an endian conversion, instead of just a word-swap.
stealing Paul R's function structure, just replacing the vector intrinsics:
void word_swapping_ssse3(uint32_t* dest, const uint32_t* src, size_t count)
{
size_t i;
__m128i shufmask = _mm_set_epi8(13,12, 15,14, 9,8, 11,10, 5,4, 7,6, 1,0, 3,2);
// _mm_set args go in big-endian order for some reason.
for (i = 0; i + 4 <= count; i += 4)
{
__m128i s = _mm_loadu_si128((__m128i*)&src[i]);
__m128i d = _mm_shuffle_epi8(s, shufmask);
_mm_storeu_si128((__m128i*)&dest[i], d);
}
for ( ; i < count; ++i) // handle residual elements
{
uint32_t w = src[i];
w = (w >> 16) | (w << 16);
dest[i] = w;
}
}
pshufb can have a memory operand, but it has to be the shuffle mask, not the data to be shuffled. So you can't use it as a shuffled-load. :/
gcc doesn't generate great code for the loop. The main loop is
# src: r8. dest: rcx. count: rax. shufmask: xmm1
.L16:
movq %r9, %rax
.L3: # first-iteration entry point
movdqu (%r8), %xmm0
leaq 4(%rax), %r9
addq $16, %r8
addq $16, %rcx
pshufb %xmm1, %xmm0
movups %xmm0, -16(%rcx)
cmpq %rdx, %r9
jbe .L16
With all that loop overhead, and needing a separate load and store instruction, throughput will only be 1 shuffle per 2 cycles. (8 uops, since cmp macro-fuses with jbe).
A faster loop would be
shl $2, %rax # uint count -> byte count
# check for %rax less than 16 and skip the vector loop
# cmp / jsomething
add %rax, %r8 # set up pointers to the end of the array
add %rax, %rcx
neg %rax # and count upwards toward zero
.loop:
movdqu (%r8, %rax), %xmm0
pshufb %xmm1, %xmm0
movups %xmm0, (%rcx, %rax) # IDK why gcc chooses movups for stores. Shorter encoding?
add $16, %rax
jl .loop
# ...
# scalar cleanup
movdqu loads can micro-fuse with complex addressing modes, unlike vector ALU ops, so all these instructions are single-uop except the store, I believe.
This should run at 1 cycle per iteration with some unrolling, since add can micro-fuse with jl. So the loop has 5 total uops. 3 of them are load/store ops, which have dedicated ports. Bottlenecks are: pshufb can only run on one execution port (Haswell (SnB/IvB can pshufb on ports 1&5)). One store per cycle (all microarches). And finally, the 4 fused-domain uops per clock limit for Intel CPUs, which should be reachable barring cache-misses on Nehalem and later (uop loop buffer).
Unrolling would bring the total fused-domain uops per 16B down below 4. Incrementing pointers, instead of using complex addressing modes, would let the stores micro-fuse. (Reducing loop overhead is always good: letting the re-order buffer fill up with future iterations means the CPU has something to do when it hits a mispredict at the end of the loop and returns to other code.)
This is pretty much what you'd get by unrolling the intrinsics loops, as Elalfer rightly suggests would be a good idea. With gcc, try -funroll-loops if that doesn't bloat the code too much.
BTW, it's probably going to be better to byte-swap while loading or storing, mixed in with other code, rather than converting a buffer as a separate operation.

The scalar code in your question isn't really byte swapping (in the sense of endianness conversion, at least) - it's just swapping the high and low 16 bits within a 32 bit word. If this is what you want though then just re-use the solution to your previous question, with appropriate changes:
void byte_swapping(uint32_t* dest, const uint32_t* src, size_t count)
{
size_t i;
for (i = 0; i + 4 <= count; i += 4)
{
__m128i s = _mm_loadu_si128((__m128i*)&src[i]);
__m128i d = _mm_or_si128(_mm_slli_epi32(s, 16), _mm_srli_epi32(s, 16));
_mm_storeu_si128((__m128i*)&dest[i], d);
}
for ( ; i < count; ++i) // handle residual elements
{
uint32_t w = src[i];
w = (w >> 16) | (w << 16);
dest[i] = w;
}
}

Related

Optimize lookup tables to simple ALU

Question
Say you have a simple function that returns a value based on a look table for example:
See edit about assumptions.
uint32_t
lookup0(uint32_t r) {
static const uint32_t tbl[] = { 0, 1, 2, 3 };
if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
__builtin_unreachable();
}
/* Can replace with: `return r`. */
return tbl[r];
}
uint32_t
lookup1(uint32_t r) {
static const uint32_t tbl[] = { 0, 0, 1, 1 };
if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
__builtin_unreachable();
}
/* Can replace with: `return r / 2`. */
return tbl[r];
}
Is there any super-optimization infrastructure or algorithm that can take go from the lookup table to the optimized ALU implementation.
Motivation
The motivation is I'm building some locks for NUMA machines and want to be able to configure my code generically. Its pretty common that in NUMA locks you will need to do cpu_id -> numa_node. I can obviously setup the lookup table during configuration, but since I'm fighting for every drop of memory bandwidth I can, I am hoping to generically reach a solution that will be able to cover most layouts.
Looking at how modern compilers do:
Neither clang or gcc are able to do this at the moment.
Clang is able to get lookup0 if you rewrite it as a switch/case statement.
lookup0(unsigned int): # #lookup0(unsigned int)
movl %edi, %eax
movl lookup0(unsigned int)::tbl(,%rax,4), %eax
retq
...
case0(unsigned int): # #case0(unsigned int)
movl %edi, %eax
retq
but can't get lookup1.
lookup1(unsigned int): # #lookup1(unsigned int)
movl %edi, %eax
movl .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
retq
...
case1(unsigned int): # #case1(unsigned int)
movl %edi, %eax
movl .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
retq
Gcc cant get either.
lookup0(unsigned int):
movl %edi, %edi
movl lookup0(unsigned int)::tbl(,%rdi,4), %eax
ret
lookup1(unsigned int):
movl %edi, %edi
movl lookup1(unsigned int)::tbl(,%rdi,4), %eax
ret
case0(unsigned int):
leal -1(%rdi), %eax
cmpl $2, %eax
movl $0, %eax
cmovbe %edi, %eax
ret
case1(unsigned int):
subl $2, %edi
xorl %eax, %eax
cmpl $1, %edi
setbe %al
ret
I imagine I can cover a fair amount of the necessary cases with some custom brute-force approach, but was hoping this was a solved problem.
Edit:
The only true assumption is:
All inputs are have an index in the LUT.
All values are positive (think that makes things easier) and will be true for just about any sys-config thats online.
(Edit4) Would add one more assumption. The LUT is dense. That is it covers a range [<low_bound>, <bound_bound>] but nothing outside of that range.
In my case for CPU topology, I would generally expect sizeof(LUT) >= <max_value_in_lut> but that is specific to the one example I gave and would have some counter-examples.
Edit2:
I wrote a pretty simple optimizer that does a reasonable job for the CPU topologies I've tested here. But obviously it could be a lot better.
Edit3:
There seems to be some confusion about the question/initial example (I should have been clearer).
The example lookup0/lookup1 are arbitrary. I am hoping to find a solution that can scale beyond 4 indexes and with different values.
The use case I have in mind is CPU topology so ~256 - 1024 is where I would expect the upper bound in size but for a generic LUT it could obviously get much larger.

The best "generic" solution I am aware of is the following:
int compute(int r)
{
static const int T[] = {0,0,1,1};
const int lut_size = sizeof(T) / sizeof(T[0]);
int result = 0;
for(int i=0 ; i<lut_size ; ++i)
result += (r == i) * T[i];
return result;
}
In -O3, GCC and Clang unroll the loop, propagate constants, and generate an intermediate code similar to the following:
int compute(int r)
{
return (r == 0) * 0 + (r == 1) * 0 + (r == 2) * 1 + (r == 3) * 1;
}
GCC/Clang optimizers know that multiplication can be replaced with conditional moves (since developers often use this as a trick to guide compilers generating assembly codes without conditional branches).
The resulting assembly is the following for Clang:
compute:
xor ecx, ecx
cmp edi, 2
sete cl
xor eax, eax
cmp edi, 3
sete al
add eax, ecx
ret
The same applies for GCC. There is no branches nor memory accesses (at least as long as the values are small). Multiplication by small values are also replaced with the fast lea instruction.
A more complete test is available on Godbolt.
Note that this method should work for bigger tables but if the table is too big, then the loop will not be automatically unrolled. You can tell the compiler to use a more aggressive unrolling thanks to compilation flags. That being said, a LUT will likely be faster if it is big since having a huge code to load and execute is slow in this pathological case.

You could pack the array into a long integer and use bitshifts and anding to extract the result.
For example for the table {2,0,3,1} could be handled with:
uint32_t lookup0(uint32_t r) {
static const uint32_t tbl = (2u << 0) | (0u << 8) |
(3u << 16) | (1u << 24);
return (tbl >> (8 * r)) & 0xff;
}
It produces relatively nice assembly:
lookup0: # #lookup0
lea ecx, [8*rdi]
mov eax, 16973826
shr eax, cl
movzx eax, al
ret
Not perfect but branchless and with no indirection.
This method is quite generic and it could support vectorization by "looking up" multiple inputs at the same time.
There are a few tricks to allow handling larger arrays like using longer integers (i.e. uint64_t or __uint128_t extension).
Other approach is splitting bits of value in array like high and low byte, lookup them and combine using bitwise operations.

Trying to obtain addq, but keep getting leaq

So, I'm trying to get familiar with assembly and trying to reverse-engineer some code. My problem lies in trying to decode addq which I understands performs Source + Destination= Destination.
I am using the assumptions that parameters x, y, and z are passed in registers %rdi, %rsi, and %rdx. The return value is stored in %rax.
long someFunc(long x, long y, long z){
1. long temp=(x-z)*x;
2. long temp2= (temp<<63)>>63;
3. long temp3= (temp2 ^ x);
4. long answer=y+temp3;
5. return answer;
}
So far everything above line 4 is exactly what I am wanting. However, line 4 gives me leaq (%rsi,%rdi), %rax rather than addq %rsi, %rax. I'm not sure if this is something I am doing wrong, but I am looking for some insight.

Those instructions aren't equivalent. For LEA, rax is a pure output. For your hoped-for add, it's rax += rsi so the compiler would have to mov %rdi, %rax first. That's less efficient so it doesn't do that.
lea is a totally normal way for compilers to implement dst = src1 + src2, saving a mov instruction. In general don't expect C operators to compile to instruction named after them. Especially small left-shifts and add, or multiply by 3, 5, or 9, because those are prime targets for optimization with LEA. e.g. lea (%rsi, %rsi, 2), %rax implements result = y*3. See Using LEA on values that aren't addresses / pointers? for more. LEA is also useful to avoid destroying either of the inputs, if they're both needed later.
Assuming you meant t3 to be the same variable as temp3, clang does compile the way you were expecting, doing a better job of register allocation so it can use a shorter and more efficient add instruction without any extra mov instructions, instead of needing lea.
Clang chooses to do better register allocation than GCC so it can just use add instead of needing lea for the last instruction. (Godbolt). This saves code-size (because of the indexed addressing mode), and add has slightly better throughput than LEA on most CPUs, like 4/clock instead of 2/clock.
Clang also optimized the shifts into andl $1, %eax / negq %rax to create the 0 or -1 result of that arithmetic right shift = bit-broadcast. It also optimized to 32-bit operand-size for the first few steps because the shifts throw away all but the low bit of temp1.
# side by side comparison, like the Godbolt diff pane
clang: | gcc:
movl %edi, %eax movq %rdi, %rax
subl %edx, %eax subq %rdx, %rdi
imull %edi, %eax imulq %rax, %rdi # temp1
andl $1, %eax salq $63, %rdi
negq %rax sarq $63, %rdi # temp2
xorq %rdi, %rax xorq %rax, %rdi # temp3
addq %rsi, %rax leaq (%rdi,%rsi), %rax # answer
retq ret
Notice that clang chose imul %edi, %eax (into RAX) but GCC chose to multiply into RDI. That's the difference in register allocation that leads to GCC needing an lea at the end instead of an add.
Compilers sometimes even get stuck with an extra mov instruction at the end of a small function when they make poor choices like this, if the last operation wasn't something like addition that can be done with lea as a non-destructive op-and-copy. These are missed-optimization bugs; you can report them on GCC's bugzilla.
Other missed optimizations
GCC and clang could have optimized by using and instead of imul to set the low bit only if both inputs are odd.
Also, since only the low bit of the sub output matters, XOR (add without carry) would have worked, or even addition! (Odd+-even = odd. even+-even = even. odd+-odd = odd.) That would have allowed an lea instead of mov/sub as the first instruction.
lea (%rdi,%rsi), %eax
and %edi, %eax # low bit matches (x-z)*x
andl $1, %eax # keep only the low bit
negq %rax # temp2
Lets make a truth table for the low bits of x and z to see how this shakes out if we want to optimize more / differently:
# truth table for low bit: input to shifts that broadcasts this to all bits
x&1 | z&1 | x-z = x^z | x*(x-z) = x & (x-z)
0 0 0 0
0 1 1 0
1 0 1 1
1 1 0 0
x & (~z) = BMI1 andn
So temp2 = (x^z) & x & 1 ? -1 : 0. But also temp2 = -((x & ~z) & 1).
We can rearrange that to -((x&1) & ~z) which lets us start with not z and and $1, x in parallel, for better ILP. Or if z might be ready first, we could do operations on it and shorten the critical path from x -> answer, at the expense of z.
Or with a BMI1 andn instruction which does (~z) & x, we can do this in one instruction. (Plus another to isolate the low bit)
I think this function has the same behaviour for every possible input, so compilers could have emitted it from your source code. This is one possibility you should wish your compiler emitted:
# hand-optimized
# long someFunc(long x, long y, long z)
someFunc:
not %edx # ~z
and $1, %edx
and %edi, %edx # x&1 & ~z = low bit of temp1
neg %rdx # temp2 = 0 or -1
xor %rdi, %rdx # temp3 = x or ~x
lea (%rsi, %rdx), %rax # answer = y + temp3
ret
So there's still no ILP, unless z is ready before x and/or y. Using an extra mov instruction, we could do x&1 in parallel with not z
Possibly you could do something with test/setz or cmov, but IDK if that would beat lea/and (temp1) + and/neg (temp2) + xor + add.
I haven't looked into optimizing the final xor and add, but note that temp3 is basically a conditional NOT of x. You could maybe improve latency at the expense of throughput by calculating both ways at once and selecting between them with cmov. Possibly by involving the 2's complement identity that -x - 1 = ~x. Maybe improve ILP / latency by doing x+y and then correcting that with something that depends on the x and z condition? Since we can't subtract using LEA, it seems best to just NOT and ADD.
# return y + x or y + (~x) according to the condition on x and z
someFunc:
lea (%rsi, %rdi), %rax # y + x
andn %edi, %edx, %ecx # ecx = x & (~z)
not %rdi # ~x
add %rsi, %rdi # y + (~x)
test $1, %cl
cmovnz %rdi, %rax # select between y+x and y+~x
retq
This has more ILP, but needs BMI1 andn to still be only 6 (single-uop) instructions. Broadwell and later have single-uop CMOV; on earlier Intel it's 2 uops.
The other function could be 5 uops using BMI andn.
In this version, the first 3 instructions can all run in the first cycle, assuming x,y, and z are all ready. Then in the 2nd cycle, ADD and TEST can both run. In the 3rd cycle, CMOV can run, taking integer inputs from LEA, ADD, and flag input from TEST. So the total latency from x->answer, y->answer, or z->answer is 3 cycles in this version. (Assuming single-uop / single-cycle cmov). Great if it's on the critical path, not very relevant if it's part of an independent dep chain and throughput is all that matters.
vs. 5 (andn) or 6 cycles (without) for the previous attempt. Or even worse for the compiler output using imul instead of and (3 cycle latency just for that instruction).

Need help constructing a Long loop(long x, int n) function in C given 64 bit assembly instructions

I have the following assembly code from the C function long loop(long x, int n)
with x in %rdi, n in %esi on a 64 bit machine. I've written my comments on what I think the assembly instructions are doing.
loop:
movl %esi, %ecx // store the value of n in register ecx
movl $1, %edx // store the value of 1 in register edx (rdx).initial mask
movl $0, %eax //store the value of 0 in register eax (rax). this is initial return value
jmp .L2
.L3
movq %rdi, %r8 //store the value of x in register r8
andq %rdx, %r8 //store the value of (x & mask) in r8
orq %r8, %rax //update the return value rax by (x & mask | [rax] )
salq %cl, %rdx //update the mask rdx by ( [rdx] << n)
.L2
testq %rdx, %rdx //test mask&mask
jne .L3 // if (mask&mask) != 0, jump to L3
rep; ret
I have the following C function which needs to correspond to the assembly code:
long loop(long x, int n){
long result = _____ ;
long mask;
// for (mask = ______; mask ________; mask = ______){ // filled in as:
for (mask = 1; mask != 0; mask <<n) {
result |= ________;
}
return result;
}
I need some help filling in the blanks, I'm not a 100% sure what the assembly instructions but I've given it my best shot by commenting with each line.

You've pretty much got it in your comments.
long loop(long x, long n) {
long result = 0;
long mask;
for (mask = 1; mask != 0; mask <<= n) {
result |= (x & mask);
}
return result;
}
Because result is the return value, and the return value is stored in %rax, movl $0, %eax loads 0 into result initially.
Inside the for loop, %r8 holds the value that is or'd with result, which, like you mentioned in your comments, is just x & mask.

The function copies every nth bit to result.
For the record, the implementation is full of missed optimizations, especially if we're tuning for Sandybridge-family where bts reg,reg is only 1 uop with 1c latency, but shl %cl is 3 uops. (BTS is also 1 uop on Intel P6 / Atom / Silvermont CPUs)
bts is only 2 uops on AMD K8/Bulldozer/Zen. BTS reg,reg masks the shift count the same way x86 integer shifts do, so bts %rdx, %rax implements rax |= 1ULL << (rdx&0x3f). i.e. setting bit n in RAX.
(This code is clearly designed to be simple to understand, and doesn't even use the most well-known x86 peephole optimization, xor-zeroing, but it's fun to see how we can implement the same thing efficiently.)
More obviously, doing the and inside the loop is bad. Instead we can just build up a mask with every nth bit set, and return x & mask. This has the added advantage that with a non-ret instruction following the conditional branch, we don't need the rep prefix as padding for the ret even if we care about tuning for the branch predictors in AMD Phenom CPUs. (Because it isn't the first byte after a conditional branch.)
# x86-64 System V: x in RDI, n in ESI
mask_bitstride: # give the function a meaningful name
mov $1, %eax # mask = 1
mov %esi, %ecx # unsigned bitidx = n (new tmp var)
# the loop always runs at least 1 iteration, so just fall into it
.Lloop: # do {
bts %rcx, %rax # rax |= 1ULL << bitidx
add %esi, %ecx # bitidx += n
cmp $63, %ecx # sizeof(long)*CHAR_BIT - 1
jbe .Lloop # }while(bitidx <= maxbit); // unsigned condition
and %rdi, %rax # return x & mask
ret # not following a JCC so no need for a REP prefix even on K10
We assume n is in the 0..63 range because otherwise the C would have undefined behaviour. In that case, this implementation differs from the shift-based implementation in the question. The shl version would treat n==64 as an infinite loop, because shift count = 0x40 & 0x3f = 0, so mask would never change. This bitidx += n version would exit after the first iteration, because idx immediately becomes >=63, i.e. out of range.
A less extreme case is that n=65 would copy all the bits (shift count of 1); this would copy only the low bit.
Both versions create an infinite loop for n=0. I used an unsigned compare so negative n will exit the loop promptly.
On Intel Sandybridge-family, the inner loop in the original is 7 uops. (mov = 1 + and=1 + or=1 + variable-count-shl=3 + macro-fused test+jcc=1). This will bottleneck on the front-end, or on ALU throughput on SnB/IvB.
My version is only 3 uops, and can run about twice as fast. (1 iteration per clock.)

Implementing a timing, cache attack resistant sort in C

Disclaimer: I am well aware implementing your own crypto is a very bad idea. This is part of a master thesis, the code will not be used in practice.
As part of a larger cryptographic algorithm, I need to sort an array of constant length (small, 24 to be precise), without leaking any information on the contents of this array. As far as I know (please correct me if these are not sufficient to prevent timing and cache attacks), this means:
The sort should run in the same amount of cycles in terms of the length of the array, regardless of the particular values of the array
The sort should not branch or access memory depending on the particular values of the array
Do any such implementations exist? If not, are there any good resources on this type of programming?
To be honest, I'm even struggling with the easier subproblem, namely finding the smallest value of an array.
double arr[24]; // some input
double min = DBL_MAX;
int i;
for (i = 0; i < 24; ++i) {
if (arr[i] < min) {
min = arr[i];
}
}
Would adding an else with a dummy assignment be sufficient to make it timing-safe? If so, how do I ensure the compiler (GCC in my case) doesn't undo my hard work? Would this be susceptible to cache attacks?

Use a sorting network, a series of comparisons and swaps.
The swap call must not be dependent on the comparison. It must be implemented in a way to execute the same amount of instructions, regardless of the comparison result.
Like this:
void swap( int* a , int* b , bool c )
{
const int min = c ? b : a;
const int max = c ? a : b;
*a = min;
*b = max;
}
swap( &array[0] , &array[1] , array[0] > array[1] );
Then find the sorting network and use the swaps. Here is a generator that does that for you: http://pages.ripco.net/~jgamble/nw.html
Example for 4 elements, the numbers are array indices, generated by the above link:
SWAP(0, 1);
SWAP(2, 3);
SWAP(0, 2);
SWAP(1, 3);
SWAP(1, 2);

This is a very dumb bubble sort that actually works and doesn't branch or change memory access behavior depending on input data. Not sure if this can be plugged into another sorting algorithm, they need their compares separate from the swaps, but maybe it's possible, working on that now.
#include <stdint.h>
static void
cmp_and_swap(uint32_t *ap, uint32_t *bp)
{
uint32_t a = *ap;
uint32_t b = *bp;
int64_t c = (int64_t)a - (int64_t)b;
uint32_t sign = ((uint64_t)c >> 63);
uint32_t min = a * sign + b * (sign ^ 1);
uint32_t max = b * sign + a * (sign ^ 1);
*ap = min;
*bp = max;
}
void
timing_sort(uint32_t *arr, int n)
{
int i, j;
for (i = n - 1; i >= 0; i--) {
for (j = 0; j < i; j++) {
cmp_and_swap(&arr[j], &arr[j + 1]);
}
}
}
The cmp_and_swap function compiles to (Apple LLVM version 7.3.0 (clang-703.0.29), compiled with -O3):
_cmp_and_swap:
00000001000009e0 pushq %rbp
00000001000009e1 movq %rsp, %rbp
00000001000009e4 movl (%rdi), %r8d
00000001000009e7 movl (%rsi), %r9d
00000001000009ea movq %r8, %rdx
00000001000009ed subq %r9, %rdx
00000001000009f0 shrq $0x3f, %rdx
00000001000009f4 movl %edx, %r10d
00000001000009f7 negl %r10d
00000001000009fa orl $-0x2, %edx
00000001000009fd incl %edx
00000001000009ff movl %r9d, %ecx
0000000100000a02 andl %edx, %ecx
0000000100000a04 andl %r8d, %edx
0000000100000a07 movl %r8d, %eax
0000000100000a0a andl %r10d, %eax
0000000100000a0d addl %eax, %ecx
0000000100000a0f andl %r9d, %r10d
0000000100000a12 addl %r10d, %edx
0000000100000a15 movl %ecx, (%rdi)
0000000100000a17 movl %edx, (%rsi)
0000000100000a19 popq %rbp
0000000100000a1a retq
0000000100000a1b nopl (%rax,%rax)
Only memory accesses are reading and writing of the array, no branches. The compiler did figure out what the multiplication actually does, quite clever actually, but it didn't use branches for that.
The casts to int64_t are necessary to avoid overflows. I'm pretty sure it can be written cleaner.
As requested, here's a compare function for doubles:
void
cmp_and_swap(double *ap, double *bp)
{
double a = *ap;
double b = *bp;
int sign = signbit(a - b);
double min = a * sign + b * (sign ^ 1);
double max = b * sign + a * (sign ^ 1);
*ap = min;
*bp = max;
}
Compiled code is branchless and doesn't change memory access pattern depending on input data.

A very trivial, time-constant (but also highly in-efficient) sort is to
have a src and destination array
for each element in the (sorted) destination array, iterate through the complete source array to find the element that belongs exactly into this position.
No early breaks, (nearly) constant timing, not depending on even partial sortedness of the source.

Why vectorizing the loop does not have performance improvement

I am investigating the effect of vectorization on the performance of the program. In this regard, I have written following code:
#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>
#define LEN 10000000
int main(){
struct timeval stTime, endTime;
double* a = (double*)malloc(LEN*sizeof(*a));
double* b = (double*)malloc(LEN*sizeof(*b));
double* c = (double*)malloc(LEN*sizeof(*c));
int k;
for(k = 0; k < LEN; k++){
a[k] = rand();
b[k] = rand();
}
gettimeofday(&stTime, NULL);
for(k = 0; k < LEN; k++)
c[k] = a[k] * b[k];
gettimeofday(&endTime, NULL);
FILE* fh = fopen("dump", "w");
for(k = 0; k < LEN; k++)
fprintf(fh, "c[%d] = %f\t", k, c[k]);
fclose(fh);
double timeE = (double)(endTime.tv_usec + endTime.tv_sec*1000000 - stTime.tv_usec - stTime.tv_sec*1000000);
printf("Time elapsed: %f\n", timeE);
return 0;
}
In this code, I am simply initializing and multiplying two vectors. The results are saved in vector c. What I am mainly interested in is the effect of vectorizing following loop:
for(k = 0; k < LEN; k++)
c[k] = a[k] * b[k];
I compile the code using following two commands:
1) icc -O2 TestSMID.c -o TestSMID -no-vec -no-simd
2) icc -O2 TestSMID.c -o TestSMID -vec-report2
I expect to see performance improvement since the second command successfully vectorizes the loop. However, my studies show that there is no performance improvement when the loop is vectorized.
I may have missed something here since I am not super familiar with the topic. So, please let me know if there is something wrong with my code.
Thanks in advance for your help.
PS: I am using Mac OSX, so there is no need to align the data as all the allocated memories are 16-byte aligned.
Edit:
I would like to first thank you all for your comments and answers.
I thought about the answer proposed by #Mysticial and there are some further points that should be mentioned here.
Firstly, as #Vinska mentioned, c[k]=a[k]*b[k] does not take only one cycle. In addition to loop index increment and the comparison made to ensure that k is smaller than LEN, there are other things to be done to perform the operation. Having a look at the assembly code generated by the compiler, it can be seen that a simple multiplication needs much more than one cycle. The vectorized version looks like:
L_B1.9: # Preds L_B1.8
movq %r13, %rax #25.5
andq $15, %rax #25.5
testl %eax, %eax #25.5
je L_B1.12 # Prob 50% #25.5
# LOE rbx r12 r13 r14 r15 eax
L_B1.10: # Preds L_B1.9
testb $7, %al #25.5
jne L_B1.32 # Prob 10% #25.5
# LOE rbx r12 r13 r14 r15
L_B1.11: # Preds L_B1.10
movsd (%r14), %xmm0 #26.16
movl $1, %eax #25.5
mulsd (%r15), %xmm0 #26.23
movsd %xmm0, (%r13) #26.9
# LOE rbx r12 r13 r14 r15 eax
L_B1.12: # Preds L_B1.11 L_B1.9
movl %eax, %edx #25.5
movl %eax, %eax #26.23
negl %edx #25.5
andl $1, %edx #25.5
negl %edx #25.5
addl $10000000, %edx #25.5
lea (%r15,%rax,8), %rcx #26.23
testq $15, %rcx #25.5
je L_B1.16 # Prob 60% #25.5
# LOE rdx rbx r12 r13 r14 r15 eax
L_B1.13: # Preds L_B1.12
movl %eax, %eax #25.5
# LOE rax rdx rbx r12 r13 r14 r15
L_B1.14: # Preds L_B1.14 L_B1.13
movups (%r15,%rax,8), %xmm0 #26.23
movsd (%r14,%rax,8), %xmm1 #26.16
movhpd 8(%r14,%rax,8), %xmm1 #26.16
mulpd %xmm0, %xmm1 #26.23
movntpd %xmm1, (%r13,%rax,8) #26.9
addq $2, %rax #25.5
cmpq %rdx, %rax #25.5
jb L_B1.14 # Prob 99% #25.5
jmp L_B1.20 # Prob 100% #25.5
# LOE rax rdx rbx r12 r13 r14 r15
L_B1.16: # Preds L_B1.12
movl %eax, %eax #25.5
# LOE rax rdx rbx r12 r13 r14 r15
L_B1.17: # Preds L_B1.17 L_B1.16
movsd (%r14,%rax,8), %xmm0 #26.16
movhpd 8(%r14,%rax,8), %xmm0 #26.16
mulpd (%r15,%rax,8), %xmm0 #26.23
movntpd %xmm0, (%r13,%rax,8) #26.9
addq $2, %rax #25.5
cmpq %rdx, %rax #25.5
jb L_B1.17 # Prob 99% #25.5
# LOE rax rdx rbx r12 r13 r14 r15
L_B1.18: # Preds L_B1.17
mfence #25.5
# LOE rdx rbx r12 r13 r14 r15
L_B1.19: # Preds L_B1.18
mfence #25.5
# LOE rdx rbx r12 r13 r14 r15
L_B1.20: # Preds L_B1.14 L_B1.19 L_B1.32
cmpq $10000000, %rdx #25.5
jae L_B1.24 # Prob 0% #25.5
# LOE rdx rbx r12 r13 r14 r15
L_B1.22: # Preds L_B1.20 L_B1.22
movsd (%r14,%rdx,8), %xmm0 #26.16
mulsd (%r15,%rdx,8), %xmm0 #26.23
movsd %xmm0, (%r13,%rdx,8) #26.9
incq %rdx #25.5
cmpq $10000000, %rdx #25.5
jb L_B1.22 # Prob 99% #25.5
# LOE rdx rbx r12 r13 r14 r15
L_B1.24: # Preds L_B1.22 L_B1.20
And non-vectoized version is:
L_B1.9: # Preds L_B1.8
xorl %eax, %eax #25.5
# LOE rbx r12 r13 r14 r15 eax
L_B1.10: # Preds L_B1.10 L_B1.9
lea (%rax,%rax), %edx #26.9
incl %eax #25.5
cmpl $5000000, %eax #25.5
movsd (%r15,%rdx,8), %xmm0 #26.16
movsd 8(%r15,%rdx,8), %xmm1 #26.16
mulsd (%r13,%rdx,8), %xmm0 #26.23
mulsd 8(%r13,%rdx,8), %xmm1 #26.23
movsd %xmm0, (%rbx,%rdx,8) #26.9
movsd %xmm1, 8(%rbx,%rdx,8) #26.9
jb L_B1.10 # Prob 99% #25.5
# LOE rbx r12 r13 r14 r15 eax
Beside this, the processor does not load only 24 bytes. In each access to memory, a full line (64 bytes) is loaded. More importantly, since the memory required for a, b, and c is contiguous, prefetcher would definitely help a lot and loads next blocks in advance.
Having said that, I think the memory bandwidth calculated by #Mysticial is too pessimistic.
Moreover, using SIMD to improve the performance of program for a very simple addition is mentioned in Intel Vectorization Guide. Therefore, it seems we should be able to gain some performance improvement for this very simple loop.
Edit2:
Thanks again for your comments. Also, thank to #Mysticial sample code, I finally saw the effect of SIMD on performance improvement. The problem, as Mysticial mentioned, was the memory bandwidth. With choosing small size for a, b, and c which fit into the L1 cache, it can be seen that SIMD can help to improve the performance significantly. Here are the results that I got:
icc -O2 -o TestSMIDNoVec -no-vec TestSMID2.c: 17.34 sec
icc -O2 -o TestSMIDVecNoUnroll -vec-report2 TestSMID2.c: 9.33 sec
And unrolling the loop improves the performance even further:
icc -O2 -o TestSMIDVecUnroll -vec-report2 TestSMID2.c -unroll=8: 8.6sec
Also, I should mention that it takes only one cycle for my processor to complete an iteration when compiled with -O2.
PS: My computer is a Macbook Pro core i5 #2.5GHz (dual core)

This original answer was valid back in 2013. As of 2017 hardware, things have changed enough that both the question and the answer are out-of-date.
See the end of this answer for the 2017 update.
Original Answer (2013):
Because you're bottlenecked by memory bandwidth.
While vectorization and other micro-optimizations can improve the speed of computation, they can't increase the speed of your memory.
In your example:
for(k = 0; k < LEN; k++)
c[k] = a[k] * b[k];
You are making a single pass over all the memory doing very little work. This is maxing out your memory bandwidth.
So regardless of how it's optimized, (vectorized, unrolled, etc...) it isn't gonna get much faster.
A typical desktop machine of 2013 has on the order of 10 GB/s of memory bandwidth*.Your loop touches 24 bytes/iteration.
Without vectorization, a modern x64 processor can probably do about 1 iteration a cycle*.
Suppose you're running at 4 GHz:
(4 * 10^9) * 24 bytes/iteration = 96 GB/s
That's almost 10x of your memory bandwidth - without vectorization.
*Not surprisingly, a few people doubted the numbers I gave above since I gave no citation. Well those were off the top of my head from experience. So here's some benchmarks to prove it.
The loop iteration can run as fast as 1 cycle/iteration:
We can get rid of the memory bottleneck if we reduce LEN so that it fits in cache.
(I tested this in C++ since it was easier. But it makes no difference.)
#include <iostream>
#include <time.h>
using std::cout;
using std::endl;
int main(){
const int LEN = 256;
double *a = (double*)malloc(LEN*sizeof(*a));
double *b = (double*)malloc(LEN*sizeof(*a));
double *c = (double*)malloc(LEN*sizeof(*a));
int k;
for(k = 0; k < LEN; k++){
a[k] = rand();
b[k] = rand();
}
clock_t time0 = clock();
for (int i = 0; i < 100000000; i++){
for(k = 0; k < LEN; k++)
c[k] = a[k] * b[k];
}
clock_t time1 = clock();
cout << (double)(time1 - time0) / CLOCKS_PER_SEC << endl;
}
Processor: Intel Core i7 2600K # 4.2 GHz
Compiler: Visual Studio 2012
Time: 6.55 seconds
In this test, I ran 25,600,000,000 iterations in only 6.55 seconds.
6.55 * 4.2 GHz = 27,510,000,000 cycles
27,510,000,000 / 25,600,000,000 = 1.074 cycles/iteration
Now if you're wondering how it's possible to do:
2 loads
1 store
1 multiply
increment counter
compare + branch
all in one cycle...
It's because modern processors and compilers are awesome.
While each of these operations have latency (especially the multiply), the processor is able to execute multiple iterations at the same time. My test machine is a Sandy Bridge processor, which is capable of sustaining 2x128b loads, 1x128b store, and 1x256b vector FP multiply every single cycle. And potentially another one or two vector or integer ops, if the loads are memory source operands for micro-fused uops. (2 loads + 1 store throughput only when using 256b AVX loads/stores, otherwise only two total memory ops per cycle (at most one store)).
Looking at the assembly (which I'll omit for brevity), it seems that the compiler unrolled the loop, thereby reducing the looping overhead. But it didn't quite manage to vectorize it.
Memory bandwidth is on the order of 10 GB/s:
The easiest way to test this is via a memset():
#include <iostream>
#include <time.h>
using std::cout;
using std::endl;
int main(){
const int LEN = 1 << 30; // 1GB
char *a = (char*)calloc(LEN,1);
clock_t time0 = clock();
for (int i = 0; i < 100; i++){
memset(a,0xff,LEN);
}
clock_t time1 = clock();
cout << (double)(time1 - time0) / CLOCKS_PER_SEC << endl;
}
Processor: Intel Core i7 2600K # 4.2 GHz
Compiler: Visual Studio 2012
Time: 5.811 seconds
So it takes my machine 5.811 seconds to write to 100 GB of memory. That's about 17.2 GB/s.
And my processor is on the higher end. The Nehalem and Core 2 generation processors have less memory bandwidth.
Update March 2017:
As of 2017, things have gotten more complicated.
Thanks to DDR4 and quad-channel memory, it is no longer possible for a single thread to saturate memory bandwidth. But the problem of bandwidth doesn't necessarily go away. Even though bandwidth has gone up, processor cores have also improved - and there are more of them.
To put it mathematically:
Each core has a bandwidth limit X.
Main memory has a bandwidth limit of Y.
On older systems, X > Y.
On current high-end systems, X < Y. But X * (# of cores) > Y.
Back in 2013: Sandy Bridge # 4 GHz + dual-channel DDR3 # 1333 MHz
No vectorization (8-byte load/stores): X = 32 GB/s and Y = ~17 GB/s
Vectorized SSE* (16-byte load/stores): X = 64 GB/s and Y = ~17 GB/s
Now in 2017: Haswell-E # 4 GHz + quad-channel DDR4 # 2400 MHz
No vectorization (8-byte load/stores): X = 32 GB/s and Y = ~70 GB/s
Vectorized AVX* (32-byte load/stores): X = 64 GB/s and Y = ~70 GB/s
(For both Sandy Bridge and Haswell, architectural limits in the cache will limit bandwidth to about 16 bytes/cycle regardless of SIMD width.)
So nowadays, a single thread will not always be able to saturate memory bandwidth. And you will need to vectorize to achieve that limit of X. But you will still hit the main memory bandwidth limit of Y with 2 or more threads.
But one thing hasn't changed and probably won't change for a long time: You will not be able to run a bandwidth-hogging loop on all cores without saturating the total memory bandwidth.

As Mysticial already described, main-memory bandwidth limitations are the bottleneck for large buffers here. The way around this is to redesign your processing to work in chunks that fit in the cache. (Instead of multiplying a whole 200MiB of doubles, multiply just 128kiB, then do something with that. So the code that uses the output of the multiply will find it still in L2 cache. L2 is typically 256kiB, and is private to each CPU core, on recent Intel designs.)
This technique is called cache blocking, or loop tiling. It might be tricky for some algorithms, but the payoff is the difference between L2 cache bandwidth vs. main memory bandwidth.
If you do this, make sure the compiler isn't still generating streaming stores (movnt...). Those writes bypass the caches to avoid polluting it with data that won't fit. The next read of that data will need to touch main memory.

EDIT: Modified the answer a lot. Also, please disregard most of what I wrote before about Mystical's answer not being entirely correct.
Though, I still do not agree it being bottlenecked by memory, as despite doing a very wide variety of tests, I couldn't see any signs of the original code being bound by memory speed. Meanwhile it kept showing clear signs of being CPU-bound.
There can be many reasons. And since the reason[s] can be very hardware-dependent, I decided I shouldn't speculate based on guesses.
Just going to outline these things I encountered during later testing, where I used a much more accurate and reliable CPU time measuring method and looping-the-loop 1000 times. I believe this information could be of help. But please take it with a grain of salt, as it's hardware dependent.
When using instructions from the SSE family, vectorized code I got was over 10% faster vs. non-vectorized code.
Vectorized code using SSE-family and vectorized code using AVX ran more or less with the same performance.
When using AVX instructions, non-vectorized code ran the fastest - 25% or more faster than every other thing I tried.
Results scaled linearly with CPU clock in all cases.
Results were hardly affected by memory clock.
Results were considerably affected by memory latency - much more than memory clock, but not nearly as much as CPU clock affected the results.
WRT Mystical's example of running nearly 1 iteration per clock - I didn't expect the CPU scheduler to be that efficient and was assuming 1 iteration every 1.5-2 clock ticks. But to my surprise, that is not the case; I sure was wrong, sorry about that. My own CPU ran it even more efficiently - 1.048 cycles/iteration. So I can attest to this part of Mystical's answer to be definitely right.

Just in case a[] b[] and c[] are fighting for the L2 cache ::
#include <string.h> /* for memcpy */
...
gettimeofday(&stTime, NULL);
for(k = 0; k < LEN; k += 4) {
double a4[4], b4[4], c4[4];
memcpy(a4,a+k, sizeof a4);
memcpy(b4,b+k, sizeof b4);
c4[0] = a4[0] * b4[0];
c4[1] = a4[1] * b4[1];
c4[2] = a4[2] * b4[2];
c4[3] = a4[3] * b4[3];
memcpy(c+k,c4, sizeof c4);
}
gettimeofday(&endTime, NULL);
Reduces the running time from 98429.000000 to 67213.000000;
unrolling the loop 8-fold reduces it to 57157.000000 here.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight