Reverse vector order in ARM NEON intrinsics

Reverse vector order in ARM NEON intrinsics - arm

I'm trying to reverse the order of a 128 bit vector (uint16x8).
For example if I have
a b c d e f g h
I would like to obtain
h g f e d c b a
Is there a simple way to do that with NEON intrinsics?
I tried with the VREV but it doesn't work.

You want vrev64.16 instruction however it doesn't swap between double registers of a single quad register. You need to achieve that using an additional vswp.
For intrinsics
q = vrev64q_u16(q)
should do the trick for swapping inside double words, then you need to swap double words in quad register. However that gets cumbersome since there is no vswp intrinsics directly which forces you to use something like
q = vcombine_u16(vget_high_u16(q), vget_low_u16(q))
which actually ends up as a vswp instruction.
See below for an example.
#include <stdio.h>
#include <stdlib.h>
#include <arm_neon.h>
int main() {
uint16_t s[] = {0x101, 0x102, 0x103, 0x104, 0x105, 0x106, 0x107, 0x108};
uint16_t *t = malloc(sizeof(uint16_t) * 8);
for (int i = 0; i < 8; i++) {
t[i] = 0;
}
uint16x8_t a = vld1q_u16(s);
a = vrev64q_u16(a);
a = vcombine_u16(vget_high_u16(a), vget_low_u16(a));
vst1q_u16(t, a);
for (int i = 0; i < 8; i++) {
printf("0x%3x ", t[i]);
}
printf("\n");
return 0;
}
which generates an assembly like below
vld1.16 {d16-d17}, [sp:64]
movs r4, #0
vrev64.16 q8, q8
vswp d16, d17
vst1.16 {d16-d17}, [r5]
and outputs
$ rev
0x108 0x107 0x106 0x105 0x104 0x103 0x102 0x101

Related

Efficient C vectors for generic SIMD (SSE, AVX, NEON) test for zero matches. (find FP max absolute value and index)

I want to see if it's possible to write some generic SIMD code that can compile efficiently. Mostly for SSE, AVX, and NEON. A simplified version of the problem is: Find the maximum absolute value of an array of floating point numbers and return both the value and the index. It is the last part, the index of the maximum, that causes the problem. There doesn't seem to be a very good way to write code that has a branch.
See update at end for finished code using some of the suggested answers.
Here's a sample implementation (more complete version on godbolt):
#define VLEN 8
typedef float vNs __attribute__((vector_size(VLEN*sizeof(float))));
typedef int vNb __attribute__((vector_size(VLEN*sizeof(int))));
#define SWAP128 4,5,6,7, 0,1,2,3
#define SWAP64 2,3, 0,1, 6,7, 4,5
#define SWAP32 1, 0, 3, 2, 5, 4, 7, 6
static bool any(vNb x) {
x = x | __builtin_shufflevector(x,x, SWAP128);
x = x | __builtin_shufflevector(x,x, SWAP64);
x = x | __builtin_shufflevector(x,x, SWAP32);
return x[0];
}
float maxabs(float* __attribute__((aligned(32))) data, unsigned n, unsigned *index) {
vNs max = {0,0,0,0,0,0,0,0};
vNs tmax;
unsigned imax = 0;
for (unsigned i = 0 ; i < n; i += VLEN) {
vNs t = *(vNs*)(data + i);
t = -t < t ? t : -t; // Absolute value
vNb cmp = t > max;
if (any(cmp)) {
tmax = t; imax = i;
// broadcast horizontal max of t into every element of max
vNs tswap128 = __builtin_shufflevector(t,t, SWAP128);
t = t < tswap128 ? tswap128 : t;
vNs tswap64 = __builtin_shufflevector(t,t, SWAP64);
t = t < tswap64 ? tswap64 : t;
vNs tswap32 = __builtin_shufflevector(t,t, SWAP32);
max = t < tswap32 ? tswap32 : t;
}
}
// To simplify example, ignore finding index of true value in tmax==max
*index = imax; // + which(tmax == max);
return max[0];
}
Code on godbolt allows changing VLEN to 8 or 4.
This mostly works very well. For AVX/SSE the absolute value becomes t & 0x7fffffff using a (v)andps, i.e. clear the sign bit. For NEON it's done with vneg + fmaxnm. The block to find and broadcast the horizontal max becomes an efficient sequence of permute and max instructions. gcc is able to use NEON fabs for absolute value.
The 8 element vector on the 4 element SSE/NEON targets works well on clang. It uses a pair of instructions on two sets of registers and for the SWAP128 horizontal op will max or or the two registers without any unnecessary permute. gcc on the other hand really can't handle this and produces mostly non-SIMD code. If we reduce the vector length to 4, gcc works fine for SSE and NEON.
But there's a problem with if (any(cmp)). For clang + SSE/AVX, it works well, vcmpltps + vptest, with an orps to go from 8->4 on SSE.
But gcc and clang on NEON do all the permutes and ORs, then move the result to a gp register to test.
Is there some bit of code, other than architecture specific intrinsics, to get ptest with gcc and vmaxvq with clang/gcc and NEON?
I tried some other methods, like if (x[0] || x[1] || ... x[7]) but they were worse.
Update
I've created an updated example that shows two different implementations, both the original and "indices in a vector" method as suggested by chtz and shown in Aki Suihkonen's answer. One can see the resulting SSE and NEON output.
While some might be skeptical, the compiler does produce very good code from the generic SIMD (not auto-vectorization!) C++ code. On SSE/AVX, I see very little room to improve the code in the loop. The NEON version still troubled by a sub-optimal implementation of "any()".
Unless the data is usually in ascending order, or nearly so, my original version is still fastest on SSE/AVX. I haven't tested on NEON. This is because most loop iterations do not find a new max value and it's best to optimize for that case. The "indices in a vector" method produces a tighter loop and the compiler does a better job too, but the common case is just a bit slower on SSE/AVX. The common case might be equal or faster on NEON.
Some notes on writing generic SIMD code.
The absolute value of a vector of floats can be found with the following. It produces optimal code on SSE/AVX (and with a mask that clears the sign bit) and on NEON (the fabs instruction).
static vNs vabs(vNs x) {
return -x < x ? x : -x;
}
This will do a vertical max efficiently on SSE/AVX/NEON. It doesn't do a compare; it produces the architecture's "max' instruction. On NEON, changing it to use > instead of < causes the compiler to produce very bad scalar code. Something with denormals or exceptions I guess.
template <typename v> // Deduce vector type (float, unsigned, etc.)
static v vmax(v a, v b) {
return a < b ? b : a; // compiles best with "<" as compare op
}
This code will broadcast the horizontal max across a register. It compiles very well on SSE/AVX. On NEON, it would probably be better if the compiler could use a horizontal max instruction and then broadcast the result. I was impressed to see that if one uses 8 element vectors on SSE/NEON, which have only 4 element registers, the compiler is smart enough to use just one register for the broadcasted result, since the top 4 and bottom 4 elements are the same.
template <typename v>
static v hmax(v x) {
if (VLEN >= 8)
x = vmax(x, __builtin_shufflevector(x,x, SWAP128));
x = vmax(x, __builtin_shufflevector(x,x, SWAP64));
return vmax(x, __builtin_shufflevector(x,x, SWAP32));
}
This is the best "any()" I found. It is optimal on SSE/AVX, using a single ptest instruction. On NEON it does the permutes and ORs, instead of a horizontal max instruction, but I haven't found a way to get anything better on NEON.
static bool any(vNb x) {
if (VLEN >= 8)
x |= __builtin_shufflevector(x,x, SWAP128);
x |= __builtin_shufflevector(x,x, SWAP64);
x |= __builtin_shufflevector(x,x, SWAP32);
return x[0];
}
Also interesting, on AVX the code i = i + 1 will be compiled to vpsubd ymmI, ymmI, ymmNegativeOne, i.e. subtract -1. Why? Because a vector of -1s is produced with vpcmpeqd ymm0, ymm0, ymm0 and that's faster than broadcasting a vector of 1s.
Here is the best which() I've come up with. This gives you the index of the 1st true value in a vector of booleans (0 = false, -1 = true). One can do somewhat better on AVX with movemask. I don't know about the best NEON.
// vector of signed ints
typedef int vNi __attribute__((vector_size(VLEN*sizeof(int))));
// vector of bytes, same number of elements, 1/4 the size
typedef unsigned char vNb __attribute__((vector_size(VLEN*sizeof(unsigned char))));
// scalar type the same size as the byte vector
using sNb = std::conditional_t<VLEN == 4, uint32_t, uint64_t>;
static int which(vNi x) {
vNb cidx = __builtin_convertvector(x, vNb);
return __builtin_ctzll((sNb)cidx) / 8u;
}

As commented by chtz, the most generic and typical method is to have another mask to gather indices:
Vec8s indices = { 0,1,2,3,4,5,6,7};
Vec8s max_idx = indices;
Vec8f max_abs = abs(load8(ptr));
for (auto i = 8; i + 8 <= vec_length; i+=8) {
Vec8s data = abs(load8(ptr[i]));
auto mask = is_greater(data, max_abs);
max_idx = bitselect(mask, indices, max_idx);
max_abs = max(max_abs, data);
indices = indices + 8;
}
Another option is to interleave the values and indices:
auto data = load8s(ptr) & 0x7fffffff; // can load data as int32_t
auto idx = vec8s{0,1,2,3,4,5,6,7};
auto lo = zip_lo(idx, data);
auto hi = zip_hi(idx, data);
for (int i = 8; i + 8 <= size; i+=8) {
idx = idx + 8;
auto d1 = load8s(ptr + i) & 0x7fffffff;
auto lo1 = zip_lo(idx, d1);
auto hi1 = zip_hi(idx, d1);
lo = max_u64(lo, lo1);
hi = max_u64(hi, hi1);
}
This method is especially lucrative, if the range of inputs is small enough to shift the input left, while appending a few bits from the index to the LSB bits of the same word.
Even in this case we can repurpose 1 bit in the float allowing us to save one half of the bit/index selection operations.
auto data0 = load8u(ptr) << 1; // take abs by shifting left
auto data1 = (load8u(ptr + 8) << 1) + 1; // encode odd index to data
auto mx = max_u32(data0, data1); // the LSB contains one bit of index
Looks like one can use double as the storage, since even SSE2 supports _mm_max_pd (some attention needs to be given to Inf/Nan handling, which don't encode as Inf/Nan any more when reinterpreted as the high part of 64-bit double).

UPD: the no-aligning issue is fixed now, all the examples on godbolt use aligned reads.
UPD: MISSED THE ABS
Terribly sorry about that, I missed the absolute value from the definition.
I do not have the measurements, but here are all 3 functions vectorised:
max value with abs: https://godbolt.org/z/6Wznrc5qq
find with abs: https://godbolt.org/z/61r9Efxvn
one pass with abs: https://godbolt.org/z/EvdbfnWjb
Asm stashed in a gist
On the method
The way to do max element with simd is to first find the value and then find the index.
Alternatively you have to keep a register of indexes and blend the indexes.
This requires keeping indexes, doing more operations and the problem of the overflow needs to be addressed.
Here are my timings on avx2 by type (char, short and int) for 10'000 bytes of data
The min_element is my implementation of keeping the index.
reduce(min) + find is doing two loops - first get the value, then find where.
For ints (should behave like floats), performance is 25% faster for the two loops solution, at least on my measurements.
For completeness, comparisons against scalar for both methods - this is definitely an operation that should be vectorized.
How to do it
finding the maximum value is auto-vectorised across all platforms if you write it as reduce
if (!arr.size()) return {};
// std::reduce is also ok, just showing for more C ppl
float res = arr[0];
for (int i = 1; i != (int)arr.size(); ++i) {
res = res > arr[i] ? res : arr[i];
}
return res;
https://godbolt.org/z/EsazWf1vT
Now the find portion is trickier, non of the compilers I know autovectorize find
We have eve library that provides you with find algorithm: https://godbolt.org/z/93a98x6Tj
Or I explain how to implement find in this talk if you want to do it yourself.
UPD:
UPD2: changed the blend to max
#Peter Cordes in the comments said that there is maybe a point to doing the one pass solution in case of bigger data.
I have no evidence of this - my measurements point to reduce + find.
However, I hacked together roughly how keeping the index looks (there is an aligning issue at the moment, we should definitely align reads here)
https://godbolt.org/z/djrzobEj4
AVX2 main loop:
.L6:
vmovups ymm6, YMMWORD PTR [rdx]
add rdx, 32
vcmpps ymm3, ymm6, ymm0, 30
vmaxps ymm0, ymm6, ymm0
vpblendvb ymm3, ymm2, ymm1, ymm3
vpaddd ymm1, ymm5, ymm1
vmovdqa ymm2, ymm3
cmp rcx, rdx
jne .L6
ARM-64 main loop:
.L6:
ldr q3, [x0], 16
fcmgt v4.4s, v3.4s, v0.4s
fmax v0.4s, v3.4s, v0.4s
bit v1.16b, v2.16b, v4.16b
add v2.4s, v2.4s, v5.4s
cmp x0, x1
bne .L6
Links to ASM if godbolt becomes stale: https://gist.github.com/DenisYaroshevskiy/56d82c8cf4a4dd5bf91d58b053ea80f2

I don’t believe that’s possible. Compilers aren’t smart enough to do that efficiently.
Compare the other answer (which uses NEON-like pseudocode) with the SSE version below:
// Compare vector absolute value with aa, if greater update both aa and maxIdx
inline void updateMax( __m128 vec, __m128i idx, __m128& aa, __m128& maxIdx )
{
vec = _mm_andnot_ps( _mm_set1_ps( -0.0f ), vec );
const __m128 greater = _mm_cmpgt_ps( vec, aa );
aa = _mm_max_ps( vec, aa );
// If you don't have SSE4, emulate with bitwise ops: and, andnot, or
maxIdx = _mm_blendv_ps( maxIdx, _mm_castsi128_ps( idx ), greater );
}
float maxabs_sse4( const float* rsi, size_t length, size_t& index )
{
// Initialize things
const float* const end = rsi + length;
const float* const endAligned = rsi + ( ( length / 4 ) * 4 );
__m128 aa = _mm_set1_ps( -1 );
__m128 maxIdx = _mm_setzero_ps();
__m128i idx = _mm_setr_epi32( 0, 1, 2, 3 );
// Main vectorized portion
while( rsi < endAligned )
{
__m128 vec = _mm_loadu_ps( rsi );
rsi += 4;
updateMax( vec, idx, aa, maxIdx );
idx = _mm_add_epi32( idx, _mm_set1_epi32( 4 ) );
}
// Handle the remainder, if present
if( rsi < end )
{
__m128 vec;
if( length > 4 )
{
// The source has at least 5 elements
// Offset the source pointer + index back, by a few elements
const int offset = (int)( 4 - ( length % 4 ) );
rsi -= offset;
idx = _mm_sub_epi32( idx, _mm_set1_epi32( offset ) );
vec = _mm_loadu_ps( rsi );
}
else
{
// The source was smaller than 4 elements, copy them into temporary buffer and load vector from there
alignas( 16 ) float buff[ 4 ];
_mm_store_ps( buff, _mm_setzero_ps() );
for( size_t i = 0; i < length; i++ )
buff[ i ] = rsi[ i ];
vec = _mm_load_ps( buff );
}
updateMax( vec, idx, aa, maxIdx );
}
// Reduce to scalar
__m128 tmpMax = _mm_movehl_ps( aa, aa );
__m128 tmpMaxIdx = _mm_movehl_ps( maxIdx, maxIdx );
__m128 greater = _mm_cmpgt_ps( tmpMax, aa );
aa = _mm_max_ps( tmpMax, aa );
maxIdx = _mm_blendv_ps( maxIdx, tmpMaxIdx, greater );
// SSE3 has 100% market penetration in 2022
tmpMax = _mm_movehdup_ps( tmpMax );
tmpMaxIdx = _mm_movehdup_ps( tmpMaxIdx );
greater = _mm_cmpgt_ss( tmpMax, aa );
aa = _mm_max_ss( tmpMax, aa );
maxIdx = _mm_blendv_ps( maxIdx, tmpMaxIdx, greater );
index = (size_t)_mm_cvtsi128_si32( _mm_castps_si128( maxIdx ) );
return _mm_cvtss_f32( aa );
}
As you see, pretty much everything is completely different. Not just the boilerplate about remainder and final reduction, the main loop is very different too.
SSE doesn’t have bitselect; blendvps is not quite that, it selects 32-bit lanes based on high bit of the selector. Unlike NEON, SSE doesn’t have instructions for absolute value, need to be emulated with bitwise andnot.
The final reduction going to be completely different as well. NEON has very limited shuffles, but it has better horizontal operations, like vmaxvq_f32 which finds horizontal maximum over the complete SIMD vector.

Why the cos function in math.h faster than x86 fcos instruction

The cos() in math.h run faster than the x86 asm fcos.
The following code is compare between the x86 fcos and the cos() in math.h.
In this code, 1000000 times asm fcos cost 150ms; 1000000 times cos() call cost only 80ms.
How is the fcos implemented in x86?
Why is the fcos much slower than cos()?
My enviroment is intel i7-6820HQ + win10 + visual studio 2017.
#include "string"
#include "iostream"
#include<time.h>
#include "math.h"
int main()
{
int i;
const int i_max = 1000000;
float c = 10000;
float *d = &c;
float start_value = 8.333333f;
float* pstart_value = &start_value;
clock_t a, b;
a = clock();
__asm {
mov edx, pstart_value;
fld [edx];
}
for (i = 0; i < i_max; i++) {
__asm {
fcos;
}
}
b = clock();
printf("asm time = %u", b - a);
a = clock();
double y;
for (i = 0; i < i_max; i++) {
start_value = cos(start_value);
}
b = clock();
printf("math time = %u", b - a);
return 0;
}
According to my personal understanding, a single asm instruction is usually faster than a function call.
Why in this case the fcos so slow?
Update:
I have run the same code on another laptop with i7-6700HQ.
On this laptop the 1000000 times fcos cost only 51ms. Why there is such a big difference between the two cpus.

I bet the answer is easy. You do not use the result of cos and it is optimized out as in this example
https://godbolt.org/z/iw-nft
Change the variables to volatile to force cos call.
https://godbolt.org/z/9_dpMs
Another guess:
Maybe your cos implementation uses lookup tables. Then it will be faster than the hardware implementation.

Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2

(Related: How to quickly count bits into separate bins in a series of ints on Sandy Bridge? is an earlier duplicate of this, with some different answers. Editor's note: the answers here are probably better.
Also, an AVX2 version of a similar problem, with many bins for a whole row of bits much wider than one uint64_t: Improve column population count algorithm)
I am working on a project in C where I need to go through tens of millions of masks (of type ulong (64-bit)) and update an array (called target) of 64 short integers (uint16) based on a simple rule:
// for any given mask, do the following loop
for (i = 0; i < 64; i++) {
if (mask & (1ull << i)) {
target[i]++
}
}
The problem is that I need do the above loops on tens of millions of masks and I need to finish in less than a second. Wonder if there are any way to speed it up, like using some sort special assembly instruction that represents the above loop.
Currently I use gcc 4.8.4 on ubuntu 14.04 (i7-2670QM, supporting AVX, not AVX2) to compile and run the following code and took about 2 seconds. Would love to make it run under 200ms.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>
double getTS() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1000000.0;
}
unsigned int target[64];
int main(int argc, char *argv[]) {
int i, j;
unsigned long x = 123;
unsigned long m = 1;
char *p = malloc(8 * 10000000);
if (!p) {
printf("failed to allocate\n");
exit(0);
}
memset(p, 0xff, 80000000);
printf("p=%p\n", p);
unsigned long *pLong = (unsigned long*)p;
double start = getTS();
for (j = 0; j < 10000000; j++) {
m = 1;
for (i = 0; i < 64; i++) {
if ((pLong[j] & m) == m) {
target[i]++;
}
m = (m << 1);
}
}
printf("took %f secs\n", getTS() - start);
return 0;
}
Thanks in advance!

On my system, a 4 year old MacBook (2.7 GHz intel core i5) with clang-900.0.39.2 -O3, your code runs in 500ms.
Just changing the inner test to if ((pLong[j] & m) != 0) saves 30%, running in 350ms.
Further simplifying the inner part to target[i] += (pLong[j] >> i) & 1; without a test brings it down to 280ms.
Further improvements seem to require more advanced techniques such as unpacking the bits into blocks of 8 ulongs and adding those in parallel, handling 255 ulongs at a time.
Here is an improved version using this method. it runs in 45ms on my system.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>
double getTS() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1000000.0;
}
int main(int argc, char *argv[]) {
unsigned int target[64] = { 0 };
unsigned long *pLong = malloc(sizeof(*pLong) * 10000000);
int i, j;
if (!pLong) {
printf("failed to allocate\n");
exit(1);
}
memset(pLong, 0xff, sizeof(*pLong) * 10000000);
printf("p=%p\n", (void*)pLong);
double start = getTS();
uint64_t inflate[256];
for (i = 0; i < 256; i++) {
uint64_t x = i;
x = (x | (x << 28));
x = (x | (x << 14));
inflate[i] = (x | (x << 7)) & 0x0101010101010101ULL;
}
for (j = 0; j < 10000000 / 255 * 255; j += 255) {
uint64_t b[8] = { 0 };
for (int k = 0; k < 255; k++) {
uint64_t u = pLong[j + k];
for (int kk = 0; kk < 8; kk++, u >>= 8)
b[kk] += inflate[u & 255];
}
for (i = 0; i < 64; i++)
target[i] += (b[i / 8] >> ((i % 8) * 8)) & 255;
}
for (; j < 10000000; j++) {
uint64_t m = 1;
for (i = 0; i < 64; i++) {
target[i] += (pLong[j] >> i) & 1;
m <<= 1;
}
}
printf("target = {");
for (i = 0; i < 64; i++)
printf(" %d", target[i]);
printf(" }\n");
printf("took %f secs\n", getTS() - start);
return 0;
}
The technique for inflating a byte to a 64-bit long are investigated and explained in the answer: https://stackoverflow.com/a/55059914/4593267 . I made the target array a local variable, as well as the inflate array, and I print the results to ensure the compiler will not optimize the computations away. In a production version you would compute the inflate array separately.
Using SIMD directly might provide further improvements at the expense of portability and readability. This kind of optimisation is often better left to the compiler as it can generate specific code for the target architecture. Unless performance is critical and benchmarking proves this to be a bottleneck, I would always favor a generic solution.
A different solution by njuffa provides similar performance without the need for a precomputed array. Depending on your compiler and hardware specifics, it might be faster.

Related:
an earlier duplicate has some alternate ideas: How to quickly count bits into separate bins in a series of ints on Sandy Bridge?.
Harold's answer on AVX2 column population count algorithm over each bit-column separately.
Matrix transpose and population count has a couple useful answers with AVX2, including benchmarks. It uses 32-bit chunks instead of 64-bit.
Also: https://github.com/mklarqvist/positional-popcount has SSE blend, various AVX2, various AVX512 including Harley-Seal which is great for large arrays, and various other algorithms for positional popcount. Possibly only for uint16_t, but most could be adapted for other word widths. I think the algorithm I propose below is what they call adder_forest.
Your best bet is SIMD, using AVX1 on your Sandybridge CPU. Compilers aren't smart enough to auto-vectorize your loop-over-bits for you, even if you write it branchlessly to give them a better chance.
And unfortunately not smart enough to auto-vectorize the fast version that gradually widens and adds.
See is there an inverse instruction to the movemask instruction in intel avx2? for a summary of bitmap -> vector unpack methods for different sizes. Ext3h's suggestion in another answer is good: Unpack bits to something narrower than the final count array gives you more elements per instruction. Bytes is efficient with SIMD, and then you can do up to 255 vertical paddb without overflow, before unpacking to accumulate into the 32-bit counter array.
It only takes 4x 16-byte __m128i vectors to hold all 64 uint8_t elements, so those accumulators can stay in registers, only adding to memory when widening out to 32-bit counters in an outer loop.
The unpack doesn't have to be in-order: you can always shuffle target[] once at the very end, after accumulating all the results.
The inner loop could be unrolled to start with a 64 or 128-bit vector load, and unpack 4 or 8 different ways using pshufb (_mm_shuffle_epi8).
An even better strategy is to widen gradually
Starting with 2-bit accumulators, then mask/shift to widen those to 4-bit. So in the inner-most loop most of the operations are working with "dense" data, not "diluting" it too much right away. Higher information / entropy density means that each instruction does more useful work.
Using SWAR techniques for 32x 2-bit add inside scalar or SIMD registers is easy / cheap because we need to avoid the possibility of carry out the top of an element anyway. With proper SIMD, we'd lose those counts, with SWAR we'd corrupt the next element.
uint64_t x = *(input++); // load a new bitmask
const uint64_t even_1bits = 0x5555555555555555; // 0b...01010101;
uint64_t lo = x & even_1bits;
uint64_t hi = (x>>1) & even_1bits; // or use ANDN before shifting to avoid a MOV copy
accum2_lo += lo; // can do up to 3 iterations of this without overflow
accum2_hi += hi; // because a 2-bit integer overflows at 4
Then you repeat up to 4 vectors of 4-bit elements, then 8 vectors of 8-bit elements, then you should widen all the way to 32 and accumulate into the array in memory because you'll run out of registers anyway, and this outer outer loop work is infrequent enough that we don't need to bother with going to 16-bit. (Especially if we manually vectorize).
Biggest downside: this doesn't auto-vectorize, unlike #njuffa's version. But with gcc -O3 -march=sandybridge for AVX1 (then running the code on Skylake), this running scalar 64-bit is actually still slightly faster than 128-bit AVX auto-vectorized asm from #njuffa's code.
But that's timing on Skylake, which has 4 scalar ALU ports (and mov-elimination), while Sandybridge lacks mov-elimination and only has 3 ALU ports, so the scalar code will probably hit back-end execution-port bottlenecks. (But SIMD code may be nearly as fast, because there's plenty of AND / ADD mixed with the shifts, and SnB does have SIMD execution units on all 3 of its ports that have any ALUs on them. Haswell just added port 6, for scalar-only including shifts and branches.)
With good manual vectorization, this should be a factor of almost 2 or 4 faster.
But if you have to choose between this scalar or #njuffa's with AVX2 autovectorization, #njuffa's is faster on Skylake with -march=native
If building on a 32-bit target is possible/required, this suffers a lot (without vectorization because of using uint64_t in 32-bit registers), while vectorized code barely suffers at all (because all the work happens in vector regs of the same width).
// TODO: put the target[] re-ordering somewhere
// TODO: cleanup for N not a multiple of 3*4*21 = 252
// TODO: manual vectorize with __m128i, __m256i, and/or __m512i
void sum_gradual_widen (const uint64_t *restrict input, unsigned int *restrict target, size_t length)
{
const uint64_t *endp = input + length - 3*4*21; // 252 masks per outer iteration
while(input <= endp) {
uint64_t accum8[8] = {0}; // 8-bit accumulators
for (int k=0 ; k<21 ; k++) {
uint64_t accum4[4] = {0}; // 4-bit accumulators can hold counts up to 15. We use 4*3=12
for(int j=0 ; j<4 ; j++){
uint64_t accum2_lo=0, accum2_hi=0;
for(int i=0 ; i<3 ; i++) { // the compiler should fully unroll this
uint64_t x = *input++; // load a new bitmask
const uint64_t even_1bits = 0x5555555555555555;
uint64_t lo = x & even_1bits; // 0b...01010101;
uint64_t hi = (x>>1) & even_1bits; // or use ANDN before shifting to avoid a MOV copy
accum2_lo += lo;
accum2_hi += hi; // can do up to 3 iterations of this without overflow
}
const uint64_t even_2bits = 0x3333333333333333;
accum4[0] += accum2_lo & even_2bits; // 0b...001100110011; // same constant 4 times, because we shift *first*
accum4[1] += (accum2_lo >> 2) & even_2bits;
accum4[2] += accum2_hi & even_2bits;
accum4[3] += (accum2_hi >> 2) & even_2bits;
}
for (int i = 0 ; i<4 ; i++) {
accum8[i*2 + 0] += accum4[i] & 0x0f0f0f0f0f0f0f0f;
accum8[i*2 + 1] += (accum4[i] >> 4) & 0x0f0f0f0f0f0f0f0f;
}
}
// char* can safely alias anything.
unsigned char *narrow = (uint8_t*) accum8;
for (int i=0 ; i<64 ; i++){
target[i] += narrow[i];
}
}
/* target[0] = bit 0
* target[1] = bit 8
* ...
* target[8] = bit 1
* target[9] = bit 9
* ...
*/
// TODO: 8x8 transpose
}
We don't care about order, so accum4[0] has 4-bit accumulators for every 4th bit, for example. The final fixup needed (but not yet implemented) at the very end is an 8x8 transpose of the uint32_t target[64] array, which can be done efficiently using unpck and vshufps with only AVX1. (Transpose an 8x8 float using AVX/AVX2). And also a cleanup loop for the last up to 251 masks.
We can use any SIMD element width to implement these shifts; we have to mask anyway for widths lower than 16-bit (SSE/AVX doesn't have byte-granularity shifts, only 16-bit minimum.)
Benchmark results on Arch Linux i7-6700k from #njuffa's test harness, with this added. (Godbolt) N = (10000000 / (3*4*21) * 3*4*21) = 9999864 (i.e. 10000000 rounded down to a multiple of the 252 iteration "unroll" factor, so my simplistic implementation is doing the same amount of work, not counting re-ordering target[] which it doesn't do, so it does print mismatch results.
But the printed counts match another position of the reference array.)
I ran the program 4x in a row (to make sure the CPU was warmed up to max turbo) and took one of the runs that looked good (none of the 3 times abnormally high).
ref: the best bit-loop (next section)
fast: #njuffa's code. (auto-vectorized with 128-bit AVX integer instructions).
gradual: my version (not auto-vectorized by gcc or clang, at least not in the inner loop.) gcc and clang fully unroll the inner 12 iterations.
gcc8.2 -O3 -march=sandybridge -fpie -no-pie
ref: 0.331373 secs, fast: 0.011387 secs, gradual: 0.009966 secs
gcc8.2 -O3 -march=sandybridge -fno-pie -no-pie
ref: 0.397175 secs, fast: 0.011255 secs, gradual: 0.010018 secs
clang7.0 -O3 -march=sandybridge -fpie -no-pie
ref: 0.352381 secs, fast: 0.011926 secs, gradual: 0.009269 secs (very low counts for port 7 uops, clang used indexed addressing for stores)
clang7.0 -O3 -march=sandybridge -fno-pie -no-pie
ref: 0.293014 secs, fast: 0.011777 secs, gradual: 0.009235 secs
-march=skylake (allowing AVX2 for 256-bit integer vectors) helps both, but #njuffa's most because more of it vectorizes (including its inner-most loop):
gcc8.2 -O3 -march=skylake -fpie -no-pie
ref: 0.328725 secs, fast: 0.007621 secs, gradual: 0.010054 secs (gcc shows no gain for "gradual", only "fast")
gcc8.2 -O3 -march=skylake -fno-pie -no-pie
ref: 0.333922 secs, fast: 0.007620 secs, gradual: 0.009866 secs
clang7.0 -O3 -march=skylake -fpie -no-pie
ref: 0.260616 secs, fast: 0.007521 secs, gradual: 0.008535 secs (IDK why gradual is faster than -march=sandybridge; it's not using BMI1 andn. I guess because it's using 256-bit AVX2 for the k=0..20 outer loop with vpaddq)
clang7.0 -O3 -march=skylake -fno-pie -no-pie
ref: 0.259159 secs, fast: 0.007496 secs, gradual: 0.008671 secs
Without AVX, just SSE4.2: (-march=nehalem), bizarrely clang's gradual is faster than with AVX / tune=sandybridge. "fast" is only barely slower than with AVX.
gcc8.2 -O3 -march=skylake -fno-pie -no-pie
ref: 0.337178 secs, fast: 0.011983 secs, gradual: 0.010587 secs
clang7.0 -O3 -march=skylake -fno-pie -no-pie
ref: 0.293555 secs, fast: 0.012549 secs, gradual: 0.008697 secs
-fprofile-generate / -fprofile-use help some for GCC, especially for the "ref" version where it doesn't unroll at all by default.
I highlighted the best, but often they're within measurement noise margin of each other. It's unsurprising the -fno-pie -no-pie was sometimes faster: indexing static arrays with [disp32 + reg] is not an indexed addressing mode, just base + disp32, so it doesn't ever unlaminate on Sandybridge-family CPUs.
But with gcc sometimes -fpie was faster; I didn't check but I assume gcc just shot itself in the foot somehow when 32-bit absolute addressing was possible. Or just innocent-looking differences in code-gen happened to cause alignment or uop-cache problems; I didn't check in detail.
For SIMD, we can simply do 2 or 4x uint64_t in parallel, only accumulating horizontally in the final step where we widen bytes to 32-bit elements. (Perhaps by shuffling in-lane and then using pmaddubsw with a multiplier of _mm256_set1_epi8(1) to add horizontal byte pairs into 16-bit elements.)
TODO: manually-vectorized __m128i and __m256i (and __m512i) versions of this. Should be close to 2x, 4x, or even 8x faster than the "gradual" times above. Probably HW prefetch can still keep up with it, except maybe an AVX512 version with data coming from DRAM, especially if there's contention from other threads. We do a significant amount of work per qword we read.
Obsolete code: improvements to the bit-loop
Your portable scalar version can be improved, too, speeding it up from ~1.92 seconds (with a 34% branch mispredict rate overall, with the fast loops commented out!) to ~0.35sec (clang7.0 -O3 -march=sandybridge) with a properly random input on 3.9GHz Skylake. Or 1.83 sec for the branchy version with != 0 instead of == m, because compilers fail to prove that m always has exactly 1 bit set and/or optimize accordingly.
(vs. 0.01 sec for #njuffa's or my fast version above, so this is pretty useless in an absolute sense, but it's worth mentioning as a general optimization example of when to use branchless code.)
If you expect a random mix of zeros and ones, you want something branchless that won't mispredict. Doing += 0 for elements that were zero avoids that, and also means that the C abstract machine definitely touches that memory regardless of the data.
Compilers aren't allowed to invent writes, so if they wanted to auto-vectorize your if() target[i]++ version, they'd have to use a masked store like x86 vmaskmovps to avoid a non-atomic read / rewrite of unmodified elements of target. So some hypothetical future compiler that can auto-vectorize the plain scalar code would have an easier time with this.
Anyway, one way to write this is target[i] += (pLong[j] & m != 0);, using bool->int conversion to get a 0 / 1 integer.
But we get better asm for x86 (and probably for most other architectures) if we just shift the data and isolate the low bit with &1. Compilers are kinda dumb and don't seem to spot this optimization. They do nicely optimize away the extra loop counter, and turn m <<= 1 into add same,same to efficiently left shift, but they still use xor-zero / test / setne to create a 0 / 1 integer.
An inner loop like this compiles slightly more efficiently (but still much much worse than we can do with SSE2 or AVX, or even scalar using #chrqlie's lookup table which will stay hot in L1d when used repeatedly like this, allowing SWAR in uint64_t):
for (int j = 0; j < 10000000; j++) {
#if 1 // extract low bit directly
unsigned long long tmp = pLong[j];
for (int i=0 ; i<64 ; i++) { // while(tmp) could mispredict, but good for sparse data
target[i] += tmp&1;
tmp >>= 1;
}
#else // bool -> int shifting a mask
unsigned long m = 1;
for (i = 0; i < 64; i++) {
target[i]+= (pLong[j] & m) != 0;
m = (m << 1);
}
#endif
Note that unsigned long is not guaranteed to be a 64-bit type, and isn't in x86-64 System V x32 (ILP32 in 64-bit mode), and Windows x64. Or in 32-bit ABIs like i386 System V.
Compiled on the Godbolt compiler explorer by gcc, clang, and ICC, it's 1 fewer uops in the loop with gcc. But all of them are just plain scalar, with clang and ICC unrolling by 2.
# clang7.0 -O3 -march=sandybridge
.LBB1_2: # =>This Loop Header: Depth=1
# outer loop loads a uint64 from the src
mov rdx, qword ptr [r14 + 8*rbx]
mov rsi, -256
.LBB1_3: # Parent Loop BB1_2 Depth=1
# do {
mov edi, edx
and edi, 1 # isolate the low bit
add dword ptr [rsi + target+256], edi # and += into target
mov edi, edx
shr edi
and edi, 1 # isolate the 2nd bit
add dword ptr [rsi + target+260], edi
shr rdx, 2 # tmp >>= 2;
add rsi, 8
jne .LBB1_3 # } while(offset += 8 != 0);
This is slightly better than we get from test / setnz. Without unrolling, bt / setc might have been equal, but compilers are bad at using bt to implement bool (x & (1ULL << n)), or bts to implement x |= 1ULL << n.
If many words have their highest set bit far below bit 63, looping on while(tmp) could be a win. Branch mispredicts make it not worth it if it only saves ~0 to 4 iterations most of the time, but if it often saves 32 iterations, that could really be worth it. Maybe unroll in the source so the loop only tests tmp every 2 iterations (because compilers won't do that transformation for you), but then the loop branch can be shr rdx, 2 / jnz.
On Sandybridge-family, this is 11 fused-domain uops for the front end per 2 bits of input. (add [mem], reg with a non-indexed addressing mode micro-fuses the load+ALU, and the store-address+store-data, everything else is single-uop. add/jcc macro-fuses. See Agner Fog's guide, and https://stackoverflow.com/tags/x86/info). So it should run at something like 3 cycles per 2 bits = one uint64_t per 96 cycles. (Sandybridge doesn't "unroll" internally in its loop buffer, so non-multiple-of-4 uop counts basically round up, unlike on Haswell and later).
vs. gcc's not-unrolled version being 7 uops per 1 bit = 2 cycles per bit. If you compiled with gcc -O3 -march=native -fprofile-generate / test-run / gcc -O3 -march=native -fprofile-use, profile-guided optimization would enable loop unrolling.
This is probably slower than a branchy version on perfectly predictable data like you get from memset with any repeating byte pattern. I'd suggest filling your array with randomly-generated data from a fast PRNG like an SSE2 xorshift+, or if you're just timing the count loop then use anything you want, like rand().

One way of speeding this up significantly, even without AVX, is to split the data into blocks of up to 255 elements, and accumulate the bit counts byte-wise in ordinary uint64_t variables. Since the source data has 64 bits, we need an array of 8 byte-wise accumulators. The first accumulator counts bits in positions 0, 8, 16, ... 56, second accumulator counts bits in positions 1, 9, 17, ... 57; and so on. After we are finished processing a block of data, we transfers the counts form the byte-wise accumulator into the target counts. A function to update the target counts for a block of up to 255 numbers can be coded in a straightforward fashion according to the description above, where BITS is the number of bits in the source data:
/* update the counts of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
int jj, k, kk;
uint64_t byte_wise_sum [BITS/8] = {0};
for (jj = lo; jj < hi; jj++) {
uint64_t t = pLong[jj];
for (k = 0; k < BITS/8; k++) {
byte_wise_sum[k] += t & 0x0101010101010101;
t >>= 1;
}
}
/* accumulate byte sums into target */
for (k = 0; k < BITS/8; k++) {
for (kk = 0; kk < BITS; kk += 8) {
target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
}
}
}
The entire ISO-C99 program, which should be able to run on at least Windows and Linux platforms is shown below. It initializes the source data with a PRNG, performs a correctness check against the asker's reference implementation, and benchmarks both the reference code and the accelerated version. On my machine (Intel Xeon E3-1270 v2 # 3.50 GHz), when compiled with MSVS 2010 at full optimization (/Ox), the output of the program is:
p=0000000000550040
ref took 2.020282 secs, fast took 0.027099 secs
where ref refers to the asker's original solution. The speed-up here is about a factor 74x. Different speed-ups will be observed with other (and especially newer) compilers.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
LARGE_INTEGER t;
static double oofreq;
static int checkedForHighResTimer;
static BOOL hasHighResTimer;
if (!checkedForHighResTimer) {
hasHighResTimer = QueryPerformanceFrequency (&t);
oofreq = 1.0 / (double)t.QuadPart;
checkedForHighResTimer = 1;
}
if (hasHighResTimer) {
QueryPerformanceCounter (&t);
return (double)t.QuadPart * oofreq;
} else {
return (double)GetTickCount() * 1.0e-3;
}
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif
/*
From: geo <gmars...#gmail.com>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c, \
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
#define N (10000000)
#define BITS (64)
#define BLOCK_SIZE (255)
/* cupdate the count of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
int jj, k, kk;
uint64_t byte_wise_sum [BITS/8] = {0};
for (jj = lo; jj < hi; jj++) {
uint64_t t = pLong[jj];
for (k = 0; k < BITS/8; k++) {
byte_wise_sum[k] += t & 0x0101010101010101;
t >>= 1;
}
}
/* accumulate byte sums into target */
for (k = 0; k < BITS/8; k++) {
for (kk = 0; kk < BITS; kk += 8) {
target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
}
}
}
int main (void)
{
double start_ref, stop_ref, start, stop;
uint64_t *pLong;
unsigned int target_ref [BITS] = {0};
unsigned int target [BITS] = {0};
int i, j;
pLong = malloc (sizeof(pLong[0]) * N);
if (!pLong) {
printf("failed to allocate\n");
return EXIT_FAILURE;
}
printf("p=%p\n", pLong);
/* init data */
for (j = 0; j < N; j++) {
pLong[j] = KISS64;
}
/* count bits slowly */
start_ref = second();
for (j = 0; j < N; j++) {
uint64_t m = 1;
for (i = 0; i < BITS; i++) {
if ((pLong[j] & m) == m) {
target_ref[i]++;
}
m = (m << 1);
}
}
stop_ref = second();
/* count bits fast */
start = second();
for (j = 0; j < N / BLOCK_SIZE; j++) {
sum_block (pLong, target, j * BLOCK_SIZE, (j+1) * BLOCK_SIZE);
}
sum_block (pLong, target, j * BLOCK_SIZE, N);
stop = second();
/* check whether result is correct */
for (i = 0; i < BITS; i++) {
if (target[i] != target_ref[i]) {
printf ("error # %d: res=%u ref=%u\n", i, target[i], target_ref[i]);
}
}
/* print benchmark results */
printf("ref took %f secs, fast took %f secs\n", stop_ref - start_ref, stop - start);
return EXIT_SUCCESS;
}

For starters, the problem of unpacking the bits, because seriously, you do not want to test each bit individually.
So just follow the following strategy for unpacking the bits into bytes of a vector: https://stackoverflow.com/a/24242696/2879325
Now that you have padded each bit to 8 bits, you can just do this for blocks of up to 255 bitmasks at a time, and accumulate them all into a single vector register. After that, you would have to expect potential overflows, so you need to transfer.
After each block of 255, unpack again to 32bit, and add into the array. (You don't have to do exactly 255, just some convenient number less than 256 to avoid overflow of byte accumulators).
At 8 instructions per bitmask (4 per each lower and higher 32-bit with AVX2) - or half that if you have AVX512 available - you should be able to achieve a throughput of about half a billion bitmasks per second and core on an recent CPU.
typedef uint64_t T;
const size_t bytes = 8;
const size_t bits = bytes * 8;
const size_t block_size = 128;
static inline __m256i expand_bits_to_bytes(uint32_t x)
{
__m256i xbcast = _mm256_set1_epi32(x); // we only use the low 32bits of each lane, but this is fine with AVX2
// Each byte gets the source byte containing the corresponding bit
const __m256i shufmask = _mm256_set_epi64x(
0x0303030303030303, 0x0202020202020202,
0x0101010101010101, 0x0000000000000000);
__m256i shuf = _mm256_shuffle_epi8(xbcast, shufmask);
const __m256i andmask = _mm256_set1_epi64x(0x8040201008040201); // every 8 bits -> 8 bytes, pattern repeats.
__m256i isolated_inverted = _mm256_andnot_si256(shuf, andmask);
// this is the extra step: byte == 0 ? 0 : -1
return _mm256_cmpeq_epi8(isolated_inverted, _mm256_setzero_si256());
}
void bitcount_vectorized(const T *data, uint32_t accumulator[bits], const size_t count)
{
for (size_t outer = 0; outer < count - (count % block_size); outer += block_size)
{
__m256i temp_accumulator[bits / 32] = { _mm256_setzero_si256() };
for (size_t inner = 0; inner < block_size; ++inner) {
for (size_t j = 0; j < bits / 32; j++)
{
const auto unpacked = expand_bits_to_bytes(static_cast<uint32_t>(data[outer + inner] >> (j * 32)));
temp_accumulator[j] = _mm256_sub_epi8(temp_accumulator[j], unpacked);
}
}
for (size_t j = 0; j < bits; j++)
{
accumulator[j] += ((uint8_t*)(&temp_accumulator))[j];
}
}
for (size_t outer = count - (count % block_size); outer < count; outer++)
{
for (size_t j = 0; j < bits; j++)
{
if (data[outer] & (T(1) << j))
{
accumulator[j]++;
}
}
}
}
void bitcount_naive(const T *data, uint32_t accumulator[bits], const size_t count)
{
for (size_t outer = 0; outer < count; outer++)
{
for (size_t j = 0; j < bits; j++)
{
if (data[outer] & (T(1) << j))
{
accumulator[j]++;
}
}
}
}
Depending on the chose compiler, the vectorized form achieved roughly a factor 25 speedup over the naive one.
On a Ryzen 5 1600X, the vectorized form roughly achieved the predicted throughput of ~600,000,000 elements per second.
Surprisingly, this is actually still 50% slower than the solution proposed by #njuffa.

See
Efficient Computation of Positional Population Counts Using SIMD Instructions by Marcus D. R. Klarqvist, Wojciech Muła, Daniel Lemire (7 Nov 2019)
Faster Population Counts using AVX2 Instructions by Wojciech Muła, Nathan Kurz, Daniel Lemire (23 Nov 2016).
Basically, each full adder compresses 3 inputs to 2 outputs. So one can eliminate an entire 256-bit word for the price of 5 logic instructions. The full adder operation could be repeated until registers become exhausted. Then results in the registers are accumulated (as seen in most of the other answers).
Positional popcnt for 16-bit subwords is implemented here:
https://github.com/mklarqvist/positional-popcount
// Carry-Save Full Adder (3:2 compressor)
b ^= a;
a ^= c;
c ^= b; // xor sum
b |= a;
b ^= c; // carry
Note: the accumulate step for positional-popcnt is more expensive than for normal simd popcnt. Which I believe makes it feasible to add a couple of half-adders to the end of the CSU, it might pay to go all the way up to 256 words before accumulating.

GCC generates redundant code for repeated XOR of an array element

GCC is giving me a hard time generating optimal assembly for following source code:
memset(X, 0, 16);
for (int i= 0; i < 16; ++i) {
X[0] ^= table[i][Y[i]].asQWord;
}
X being an uint64_t[2] array, and
Y being an unsigned char[16] array, and
table being a double dimensional array of union qword_t:
union qword_t {
uint8_t asBytes[8];
uint64_t asQWord;
};
const union qword_t table[16][256] = /* ... */;
With options -m64 -Ofast -mno-sse it does unroll the loop, and each xor with assignment results in 3 instructions (thus overall number of instructions issued is 3 * 16 = 48):
movzx r9d, byte ptr [Y + i] ; extracting byte
xor rax, qword ptr [table + r9*8 + SHIFT] ; xoring, SHIFT = i * 0x800
mov qword ptr [X], rax ; storing result
Now, my understanding is that resulting X value could be accumulated in rax register throughout all 16 xors, and then it could be stored at [X] address, which could be achieved with these two instructions for each xor with assignment:
movzx r9d, byte ptr [Y + i] ; extracting byte
xor rax, qword ptr [table + r9*8 + SHIFT] ; xoring, SHIFT = i * 0x800
and single storing:
mov qword ptr [X], rax ; storing result
(In this case overall number of instructions is 2 * 16 + 1 = 33)
Why does GCC generate these redundant mov instructions? What can I do to avoid this?
P.S. C99, GCC 5.3.0, Intel Core i5 Sandy Bridge

Redundant stores are usually down to aliasing; in this case gcc would be unable to prove to its satisfaction that the store to X[0] does not affect table. It makes a big difference how the variables are passed to the routine; if they are globals or members of the same larger struct then proving non-aliasing is easier.
Example:
void f1(uint64_t X[2]) {
memset(X, 0, 16);
for (int i= 0; i < 16; ++i) {
X[0] ^= table[i][Y[i]].asQWord;
}
}
uint64_t X[2];
void f2() {
memset(X, 0, 16);
for (int i= 0; i < 16; ++i) {
X[0] ^= table[i][Y[i]].asQWord;
}
}
Here the store to X[0] is sunk out of the loop in f2 but not in f1, because only in f2 can gcc prove that X does not alias members of table.
Your workaround/fix could be to adjust how the parameters are passed, to use the restrict specifier, or to manually sink the store yourself.

To avoid this, you could use this instead:
uint64_t v = 0;
for (int i= 0; i < 16; ++i) {
v ^= table[i][Y[i]].asQWord;
}
X[0] = v;
X[1] = 0;
You can easily notice the generated instructions are sub-optimal in your case, however for different reasons gcc may not be able to determine that. (And in this case, gcc cannot determine that table will never access the same memory-region as X, as ecatmur explains more elaborately.)

Sort integer in array in Objective-C [duplicate]

Answering to another Stack Overflow question (this one) I stumbled upon an interesting sub-problem. What is the fastest way to sort an array of 6 integers?
As the question is very low level:
we can't assume libraries are available (and the call itself has its cost), only plain C
to avoid emptying instruction pipeline (that has a very high cost) we should probably minimize branches, jumps, and every other kind of control flow breaking (like those hidden behind sequence points in && or ||).
room is constrained and minimizing registers and memory use is an issue, ideally in place sort is probably best.
Really this question is a kind of Golf where the goal is not to minimize source length but execution time. I call it 'Zening' code as used in the title of the book Zen of Code optimization by Michael Abrash and its sequels.
As for why it is interesting, there is several layers:
the example is simple and easy to understand and measure, not much C skill involved
it shows effects of choice of a good algorithm for the problem, but also effects of the compiler and underlying hardware.
Here is my reference (naive, not optimized) implementation and my test set.
#include <stdio.h>
static __inline__ int sort6(int * d){
char j, i, imin;
int tmp;
for (j = 0 ; j < 5 ; j++){
imin = j;
for (i = j + 1; i < 6 ; i++){
if (d[i] < d[imin]){
imin = i;
}
}
tmp = d[j];
d[j] = d[imin];
d[imin] = tmp;
}
}
static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
return x;
}
int main(int argc, char ** argv){
int i;
int d[6][5] = {
{1, 2, 3, 4, 5, 6},
{6, 5, 4, 3, 2, 1},
{100, 2, 300, 4, 500, 6},
{100, 2, 3, 4, 500, 6},
{1, 200, 3, 4, 5, 600},
{1, 1, 2, 1, 2, 1}
};
    unsigned long long cycles = rdtsc();
    for (i = 0; i < 6 ; i++){
    sort6(d[i]);
    /*
         * printf("d%d : %d %d %d %d %d %d\n", i,
     *  d[i][0], d[i][6], d[i][7],
      *  d[i][8], d[i][9], d[i][10]);
        */
    }
    cycles = rdtsc() - cycles;
    printf("Time is %d\n", (unsigned)cycles);
}
Raw results
As number of variants is becoming large, I gathered them all in a test suite that can be found here. The actual tests used are a bit less naive than those showed above, thanks to Kevin Stock. You can compile and execute it in your own environment. I'm quite interested by behavior on different target architecture/compilers. (OK guys, put it in answers, I will +1 every contributor of a new resultset).
I gave the answer to Daniel Stutzbach (for golfing) one year ago as he was at the source of the fastest solution at that time (sorting networks).
Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, -O2
Direct call to qsort library function : 689.38
Naive implementation (insertion sort) : 285.70
Insertion Sort (Daniel Stutzbach) : 142.12
Insertion Sort Unrolled : 125.47
Rank Order : 102.26
Rank Order with registers : 58.03
Sorting Networks (Daniel Stutzbach) : 111.68
Sorting Networks (Paul R) : 66.36
Sorting Networks 12 with Fast Swap : 58.86
Sorting Networks 12 reordered Swap : 53.74
Sorting Networks 12 reordered Simple Swap : 31.54
Reordered Sorting Network w/ fast swap : 31.54
Reordered Sorting Network w/ fast swap V2 : 33.63
Inlined Bubble Sort (Paolo Bonzini) : 48.85
Unrolled Insertion Sort (Paolo Bonzini) : 75.30
Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, -O1
Direct call to qsort library function : 705.93
Naive implementation (insertion sort) : 135.60
Insertion Sort (Daniel Stutzbach) : 142.11
Insertion Sort Unrolled : 126.75
Rank Order : 46.42
Rank Order with registers : 43.58
Sorting Networks (Daniel Stutzbach) : 115.57
Sorting Networks (Paul R) : 64.44
Sorting Networks 12 with Fast Swap : 61.98
Sorting Networks 12 reordered Swap : 54.67
Sorting Networks 12 reordered Simple Swap : 31.54
Reordered Sorting Network w/ fast swap : 31.24
Reordered Sorting Network w/ fast swap V2 : 33.07
Inlined Bubble Sort (Paolo Bonzini) : 45.79
Unrolled Insertion Sort (Paolo Bonzini) : 80.15
I included both -O1 and -O2 results because surprisingly for several programs O2 is less efficient than O1. I wonder what specific optimization has this effect ?
Comments on proposed solutions
Insertion Sort (Daniel Stutzbach)
As expected minimizing branches is indeed a good idea.
Sorting Networks (Daniel Stutzbach)
Better than insertion sort. I wondered if the main effect was not get from avoiding the external loop. I gave it a try by unrolled insertion sort to check and indeed we get roughly the same figures (code is here).
Sorting Networks (Paul R)
The best so far. The actual code I used to test is here. Don't know yet why it is nearly two times as fast as the other sorting network implementation. Parameter passing ? Fast max ?
Sorting Networks 12 SWAP with Fast Swap
As suggested by Daniel Stutzbach, I combined his 12 swap sorting network with branchless fast swap (code is here). It is indeed faster, the best so far with a small margin (roughly 5%) as could be expected using 1 less swap.
It is also interesting to notice that the branchless swap seems to be much (4 times) less efficient than the simple one using if on PPC architecture.
Calling Library qsort
To give another reference point I also tried as suggested to just call library qsort (code is here). As expected it is much slower : 10 to 30 times slower... as it became obvious with the new test suite, the main problem seems to be the initial load of the library after the first call, and it compares not so poorly with other version. It is just between 3 and 20 times slower on my Linux. On some architecture used for tests by others it seems even to be faster (I'm really surprised by that one, as library qsort use a more complex API).
Rank order
Rex Kerr proposed another completely different method : for each item of the array compute directly its final position. This is efficient because computing rank order do not need branch. The drawback of this method is that it takes three times the amount of memory of the array (one copy of array and variables to store rank orders). The performance results are very surprising (and interesting). On my reference architecture with 32 bits OS and Intel Core2 Quad E8300, cycle count was slightly below 1000 (like sorting networks with branching swap). But when compiled and executed on my 64 bits box (Intel Core2 Duo) it performed much better : it became the fastest so far. I finally found out the true reason. My 32bits box use gcc 4.4.1 and my 64bits box gcc 4.4.3 and the last one seems much better at optimizing this particular code (there was very little difference for other proposals).
update:
As published figures above shows this effect was still enhanced by later versions of gcc and Rank Order became consistently twice as fast as any other alternative.
Sorting Networks 12 with reordered Swap
The amazing efficiency of the Rex Kerr proposal with gcc 4.4.3 made me wonder : how could a program with 3 times as much memory usage be faster than branchless sorting networks? My hypothesis was that it had less dependencies of the kind read after write, allowing for better use of the superscalar instruction scheduler of the x86. That gave me an idea: reorder swaps to minimize read after write dependencies. More simply put: when you do SWAP(1, 2); SWAP(0, 2); you have to wait for the first swap to be finished before performing the second one because both access to a common memory cell. When you do SWAP(1, 2); SWAP(4, 5);the processor can execute both in parallel. I tried it and it works as expected, the sorting networks is running about 10% faster.
Sorting Networks 12 with Simple Swap
One year after the original post Steinar H. Gunderson suggested, that we should not try to outsmart the compiler and keep the swap code simple. It's indeed a good idea as the resulting code is about 40% faster! He also proposed a swap optimized by hand using x86 inline assembly code that can still spare some more cycles. The most surprising (it says volumes on programmer's psychology) is that one year ago none of used tried that version of swap. Code I used to test is here. Others suggested other ways to write a C fast swap, but it yields the same performances as the simple one with a decent compiler.
The "best" code is now as follow:
static inline void sort6_sorting_network_simple_swap(int * d){
#define min(x, y) (x<y?x:y)
#define max(x, y) (x<y?y:x)
#define SWAP(x,y) { const int a = min(d[x], d[y]); \
const int b = max(d[x], d[y]); \
d[x] = a; d[y] = b; }
SWAP(1, 2);
SWAP(4, 5);
SWAP(0, 2);
SWAP(3, 5);
SWAP(0, 1);
SWAP(3, 4);
SWAP(1, 4);
SWAP(0, 3);
SWAP(2, 5);
SWAP(1, 3);
SWAP(2, 4);
SWAP(2, 3);
#undef SWAP
#undef min
#undef max
}
If we believe our test set (and, yes it is quite poor, it's mere benefit is being short, simple and easy to understand what we are measuring), the average number of cycles of the resulting code for one sort is below 40 cycles (6 tests are executed). That put each swap at an average of 4 cycles. I call that amazingly fast. Any other improvements possible ?

For any optimization, it's always best to test, test, test. I would try at least sorting networks and insertion sort. If I were betting, I'd put my money on insertion sort based on past experience.
Do you know anything about the input data? Some algorithms will perform better with certain kinds of data. For example, insertion sort performs better on sorted or almost-sorted dat, so it will be the better choice if there's an above-average chance of almost-sorted data.
The algorithm you posted is similar to an insertion sort, but it looks like you've minimized the number of swaps at the cost of more comparisons. Comparisons are far more expensive than swaps, though, because branches can cause the instruction pipeline to stall.
Here's an insertion sort implementation:
static __inline__ int sort6(int *d){
int i, j;
for (i = 1; i < 6; i++) {
int tmp = d[i];
for (j = i; j >= 1 && tmp < d[j-1]; j--)
d[j] = d[j-1];
d[j] = tmp;
}
}
Here's how I'd build a sorting network. First, use this site to generate a minimal set of SWAP macros for a network of the appropriate length. Wrapping that up in a function gives me:
static __inline__ int sort6(int * d){
#define SWAP(x,y) if (d[y] < d[x]) { int tmp = d[x]; d[x] = d[y]; d[y] = tmp; }
SWAP(1, 2);
SWAP(0, 2);
SWAP(0, 1);
SWAP(4, 5);
SWAP(3, 5);
SWAP(3, 4);
SWAP(0, 3);
SWAP(1, 4);
SWAP(2, 5);
SWAP(2, 4);
SWAP(1, 3);
SWAP(2, 3);
#undef SWAP
}

Here's an implementation using sorting networks:
inline void Sort2(int *p0, int *p1)
{
const int temp = min(*p0, *p1);
*p1 = max(*p0, *p1);
*p0 = temp;
}
inline void Sort3(int *p0, int *p1, int *p2)
{
Sort2(p0, p1);
Sort2(p1, p2);
Sort2(p0, p1);
}
inline void Sort4(int *p0, int *p1, int *p2, int *p3)
{
Sort2(p0, p1);
Sort2(p2, p3);
Sort2(p0, p2);
Sort2(p1, p3);
Sort2(p1, p2);
}
inline void Sort6(int *p0, int *p1, int *p2, int *p3, int *p4, int *p5)
{
Sort3(p0, p1, p2);
Sort3(p3, p4, p5);
Sort2(p0, p3);
Sort2(p2, p5);
Sort4(p1, p2, p3, p4);
}
You really need very efficient branchless min and max implementations for this, since that is effectively what this code boils down to - a sequence of min and max operations (13 of each, in total). I leave this as an exercise for the reader.
Note that this implementation lends itself easily to vectorization (e.g. SIMD - most SIMD ISAs have vector min/max instructions) and also to GPU implementations (e.g. CUDA - being branchless there are no problems with warp divergence etc).
See also: Fast algorithm implementation to sort very small list

Since these are integers and compares are fast, why not compute the rank order of each directly:
inline void sort6(int *d) {
int e[6];
memcpy(e,d,6*sizeof(int));
int o0 = (d[0]>d[1])+(d[0]>d[2])+(d[0]>d[3])+(d[0]>d[4])+(d[0]>d[5]);
int o1 = (d[1]>=d[0])+(d[1]>d[2])+(d[1]>d[3])+(d[1]>d[4])+(d[1]>d[5]);
int o2 = (d[2]>=d[0])+(d[2]>=d[1])+(d[2]>d[3])+(d[2]>d[4])+(d[2]>d[5]);
int o3 = (d[3]>=d[0])+(d[3]>=d[1])+(d[3]>=d[2])+(d[3]>d[4])+(d[3]>d[5]);
int o4 = (d[4]>=d[0])+(d[4]>=d[1])+(d[4]>=d[2])+(d[4]>=d[3])+(d[4]>d[5]);
int o5 = 15-(o0+o1+o2+o3+o4);
d[o0]=e[0]; d[o1]=e[1]; d[o2]=e[2]; d[o3]=e[3]; d[o4]=e[4]; d[o5]=e[5];
}

Looks like I got to the party a year late, but here we go...
Looking at the assembly generated by gcc 4.5.2 I observed that loads and stores are being done for every swap, which really isn't needed. It would be better to load the 6 values into registers, sort those, and store them back into memory. I ordered the loads at stores to be as close as possible to there the registers are first needed and last used. I also used Steinar H. Gunderson's SWAP macro. Update: I switched to Paolo Bonzini's SWAP macro which gcc converts into something similar to Gunderson's, but gcc is able to better order the instructions since they aren't given as explicit assembly.
I used the same swap order as the reordered swap network given as the best performing, although there may be a better ordering. If I find some more time I'll generate and test a bunch of permutations.
I changed the testing code to consider over 4000 arrays and show the average number of cycles needed to sort each one. On an i5-650 I'm getting ~34.1 cycles/sort (using -O3), compared to the original reordered sorting network getting ~65.3 cycles/sort (using -O1, beats -O2 and -O3).
#include <stdio.h>
static inline void sort6_fast(int * d) {
#define SWAP(x,y) { int dx = x, dy = y, tmp; tmp = x = dx < dy ? dx : dy; y ^= dx ^ tmp; }
register int x0,x1,x2,x3,x4,x5;
x1 = d[1];
x2 = d[2];
SWAP(x1, x2);
x4 = d[4];
x5 = d[5];
SWAP(x4, x5);
x0 = d[0];
SWAP(x0, x2);
x3 = d[3];
SWAP(x3, x5);
SWAP(x0, x1);
SWAP(x3, x4);
SWAP(x1, x4);
SWAP(x0, x3);
d[0] = x0;
SWAP(x2, x5);
d[5] = x5;
SWAP(x1, x3);
d[1] = x1;
SWAP(x2, x4);
d[4] = x4;
SWAP(x2, x3);
d[2] = x2;
d[3] = x3;
#undef SWAP
#undef min
#undef max
}
static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile ("rdtsc; shlq $32, %%rdx; orq %%rdx, %0" : "=a" (x) : : "rdx");
return x;
}
void ran_fill(int n, int *a) {
static int seed = 76521;
while (n--) *a++ = (seed = seed *1812433253 + 12345);
}
#define NTESTS 4096
int main() {
int i;
int d[6*NTESTS];
ran_fill(6*NTESTS, d);
unsigned long long cycles = rdtsc();
for (i = 0; i < 6*NTESTS ; i+=6) {
sort6_fast(d+i);
}
cycles = rdtsc() - cycles;
printf("Time is %.2lf\n", (double)cycles/(double)NTESTS);
for (i = 0; i < 6*NTESTS ; i+=6) {
if (d[i+0] > d[i+1] || d[i+1] > d[i+2] || d[i+2] > d[i+3] || d[i+3] > d[i+4] || d[i+4] > d[i+5])
printf("d%d : %d %d %d %d %d %d\n", i,
d[i+0], d[i+1], d[i+2],
d[i+3], d[i+4], d[i+5]);
}
return 0;
}
I changed modified the test suite to also report clocks per sort and run more tests (the cmp function was updated to handle integer overflow as well), here are the results on some different architectures. I attempted testing on an AMD cpu but rdtsc isn't reliable on the X6 1100T I have available.
Clarkdale (i5-650)
==================
Direct call to qsort library function 635.14 575.65 581.61 577.76 521.12
Naive implementation (insertion sort) 538.30 135.36 134.89 240.62 101.23
Insertion Sort (Daniel Stutzbach) 424.48 159.85 160.76 152.01 151.92
Insertion Sort Unrolled 339.16 125.16 125.81 129.93 123.16
Rank Order 184.34 106.58 54.74 93.24 94.09
Rank Order with registers 127.45 104.65 53.79 98.05 97.95
Sorting Networks (Daniel Stutzbach) 269.77 130.56 128.15 126.70 127.30
Sorting Networks (Paul R) 551.64 103.20 64.57 73.68 73.51
Sorting Networks 12 with Fast Swap 321.74 61.61 63.90 67.92 67.76
Sorting Networks 12 reordered Swap 318.75 60.69 65.90 70.25 70.06
Reordered Sorting Network w/ fast swap 145.91 34.17 32.66 32.22 32.18
Kentsfield (Core 2 Quad)
========================
Direct call to qsort library function 870.01 736.39 723.39 725.48 721.85
Naive implementation (insertion sort) 503.67 174.09 182.13 284.41 191.10
Insertion Sort (Daniel Stutzbach) 345.32 152.84 157.67 151.23 150.96
Insertion Sort Unrolled 316.20 133.03 129.86 118.96 105.06
Rank Order 164.37 138.32 46.29 99.87 99.81
Rank Order with registers 115.44 116.02 44.04 116.04 116.03
Sorting Networks (Daniel Stutzbach) 230.35 114.31 119.15 110.51 111.45
Sorting Networks (Paul R) 498.94 77.24 63.98 62.17 65.67
Sorting Networks 12 with Fast Swap 315.98 59.41 58.36 60.29 55.15
Sorting Networks 12 reordered Swap 307.67 55.78 51.48 51.67 50.74
Reordered Sorting Network w/ fast swap 149.68 31.46 30.91 31.54 31.58
Sandy Bridge (i7-2600k)
=======================
Direct call to qsort library function 559.97 451.88 464.84 491.35 458.11
Naive implementation (insertion sort) 341.15 160.26 160.45 154.40 106.54
Insertion Sort (Daniel Stutzbach) 284.17 136.74 132.69 123.85 121.77
Insertion Sort Unrolled 239.40 110.49 114.81 110.79 117.30
Rank Order 114.24 76.42 45.31 36.96 36.73
Rank Order with registers 105.09 32.31 48.54 32.51 33.29
Sorting Networks (Daniel Stutzbach) 210.56 115.68 116.69 107.05 124.08
Sorting Networks (Paul R) 364.03 66.02 61.64 45.70 44.19
Sorting Networks 12 with Fast Swap 246.97 41.36 59.03 41.66 38.98
Sorting Networks 12 reordered Swap 235.39 38.84 47.36 38.61 37.29
Reordered Sorting Network w/ fast swap 115.58 27.23 27.75 27.25 26.54
Nehalem (Xeon E5640)
====================
Direct call to qsort library function 911.62 890.88 681.80 876.03 872.89
Naive implementation (insertion sort) 457.69 236.87 127.68 388.74 175.28
Insertion Sort (Daniel Stutzbach) 317.89 279.74 147.78 247.97 245.09
Insertion Sort Unrolled 259.63 220.60 116.55 221.66 212.93
Rank Order 140.62 197.04 52.10 163.66 153.63
Rank Order with registers 84.83 96.78 50.93 109.96 54.73
Sorting Networks (Daniel Stutzbach) 214.59 220.94 118.68 120.60 116.09
Sorting Networks (Paul R) 459.17 163.76 56.40 61.83 58.69
Sorting Networks 12 with Fast Swap 284.58 95.01 50.66 53.19 55.47
Sorting Networks 12 reordered Swap 281.20 96.72 44.15 56.38 54.57
Reordered Sorting Network w/ fast swap 128.34 50.87 26.87 27.91 28.02

The test code is pretty bad; it overflows the initial array (don't people here read compiler warnings?), the printf is printing out the wrong elements, it uses .byte for rdtsc for no good reason, there's only one run (!), there's nothing checking that the end results are actually correct (so it's very easy to “optimize” into something subtly wrong), the included tests are very rudimentary (no negative numbers?) and there's nothing to stop the compiler from just discarding the entire function as dead code.
That being said, it's also pretty easy to improve on the bitonic network solution; simply change the min/max/SWAP stuff to
#define SWAP(x,y) { int tmp; asm("mov %0, %2 ; cmp %1, %0 ; cmovg %1, %0 ; cmovg %2, %1" : "=r" (d[x]), "=r" (d[y]), "=r" (tmp) : "0" (d[x]), "1" (d[y]) : "cc"); }
and it comes out about 65% faster for me (Debian gcc 4.4.5 with -O2, amd64, Core i7).

I stumbled onto this question from Google a few days ago because I also had a need to quickly sort a fixed length array of 6 integers. In my case however, my integers are only 8 bits (instead of 32) and I do not have a strict requirement of only using C. I thought I would share my findings anyways, in case they might be helpful to someone...
I implemented a variant of a network sort in assembly that uses SSE to vectorize the compare and swap operations, to the extent possible. It takes six "passes" to completely sort the array. I used a novel mechanism to directly convert the results of PCMPGTB (vectorized compare) to shuffle parameters for PSHUFB (vectorized swap), using only a PADDB (vectorized add) and in some cases also a PAND (bitwise AND) instruction.
This approach also had the side effect of yielding a truly branchless function. There are no jump instructions whatsoever.
It appears that this implementation is about 38% faster than the implementation which is currently marked as the fastest option in the question ("Sorting Networks 12 with Simple Swap"). I modified that implementation to use char array elements during my testing, to make the comparison fair.
I should note that this approach can be applied to any array size up to 16 elements. I expect the relative speed advantage over the alternatives to grow larger for the bigger arrays.
The code is written in MASM for x86_64 processors with SSSE3. The function uses the "new" Windows x64 calling convention. Here it is...
PUBLIC simd_sort_6
.DATA
ALIGN 16
pass1_shuffle OWORD 0F0E0D0C0B0A09080706040503010200h
pass1_add OWORD 0F0E0D0C0B0A09080706050503020200h
pass2_shuffle OWORD 0F0E0D0C0B0A09080706030405000102h
pass2_and OWORD 00000000000000000000FE00FEFE00FEh
pass2_add OWORD 0F0E0D0C0B0A09080706050405020102h
pass3_shuffle OWORD 0F0E0D0C0B0A09080706020304050001h
pass3_and OWORD 00000000000000000000FDFFFFFDFFFFh
pass3_add OWORD 0F0E0D0C0B0A09080706050404050101h
pass4_shuffle OWORD 0F0E0D0C0B0A09080706050100020403h
pass4_and OWORD 0000000000000000000000FDFD00FDFDh
pass4_add OWORD 0F0E0D0C0B0A09080706050403020403h
pass5_shuffle OWORD 0F0E0D0C0B0A09080706050201040300h
pass5_and OWORD 0000000000000000000000FEFEFEFE00h
pass5_add OWORD 0F0E0D0C0B0A09080706050403040300h
pass6_shuffle OWORD 0F0E0D0C0B0A09080706050402030100h
pass6_add OWORD 0F0E0D0C0B0A09080706050403030100h
.CODE
simd_sort_6 PROC FRAME
.endprolog
; pxor xmm4, xmm4
; pinsrd xmm4, dword ptr [rcx], 0
; pinsrb xmm4, byte ptr [rcx + 4], 4
; pinsrb xmm4, byte ptr [rcx + 5], 5
; The benchmarked 38% faster mentioned in the text was with the above slower sequence that tied up the shuffle port longer. Same on extract
; avoiding pins/extrb also means we don't need SSE 4.1, but SSSE3 CPUs without SSE4.1 (e.g. Conroe/Merom) have slow pshufb.
movd xmm4, dword ptr [rcx]
pinsrw xmm4, word ptr [rcx + 4], 2 ; word 2 = bytes 4 and 5
movdqa xmm5, xmm4
pshufb xmm5, oword ptr [pass1_shuffle]
pcmpgtb xmm5, xmm4
paddb xmm5, oword ptr [pass1_add]
pshufb xmm4, xmm5
movdqa xmm5, xmm4
pshufb xmm5, oword ptr [pass2_shuffle]
pcmpgtb xmm5, xmm4
pand xmm5, oword ptr [pass2_and]
paddb xmm5, oword ptr [pass2_add]
pshufb xmm4, xmm5
movdqa xmm5, xmm4
pshufb xmm5, oword ptr [pass3_shuffle]
pcmpgtb xmm5, xmm4
pand xmm5, oword ptr [pass3_and]
paddb xmm5, oword ptr [pass3_add]
pshufb xmm4, xmm5
movdqa xmm5, xmm4
pshufb xmm5, oword ptr [pass4_shuffle]
pcmpgtb xmm5, xmm4
pand xmm5, oword ptr [pass4_and]
paddb xmm5, oword ptr [pass4_add]
pshufb xmm4, xmm5
movdqa xmm5, xmm4
pshufb xmm5, oword ptr [pass5_shuffle]
pcmpgtb xmm5, xmm4
pand xmm5, oword ptr [pass5_and]
paddb xmm5, oword ptr [pass5_add]
pshufb xmm4, xmm5
movdqa xmm5, xmm4
pshufb xmm5, oword ptr [pass6_shuffle]
pcmpgtb xmm5, xmm4
paddb xmm5, oword ptr [pass6_add]
pshufb xmm4, xmm5
;pextrd dword ptr [rcx], xmm4, 0 ; benchmarked with this
;pextrb byte ptr [rcx + 4], xmm4, 4 ; slower version
;pextrb byte ptr [rcx + 5], xmm4, 5
movd dword ptr [rcx], xmm4
pextrw word ptr [rcx + 4], xmm4, 2 ; x86 is little-endian, so this is the right order
ret
simd_sort_6 ENDP
END
You can compile this to an executable object and link it into your C project. For instructions on how to do this in Visual Studio, you can read this article. You can use the following C prototype to call the function from your C code:
void simd_sort_6(char *values);

While I really like the swap macro provided:
#define min(x, y) (y ^ ((x ^ y) & -(x < y)))
#define max(x, y) (x ^ ((x ^ y) & -(x < y)))
#define SWAP(x,y) { int tmp = min(d[x], d[y]); d[y] = max(d[x], d[y]); d[x] = tmp; }
I see an improvement (which a good compiler might make):
#define SWAP(x,y) { int tmp = ((x ^ y) & -(y < x)); y ^= tmp; x ^= tmp; }
We take note of how min and max work and pull the common sub-expression explicitly. This eliminates the min and max macros completely.

Never optimize min/max without benchmarking and looking at actual compiler generated assembly. If I let GCC optimize min with conditional move instructions I get a 33% speedup:
#define SWAP(x,y) { int dx = d[x], dy = d[y], tmp; tmp = d[x] = dx < dy ? dx : dy; d[y] ^= dx ^ tmp; }
(280 vs. 420 cycles in the test code). Doing max with ?: is more or less the same, almost lost in the noise, but the above is a little bit faster. This SWAP is faster with both GCC and Clang.
Compilers are also doing an exceptional job at register allocation and alias analysis, effectively moving d[x] into local variables upfront, and only copying back to memory at the end. In fact, they do so even better than if you worked entirely with local variables (like d0 = d[0], d1 = d[1], d2 = d[2], d3 = d[3], d4 = d[4], d5 = d[5]). I'm writing this because you are assuming strong optimization and yet trying to outsmart the compiler on min/max. :)
By the way, I tried Clang and GCC. They do the same optimization, but due to scheduling differences the two have some variation in the results, can't say really which is faster or slower. GCC is faster on the sorting networks, Clang on the quadratic sorts.
Just for completeness, unrolled bubble sort and insertion sorts are possible too. Here is the bubble sort:
SWAP(0,1); SWAP(1,2); SWAP(2,3); SWAP(3,4); SWAP(4,5);
SWAP(0,1); SWAP(1,2); SWAP(2,3); SWAP(3,4);
SWAP(0,1); SWAP(1,2); SWAP(2,3);
SWAP(0,1); SWAP(1,2);
SWAP(0,1);
and here is the insertion sort:
//#define ITER(x) { if (t < d[x]) { d[x+1] = d[x]; d[x] = t; } }
//Faster on x86, probably slower on ARM or similar:
#define ITER(x) { d[x+1] ^= t < d[x] ? d[x] ^ d[x+1] : 0; d[x] = t < d[x] ? t : d[x]; }
static inline void sort6_insertion_sort_unrolled_v2(int * d){
int t;
t = d[1]; ITER(0);
t = d[2]; ITER(1); ITER(0);
t = d[3]; ITER(2); ITER(1); ITER(0);
t = d[4]; ITER(3); ITER(2); ITER(1); ITER(0);
t = d[5]; ITER(4); ITER(3); ITER(2); ITER(1); ITER(0);
This insertion sort is faster than Daniel Stutzbach's, and is especially good on a GPU or a computer with predication because ITER can be done with only 3 instructions (vs. 4 for SWAP). For example, here is the t = d[2]; ITER(1); ITER(0); line in ARM assembly:
MOV r6, r2
CMP r6, r1
MOVLT r2, r1
MOVLT r1, r6
CMP r6, r0
MOVLT r1, r0
MOVLT r0, r6
For six elements the insertion sort is competitive with the sorting network (12 swaps vs. 15 iterations balances 4 instructions/swap vs. 3 instructions/iteration); bubble sort of course is slower. But it's not going to be true when the size grows, since insertion sort is O(n^2) while sorting networks are O(n log n).

I ported the test suite to a PPC architecture machine I can not identify (didn't have to touch code, just increase the iterations of the test, use 8 test cases to avoid polluting results with mods and replace the x86 specific rdtsc):
Direct call to qsort library function : 101
Naive implementation (insertion sort) : 299
Insertion Sort (Daniel Stutzbach) : 108
Insertion Sort Unrolled : 51
Sorting Networks (Daniel Stutzbach) : 26
Sorting Networks (Paul R) : 85
Sorting Networks 12 with Fast Swap : 117
Sorting Networks 12 reordered Swap : 116
Rank Order : 56

An XOR swap may be useful in your swapping functions.
void xorSwap (int *x, int *y) {
if (*x != *y) {
*x ^= *y;
*y ^= *x;
*x ^= *y;
}
}
The if may cause too much divergence in your code, but if you have a guarantee that all your ints are unique this could be handy.

Looking forward to trying my hand at this and learning from these examples, but first some timings from my 1.5 GHz PPC Powerbook G4 w/ 1 GB DDR RAM. (I borrowed a similar rdtsc-like timer for PPC from http://www.mcs.anl.gov/~kazutomo/rdtsc.html for the timings.) I ran the program a few times and the absolute results varied but the consistently fastest test was "Insertion Sort (Daniel Stutzbach)", with "Insertion Sort Unrolled" a close second.
Here's the last set of times:
**Direct call to qsort library function** : 164
**Naive implementation (insertion sort)** : 138
**Insertion Sort (Daniel Stutzbach)** : 85
**Insertion Sort Unrolled** : 97
**Sorting Networks (Daniel Stutzbach)** : 457
**Sorting Networks (Paul R)** : 179
**Sorting Networks 12 with Fast Swap** : 238
**Sorting Networks 12 reordered Swap** : 236
**Rank Order** : 116

Here is my contribution to this thread: an optimized 1, 4 gap shellsort for a 6-member int vector (valp) containing unique values.
void shellsort (int *valp)
{
int c,a,*cp,*ip=valp,*ep=valp+5;
c=*valp; a=*(valp+4);if (c>a) {*valp= a;*(valp+4)=c;}
c=*(valp+1);a=*(valp+5);if (c>a) {*(valp+1)=a;*(valp+5)=c;}
cp=ip;
do
{
c=*cp;
a=*(cp+1);
do
{
if (c<a) break;
*cp=a;
*(cp+1)=c;
cp-=1;
c=*cp;
} while (cp>=valp);
ip+=1;
cp=ip;
} while (ip<ep);
}
On my HP dv7-3010so laptop with a dual-core Athlon M300 # 2 Ghz (DDR2 memory) it executes in 165 clock cycles. This is an average calculated from timing every unique sequence (6!/720 in all). Compiled to Win32 using OpenWatcom 1.8. The loop is essentially an insertion sort and is 16 instructions/37 bytes long.
I do not have a 64-bit environment to compile on.

If insertion sort is reasonably competitive here, I would recommend trying a shellsort. I'm afraid 6 elements is probably just too little for it to be among the best, but it might be worth a try.
Example code, untested, undebugged, etc. You want to tune the inc = 4 and inc -= 3 sequence to find the optimum (try inc = 2, inc -= 1 for example).
static __inline__ int sort6(int * d) {
char j, i;
int tmp;
for (inc = 4; inc > 0; inc -= 3) {
for (i = inc; i < 5; i++) {
tmp = a[i];
j = i;
while (j >= inc && a[j - inc] > tmp) {
a[j] = a[j - inc];
j -= inc;
}
a[j] = tmp;
}
}
}
I don't think this will win, but if someone posts a question about sorting 10 elements, who knows...
According to Wikipedia this can even be combined with sorting networks:
Pratt, V (1979). Shellsort and sorting networks (Outstanding dissertations in the computer sciences). Garland. ISBN 0-824-04406-1

I know I'm super-late, but I was interested in experimenting with some different solutions. First, I cleaned up that paste, made it compile, and put it into a repository. I kept some undesirable solutions as dead-ends so that others wouldn't try it. Among this was my first solution, which attempted to ensure that x1>x2 was calculated once. After optimization, it is no faster than the other, simple versions.
I added a looping version of rank order sort, since my own application of this study is for sorting 2-8 items, so since there are a variable number of arguments, a loop is necessary. This is also why I ignored the sorting network solutions.
The test code didn't test that duplicates were handled correctly, so while the existing solutions were all correct, I added a special case to the test code to ensure that duplicates were handled correctly.
Then, I wrote an insertion sort that is entirely in AVX registers. On my machine it is 25% faster than the other insertion sorts, but 100% slower than rank order. I did this purely for experiment and had no expectation of this being better due to the branching in insertion sort.
static inline void sort6_insertion_sort_avx(int* d) {
__m256i src = _mm256_setr_epi32(d[0], d[1], d[2], d[3], d[4], d[5], 0, 0);
__m256i index = _mm256_setr_epi32(0, 1, 2, 3, 4, 5, 6, 7);
__m256i shlpermute = _mm256_setr_epi32(7, 0, 1, 2, 3, 4, 5, 6);
__m256i sorted = _mm256_setr_epi32(d[0], INT_MAX, INT_MAX, INT_MAX,
INT_MAX, INT_MAX, INT_MAX, INT_MAX);
__m256i val, gt, permute;
unsigned j;
// 8 / 32 = 2^-2
#define ITER(I) \
val = _mm256_permutevar8x32_epi32(src, _mm256_set1_epi32(I));\
gt = _mm256_cmpgt_epi32(sorted, val);\
permute = _mm256_blendv_epi8(index, shlpermute, gt);\
j = ffs( _mm256_movemask_epi8(gt)) >> 2;\
sorted = _mm256_blendv_epi8(_mm256_permutevar8x32_epi32(sorted, permute),\
val, _mm256_cmpeq_epi32(index, _mm256_set1_epi32(j)))
ITER(1);
ITER(2);
ITER(3);
ITER(4);
ITER(5);
int x[8];
_mm256_storeu_si256((__m256i*)x, sorted);
d[0] = x[0]; d[1] = x[1]; d[2] = x[2]; d[3] = x[3]; d[4] = x[4]; d[5] = x[5];
#undef ITER
}
Then, I wrote a rank order sort using AVX. This matches the speed of the other rank-order solutions, but is no faster. The issue here is that I can only calculate the indices with AVX, and then I have to make a table of indices. This is because the calculation is destination-based rather than source-based. See Converting from Source-based Indices to Destination-based Indices
static inline void sort6_rank_order_avx(int* d) {
__m256i ror = _mm256_setr_epi32(5, 0, 1, 2, 3, 4, 6, 7);
__m256i one = _mm256_set1_epi32(1);
__m256i src = _mm256_setr_epi32(d[0], d[1], d[2], d[3], d[4], d[5], INT_MAX, INT_MAX);
__m256i rot = src;
__m256i index = _mm256_setzero_si256();
__m256i gt, permute;
__m256i shl = _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 6, 6);
__m256i dstIx = _mm256_setr_epi32(0,1,2,3,4,5,6,7);
__m256i srcIx = dstIx;
__m256i eq = one;
__m256i rotIx = _mm256_setzero_si256();
#define INC(I)\
rot = _mm256_permutevar8x32_epi32(rot, ror);\
gt = _mm256_cmpgt_epi32(src, rot);\
index = _mm256_add_epi32(index, _mm256_and_si256(gt, one));\
index = _mm256_add_epi32(index, _mm256_and_si256(eq,\
_mm256_cmpeq_epi32(src, rot)));\
eq = _mm256_insert_epi32(eq, 0, I)
INC(0);
INC(1);
INC(2);
INC(3);
INC(4);
int e[6];
e[0] = d[0]; e[1] = d[1]; e[2] = d[2]; e[3] = d[3]; e[4] = d[4]; e[5] = d[5];
int i[8];
_mm256_storeu_si256((__m256i*)i, index);
d[i[0]] = e[0]; d[i[1]] = e[1]; d[i[2]] = e[2]; d[i[3]] = e[3]; d[i[4]] = e[4]; d[i[5]] = e[5];
}
The repo can be found here: https://github.com/eyepatchParrot/sort6/

This question is becoming quite old, but I actually had to solve the same problem these days: fast agorithms to sort small arrays. I thought it would be a good idea to share my knowledge. While I first started by using sorting networks, I finally managed to find other algorithms for which the total number of comparisons performed to sort every permutation of 6 values was smaller than with sorting networks, and smaller than with insertion sort. I didn't count the number of swaps; I would expect it to be roughly equivalent (maybe a bit higher sometimes).
The algorithm sort6 uses the algorithm sort4 which uses the algorithm sort3. Here is the implementation in some light C++ form (the original is template-heavy so that it can work with any random-access iterator and any suitable comparison function).
Sorting 3 values
The following algorithm is an unrolled insertion sort. When two swaps (6 assignments) have to be performed, it uses 4 assignments instead:
void sort3(int* array)
{
if (array[1] < array[0]) {
if (array[2] < array[0]) {
if (array[2] < array[1]) {
std::swap(array[0], array[2]);
} else {
int tmp = array[0];
array[0] = array[1];
array[1] = array[2];
array[2] = tmp;
}
} else {
std::swap(array[0], array[1]);
}
} else {
if (array[2] < array[1]) {
if (array[2] < array[0]) {
int tmp = array[2];
array[2] = array[1];
array[1] = array[0];
array[0] = tmp;
} else {
std::swap(array[1], array[2]);
}
}
}
}
It looks a bit complex because the sort has more or less one branch for every possible permutation of the array, using 2~3 comparisons and at most 4 assignments to sort the three values.
Sorting 4 values
This one calls sort3 then performs an unrolled insertion sort with the last element of the array:
void sort4(int* array)
{
// Sort the first 3 elements
sort3(array);
// Insert the 4th element with insertion sort
if (array[3] < array[2]) {
std::swap(array[2], array[3]);
if (array[2] < array[1]) {
std::swap(array[1], array[2]);
if (array[1] < array[0]) {
std::swap(array[0], array[1]);
}
}
}
}
This algorithm performs 3 to 6 comparisons and at most 5 swaps. It is easy to unroll an insertion sort, but we will be using another algorithm for the last sort...
Sorting 6 values
This one uses an unrolled version of what I called a double insertion sort. The name isn't that great, but it's quite descriptive, here is how it works:
Sort everything but the first and the last elements of the array.
Swap the first and the elements of the array if the first is greater than the last.
Insert the first element into the sorted sequence from the front then the last element from the back.
After the swap, the first element is always smaller than the last, which means that, when inserting them into the sorted sequence, there won't be more than N comparisons to insert the two elements in the worst case: for example, if the first element has been insert in the 3rd position, then the last one can't be inserted lower than the 4th position.
void sort6(int* array)
{
// Sort everything but first and last elements
sort4(array+1);
// Switch first and last elements if needed
if (array[5] < array[0]) {
std::swap(array[0], array[5]);
}
// Insert first element from the front
if (array[1] < array[0]) {
std::swap(array[0], array[1]);
if (array[2] < array[1]) {
std::swap(array[1], array[2]);
if (array[3] < array[2]) {
std::swap(array[2], array[3]);
if (array[4] < array[3]) {
std::swap(array[3], array[4]);
}
}
}
}
// Insert last element from the back
if (array[5] < array[4]) {
std::swap(array[4], array[5]);
if (array[4] < array[3]) {
std::swap(array[3], array[4]);
if (array[3] < array[2]) {
std::swap(array[2], array[3]);
if (array[2] < array[1]) {
std::swap(array[1], array[2]);
}
}
}
}
}
My tests on every permutation of 6 values ever show that this algorithms always performs between 6 and 13 comparisons. I didn't compute the number of swaps performed, but I don't expect it to be higher than 11 in the worst case.
I hope that this helps, even if this question may not represent an actual problem anymore :)
EDIT: after putting it in the provided benchmark, it is cleary slower than most of the interesting alternatives. It tends to perform a bit better than the unrolled insertion sort, but that's pretty much it. Basically, it isn't the best sort for integers but could be interesting for types with an expensive comparison operation.

I found that at least on my system, the functions sort6_iterator() and sort6_iterator_local() defined below both ran at least as fast, and frequently noticeably faster, than the above current record holder:
#define MIN(x, y) (x<y?x:y)
#define MAX(x, y) (x<y?y:x)
template<class IterType>
inline void sort6_iterator(IterType it)
{
#define SWAP(x,y) { const auto a = MIN(*(it + x), *(it + y)); \
const auto b = MAX(*(it + x), *(it + y)); \
*(it + x) = a; *(it + y) = b; }
SWAP(1, 2) SWAP(4, 5)
SWAP(0, 2) SWAP(3, 5)
SWAP(0, 1) SWAP(3, 4)
SWAP(1, 4) SWAP(0, 3)
SWAP(2, 5) SWAP(1, 3)
SWAP(2, 4)
SWAP(2, 3)
#undef SWAP
}
I passed this function a std::vector's iterator in my timing code.
I suspect (from comments like this and elsewhere) that using iterators gives g++ certain assurances about what can and can't happen to the memory that the iterator refers to, which it otherwise wouldn't have and it is these assurances that allow g++ to better optimize the sorting code (e.g. with pointers, the compiler can't be sure that all pointers are pointing to different memory locations). If I remember correctly, this is also part of the reason why so many STL algorithms, such as std::sort(), generally have such obscenely good performance.
Moreover, sort6_iterator() is sometimes (again, depending on the context in which the function is called) consistently outperformed by the following sorting function, which copies the data into local variables before sorting them.1 Note that since there are only 6 local variables defined, if these local variables are primitives then they are likely never actually stored in RAM and are instead only ever stored in the CPU's registers until the end of the function call, which helps make this sorting function fast. (It also helps that the compiler knows that distinct local variables have distinct locations in memory).
template<class IterType>
inline void sort6_iterator_local(IterType it)
{
#define SWAP(x,y) { const auto a = MIN(data##x, data##y); \
const auto b = MAX(data##x, data##y); \
data##x = a; data##y = b; }
//DD = Define Data
#define DD1(a) auto data##a = *(it + a);
#define DD2(a,b) auto data##a = *(it + a), data##b = *(it + b);
//CB = Copy Back
#define CB(a) *(it + a) = data##a;
DD2(1,2) SWAP(1, 2)
DD2(4,5) SWAP(4, 5)
DD1(0) SWAP(0, 2)
DD1(3) SWAP(3, 5)
SWAP(0, 1) SWAP(3, 4)
SWAP(1, 4) SWAP(0, 3) CB(0)
SWAP(2, 5) CB(5)
SWAP(1, 3) CB(1)
SWAP(2, 4) CB(4)
SWAP(2, 3) CB(2) CB(3)
#undef CB
#undef DD2
#undef DD1
#undef SWAP
}
Note that defining SWAP() as follows sometimes results in slightly better performance although most of the time it results in slightly worse performance or a negligible difference in performance.
#define SWAP(x,y) { const auto a = MIN(data##x, data##y); \
data##y = MAX(data##x, data##y); \
data##x = a; }
If you just want a sorting algorithm that on primitive data types, gcc -O3 is consistently good at optimizing no matter what context the call to the sorting function appears in1 then, depending on how you pass the input, try one of the following two algorithms:
template<class T> inline void sort6(T it) {
#define SORT2(x,y) {if(data##x>data##y){auto a=std::move(data##y);data##y=std::move(data##x);data##x=std::move(a);}}
#define DD1(a) register auto data##a=*(it+a);
#define DD2(a,b) register auto data##a=*(it+a);register auto data##b=*(it+b);
#define CB1(a) *(it+a)=data##a;
#define CB2(a,b) *(it+a)=data##a;*(it+b)=data##b;
DD2(1,2) SORT2(1,2)
DD2(4,5) SORT2(4,5)
DD1(0) SORT2(0,2)
DD1(3) SORT2(3,5)
SORT2(0,1) SORT2(3,4) SORT2(2,5) CB1(5)
SORT2(1,4) SORT2(0,3) CB1(0)
SORT2(2,4) CB1(4)
SORT2(1,3) CB1(1)
SORT2(2,3) CB2(2,3)
#undef CB1
#undef CB2
#undef DD1
#undef DD2
#undef SORT2
}
Or if you want to pass the variables by reference then use this (the below function differs from the above in its first 5 lines):
template<class T> inline void sort6(T& e0, T& e1, T& e2, T& e3, T& e4, T& e5) {
#define SORT2(x,y) {if(data##x>data##y)std::swap(data##x,data##y);}
#define DD1(a) register auto data##a=e##a;
#define DD2(a,b) register auto data##a=e##a;register auto data##b=e##b;
#define CB1(a) e##a=data##a;
#define CB2(a,b) e##a=data##a;e##b=data##b;
DD2(1,2) SORT2(1,2)
DD2(4,5) SORT2(4,5)
DD1(0) SORT2(0,2)
DD1(3) SORT2(3,5)
SORT2(0,1) SORT2(3,4) SORT2(2,5) CB1(5)
SORT2(1,4) SORT2(0,3) CB1(0)
SORT2(2,4) CB1(4)
SORT2(1,3) CB1(1)
SORT2(2,3) CB2(2,3)
#undef CB1
#undef CB2
#undef DD1
#undef DD2
#undef SORT2
}
The reason for using the register keyword is because this is one of the few times that you know that you want these values in registers. Without register, the compiler will figure this out most of the time but sometimes it doesn't. Using the register keyword helps solve this issue. Normally, however, don't use the register keyword since it's more likely to slow your code than speed it up.
Also, note the use of templates. This is done on purpose since, even with the inline keyword, template functions are generally much more aggressively optimized by gcc than vanilla C functions (this has to do with gcc needing to deal with function pointers for vanilla C functions but not with template functions).
While timing various sorting functions I noticed that the context (i.e. surrounding code) in which the call to the sorting function was made had a significant impact on performance, which is likely due to the function being inlined and then optimized. For instance, if the program was sufficiently simple then there usually wasn't much of a difference in performance between passing the sorting function a pointer versus passing it an iterator; otherwise using iterators usually resulted in noticeably better performance and never (in my experience so far at least) any noticeably worse performance. I suspect that this may be because g++ can globally optimize sufficiently simple code.

I believe there are two parts to your question.
The first is to determine the optimal algorithm. This is done - at least in this case - by looping through every possible ordering (there aren't that many) which allows you to compute exact min, max, average and standard deviation of compares and swaps. Have a runner-up or two handy as well.
The second is to optimize the algorithm. A lot can be done to convert textbook code examples to mean and lean real-life algorithms. If you realize that an algorithm can't be optimized to the extent required, try a runner-up.
I wouldn't worry too much about emptying pipelines (assuming current x86): branch prediction has come a long way. What I would worry about is making sure that the code and data fit in one cache line each (maybe two for the code). Once there fetch latencies are refreshingly low which will compensate for any stall. It also means that your inner loop will be maybe ten instructions or so which is right where it should be (there are two different inner loops in my sorting algorithm, they are 10 instructions/22 bytes and 9/22 long respectively). Assuming the code doesn't contain any divs you can be sure it will be blindingly fast.

I know this is an old question.
But I just wrote a different kind of solution I want to share.
Using nothing but nested MIN MAX,
It's not fast as it uses 114 of each,
could reduce it to 75 pretty simply like so -> pastebin
But then it's not purely min max anymore.
What might work is doing min/max on multiple integers at once with AVX
PMINSW reference
#include <stdio.h>
static __inline__ int MIN(int a, int b){
int result =a;
__asm__ ("pminsw %1, %0" : "+x" (result) : "x" (b));
return result;
}
static __inline__ int MAX(int a, int b){
int result = a;
__asm__ ("pmaxsw %1, %0" : "+x" (result) : "x" (b));
return result;
}
static __inline__ unsigned long long rdtsc(void){
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" :
"=A" (x));
return x;
}
#define MIN3(a, b, c) (MIN(MIN(a,b),c))
#define MIN4(a, b, c, d) (MIN(MIN(a,b),MIN(c,d)))
static __inline__ void sort6(int * in) {
const int A=in[0], B=in[1], C=in[2], D=in[3], E=in[4], F=in[5];
in[0] = MIN( MIN4(A,B,C,D),MIN(E,F) );
const int
AB = MAX(A, B),
AC = MAX(A, C),
AD = MAX(A, D),
AE = MAX(A, E),
AF = MAX(A, F),
BC = MAX(B, C),
BD = MAX(B, D),
BE = MAX(B, E),
BF = MAX(B, F),
CD = MAX(C, D),
CE = MAX(C, E),
CF = MAX(C, F),
DE = MAX(D, E),
DF = MAX(D, F),
EF = MAX(E, F);
in[1] = MIN4 (
MIN4( AB, AC, AD, AE ),
MIN4( AF, BC, BD, BE ),
MIN4( BF, CD, CE, CF ),
MIN3( DE, DF, EF)
);
const int
ABC = MAX(AB,C),
ABD = MAX(AB,D),
ABE = MAX(AB,E),
ABF = MAX(AB,F),
ACD = MAX(AC,D),
ACE = MAX(AC,E),
ACF = MAX(AC,F),
ADE = MAX(AD,E),
ADF = MAX(AD,F),
AEF = MAX(AE,F),
BCD = MAX(BC,D),
BCE = MAX(BC,E),
BCF = MAX(BC,F),
BDE = MAX(BD,E),
BDF = MAX(BD,F),
BEF = MAX(BE,F),
CDE = MAX(CD,E),
CDF = MAX(CD,F),
CEF = MAX(CE,F),
DEF = MAX(DE,F);
in[2] = MIN( MIN4 (
MIN4( ABC, ABD, ABE, ABF ),
MIN4( ACD, ACE, ACF, ADE ),
MIN4( ADF, AEF, BCD, BCE ),
MIN4( BCF, BDE, BDF, BEF )),
MIN4( CDE, CDF, CEF, DEF )
);
const int
ABCD = MAX(ABC,D),
ABCE = MAX(ABC,E),
ABCF = MAX(ABC,F),
ABDE = MAX(ABD,E),
ABDF = MAX(ABD,F),
ABEF = MAX(ABE,F),
ACDE = MAX(ACD,E),
ACDF = MAX(ACD,F),
ACEF = MAX(ACE,F),
ADEF = MAX(ADE,F),
BCDE = MAX(BCD,E),
BCDF = MAX(BCD,F),
BCEF = MAX(BCE,F),
BDEF = MAX(BDE,F),
CDEF = MAX(CDE,F);
in[3] = MIN4 (
MIN4( ABCD, ABCE, ABCF, ABDE ),
MIN4( ABDF, ABEF, ACDE, ACDF ),
MIN4( ACEF, ADEF, BCDE, BCDF ),
MIN3( BCEF, BDEF, CDEF )
);
const int
ABCDE= MAX(ABCD,E),
ABCDF= MAX(ABCD,F),
ABCEF= MAX(ABCE,F),
ABDEF= MAX(ABDE,F),
ACDEF= MAX(ACDE,F),
BCDEF= MAX(BCDE,F);
in[4]= MIN (
MIN4( ABCDE, ABCDF, ABCEF, ABDEF ),
MIN ( ACDEF, BCDEF )
);
in[5] = MAX(ABCDE,F);
}
int main(int argc, char ** argv) {
int d[6][6] = {
{1, 2, 3, 4, 5, 6},
{6, 5, 4, 3, 2, 1},
{100, 2, 300, 4, 500, 6},
{100, 2, 3, 4, 500, 6},
{1, 200, 3, 4, 5, 600},
{1, 1, 2, 1, 2, 1}
};
unsigned long long cycles = rdtsc();
for (int i = 0; i < 6; i++) {
sort6(d[i]);
}
cycles = rdtsc() - cycles;
printf("Time is %d\n", (unsigned)cycles);
for (int i = 0; i < 6; i++) {
printf("d%d : %d %d %d %d %d %d\n", i,
d[i][0], d[i][1], d[i][2],
d[i][3], d[i][4], d[i][5]);
}
}
EDIT:
Rank order solution inspired by Rex Kerr's,
Much faster than the mess above
static void sort6(int *o) {
const int
A=o[0],B=o[1],C=o[2],D=o[3],E=o[4],F=o[5];
const unsigned char
AB = A>B, AC = A>C, AD = A>D, AE = A>E,
BC = B>C, BD = B>D, BE = B>E,
CD = C>D, CE = C>E,
DE = D>E,
a = AB + AC + AD + AE + (A>F),
b = 1 - AB + BC + BD + BE + (B>F),
c = 2 - AC - BC + CD + CE + (C>F),
d = 3 - AD - BD - CD + DE + (D>F),
e = 4 - AE - BE - CE - DE + (E>F);
o[a]=A; o[b]=B; o[c]=C; o[d]=D; o[e]=E;
o[15-a-b-c-d-e]=F;
}

I thought I'd try an unrolled Ford-Johnson merge-insertion sort, which achieves the minimum possible number of comparisons (ceil(log2(6!)) = 10) and no swaps.
It doesn't compete, though (I got a slightly better timing than the worst sorting networks solution sort6_sorting_network_v1).
It loads the values into six registers, then performs 8 to 10 comparisons
to decide which of the 720=6!
cases it's in, then writes the registers back in the appropriate one
of those 720 orders (separate code for each case).
There are no swaps or reordering of anything until the final write-back. I haven't looked at the generated assembly code.
static inline void sort6_ford_johnson_unrolled(int *D) {
register int a = D[0], b = D[1], c = D[2], d = D[3], e = D[4], f = D[5];
#define abcdef(a,b,c,d,e,f) (D[0]=a, D[1]=b, D[2]=c, D[3]=d, D[4]=e, D[5]=f)
#define abdef_cd(a,b,c,d,e,f) (c<a ? abcdef(c,a,b,d,e,f) \
: c<b ? abcdef(a,c,b,d,e,f) \
: abcdef(a,b,c,d,e,f))
#define abedf_cd(a,b,c,d,e,f) (c<b ? c<a ? abcdef(c,a,b,e,d,f) \
: abcdef(a,c,b,e,d,f) \
: c<e ? abcdef(a,b,c,e,d,f) \
: abcdef(a,b,e,c,d,f))
#define abdf_cd_ef(a,b,c,d,e,f) (e<b ? e<a ? abedf_cd(e,a,c,d,b,f) \
: abedf_cd(a,e,c,d,b,f) \
: e<d ? abedf_cd(a,b,c,d,e,f) \
: abdef_cd(a,b,c,d,e,f))
#define abd_cd_ef(a,b,c,d,e,f) (d<f ? abdf_cd_ef(a,b,c,d,e,f) \
: b<f ? abdf_cd_ef(a,b,e,f,c,d) \
: abdf_cd_ef(e,f,a,b,c,d))
#define ab_cd_ef(a,b,c,d,e,f) (b<d ? abd_cd_ef(a,b,c,d,e,f) \
: abd_cd_ef(c,d,a,b,e,f))
#define ab_cd(a,b,c,d,e,f) (e<f ? ab_cd_ef(a,b,c,d,e,f) \
: ab_cd_ef(a,b,c,d,f,e))
#define ab(a,b,c,d,e,f) (c<d ? ab_cd(a,b,c,d,e,f) \
: ab_cd(a,b,d,c,e,f))
a<b ? ab(a,b,c,d,e,f)
: ab(b,a,c,d,e,f);
#undef ab
#undef ab_cd
#undef ab_cd_ef
#undef abd_cd_ef
#undef abdf_cd_ef
#undef abedf_cd
#undef abdef_cd
#undef abcdef
}
TEST(ford_johnson_unrolled, "Unrolled Ford-Johnson Merge-Insertion sort");

Try 'merging sorted list' sort. :) Use two array. Fastest for small and big array.
If you concating, you only check where insert. Other bigger values you not need compare (cmp = a-b>0).
For 4 numbers, you can use system 4-5 cmp (~4.6) or 3-6 cmp (~4.9). Bubble sort use 6 cmp (6). Lots of cmp for big numbers slower code.
This code use 5 cmp (not MSL sort):
if (cmp(arr[n][i+0],arr[n][i+1])>0) {swap(n,i+0,i+1);}
if (cmp(arr[n][i+2],arr[n][i+3])>0) {swap(n,i+2,i+3);}
if (cmp(arr[n][i+0],arr[n][i+2])>0) {swap(n,i+0,i+2);}
if (cmp(arr[n][i+1],arr[n][i+3])>0) {swap(n,i+1,i+3);}
if (cmp(arr[n][i+1],arr[n][i+2])>0) {swap(n,i+1,i+2);}
Principial MSL
9 8 7 6 5 4 3 2 1 0
89 67 45 23 01 ... concat two sorted lists, list length = 1
6789 2345 01 ... concat two sorted lists, list length = 2
23456789 01 ... concat two sorted lists, list length = 4
0123456789 ... concat two sorted lists, list length = 8
js code
function sortListMerge_2a(cmp)
{
var step, stepmax, tmp, a,b,c, i,j,k, m,n, cycles;
var start = 0;
var end = arr_count;
//var str = '';
cycles = 0;
if (end>3)
{
stepmax = ((end - start + 1) >> 1) << 1;
m = 1;
n = 2;
for (step=1;step<stepmax;step<<=1) //bounds 1-1, 2-2, 4-4, 8-8...
{
a = start;
while (a<end)
{
b = a + step;
c = a + step + step;
b = b<end ? b : end;
c = c<end ? c : end;
i = a;
j = b;
k = i;
while (i<b && j<c)
{
if (cmp(arr[m][i],arr[m][j])>0)
{arr[n][k] = arr[m][j]; j++; k++;}
else {arr[n][k] = arr[m][i]; i++; k++;}
}
while (i<b)
{arr[n][k] = arr[m][i]; i++; k++;
}
while (j<c)
{arr[n][k] = arr[m][j]; j++; k++;
}
a = c;
}
tmp = m; m = n; n = tmp;
}
return m;
}
else
{
// sort 3 items
sort10(cmp);
return m;
}
}

Maybe I am late to the party, but at least my contribution is a new approach.
The code really should be inlined
even if inlined, there are too many branches
the analysing part is basically O(N(N-1)) which seems OK for N=6
the code could be more effective if the cost of swap would be higher (irt the cost of compare)
I trust on static functions being inlined.
The method is related to rank-sort
instead of ranks, the relative ranks (offsets) are used.
the sum of the ranks is zero for every cycle in any permutation group.
instead of SWAP()ing two elements, the cycles are chased, needing only one temp, and one (register->register) swap (new <- old).
Update: changed the code a bit, some people use C++ compilers to compile C code ...
#include <stdio.h>
#if WANT_CHAR
typedef signed char Dif;
#else
typedef signed int Dif;
#endif
static int walksort (int *arr, int cnt);
static void countdifs (int *arr, Dif *dif, int cnt);
static void calcranks(int *arr, Dif *dif);
int wsort6(int *arr);
void do_print_a(char *msg, int *arr, unsigned cnt)
{
fprintf(stderr,"%s:", msg);
for (; cnt--; arr++) {
fprintf(stderr, " %3d", *arr);
}
fprintf(stderr,"\n");
}
void do_print_d(char *msg, Dif *arr, unsigned cnt)
{
fprintf(stderr,"%s:", msg);
for (; cnt--; arr++) {
fprintf(stderr, " %3d", (int) *arr);
}
fprintf(stderr,"\n");
}
static void inline countdifs (int *arr, Dif *dif, int cnt)
{
int top, bot;
for (top = 0; top < cnt; top++ ) {
for (bot = 0; bot < top; bot++ ) {
if (arr[top] < arr[bot]) { dif[top]--; dif[bot]++; }
}
}
return ;
}
/* Copied from RexKerr ... */
static void inline calcranks(int *arr, Dif *dif){
dif[0] = (arr[0]>arr[1])+(arr[0]>arr[2])+(arr[0]>arr[3])+(arr[0]>arr[4])+(arr[0]>arr[5]);
dif[1] = -1+ (arr[1]>=arr[0])+(arr[1]>arr[2])+(arr[1]>arr[3])+(arr[1]>arr[4])+(arr[1]>arr[5]);
dif[2] = -2+ (arr[2]>=arr[0])+(arr[2]>=arr[1])+(arr[2]>arr[3])+(arr[2]>arr[4])+(arr[2]>arr[5]);
dif[3] = -3+ (arr[3]>=arr[0])+(arr[3]>=arr[1])+(arr[3]>=arr[2])+(arr[3]>arr[4])+(arr[3]>arr[5]);
dif[4] = -4+ (arr[4]>=arr[0])+(arr[4]>=arr[1])+(arr[4]>=arr[2])+(arr[4]>=arr[3])+(arr[4]>arr[5]);
dif[5] = -(dif[0]+dif[1]+dif[2]+dif[3]+dif[4]);
}
static int walksort (int *arr, int cnt)
{
int idx, src,dst, nswap;
Dif difs[cnt];
#if WANT_REXK
calcranks(arr, difs);
#else
for (idx=0; idx < cnt; idx++) difs[idx] =0;
countdifs(arr, difs, cnt);
#endif
calcranks(arr, difs);
#define DUMP_IT 0
#if DUMP_IT
do_print_d("ISteps ", difs, cnt);
#endif
nswap = 0;
for (idx=0; idx < cnt; idx++) {
int newval;
int step,cyc;
if ( !difs[idx] ) continue;
newval = arr[idx];
cyc = 0;
src = idx;
do {
int oldval;
step = difs[src];
difs[src] =0;
dst = src + step;
cyc += step ;
if(dst == idx+1)idx=dst;
oldval = arr[dst];
#if (DUMP_IT&1)
fprintf(stderr, "[Nswap=%d] Cyc=%d Step=%2d Idx=%d Old=%2d New=%2d #### Src=%d Dst=%d[%2d]->%2d <-- %d\n##\n"
, nswap, cyc, step, idx, oldval, newval
, src, dst, difs[dst], arr[dst]
, newval );
do_print_a("Array ", arr, cnt);
do_print_d("Steps ", difs, cnt);
#endif
arr[dst] = newval;
newval = oldval;
nswap++;
src = dst;
} while( cyc);
}
return nswap;
}
/*************/
int wsort6(int *arr)
{
return walksort(arr, 6);
}

//Bruteforce compute unrolled count dumbsort(min to 0-index)
void bcudc_sort6(int* a)
{
int t[6] = {0};
int r1,r2;
r1=0;
r1 += (a[0] > a[1]);
r1 += (a[0] > a[2]);
r1 += (a[0] > a[3]);
r1 += (a[0] > a[4]);
r1 += (a[0] > a[5]);
while(t[r1]){r1++;}
t[r1] = a[0];
r2=0;
r2 += (a[1] > a[0]);
r2 += (a[1] > a[2]);
r2 += (a[1] > a[3]);
r2 += (a[1] > a[4]);
r2 += (a[1] > a[5]);
while(t[r2]){r2++;}
t[r2] = a[1];
r1=0;
r1 += (a[2] > a[0]);
r1 += (a[2] > a[1]);
r1 += (a[2] > a[3]);
r1 += (a[2] > a[4]);
r1 += (a[2] > a[5]);
while(t[r1]){r1++;}
t[r1] = a[2];
r2=0;
r2 += (a[3] > a[0]);
r2 += (a[3] > a[1]);
r2 += (a[3] > a[2]);
r2 += (a[3] > a[4]);
r2 += (a[3] > a[5]);
while(t[r2]){r2++;}
t[r2] = a[3];
r1=0;
r1 += (a[4] > a[0]);
r1 += (a[4] > a[1]);
r1 += (a[4] > a[2]);
r1 += (a[4] > a[3]);
r1 += (a[4] > a[5]);
while(t[r1]){r1++;}
t[r1] = a[4];
r2=0;
r2 += (a[5] > a[0]);
r2 += (a[5] > a[1]);
r2 += (a[5] > a[2]);
r2 += (a[5] > a[3]);
r2 += (a[5] > a[4]);
while(t[r2]){r2++;}
t[r2] = a[5];
a[0]=t[0];
a[1]=t[1];
a[2]=t[2];
a[3]=t[3];
a[4]=t[4];
a[5]=t[5];
}
static __inline__ void sort6(int* a)
{
#define wire(x,y); t = a[x] ^ a[y] ^ ( (a[x] ^ a[y]) & -(a[x] < a[y]) ); a[x] = a[x] ^ t; a[y] = a[y] ^ t;
register int t;
wire( 0, 1); wire( 2, 3); wire( 4, 5);
wire( 3, 5); wire( 0, 2); wire( 1, 4);
wire( 4, 5); wire( 2, 3); wire( 0, 1);
wire( 3, 4); wire( 1, 2);
wire( 2, 3);
#undef wire
}

Well, if it's only 6 elements and you can leverage parallelism, want to minimize conditional branching, etc. Why you don't generate all the combinations and test for order? I would venture that in some architectures, it can be pretty fast (as long as you have the memory preallocated)

Sort 4 items with usage cmp==0.
Numbers of cmp is ~4.34 (FF native have ~4.52), but take 3x time than merging lists. But better less cmp operations, if you have big numbers or big text.
Edit: repaired bug
Online test http://mlich.zam.slu.cz/js-sort/x-sort-x2.htm
function sort4DG(cmp,start,end,n) // sort 4
{
var n = typeof(n) !=='undefined' ? n : 1;
var cmp = typeof(cmp) !=='undefined' ? cmp : sortCompare2;
var start = typeof(start)!=='undefined' ? start : 0;
var end = typeof(end) !=='undefined' ? end : arr[n].length;
var count = end - start;
var pos = -1;
var i = start;
var cc = [];
// stabilni?
cc[01] = cmp(arr[n][i+0],arr[n][i+1]);
cc[23] = cmp(arr[n][i+2],arr[n][i+3]);
if (cc[01]>0) {swap(n,i+0,i+1);}
if (cc[23]>0) {swap(n,i+2,i+3);}
cc[12] = cmp(arr[n][i+1],arr[n][i+2]);
if (!(cc[12]>0)) {return n;}
cc[02] = cc[01]==0 ? cc[12] : cmp(arr[n][i+0],arr[n][i+2]);
if (cc[02]>0)
{
swap(n,i+1,i+2); swap(n,i+0,i+1); // bubble last to top
cc[13] = cc[23]==0 ? cc[12] : cmp(arr[n][i+1],arr[n][i+3]);
if (cc[13]>0)
{
swap(n,i+2,i+3); swap(n,i+1,i+2); // bubble
return n;
}
else {
cc[23] = cc[23]==0 ? cc[12] : (cc[01]==0 ? cc[30] : cmp(arr[n][i+2],arr[n][i+3])); // new cc23 | c03 //repaired
if (cc[23]>0)
{
swap(n,i+2,i+3);
return n;
}
return n;
}
}
else {
if (cc[12]>0)
{
swap(n,i+1,i+2);
cc[23] = cc[23]==0 ? cc[12] : cmp(arr[n][i+2],arr[n][i+3]); // new cc23
if (cc[23]>0)
{
swap(n,i+2,i+3);
return n;
}
return n;
}
else {
return n;
}
}
return n;
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Reverse vector order in ARM NEON intrinsics - arm

I'm trying to reverse the order of a 128 bit vector (uint16x8). For example if I have a b c d e f g h I would like to obtain h g f e d c b a Is there a simple way to do that with NEON intrinsics? I tried with the VREV but it doesn't work.

Related

Efficient C vectors for generic SIMD (SSE, AVX, NEON) test for zero matches. (find FP max absolute value and index)

Why the cos function in math.h faster than x86 fcos instruction

Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2

GCC generates redundant code for repeated XOR of an array element

Sort integer in array in Objective-C [duplicate]

Categories

Resources