Schönauer Triad Benchmark - L1D Cache not visible

Schönauer Triad Benchmark - L1D Cache not visible - c

We are two HPC students getting involved into the famous Schönauer Triad Benchmark, whose C code are reported here along with its short explanation:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define DEFAULT_NMAX 10000000
#define DEFAULT_NR DEFAULT_NMAX
#define DEFAULT_INC 10
#define DEFAULT_XIDX 0
#define MAX_PATH_LENGTH 1024
// #define WINOS
#define STACKALLOC
#ifdef WINOS
#include <windows.h>
#endif
static void dummy(double A[], double B[], double C[], double D[])
{
return;
}
static double simulation(int N, int R)
{
int i, j;
#ifdef STACKALLOC
double A[N];
double B[N];
double C[N];
double D[N];
#else
double * A = malloc(N*sizeof(double));
double * B = malloc(N*sizeof(double));
double * C = malloc(N*sizeof(double));
double * D = malloc(N*sizeof(double));
#endif
double elaps;
for (i = 0; i < N; ++i)
{
A[i] = 0.00;
B[i] = 1.00;
C[i] = 2.00;
D[i] = 3.00;
}
#ifdef WINOS
FILETIME tp;
GetSystemTimePreciseAsFileTime(&tp);
elaps = - (double)(((ULONGLONG)tp.dwHighDateTime << 32) | (ULONGLONG)tp.dwLowDateTime)/10000000.0;
#else
struct timeval tp;
gettimeofday(&tp, NULL);
elaps = -(double)(tp.tv_sec + tp.tv_usec/1000000.0);
#endif
for(j=0; j<R; ++j)
{
for(i=0; i<N; ++i)
A[i] = B[i] + C[i]*D[i];
if(A[2] < 0) dummy(A, B, C, D);
}
#ifndef STACKALLOC
free(A);
free(B);
free(C);
free(D);
#endif
#ifdef WINOS
GetSystemTimePreciseAsFileTime(&tp);
return elaps + (double)(((ULONGLONG)tp.dwHighDateTime << 32) | (ULONGLONG)tp.dwLowDateTime)/10000000.0;
#else
gettimeofday(&tp, NULL);
return elaps + ((double)(tp.tv_sec + tp.tv_usec/1000000.0));
#endif
}
int main(int argc, char *argv[])
{
const int NR = argc > 1 ? atoi(argv[1]) : DEFAULT_NR;
const int NMAX = argc > 2 ? atoi(argv[2]) : DEFAULT_NMAX;
const int inc = argc > 3 ? atoi(argv[3]) : DEFAULT_INC;
const int xidx = argc > 4 ? atoi(argv[4]) : DEFAULT_XIDX;
int i, j, k;
FILE * fp;
printf("\n*** Schonauer Triad benchmark ***\n");
char csvname[MAX_PATH_LENGTH];
sprintf(csvname, "data%d.csv", xidx);
if(!(fp = fopen(csvname, "a+")))
{
printf("\nError whilst writing to file\n");
return 1;
}
int R, N;
double MFLOPS;
double elaps;
for(N=1; N<=NMAX; N += inc)
{
R = NR/N;
elaps = simulation(N, R);
MFLOPS = ((R*N)<<1)/(elaps*1000000);
fprintf(fp, "%d,%lf\n", N, MFLOPS);
printf("N = %d, R = %d\n", N, R);
printf("Elapsed time: %lf\n", elaps);
printf("MFLOPS: %lf\n", MFLOPS);
}
fclose(fp);
(void) getchar();
return 0;
}
The code simply loops over N and for each N, it does NR floating point operations, where NR is a constant that stands for the number of constant operations to do at each outermost iteration, in order to take accurate time measurements even for too short N values. The kernel to analyze is obviously the simulation subroutine.
We've got some strange results:
We started with benchmarking the kernel on an E4 E9220 server 2U, consisting of 8 nodes, each of them equipped with dual-socket Intel Xeon E5-2697 V2 (Ivy Bridge) # 2,7 GHz, 12 cores. The code has been compiled with gcc (GCC) 4.8.2, and has been run on Linux CentOS release 6. Below are listed the resulting plots in a single image:
N versus MFlops plots: -Ofast (above) and -Ofast along -march=native (below)
It is straightforward to see that L2 and L3 downhills are pretty visible, and they are numerically OK by doing some simple calculations and taking into account multiprogramming issues and the facts that L2-L3 are UNIFIED and L3 are also SHARED among all 12 cores. In the first plot L1 is not visible, while in the second it is visible and it starts in an N value so the resulting L1D saturation value is exactly 32 KB, according to the per-core L1D size. The first question is: why don't we see L1 downhill without -march=native architecture specialization flag?
After some tricky (obviously wrong) self-explanations, we decided to do the benchmark on a Lenovo Z500, equipped with a single socket Intel Core i7-3632QM (Ivy Bridge) # 2.2 GHz. This time we've used gcc (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406 (from gcc --version), and the resulting plots are listed below:
N versus MFlops plots: -Ofast (above) and -Ofast along -march=native (below)
The second question is somewhat spontaneous: why we see L1D downhill without -march=native- this time?

There are assembly fragments of inner "TRIAD" loop (A[i] = B[i] + C[i]*D[i]: per i iteration 2 double_precision flops, 3 reads of double, 1 write of double).
Exact percents from perf annotate was not very useful, as you profiled all regions with different performance into single run. And long perf report not useful at all, only 5-10 first lines after # are usually needed. You may try to limit the test to the interesting region of 4*N*sizeof(double) < sizeof(L1d_cache) and recollect perf annotate and also get results of perf stat ./program and perf stat -d ./program (and also learn about Intel-specific perf wrapper ocperf.py - https://github.com/andikleen/pmu-tools and other tools there).
From gcc-6.3.0 -Ofast - 128-bit (2 doubles) XMM registers and SSE2 movupd/movups are used (SSE2 is default FPU for x86_64 cpu), 2 iterations of i for every assembler loop (movupd loads 2 doubles from memory)
: A[i] = B[i] + C[i]*D[i];
0.03 : d70: movupd (%r11,%rax,1),%xmm1 # load C[i:i+1] into xmm1
14.87 : d76: add $0x1,%ecx # advance 'i/2' loop counter by 1
0.10 : d79: movupd (%r10,%rax,1),%xmm0 # load D[i:i+1] into xmm0
14.59 : d7f: mulpd %xmm1,%xmm0 # multiply them into xmm0
2.78 : d83: addpd (%r14,%rax,1),%xmm0 # load B[i:i+1] and add to xmm0
17.69 : d89: movups %xmm0,(%rsi,%rax,1) # store into A[i:i+1]
2.71 : d8d: add $0x10,%rax # advance array pointer by 2 doubles (0x10=16=2*8)
1.68 : d91: cmp %edi,%ecx # check for end of loop (edi is N/2)
0.00 : d93: jb d70 <main+0x4c0> # if not, jump to 0xd70
From gcc-6.3.0 -Ofast -march=native: vmovupd are not just vector (SSE2 somethingpd are vector too), they are AVX instructions which may use 2 times wide registers YMM (256 bits, 4 doubles per register). There is longer loop but 4 i iterations are processed per loop iteration
0.02 : db6: vmovupd (%r10,%rdx,1),%xmm0 # load C[i:i+1] into xmm0 (low part of ymm0)
8.42 : dbc: vinsertf128 $0x1,0x10(%r10,%rdx,1),%ymm0,%ymm1 # load C[i+2:i+3] into high part of ymm1 and copy xmm0 into lower part; ymm1 is C[i:i+3]
7.37 : dc4: add $0x1,%esi # loop counter ++
0.06 : dc7: vmovupd (%r9,%rdx,1),%xmm0 # load D[i:i+1] -> xmm0
15.05 : dcd: vinsertf128 $0x1,0x10(%r9,%rdx,1),%ymm0,%ymm0 # load D[i+2:i+3] and get D[i:i+3] in ymm0
0.85 : dd5: vmulpd %ymm0,%ymm1,%ymm0 # mul C[i:i+3] and D[i:i+3] into ymm0
1.65 : dd9: vaddpd (%r11,%rdx,1),%ymm0,%ymm0 # soad 4 doubles of B[i:i+3] and add to ymm0
21.18 : ddf: vmovups %xmm0,(%r8,%rdx,1) # store low 2 doubles to A[i:i+1]
1.24 : de5: vextractf128 $0x1,%ymm0,0x10(%r8,%rdx,1) # store high 2 doubles to A[i+2:i+3]
2.04 : ded: add $0x20,%rdx # advance array pointer by 4 doubles
0.02 : df1: cmp -0x460(%rbp),%esi # loop cmp
0.00 : df7: jb db6 <main+0x506> # loop jump to 0xdb6
The code with AVX enabled (with -march=native) is better as it uses better unroll, but it uses narrow loads of 2 doubles. With more real tests arrays will be better aligned and compiler may select widest 256-bit vmovupd into ymm, without need of insert/extract instructions.
The code you have now probably may be so slow that it is unable to fully load (saturate) interface to L1 data cache in most cases with short arrays. Another possibility is bad alignment between arrays.
You have short spike of high bandwidth in lower graph in https://i.stack.imgur.com/2ovxm.png - 6 "GFLOPS" and it is strange. Do the calculation to convert this into GByte/s and find the L1d bandwidth of Ivy Bridge and limitations of load issue rate... something like https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/532346 "Haswell core can only issue two loads per cycle, so they have to be 256-bit AVX loads to have any chance of achieving a rate of 64 Bytes/cycle." (The word of expert in TRIAD and author of STREAM, John D. McCalpin, PhD "Dr. Bandwidth", do search for his posts) and http://www.overclock.net/t/1541624/how-much-bandwidth-is-in-cpu-cache-and-how-is-it-calculated "L1 bandwidth depends on the instructions per tick and the stride of the instructions (AVX = 256-bit, SSE = 128-bit etc.). IIRC, Sandy Bridge has 1 instruction per tick"

Related

Why cannot my program reach integer addition instruction throughput bound?

I have read chapter 5 of CSAPP 3e. I want to test if the optimization techniques described in the book can work on my computer. I write the following program:
#define SIZE (1024)
int main(int argc, char* argv[]) {
int sum = 0;
int* array = malloc(sizeof(int) * SIZE);
unsigned long long before = __rdtsc();
for (int i = 0; i < SIZE; ++i) {
sum += array[i];
}
unsigned long long after = __rdtsc();
double cpe = (double)(after - before) / SIZE;
printf("CPE is %f\n", cpe);
printf("sum is %d\n", sum);
return 0;
}
and it reports the CPE is around 1.00.
I transform the program using the 4x4 loop unrolling technique and it leads to the following program:
#define SIZE (1024)
int main(int argc, char* argv[]) {
int sum = 0;
int* array = malloc(sizeof(int) * SIZE);
int sum0 = 0;
int sum1 = 0;
int sum2 = 0;
int sum3 = 0;
/* 4x4 unrolling */
unsigned long long before = __rdtsc();
for (int i = 0; i < SIZE; i += 4) {
sum0 += array[i];
sum1 += array[i + 1];
sum2 += array[i + 2];
sum3 += array[i + 3];
}
unsigned long long after = __rdtsc();
sum = sum0 + sum1 + sum2 + sum3;
double cpe = (double)(after - before) / SIZE;
printf("CPE is %f\n", cpe);
printf("sum is %d\n", sum);
return 0;
}
Note that I omit the code to handle the situation when SIZE is not a multiple of 4. This program reports the CPE is around 0.80.
My program runs on an AMD 5950X, and according to AMD's software optimization manual (https://developer.amd.com/resources/developer-guides-manuals/), the integer addition instruction has a latency of 1 cycle and throughput of 4 instructions per cycle. It also has a load-store unit which could execute three independent load operations at the same time. My expectation of the CPE is 0.33, and I do not know why the result is so much higher.
My compiler is gcc 12.2.0. All programs are compiled with flags -Og.
I checked the assembly code of the optimized program, but found nothing helpful:
.L4:
movslq %r9d, %rcx
addl (%r8,%rcx,4), %r11d
addl 4(%r8,%rcx,4), %r10d
addl 8(%r8,%rcx,4), %ebx
addl 12(%r8,%rcx,4), %esi
addl $4, %r9d
.L3:
cmpl $127, %r9d
jle .L4
I assume at least 3 of the 4 addl instructions should execute in parallel. However, the result of the program does not meet my expectation.

cmpl $127, %r9d is not a large iteration count compared to rdtsc overhead and the branch mispredict when you exit the loop, and time for the CPU to ramp up to max frequency.
Also, you want to measure core clock cycles, not TSC reference cycles. Put the loop in a static executable (for minimal startup overhead) and run it with perf stat to get core clocks for the whole process. (As in Can x86's MOV really be "free"? Why can't I reproduce this at all? or some perf experiments I've posted in other answers.)
See Idiomatic way of performance evaluation?
10M to 1000M total iterations is appropriate since that's still under a second and we only want to measure steady-state behaviour, not cold-cache or cold-branch-predictor effect. Or page-faults. Interrupt overhead tends to be under 1% on an idle system. Use perf stat --all-user to only count user-space cycles and instructions.
If you want to do it over an array (instead of just removing the pointer increment from the asm), do many passes over a small (16K) array so they all hit in L1d cache. Use a nested loop, or use an and to wrap an index.
Doing that, yes you should be able to measure the 3/clock throughput of add mem, reg on Zen3 and later, even if you leave in the movslq overhead and crap like that from compiler -Og output.
When you're truly micro-benchmarking to find out stuff about throughput of one form of one instruction, it's usually easier to write asm by hand than to coax a compiler into emitting the loop you want. (As long as you know enough asm to avoid pitfalls, e.g. .balign 64 before the loop just for good measure, to hopefully avoid front-end bottlenecks.)
See also https://uops.info/ for how they measure; for any given test, you can click on the link to see the asm loop body for the experiments they ran, and the raw perf counter outputs for each variation on the test. (Although I have to admit I forget what MPERF and APERF mean for AMD CPUs; the results for Intel CPUs are more obvious.) e.g. https://uops.info/html-tp/ZEN3/ADD_R32_M32-Measurements.html is the Zen3 results, which includes a test of 4 or 8 independent add reg, [r14+const] instructions as the inner loop body.
They also tested with an indexed addressing mode. With "With unroll_count=200 and no inner loop" they got identical results for MPERF / APERF / UOPS for 4 independent adds, with indexed vs. non-indexed addressing modes. (Their loops don't have a pointer increment.)

Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2

(Related: How to quickly count bits into separate bins in a series of ints on Sandy Bridge? is an earlier duplicate of this, with some different answers. Editor's note: the answers here are probably better.
Also, an AVX2 version of a similar problem, with many bins for a whole row of bits much wider than one uint64_t: Improve column population count algorithm)
I am working on a project in C where I need to go through tens of millions of masks (of type ulong (64-bit)) and update an array (called target) of 64 short integers (uint16) based on a simple rule:
// for any given mask, do the following loop
for (i = 0; i < 64; i++) {
if (mask & (1ull << i)) {
target[i]++
}
}
The problem is that I need do the above loops on tens of millions of masks and I need to finish in less than a second. Wonder if there are any way to speed it up, like using some sort special assembly instruction that represents the above loop.
Currently I use gcc 4.8.4 on ubuntu 14.04 (i7-2670QM, supporting AVX, not AVX2) to compile and run the following code and took about 2 seconds. Would love to make it run under 200ms.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>
double getTS() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1000000.0;
}
unsigned int target[64];
int main(int argc, char *argv[]) {
int i, j;
unsigned long x = 123;
unsigned long m = 1;
char *p = malloc(8 * 10000000);
if (!p) {
printf("failed to allocate\n");
exit(0);
}
memset(p, 0xff, 80000000);
printf("p=%p\n", p);
unsigned long *pLong = (unsigned long*)p;
double start = getTS();
for (j = 0; j < 10000000; j++) {
m = 1;
for (i = 0; i < 64; i++) {
if ((pLong[j] & m) == m) {
target[i]++;
}
m = (m << 1);
}
}
printf("took %f secs\n", getTS() - start);
return 0;
}
Thanks in advance!

On my system, a 4 year old MacBook (2.7 GHz intel core i5) with clang-900.0.39.2 -O3, your code runs in 500ms.
Just changing the inner test to if ((pLong[j] & m) != 0) saves 30%, running in 350ms.
Further simplifying the inner part to target[i] += (pLong[j] >> i) & 1; without a test brings it down to 280ms.
Further improvements seem to require more advanced techniques such as unpacking the bits into blocks of 8 ulongs and adding those in parallel, handling 255 ulongs at a time.
Here is an improved version using this method. it runs in 45ms on my system.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>
double getTS() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1000000.0;
}
int main(int argc, char *argv[]) {
unsigned int target[64] = { 0 };
unsigned long *pLong = malloc(sizeof(*pLong) * 10000000);
int i, j;
if (!pLong) {
printf("failed to allocate\n");
exit(1);
}
memset(pLong, 0xff, sizeof(*pLong) * 10000000);
printf("p=%p\n", (void*)pLong);
double start = getTS();
uint64_t inflate[256];
for (i = 0; i < 256; i++) {
uint64_t x = i;
x = (x | (x << 28));
x = (x | (x << 14));
inflate[i] = (x | (x << 7)) & 0x0101010101010101ULL;
}
for (j = 0; j < 10000000 / 255 * 255; j += 255) {
uint64_t b[8] = { 0 };
for (int k = 0; k < 255; k++) {
uint64_t u = pLong[j + k];
for (int kk = 0; kk < 8; kk++, u >>= 8)
b[kk] += inflate[u & 255];
}
for (i = 0; i < 64; i++)
target[i] += (b[i / 8] >> ((i % 8) * 8)) & 255;
}
for (; j < 10000000; j++) {
uint64_t m = 1;
for (i = 0; i < 64; i++) {
target[i] += (pLong[j] >> i) & 1;
m <<= 1;
}
}
printf("target = {");
for (i = 0; i < 64; i++)
printf(" %d", target[i]);
printf(" }\n");
printf("took %f secs\n", getTS() - start);
return 0;
}
The technique for inflating a byte to a 64-bit long are investigated and explained in the answer: https://stackoverflow.com/a/55059914/4593267 . I made the target array a local variable, as well as the inflate array, and I print the results to ensure the compiler will not optimize the computations away. In a production version you would compute the inflate array separately.
Using SIMD directly might provide further improvements at the expense of portability and readability. This kind of optimisation is often better left to the compiler as it can generate specific code for the target architecture. Unless performance is critical and benchmarking proves this to be a bottleneck, I would always favor a generic solution.
A different solution by njuffa provides similar performance without the need for a precomputed array. Depending on your compiler and hardware specifics, it might be faster.

Related:
an earlier duplicate has some alternate ideas: How to quickly count bits into separate bins in a series of ints on Sandy Bridge?.
Harold's answer on AVX2 column population count algorithm over each bit-column separately.
Matrix transpose and population count has a couple useful answers with AVX2, including benchmarks. It uses 32-bit chunks instead of 64-bit.
Also: https://github.com/mklarqvist/positional-popcount has SSE blend, various AVX2, various AVX512 including Harley-Seal which is great for large arrays, and various other algorithms for positional popcount. Possibly only for uint16_t, but most could be adapted for other word widths. I think the algorithm I propose below is what they call adder_forest.
Your best bet is SIMD, using AVX1 on your Sandybridge CPU. Compilers aren't smart enough to auto-vectorize your loop-over-bits for you, even if you write it branchlessly to give them a better chance.
And unfortunately not smart enough to auto-vectorize the fast version that gradually widens and adds.
See is there an inverse instruction to the movemask instruction in intel avx2? for a summary of bitmap -> vector unpack methods for different sizes. Ext3h's suggestion in another answer is good: Unpack bits to something narrower than the final count array gives you more elements per instruction. Bytes is efficient with SIMD, and then you can do up to 255 vertical paddb without overflow, before unpacking to accumulate into the 32-bit counter array.
It only takes 4x 16-byte __m128i vectors to hold all 64 uint8_t elements, so those accumulators can stay in registers, only adding to memory when widening out to 32-bit counters in an outer loop.
The unpack doesn't have to be in-order: you can always shuffle target[] once at the very end, after accumulating all the results.
The inner loop could be unrolled to start with a 64 or 128-bit vector load, and unpack 4 or 8 different ways using pshufb (_mm_shuffle_epi8).
An even better strategy is to widen gradually
Starting with 2-bit accumulators, then mask/shift to widen those to 4-bit. So in the inner-most loop most of the operations are working with "dense" data, not "diluting" it too much right away. Higher information / entropy density means that each instruction does more useful work.
Using SWAR techniques for 32x 2-bit add inside scalar or SIMD registers is easy / cheap because we need to avoid the possibility of carry out the top of an element anyway. With proper SIMD, we'd lose those counts, with SWAR we'd corrupt the next element.
uint64_t x = *(input++); // load a new bitmask
const uint64_t even_1bits = 0x5555555555555555; // 0b...01010101;
uint64_t lo = x & even_1bits;
uint64_t hi = (x>>1) & even_1bits; // or use ANDN before shifting to avoid a MOV copy
accum2_lo += lo; // can do up to 3 iterations of this without overflow
accum2_hi += hi; // because a 2-bit integer overflows at 4
Then you repeat up to 4 vectors of 4-bit elements, then 8 vectors of 8-bit elements, then you should widen all the way to 32 and accumulate into the array in memory because you'll run out of registers anyway, and this outer outer loop work is infrequent enough that we don't need to bother with going to 16-bit. (Especially if we manually vectorize).
Biggest downside: this doesn't auto-vectorize, unlike #njuffa's version. But with gcc -O3 -march=sandybridge for AVX1 (then running the code on Skylake), this running scalar 64-bit is actually still slightly faster than 128-bit AVX auto-vectorized asm from #njuffa's code.
But that's timing on Skylake, which has 4 scalar ALU ports (and mov-elimination), while Sandybridge lacks mov-elimination and only has 3 ALU ports, so the scalar code will probably hit back-end execution-port bottlenecks. (But SIMD code may be nearly as fast, because there's plenty of AND / ADD mixed with the shifts, and SnB does have SIMD execution units on all 3 of its ports that have any ALUs on them. Haswell just added port 6, for scalar-only including shifts and branches.)
With good manual vectorization, this should be a factor of almost 2 or 4 faster.
But if you have to choose between this scalar or #njuffa's with AVX2 autovectorization, #njuffa's is faster on Skylake with -march=native
If building on a 32-bit target is possible/required, this suffers a lot (without vectorization because of using uint64_t in 32-bit registers), while vectorized code barely suffers at all (because all the work happens in vector regs of the same width).
// TODO: put the target[] re-ordering somewhere
// TODO: cleanup for N not a multiple of 3*4*21 = 252
// TODO: manual vectorize with __m128i, __m256i, and/or __m512i
void sum_gradual_widen (const uint64_t *restrict input, unsigned int *restrict target, size_t length)
{
const uint64_t *endp = input + length - 3*4*21; // 252 masks per outer iteration
while(input <= endp) {
uint64_t accum8[8] = {0}; // 8-bit accumulators
for (int k=0 ; k<21 ; k++) {
uint64_t accum4[4] = {0}; // 4-bit accumulators can hold counts up to 15. We use 4*3=12
for(int j=0 ; j<4 ; j++){
uint64_t accum2_lo=0, accum2_hi=0;
for(int i=0 ; i<3 ; i++) { // the compiler should fully unroll this
uint64_t x = *input++; // load a new bitmask
const uint64_t even_1bits = 0x5555555555555555;
uint64_t lo = x & even_1bits; // 0b...01010101;
uint64_t hi = (x>>1) & even_1bits; // or use ANDN before shifting to avoid a MOV copy
accum2_lo += lo;
accum2_hi += hi; // can do up to 3 iterations of this without overflow
}
const uint64_t even_2bits = 0x3333333333333333;
accum4[0] += accum2_lo & even_2bits; // 0b...001100110011; // same constant 4 times, because we shift *first*
accum4[1] += (accum2_lo >> 2) & even_2bits;
accum4[2] += accum2_hi & even_2bits;
accum4[3] += (accum2_hi >> 2) & even_2bits;
}
for (int i = 0 ; i<4 ; i++) {
accum8[i*2 + 0] += accum4[i] & 0x0f0f0f0f0f0f0f0f;
accum8[i*2 + 1] += (accum4[i] >> 4) & 0x0f0f0f0f0f0f0f0f;
}
}
// char* can safely alias anything.
unsigned char *narrow = (uint8_t*) accum8;
for (int i=0 ; i<64 ; i++){
target[i] += narrow[i];
}
}
/* target[0] = bit 0
* target[1] = bit 8
* ...
* target[8] = bit 1
* target[9] = bit 9
* ...
*/
// TODO: 8x8 transpose
}
We don't care about order, so accum4[0] has 4-bit accumulators for every 4th bit, for example. The final fixup needed (but not yet implemented) at the very end is an 8x8 transpose of the uint32_t target[64] array, which can be done efficiently using unpck and vshufps with only AVX1. (Transpose an 8x8 float using AVX/AVX2). And also a cleanup loop for the last up to 251 masks.
We can use any SIMD element width to implement these shifts; we have to mask anyway for widths lower than 16-bit (SSE/AVX doesn't have byte-granularity shifts, only 16-bit minimum.)
Benchmark results on Arch Linux i7-6700k from #njuffa's test harness, with this added. (Godbolt) N = (10000000 / (3*4*21) * 3*4*21) = 9999864 (i.e. 10000000 rounded down to a multiple of the 252 iteration "unroll" factor, so my simplistic implementation is doing the same amount of work, not counting re-ordering target[] which it doesn't do, so it does print mismatch results.
But the printed counts match another position of the reference array.)
I ran the program 4x in a row (to make sure the CPU was warmed up to max turbo) and took one of the runs that looked good (none of the 3 times abnormally high).
ref: the best bit-loop (next section)
fast: #njuffa's code. (auto-vectorized with 128-bit AVX integer instructions).
gradual: my version (not auto-vectorized by gcc or clang, at least not in the inner loop.) gcc and clang fully unroll the inner 12 iterations.
gcc8.2 -O3 -march=sandybridge -fpie -no-pie
ref: 0.331373 secs, fast: 0.011387 secs, gradual: 0.009966 secs
gcc8.2 -O3 -march=sandybridge -fno-pie -no-pie
ref: 0.397175 secs, fast: 0.011255 secs, gradual: 0.010018 secs
clang7.0 -O3 -march=sandybridge -fpie -no-pie
ref: 0.352381 secs, fast: 0.011926 secs, gradual: 0.009269 secs (very low counts for port 7 uops, clang used indexed addressing for stores)
clang7.0 -O3 -march=sandybridge -fno-pie -no-pie
ref: 0.293014 secs, fast: 0.011777 secs, gradual: 0.009235 secs
-march=skylake (allowing AVX2 for 256-bit integer vectors) helps both, but #njuffa's most because more of it vectorizes (including its inner-most loop):
gcc8.2 -O3 -march=skylake -fpie -no-pie
ref: 0.328725 secs, fast: 0.007621 secs, gradual: 0.010054 secs (gcc shows no gain for "gradual", only "fast")
gcc8.2 -O3 -march=skylake -fno-pie -no-pie
ref: 0.333922 secs, fast: 0.007620 secs, gradual: 0.009866 secs
clang7.0 -O3 -march=skylake -fpie -no-pie
ref: 0.260616 secs, fast: 0.007521 secs, gradual: 0.008535 secs (IDK why gradual is faster than -march=sandybridge; it's not using BMI1 andn. I guess because it's using 256-bit AVX2 for the k=0..20 outer loop with vpaddq)
clang7.0 -O3 -march=skylake -fno-pie -no-pie
ref: 0.259159 secs, fast: 0.007496 secs, gradual: 0.008671 secs
Without AVX, just SSE4.2: (-march=nehalem), bizarrely clang's gradual is faster than with AVX / tune=sandybridge. "fast" is only barely slower than with AVX.
gcc8.2 -O3 -march=skylake -fno-pie -no-pie
ref: 0.337178 secs, fast: 0.011983 secs, gradual: 0.010587 secs
clang7.0 -O3 -march=skylake -fno-pie -no-pie
ref: 0.293555 secs, fast: 0.012549 secs, gradual: 0.008697 secs
-fprofile-generate / -fprofile-use help some for GCC, especially for the "ref" version where it doesn't unroll at all by default.
I highlighted the best, but often they're within measurement noise margin of each other. It's unsurprising the -fno-pie -no-pie was sometimes faster: indexing static arrays with [disp32 + reg] is not an indexed addressing mode, just base + disp32, so it doesn't ever unlaminate on Sandybridge-family CPUs.
But with gcc sometimes -fpie was faster; I didn't check but I assume gcc just shot itself in the foot somehow when 32-bit absolute addressing was possible. Or just innocent-looking differences in code-gen happened to cause alignment or uop-cache problems; I didn't check in detail.
For SIMD, we can simply do 2 or 4x uint64_t in parallel, only accumulating horizontally in the final step where we widen bytes to 32-bit elements. (Perhaps by shuffling in-lane and then using pmaddubsw with a multiplier of _mm256_set1_epi8(1) to add horizontal byte pairs into 16-bit elements.)
TODO: manually-vectorized __m128i and __m256i (and __m512i) versions of this. Should be close to 2x, 4x, or even 8x faster than the "gradual" times above. Probably HW prefetch can still keep up with it, except maybe an AVX512 version with data coming from DRAM, especially if there's contention from other threads. We do a significant amount of work per qword we read.
Obsolete code: improvements to the bit-loop
Your portable scalar version can be improved, too, speeding it up from ~1.92 seconds (with a 34% branch mispredict rate overall, with the fast loops commented out!) to ~0.35sec (clang7.0 -O3 -march=sandybridge) with a properly random input on 3.9GHz Skylake. Or 1.83 sec for the branchy version with != 0 instead of == m, because compilers fail to prove that m always has exactly 1 bit set and/or optimize accordingly.
(vs. 0.01 sec for #njuffa's or my fast version above, so this is pretty useless in an absolute sense, but it's worth mentioning as a general optimization example of when to use branchless code.)
If you expect a random mix of zeros and ones, you want something branchless that won't mispredict. Doing += 0 for elements that were zero avoids that, and also means that the C abstract machine definitely touches that memory regardless of the data.
Compilers aren't allowed to invent writes, so if they wanted to auto-vectorize your if() target[i]++ version, they'd have to use a masked store like x86 vmaskmovps to avoid a non-atomic read / rewrite of unmodified elements of target. So some hypothetical future compiler that can auto-vectorize the plain scalar code would have an easier time with this.
Anyway, one way to write this is target[i] += (pLong[j] & m != 0);, using bool->int conversion to get a 0 / 1 integer.
But we get better asm for x86 (and probably for most other architectures) if we just shift the data and isolate the low bit with &1. Compilers are kinda dumb and don't seem to spot this optimization. They do nicely optimize away the extra loop counter, and turn m <<= 1 into add same,same to efficiently left shift, but they still use xor-zero / test / setne to create a 0 / 1 integer.
An inner loop like this compiles slightly more efficiently (but still much much worse than we can do with SSE2 or AVX, or even scalar using #chrqlie's lookup table which will stay hot in L1d when used repeatedly like this, allowing SWAR in uint64_t):
for (int j = 0; j < 10000000; j++) {
#if 1 // extract low bit directly
unsigned long long tmp = pLong[j];
for (int i=0 ; i<64 ; i++) { // while(tmp) could mispredict, but good for sparse data
target[i] += tmp&1;
tmp >>= 1;
}
#else // bool -> int shifting a mask
unsigned long m = 1;
for (i = 0; i < 64; i++) {
target[i]+= (pLong[j] & m) != 0;
m = (m << 1);
}
#endif
Note that unsigned long is not guaranteed to be a 64-bit type, and isn't in x86-64 System V x32 (ILP32 in 64-bit mode), and Windows x64. Or in 32-bit ABIs like i386 System V.
Compiled on the Godbolt compiler explorer by gcc, clang, and ICC, it's 1 fewer uops in the loop with gcc. But all of them are just plain scalar, with clang and ICC unrolling by 2.
# clang7.0 -O3 -march=sandybridge
.LBB1_2: # =>This Loop Header: Depth=1
# outer loop loads a uint64 from the src
mov rdx, qword ptr [r14 + 8*rbx]
mov rsi, -256
.LBB1_3: # Parent Loop BB1_2 Depth=1
# do {
mov edi, edx
and edi, 1 # isolate the low bit
add dword ptr [rsi + target+256], edi # and += into target
mov edi, edx
shr edi
and edi, 1 # isolate the 2nd bit
add dword ptr [rsi + target+260], edi
shr rdx, 2 # tmp >>= 2;
add rsi, 8
jne .LBB1_3 # } while(offset += 8 != 0);
This is slightly better than we get from test / setnz. Without unrolling, bt / setc might have been equal, but compilers are bad at using bt to implement bool (x & (1ULL << n)), or bts to implement x |= 1ULL << n.
If many words have their highest set bit far below bit 63, looping on while(tmp) could be a win. Branch mispredicts make it not worth it if it only saves ~0 to 4 iterations most of the time, but if it often saves 32 iterations, that could really be worth it. Maybe unroll in the source so the loop only tests tmp every 2 iterations (because compilers won't do that transformation for you), but then the loop branch can be shr rdx, 2 / jnz.
On Sandybridge-family, this is 11 fused-domain uops for the front end per 2 bits of input. (add [mem], reg with a non-indexed addressing mode micro-fuses the load+ALU, and the store-address+store-data, everything else is single-uop. add/jcc macro-fuses. See Agner Fog's guide, and https://stackoverflow.com/tags/x86/info). So it should run at something like 3 cycles per 2 bits = one uint64_t per 96 cycles. (Sandybridge doesn't "unroll" internally in its loop buffer, so non-multiple-of-4 uop counts basically round up, unlike on Haswell and later).
vs. gcc's not-unrolled version being 7 uops per 1 bit = 2 cycles per bit. If you compiled with gcc -O3 -march=native -fprofile-generate / test-run / gcc -O3 -march=native -fprofile-use, profile-guided optimization would enable loop unrolling.
This is probably slower than a branchy version on perfectly predictable data like you get from memset with any repeating byte pattern. I'd suggest filling your array with randomly-generated data from a fast PRNG like an SSE2 xorshift+, or if you're just timing the count loop then use anything you want, like rand().

One way of speeding this up significantly, even without AVX, is to split the data into blocks of up to 255 elements, and accumulate the bit counts byte-wise in ordinary uint64_t variables. Since the source data has 64 bits, we need an array of 8 byte-wise accumulators. The first accumulator counts bits in positions 0, 8, 16, ... 56, second accumulator counts bits in positions 1, 9, 17, ... 57; and so on. After we are finished processing a block of data, we transfers the counts form the byte-wise accumulator into the target counts. A function to update the target counts for a block of up to 255 numbers can be coded in a straightforward fashion according to the description above, where BITS is the number of bits in the source data:
/* update the counts of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
int jj, k, kk;
uint64_t byte_wise_sum [BITS/8] = {0};
for (jj = lo; jj < hi; jj++) {
uint64_t t = pLong[jj];
for (k = 0; k < BITS/8; k++) {
byte_wise_sum[k] += t & 0x0101010101010101;
t >>= 1;
}
}
/* accumulate byte sums into target */
for (k = 0; k < BITS/8; k++) {
for (kk = 0; kk < BITS; kk += 8) {
target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
}
}
}
The entire ISO-C99 program, which should be able to run on at least Windows and Linux platforms is shown below. It initializes the source data with a PRNG, performs a correctness check against the asker's reference implementation, and benchmarks both the reference code and the accelerated version. On my machine (Intel Xeon E3-1270 v2 # 3.50 GHz), when compiled with MSVS 2010 at full optimization (/Ox), the output of the program is:
p=0000000000550040
ref took 2.020282 secs, fast took 0.027099 secs
where ref refers to the asker's original solution. The speed-up here is about a factor 74x. Different speed-ups will be observed with other (and especially newer) compilers.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
LARGE_INTEGER t;
static double oofreq;
static int checkedForHighResTimer;
static BOOL hasHighResTimer;
if (!checkedForHighResTimer) {
hasHighResTimer = QueryPerformanceFrequency (&t);
oofreq = 1.0 / (double)t.QuadPart;
checkedForHighResTimer = 1;
}
if (hasHighResTimer) {
QueryPerformanceCounter (&t);
return (double)t.QuadPart * oofreq;
} else {
return (double)GetTickCount() * 1.0e-3;
}
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif
/*
From: geo <gmars...#gmail.com>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c, \
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
#define N (10000000)
#define BITS (64)
#define BLOCK_SIZE (255)
/* cupdate the count of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
int jj, k, kk;
uint64_t byte_wise_sum [BITS/8] = {0};
for (jj = lo; jj < hi; jj++) {
uint64_t t = pLong[jj];
for (k = 0; k < BITS/8; k++) {
byte_wise_sum[k] += t & 0x0101010101010101;
t >>= 1;
}
}
/* accumulate byte sums into target */
for (k = 0; k < BITS/8; k++) {
for (kk = 0; kk < BITS; kk += 8) {
target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
}
}
}
int main (void)
{
double start_ref, stop_ref, start, stop;
uint64_t *pLong;
unsigned int target_ref [BITS] = {0};
unsigned int target [BITS] = {0};
int i, j;
pLong = malloc (sizeof(pLong[0]) * N);
if (!pLong) {
printf("failed to allocate\n");
return EXIT_FAILURE;
}
printf("p=%p\n", pLong);
/* init data */
for (j = 0; j < N; j++) {
pLong[j] = KISS64;
}
/* count bits slowly */
start_ref = second();
for (j = 0; j < N; j++) {
uint64_t m = 1;
for (i = 0; i < BITS; i++) {
if ((pLong[j] & m) == m) {
target_ref[i]++;
}
m = (m << 1);
}
}
stop_ref = second();
/* count bits fast */
start = second();
for (j = 0; j < N / BLOCK_SIZE; j++) {
sum_block (pLong, target, j * BLOCK_SIZE, (j+1) * BLOCK_SIZE);
}
sum_block (pLong, target, j * BLOCK_SIZE, N);
stop = second();
/* check whether result is correct */
for (i = 0; i < BITS; i++) {
if (target[i] != target_ref[i]) {
printf ("error # %d: res=%u ref=%u\n", i, target[i], target_ref[i]);
}
}
/* print benchmark results */
printf("ref took %f secs, fast took %f secs\n", stop_ref - start_ref, stop - start);
return EXIT_SUCCESS;
}

For starters, the problem of unpacking the bits, because seriously, you do not want to test each bit individually.
So just follow the following strategy for unpacking the bits into bytes of a vector: https://stackoverflow.com/a/24242696/2879325
Now that you have padded each bit to 8 bits, you can just do this for blocks of up to 255 bitmasks at a time, and accumulate them all into a single vector register. After that, you would have to expect potential overflows, so you need to transfer.
After each block of 255, unpack again to 32bit, and add into the array. (You don't have to do exactly 255, just some convenient number less than 256 to avoid overflow of byte accumulators).
At 8 instructions per bitmask (4 per each lower and higher 32-bit with AVX2) - or half that if you have AVX512 available - you should be able to achieve a throughput of about half a billion bitmasks per second and core on an recent CPU.
typedef uint64_t T;
const size_t bytes = 8;
const size_t bits = bytes * 8;
const size_t block_size = 128;
static inline __m256i expand_bits_to_bytes(uint32_t x)
{
__m256i xbcast = _mm256_set1_epi32(x); // we only use the low 32bits of each lane, but this is fine with AVX2
// Each byte gets the source byte containing the corresponding bit
const __m256i shufmask = _mm256_set_epi64x(
0x0303030303030303, 0x0202020202020202,
0x0101010101010101, 0x0000000000000000);
__m256i shuf = _mm256_shuffle_epi8(xbcast, shufmask);
const __m256i andmask = _mm256_set1_epi64x(0x8040201008040201); // every 8 bits -> 8 bytes, pattern repeats.
__m256i isolated_inverted = _mm256_andnot_si256(shuf, andmask);
// this is the extra step: byte == 0 ? 0 : -1
return _mm256_cmpeq_epi8(isolated_inverted, _mm256_setzero_si256());
}
void bitcount_vectorized(const T *data, uint32_t accumulator[bits], const size_t count)
{
for (size_t outer = 0; outer < count - (count % block_size); outer += block_size)
{
__m256i temp_accumulator[bits / 32] = { _mm256_setzero_si256() };
for (size_t inner = 0; inner < block_size; ++inner) {
for (size_t j = 0; j < bits / 32; j++)
{
const auto unpacked = expand_bits_to_bytes(static_cast<uint32_t>(data[outer + inner] >> (j * 32)));
temp_accumulator[j] = _mm256_sub_epi8(temp_accumulator[j], unpacked);
}
}
for (size_t j = 0; j < bits; j++)
{
accumulator[j] += ((uint8_t*)(&temp_accumulator))[j];
}
}
for (size_t outer = count - (count % block_size); outer < count; outer++)
{
for (size_t j = 0; j < bits; j++)
{
if (data[outer] & (T(1) << j))
{
accumulator[j]++;
}
}
}
}
void bitcount_naive(const T *data, uint32_t accumulator[bits], const size_t count)
{
for (size_t outer = 0; outer < count; outer++)
{
for (size_t j = 0; j < bits; j++)
{
if (data[outer] & (T(1) << j))
{
accumulator[j]++;
}
}
}
}
Depending on the chose compiler, the vectorized form achieved roughly a factor 25 speedup over the naive one.
On a Ryzen 5 1600X, the vectorized form roughly achieved the predicted throughput of ~600,000,000 elements per second.
Surprisingly, this is actually still 50% slower than the solution proposed by #njuffa.

See
Efficient Computation of Positional Population Counts Using SIMD Instructions by Marcus D. R. Klarqvist, Wojciech Muła, Daniel Lemire (7 Nov 2019)
Faster Population Counts using AVX2 Instructions by Wojciech Muła, Nathan Kurz, Daniel Lemire (23 Nov 2016).
Basically, each full adder compresses 3 inputs to 2 outputs. So one can eliminate an entire 256-bit word for the price of 5 logic instructions. The full adder operation could be repeated until registers become exhausted. Then results in the registers are accumulated (as seen in most of the other answers).
Positional popcnt for 16-bit subwords is implemented here:
https://github.com/mklarqvist/positional-popcount
// Carry-Save Full Adder (3:2 compressor)
b ^= a;
a ^= c;
c ^= b; // xor sum
b |= a;
b ^= c; // carry
Note: the accumulate step for positional-popcnt is more expensive than for normal simd popcnt. Which I believe makes it feasible to add a couple of half-adders to the end of the CSU, it might pay to go all the way up to 256 words before accumulating.

Handling zeroes in _mm256_rsqrt_ps()

Given that _mm256_sqrt_ps() is relatively slow, and that the values I am generating are immediately truncated with _mm256_floor_ps(), looking around it seems that doing:
_mm256_mul_ps(_mm256_rsqrt_ps(eightFloats),
eightFloats);
Is the way to go for that extra bit of performance and avoiding a pipeline stall.
Unfortunately, with zero values, I of course get a crash calculating 1/sqrt(0). What is the best way around this? I have tried this (which works and is faster), but is there a better way, or am I going to run into problems under certain conditions?
_mm256_mul_ps(_mm256_rsqrt_ps(_mm256_max_ps(eightFloats,
_mm256_set1_ps(0.1))),
eightFloats);
My code is for a vertical application, so I can assume that it will be running on a Haswell CPU (i7-4810MQ), so FMA/AVX2 can be used. The original code is approximately:
float vals[MAX];
int sum = 0;
for (int i = 0; i < MAX; i++)
{
int thisSqrt = (int) floor(sqrt(vals[i]));
sum += min(thisSqrt, 0x3F);
}
All the values of vals should be integer values. (Why everything isn't just int is a different question...)

tl;dr: See the end for code that compiles and should work.
To just solve the 0.0 problem, you could also special-case inputs of 0.0 with an FP compare of the source against 0.0. Use the compare result as a mask to zero out any NaNs resulting from 0 * +Infinity in sqrt(x) = x * rsqrt(x)). Clang does this when autovectorizing. (But it uses blendps with the zeroed vector, instead of using the compare mask with andnps directly to zero or preserve elements.)
It would also be possible to use sqrt(x) ~= recip(rsqrt(x)), as suggested by njuffa. rsqrt(0) = +Inf. recip(+Inf) = 0. However, using two approximations would compound the relative error, which is a problem.
The thing you're missing:
Truncating to integer (instead of rounding) requires an accurate sqrt result when the input is a perfect square. If the result for 25*rsqrt(25) is 4.999999 or something (instead of 5.00001), you'll add 4 instead of 5.
Even with a Newton-Raphson iteration, rsqrtps isn't perfectly accurate the way sqrtps is, so it might still give 5.0 - 1ulp. (1ulp = one unit in the last place = lowest bit of the mantissa).
Also:
Newton Raphson formula explained
Newton Raphson SSE implementation performance (latency/throughput). Note that we care more about throughput than latency, since we're using it in a loop that doesn't do much else. sqrt isn't part of the loop-carried dep chain, so different iterations can have their sqrt calcs in flight at once.
It might be possible to kill 2 birds with one stone by adding a small constant before doing the (x+offset)*approx_rsqrt(x+offset) and then truncating to integer. Large enough to overcome the max relative error of 1.5*2-12, but small enough not to bump sqrt_approx(63*63-1+offset) up to 63 (the most sensitive case).
63*1.5*2^(-12) == 0.023071...
approx_sqrt(63*63-1) == 62.99206... +/- 0.023068..
Actually, we're screwed without a Newton iteration even without adding anything. approx_sqrt(63*63-1) could come out above 63.0 all by itself. n=36 is the largest value where the relative error in sqrt(n*n-1) + error is less than sqrt(n*n). GNU Calc:
define f(n) { local x=sqrt(n*n-1); local e=x*1.5*2^(-12); print x; print e, x+e; }
; f(36)
35.98610843089316319413
~0.01317850650545403926 ~35.99928693739861723339
; f(37)
36.9864840178138587015
~0.01354485498699237990 ~37.00002887280085108140
Does your source data have any properties that mean you don't have to worry about it being just below a large perfect square? e.g. is it always perfect squares?
You could check all possible input values, since the important domain is very small (integer FP values from 0..63*63) to see if the error in practice is small enough on Intel Haswell, but that would be a brittle optimization that could make your code break on AMD CPUs, or even on future Intel CPUs. Unfortunately, just coding to the ISA spec's guarantee that the relative error is up to 1.5*2-12 requires more instructions. I don't see any tricks a NR iteration.
If your upper limit was smaller (like 20), you could just do isqrt = static_cast<int> ((x+0.5)*approx_rsqrt(x+0.5)). You'd get 20 for 20*20, but always 19 for 20*20-1.
; define test_approx_sqrt(x, off) { local s=x*x+off; local sq=s/sqrt(s); local sq_1=(s-1)/sqrt(s-1); local e=1.5*2^(-12); print sq, sq_1; print sq*e, sq_1*e; }
; test_approx_sqrt(20, 0.5)
~20.01249609618950056874 ~19.98749609130668473087 # (x+0.5)/sqrt(x+0.5)
~0.00732879495710064718 ~0.00731963968187500662 # relative error
Note that val * (x +/- err) = val*x +/- val*err. IEEE FP mul produces results that are correctly rounded to 0.5ulp, so this should work for FP relative errors.
Anyway, I think you need one Newton-Raphson iteration.
The best bet is to add 0.5 to your input before doing an approx_sqrt using rsqrt. That sidesteps the 0/0 = NaN problem, and pushes the +/- error range all to one side of the whole number cut point (for numbers in the range we care about).
FP min/max instructions have the same performance as FP add, and will be on the critical path either way. Using an add instead of a max also solves the problem of results for perfect squares potentially being a few ulp below the correct result.
Compiler output: a decent starting point
I get pretty good autovectorization results from clang 3.7.1 with sum_int, with -fno-math-errno -funsafe-math-optimizations. -ffinite-math-only is not required (but even with the full -ffast-math, clang avoids sqrt(0) = NaN when using rsqrtps).
sum_fp doesn't auto-vectorize, even with the full -ffast-math.
However clang's version suffers from the same problem as your idea: truncating an inexact result from rsqrt + NR, potentially giving the wrong integer. IDK if this is why gcc doesn't auto-vectorize, because it could have used sqrtps for a big speedup without changing the results. (At least, as long as all the floats are between 0 and INT_MAX2, otherwise converting back to integer will give the "indefinite" result of INT_MIN. (sign bit set, all other bits cleared). This is a case where -ffast-math breaks your program, unless you use -mrecip=none or something.
See the asm output on godbolt from:
// autovectorizes with clang, but has rounding problems.
// Note the use of sqrtf, and that floorf before truncating to int is redundant. (removed because clang doesn't optimize away the roundps)
int sum_int(float vals[]){
int sum = 0;
for (int i = 0; i < MAX; i++) {
int thisSqrt = (int) sqrtf(vals[i]);
sum += std::min(thisSqrt, 0x3F);
}
return sum;
}
To manually vectorize with intrinsics, we can look at the asm output from -fno-unroll-loops (to keep things simple). I was going to include this in the answer, but then realized that it had problems.
putting it together:
I think converting to int inside the loop is better than using floorf and then addps. roundps is a 2-uop instruction (6c latency) on Haswell (1uop in SnB/IvB). Worse, both uops require port1, so they compete with FP add / mul. cvttps2dq is a 1-uop instruction for port1, with 3c latency, and then we can use integer min and add to clamp and accumulate, so port5 gets something to do. Using an integer vector accumulator also means the loop-carried dependency chain is 1 cycle, so we don't need to unroll or use multiple accumulators to keep multiple iterations in flight. Smaller code is always better for the big picture (uop cache, L1 I-cache, branch predictors).
As long as we aren't in danger of overflowing 32bit accumulators, this seems to be the best choice. (Without having benchmarked anything or even tested it).
I'm not using the sqrt(x) ~= approx_recip(approx_sqrt(x)) method, because I don't know how to do a Newton iteration to refine it (probably it would involve a division). And because the compounded error is larger.
Horizontal sum from this answer.
Complete but untested version:
#include <immintrin.h>
#define MAX 4096
// 2*sqrt(x) ~= 2*x*approx_rsqrt(x), with a Newton-Raphson iteration
// dividing by 2 is faster in the integer domain, so we don't do it
__m256 approx_2sqrt_ps256(__m256 x) {
// clang / gcc usually use -3.0 and -0.5. We could do the same by using fnmsub_ps (add 3 = subtract -3), so we can share constants
__m256 three = _mm256_set1_ps(3.0f);
//__m256 half = _mm256_set1_ps(0.5f); // we omit the *0.5 step
__m256 nr = _mm256_rsqrt_ps( x ); // initial approximation for Newton-Raphson
// 1/sqrt(x) ~= nr * (3 - x*nr * nr) * 0.5 = nr*(1.5 - x*0.5*nr*nr)
// sqrt(x) = x/sqrt(x) ~= (x*nr) * (3 - x*nr * nr) * 0.5
// 2*sqrt(x) ~= (x*nr) * (3 - x*nr * nr)
__m256 xnr = _mm256_mul_ps( x, nr );
__m256 three_minus_muls = _mm256_fnmadd_ps( xnr, nr, three ); // -(xnr*nr) + 3
return _mm256_mul_ps( xnr, three_minus_muls );
}
// packed int32_t: correct results for inputs from 0 to well above 63*63
__m256i isqrt256_ps(__m256 x) {
__m256 offset = _mm256_set1_ps(0.5f); // or subtract -0.5, to maybe share constants with compiler-generated Newton iterations.
__m256 xoff = _mm256_add_ps(x, offset); // avoids 0*Inf = NaN, and rounding error before truncation
__m256 approx_2sqrt_xoff = approx_2sqrt_ps256(xoff);
__m256i i2sqrtx = _mm256_cvttps_epi32(approx_2sqrt_xoff);
return _mm256_srli_epi32(i2sqrtx, 1); // divide by 2 with truncation
// alternatively, we could mask the low bit to zero and divide by two outside the loop, but that has no advantage unless port0 turns out to be the bottleneck
}
__m256i isqrt256_ps_simple_exact(__m256 x) {
__m256 sqrt_x = _mm256_sqrt_ps(x);
__m256i isqrtx = _mm256_cvttps_epi32(sqrt_x);
return isqrtx;
}
int hsum_epi32_avx(__m256i x256){
__m128i xhi = _mm256_extracti128_si256(x256, 1);
__m128i xlo = _mm256_castsi256_si128(x256);
__m128i x = _mm_add_epi32(xlo, xhi);
__m128i hl = _mm_shuffle_epi32(x, _MM_SHUFFLE(1, 0, 3, 2));
hl = _mm_add_epi32(hl, x);
x = _mm_shuffle_epi32(hl, _MM_SHUFFLE(2, 3, 0, 1));
hl = _mm_add_epi32(hl, x);
return _mm_cvtsi128_si32(hl);
}
int sum_int_avx(float vals[]){
__m256i sum = _mm256_setzero_si256();
__m256i upperlimit = _mm256_set1_epi32(0x3F);
for (int i = 0; i < MAX; i+=8) {
__m256 v = _mm256_loadu_ps(vals+i);
__m256i visqrt = isqrt256_ps(v);
// assert visqrt == isqrt256_ps_simple_exact(v) or something
visqrt = _mm256_min_epi32(visqrt, upperlimit);
sum = _mm256_add_epi32(sum, visqrt);
}
return hsum_epi32_avx(sum);
}
Compiles on godbolt to nice code, but I haven't tested it. clang makes slightly nicer code that gcc: clang uses broadcast-loads from 4B locations for the set1 constants, instead of repeating them at compile time into 32B constants. gcc also has a bizarre movdqa to copy a register.
Anyway, the whole loop winds up being only 9 vector instructions, compared to 12 for the compiler-generated sum_int version. It probably didn't notice the x*initial_guess(x) common-subexpressions that occur in the Newton-Raphson iteration formula when you're multiplying the result by x, or something like that. It also does an extra mulps instead of a psrld because it does the *0.5 before converting to int. So that's where the extra two mulps instructions come from, and there's the cmpps/blendvps.
sum_int_avx(float*):
vpxor ymm3, ymm3, ymm3
xor eax, eax
vbroadcastss ymm0, dword ptr [rip + .LCPI4_0] ; set1(0.5)
vbroadcastss ymm1, dword ptr [rip + .LCPI4_1] ; set1(3.0)
vpbroadcastd ymm2, dword ptr [rip + .LCPI4_2] ; set1(63)
LBB4_1: ; latencies
vaddps ymm4, ymm0, ymmword ptr [rdi + 4*rax] ; 3c
vrsqrtps ymm5, ymm4 ; 7c
vmulps ymm4, ymm4, ymm5 ; x*nr ; 5c
vfnmadd213ps ymm5, ymm4, ymm1 ; 5c
vmulps ymm4, ymm4, ymm5 ; 5c
vcvttps2dq ymm4, ymm4 ; 3c
vpsrld ymm4, ymm4, 1 ; 1c this would be a mulps (but not on the critical path) if we did this in the FP domain
vpminsd ymm4, ymm4, ymm2 ; 1c
vpaddd ymm3, ymm4, ymm3 ; 1c
; ... (those 9 insns repeated: loop unrolling)
add rax, 16
cmp rax, 4096
jl .LBB4_1
;... horizontal sum
IACA thinks that with no unroll, Haswell can sustain a throughput of one iteration per 4.15 cycles, bottlenecking on ports 0 and 1. So potentially you could shave a cycle by accumulating sqrt(x)*2 (with truncation to even numbers using _mm256_and_si256), and only divide by two outside the loop.
Also according to IACA, the latency of a single iteration is 38 cycles on Haswell. I only get 31c, so probably it's including L1 load-use latency or something. Anyway, this means that to saturate the execution units, operations from 8 iterations have to be in flight at once. That's 8 * ~14 unfused-domain uops = 112 unfused-uops (or less with clang's unroll) that have to be in flight at once. Haswell's scheduler is actually only 60 entries, but the ROB is 192 entries. The early uops from early iterations will already have executed, so they only need to be tracked in the ROB, not also in the scheduler. Many of the slow uops are at the beginning of each iteration, though. Still, there's reason to hope that this will come close-ish to saturating ports 0 and 1. Unless data is hot in L1 cache, cache/memory bandwidth will probably be the bottleneck.
Interleaving operations from multiple dep chains would also be better. When clang unrolls, it puts all 9 instructions for one iteration ahead of all 9 instructions for another iteration. It uses a surprisingly small number of registers, so it would be possible to have instructions for 2 or 4 iterations mixed together. This is the sort of thing compilers are supposed to be good at, but which is cumbersome for humans. :/
It would also be slightly more efficient if the compiler chose a one-register addressing mode, so the load could micro-fuse with the vaddps. gcc does this.

Decimal concatenation of two integers using bit operations

I want to concatenate two integers using only bit operations since i need efficiency as much as possible.There are various answers available but they are not fast enough what I want is implementation that uses only bit operations like left shift,or and etc.
Please guide me how to do it.
example
int x=32;
int y=12;
int result=3212;
I am having and FPGA implentation of AES.I need this on my system to reduce time consumption of some task

The most efficient way to do it, is likely something similar to this:
uint32_t uintcat (uint32_t ms, uint32_t ls)
{
uint32_t mult=1;
do
{
mult *= 10;
} while(mult <= ls);
return ms * mult + ls;
}
Then let the compiler worry about optimization. There's likely not a lot it can improve since this is base 10, which doesn't get along well with the various instructions of the computer, like shifting.
EDIT : BENCHMARK TEST
Intel i7-3770 2 3,4 GHz
OS: Windows 7/64
Mingw, GCC version 4.6.2
gcc -O3 -std=c99 -pedantic-errors -Wall
10 million random values, from 0 to 3276732767.
Result (approximates):
Algorithm 1: 60287 micro seconds
Algorithm 2: 65185 micro seconds
Benchmark code used:
#include <stdint.h>
#include <stdio.h>
#include <windows.h>
#include <time.h>
uint32_t uintcat (uint32_t ms, uint32_t ls)
{
uint32_t mult=1;
do
{
mult *= 10;
} while(mult <= ls);
return ms * mult + ls;
}
uint32_t myConcat (uint32_t a, uint32_t b) {
switch( (b >= 10000000) ? 7 :
(b >= 1000000) ? 6 :
(b >= 100000) ? 5 :
(b >= 10000) ? 4 :
(b >= 1000) ? 3 :
(b >= 100) ? 2 :
(b >= 10) ? 1 : 0 ) {
case 1: return a*100+b; break;
case 2: return a*1000+b; break;
case 3: return a*10000+b; break;
case 4: return a*100000+b; break;
case 5: return a*1000000+b; break;
case 6: return a*10000000+b; break;
case 7: return a*100000000+b; break;
default: return a*10+b; break;
}
}
static LARGE_INTEGER freq;
static void print_benchmark_results (LARGE_INTEGER* start, LARGE_INTEGER* end)
{
LARGE_INTEGER elapsed;
elapsed.QuadPart = end->QuadPart - start->QuadPart;
elapsed.QuadPart *= 1000000;
elapsed.QuadPart /= freq.QuadPart;
printf("%lu micro seconds", elapsed.QuadPart);
}
int main()
{
const uint32_t TEST_N = 10000000;
uint32_t* data1 = malloc (sizeof(uint32_t) * TEST_N);
uint32_t* data2 = malloc (sizeof(uint32_t) * TEST_N);
volatile uint32_t* result_algo1 = malloc (sizeof(uint32_t) * TEST_N);
volatile uint32_t* result_algo2 = malloc (sizeof(uint32_t) * TEST_N);
srand (time(NULL));
// Mingw rand() apparently gives numbers up to 32767
// worst case should therefore be 3,276,732,767
// fill up random data in arrays
for(uint32_t i=0; i<TEST_N; i++)
{
data1[i] = rand();
data2[i] = rand();
}
QueryPerformanceFrequency(&freq);
LARGE_INTEGER start, end;
// run algorithm 1
QueryPerformanceCounter(&start);
for(uint32_t i=0; i<TEST_N; i++)
{
result_algo1[i] = uintcat(data1[i], data2[i]);
}
QueryPerformanceCounter(&end);
// print results
printf("Algorithm 1: ");
print_benchmark_results(&start, &end);
printf("\n");
// run algorithm 2
QueryPerformanceCounter(&start);
for(uint32_t i=0; i<TEST_N; i++)
{
result_algo2[i] = myConcat(data1[i], data2[i]);
}
QueryPerformanceCounter(&end);
// print results
printf("Algorithm 2: ");
print_benchmark_results(&start, &end);
printf("\n\n");
// sanity check both algorithms against each other
for(uint32_t i=0; i<TEST_N; i++)
{
if(result_algo1[i] != result_algo2[i])
{
printf("Results mismatch for %lu %lu. Expected: %lu%lu, algo1: %lu, algo2: %lu\n",
data1[i],
data2[i],
data1[i],
data2[i],
result_algo1[i],
result_algo2[i]);
}
}
// clean up
free((void*)data1);
free((void*)data2);
free((void*)result_algo1);
free((void*)result_algo2);
}

Bit operations use the binary representation of the numbers. However what you try to achieve is to concatenate the numbers in decimal notation. Please note that concatenating the decimal representations has little to do with concatenating the binary representations. Though it is theoretically possible to solve the problem using binary operations I am sure it will be far from the most efficient way.

We need to calculate a*10^N + b very fast.
Bit operations isn't a best idea to optimize it (even using tricks like a := (a<<1) + (a<<3) <=> a := a*10 as compiler can make it himself).
The first problem is to calculate 10^N, but there is no need to calculate it, there is just 9 possible values.
The second problem is to calculate N from b (length of 10 representation). If your data have uniform distribution you can minimize operations count in average case.
Check b <= 10^9, b <= 10^8, ..., b <= 10 with ()?: (it's faster than if( ) after optimizations, it has much simplier grammar and functionality), call the result N. Next, make switch(N) with lines "return a*10^N + b" (where 10^N is constant). As I know, switch() with 3-4 "case" is faster than same if( ) construction after optimizations.
unsigned int myConcat(unsigned int& a, unsigned int& b) {
switch( (b >= 10000000) ? 7 :
(b >= 1000000) ? 6 :
(b >= 100000) ? 5 :
(b >= 10000) ? 4 :
(b >= 1000) ? 3 :
(b >= 100) ? 2 :
(b >= 10) ? 1 : 0 ) {
case 1: return a*100+b; break;
case 2: return a*1000+b; break;
case 3: return a*10000+b; break;
case 4: return a*100000+b; break;
case 5: return a*1000000+b; break;
case 6: return a*10000000+b; break;
case 7: return a*100000000+b; break;
default: return a*10+b; break;
// I don't really know what to do here
//case 8: return a*1000*1000*1000+b; break;
//case 9: return a*10*1000*1000*1000+b; break;
}
}
As you can see, there is 2-3 operations in average case + optimisations is very effective here. I've benchmarked it in comparison with Lundin's suggestion, here is the result. 0ms vs 100ms

If you care about decimal digit concatenation, you might want to simply do that as you're printing, and convert both numbers to a sequence of digits sequentially. e.g. How do I print an integer in Assembly Level Programming without printf from the c library? shows an efficient C function, as well as asm. Call it twice into the same buffer.
#Lundin's answer loops increasing powers of 10 to find
the right decimal-shift, i.e. a linear search for the right power of 10. If it's called very frequently so a lookup table can stay hot in cache, a speedup maybe be possible.
If you can use GNU C __builtin_clz (Count Leading Zeros) or some other way of quickly finding the MSB position of the right-hand input (ls, the least-significant part of the resulting concatenation), you can start the search for the right mult from a 32-entry lookup table. (And you only have to check at most one more iteration, so it's not a loop.)
Most of the common modern CPU architectures have a HW instruction that the compiler can use directly or with a bit of processing to implement clz. https://en.wikipedia.org/wiki/Find_first_set#Hardware_support. (And on all but x86, the result is well-defined for an input of 0, but unfortunately GNU C doesn't portably give us access to that.)
If the table stays hot in L1d cache, this can be pretty good. The extra latency of a clz and a table lookup are comparable to a couple iterations of the loop (on a modern x86 like Skylake or Ryzen for example, where bsf or tzcnt is 3 cycle latency, L1d latency is 4 or 5 cycles, imul latency is 3 cycles.)
Of course, on many architectures (including x86), multiplying by 10 is cheaper than by a runtime variable, using shift and add. 2 LEA instructions on x86, or an add+lsl on ARM/AArch64 using a shifted input to do tmp = x + x*4 with the add. So on Intel CPUs, we're only looking at a 2-cycle loop-carried dependency chain, not 3. But AMD CPUs have slower LEA when using a scaled index.
This doesn't sound great for small numbers. But it can reduce branch mispredictions by needing at most one iteration. It even makes a branchless implementation possible. And it means less total work for large lower parts (large powers of 10). But large integers will easily overflow unless you use a wider result type.
Unfortunately, 10 is not a power of 2, so the MSB position alone can't give us the exact power of 10 to multiply by. e.g. all numbers from 64 to 127 all have MSB = 1<<7, but some of them have 2 decimal digits and some have 3. Since we want to avoid division (because it requires a multiplication by a magic constant and shifting the high half), we want to always start with the lower power of 10 and see if that's big enough.
But fortunately, a bitscan does get us within one power of 10 so we no longer need a loop.
I probably wouldn't have written the part with _lzcnt_u32 or ARM __clz if I'd learned of the clz(a|1) trick for avoiding problems with input=0 beforehand. But I did, and played around with the source a bit to try to get nicer asm from gcc and clang. Index the table on clz or BSR depending on target platform makes it a bit of a mess.
#include <stdint.h>
#include <limits.h>
#include <assert.h>
// builtin_clz matches Intel's docs for x86 BSR: garbage result for input=0
// actual x86 HW leaves the destination register unmodified; AMD even documents this.
// but GNU C doesn't let us take advantage with intrinsics.
// unless you use BMI1 _lzcnt_u32
// if available, use an intrinsic that gives us a leading-zero count
// *without* an undefined result for input=0
#ifdef __LZCNT__ // x86 CPU feature
#include <immintrin.h> // Intel's intrinsics
#define HAVE_LZCNT32
#define lzcnt32(a) _lzcnt_u32(a)
#endif
#ifdef __ARM__ // TODO: do older ARMs not have this?
#define HAVE_LZCNT32
#define lzcnt32(a) __clz(a) // builtin, no header needed
#endif
// Some POWER compilers define `__cntlzw`?
// index = msb position, or lzcnt, depending on which the HW can do more efficiently
// defined later; one or the other is unused and optimized out, depending on target platform
// alternative: fill this at run-time startup
// with a loop that does mult*=10 when (x<<1)-1 > mult, or something
//#if INDEX_BY_MSB_POS == 1
__attribute__((unused))
static const uint32_t catpower_msb[] = {
10, // 1 and 0
10, // 2..3
10, // 4..7
10, // 8..15
100, // 16..31 // 2 digits even for the low end of the range
100, // 32..63
100, // 64..127
1000, // 128..255 // 3 digits
1000, // 256..511
1000, // 512..1023
10000, // 1024..2047
10000, // 2048..4095
10000, // 4096..8191
10000, // 8192..16383
100000, // 16384..32767
100000, // 32768..65535 // up to 2^16-1, enough for 16-bit inputs
// ... // fill in the rest yourself
};
//#elif INDEX_BY_MSB_POS == 0
// index on leading zeros
__attribute__((unused))
static const uint32_t catpower_lz32[] = {
// top entries overflow: 10^10 doesn't fit in uint32_t
// intentionally wrong to make it easier to spot bad output.
4000000000, // 2^31 .. 2^32-1 2*10^9 .. 4*10^9
2000000000, // 1,073,741,824 .. 2,147,483,647
// first correct entry
1000000000, // 536,870,912 .. 1,073,741,823
// ... fill in the rest
// for testing, skip until 16 leading zeros
[16] = 100000, // 32768..65535 // up to 2^16-1, enough for 16-bit inputs
100000, // 16384..32767
10000, // 8192..16383
10000, // 4096..8191
10000, // 2048..4095
10000, // 1024..2047
1000, // 512..1023
1000, // 256..511
1000, // 128..255
100, // 64..127
100, // 32..63
100, // 16..31 // low end of the range has 2 digits
10, // 8..15
10, // 4..7
10, // 2..3
10, // 1
// lzcnt32(0) == 32
10, // 0 // treat 0 as having one significant digit.
};
//#else
//#error "INDEX_BY_MSB_POS not set correctly"
//#endif
//#undef HAVE_LZCNT32 // codegen for the other path, for fun
static inline uint32_t msb_power10(uint32_t a)
{
#ifdef HAVE_LZCNT32 // 0-safe lzcnt32 macro available
#define INDEX_BY_MSB_POS 0
// a |= 1 would let us shorten the table, in case 32*4 is a lot nicer than 33*4 bytes
unsigned lzcnt = lzcnt32(a); // 32 for a=0
return catpower_lz32[lzcnt];
#else
// only generic __builtin_clz available
static_assert(sizeof(uint32_t) == sizeof(unsigned) && UINT_MAX == (1ULL<<32)-1, "__builtin_clz isn't 32-bit");
// See also https://foonathan.net/blog/2016/02/11/implementation-challenge-2.html
// for C++ templates for fixed-width wrappers for __builtin_clz
#if defined(__i386__) || defined(__x86_64__)
// x86 where MSB_index = 31-clz = BSR is most efficient
#define INDEX_BY_MSB_POS 1
unsigned msb = 31 - __builtin_clz(a|1); // BSR
return catpower_msb[msb];
//return unlikely(a==0) ? 10 : catpower_msb[msb];
#else
// use clz directly while still avoiding input=0
// I think all non-x86 CPUs with hardware CLZ do define clz(0) = 32 or 64 (the operand width),
// but gcc's builtin is still documented as not valid for input=0
// Most ISAs like PowerPC and ARM that have a bitscan instruction have clz, not MSB-index
// set the LSB to avoid the a==0 special case
unsigned clz = __builtin_clz(a|1);
// table[32] unused, could add yet another #ifdef for that
#define INDEX_BY_MSB_POS 0
//return unlikely(a==0) ? 10 : catpower_lz32[clz];
return catpower_lz32[clz]; // a|1 avoids the special-casing
#endif // optimize for BSR or not
#endif // HAVE_LZCNT32
}
uint32_t uintcat (uint32_t ms, uint32_t ls)
{
// if (ls==0) return ms * 10; // Another way to avoid the special case for clz
uint32_t mult = msb_power10(ls); // catpower[clz(ls)];
uint32_t high = mult * ms;
#if 0
if (mult <= ls)
high *= 10;
return high + ls;
#else
// hopefully compute both and then select
// because some CPUs can shift and add at the same time (x86, ARM)
// so this avoids having an ADD *after* the cmov / csel, if the compiler is smart
uint32_t another10 = high*10 + ls;
uint32_t enough = high + ls;
return (mult<=ls) ? another10 : enough;
#endif
}
From the Godbolt compiler explorer, this compiles efficiently for x86-64 with and without BSR:
# clang7.0 -O3 for x86-64 SysV, -march=skylake -mno-lzcnt
uintcat(unsigned int, unsigned int):
mov eax, esi
or eax, 1
bsr eax, eax # 31-clz(ls|1)
mov ecx, dword ptr [4*rax + catpower_msb]
imul edi, ecx # high = mult * ms
lea eax, [rdi + rdi]
lea eax, [rax + 4*rax] # retval = high * 10
cmp ecx, esi
cmova eax, edi # if(mult>ls) retval = high (drop the *10 result)
add eax, esi # retval += ls
ret
Or with lzcnt, (enabled by -march=haswell or later, or some AMD uarches),
uintcat(unsigned int, unsigned int):
# clang doesn't try to break the false dependency on EAX; gcc uses xor eax,eax
lzcnt eax, esi # C source avoids the |1, saving instructions
mov ecx, dword ptr [4*rax + catpower_lz32]
imul edi, ecx # same as above from here on
lea eax, [rdi + rdi]
lea eax, [rax + 4*rax]
cmp ecx, esi
cmova eax, edi
add eax, esi
ret
Factoring the last add out of both sides of the ternary is a missed optimization, adding 1 cycle of latency after the cmov. We can multiply by 10 and add just as cheaply as multiplying by 10 alone, on Intel CPUs:
... same start # hand-optimized version that clang should use
imul edi, ecx # high = mult * ms
lea eax, [rdi + 4*rdi] # high * 5
lea eax, [rsi + rdi*2] # retval = high * 10 + ls
add edi, esi # tmp = high + ls
cmp ecx, esi
cmova eax, edi # if(mult>ls) retval = high+ls
ret
So the high + ls latency would run in parallel with the high*10 + ls latency, both needed as inputs for cmov.
GCC branches instead of using CMOV for the last condition. GCC also makes a mess of 31-clz(a|1), calculating clz with BSR and XOR with 31. But then subtracting that from 31. And it has some extra mov instructions. Strangely, gcc seems to do better with that BSR code when lzcnt is available, even though it chooses not to use it.
clang has no trouble optimizing away the 31-clz double-inversion and just using BSR directly.
For PowerPC64, clang also makes branchless asm. gcc does something similar, but with a branch like on x86-64.
uintcat:
.Lfunc_gep0:
addis 2, 12, .TOC.-.Lfunc_gep0#ha
addi 2, 2, .TOC.-.Lfunc_gep0#l
ori 6, 4, 1 # OR immediate
addis 5, 2, catpower_lz32#toc#ha
cntlzw 6, 6 # CLZ word
addi 5, 5, catpower_lz32#toc#l # static table address
rldic 6, 6, 2, 30 # rotate left and clear immediate (shift and zero-extend the CLZ result)
lwzx 5, 5, 6 # Load Word Zero eXtend, catpower_lz32[clz]
mullw 3, 5, 3 # mul word
cmplw 5, 4 # compare mult, ls
mulli 6, 3, 10 # mul immediate
isel 3, 3, 6, 1 # conditional select high vs. high*10
add 3, 3, 4 # + ls
clrldi 3, 3, 32 # zero extend, clearing upper 32 bits
blr # return
Compressing the table
Using clz(ls|1) >> 1 or that +1 should work, because 4 < 10. The table always takes at least 3 entries to gain another digit. I haven't investigated this. (And have already spent longer than I meant to on this. :P)
Or right-shift a lot more to just get a starting point for the loop. e.g. mult = clz(ls) >= 18 ? 100000 : 10;, or a 3 or 4-way chain of if.
Or loop on mult *= 100, and after exiting that loop sort out whether you want old_mult * 10 or mult. (i.e. check if you went too far). This cuts the iteration count in half for even numbers of digits.
(Watch out for a possible infinite loop on large ls that would overflow the result. If mult *= 100 wraps to 0, it will always stay <= ls for ls = 1000000000, for example.)

When should prefetch be used on modern machines? [duplicate]

Can anyone give an example or a link to an example which uses __builtin_prefetch in GCC (or just the asm instruction prefetcht0 in general) to gain a substantial performance advantage? In particular, I'd like the example to meet the following criteria:
It is a simple, small, self-contained example.
Removing the __builtin_prefetch instruction results in performance degradation.
Replacing the __builtin_prefetch instruction with the corresponding memory access results in performance degradation.
That is, I want the shortest example showing __builtin_prefetch performing an optimization that couldn't be managed without it.

Here's an actual piece of code that I've pulled out of a larger project. (Sorry, it's the shortest one I can find that had a noticable speedup from prefetching.)
This code performs a very large data transpose.
This example uses the SSE prefetch instructions, which may be the same as the one that GCC emits.
To run this example, you will need to compile this for x64 and have more than 4GB of memory. You can run it with a smaller datasize, but it will be too fast to time.
#include <iostream>
using std::cout;
using std::endl;
#include <emmintrin.h>
#include <malloc.h>
#include <time.h>
#include <string.h>
#define ENABLE_PREFETCH
#define f_vector __m128d
#define i_ptr size_t
inline void swap_block(f_vector *A,f_vector *B,i_ptr L){
// To be super-optimized later.
f_vector *stop = A + L;
do{
f_vector tmpA = *A;
f_vector tmpB = *B;
*A++ = tmpB;
*B++ = tmpA;
}while (A < stop);
}
void transpose_even(f_vector *T,i_ptr block,i_ptr x){
// Transposes T.
// T contains x columns and x rows.
// Each unit is of size (block * sizeof(f_vector)) bytes.
//Conditions:
// - 0 < block
// - 1 < x
i_ptr row_size = block * x;
i_ptr iter_size = row_size + block;
// End of entire matrix.
f_vector *stop_T = T + row_size * x;
f_vector *end = stop_T - row_size;
// Iterate each row.
f_vector *y_iter = T;
do{
// Iterate each column.
f_vector *ptr_x = y_iter + block;
f_vector *ptr_y = y_iter + row_size;
do{
#ifdef ENABLE_PREFETCH
_mm_prefetch((char*)(ptr_y + row_size),_MM_HINT_T0);
#endif
swap_block(ptr_x,ptr_y,block);
ptr_x += block;
ptr_y += row_size;
}while (ptr_y < stop_T);
y_iter += iter_size;
}while (y_iter < end);
}
int main(){
i_ptr dimension = 4096;
i_ptr block = 16;
i_ptr words = block * dimension * dimension;
i_ptr bytes = words * sizeof(f_vector);
cout << "bytes = " << bytes << endl;
// system("pause");
f_vector *T = (f_vector*)_mm_malloc(bytes,16);
if (T == NULL){
cout << "Memory Allocation Failure" << endl;
system("pause");
exit(1);
}
memset(T,0,bytes);
// Perform in-place data transpose
cout << "Starting Data Transpose... ";
clock_t start = clock();
transpose_even(T,block,dimension);
clock_t end = clock();
cout << "Done" << endl;
cout << "Time: " << (double)(end - start) / CLOCKS_PER_SEC << " seconds" << endl;
_mm_free(T);
system("pause");
}
When I run it with ENABLE_PREFETCH enabled, this is the output:
bytes = 4294967296
Starting Data Transpose... Done
Time: 0.725 seconds
Press any key to continue . . .
When I run it with ENABLE_PREFETCH disabled, this is the output:
bytes = 4294967296
Starting Data Transpose... Done
Time: 0.822 seconds
Press any key to continue . . .
So there's a 13% speedup from prefetching.
EDIT:
Here's some more results:
Operating System: Windows 7 Professional/Ultimate
Compiler: Visual Studio 2010 SP1
Compile Mode: x64 Release
Intel Core i7 860 # 2.8 GHz, 8 GB DDR3 # 1333 MHz
Prefetch : 0.868
No Prefetch: 0.960
Intel Core i7 920 # 3.5 GHz, 12 GB DDR3 # 1333 MHz
Prefetch : 0.725
No Prefetch: 0.822
Intel Core i7 2600K # 4.6 GHz, 16 GB DDR3 # 1333 MHz
Prefetch : 0.718
No Prefetch: 0.796
2 x Intel Xeon X5482 # 3.2 GHz, 64 GB DDR2 # 800 MHz
Prefetch : 2.273
No Prefetch: 2.666

Binary search is a simple example that could benefit from explicit prefetching. The access pattern in a binary search looks pretty much random to the hardware prefetcher, so there is little chance that it will accurately predict what to fetch.
In this example, I prefetch the two possible 'middle' locations of the next loop iteration in the current iteration. One of the prefetches will probably never be used, but the other will (unless this is the final iteration).
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
int binarySearch(int *array, int number_of_elements, int key) {
int low = 0, high = number_of_elements-1, mid;
while(low <= high) {
mid = (low + high)/2;
#ifdef DO_PREFETCH
// low path
__builtin_prefetch (&array[(mid + 1 + high)/2], 0, 1);
// high path
__builtin_prefetch (&array[(low + mid - 1)/2], 0, 1);
#endif
if(array[mid] < key)
low = mid + 1;
else if(array[mid] == key)
return mid;
else if(array[mid] > key)
high = mid-1;
}
return -1;
}
int main() {
int SIZE = 1024*1024*512;
int *array = malloc(SIZE*sizeof(int));
for (int i=0;i<SIZE;i++){
array[i] = i;
}
int NUM_LOOKUPS = 1024*1024*8;
srand(time(NULL));
int *lookups = malloc(NUM_LOOKUPS * sizeof(int));
for (int i=0;i<NUM_LOOKUPS;i++){
lookups[i] = rand() % SIZE;
}
for (int i=0;i<NUM_LOOKUPS;i++){
int result = binarySearch(array, SIZE, lookups[i]);
}
free(array);
free(lookups);
}
When I compile and run this example with DO_PREFETCH enabled, I see a 20% reduction in runtime:
$ gcc c-binarysearch.c -DDO_PREFETCH -o with-prefetch -std=c11 -O3
$ gcc c-binarysearch.c -o no-prefetch -std=c11 -O3
$ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./with-prefetch
Performance counter stats for './with-prefetch':
356,675,702 L1-dcache-load-misses # 41.39% of all L1-dcache hits
861,807,382 L1-dcache-loads
8.787467487 seconds time elapsed
$ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./no-prefetch
Performance counter stats for './no-prefetch':
382,423,177 L1-dcache-load-misses # 97.36% of all L1-dcache hits
392,799,791 L1-dcache-loads
11.376439030 seconds time elapsed
Notice that we are doing twice as many L1 cache loads in the prefetch version. We're actually doing a lot more work but the memory access pattern is more friendly to the pipeline. This also shows the tradeoff. While this block of code runs faster in isolation, we have loaded a lot of junk into the caches and this may put more pressure on other parts of the application.

I learned a lot from the excellent answers provided by #JamesScriven and #Mystical. However, their examples give only a modest boost - the objective of this answer is to present a (I must confess somewhat artificial) example, where prefetching has a bigger impact (about factor 4 on my machine).
There are three possible bottle-necks for the modern architectures: CPU-speed, memory-band-width and memory latency. Prefetching is all about reducing the latency of the memory-accesses.
In a perfect scenario, where latency corresponds to X calculation-steps, we would have a oracle, which would tell us which memory we would access in X calculation-steps, the prefetching of this data would be launched and it would arrive just in-time X calculation-steps later.
For a lot of algorithms we are (almost) in this perfect world. For a simple for-loop it is easy to predict which data will be needed X steps later. Out-of-order execution and other hardware tricks are doing a very good job here, concealing the latency almost completely.
That is the reason, why there is such a modest improvement for #Mystical's example: The prefetcher is already pretty good - there is just not much room for improvement. The task is also memory-bound, so probably not much band-width is left - it could be becoming the limiting factor. I could see at best around 8% improvement on my machine.
The crucial insight from the #JamesScriven example: neither we nor the CPU knows the next access-address before the the current data is fetched from memory - this dependency is pretty important, otherwise out-of-order execution would lead to a look-forward and the hardware would be able to prefetch the data. However, because we can speculate about only one step there is not that much potential. I was not able to get more than 40% on my machine.
So let's rig the competition and prepare the data in such a way that we know which address is accessed in X steps, but make it impossible for hardware to find it out due to dependencies on not yet accessed data (see the whole program at the end of the answer):
//making random accesses to memory:
unsigned int next(unsigned int current){
return (current*10001+328)%SIZE;
}
//the actual work is happening here
void operator()(){
//set up the oracle - let see it in the future oracle_offset steps
unsigned int prefetch_index=0;
for(int i=0;i<oracle_offset;i++)
prefetch_index=next(prefetch_index);
unsigned int index=0;
for(int i=0;i<STEP_CNT;i++){
//use oracle and prefetch memory block used in a future iteration
if(prefetch){
__builtin_prefetch(mem.data()+prefetch_index,0,1);
}
//actual work, the less the better
result+=mem[index];
//prepare next iteration
prefetch_index=next(prefetch_index); #update oracle
index=next(mem[index]); #dependency on `mem[index]` is VERY important to prevent hardware from predicting future
}
}
Some remarks:
data is prepared in such a way, that the oracle is alway right.
maybe surprisingly, the less CPU-bound task the bigger the speed-up: we are able to hide the latency almost completely, thus the speed-up is CPU-time+original-latency-time/CPU-time.
Compiling and executing leads:
>>> g++ -std=c++11 prefetch_demo.cpp -O3 -o prefetch_demo
>>> ./prefetch_demo
#preloops time no prefetch time prefetch factor
...
7 1.0711102260000001 0.230566831 4.6455521002498408
8 1.0511602149999999 0.22651144600000001 4.6406494398521474
9 1.049024333 0.22841439299999999 4.5926367389641687
....
to a speed-up between 4 and 5.
Listing of prefetch_demp.cpp:
//prefetch_demo.cpp
#include <vector>
#include <iostream>
#include <iomanip>
#include <chrono>
const int SIZE=1024*1024*1;
const int STEP_CNT=1024*1024*10;
unsigned int next(unsigned int current){
return (current*10001+328)%SIZE;
}
template<bool prefetch>
struct Worker{
std::vector<int> mem;
double result;
int oracle_offset;
void operator()(){
unsigned int prefetch_index=0;
for(int i=0;i<oracle_offset;i++)
prefetch_index=next(prefetch_index);
unsigned int index=0;
for(int i=0;i<STEP_CNT;i++){
//prefetch memory block used in a future iteration
if(prefetch){
__builtin_prefetch(mem.data()+prefetch_index,0,1);
}
//actual work:
result+=mem[index];
//prepare next iteration
prefetch_index=next(prefetch_index);
index=next(mem[index]);
}
}
Worker(std::vector<int> &mem_):
mem(mem_), result(0.0), oracle_offset(0)
{}
};
template <typename Worker>
double timeit(Worker &worker){
auto begin = std::chrono::high_resolution_clock::now();
worker();
auto end = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count()/1e9;
}
int main() {
//set up the data in special way!
std::vector<int> keys(SIZE);
for (int i=0;i<SIZE;i++){
keys[i] = i;
}
Worker<false> without_prefetch(keys);
Worker<true> with_prefetch(keys);
std::cout<<"#preloops\ttime no prefetch\ttime prefetch\tfactor\n";
std::cout<<std::setprecision(17);
for(int i=0;i<20;i++){
//let oracle see i steps in the future:
without_prefetch.oracle_offset=i;
with_prefetch.oracle_offset=i;
//calculate:
double time_with_prefetch=timeit(with_prefetch);
double time_no_prefetch=timeit(without_prefetch);
std::cout<<i<<"\t"
<<time_no_prefetch<<"\t"
<<time_with_prefetch<<"\t"
<<(time_no_prefetch/time_with_prefetch)<<"\n";
}
}

From the documentation:
for (i = 0; i < n; i++)
{
a[i] = a[i] + b[i];
__builtin_prefetch (&a[i+j], 1, 1);
__builtin_prefetch (&b[i+j], 0, 1);
/* ... */
}

Pre-fetching data can be optimized to the Cache Line size, which for most modern 64-bit processors is 64 bytes to for example pre-load a uint32_t[16] with one instruction.
For example on ArmV8 I discovered through experimentation casting the memory pointer to a uint32_t 4x4 matrix vector (which is 64 bytes in size) halved the required instructions required as before I had to increment by 8 as it was only loading half the data, even though my understanding was that it fetches a full cache line.
Pre-fetching an uint32_t[32] original code example...
int addrindex = &B[0];
__builtin_prefetch(&V[addrindex]);
__builtin_prefetch(&V[addrindex + 8]);
__builtin_prefetch(&V[addrindex + 16]);
__builtin_prefetch(&V[addrindex + 24]);
After...
int addrindex = &B[0];
__builtin_prefetch((uint32x4x4_t *) &V[addrindex]);
__builtin_prefetch((uint32x4x4_t *) &V[addrindex + 16]);
For some reason int datatype for the address index/offset gave better performance. Tested with GCC 8 on Cortex-a53. Using an equivalent 64 byte vector on other architectures might give the same performance improvement if you find it is not pre-fetching all the data like in my case. In my application with a one million iteration loop, it improved performance by 5% just by doing this. There were further requirements for the improvement.
the 128 megabyte "V" memory allocation had to be aligned to 64 bytes.
uint32_t *V __attribute__((__aligned__(64))) = (uint32_t *)(((uintptr_t)(__builtin_assume_aligned((unsigned char*)aligned_alloc(64,size), 64)) + 63) & ~ (uintptr_t)(63));
Also, I had to use C operators instead of Neon Intrinsics, since they require regular datatype pointers (in my case it was uint32_t *) otherwise the new built in prefetch method had a performance regression.
My real world example can be found at https://github.com/rollmeister/veriumMiner/blob/main/algo/scrypt.c in the scrypt_core() and its internal function which are all easy to read. The hard work is done by GCC8. Overall improvement to performance was 25%.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight