Decrease in instructions retired after loop Unrolling

Decrease in instructions retired after loop Unrolling - c

I have a O(N^4) image processing loop and after profiling it (Using Intel Vtune 2013), I see that the number of Instructions retired is reduced drastically. I need help understanding this behavior on a multicore architecture. (I'm using Intel Xeon x5365- has 8 cores with shared L2 cache for every 2 cores). And also the no of branch mis-predictions have increased drastically!!
///////////////EDITS/////////// A sample of my non-Unrolled code is shown below:
for(imageNo =0; imageNo<496;imageNo++){
for (unsigned int k=0; k<256; k++)
{
double z = O_L + (double)k * R_L;
for (unsigned int j=0; j<256; j++)
{
double y = O_L + (double)j * R_L;
for (unsigned int i=0; i<256; i++)
{
double x[1] = {O_L + (double)i * R_L} ;
double w_n = (A_n[2] * x[0] + A_n[5] * y + A_n[8] * z + A_n[11]) ;
double u_n = ((A_n[0] * x[0] + A_n[3] * y + A_n[6] * z + A_n[9] ) / w_n);
double v_n = ((A_n[1] * x[0] + A_n[4] * y + A_n[7] * z + A_n[10]) / w_n);
for(int loop=0; loop<1;loop++)
{
px_x[loop] = (int) floor(u_n);
px_y[loop] = (int) floor(v_n);
alpha[loop] = u_n - px_x[loop] ;
beta[loop] = v_n - px_y[loop] ;
}
///////////////////(i,j) pixels ///////////////////////////////
if (px_x[0]>=0 && px_x[0]<(int)threadCopy[0].S_x && px_y[0]>=0 && px_y[0]<(int)threadCopy[0].S_y)
pixel_1[0] = threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + px_x[0]];
else
pixel_1[0] = 0.0;
if (px_x[0]+1>=0 && px_x[0]+1<(int)threadCopy[0].S_x && px_y[0]>=0 && px_y[0]<(int)threadCopy[0].S_y)
pixel_1[2] = threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + (px_x[0]+1)];
else
pixel_1[2] = 0.0;
/////////////////// (i+1, j) pixels/////////////////////////
if (px_x[0]>=0 && px_x[0]<(int)threadCopy[0].S_x && px_y[0]+1>=0 && px_y[0]+1<(int)threadCopy[0].S_y)
pixel_1[1] = threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + px_x[0]];
else
pixel_1[1] = 0.0;
if (px_x[0]+1>=0 && px_x[0]+1<(int)threadCopy[0].S_x && px_y[0]+1>=0 && px_y[0]+1<(int)threadCopy[0].S_y)
pixel_1[3] = threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + (px_x[0]+1)];
else
pixel_1[3] = 0.0;
pix_1 = (1.0 - alpha[0]) * (1.0 - beta[0]) * pixel_1[0] + (1.0 - alpha[0]) * beta[0] * pixel_1[1]
+ alpha[0] * (1.0 - beta[0]) * pixel_1[2] + alpha[0] * beta[0] * pixel_1[3];
f_L[k * L * L + j * L + i] += (float)(1.0 / (w_n * w_n) * pix_1);
}
}
}
}
I'm unrolling the inner most loop by 4 iterations.(You will have a general ideal how I stripped the loop. Basically i created an array of Array[4] and filled respective vales in it.) Doing the math, I'm reducing the total no of iterations by 75%. Say there are 4 loop handling instructions for every loop (load i, inc i, cmp i, jle loop), the total no of instructions after unrolling should reduce by (256-64)*4*256*256*496=24.96G.
The profiled results are:
Before UnRolling: Instr retired: 3.1603T no of branch mis-predictions: 96 million
After UnRolling: Instr retired: 2.642240T no of branch mis-predictions: 144 million
The no instr retired decreased by 518.06G . I have no clue how this is happening. I would appreciate any help regarding this (Even if it is remote possibility for its occurrence) . Also, I would like to know why are branch mis-predictions increasing. Thanks in advance!

It is not clear where gcc would be reducing the number of instructions. It is possible that increased register pressure might encourage gcc to use load+operate instructions (so the same number of primitive operations but fewer instructions). The index for f_L would only be incremented once per innermost loop, but this would only save 6.2G (3*64*256*256*496) instructions. (By the way, the loop overhead should only be three instructions since i should remain in a register.)
The following pseudo-assembly (for a RISC-like ISA) using a two-way unrolling shows how an increment can be saved:
// the address of f_L[k * L * L + j * L + i] is in r1
// (float)(1.0 / (w_n * w_n) * pix_1) results are in f1 and f2
load-single f9 [r1]; // load float at address in r1 to register f9
add-single f9 f9 f1; // f9 = f9 + f1
store-single [r1] f9; // store float in f9 to address in r1
load-single f10 4[r1]; // load float at address of r1+4 to f10
add-single f10 f10 f2; // f10 = f10 + f2
store-single 4[r1] f10; // store float in f10 to address of r1+4
add r1 r1 #8; // increase the address by 8 bytes
The trace of two iterations of the non-unrolled version would look more like:
load-single f9 [r1]; // load float at address of r1 to f9
add-single f9 f9 f2; // f9 = f9 + f2
store-single [r1] f9; // store float in f9 to address of r1
add r1 r1 #4; // increase the address by 4 bytes
...
load-single f9 [r1]; // load float at address of r1 to f9
add-single f9 f9 f2; // f9 = f9 + f2
store-single [r1] f9; // store float in f9 to address of r1
add r1 r1 #4; // increase the address by 4 bytes
Because memory addressing instructions commonly include adding an immediate offset (Itanium is an unusual exception) and the pipelines are not generally implemented to optimize the case when the immediate is zero, using a non-zero immediate offset is generally "free". (It certainly reduces the number of instructions—7 vs. 8 in this case—, but generally it also improves performance.)
With respect to branch prediction, the according to Agner Fog's The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers(PDF) the Core2 microarchitecture's branch predictor uses an 8 bit global history. This means that it tracks the results for the last 8 branches and uses these 8 bits (along with bits from the instruction address) to index a table. This allows correlations between nearby branches to be recognized.
For your code, the branch corresponding to, e.g., the 8th previous branch is not the same branch in each iteration (since short-circuiting is used), so it is not easy to conceptualize how well correlations would be recognized.
Some correlations in branches are obvious. If px_x[0]>=0 is true, px_x[0]+1>=0 will also be true. If px_x[0] <(int)threadCopy[0].S_x is true, then px_x[0]+1 <(int)threadCopy[0].S_x is likely to be true.
If the unrolling is done such that px_x[n] is tested for all four values of n then these correlations would be pushed farther away so that the results are not used by the branch predictor.
Some optimization possibilities
Although you did not ask about any optimization possibilities, I am going to offer some avenues for exploration.
First, for the branches, if not being strictly portable is OK, the test x>=0 && x<y can be simplified to (unsigned)x<(unsigned)y. This is not strictly portable because, e.g., a machine could theoretically represent negative numbers in a sign-magnitude format with the most significant bit as the sign and negative indicated by a zero bit. For the common representations of signed integers, such a reinterpreting cast will work as long as y is a positive signed integer since a negative x value will have the most significant bit set and so be larger than y interpreted as an unsigned integer.
Second, the number of branches can be significantly reduced by using the 100% correlations for either px_x or px_y:
if ((unsigned) px_y[0]<(unsigned int)threadCopy[0].S_y)
{
if ((unsigned)px_x[0]<(unsigned int)threadCopy[0].S_x)
pixel_1[0] = threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + px_x[0]];
else
pixel_1[0] = 0.0;
if ((unsigned)px_x[0]+1<(unsigned int)threadCopy[0].S_x)
pixel_1[2] = threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + (px_x[0]+1)];
else
pixel_1[2] = 0.0;
}
if ((unsigned)px_y[0]+1<(unsigned int)threadCopy[0].S_y)
{
if ((unsigned)px_x[0]<(unsigned int)threadCopy[0].S_x)
pixel_1[1] = threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + px_x[0]];
else
pixel_1[1] = 0.0;
if ((unsigned)px_x[0]+1<(unsigned int)threadCopy[0].S_x)
pixel_1[3] = threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + (px_x[0]+1)];
else
pixel_1[3] = 0.0;
}
(If the above section of code is replicated for unrolling, it should probably be replicated as a block rather than interleaving tests for different px_x and px_y values to allow the px_y branch to be near the px_y+1 branch and the first px_x branch to be near the other px_x branch and the px_x+1 branches.)
Another possible optimization is changing the calculation of w_n into a calculation of its reciprocal. This would change a multiply and three divisions into four multiplies and one division. Division is much more expensive than multiplication. In addition, calculating an approximate reciprocal is much more SIMD-friendly since there are usually reciprocal estimate instructions that provide a starting point which can be refined by the Newton-Raphson method.
If even worse obfuscation of the code is acceptable, you might consider changing code like double y = O_L + (double)j * R_L; into double y = O_L; ... y += R_L;. (I ran a test, and gcc does not seem to recognize this optimization, probably because of the use of floating point and the cast to double.) Thus:
for(int imageNo =0; imageNo<496;imageNo++){
double z = O_L;
for (unsigned int k=0; k<256; k++)
{
double y = O_L;
for (unsigned int j=0; j<256; j++)
{
double x[1]; x[0] = O_L;
for (unsigned int i=0; i<256; i++)
{
...
x[0] += R_L ;
} // end of i loop
y += R_L;
} // end of j loop
z += R_L;
} // end of k loop
} // end of imageNo loop
I am guessing that this would only modest improve performance, so the obfuscation cost would be higher relative to the benefit.
Another change that might be worth trying is incorporating some of the pix_1 calculation into the sections conditionally setting pixel_1[] values. This would significantly obfuscate the code and might not have much benefit. In addition, it might make autovectorization by the compiler more difficult. (With conditionally setting the values to the appropriate I_n or to zero, an SIMD comparison could set each element to -1 or 0 and a simple and with the I_n value would provide the correct value. In this case, the overhead of forming the I_n vector would probably not be worthwhile given that Core2 only supports 2-wide double precision SIMD, but with gather support or even just a longer vector the tradeoffs might change.)
However, this change would increase the size of basic blocks and reduce the amount of computation when any of px_x and px_y are out of range (I am guessing this is uncommon, so the benefit would be very small at best).
double pix_1 = 0.0;
double alpha_diff = 1.0 - alpha;
if ((unsigned) px_y[0]<(unsigned int)threadCopy[0].S_y)
{
double beta_diff = 1.0 - beta;
if ((unsigned)px_x[0]<(unsigned int)threadCopy[0].S_x)
pix1 += alpha_diff * beta_diff
* threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + px_x[0]];
// no need for else statement since pix1 is already zeroed and not
// adding the pixel_1[0] factor is the same as zeroing pixel_1[0]
if ((unsigned)px_x[0]+1<(unsigned int)threadCopy[0].S_x)
pix1 += alpha * beta_diff
* threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + (px_x[0]+1)];
}
if ((unsigned)px_y[0]+1<(unsigned int)threadCopy[0].S_y)
{
if ((unsigned)px_x[0]<(unsigned int)threadCopy[0].S_x)
pix1 += alpha_diff * beta
* threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + px_x[0]];
if ((unsigned)px_x[0]+1<(unsigned int)threadCopy[0].S_x)
pix1 += alpha * beta
* threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + (px_x[0]+1)];
}
Ideally, code like yours would be vectorized, but I do not know how to get gcc to recognize the opportunities, how to express the opportunities using intrinsics, nor whether significant effort at manually vectorizing this code would be worthwhile with an SIMD width of only two.
I am not a programmer (just someone who likes learning and thinking about computer architecture) and I have a significant inclination toward micro-optimization (as clear from the above), so the above proposals should be considered in that light.

Related

How do I obtain the theoretical/Linpack FLOPS performance on basic vector/matrix operations?

The question is simple. How do I further optimize my code as the basic matrix operations are critical and common to my calculation. BLAS and LAPACK operations are good in linear algebra but neither of them provides basic element by element addition/multiply operations (Hadamard). Theoretical performance maybe difficult, but Linpack performance or 60~80% Linpack performance should be achievable. (I can only do 12%, if I use multiply-add, then only 25%)
For references
Theoretical performance: 8259u has 4 cores * 3.8GHz * 16 FLOPS = 240 GFlops
Linpack performance: 8259u can run as fast as 140~160 GFlops double precision operations.
Platform: Macbook Pro 2018, Monterey
CPU: i5-8259u, 4c8t
RAM: 8GB
CC: gcc 11.3.0
CFLAGS: -mavx2 -mfma -fopenmp -O3
Here's my attempt
the flops are calculated as follows:
double time = stop - start;
double ops = 1.0 * Nx * Ny * iterNum; //2.0 for complex numbers
double flops = ops / time;
double gFlops = flops / 1E9;
Here's some results when I run my code. real and complex results are almost the same. Only showing the real results (roughly):
//Nx = Ny = 2048, iterNum = 10000
//Typical matrix size and iteration depth for my calculation
threads = 1: 1 GFlops
threads = 2: 2 GFlops
threads = 4: 3 GFlops
threads = 8: 4 GFlops
threads = 16: 9 GFlops
threads = 32: 11 GFlops
threads = 64: 15 GFlops
threads = 128: 18 GFlops
threads = 256: 19 GFlops
threads = 512: 21 GFlops
threads = 1024: 20 GFlops
threads = 2048: 40 GFlops // wrong answer
For the convenience of large matrix on heap and integrating with mathGL, the matrix is flattened as a vector consisting of Nx * Ny elements cascading by rows.
// for real numbers
x = (double *)_mm_malloc(Nx * Ny * sizeof(double), 32);
y = (double *)_mm_malloc(Nx * Ny * sizeof(double), 32);
z = (double *)_mm_malloc(Nx * Ny * sizeof(double), 32);
sum = (double *)_mm_malloc(Nx * Ny * sizeof(double), 32);
// for complex numbers
x = (double *)_mm_malloc(Nx * Ny * sizeof(double complex), 32);
y = (double *)_mm_malloc(Nx * Ny * sizeof(double complex), 32);
z = (double *)_mm_malloc(Nx * Ny * sizeof(double complex), 32);
sum = (double *)_mm_malloc(Nx * Ny * sizeof(double complex), 32);
and the addition was done parallelly using openmp.
double start = omp_get_wtime();
#pragma omp parallel private(shift)
{
for (int tds = omp_get_thread_num(); tds < threads; tds = tds + threads)
{
shift = Nx * Ny / threads * tds;
for (int i = 0; i < iterNum; i++)
{
AddComplex(sum+shift, sum+shift, z+shift, Nx/threads, Ny);
}
}
}
double stop = omp_get_wtime();
I wrote explicit vectorization code using AVX intrinsics "immintrin.h".
//real matrix addition
void AddReal(double *summation, const double *summand, const double *addend, int Nx, int Ny)
{
int nBlock = Nx * Ny / realPackSize;
int nRem = Nx * Ny % realPackSize;
register __m256d packSummand, packAddend, packSum;
const double *px = summand;
const double *py = addend;
double *pSum = summation;
for (int i = 0; i < nBlock; i++)
{
packSummand = _mm256_load_pd(px);
packAddend = _mm256_load_pd(py);
packSum = _mm256_add_pd(packSummand, packAddend);
_mm256_store_pd(pSum, packSum);
px = px + realPackSize;
py = py + realPackSize;
pSum = pSum + realPackSize;
}
for (int i = 0; i < nRem; i++)
{
pSum[i] = px[i] + py[i];
}
px = NULL;
py = NULL;
pSum = NULL;
return;
}
//Complex matrix addition
void AddComplex(double complex *summation, const double complex *summand, const double complex *addend, int Nx, int Ny)
{
int nBlock = Nx * Ny / complexPackSize;
int nRem = Nx * Ny % complexPackSize;
register __m256d packSummand, packAddend, packSum;
const double complex *px = summand;
const double complex *py = addend;
double complex *pSum = summation;
for (int i = 0; i < nBlock; i++)
{
packSummand = _mm256_load_pd(px);
packAddend = _mm256_load_pd(py);
packSum = _mm256_add_pd(packSummand, packAddend);
_mm256_store_pd(pSum, packSum);
px = px + complexPackSize;
py = py + complexPackSize;
pSum = pSum + complexPackSize;
}
for (int i = 0; i < nRem; i++)
{
pSum[i] = px[i] + py[i];
}
px = NULL;
py = NULL;
pSum = NULL;
return;
}

Level 1 (eg. dot product) and level 2 (eg. vector-matrix multiplication) BLAS functions are known not to scale (especially level 1 BLAS functions) as opposed to level 3 (eg. matrix-multiplication). Indeed, they are generally memory-bound: the amount of data read/written is O(n) while the amount of floating-point operation is also O(n). This is not the case for level 3 BLAS which are generally clearly compute-bound.
Theoretical performance maybe difficult, but Linpack performance or 60~80% Linpack performance should be achievable
If the computation is memory bound, then, no, this is not possible. Linpack is generally clearly compute bound on nearly all machine. The think is memory is slow and the speed of the RAM is not increasing as fast as the speed of processors over the last decades. This is known as a memory wall (formulated few decades ago and still true nowadays).
Here's some results when I run my code.
Having a faster computation with from using 1024 threads instead of 512 on a mobile processor with 4 core and 8 thread make me think that there is a huge problem somewhere. The maximum should be reached with 8 threads, or otherwise this means the computation is clearly inefficient. Indeed, running more threads than hardware threads cause the OS scheduler to make expensive context-switch (higher overhead). In the end, your processor never runs more that 8 tasks at a time. There are two possibility:
The timings are not correct (the provided piece of code about that seems fine to me)
The program is bogus
The computation exhibit a super-linear speed up (possibly due to cache)
I wrote explicit vectorization code using AVX intrinsics "immintrin.h".
The hot loop contains 2 loads, 1 store, 1 add and few instructions incrementing integers. Your processor can do 2 loads and 1 store per cycle so the SIMD part can be done in 1 cycle of throughput (though the latency can be much bigger) assuming nBlock is large enough.
Your processor can do 2 add per cycle so half the throughput is lost. However, you cannot write something faster than that if the load/write are mandatory.
If complexPackSize is smaller than a SIMD lane, then I think the processor has to make complex operation due to the overlapp with the past iteration that will certainly make it run the loop much less efficiently (a loop carried dependency will make the loop latency bound which is very inefficient here). If complexPackSize is much larger than a cache line, then prefetching will likely be an issue.
Your processor cannot execute too many instructions at the same time. The increment instruction and the loop check cause 5 instruction to be executed, which consume at least 1 cycle. This reduce the throughput by a factor of 2 again so not more than 25% of the theoretical performance can be reached. This can be improved a bit by unrolling the loop. Unrolling might also improve the execution because the _mm256_add_pd instruction has a pretty high latency. One should keep in mind that SIMD instructions are great for throughput but not for latency. Thus, when the latency is not an issue, SIMD codes should be fast.
Note that the write allocate cache policy cause data to be read when _mm256_store_pd is used increasing the amount of data transferred from the RAM and reducing the observed throughput. _mm256_stream_pd can be used to avoid this effect but it is fast only if data are not read just after or when data do not fit in the cache anyway. It also require data to be aligned. In fact, _mm256_store_pd also requires that and if it is not the case, it certainly cause a silent bug. The same applies for _mm256_load_pd: _mm256_loadu_pd should be used instead for unaligned data. I am not sure data read is always aligned. It should be fine if complexPackSize is a power of two divisible by 32 as well as shift. However, I highly doubt this is the case for shift, especially with a large number of threads. I also find very suspicious to use a constant complexPackSize while the SIMD lanes have a fixed size. Did you checked the results in all cases?

How to further optimize performance of Matrix Multiplication?

I am trying to optimize my matrix multiplication code running on a single core. How can I futher improve the performance in regards to loop unrolling, FMA/SSE? I'm also curious to know why the performance won't increase if you use four instead of two sums in the inner loop.
The problem size is a 1000x1000 matrix multiplication. Both gcc 9 and icc 19.0.5 are available. Intel Xeon # 3.10GHz, 32K L1d Cache, Skylake Architecture. Compiled with gcc -O3 -mavx.
void mmult(double* A, double* B, double* C)
{
const int block_size = 64 / sizeof(double);
__m256d sum[2], broadcast;
for (int i0 = 0; i0 < SIZE_M; i0 += block_size) {
for (int k0 = 0; k0 < SIZE_N; k0 += block_size) {
for (int j0 = 0; j0 < SIZE_K; j0 += block_size) {
int imax = i0 + block_size > SIZE_M ? SIZE_M : i0 + block_size;
int kmax = k0 + block_size > SIZE_N ? SIZE_N : k0 + block_size;
int jmax = j0 + block_size > SIZE_K ? SIZE_K : j0 + block_size;
for (int i1 = i0; i1 < imax; i1++) {
for (int k1 = k0; k1 < kmax; k1++) {
broadcast = _mm256_broadcast_sd(A+i1*SIZE_N+k1);
for (int j1 = j0; j1 < jmax; j1+=8) {
sum[0] = _mm256_load_pd(C+i1*SIZE_K+j1+0);
sum[0] = _mm256_add_pd(sum[0], _mm256_mul_pd(broadcast, _mm256_load_pd(B+k1*SIZE_K+j1+0)));
_mm256_store_pd(C+i1*SIZE_K+j1+0, sum[0]);
sum[1] = _mm256_load_pd(C+i1*SIZE_K+j1+4);
sum[1] = _mm256_add_pd(sum[1], _mm256_mul_pd(broadcast, _mm256_load_pd(B+k1*SIZE_K+j1+4)));
_mm256_store_pd(C+i1*SIZE_K+j1+4, sum[1]);
// doesn't improve performance.. why?
// sum[2] = _mm256_load_pd(C+i1*SIZE_K+j1+8);
// sum[2] = _mm256_add_pd(sum[2], _mm256_mul_pd(broadcast, _mm256_load_pd(B+k1*SIZE_K+j1+8)));
// _mm256_store_pd(C+i1*SIZE_K+j1+8, sum[2]);
// sum[3] = _mm256_load_pd(C+i1*SIZE_K+j1+12);
// sum[3] = _mm256_add_pd(sum[3], _mm256_mul_pd(broadcast, _mm256_load_pd(B+k1*SIZE_K+j1+12)));
// _mm256_store_pd(C+i1*SIZE_K+j1+4, sum[3]);
}
}
}
}
}
}
}

This code has 2 loads per FMA (if FMA-contraction happens), but Skylake only supports at most one load per FMA in theory (if you want to max out 2/clock FMA throughput), and even that is usually too much in practice. (Peak through is 2 loads + 1 store per clock, but it usually can't quite sustain that). See Intel's optimization guide and https://agner.org/optimize/
The loop overhead is not the biggest problem, the body itself forces the code to run at half speed.
If the k-loop was the inner loop, a lot of accumulation could be chained, without having to load/store to and from C. This has a downside: with a loop-carried dependency chain like that, it would be up to to code to explicitly ensure that there is enough independent work to be done.
In order to have few loads but enough independent work, the body of the inner loop could calculate the product between a small column vector from A and a small row vector from B, for example using 4 scalar broadcasts to load the column and 2 normal vector loads from B, resulting in just 6 loads for 8 independent FMAs (even lower ratios are possible), which is enough independent FMAs to keep Skylake happy and not too many loads. Even a 3x4 footprint is possible, which also has enough independent FMAs to keep Haswell happy (it needs at least 10).
I happen to have some example code, it's for single precision and C++ but you'll get the point:
sumA_1 = _mm256_load_ps(&result[i * N + j]);
sumB_1 = _mm256_load_ps(&result[i * N + j + 8]);
sumA_2 = _mm256_load_ps(&result[(i + 1) * N + j]);
sumB_2 = _mm256_load_ps(&result[(i + 1) * N + j + 8]);
sumA_3 = _mm256_load_ps(&result[(i + 2) * N + j]);
sumB_3 = _mm256_load_ps(&result[(i + 2) * N + j + 8]);
sumA_4 = _mm256_load_ps(&result[(i + 3) * N + j]);
sumB_4 = _mm256_load_ps(&result[(i + 3) * N + j + 8]);
for (size_t k = kk; k < kk + akb; k++) {
auto bc_mat1_1 = _mm256_set1_ps(*mat1ptr);
auto vecA_mat2 = _mm256_load_ps(mat2 + m2idx);
auto vecB_mat2 = _mm256_load_ps(mat2 + m2idx + 8);
sumA_1 = _mm256_fmadd_ps(bc_mat1_1, vecA_mat2, sumA_1);
sumB_1 = _mm256_fmadd_ps(bc_mat1_1, vecB_mat2, sumB_1);
auto bc_mat1_2 = _mm256_set1_ps(mat1ptr[N]);
sumA_2 = _mm256_fmadd_ps(bc_mat1_2, vecA_mat2, sumA_2);
sumB_2 = _mm256_fmadd_ps(bc_mat1_2, vecB_mat2, sumB_2);
auto bc_mat1_3 = _mm256_set1_ps(mat1ptr[N * 2]);
sumA_3 = _mm256_fmadd_ps(bc_mat1_3, vecA_mat2, sumA_3);
sumB_3 = _mm256_fmadd_ps(bc_mat1_3, vecB_mat2, sumB_3);
auto bc_mat1_4 = _mm256_set1_ps(mat1ptr[N * 3]);
sumA_4 = _mm256_fmadd_ps(bc_mat1_4, vecA_mat2, sumA_4);
sumB_4 = _mm256_fmadd_ps(bc_mat1_4, vecB_mat2, sumB_4);
m2idx += 16;
mat1ptr++;
}
_mm256_store_ps(&result[i * N + j], sumA_1);
_mm256_store_ps(&result[i * N + j + 8], sumB_1);
_mm256_store_ps(&result[(i + 1) * N + j], sumA_2);
_mm256_store_ps(&result[(i + 1) * N + j + 8], sumB_2);
_mm256_store_ps(&result[(i + 2) * N + j], sumA_3);
_mm256_store_ps(&result[(i + 2) * N + j + 8], sumB_3);
_mm256_store_ps(&result[(i + 3) * N + j], sumA_4);
_mm256_store_ps(&result[(i + 3) * N + j + 8], sumB_4);
This means that the j-loop and the i-loop are unrolled, but not the k-loop, even though it is the inner loop now. Unrolling the k-loop a bit did help a bit in my experiments.

See #harold's answer for an actual improvement. This is mostly to repost what I wrote in comments.
four instead of two sums in the inner loop. (Why doesn't unrolling help?)
There's no loop-carried dependency through sum[i]. The next iteration assigns sum[0] = _mm256_load_pd(C+i1*SIZE_K+j1+0); which has no dependency on the previous value.
Therefore register-renaming of the same architectural register onto different physical registers is sufficient to avoid write-after-write hazards that might stall the pipeline. No need to complicate the source with multiple tmp variables. See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) (In that question, one dot product of 2 arrays, there is a loop carried dependency through an accumulator. There, using multiple accumulators is valuable to hide FP FMA latency so we bottleneck on FMA throughput, not latency.)
A pipeline without register renaming (most in-order CPUs) would benefit from "software pipelining" to statically schedule for what out-of-order exec can do on the fly: load into different registers so there's distance (filled with independent work) between each load and the FMA that consumes it. And then between that and the store.
But all modern x86 CPUs are OoO; even Knight's Landing has some limited OoO exec for SIMD vectors. (Silvermont doesn't support AVX, but does run SIMD instructions in-order, only doing OoO for integer).
Without any multiple-accumulator situation to hide latency, the benefits of unrolling (explicitly in the source or with -funroll-loop as enabled by -fprofile-use, or in clang by default) are:
Reduce front-end bandwidth to get the loop overhead into the back-end. More useful-work uops per loop overhead. Thus it helps if your "useful work" is close to bottlenecked on the front end.
Less back-end execution-unit demand for running the loop overhead. Normally not a problem on Haswell and later, or Zen; the back end can mostly keep up with the front-end when the instruction mix includes some integer stuff and some pure load instructions.
Fewer total uops per work done means OoO exec can "see" farther ahead for memory loads/stores.
Sometimes better branch prediction for short-running loops: The lower iteration count means a shorter pattern for branch prediction to learn. So for short trip-counts, a better chance of correctly predicting the not-taken for the last iteration when execution falls out of the loop.
Sometimes save a mov reg,reg in more complicated cases where it's easier for the compiler to generate a new result in a different reg. The same variable can alternate between living in two regs instead of needing to get moved back to the same one to be ready for the next iteration. Especially if you have a loop that uses a[i] and a[i+1] in a dependent way, or something like Fibonacci.
With 2 loads + 1 store in the loop, that will probably be the bottleneck, not FMA or front-end bandwidth. Unrolling by 2 might have helped avoid a front-end bottleneck, but more than that would only matter with contention from another hyperthread.
An interesting question came up in comments: doesn't unrolling need a lot of registers to be useful?
Harold commented:
16 is not a huge number of registers, but it's enough to have 12
accumulators and 3 pieces of row vector from B and the broadcasted
scalar from A, so it works out to just about enough. The loop from OP
above barely uses any registers anyway. The 8 registers in 32bit are
indeed too few.
Of course since the code in the question doesn't have "accumulators" in registers across loop iterations, only adding into memory, compilers could have optimized all of sum[0..n] to reuse the same register in asm; it's "dead" after storing. So actual register pressure is very low.
Yes x86-64 is somewhat register-poor, that's why AVX512 doubles the number as well as width of vector regs (zmm0..31). Yes, many RISCs have 32 int / 32 fp regs, including AArch64 up from 16 in ARM.
x86-64 has 16 scalar integer registers (including the stack pointer, not including the program counter), so normal functions can use 15. There are also 16 vector regs, xmm0..15. (And with AVX they're double the width ymm0..15).
(Some of this was written before I noticed that sum[0..n] was pointless, not loop-carried.)
Register renaming onto a large physical register file is sufficient in this case. There are other cases where having more architectural registers helps, especially for higher FP latency hence why AVX512 has 32 zmm regs. But for integer 16 is close to enough. RISC CPUs were often designed for in-order without reg renaming, needing SW pipeline.
With OoO exec, the jump from 8 to 16 architectural GP integer regs is more significant than a jump from 16 to 32 would be, in terms of reducing spill/reloads. (I've seen a paper that measured total dynamic instruction count for SPECint with various numbers of architectural registers. I didn't look it up again, but 8->16 might have been 10% total saving while 16->32 was only a couple %. Something like that).
But this specific problem doesn't need a lot of FP registers, only 4 vectors for sum[0..3] (if they were loop-carried) and maybe 1 temporary; x86 can use memory-source mul/add/FMA. Register renaming removes any WAW hazards so we can reuse the same temporary register instead of needing software pipelining. (And OoO exec also hides load and ALU latency.)
You want multiple accumulators when there are loop-carried dependencies. This code is adding into memory, not into a few vector accumulators, so any dependency is through store/reload. But that only has ~7 cycle latency so any sane cache-blocking factor hides it.

Calculate (x exponent 0.19029) with low memory using lookup table?

I'm writing a C program for a PIC micro-controller which needs to do a very specific exponential function. I need to calculate the following:
A = k . (1 - (p/p0)^0.19029)
k and p0 are constant, so it's all pretty simple apart from finding x^0.19029
(p/p0) ratio would always be in the range 0-1.
It works well if I add in math.h and use the power function, except that uses up all of the available 16 kB of program memory. Talk about bloatware! (Rest of program without power function = ~20% flash memory usage; add math.h and power function, =100%).
I'd like the program to do some other things as well. I was wondering if I can write a special case implementation for x^0.19029, maybe involving iteration and some kind of lookup table.
My idea is to generate a look-up table for the function x^0.19029, with perhaps 10-100 values of x in the range 0-1. The code would find a close match, then (somehow) iteratively refine it by re-scaling the lookup table values. However, this is where I get lost because my tiny brain can't visualise the maths involved.
Could this approach work?
Alternatively, I've looked at using Exp(x) and Ln(x), which can be implemented with a Taylor expansion. b^x can the be found with:
b^x = (e^(ln b))^x = e^(x.ln(b))
(See: Wikipedia - Powers via Logarithms)
This looks a bit tricky and complicated to me, though. Am I likely to get the implementation smaller then the compiler's math library, and can I simplify it for my special case (i.e. base = 0-1, exponent always 0.19029)?
Note that RAM usage is OK at the moment, but I've run low on Flash (used for code storage). Speed is not critical. Somebody has already suggested that I use a bigger micro with more flash memory, but that sounds like profligate wastefulness!
[EDIT] I was being lazy when I said "(p/p0) ratio would always be in the range 0-1". Actually it will never reach 0, and I did some calculations last night and decided that in fact a range of 0.3 - 1 would be quite adequate! This mean that some of the simpler solutions below should be suitable. Also, the "k" in the above is 44330, and I'd like the error in the final result to be less than 0.1. I guess that means an error in the (p/p0)^0.19029 needs to be less than 1/443300 or 2.256e-6

Use splines. The relevant part of the function is shown in the figure below. It varies approximately like the 5th root, so the problematic zone is close to p / p0 = 0. There is mathematical theory how to optimally place the knots of splines to minimize the error (see Carl de Boor: A Practical Guide to Splines). Usually one constructs the spline in B form ahead of time (using toolboxes such as Matlab's spline toolbox - also written by C. de Boor), then converts to Piecewise Polynomial representation for fast evaluation.
In C. de Boor, PGS, the function g(x) = sqrt(x + 1) is actually taken as an example (Chapter 12, Example II). This is exactly what you need here. The book comes back to this case a few times, since it is admittedly a hard problem for any interpolation scheme due to the infinite derivatives at x = -1. All software from PGS is available for free as PPPACK in netlib, and most of it is also part of SLATEC (also from netlib).
Edit (Removed)
(Multiplying by x once does not significantly help, since it only regularizes the first derivative, while all other derivatives at x = 0 are still infinite.)
Edit 2
My feeling is that optimally constructed splines (following de Boor) will be best (and fastest) for relatively low accuracy requirements. If the accuracy requirements are high (say 1e-8), one may be forced to get back to the algorithms that mathematicians have been researching for centuries. At this point, it may be best to simply download the sources of glibc and copy (provided GPL is acceptable) whatever is in
glibc-2.19/sysdeps/ieee754/dbl-64/e_pow.c
Since we don't have to include the whole math.h, there shouldn't be a problem with memory, but we will only marginally profit from having a fixed exponent.
Edit 3
Here is an adapted version of e_pow.c from netlib, as found by #Joni. This seems to be the grandfather of glibc's more modern implementation mentioned above. The old version has two advantages: (1) It is public domain, and (2) it uses a limited number of constants, which is beneficial if memory is a tight resource (glibc's version defines over 10000 lines of constants!). The following is completely standalone code, which calculates x^0.19029 for 0 <= x <= 1 to double precision (I tested it against Python's power function and found that at most 2 bits differed):
#define __LITTLE_ENDIAN
#ifdef __LITTLE_ENDIAN
#define __HI(x) *(1+(int*)&x)
#define __LO(x) *(int*)&x
#else
#define __HI(x) *(int*)&x
#define __LO(x) *(1+(int*)&x)
#endif
static const double
bp[] = {1.0, 1.5,},
dp_h[] = { 0.0, 5.84962487220764160156e-01,}, /* 0x3FE2B803, 0x40000000 */
dp_l[] = { 0.0, 1.35003920212974897128e-08,}, /* 0x3E4CFDEB, 0x43CFD006 */
zero = 0.0,
one = 1.0,
two = 2.0,
two53 = 9007199254740992.0, /* 0x43400000, 0x00000000 */
/* poly coefs for (3/2)*(log(x)-2s-2/3*s**3 */
L1 = 5.99999999999994648725e-01, /* 0x3FE33333, 0x33333303 */
L2 = 4.28571428578550184252e-01, /* 0x3FDB6DB6, 0xDB6FABFF */
L3 = 3.33333329818377432918e-01, /* 0x3FD55555, 0x518F264D */
L4 = 2.72728123808534006489e-01, /* 0x3FD17460, 0xA91D4101 */
L5 = 2.30660745775561754067e-01, /* 0x3FCD864A, 0x93C9DB65 */
L6 = 2.06975017800338417784e-01, /* 0x3FCA7E28, 0x4A454EEF */
P1 = 1.66666666666666019037e-01, /* 0x3FC55555, 0x5555553E */
P2 = -2.77777777770155933842e-03, /* 0xBF66C16C, 0x16BEBD93 */
P3 = 6.61375632143793436117e-05, /* 0x3F11566A, 0xAF25DE2C */
P4 = -1.65339022054652515390e-06, /* 0xBEBBBD41, 0xC5D26BF1 */
P5 = 4.13813679705723846039e-08, /* 0x3E663769, 0x72BEA4D0 */
lg2 = 6.93147180559945286227e-01, /* 0x3FE62E42, 0xFEFA39EF */
lg2_h = 6.93147182464599609375e-01, /* 0x3FE62E43, 0x00000000 */
lg2_l = -1.90465429995776804525e-09, /* 0xBE205C61, 0x0CA86C39 */
ovt = 8.0085662595372944372e-0017, /* -(1024-log2(ovfl+.5ulp)) */
cp = 9.61796693925975554329e-01, /* 0x3FEEC709, 0xDC3A03FD =2/(3ln2) */
cp_h = 9.61796700954437255859e-01, /* 0x3FEEC709, 0xE0000000 =(float)cp */
cp_l = -7.02846165095275826516e-09, /* 0xBE3E2FE0, 0x145B01F5 =tail of cp_h*/
ivln2 = 1.44269504088896338700e+00, /* 0x3FF71547, 0x652B82FE =1/ln2 */
ivln2_h = 1.44269502162933349609e+00, /* 0x3FF71547, 0x60000000 =24b 1/ln2*/
ivln2_l = 1.92596299112661746887e-08; /* 0x3E54AE0B, 0xF85DDF44 =1/ln2 tail*/
double pow0p19029(double x)
{
double y = 0.19029e+00;
double z,ax,z_h,z_l,p_h,p_l;
double y1,t1,t2,r,s,t,u,v,w;
int i,j,k,n;
int hx,hy,ix,iy;
unsigned lx,ly;
hx = __HI(x); lx = __LO(x);
hy = __HI(y); ly = __LO(y);
ix = hx&0x7fffffff; iy = hy&0x7fffffff;
ax = x;
/* special value of x */
if(lx==0) {
if(ix==0x7ff00000||ix==0||ix==0x3ff00000){
z = ax; /*x is +-0,+-inf,+-1*/
return z;
}
}
s = one; /* s (sign of result -ve**odd) = -1 else = 1 */
double ss,s2,s_h,s_l,t_h,t_l;
n = ((ix)>>20)-0x3ff;
j = ix&0x000fffff;
/* determine interval */
ix = j|0x3ff00000; /* normalize ix */
if(j<=0x3988E) k=0; /* |x|<sqrt(3/2) */
else if(j<0xBB67A) k=1; /* |x|<sqrt(3) */
else {k=0;n+=1;ix -= 0x00100000;}
__HI(ax) = ix;
/* compute ss = s_h+s_l = (x-1)/(x+1) or (x-1.5)/(x+1.5) */
u = ax-bp[k]; /* bp[0]=1.0, bp[1]=1.5 */
v = one/(ax+bp[k]);
ss = u*v;
s_h = ss;
__LO(s_h) = 0;
/* t_h=ax+bp[k] High */
t_h = zero;
__HI(t_h)=((ix>>1)|0x20000000)+0x00080000+(k<<18);
t_l = ax - (t_h-bp[k]);
s_l = v*((u-s_h*t_h)-s_h*t_l);
/* compute log(ax) */
s2 = ss*ss;
r = s2*s2*(L1+s2*(L2+s2*(L3+s2*(L4+s2*(L5+s2*L6)))));
r += s_l*(s_h+ss);
s2 = s_h*s_h;
t_h = 3.0+s2+r;
__LO(t_h) = 0;
t_l = r-((t_h-3.0)-s2);
/* u+v = ss*(1+...) */
u = s_h*t_h;
v = s_l*t_h+t_l*ss;
/* 2/(3log2)*(ss+...) */
p_h = u+v;
__LO(p_h) = 0;
p_l = v-(p_h-u);
z_h = cp_h*p_h; /* cp_h+cp_l = 2/(3*log2) */
z_l = cp_l*p_h+p_l*cp+dp_l[k];
/* log2(ax) = (ss+..)*2/(3*log2) = n + dp_h + z_h + z_l */
t = (double)n;
t1 = (((z_h+z_l)+dp_h[k])+t);
__LO(t1) = 0;
t2 = z_l-(((t1-t)-dp_h[k])-z_h);
/* split up y into y1+y2 and compute (y1+y2)*(t1+t2) */
y1 = y;
__LO(y1) = 0;
p_l = (y-y1)*t1+y*t2;
p_h = y1*t1;
z = p_l+p_h;
j = __HI(z);
i = __LO(z);
/*
* compute 2**(p_h+p_l)
*/
i = j&0x7fffffff;
k = (i>>20)-0x3ff;
n = 0;
if(i>0x3fe00000) { /* if |z| > 0.5, set n = [z+0.5] */
n = j+(0x00100000>>(k+1));
k = ((n&0x7fffffff)>>20)-0x3ff; /* new k for n */
t = zero;
__HI(t) = (n&~(0x000fffff>>k));
n = ((n&0x000fffff)|0x00100000)>>(20-k);
if(j<0) n = -n;
p_h -= t;
}
t = p_l+p_h;
__LO(t) = 0;
u = t*lg2_h;
v = (p_l-(t-p_h))*lg2+t*lg2_l;
z = u+v;
w = v-(z-u);
t = z*z;
t1 = z - t*(P1+t*(P2+t*(P3+t*(P4+t*P5))));
r = (z*t1)/(t1-two)-(w+z*w);
z = one-(r-z);
__HI(z) += (n<<20);
return s*z;
}
Clearly, 50+ years of research have gone into this, so it's probably very hard to do any better. (One has to appreciate that there are 0 loops, only 2 divisions, and only 6 if statements in the whole algorithm!) The reason for this is, again, the behavior at x = 0, where all derivatives diverge, which makes it extremely hard to keep the error under control: I once had a spline representation with 18 knots that was good up to x = 1e-4, with absolute and relative errors < 5e-4 everywhere, but going to x = 1e-5 ruined everything again.
So, unless the requirement to go arbitrarily close to zero is relaxed, I recommend using the adapted version of e_pow.c given above.
Edit 4
Now that we know that the domain 0.3 <= x <= 1 is sufficient, and that we have very low accuracy requirements, Edit 3 is clearly overkill. As #MvG has demonstrated, the function is so well behaved that a polynomial of degree 7 is sufficient to satisfy the accuracy requirements, which can be considered a single spline segment. #MvG's solution minimizes the integral error, which already looks very good.
The question arises as to how much better we can still do? It would be interesting to find the polynomial of a given degree that minimizes the maximum error in the interval of interest. The answer is the minimax
polynomial, which can be found using Remez' algorithm, which is implemented in the Boost library. I like #MvG's idea to clamp the value at x = 1 to 1, which I will do as well. Here is minimax.cpp:
#include <ostream>
#define TARG_PREC 64
#define WORK_PREC (TARG_PREC*2)
#include <boost/multiprecision/cpp_dec_float.hpp>
typedef boost::multiprecision::number<boost::multiprecision::cpp_dec_float<WORK_PREC> > dtype;
using boost::math::pow;
#include <boost/math/tools/remez.hpp>
boost::shared_ptr<boost::math::tools::remez_minimax<dtype> > p_remez;
dtype f(const dtype& x) {
static const dtype one(1), y(0.19029);
return one - pow(one - x, y);
}
void out(const char *descr, const dtype& x, const char *sep="") {
std::cout << descr << boost::math::tools::real_cast<double>(x) << sep << std::endl;
}
int main() {
dtype a(0), b(0.7); // range to optimise over
bool rel_error(false), pin(true);
int orderN(7), orderD(0), skew(0), brake(50);
int prec = 2 + (TARG_PREC * 3010LL)/10000;
std::cout << std::scientific << std::setprecision(prec);
p_remez.reset(new boost::math::tools::remez_minimax<dtype>(
&f, orderN, orderD, a, b, pin, rel_error, skew, WORK_PREC));
out("Max error in interpolated form: ", p_remez->max_error());
p_remez->set_brake(brake);
unsigned i, count(50);
for (i = 0; i < count; ++i) {
std::cout << "Stepping..." << std::endl;
dtype r = p_remez->iterate();
out("Maximum Deviation Found: ", p_remez->max_error());
out("Expected Error Term: ", p_remez->error_term());
out("Maximum Relative Change in Control Points: ", r);
}
boost::math::tools::polynomial<dtype> n = p_remez->numerator();
for(i = n.size(); i--; ) {
out("", n[i], ",");
}
}
Since all parts of boost that we use are header-only, simply build with:
c++ -O3 -I<path/to/boost/headers> minimax.cpp -o minimax
We finally get the coefficients, which are after multiplication by 44330:
24538.3409, -42811.1497, 34300.7501, -11284.1276, 4564.5847, 3186.7541, 8442.5236, 0.
The following error plot demonstrates that this is really the best possible degree-7 polynomial approximation, since all extrema are of equal magnitude (0.06659):
Should the requirements ever change (while still keeping well away from 0!), the C++ program above can be simply adapted to spit out the new optimal polynomial approximation.

Instead of a lookup table, I'd use a polynomial approximation:
1 - x0.19029 ≈ - 1073365.91783x15 + 8354695.40833x14 - 29422576.6529x13 + 61993794.537x12 - 87079891.4988x11 + 86005723.842x10 - 61389954.7459x9 + 32053170.1149x8 - 12253383.4372x7 + 3399819.97536x6 - 672003.142815x5 + 91817.6782072x4 - 8299.75873768x3 + 469.530204564x2 - 16.6572179869x + 0.722044145701
Or in code:
double f(double x) {
double fx;
fx = - 1073365.91783;
fx = fx*x + 8354695.40833;
fx = fx*x - 29422576.6529;
fx = fx*x + 61993794.537;
fx = fx*x - 87079891.4988;
fx = fx*x + 86005723.842;
fx = fx*x - 61389954.7459;
fx = fx*x + 32053170.1149;
fx = fx*x - 12253383.4372;
fx = fx*x + 3399819.97536;
fx = fx*x - 672003.142815;
fx = fx*x + 91817.6782072;
fx = fx*x - 8299.75873768;
fx = fx*x + 469.530204564;
fx = fx*x - 16.6572179869;
fx = fx*x + 0.722044145701;
return fx;
}
I computed this in sage using the least squares approach:
f(x) = 1-x^(19029/100000) # your function
d = 16 # number of terms, i.e. degree + 1
A = matrix(d, d, lambda r, c: integrate(x^r*x^c, (x, 0, 1)))
b = vector([integrate(x^r*f(x), (x, 0, 1)) for r in range(d)])
A.solve_right(b).change_ring(RDF)
Here is a plot of the error this will entail:
Blue is the error from my 16 term polynomial, while red is the error you'd get from piecewise linear interpolation with 16 equidistant values. As you can see, both errors are quite small for most parts of the range, but will become really huge close to x=0. I actually clipped the plot there. If you can somehow narrow the range of possible values, you could use that as the domain for the integration, and obtain an even better fit for the relevant range. At the cost of worse fit outside, of course. You could also increase the number of terms to obtain a closer fit, although that might also lead to higher oscillations.
I guess you can also combine this approach with the one Stefan posted: use his to split the domain into several parts, then use mine to find a close low degree polynomial for each part.
Update
Since you updated the specification of your question, with regard to both the domain and the error, here is a minimal solution to fit those requirements:
44330(1 - x0.19029) ≈ + 23024.9160933(1-x)7 - 39408.6473636(1-x)6 + 31379.9086193(1-x)5 - 10098.7031260(1-x)4 + 4339.44098317(1-x)3 + 3202.85705860(1-x)2 + 8442.42528906(1-x)
double f(double x) {
double fx, x1 = 1. - x;
fx = + 23024.9160933;
fx = fx*x1 - 39408.6473636;
fx = fx*x1 + 31379.9086193;
fx = fx*x1 - 10098.7031260;
fx = fx*x1 + 4339.44098317;
fx = fx*x1 + 3202.85705860;
fx = fx*x1 + 8442.42528906;
fx = fx*x1;
return fx;
}
I integrated x from 0.293 to 1 or equivalently 1 - x from 0 to 0.707 to keep the worst oscillations outside the relevant domain. I also omitted the constant term, to ensure an exact result at x=1. The maximal error for the range [0.3, 1] now occurs at x=0.3260 and amounts to 0.0972 < 0.1. Here is an error plot, which of course has bigger absolute errors than the one above due to the scale factor k=44330 which has been included here.
I can also state that the first three derivatives of the function will have constant sign over the range in question, so the function is monotonic, convex, and in general pretty well-behaved.

Not meant to answer the question, but it illustrates the Road Not To Go, and thus may be helpful:
This quick-and-dirty C code calculates pow(i, 0.19029) for 0.000 to 1.000 in steps of 0.01. The first half displays the error, in percents, when stored as 1/65536ths (as that theoretically provides slightly over 4 decimals of precision). The second half shows both interpolated and calculated values in steps of 0.001, and the difference between these two.
It kind of looks okay if you read from the bottom up, all 100s and 99.99s there, but about the first 20 values from 0.001 to 0.020 are worthless.
#include <stdio.h>
#include <math.h>
float powers[102];
int main (void)
{
int i, as_int;
double as_real, low, high, delta, approx, calcd, diff;
printf ("calculating and storing:\n");
for (i=0; i<=101; i++)
{
as_real = pow(i/100.0, 0.19029);
as_int = (int)round(65536*as_real);
powers[i] = as_real;
diff = 100*as_real/(as_int/65536.0);
printf ("%.5f %.5f %.5f ~ %.3f\n", i/100.0, as_real, as_int/65536.0, diff);
}
printf ("\n");
printf ("-- interpolating in 1/10ths:\n");
for (i=0; i<1000; i++)
{
as_real = i/1000.0;
low = powers[i/10];
high = powers[1+i/10];
delta = (high-low)/10.0;
approx = low + (i%10)*delta;
calcd = pow(as_real, 0.19029);
diff = 100.0*approx/calcd;
printf ("%.5f ~ %.5f = %.5f +/- %.5f%%\n", as_real, approx, calcd, diff);
}
return 0;
}

You can find a complete, correct standalone implementation of pow in fdlibm. It's about 200 lines of code, about half of which deal with special cases. If you remove the code that deals with special cases you're not interested in I doubt you'll have problems including it in your program.

LutzL's answer is a really good one: Calculate your power as (x^1.52232)^(1/8), computing the inner power by spline interpolation or another method. The eighth root deals with the pathological non-differentiable behavior near zero. I took the liberty of mocking up an implementation this way. The below, however, only does a linear interpolation to do x^1.52232, and you'd need to get the full coefficients using your favorite numerical mathematics tools. You'll adding scarcely 40 lines of code to get your needed power, plus however many knots you choose to use for your spline, as dicated by your required accuracy.
Don't be scared by the #include <math.h>; it's just for benchmarking the code.
#include <stdio.h>
#include <math.h>
double my_sqrt(double x) {
/* Newton's method for a square root. */
int i = 0;
double res = 1.0;
if (x > 0) {
for (i = 0; i < 10; i++) {
res = 0.5 * (res + x / res);
}
} else {
res = 0.0;
}
return res;
}
double my_152232(double x) {
/* Cubic spline interpolation for x ** 1.52232. */
int i = 0;
double res = 0.0;
/* coefs[i] will give the cubic polynomial coefficients between x =
i and x = i+1. Out of laziness, the below numbers give only a
linear interpolation. You'll need to do some work and research
to get the spline coefficients. */
double coefs[3][4] = {{0.0, 1.0, 0.0, 0.0},
{-0.872526, 1.872526, 0.0, 0.0},
{-2.032706, 2.452616, 0.0, 0.0}};
if ((x >= 0) && (x < 3.0)) {
i = (int) x;
/* Horner's method cubic. */
res = (((coefs[i][3] * x + coefs[i][2]) * x) + coefs[i][1] * x)
+ coefs[i][0];
} else if (x >= 3.0) {
/* Scaled x ** 1.5 once you go off the spline. */
res = 1.024824 * my_sqrt(x * x * x);
}
return res;
}
double my_019029(double x) {
return my_sqrt(my_sqrt(my_sqrt(my_152232(x))));
}
int main() {
int i;
double x = 0.0;
for (i = 0; i < 1000; i++) {
x = 1e-2 * i;
printf("%f %f %f \n", x, my_019029(x), pow(x, 0.19029));
}
return 0;
}
EDIT: If you're just interested in a small region like [0,1], even simpler is to peel off one sqrt(x) and compute x^1.02232, which is quite well behaved, using a Taylor series:
double my_152232(double x) {
double part_050000 = my_sqrt(x);
double part_102232 = 1.02232 * x + 0.0114091 * x * x - 3.718147e-3 * x * x * x;
return part_102232 * part_050000;
}
This gets you within 1% of the exact power for approximately [0.1,6], though getting the singularity exactly right is always a challenge. Even so, this three-term Taylor series gets you within 2.3% for x = 0.001.

Can anyone help me to optimize this for loop use SSE?

I have a for loop which will run many times, and will cost a lot of time:
for (int z=0; z<temp; z++)
{
float findex= a + b * A[z];
int iindex = findex ;
outArray[z] += inArray[iindex] + (findex - iindex) * (inArray[iindex+1] - inArray[iindex]);
a++;
}
I have optimized this code, but have no performance improvement! Maybe my SSE code is bad, can any one help me?

Try using the restrict keyword on inArray and outArray. Otherwise the compiler has to assume that inArray could be == outArray. In this case no parallelization would be possible.

Your loop has a loop carried dependency when you write to outArray[z]. Your CPU can do more than one floating point sum at once but with your current loop you only allows one sum of outArray[z]. To fix this you should unroll your loop.
for (int z=0; z<temp; z+=2) {
float findex_v1 = a + b * A[z];
int iindex_v1 = findex_v1;
outArray[z] += inArray[iindex_v1] + (findex_v1 - iindex_v1) * (inArray[iindex_v1+1] - inArray[iindex_v1]);
float findex_v2 = (a+1) + b * A[z+1];
int iindex_v2 = findex_v2;
outArray[z+1] += inArray[iindex_v2] + (findex_v2 - iindex_v2) * (inArray[iindex_v2+1] - inArray[iindex_v2]);
a+=2;
}
In terms of SIMD the problem is that you have to gather non-contiguous data when you access inArray[iindex_v1]. AVX2 has some gather instructions but I have not tried them. Otherwise it may be best to do the gather without SIMD. All the operations accessing z access contiguous memory so that part is easy. Psuedo-code (without unrolling) would look something like this
int indexa[4];
float inArraya[4];
float dinArraya[4];
int4 a4 = a + float4(0,1,2,3);
for (int z=0; z<temp; z+=4) {
//use SSE for contiguous memory
float4 findex4 = a4 + b * float4.load(&A[z]);
int4 iindex4 = truncate_to_int(findex4);
//don't use SSE for non-contiguous memory
iindex4.store(indexa);
for(int i=0; i<4; i++) {
inArraya[i] = inArray[indexa[i]];
dinArraya[i] = inArray[indexa[i+1]] - inArray[indexa[i]];
}
//loading from and array right after writing to it causes a CPU stall
float4 inArraya4 = float4.load(inArraya);
float4 dinArraya4 = float4.load(dinArraya);
//back to SSE
float4 outArray4 = float4.load(&outarray[z]);
outArray4 += inArray4 + (findex4 - iindex4)*dinArray4;
outArray4.store(&outArray[z]);
a4+=4;
}

Effects of Loop unrolling on memory bound data

I have been working with a piece of code which is intensively memory bound. I am trying to optimize it within a single core by manually implementing cache blocking, sw prefetching, loop unrolling etc. Even though cache blocking gives significant improvement in performance. However when i introduce loop unrolling I get tremendous performance degradation.
I am compiling with Intel icc with compiler flags -O2 and -ipo in all my test cases.
My code is similar to this (3D 25-point stencil):
void stencil_baseline (double *V, double *U, int dx, int dy, int dz, double c0, double c1, double c2, double c3, double c4)
{
int i, j, k;
for (k = 4; k < dz-4; k++)
{
for (j = 4; j < dy-4; j++)
{
//x-direction
for (i = 4; i < dx-4; i++)
{
U[k*dy*dx+j*dx+i] = (c0 * (V[k*dy*dx+j*dx+i]) //center
+ c1 * (V[k*dy*dx+j*dx+(i-1)] + V[k*dy*dx+j*dx+(i+1)])
+ c2 * (V[k*dy*dx+j*dx+(i-2)] + V[k*dy*dx+j*dx+(i+2)])
+ c3 * (V[k*dy*dx+j*dx+(i-3)] + V[k*dy*dx+j*dx+(i+3)])
+ c4 * (V[k*dy*dx+j*dx+(i-4)] + V[k*dy*dx+j*dx+(i+4)]));
}
//y-direction
for (i = 4; i < dx-4; i++)
{
U[k*dy*dx+j*dx+i] += (c1 * (V[k*dy*dx+(j-1)*dx+i] + V[k*dy*dx+(j+1)*dx+i])
+ c2 * (V[k*dy*dx+(j-2)*dx+i] + V[k*dy*dx+(j+2)*dx+i])
+ c3 * (V[k*dy*dx+(j-3)*dx+i] + V[k*dy*dx+(j+3)*dx+i])
+ c4 * (V[k*dy*dx+(j-4)*dx+i] + V[k*dy*dx+(j+4)*dx+i]));
}
//z-direction
for (i = 4; i < dx-4; i++)
{
U[k*dy*dx+j*dx+i] += (c1 * (V[(k-1)*dy*dx+j*dx+i] + V[(k+1)*dy*dx+j*dx+i])
+ c2 * (V[(k-2)*dy*dx+j*dx+i] + V[(k+2)*dy*dx+j*dx+i])
+ c3 * (V[(k-3)*dy*dx+j*dx+i] + V[(k+3)*dy*dx+j*dx+i])
+ c4 * (V[(k-4)*dy*dx+j*dx+i] + V[(k+4)*dy*dx+j*dx+i]));
}
}
}
}
When I do loop unrolling on the innermost loop (dimension i) and unroll in directions x,y,z separately by unroll factor 2,4,8 respectively, I get performance degradation in all 9 cases i.e. unroll by 2 on direction x, unroll by 2 on direction y, unroll by 2 in direction z, unroll by 4 in direction x ... etc.
But when I perform loop unrolling on the outermost loop (dimension k) by factor of 8 (2 & 4 also), I get v.good performance improvement which is even better than cache blocking.
I even tried profiling my code with Intel Vtune. It seemed like the bottlenecks where mainly due to 1.LLC Miss and 2. LLC Load Misses serviced by Remote DRAM.
I am unable to understand why unrolling the innermost fastest loop in giving performance degradation whereas unrolling the outermost, slowest dimension is fetching performance improvement. However, this improvement in the latter case is when i use -O2 and -ipo when compiling with icc.
I am not sure how to interpret these statistics. Can someone help shed some light on this.

This strongly suggests that you are causing instruction cache misses by the unrolling, which is typical. In the age of modern hardware, unrolling no longer automatically means faster code. If each inner loop fits in a cache line, you will get better performance.
You may be able to unroll manually, to limit the size of the generated code, but this will require examining the generated machine-language instructions -- and their position -- to ensure that your loop is within a single cache line. Cache lines are typically 64 bytes long, and aligned on 64-byte boundaries.
Outer loops do not have the same effect. They will likely be outside of the instruction cache regardless of the unroll level. Unrolling these results in fewer branches, which is why you get better performance.
"Load misses serviced by remote DRAM" means that you allocated memory on one NUMA node, but now you are running on the other. Setting process or thread affinity based on NUMA is the answer.
Remote DRAM takes almost twice as long to read as local DRAM on the Intel machines that I have used.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight