Related
I'm testing a c code for Linux with large arrays to measure thread performance, the application scales very well when threads are increased until max cores (8 for Intel 4770), but this is only for the pure math part of my code.
If I add the printf part for resulted arrays then the times becomes too large, from few seconds to several minutes even if redirected to a file, when printf those arrays should add just a few seconds.
The code:
(gcc 7.5.0-Ubuntu 18.04)
without printf loop:
gcc -O3 -m64 exp_multi.c -pthread -lm
with printf loop:
gcc -DPRINT_ARRAY -O3 -m64 exp_multi.c -pthread -lm
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <pthread.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
#define num_threads 8
static double xv[MAXSIZE];
static double yv[MAXSIZE];
/* gcc -O3 -m64 exp_multi.c -pthread -lm */
void* run(void *received_Val){
int single_val = *((int *) received_Val);
int r;
int i;
double p;
for (r = 0; r < REIT; r++) {
p = XXX + 0.00001*single_val*MAXSIZE/num_threads;
for (i = single_val*MAXSIZE/num_threads; i < (single_val+1)*MAXSIZE/num_threads; i++) {
xv[i]=p;
yv[i]=exp(p);
p += 0.00001;
}
}
return 0;
}
int main(){
int i;
pthread_t tid[num_threads];
for (i=0;i<num_threads;i++){
int *arg = malloc(sizeof(*arg));
if ( arg == NULL ) {
fprintf(stderr, "Couldn't allocate memory for thread arg.\n");
exit(1);
}
*arg = i;
pthread_create(&(tid[i]), NULL, run, arg);
}
for(i=0; i<num_threads; i++)
{
pthread_join(tid[i], NULL);
}
#ifdef PRINT_ARRAY
for (i=0;i<MAXSIZE;i++){
printf("x=%.20lf, e^x=%.20lf\n",xv[i],yv[i]);
}
#endif
return 0;
}
malloc in pthread_create passing an integer as the last argument as suggested in this post.
I tried, without success, clang, add free(tid) instruction, avoid using malloc instruction, reverse loops, only 1 unidimensional array, 1 thread version without pthread.
EDIT2: I think the exp function is processor resource intensive, probably affected by the per core cache or SIMD resources implemented by the processor generation. The following sample code is based on a licensed code posted on Stack Overflow.
This code runs fast with or without the printf loop, and exp from math.h has been improved a few years ago, it can be around x40 faster, at least on the Intel 4770 (Haswell), this link is a known test code for math library vs SSE2, and now exp speed of math should be close to the AVX2 algorithm optimized for float and x8 parallel calculations.
Test results: expf vs other SSE2 algortihm (exp_ps):
sinf .. -> 55.5 millions of vector evaluations/second -> 12 cycles/value
cosf .. -> 57.3 millions of vector evaluations/second -> 11 cycles/value
sincos (x87) .. -> 9.1 millions of vector evaluations/second -> 71 cycles/value
expf .. -> 61.4 millions of vector evaluations/second -> 11 cycles/value
logf .. -> 55.6 millions of vector evaluations/second -> 12 cycles/value
cephes_sinf .. -> 52.5 millions of vector evaluations/second -> 12 cycles/value
cephes_cosf .. -> 41.9 millions of vector evaluations/second -> 15 cycles/value
cephes_expf .. -> 18.3 millions of vector evaluations/second -> 35 cycles/value
cephes_logf .. -> 20.2 millions of vector evaluations/second -> 32 cycles/value
sin_ps .. -> 54.1 millions of vector evaluations/second -> 12 cycles/value
cos_ps .. -> 54.8 millions of vector evaluations/second -> 12 cycles/value
sincos_ps .. -> 54.6 millions of vector evaluations/second -> 12 cycles/value
exp_ps .. -> 42.6 millions of vector evaluations/second -> 15 cycles/value
log_ps .. -> 41.0 millions of vector evaluations/second -> 16 cycles/value
/* Performance test exp(x) algorithm
based on AVX implementation of Giovanni Garberoglio
Copyright (C) 2020 Antonio R.
AVX implementation of exp:
Modified code from this source: https://github.com/reyoung/avx_mathfun
Based on "sse_mathfun.h", by Julien Pommier
http://gruntthepeon.free.fr/ssemath/
Copyright (C) 2012 Giovanni Garberoglio
Interdisciplinary Laboratory for Computational Science (LISC)
Fondazione Bruno Kessler and University of Trento
via Sommarive, 18
I-38123 Trento (Italy)
This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
(this is the zlib license)
*/
/* gcc -O3 -m64 -Wall -mavx2 -march=haswell expc.c -lm */
#include <stdio.h>
#include <immintrin.h>
#include <math.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
__m256 exp256_ps(__m256 x) {
/*
To increase the compatibility across different compilers the original code is
converted to plain AVX2 intrinsics code without ingenious macro's,
gcc style alignment attributes etc.
Moreover, the part "express exp(x) as exp(g + n*log(2))" has been significantly simplified.
This modified code is not thoroughly tested!
*/
__m256 exp_hi = _mm256_set1_ps(88.3762626647949f);
__m256 exp_lo = _mm256_set1_ps(-88.3762626647949f);
__m256 cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341f);
__m256 inv_LOG2EF = _mm256_set1_ps(0.693147180559945f);
__m256 cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256 cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256 cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256 cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256 cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256 cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256 fx;
__m256i imm0;
__m256 one = _mm256_set1_ps(1.0f);
x = _mm256_min_ps(x, exp_hi);
x = _mm256_max_ps(x, exp_lo);
/* express exp(x) as exp(g + n*log(2)) */
fx = _mm256_mul_ps(x, cephes_LOG2EF);
fx = _mm256_round_ps(fx, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC);
__m256 z = _mm256_mul_ps(fx, inv_LOG2EF);
x = _mm256_sub_ps(x, z);
z = _mm256_mul_ps(x,x);
__m256 y = cephes_exp_p0;
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p1);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p2);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p3);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p4);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p5);
y = _mm256_mul_ps(y, z);
y = _mm256_add_ps(y, x);
y = _mm256_add_ps(y, one);
/* build 2^n */
imm0 = _mm256_cvttps_epi32(fx);
imm0 = _mm256_add_epi32(imm0, _mm256_set1_epi32(0x7f));
imm0 = _mm256_slli_epi32(imm0, 23);
__m256 pow2n = _mm256_castsi256_ps(imm0);
y = _mm256_mul_ps(y, pow2n);
return y;
}
int main(){
int r;
int i;
float p;
static float xv[MAXSIZE];
static float yv[MAXSIZE];
float *xp;
float *yp;
for (r = 0; r < REIT; r++) {
p = XXX;
xp = xv;
yp = yv;
for (i = 0; i < MAXSIZE; i += 8) {
__m256 x = _mm256_setr_ps(p, p + 0.00001, p + 0.00002, p + 0.00003, p + 0.00004, p + 0.00005, p + 0.00006, p + 0.00007);
__m256 y = exp256_ps(x);
_mm256_store_ps(xp,x);
_mm256_store_ps(yp,y);
xp += 8;
yp += 8;
p += 0.00008;
}
}
for (i=0;i<MAXSIZE;i++){
printf("x=%.20f, e^x=%.20f\n",xv[i],yv[i]);
}
return 0;
}
For comparison, this is the code example with exp (x) from the math library, single thread and float.
#include <stdio.h>
#include <math.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
/* gcc -O3 -m64 exp_st.c -lm */
int main(){
int r;
int i;
float p;
static float xv[MAXSIZE];
static float yv[MAXSIZE];
for (r = 0; r < REIT; r++) {
p = XXX;
for (i = 0; i < MAXSIZE; i++) {
xv[i]=p;
yv[i]=expf(p);
p += 0.00001;
}
}
for (i=0;i<MAXSIZE;i++){
printf("x=%.20f, e^x=%.20f\n",xv[i],yv[i]);
}
return 0;
}
SOLUTION: As Andreas Wenzel said, the gcc compiler is smart enough and it decides that it is not necessary to actually write the results to the array, these writes are optimized away by the compiler. After new performance tests I made based on new information, or before I committed several mistakes or I assumed wrong facts, it seems the results are clearer: exp (double arg), or expf( float arg) which is x2+ exp(double arg), have been improved but it is not as a fast AVX2 algorithm (x8 parallel float arg), which is around x6 faster than SSE2 algorithm (x4 parallel float arg). Here are some results, as expected for Intel Hyper-Threading CPUs, excepts for SSE2 algorithm:
exp (double arg) single thread: 18 min 46 sec
exp (double arg) 4 threads: 5 min 4 sec
exp (double arg) 8 threads: 4 min 28 sec
expf (float arg) single thread: 7 min 32 sec
expf (float arg) 4 threads: 1 min 58 sec
expf (float arg) 8 threads: 1 min 41 sec
Relative error**:
i x y = expf(x) double precision exp relative error
i = 0 x =-5.000000000e+00 y = 6.737946998e-03 exp_dbl = 6.737946999e-03 rel_err =-1.124224480e-10
i = 124000 x =-3.758316040e+00 y = 2.332298271e-02 exp_dbl = 2.332298229e-02 rel_err = 1.803005727e-08
i = 248000 x =-2.518329620e+00 y = 8.059411496e-02 exp_dbl = 8.059411715e-02 rel_err =-2.716802480e-08
i = 372000 x =-1.278343201e+00 y = 2.784983218e-01 exp_dbl = 2.784983343e-01 rel_err =-4.490403948e-08
i = 496000 x =-3.867173195e-02 y = 9.620664716e-01 exp_dbl = 9.620664730e-01 rel_err =-1.481617428e-09
i = 620000 x = 1.201261759e+00 y = 3.324308872e+00 exp_dbl = 3.324308753e+00 rel_err = 3.571995830e-08
i = 744000 x = 2.441616058e+00 y = 1.149159718e+01 exp_dbl = 1.149159684e+01 rel_err = 2.955980805e-08
i = 868000 x = 3.681602478e+00 y = 3.970997620e+01 exp_dbl = 3.970997748e+01 rel_err =-3.232306688e-08
i = 992000 x = 4.921588898e+00 y = 1.372204742e+02 exp_dbl = 1.372204694e+02 rel_err = 3.563072184e-08
*SSE2 algorithm by Julien Pommier, x6,8 speed increase from one thread to 8 threads. My performance test code uses aligned(16) for typedef union of vector/4 float array passed to the library, instead of aligned whole float array. This may causes lower performance, at least for other AVX2 code, its performance improvement with multithread too seems good for Intel Hyper-Threading but at a lower speed, time increased between x2.5-x1.5. Maybe SSE2 code could be sped up with better array alignment which I couldn't improve:
exp_ps (x4 parallel float arg) single thread: 12 min 7 sec
exp_ps (x4 parallel float arg) 4 threads: 3 min 10 sec
exp_ps (x4 parallel float arg) 8 threads: 1 min 46 sec
Relative error**:
i x y = exp_ps(x) double precision exp relative error
i = 0 x =-5.000000000e+00 y = 6.737946998e-03 exp_dbl = 6.737946999e-03 rel_err =-1.124224480e-10
i = 124000 x =-3.758316040e+00 y = 2.332298271e-02 exp_dbl = 2.332298229e-02 rel_err = 1.803005727e-08
i = 248000 x =-2.518329620e+00 y = 8.059412241e-02 exp_dbl = 8.059411715e-02 rel_err = 6.527768787e-08
i = 372000 x =-1.278343201e+00 y = 2.784983218e-01 exp_dbl = 2.784983343e-01 rel_err =-4.490403948e-08
i = 496000 x =-3.977407143e-02 y = 9.610065222e-01 exp_dbl = 9.610065335e-01 rel_err =-1.174323454e-08
i = 620000 x = 1.200158238e+00 y = 3.320642233e+00 exp_dbl = 3.320642334e+00 rel_err =-3.054731957e-08
i = 744000 x = 2.441616058e+00 y = 1.149159622e+01 exp_dbl = 1.149159684e+01 rel_err =-5.342903415e-08
i = 868000 x = 3.681602478e+00 y = 3.970997620e+01 exp_dbl = 3.970997748e+01 rel_err =-3.232306688e-08
i = 992000 x = 4.921588898e+00 y = 1.372204742e+02 exp_dbl = 1.372204694e+02 rel_err = 3.563072184e-08
AVX2 algorithm (x8 parallel float arg) single thread: 1 min 45 sec
AVX2 algorithm (x8 parallel float arg) 4 threads: 28 sec
AVX2 algorithm (x8 parallel float arg) 8 threads: 27 sec
Relative error**:
i x y = exp256_ps(x) double precision exp relative error
i = 0 x =-5.000000000e+00 y = 6.737946998e-03 exp_dbl = 6.737946999e-03 rel_err =-1.124224480e-10
i = 124000 x =-3.758316040e+00 y = 2.332298271e-02 exp_dbl = 2.332298229e-02 rel_err = 1.803005727e-08
i = 248000 x =-2.516632080e+00 y = 8.073104918e-02 exp_dbl = 8.073104510e-02 rel_err = 5.057888540e-08
i = 372000 x =-1.279417157e+00 y = 2.781994045e-01 exp_dbl = 2.781993997e-01 rel_err = 1.705288467e-08
i = 496000 x =-3.954863176e-02 y = 9.612231851e-01 exp_dbl = 9.612232069e-01 rel_err =-2.269774967e-08
i = 620000 x = 1.199879169e+00 y = 3.319715738e+00 exp_dbl = 3.319715775e+00 rel_err =-1.119642824e-08
i = 744000 x = 2.440370798e+00 y = 1.147729492e+01 exp_dbl = 1.147729571e+01 rel_err =-6.896860199e-08
i = 868000 x = 3.681602478e+00 y = 3.970997620e+01 exp_dbl = 3.970997748e+01 rel_err =-3.232306688e-08
i = 992000 x = 4.923286438e+00 y = 1.374535980e+02 exp_dbl = 1.374536045e+02 rel_err =-4.676466368e-08
**The relative errors are the same, since the codes with SSE2 and AVX2 use identical algorithms, and it is more than likely that it is also that of the library function exp(x).
Source code AVX2 algorithm multithread
/* Performance test of a multithreaded exp(x) algorithm
based on AVX implementation of Giovanni Garberoglio
Copyright (C) 2020 Antonio R.
AVX implementation of exp:
Modified code from this source: https://github.com/reyoung/avx_mathfun
Based on "sse_mathfun.h", by Julien Pommier
http://gruntthepeon.free.fr/ssemath/
Copyright (C) 2012 Giovanni Garberoglio
Interdisciplinary Laboratory for Computational Science (LISC)
Fondazione Bruno Kessler and University of Trento
via Sommarive, 18
I-38123 Trento (Italy)
This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
(this is the zlib license)
*/
/* gcc -O3 -m64 -mavx2 -march=haswell expc_multi.c -pthread -lm */
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
#include <math.h>
#include <pthread.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
#define num_threads 4
typedef float FLOAT32[MAXSIZE] __attribute__((aligned(4)));
static FLOAT32 xv;
static FLOAT32 yv;
__m256 exp256_ps(__m256 x) {
/*
To increase the compatibility across different compilers the original code is
converted to plain AVX2 intrinsics code without ingenious macro's,
gcc style alignment attributes etc.
Moreover, the part "express exp(x) as exp(g + n*log(2))" has been significantly simplified.
This modified code is not thoroughly tested!
*/
__m256 exp_hi = _mm256_set1_ps(88.3762626647949f);
__m256 exp_lo = _mm256_set1_ps(-88.3762626647949f);
__m256 cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341f);
__m256 inv_LOG2EF = _mm256_set1_ps(0.693147180559945f);
__m256 cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256 cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256 cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256 cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256 cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256 cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256 fx;
__m256i imm0;
__m256 one = _mm256_set1_ps(1.0f);
x = _mm256_min_ps(x, exp_hi);
x = _mm256_max_ps(x, exp_lo);
/* express exp(x) as exp(g + n*log(2)) */
fx = _mm256_mul_ps(x, cephes_LOG2EF);
fx = _mm256_round_ps(fx, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC);
__m256 z = _mm256_mul_ps(fx, inv_LOG2EF);
x = _mm256_sub_ps(x, z);
z = _mm256_mul_ps(x,x);
__m256 y = cephes_exp_p0;
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p1);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p2);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p3);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p4);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p5);
y = _mm256_mul_ps(y, z);
y = _mm256_add_ps(y, x);
y = _mm256_add_ps(y, one);
/* build 2^n */
imm0 = _mm256_cvttps_epi32(fx);
imm0 = _mm256_add_epi32(imm0, _mm256_set1_epi32(0x7f));
imm0 = _mm256_slli_epi32(imm0, 23);
__m256 pow2n = _mm256_castsi256_ps(imm0);
y = _mm256_mul_ps(y, pow2n);
return y;
}
void* run(void *received_Val){
int single_val = *((int *) received_Val);
int r;
int i;
float p;
float *xp;
float *yp;
for (r = 0; r < REIT; r++) {
p = XXX + 0.00001*single_val*MAXSIZE/num_threads;
xp = xv + single_val*MAXSIZE/num_threads;
yp = yv + single_val*MAXSIZE/num_threads;
for (i = single_val*MAXSIZE/num_threads; i < (single_val+1)*MAXSIZE/num_threads; i += 8) {
__m256 x = _mm256_setr_ps(p, p + 0.00001, p + 0.00002, p + 0.00003, p + 0.00004, p + 0.00005, p + 0.00006, p + 0.00007);
__m256 y = exp256_ps(x);
_mm256_store_ps(xp,x);
_mm256_store_ps(yp,y);
xp += 8;
yp += 8;
p += 0.00008;
}
}
return 0;
}
int main(){
int i;
pthread_t tid[num_threads];
for (i=0;i<num_threads;i++){
int *arg = malloc(sizeof(*arg));
if ( arg == NULL ) {
fprintf(stderr, "Couldn't allocate memory for thread arg.\n");
exit(1);
}
*arg = i;
pthread_create(&(tid[i]), NULL, run, arg);
}
for(i=0; i<num_threads; i++)
{
pthread_join(tid[i], NULL);
}
for (i=0;i<MAXSIZE;i++){
printf("x=%.20f, e^x=%.20f\n",xv[i],yv[i]);
}
return 0;
}
Charts overview:
exp (double arg) without printf loop, not real performance, as Andreas Wenzel found, gcc doesn't calculate exp(x) when results are not going to be printf, even float version is much slower because its different few assembly instructions. Although graph may be useful for some assembly algorithm that only uses low-level CPU cache/registers.
expf (float arg) real performance or with printf loop
AVX2 algorithm, the best performance.
When you don't print the arrays at the end of the program, the gcc compiler is smart enough to realize that the results of the calculations have no observable effects. Therefore, the compiler decides that it is not necessary to actually write the results to the array, because these results are never used. Instead, these writes are optimized away by the compiler.
Also, when you don't print the results, the library function exp has no observable effects, provided it is not called with input that is so high that it would cause a floating-point overflow (which would cause the function to raise a floating point exception). This allows the compiler to optimize these function calls away, too.
As you can see in the assembly instructions emitted by the gcc compiler for your code which does not print the results, the compiled program doesn't unconditionally call the function exp, but instead tests if the input to the function exp would be higher than 7.09e2 (to ensure that no overflow will occur). Only if an overflow would occur, will the program jump to the code which calls the function exp. Here is the relevant assembly code:
ucomisd xmm1, xmm3
jnb .L9
In the above assembly code, the CPU register xmm3 contains the double-precision floating-point value 7.09e2. If the input is higher than this constant, the function exp would cause a floating-point overflow, because the result cannot be represented in a double-precision floating-point value.
Since your input is always valid and low enough to not cause a floating-point overflow, your program will never perform this jump, so it will never actually call the function exp.
This explains why your code is so much faster when you do not print the results. If you do not print the results, your compiler will determine that the calculations have no observable effects, so it will optimize them away.
Therefore, if you want the compiler to actually perform the calculations, you must ensure that the calculations have some observable effect. This does not mean that you have to actually print all results (which are several megabytes large). It is sufficient if you just print one line which is dependant on all the results (for example the sum of all results).
However, if you replace the function call to the library function exp with a call to some other custom function, then, at least in my tests, the compiler is not smart enough to realize that the function call has no observable effects. In that case, it is unable to optimize the function call away, even if you do not print the results of the calculations.
For the reasons stated above, if you want to compare the performance of both functions, you must ensure that the calculations actually take place, by making sure that the results have an observable effect. Otherwise, you run the risk of the compiler optimizing away at least some of the calculations, and the comparison would not be fair.
I don't think this has much to with pthread because your code only appears to call printf after the threads are joined. Instead, the poor performance is likely due to cache misses by needing to read from the xv and yv arrays in every iteration of the print loop.
I'm writing a C program for a PIC micro-controller which needs to do a very specific exponential function. I need to calculate the following:
A = k . (1 - (p/p0)^0.19029)
k and p0 are constant, so it's all pretty simple apart from finding x^0.19029
(p/p0) ratio would always be in the range 0-1.
It works well if I add in math.h and use the power function, except that uses up all of the available 16 kB of program memory. Talk about bloatware! (Rest of program without power function = ~20% flash memory usage; add math.h and power function, =100%).
I'd like the program to do some other things as well. I was wondering if I can write a special case implementation for x^0.19029, maybe involving iteration and some kind of lookup table.
My idea is to generate a look-up table for the function x^0.19029, with perhaps 10-100 values of x in the range 0-1. The code would find a close match, then (somehow) iteratively refine it by re-scaling the lookup table values. However, this is where I get lost because my tiny brain can't visualise the maths involved.
Could this approach work?
Alternatively, I've looked at using Exp(x) and Ln(x), which can be implemented with a Taylor expansion. b^x can the be found with:
b^x = (e^(ln b))^x = e^(x.ln(b))
(See: Wikipedia - Powers via Logarithms)
This looks a bit tricky and complicated to me, though. Am I likely to get the implementation smaller then the compiler's math library, and can I simplify it for my special case (i.e. base = 0-1, exponent always 0.19029)?
Note that RAM usage is OK at the moment, but I've run low on Flash (used for code storage). Speed is not critical. Somebody has already suggested that I use a bigger micro with more flash memory, but that sounds like profligate wastefulness!
[EDIT] I was being lazy when I said "(p/p0) ratio would always be in the range 0-1". Actually it will never reach 0, and I did some calculations last night and decided that in fact a range of 0.3 - 1 would be quite adequate! This mean that some of the simpler solutions below should be suitable. Also, the "k" in the above is 44330, and I'd like the error in the final result to be less than 0.1. I guess that means an error in the (p/p0)^0.19029 needs to be less than 1/443300 or 2.256e-6
Use splines. The relevant part of the function is shown in the figure below. It varies approximately like the 5th root, so the problematic zone is close to p / p0 = 0. There is mathematical theory how to optimally place the knots of splines to minimize the error (see Carl de Boor: A Practical Guide to Splines). Usually one constructs the spline in B form ahead of time (using toolboxes such as Matlab's spline toolbox - also written by C. de Boor), then converts to Piecewise Polynomial representation for fast evaluation.
In C. de Boor, PGS, the function g(x) = sqrt(x + 1) is actually taken as an example (Chapter 12, Example II). This is exactly what you need here. The book comes back to this case a few times, since it is admittedly a hard problem for any interpolation scheme due to the infinite derivatives at x = -1. All software from PGS is available for free as PPPACK in netlib, and most of it is also part of SLATEC (also from netlib).
Edit (Removed)
(Multiplying by x once does not significantly help, since it only regularizes the first derivative, while all other derivatives at x = 0 are still infinite.)
Edit 2
My feeling is that optimally constructed splines (following de Boor) will be best (and fastest) for relatively low accuracy requirements. If the accuracy requirements are high (say 1e-8), one may be forced to get back to the algorithms that mathematicians have been researching for centuries. At this point, it may be best to simply download the sources of glibc and copy (provided GPL is acceptable) whatever is in
glibc-2.19/sysdeps/ieee754/dbl-64/e_pow.c
Since we don't have to include the whole math.h, there shouldn't be a problem with memory, but we will only marginally profit from having a fixed exponent.
Edit 3
Here is an adapted version of e_pow.c from netlib, as found by #Joni. This seems to be the grandfather of glibc's more modern implementation mentioned above. The old version has two advantages: (1) It is public domain, and (2) it uses a limited number of constants, which is beneficial if memory is a tight resource (glibc's version defines over 10000 lines of constants!). The following is completely standalone code, which calculates x^0.19029 for 0 <= x <= 1 to double precision (I tested it against Python's power function and found that at most 2 bits differed):
#define __LITTLE_ENDIAN
#ifdef __LITTLE_ENDIAN
#define __HI(x) *(1+(int*)&x)
#define __LO(x) *(int*)&x
#else
#define __HI(x) *(int*)&x
#define __LO(x) *(1+(int*)&x)
#endif
static const double
bp[] = {1.0, 1.5,},
dp_h[] = { 0.0, 5.84962487220764160156e-01,}, /* 0x3FE2B803, 0x40000000 */
dp_l[] = { 0.0, 1.35003920212974897128e-08,}, /* 0x3E4CFDEB, 0x43CFD006 */
zero = 0.0,
one = 1.0,
two = 2.0,
two53 = 9007199254740992.0, /* 0x43400000, 0x00000000 */
/* poly coefs for (3/2)*(log(x)-2s-2/3*s**3 */
L1 = 5.99999999999994648725e-01, /* 0x3FE33333, 0x33333303 */
L2 = 4.28571428578550184252e-01, /* 0x3FDB6DB6, 0xDB6FABFF */
L3 = 3.33333329818377432918e-01, /* 0x3FD55555, 0x518F264D */
L4 = 2.72728123808534006489e-01, /* 0x3FD17460, 0xA91D4101 */
L5 = 2.30660745775561754067e-01, /* 0x3FCD864A, 0x93C9DB65 */
L6 = 2.06975017800338417784e-01, /* 0x3FCA7E28, 0x4A454EEF */
P1 = 1.66666666666666019037e-01, /* 0x3FC55555, 0x5555553E */
P2 = -2.77777777770155933842e-03, /* 0xBF66C16C, 0x16BEBD93 */
P3 = 6.61375632143793436117e-05, /* 0x3F11566A, 0xAF25DE2C */
P4 = -1.65339022054652515390e-06, /* 0xBEBBBD41, 0xC5D26BF1 */
P5 = 4.13813679705723846039e-08, /* 0x3E663769, 0x72BEA4D0 */
lg2 = 6.93147180559945286227e-01, /* 0x3FE62E42, 0xFEFA39EF */
lg2_h = 6.93147182464599609375e-01, /* 0x3FE62E43, 0x00000000 */
lg2_l = -1.90465429995776804525e-09, /* 0xBE205C61, 0x0CA86C39 */
ovt = 8.0085662595372944372e-0017, /* -(1024-log2(ovfl+.5ulp)) */
cp = 9.61796693925975554329e-01, /* 0x3FEEC709, 0xDC3A03FD =2/(3ln2) */
cp_h = 9.61796700954437255859e-01, /* 0x3FEEC709, 0xE0000000 =(float)cp */
cp_l = -7.02846165095275826516e-09, /* 0xBE3E2FE0, 0x145B01F5 =tail of cp_h*/
ivln2 = 1.44269504088896338700e+00, /* 0x3FF71547, 0x652B82FE =1/ln2 */
ivln2_h = 1.44269502162933349609e+00, /* 0x3FF71547, 0x60000000 =24b 1/ln2*/
ivln2_l = 1.92596299112661746887e-08; /* 0x3E54AE0B, 0xF85DDF44 =1/ln2 tail*/
double pow0p19029(double x)
{
double y = 0.19029e+00;
double z,ax,z_h,z_l,p_h,p_l;
double y1,t1,t2,r,s,t,u,v,w;
int i,j,k,n;
int hx,hy,ix,iy;
unsigned lx,ly;
hx = __HI(x); lx = __LO(x);
hy = __HI(y); ly = __LO(y);
ix = hx&0x7fffffff; iy = hy&0x7fffffff;
ax = x;
/* special value of x */
if(lx==0) {
if(ix==0x7ff00000||ix==0||ix==0x3ff00000){
z = ax; /*x is +-0,+-inf,+-1*/
return z;
}
}
s = one; /* s (sign of result -ve**odd) = -1 else = 1 */
double ss,s2,s_h,s_l,t_h,t_l;
n = ((ix)>>20)-0x3ff;
j = ix&0x000fffff;
/* determine interval */
ix = j|0x3ff00000; /* normalize ix */
if(j<=0x3988E) k=0; /* |x|<sqrt(3/2) */
else if(j<0xBB67A) k=1; /* |x|<sqrt(3) */
else {k=0;n+=1;ix -= 0x00100000;}
__HI(ax) = ix;
/* compute ss = s_h+s_l = (x-1)/(x+1) or (x-1.5)/(x+1.5) */
u = ax-bp[k]; /* bp[0]=1.0, bp[1]=1.5 */
v = one/(ax+bp[k]);
ss = u*v;
s_h = ss;
__LO(s_h) = 0;
/* t_h=ax+bp[k] High */
t_h = zero;
__HI(t_h)=((ix>>1)|0x20000000)+0x00080000+(k<<18);
t_l = ax - (t_h-bp[k]);
s_l = v*((u-s_h*t_h)-s_h*t_l);
/* compute log(ax) */
s2 = ss*ss;
r = s2*s2*(L1+s2*(L2+s2*(L3+s2*(L4+s2*(L5+s2*L6)))));
r += s_l*(s_h+ss);
s2 = s_h*s_h;
t_h = 3.0+s2+r;
__LO(t_h) = 0;
t_l = r-((t_h-3.0)-s2);
/* u+v = ss*(1+...) */
u = s_h*t_h;
v = s_l*t_h+t_l*ss;
/* 2/(3log2)*(ss+...) */
p_h = u+v;
__LO(p_h) = 0;
p_l = v-(p_h-u);
z_h = cp_h*p_h; /* cp_h+cp_l = 2/(3*log2) */
z_l = cp_l*p_h+p_l*cp+dp_l[k];
/* log2(ax) = (ss+..)*2/(3*log2) = n + dp_h + z_h + z_l */
t = (double)n;
t1 = (((z_h+z_l)+dp_h[k])+t);
__LO(t1) = 0;
t2 = z_l-(((t1-t)-dp_h[k])-z_h);
/* split up y into y1+y2 and compute (y1+y2)*(t1+t2) */
y1 = y;
__LO(y1) = 0;
p_l = (y-y1)*t1+y*t2;
p_h = y1*t1;
z = p_l+p_h;
j = __HI(z);
i = __LO(z);
/*
* compute 2**(p_h+p_l)
*/
i = j&0x7fffffff;
k = (i>>20)-0x3ff;
n = 0;
if(i>0x3fe00000) { /* if |z| > 0.5, set n = [z+0.5] */
n = j+(0x00100000>>(k+1));
k = ((n&0x7fffffff)>>20)-0x3ff; /* new k for n */
t = zero;
__HI(t) = (n&~(0x000fffff>>k));
n = ((n&0x000fffff)|0x00100000)>>(20-k);
if(j<0) n = -n;
p_h -= t;
}
t = p_l+p_h;
__LO(t) = 0;
u = t*lg2_h;
v = (p_l-(t-p_h))*lg2+t*lg2_l;
z = u+v;
w = v-(z-u);
t = z*z;
t1 = z - t*(P1+t*(P2+t*(P3+t*(P4+t*P5))));
r = (z*t1)/(t1-two)-(w+z*w);
z = one-(r-z);
__HI(z) += (n<<20);
return s*z;
}
Clearly, 50+ years of research have gone into this, so it's probably very hard to do any better. (One has to appreciate that there are 0 loops, only 2 divisions, and only 6 if statements in the whole algorithm!) The reason for this is, again, the behavior at x = 0, where all derivatives diverge, which makes it extremely hard to keep the error under control: I once had a spline representation with 18 knots that was good up to x = 1e-4, with absolute and relative errors < 5e-4 everywhere, but going to x = 1e-5 ruined everything again.
So, unless the requirement to go arbitrarily close to zero is relaxed, I recommend using the adapted version of e_pow.c given above.
Edit 4
Now that we know that the domain 0.3 <= x <= 1 is sufficient, and that we have very low accuracy requirements, Edit 3 is clearly overkill. As #MvG has demonstrated, the function is so well behaved that a polynomial of degree 7 is sufficient to satisfy the accuracy requirements, which can be considered a single spline segment. #MvG's solution minimizes the integral error, which already looks very good.
The question arises as to how much better we can still do? It would be interesting to find the polynomial of a given degree that minimizes the maximum error in the interval of interest. The answer is the minimax
polynomial, which can be found using Remez' algorithm, which is implemented in the Boost library. I like #MvG's idea to clamp the value at x = 1 to 1, which I will do as well. Here is minimax.cpp:
#include <ostream>
#define TARG_PREC 64
#define WORK_PREC (TARG_PREC*2)
#include <boost/multiprecision/cpp_dec_float.hpp>
typedef boost::multiprecision::number<boost::multiprecision::cpp_dec_float<WORK_PREC> > dtype;
using boost::math::pow;
#include <boost/math/tools/remez.hpp>
boost::shared_ptr<boost::math::tools::remez_minimax<dtype> > p_remez;
dtype f(const dtype& x) {
static const dtype one(1), y(0.19029);
return one - pow(one - x, y);
}
void out(const char *descr, const dtype& x, const char *sep="") {
std::cout << descr << boost::math::tools::real_cast<double>(x) << sep << std::endl;
}
int main() {
dtype a(0), b(0.7); // range to optimise over
bool rel_error(false), pin(true);
int orderN(7), orderD(0), skew(0), brake(50);
int prec = 2 + (TARG_PREC * 3010LL)/10000;
std::cout << std::scientific << std::setprecision(prec);
p_remez.reset(new boost::math::tools::remez_minimax<dtype>(
&f, orderN, orderD, a, b, pin, rel_error, skew, WORK_PREC));
out("Max error in interpolated form: ", p_remez->max_error());
p_remez->set_brake(brake);
unsigned i, count(50);
for (i = 0; i < count; ++i) {
std::cout << "Stepping..." << std::endl;
dtype r = p_remez->iterate();
out("Maximum Deviation Found: ", p_remez->max_error());
out("Expected Error Term: ", p_remez->error_term());
out("Maximum Relative Change in Control Points: ", r);
}
boost::math::tools::polynomial<dtype> n = p_remez->numerator();
for(i = n.size(); i--; ) {
out("", n[i], ",");
}
}
Since all parts of boost that we use are header-only, simply build with:
c++ -O3 -I<path/to/boost/headers> minimax.cpp -o minimax
We finally get the coefficients, which are after multiplication by 44330:
24538.3409, -42811.1497, 34300.7501, -11284.1276, 4564.5847, 3186.7541, 8442.5236, 0.
The following error plot demonstrates that this is really the best possible degree-7 polynomial approximation, since all extrema are of equal magnitude (0.06659):
Should the requirements ever change (while still keeping well away from 0!), the C++ program above can be simply adapted to spit out the new optimal polynomial approximation.
Instead of a lookup table, I'd use a polynomial approximation:
1 - x0.19029 ≈ - 1073365.91783x15 + 8354695.40833x14 - 29422576.6529x13 + 61993794.537x12 - 87079891.4988x11 + 86005723.842x10 - 61389954.7459x9 + 32053170.1149x8 - 12253383.4372x7 + 3399819.97536x6 - 672003.142815x5 + 91817.6782072x4 - 8299.75873768x3 + 469.530204564x2 - 16.6572179869x + 0.722044145701
Or in code:
double f(double x) {
double fx;
fx = - 1073365.91783;
fx = fx*x + 8354695.40833;
fx = fx*x - 29422576.6529;
fx = fx*x + 61993794.537;
fx = fx*x - 87079891.4988;
fx = fx*x + 86005723.842;
fx = fx*x - 61389954.7459;
fx = fx*x + 32053170.1149;
fx = fx*x - 12253383.4372;
fx = fx*x + 3399819.97536;
fx = fx*x - 672003.142815;
fx = fx*x + 91817.6782072;
fx = fx*x - 8299.75873768;
fx = fx*x + 469.530204564;
fx = fx*x - 16.6572179869;
fx = fx*x + 0.722044145701;
return fx;
}
I computed this in sage using the least squares approach:
f(x) = 1-x^(19029/100000) # your function
d = 16 # number of terms, i.e. degree + 1
A = matrix(d, d, lambda r, c: integrate(x^r*x^c, (x, 0, 1)))
b = vector([integrate(x^r*f(x), (x, 0, 1)) for r in range(d)])
A.solve_right(b).change_ring(RDF)
Here is a plot of the error this will entail:
Blue is the error from my 16 term polynomial, while red is the error you'd get from piecewise linear interpolation with 16 equidistant values. As you can see, both errors are quite small for most parts of the range, but will become really huge close to x=0. I actually clipped the plot there. If you can somehow narrow the range of possible values, you could use that as the domain for the integration, and obtain an even better fit for the relevant range. At the cost of worse fit outside, of course. You could also increase the number of terms to obtain a closer fit, although that might also lead to higher oscillations.
I guess you can also combine this approach with the one Stefan posted: use his to split the domain into several parts, then use mine to find a close low degree polynomial for each part.
Update
Since you updated the specification of your question, with regard to both the domain and the error, here is a minimal solution to fit those requirements:
44330(1 - x0.19029) ≈ + 23024.9160933(1-x)7 - 39408.6473636(1-x)6 + 31379.9086193(1-x)5 - 10098.7031260(1-x)4 + 4339.44098317(1-x)3 + 3202.85705860(1-x)2 + 8442.42528906(1-x)
double f(double x) {
double fx, x1 = 1. - x;
fx = + 23024.9160933;
fx = fx*x1 - 39408.6473636;
fx = fx*x1 + 31379.9086193;
fx = fx*x1 - 10098.7031260;
fx = fx*x1 + 4339.44098317;
fx = fx*x1 + 3202.85705860;
fx = fx*x1 + 8442.42528906;
fx = fx*x1;
return fx;
}
I integrated x from 0.293 to 1 or equivalently 1 - x from 0 to 0.707 to keep the worst oscillations outside the relevant domain. I also omitted the constant term, to ensure an exact result at x=1. The maximal error for the range [0.3, 1] now occurs at x=0.3260 and amounts to 0.0972 < 0.1. Here is an error plot, which of course has bigger absolute errors than the one above due to the scale factor k=44330 which has been included here.
I can also state that the first three derivatives of the function will have constant sign over the range in question, so the function is monotonic, convex, and in general pretty well-behaved.
Not meant to answer the question, but it illustrates the Road Not To Go, and thus may be helpful:
This quick-and-dirty C code calculates pow(i, 0.19029) for 0.000 to 1.000 in steps of 0.01. The first half displays the error, in percents, when stored as 1/65536ths (as that theoretically provides slightly over 4 decimals of precision). The second half shows both interpolated and calculated values in steps of 0.001, and the difference between these two.
It kind of looks okay if you read from the bottom up, all 100s and 99.99s there, but about the first 20 values from 0.001 to 0.020 are worthless.
#include <stdio.h>
#include <math.h>
float powers[102];
int main (void)
{
int i, as_int;
double as_real, low, high, delta, approx, calcd, diff;
printf ("calculating and storing:\n");
for (i=0; i<=101; i++)
{
as_real = pow(i/100.0, 0.19029);
as_int = (int)round(65536*as_real);
powers[i] = as_real;
diff = 100*as_real/(as_int/65536.0);
printf ("%.5f %.5f %.5f ~ %.3f\n", i/100.0, as_real, as_int/65536.0, diff);
}
printf ("\n");
printf ("-- interpolating in 1/10ths:\n");
for (i=0; i<1000; i++)
{
as_real = i/1000.0;
low = powers[i/10];
high = powers[1+i/10];
delta = (high-low)/10.0;
approx = low + (i%10)*delta;
calcd = pow(as_real, 0.19029);
diff = 100.0*approx/calcd;
printf ("%.5f ~ %.5f = %.5f +/- %.5f%%\n", as_real, approx, calcd, diff);
}
return 0;
}
You can find a complete, correct standalone implementation of pow in fdlibm. It's about 200 lines of code, about half of which deal with special cases. If you remove the code that deals with special cases you're not interested in I doubt you'll have problems including it in your program.
LutzL's answer is a really good one: Calculate your power as (x^1.52232)^(1/8), computing the inner power by spline interpolation or another method. The eighth root deals with the pathological non-differentiable behavior near zero. I took the liberty of mocking up an implementation this way. The below, however, only does a linear interpolation to do x^1.52232, and you'd need to get the full coefficients using your favorite numerical mathematics tools. You'll adding scarcely 40 lines of code to get your needed power, plus however many knots you choose to use for your spline, as dicated by your required accuracy.
Don't be scared by the #include <math.h>; it's just for benchmarking the code.
#include <stdio.h>
#include <math.h>
double my_sqrt(double x) {
/* Newton's method for a square root. */
int i = 0;
double res = 1.0;
if (x > 0) {
for (i = 0; i < 10; i++) {
res = 0.5 * (res + x / res);
}
} else {
res = 0.0;
}
return res;
}
double my_152232(double x) {
/* Cubic spline interpolation for x ** 1.52232. */
int i = 0;
double res = 0.0;
/* coefs[i] will give the cubic polynomial coefficients between x =
i and x = i+1. Out of laziness, the below numbers give only a
linear interpolation. You'll need to do some work and research
to get the spline coefficients. */
double coefs[3][4] = {{0.0, 1.0, 0.0, 0.0},
{-0.872526, 1.872526, 0.0, 0.0},
{-2.032706, 2.452616, 0.0, 0.0}};
if ((x >= 0) && (x < 3.0)) {
i = (int) x;
/* Horner's method cubic. */
res = (((coefs[i][3] * x + coefs[i][2]) * x) + coefs[i][1] * x)
+ coefs[i][0];
} else if (x >= 3.0) {
/* Scaled x ** 1.5 once you go off the spline. */
res = 1.024824 * my_sqrt(x * x * x);
}
return res;
}
double my_019029(double x) {
return my_sqrt(my_sqrt(my_sqrt(my_152232(x))));
}
int main() {
int i;
double x = 0.0;
for (i = 0; i < 1000; i++) {
x = 1e-2 * i;
printf("%f %f %f \n", x, my_019029(x), pow(x, 0.19029));
}
return 0;
}
EDIT: If you're just interested in a small region like [0,1], even simpler is to peel off one sqrt(x) and compute x^1.02232, which is quite well behaved, using a Taylor series:
double my_152232(double x) {
double part_050000 = my_sqrt(x);
double part_102232 = 1.02232 * x + 0.0114091 * x * x - 3.718147e-3 * x * x * x;
return part_102232 * part_050000;
}
This gets you within 1% of the exact power for approximately [0.1,6], though getting the singularity exactly right is always a challenge. Even so, this three-term Taylor series gets you within 2.3% for x = 0.001.
I have a for loop which will run many times, and will cost a lot of time:
for (int z=0; z<temp; z++)
{
float findex= a + b * A[z];
int iindex = findex ;
outArray[z] += inArray[iindex] + (findex - iindex) * (inArray[iindex+1] - inArray[iindex]);
a++;
}
I have optimized this code, but have no performance improvement! Maybe my SSE code is bad, can any one help me?
Try using the restrict keyword on inArray and outArray. Otherwise the compiler has to assume that inArray could be == outArray. In this case no parallelization would be possible.
Your loop has a loop carried dependency when you write to outArray[z]. Your CPU can do more than one floating point sum at once but with your current loop you only allows one sum of outArray[z]. To fix this you should unroll your loop.
for (int z=0; z<temp; z+=2) {
float findex_v1 = a + b * A[z];
int iindex_v1 = findex_v1;
outArray[z] += inArray[iindex_v1] + (findex_v1 - iindex_v1) * (inArray[iindex_v1+1] - inArray[iindex_v1]);
float findex_v2 = (a+1) + b * A[z+1];
int iindex_v2 = findex_v2;
outArray[z+1] += inArray[iindex_v2] + (findex_v2 - iindex_v2) * (inArray[iindex_v2+1] - inArray[iindex_v2]);
a+=2;
}
In terms of SIMD the problem is that you have to gather non-contiguous data when you access inArray[iindex_v1]. AVX2 has some gather instructions but I have not tried them. Otherwise it may be best to do the gather without SIMD. All the operations accessing z access contiguous memory so that part is easy. Psuedo-code (without unrolling) would look something like this
int indexa[4];
float inArraya[4];
float dinArraya[4];
int4 a4 = a + float4(0,1,2,3);
for (int z=0; z<temp; z+=4) {
//use SSE for contiguous memory
float4 findex4 = a4 + b * float4.load(&A[z]);
int4 iindex4 = truncate_to_int(findex4);
//don't use SSE for non-contiguous memory
iindex4.store(indexa);
for(int i=0; i<4; i++) {
inArraya[i] = inArray[indexa[i]];
dinArraya[i] = inArray[indexa[i+1]] - inArray[indexa[i]];
}
//loading from and array right after writing to it causes a CPU stall
float4 inArraya4 = float4.load(inArraya);
float4 dinArraya4 = float4.load(dinArraya);
//back to SSE
float4 outArray4 = float4.load(&outarray[z]);
outArray4 += inArray4 + (findex4 - iindex4)*dinArray4;
outArray4.store(&outArray[z]);
a4+=4;
}
This is a follow up to this question about getting GCC to optimize memcpy() in a loop; I've given up and decided to go the direct route of optimizing the loop manually.
I'm trying to stay as portable and maintainable as possible, though, so I'd like to get GCC to vectorize a simple optimized repeated copy-within-a-loop itself without resorting to SSE intrinsics. However, it seems to refuse doing so regardless of how much handholding I give it, despite the fact that the manually vectorized version (with the SSE2 MOVDQA instructions) is empirically up to 58% faster for small arrays (<32 elements) and at least 17% faster for larger ones (>=512).
Here's the version that isn't manually vectorized (with as many hints as I could think of to tell GCC to vectorize it):
__attribute__ ((noinline))
void take(double * out, double * in,
int stride_out_0, int stride_out_1,
int stride_in_0, int stride_in_1,
int * indexer, int n, int k)
{
int i, idx, j, l;
double * __restrict__ subout __attribute__ ((aligned (16)));
double * __restrict__ subin __attribute__ ((aligned (16)));
assert(stride_out_1 == 1);
assert(stride_out_1 == stride_in_1);
l = k - (k % 8);
for(i = 0; i < n; ++i) {
idx = indexer[i];
subout = &out[i * stride_out_0];
subin = &in[idx * stride_in_0];
for(j = 0; j < l; j += 8) {
subout[j+0] = subin[j+0];
subout[j+1] = subin[j+1];
subout[j+2] = subin[j+2];
subout[j+3] = subin[j+3];
subout[j+4] = subin[j+4];
subout[j+5] = subin[j+5];
subout[j+6] = subin[j+6];
subout[j+7] = subin[j+7];
}
for( ; j < k; ++j)
subout[j] = subin[j];
}
}
And here's my first attempt at manual vectorization, which I used for comparing performance (it could definitely be improved further, but I just wanted to test the most naive transformation possible):
__attribute__ ((noinline))
void take(double * out, double * in,
int stride_out_0, int stride_out_1,
int stride_in_0, int stride_in_1,
int * indexer, int n, int k)
{
int i, idx, j, l;
__m128i * __restrict__ subout1 __attribute__ ((aligned (16)));
__m128i * __restrict__ subin1 __attribute__ ((aligned (16)));
double * __restrict__ subout2 __attribute__ ((aligned (16)));
double * __restrict__ subin2 __attribute__ ((aligned (16)));
assert(stride_out_1 == 1);
assert(stride_out_1 == stride_in_1);
l = (k - (k % 8)) / 2;
for(i = 0; i < n; ++i) {
idx = indexer[i];
subout1 = (__m128i*)&out[i * stride_out_0];
subin1 = (__m128i*)&in[idx * stride_in_0];
for(j = 0; j < l; j += 4) {
subout1[j+0] = subin1[j+0];
subout1[j+1] = subin1[j+1];
subout1[j+2] = subin1[j+2];
subout1[j+3] = subin1[j+3];
}
j *= 2;
subout2 = &out[i * stride_out_0];
subin2 = &in[idx * stride_in_0];
for( ; j < k; ++j)
subout2[j] = subin2[j];
}
}
(The actual code is only slightly more complicated to handle some special cases, but not in the way that affects GCC vectorization, since even the versions given above don't vectorize either: my test harness can be found on LiveWorkspace)
I'm compiling the first version with the following command line:
gcc-4.7 -O3 -ftree-vectorizer-verbose=3 -march=pentium4m -fverbose-asm \
-msse -msse2 -msse3 take.c -DTAKE5 -S -o take5.s
And the resulting instructions used for the main copy loop are always FLDL/FSTPL pairs (i.e. copying in 8 byte units) rather than MOVDQA instructions, which result when I use SSE intrinsics manually.
The relevant output from tree-vectorize-verbose seems to be:
Analyzing loop at take.c:168
168: vect_model_store_cost: unaligned supported by hardware.
168: vect_model_store_cost: inside_cost = 8, outside_cost = 0 .
168: vect_model_load_cost: unaligned supported by hardware.
168: vect_model_load_cost: inside_cost = 8, outside_cost = 0 .
168: cost model: Adding cost of checks for loop versioning aliasing.
168: cost model: epilogue peel iters set to vf/2 because loop iterations are unknown .
168: cost model: the vector iteration cost = 16 divided by the scalar iteration cost = 16 is greater or equal to the vectorization factor = 1.
168: not vectorized: vectorization not profitable.
I'm not sure why it's referring to "unaligned" stores and loads, and in any case the problem seem to be that the vectorization can't be proven to be profitable (even though empirically, it is for all cases that matter, and I'm not sure what cases where it wouldn't be).
Is there any simple flag or hint that I'm missing here, or does GCC just not want to do this no matter what?
I'm going to be embarrassed if this is something obvious, but hopefully this can help someone else, too, if it is.
All the __attribute__ ((aligned (16))) directives are achieving very little as they just define the alignment of the pointer variable itself, not the data that the pointer points to.
You probably need to look at __builtiin_assume_aligned.
I'm having several problems regarding OpenCL (total noob) but I think that if I manage to solve this one I will be able to solve some of the other. I have the following kernel that I want to store in a double array the a number calculated by the data of a struct. The argument that I pass to the kernel is a struct array and is initialised and the values are non zero (I tested it).
When executing the kernel though I get a "Floating point exception". If I got it right it means that the local_density variable is zero and the division causes an error. What I don't get is why it is zero since in the host the values of non-zero. Am I doing something wrong in the kernel?
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
typedef struct
{
double speeds[9];
} t_speed;
__kernel void prepare(__global const t_speed* cells,
__global const int* obstacles,
__global double* results,
const unsigned int count)
{
int pos = get_global_id(0);
if(pos >= count) return;
if(obstacles[pos] == 1) results[pos] = 0.00;
else
{
double local_density = 0.00;
for(int kk = 0; kk < 9; kk++)
local_density += cells[pos].speeds[kk];
results[pos] = (cells[pos].speeds[1] + cells[pos].speeds[5] +
cells[pos].speeds[8] - (cells[pos].speeds[3] +
cells[pos].speeds[6] + cells[pos].speeds[7])) /
local_density;
}
}
Here is also the initialization of the variable that I pass as an argument. params->ny/nx have correct values.
cells = (t_speed*) malloc(sizeof(t_speed) * (params->ny * params->nx));
Also I quote the argument setting for the kernel for the cells variable.
m_cells = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(t_speed) * count, NULL, NULL);
err = clEnqueueWriteBuffer(commands, m_cells, CL_TRUE, 0, sizeof(t_speed) * count, cells, 0, NULL, NULL);
err |= clSetKernelArg(av_velocity_prepare_kernel, 0, sizeof(cl_mem), &m_cells);
------------------------------------------ EDIT ------------------------------------------
OK, what is really weird is that I'm getting the same error (Floating point exception) even with the very simple following kernel. Anyone has got a clue?
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__kernel void test(__global float* result, const unsigned int n)
{
int i = get_global_id(0);
if(i >= n) return;
result[i] += 1.0f;
}
I noticed that you are declaring your buffer as CL_MEM_READ_ONLY, yet your are writing to it inside the kernel. According to the OpenCL spec, this is undefined. Try using CL_MEM_READ_WRITE instead.
OK, so it was a completely different thing than I thought it was. The problem was that when I was calling
clEnqueueNDRangeKernel (command_queue, kernel, work_dim, *global_work_offset,
*global_work_size, *local_work_size, num_events_in_wait_list,
*event_wait_list, *event)
the global_work_size was not divisible by local_work_size. That caused the Floating point exception.