Determine the minimum across SIMD lanes of __m256 value

Determine the minimum across SIMD lanes of __m256 value - c

I understand that operations across SIMD lanes should generally be avoided.
However, sometimes it has to be done.
I am using AVX2 intrinsics, and have 8 floating point values in an __m256.
I want to know the lowest value in this vector, and to complicate matters: also in which slot this was.
My current solution makes a round trip to memory, which I don't like:
float closestvals[8];
_mm256_store_ps( closestvals, closest8 );
float closest = closestvals[0];
int closestidx = 0;
for ( int k=1; k<8; ++k )
{
if ( closestvals[k] < closest )
{
closest = closestvals[ k ];
closestidx = k;
}
}
What would be a good way to do this without going to/from memory?

You can try this:
#include <stdio.h>
#include <x86intrin.h>
#include <math.h>
/* gcc -O3 -Wall -m64 -march=haswell hor_min.c */
int print_vec_ps(__m256 x);
int main() {
float x[8]={1.2f, 3.6f, 2.1f, 9.4f, 4.0f, 0.1f, 8.9f, 3.3f};
/* Note that the results are not useful if one of the inputs is a 'not a number'. The input below leads to indx = 32 (!) */
// float x[8]={1.2f, 3.6f, 2.1f, NAN, 4.0f, 2.0f , 8.9f, 3.3f};
__m256 v0 = _mm256_load_ps(x); /* _mm256_shuffle_ps instead of _mm256_permute_ps is also possible, see Peter Cordes' comments */
__m256 v1 = _mm256_permute_ps(v0,0b10110001); /* swap floats: 0<->1, 2<->3, 4<->5, 6<->7 */
__m256 v2 = _mm256_min_ps(v0,v1);
__m256 v3 = _mm256_permute_ps(v2,0b01001110); /* swap floats */
__m256 v4 = _mm256_min_ps(v2,v3);
__m256 v5 = _mm256_castpd_ps(_mm256_permute4x64_pd(_mm256_castps_pd(v4),0b01001110)); /* swap 128-bit lanes */
__m256 v_min = _mm256_min_ps(v4,v5);
__m256 mask = _mm256_cmp_ps(v0,v_min,0);
int indx = _tzcnt_u32(_mm256_movemask_ps(mask));
printf(" 7 6 5 4 3 2 1 0\n");
printf("v0 = ");print_vec_ps(v0 );
printf("v1 = ");print_vec_ps(v1 );
printf("v2 = ");print_vec_ps(v2 );
printf("\nv3 = ");print_vec_ps(v3 );
printf("v4 = ");print_vec_ps(v4 );
printf("\nv5 = ");print_vec_ps(v5 );
printf("v_min = ");print_vec_ps(v_min );
printf("mask = ");print_vec_ps(mask );
printf("indx = ");printf("%d\n",indx);
return 0;
}
int print_vec_ps(__m256 x){
float v[8];
_mm256_storeu_ps(v,x);
printf("%5.2f %5.2f %5.2f %5.2f %5.2f %5.2f %5.2f %5.2f\n",
v[7],v[6],v[5],v[4],v[3],v[2],v[1],v[0]);
return 0;
}
Output:
./a.out
7 6 5 4 3 2 1 0
v0 = 3.30 8.90 0.10 4.00 9.40 2.10 3.60 1.20
v1 = 8.90 3.30 4.00 0.10 2.10 9.40 1.20 3.60
v2 = 3.30 3.30 0.10 0.10 2.10 2.10 1.20 1.20
v3 = 0.10 0.10 3.30 3.30 1.20 1.20 2.10 2.10
v4 = 0.10 0.10 0.10 0.10 1.20 1.20 1.20 1.20
v5 = 1.20 1.20 1.20 1.20 0.10 0.10 0.10 0.10
v_min = 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10
mask = 0.00 0.00 -nan 0.00 0.00 0.00 0.00 0.00
indx = 5
In the previous version of this answer, the 128-bit lanes were swapped with _mm256_permute2f128_ps.
In this updated answer _mm256_permute2f128_ps is replaced by _mm256_permute4x64_pd,
which is faster on AMD CPUs and on Intel KNL, see #Peter Cordes' comments.
But note that _mm256_permute4x64_pd requires AVX2, while AVX is sufficient for _mm256_permute2f128_ps.
Also note that the results of this code are useless if one of the input values is a 'not a number' (NAN).

Related

Poor C performance with both pthread and printf

I'm testing a c code for Linux with large arrays to measure thread performance, the application scales very well when threads are increased until max cores (8 for Intel 4770), but this is only for the pure math part of my code.
If I add the printf part for resulted arrays then the times becomes too large, from few seconds to several minutes even if redirected to a file, when printf those arrays should add just a few seconds.
The code:
(gcc 7.5.0-Ubuntu 18.04)
without printf loop:
gcc -O3 -m64 exp_multi.c -pthread -lm
with printf loop:
gcc -DPRINT_ARRAY -O3 -m64 exp_multi.c -pthread -lm
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <pthread.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
#define num_threads 8
static double xv[MAXSIZE];
static double yv[MAXSIZE];
/* gcc -O3 -m64 exp_multi.c -pthread -lm */
void* run(void *received_Val){
int single_val = *((int *) received_Val);
int r;
int i;
double p;
for (r = 0; r < REIT; r++) {
p = XXX + 0.00001*single_val*MAXSIZE/num_threads;
for (i = single_val*MAXSIZE/num_threads; i < (single_val+1)*MAXSIZE/num_threads; i++) {
xv[i]=p;
yv[i]=exp(p);
p += 0.00001;
}
}
return 0;
}
int main(){
int i;
pthread_t tid[num_threads];
for (i=0;i<num_threads;i++){
int *arg = malloc(sizeof(*arg));
if ( arg == NULL ) {
fprintf(stderr, "Couldn't allocate memory for thread arg.\n");
exit(1);
}
*arg = i;
pthread_create(&(tid[i]), NULL, run, arg);
}
for(i=0; i<num_threads; i++)
{
pthread_join(tid[i], NULL);
}
#ifdef PRINT_ARRAY
for (i=0;i<MAXSIZE;i++){
printf("x=%.20lf, e^x=%.20lf\n",xv[i],yv[i]);
}
#endif
return 0;
}
malloc in pthread_create passing an integer as the last argument as suggested in this post.
I tried, without success, clang, add free(tid) instruction, avoid using malloc instruction, reverse loops, only 1 unidimensional array, 1 thread version without pthread.
EDIT2: I think the exp function is processor resource intensive, probably affected by the per core cache or SIMD resources implemented by the processor generation. The following sample code is based on a licensed code posted on Stack Overflow.
This code runs fast with or without the printf loop, and exp from math.h has been improved a few years ago, it can be around x40 faster, at least on the Intel 4770 (Haswell), this link is a known test code for math library vs SSE2, and now exp speed of math should be close to the AVX2 algorithm optimized for float and x8 parallel calculations.
Test results: expf vs other SSE2 algortihm (exp_ps):
sinf .. -> 55.5 millions of vector evaluations/second -> 12 cycles/value
cosf .. -> 57.3 millions of vector evaluations/second -> 11 cycles/value
sincos (x87) .. -> 9.1 millions of vector evaluations/second -> 71 cycles/value
expf .. -> 61.4 millions of vector evaluations/second -> 11 cycles/value
logf .. -> 55.6 millions of vector evaluations/second -> 12 cycles/value
cephes_sinf .. -> 52.5 millions of vector evaluations/second -> 12 cycles/value
cephes_cosf .. -> 41.9 millions of vector evaluations/second -> 15 cycles/value
cephes_expf .. -> 18.3 millions of vector evaluations/second -> 35 cycles/value
cephes_logf .. -> 20.2 millions of vector evaluations/second -> 32 cycles/value
sin_ps .. -> 54.1 millions of vector evaluations/second -> 12 cycles/value
cos_ps .. -> 54.8 millions of vector evaluations/second -> 12 cycles/value
sincos_ps .. -> 54.6 millions of vector evaluations/second -> 12 cycles/value
exp_ps .. -> 42.6 millions of vector evaluations/second -> 15 cycles/value
log_ps .. -> 41.0 millions of vector evaluations/second -> 16 cycles/value
/* Performance test exp(x) algorithm
based on AVX implementation of Giovanni Garberoglio
Copyright (C) 2020 Antonio R.
AVX implementation of exp:
Modified code from this source: https://github.com/reyoung/avx_mathfun
Based on "sse_mathfun.h", by Julien Pommier
http://gruntthepeon.free.fr/ssemath/
Copyright (C) 2012 Giovanni Garberoglio
Interdisciplinary Laboratory for Computational Science (LISC)
Fondazione Bruno Kessler and University of Trento
via Sommarive, 18
I-38123 Trento (Italy)
This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
(this is the zlib license)
*/
/* gcc -O3 -m64 -Wall -mavx2 -march=haswell expc.c -lm */
#include <stdio.h>
#include <immintrin.h>
#include <math.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
__m256 exp256_ps(__m256 x) {
/*
To increase the compatibility across different compilers the original code is
converted to plain AVX2 intrinsics code without ingenious macro's,
gcc style alignment attributes etc.
Moreover, the part "express exp(x) as exp(g + n*log(2))" has been significantly simplified.
This modified code is not thoroughly tested!
*/
__m256 exp_hi = _mm256_set1_ps(88.3762626647949f);
__m256 exp_lo = _mm256_set1_ps(-88.3762626647949f);
__m256 cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341f);
__m256 inv_LOG2EF = _mm256_set1_ps(0.693147180559945f);
__m256 cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256 cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256 cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256 cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256 cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256 cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256 fx;
__m256i imm0;
__m256 one = _mm256_set1_ps(1.0f);
x = _mm256_min_ps(x, exp_hi);
x = _mm256_max_ps(x, exp_lo);
/* express exp(x) as exp(g + n*log(2)) */
fx = _mm256_mul_ps(x, cephes_LOG2EF);
fx = _mm256_round_ps(fx, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC);
__m256 z = _mm256_mul_ps(fx, inv_LOG2EF);
x = _mm256_sub_ps(x, z);
z = _mm256_mul_ps(x,x);
__m256 y = cephes_exp_p0;
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p1);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p2);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p3);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p4);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p5);
y = _mm256_mul_ps(y, z);
y = _mm256_add_ps(y, x);
y = _mm256_add_ps(y, one);
/* build 2^n */
imm0 = _mm256_cvttps_epi32(fx);
imm0 = _mm256_add_epi32(imm0, _mm256_set1_epi32(0x7f));
imm0 = _mm256_slli_epi32(imm0, 23);
__m256 pow2n = _mm256_castsi256_ps(imm0);
y = _mm256_mul_ps(y, pow2n);
return y;
}
int main(){
int r;
int i;
float p;
static float xv[MAXSIZE];
static float yv[MAXSIZE];
float *xp;
float *yp;
for (r = 0; r < REIT; r++) {
p = XXX;
xp = xv;
yp = yv;
for (i = 0; i < MAXSIZE; i += 8) {
__m256 x = _mm256_setr_ps(p, p + 0.00001, p + 0.00002, p + 0.00003, p + 0.00004, p + 0.00005, p + 0.00006, p + 0.00007);
__m256 y = exp256_ps(x);
_mm256_store_ps(xp,x);
_mm256_store_ps(yp,y);
xp += 8;
yp += 8;
p += 0.00008;
}
}
for (i=0;i<MAXSIZE;i++){
printf("x=%.20f, e^x=%.20f\n",xv[i],yv[i]);
}
return 0;
}
For comparison, this is the code example with exp (x) from the math library, single thread and float.
#include <stdio.h>
#include <math.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
/* gcc -O3 -m64 exp_st.c -lm */
int main(){
int r;
int i;
float p;
static float xv[MAXSIZE];
static float yv[MAXSIZE];
for (r = 0; r < REIT; r++) {
p = XXX;
for (i = 0; i < MAXSIZE; i++) {
xv[i]=p;
yv[i]=expf(p);
p += 0.00001;
}
}
for (i=0;i<MAXSIZE;i++){
printf("x=%.20f, e^x=%.20f\n",xv[i],yv[i]);
}
return 0;
}
SOLUTION: As Andreas Wenzel said, the gcc compiler is smart enough and it decides that it is not necessary to actually write the results to the array, these writes are optimized away by the compiler. After new performance tests I made based on new information, or before I committed several mistakes or I assumed wrong facts, it seems the results are clearer: exp (double arg), or expf( float arg) which is x2+ exp(double arg), have been improved but it is not as a fast AVX2 algorithm (x8 parallel float arg), which is around x6 faster than SSE2 algorithm (x4 parallel float arg). Here are some results, as expected for Intel Hyper-Threading CPUs, excepts for SSE2 algorithm:
exp (double arg) single thread: 18 min 46 sec
exp (double arg) 4 threads: 5 min 4 sec
exp (double arg) 8 threads: 4 min 28 sec
expf (float arg) single thread: 7 min 32 sec
expf (float arg) 4 threads: 1 min 58 sec
expf (float arg) 8 threads: 1 min 41 sec
Relative error**:
i x y = expf(x) double precision exp relative error
i = 0 x =-5.000000000e+00 y = 6.737946998e-03 exp_dbl = 6.737946999e-03 rel_err =-1.124224480e-10
i = 124000 x =-3.758316040e+00 y = 2.332298271e-02 exp_dbl = 2.332298229e-02 rel_err = 1.803005727e-08
i = 248000 x =-2.518329620e+00 y = 8.059411496e-02 exp_dbl = 8.059411715e-02 rel_err =-2.716802480e-08
i = 372000 x =-1.278343201e+00 y = 2.784983218e-01 exp_dbl = 2.784983343e-01 rel_err =-4.490403948e-08
i = 496000 x =-3.867173195e-02 y = 9.620664716e-01 exp_dbl = 9.620664730e-01 rel_err =-1.481617428e-09
i = 620000 x = 1.201261759e+00 y = 3.324308872e+00 exp_dbl = 3.324308753e+00 rel_err = 3.571995830e-08
i = 744000 x = 2.441616058e+00 y = 1.149159718e+01 exp_dbl = 1.149159684e+01 rel_err = 2.955980805e-08
i = 868000 x = 3.681602478e+00 y = 3.970997620e+01 exp_dbl = 3.970997748e+01 rel_err =-3.232306688e-08
i = 992000 x = 4.921588898e+00 y = 1.372204742e+02 exp_dbl = 1.372204694e+02 rel_err = 3.563072184e-08
*SSE2 algorithm by Julien Pommier, x6,8 speed increase from one thread to 8 threads. My performance test code uses aligned(16) for typedef union of vector/4 float array passed to the library, instead of aligned whole float array. This may causes lower performance, at least for other AVX2 code, its performance improvement with multithread too seems good for Intel Hyper-Threading but at a lower speed, time increased between x2.5-x1.5. Maybe SSE2 code could be sped up with better array alignment which I couldn't improve:
exp_ps (x4 parallel float arg) single thread: 12 min 7 sec
exp_ps (x4 parallel float arg) 4 threads: 3 min 10 sec
exp_ps (x4 parallel float arg) 8 threads: 1 min 46 sec
Relative error**:
i x y = exp_ps(x) double precision exp relative error
i = 0 x =-5.000000000e+00 y = 6.737946998e-03 exp_dbl = 6.737946999e-03 rel_err =-1.124224480e-10
i = 124000 x =-3.758316040e+00 y = 2.332298271e-02 exp_dbl = 2.332298229e-02 rel_err = 1.803005727e-08
i = 248000 x =-2.518329620e+00 y = 8.059412241e-02 exp_dbl = 8.059411715e-02 rel_err = 6.527768787e-08
i = 372000 x =-1.278343201e+00 y = 2.784983218e-01 exp_dbl = 2.784983343e-01 rel_err =-4.490403948e-08
i = 496000 x =-3.977407143e-02 y = 9.610065222e-01 exp_dbl = 9.610065335e-01 rel_err =-1.174323454e-08
i = 620000 x = 1.200158238e+00 y = 3.320642233e+00 exp_dbl = 3.320642334e+00 rel_err =-3.054731957e-08
i = 744000 x = 2.441616058e+00 y = 1.149159622e+01 exp_dbl = 1.149159684e+01 rel_err =-5.342903415e-08
i = 868000 x = 3.681602478e+00 y = 3.970997620e+01 exp_dbl = 3.970997748e+01 rel_err =-3.232306688e-08
i = 992000 x = 4.921588898e+00 y = 1.372204742e+02 exp_dbl = 1.372204694e+02 rel_err = 3.563072184e-08
AVX2 algorithm (x8 parallel float arg) single thread: 1 min 45 sec
AVX2 algorithm (x8 parallel float arg) 4 threads: 28 sec
AVX2 algorithm (x8 parallel float arg) 8 threads: 27 sec
Relative error**:
i x y = exp256_ps(x) double precision exp relative error
i = 0 x =-5.000000000e+00 y = 6.737946998e-03 exp_dbl = 6.737946999e-03 rel_err =-1.124224480e-10
i = 124000 x =-3.758316040e+00 y = 2.332298271e-02 exp_dbl = 2.332298229e-02 rel_err = 1.803005727e-08
i = 248000 x =-2.516632080e+00 y = 8.073104918e-02 exp_dbl = 8.073104510e-02 rel_err = 5.057888540e-08
i = 372000 x =-1.279417157e+00 y = 2.781994045e-01 exp_dbl = 2.781993997e-01 rel_err = 1.705288467e-08
i = 496000 x =-3.954863176e-02 y = 9.612231851e-01 exp_dbl = 9.612232069e-01 rel_err =-2.269774967e-08
i = 620000 x = 1.199879169e+00 y = 3.319715738e+00 exp_dbl = 3.319715775e+00 rel_err =-1.119642824e-08
i = 744000 x = 2.440370798e+00 y = 1.147729492e+01 exp_dbl = 1.147729571e+01 rel_err =-6.896860199e-08
i = 868000 x = 3.681602478e+00 y = 3.970997620e+01 exp_dbl = 3.970997748e+01 rel_err =-3.232306688e-08
i = 992000 x = 4.923286438e+00 y = 1.374535980e+02 exp_dbl = 1.374536045e+02 rel_err =-4.676466368e-08
**The relative errors are the same, since the codes with SSE2 and AVX2 use identical algorithms, and it is more than likely that it is also that of the library function exp(x).
Source code AVX2 algorithm multithread
/* Performance test of a multithreaded exp(x) algorithm
based on AVX implementation of Giovanni Garberoglio
Copyright (C) 2020 Antonio R.
AVX implementation of exp:
Modified code from this source: https://github.com/reyoung/avx_mathfun
Based on "sse_mathfun.h", by Julien Pommier
http://gruntthepeon.free.fr/ssemath/
Copyright (C) 2012 Giovanni Garberoglio
Interdisciplinary Laboratory for Computational Science (LISC)
Fondazione Bruno Kessler and University of Trento
via Sommarive, 18
I-38123 Trento (Italy)
This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
(this is the zlib license)
*/
/* gcc -O3 -m64 -mavx2 -march=haswell expc_multi.c -pthread -lm */
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
#include <math.h>
#include <pthread.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
#define num_threads 4
typedef float FLOAT32[MAXSIZE] __attribute__((aligned(4)));
static FLOAT32 xv;
static FLOAT32 yv;
__m256 exp256_ps(__m256 x) {
/*
To increase the compatibility across different compilers the original code is
converted to plain AVX2 intrinsics code without ingenious macro's,
gcc style alignment attributes etc.
Moreover, the part "express exp(x) as exp(g + n*log(2))" has been significantly simplified.
This modified code is not thoroughly tested!
*/
__m256 exp_hi = _mm256_set1_ps(88.3762626647949f);
__m256 exp_lo = _mm256_set1_ps(-88.3762626647949f);
__m256 cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341f);
__m256 inv_LOG2EF = _mm256_set1_ps(0.693147180559945f);
__m256 cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256 cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256 cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256 cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256 cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256 cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256 fx;
__m256i imm0;
__m256 one = _mm256_set1_ps(1.0f);
x = _mm256_min_ps(x, exp_hi);
x = _mm256_max_ps(x, exp_lo);
/* express exp(x) as exp(g + n*log(2)) */
fx = _mm256_mul_ps(x, cephes_LOG2EF);
fx = _mm256_round_ps(fx, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC);
__m256 z = _mm256_mul_ps(fx, inv_LOG2EF);
x = _mm256_sub_ps(x, z);
z = _mm256_mul_ps(x,x);
__m256 y = cephes_exp_p0;
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p1);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p2);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p3);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p4);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p5);
y = _mm256_mul_ps(y, z);
y = _mm256_add_ps(y, x);
y = _mm256_add_ps(y, one);
/* build 2^n */
imm0 = _mm256_cvttps_epi32(fx);
imm0 = _mm256_add_epi32(imm0, _mm256_set1_epi32(0x7f));
imm0 = _mm256_slli_epi32(imm0, 23);
__m256 pow2n = _mm256_castsi256_ps(imm0);
y = _mm256_mul_ps(y, pow2n);
return y;
}
void* run(void *received_Val){
int single_val = *((int *) received_Val);
int r;
int i;
float p;
float *xp;
float *yp;
for (r = 0; r < REIT; r++) {
p = XXX + 0.00001*single_val*MAXSIZE/num_threads;
xp = xv + single_val*MAXSIZE/num_threads;
yp = yv + single_val*MAXSIZE/num_threads;
for (i = single_val*MAXSIZE/num_threads; i < (single_val+1)*MAXSIZE/num_threads; i += 8) {
__m256 x = _mm256_setr_ps(p, p + 0.00001, p + 0.00002, p + 0.00003, p + 0.00004, p + 0.00005, p + 0.00006, p + 0.00007);
__m256 y = exp256_ps(x);
_mm256_store_ps(xp,x);
_mm256_store_ps(yp,y);
xp += 8;
yp += 8;
p += 0.00008;
}
}
return 0;
}
int main(){
int i;
pthread_t tid[num_threads];
for (i=0;i<num_threads;i++){
int *arg = malloc(sizeof(*arg));
if ( arg == NULL ) {
fprintf(stderr, "Couldn't allocate memory for thread arg.\n");
exit(1);
}
*arg = i;
pthread_create(&(tid[i]), NULL, run, arg);
}
for(i=0; i<num_threads; i++)
{
pthread_join(tid[i], NULL);
}
for (i=0;i<MAXSIZE;i++){
printf("x=%.20f, e^x=%.20f\n",xv[i],yv[i]);
}
return 0;
}
Charts overview:
exp (double arg) without printf loop, not real performance, as Andreas Wenzel found, gcc doesn't calculate exp(x) when results are not going to be printf, even float version is much slower because its different few assembly instructions. Although graph may be useful for some assembly algorithm that only uses low-level CPU cache/registers.
expf (float arg) real performance or with printf loop
AVX2 algorithm, the best performance.

When you don't print the arrays at the end of the program, the gcc compiler is smart enough to realize that the results of the calculations have no observable effects. Therefore, the compiler decides that it is not necessary to actually write the results to the array, because these results are never used. Instead, these writes are optimized away by the compiler.
Also, when you don't print the results, the library function exp has no observable effects, provided it is not called with input that is so high that it would cause a floating-point overflow (which would cause the function to raise a floating point exception). This allows the compiler to optimize these function calls away, too.
As you can see in the assembly instructions emitted by the gcc compiler for your code which does not print the results, the compiled program doesn't unconditionally call the function exp, but instead tests if the input to the function exp would be higher than 7.09e2 (to ensure that no overflow will occur). Only if an overflow would occur, will the program jump to the code which calls the function exp. Here is the relevant assembly code:
ucomisd xmm1, xmm3
jnb .L9
In the above assembly code, the CPU register xmm3 contains the double-precision floating-point value 7.09e2. If the input is higher than this constant, the function exp would cause a floating-point overflow, because the result cannot be represented in a double-precision floating-point value.
Since your input is always valid and low enough to not cause a floating-point overflow, your program will never perform this jump, so it will never actually call the function exp.
This explains why your code is so much faster when you do not print the results. If you do not print the results, your compiler will determine that the calculations have no observable effects, so it will optimize them away.
Therefore, if you want the compiler to actually perform the calculations, you must ensure that the calculations have some observable effect. This does not mean that you have to actually print all results (which are several megabytes large). It is sufficient if you just print one line which is dependant on all the results (for example the sum of all results).
However, if you replace the function call to the library function exp with a call to some other custom function, then, at least in my tests, the compiler is not smart enough to realize that the function call has no observable effects. In that case, it is unable to optimize the function call away, even if you do not print the results of the calculations.
For the reasons stated above, if you want to compare the performance of both functions, you must ensure that the calculations actually take place, by making sure that the results have an observable effect. Otherwise, you run the risk of the compiler optimizing away at least some of the calculations, and the comparison would not be fair.

I don't think this has much to with pthread because your code only appears to call printf after the threads are joined. Instead, the poor performance is likely due to cache misses by needing to read from the xv and yv arrays in every iteration of the print loop.

C format code for positive and negative float numbers

How can I force printf format codes like %f to produce aligned padded output for both positive and negative numbers. The following program is a minimal yet complete example of my problem:
#include<stdio.h>
#include<stdlib.h>
int main(int argc, char *argv[])
{
int step = 1;
float score = 0.1;
float phi = 0.1;
float rho = 0.2;
printf(" # score phi rho\n");
printf("%2d %2.3f %1.2f %1.2f\n", step, score, phi, rho);
step++;
score++;
phi++;
rho++;
printf("%2d %2.3f %1.2f %1.2f\n", step, score, phi, rho);
step++;
score -= 1.2;
phi++;
rho++;
printf("%2d %2.3f %1.2f %1.2f\n", step, score, phi, rho);
return 0;
}
and here is the result:
# score phi rho
1 0.100 0.10 0.20
2 1.100 1.10 1.20
3 -0.100 2.10 2.20
I wanted to have an output like
# score phi rho
1 +0.100 0.10 0.20
2 +1.100 1.10 1.20
3 -0.100 2.10 2.20
or
# score phi rho
1 0.100 0.10 0.20
2 1.100 1.10 1.20
3 -0.10 2.10 2.20
My current bad solution is an if-else statement conditioned on the sign of score variable to choose from two different printfs for plus and minus signs.

From the format specification syntax, the printf conversion specification follows
%[flags][width][.precision][size]type
So, if we set the flag directive to +, sign (+ or -) appears as a prefix of the output value. For example,
printf("%+5.2f\n%+5.2f\n", -19.86, 19.86); // prints -19.86 \newline +19.86

How do i turn this into a decrement loop?

float batt = ((float)rand()/(float)(RAND_MAX)) * 5.0;
demo_printf("Battery:%.2f\n", batt);
From the line above , its doing a random numbers for batt. would like to ask how do i do about it to make it as a loop decreasing like very bit by bit . maybe per 0.1.
I cannot for loop the demoprint. i cannot . i can only loop elswhere

Huh?
This question is very strange. Do you mean something like this:
float batt = 5.f * rand() / RAND_MAX;
while(batt > 0.f)
{
printf("Battery is %.1f\n", batt);
batt -= 0.1f;
}
Sample output:
Battery is 4.2
Battery is 4.1
Battery is 4.0
Battery is 3.9
Battery is 3.8
Battery is 3.7
Battery is 3.6
Battery is 3.5
Battery is 3.4
[ ... more like this ...]
Battery is 1.0
Battery is 0.9
Battery is 0.8
Battery is 0.7
Battery is 0.6
Battery is 0.5
Battery is 0.4
Battery is 0.3
Battery is 0.2
Battery is 0.1
Battery is 0.0

// obtain initial random float
float batt = ((float)rand()/(float)(RAND_MAX)) * 5.0;
// iterate beginning with this random float downwards by steps of 0.1f
for (float x = batt; x > 0; x -= 0.1) {
demo_printf("Battery:%.2f\n", x);
}

Use 2 variables, target and current:
float current = 5.0f;
float target = 2.5f;
while(current > target) {
// print here
current -= 0.1f;
}
current = target;
// final print here
Just be aware that conversion of float literal constant 0.1f to binary float will never be perfectly accurate, and the decrement will not be exactly 0.1. But fixing that is for another question.

Calling with Arguments versus using Globals in C

I have a decent understanding of x86 assembly and i know that when a function is called all the arguments are pushed onto the stack.
I have a function which basically loops through a 8 by 8 array and calls some functions based on the values in the array. Each of these function calls involves 6-10 arguments being passed. This program takes a very long time to run, it is a Chess AI, but this function takes 20% of the running time.
So i guess my question is, what can i do to give my functions access to the variables they need in a faster way?
int row,col,i;
determineCheckValidations(eval_check, b, turn);
int * eval_check_p = &(eval_check[0][0]);
for(row = 0; row < 8; row++){
for(col = 0; col < 8; col++, eval_check_p++){
if (b->colors[row][col] == turn){
int type = b->types[row][col];
if (type == PAWN)
findPawnMoves(b,moves_found,turn,row,col,last_move,*eval_check_p);
else if (type == KNIGHT)
findMappedNoIters(b,moves_found,turn,row,col,*move_map_knight, 8, *eval_check_p);
else if (type == BISHOP)
findMappedIters(b,moves_found,turn,row,col,*move_map_bishop, 4, *eval_check_p);
else if (type == ROOK)
findMappedIters(b,moves_found,turn,row,col,*move_map_rook, 4, *eval_check_p);
else if (type == QUEEN)
findMappedIters(b,moves_found,turn,row,col,*move_map_queen, 8, *eval_check_p);
else if (type == KING){
findMappedNoIters(b,moves_found,turn,row,col,*move_map_king, 8, *eval_check_p);
findCastles(b,moves_found,turn,row,col);
}
}
}
}
all the code can be found # https://github.com/AndyGrant/JChess/tree/master/_Core/_Scripts
A sample of the profile:
% cumulative self self total
time seconds seconds calls s/call s/call name
20.00 1.55 1.55 2071328 0.00 0.00 findAllValidMoves
14.84 2.70 1.15 10418354 0.00 0.00 checkMove
10.06 3.48 0.78 1669701 0.00 0.00 encodeBoard
7.23 4.04 0.56 10132526 0.00 0.00 findMappedIters
6.84 4.57 0.53 1669701 0.00 0.00 getElement
6.71 5.09 0.52 68112169 0.00 0.00 createNormalMove

You have performed good work on profiling. You need to take the function with the worst case and profile it in more detail.
You may want to try different compiler optimization settings when you profile.
Try some common optimization techniques, such as loop unrolling and factoring out invariants from loops.
You may get some improvements by designing your functions with the processor's data cache in mind. Search the web for "optimizing data cache".
If the function works correctly, I recommend posting to CodeReview#StackExchange.com.
Don't assume anything.

float vs double performance testing on beagle bone black

I am doing some image processing on my Beaglebone Black and am interested in the performance gain of using floats vs doubles in my algorithm.
I've tried to devise a simple test for this:
main.c
#define MAX_TEST 10
#define MAX_ITER 1E7
#define DELTA 1E-8
void float_test()
{
float n = 0.0;
for (int i=0; i<MAX_ITER; i++)
{
n += DELTA;
n /= 3.0;
}
}
void double_test()
{
double n = 0.0;
for (int i=0; i<MAX_ITER; i++)
{
n += DELTA;
n /= 3.0;
}
}
int main()
{
for (int i=0; i<MAX_TEST; i++)
{
double_test();
float_test();
}
return 0;
}
ran as:
gcc -Wall -pg main.c -std=c99
./a.out
gprof a.out gmon.out -q > profile.txt
profile.txt:
granularity: each sample hit covers 4 byte(s) for 0.03% of 35.31 seconds
index % time self children called name
<spontaneous>
[1] 100.0 0.00 35.31 main [1]
18.74 0.00 10/10 float_test [2]
16.57 0.00 10/10 double_test [3]
-----------------------------------------------
18.74 0.00 10/10 main [1]
[2] 53.1 18.74 0.00 10 float_test [2]
-----------------------------------------------
16.57 0.00 10/10 main [1]
[3] 46.9 16.57 0.00 10 double_test [3]
-----------------------------------------------
I am not sure if the compiler is optimizing away some of my code or if I am doing enough arithmetic for it to matter. I find it a bit odd that the double_test() is actually taking less time than the float_test().
I've tried switching the order in which the functions are called and they results are still the same. Could somebody explain this to me?

On my machine (x86_64), looking at the code generated, side by side:
double_test: .. float_test:
xorpd %xmm0,%xmm0 // double n -- xorps %xmm0,%xmm0 // float n
xor %eax,%eax // int i == xor %eax,%eax
loop: .. loop:
++ unpcklps %xmm0,%xmm0 // Extend float n to...
++ cvtps2pd %xmm0,%xmm0 // ...double n
add $0x1,%eax // ++i == add $0x1,%eax
addsd %xmm2,%xmm0 // double n += DELTA == addsd %xmm2,%xmm0
cvtsi2sd %eax,%xmm3 // (double)i == cvtsi2sd %eax,%xmm3
++ unpcklpd %xmm0,%xmm0 // Reduce double n to...
++ cvtpd2ps %xmm0,%xmm0 // ...float n
divsd %xmm5,%xmm0 // double n /= 3.0 -- divss %xmm4,%xmm0 // float n / 3.0
ucomisd %xmm3,%xmm1 // (double)i cmp 1E7 == ucomisd %xmm3,%xmm1
ja ...loop... // if (double)i < 1E7 == ja ...loop...
showing four extra instructions to change up to double and back down to float in order to add DELTA.
The DELTA is 1E-8 which is implictly double. So, adding that is done double. Of course, 3.0 is also implictly double, but I guess the compiler spots that there is no effective difference between double and single in this case.
Defining DELTAF as 1E-8f gets rid of the change up to and down from double for the add.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Determine the minimum across SIMD lanes of __m256 value - c

Related

Poor C performance with both pthread and printf

C format code for positive and negative float numbers

How do i turn this into a decrement loop?

Calling with Arguments versus using Globals in C

float vs double performance testing on beagle bone black

Categories

Resources