Why is SSE vectorization not faster in this case? [duplicate]

Why is SSE vectorization not faster in this case? [duplicate] - c

This question already has answers here:
Demonstrator code failing to show 4 times faster SIMD speed with optimization disabled
(3 answers)
How to optimize these loops (with compiler optimization disabled)?
(3 answers)
Closed 3 years ago.
My SSE code is just as slow as the standard C one, what am I doing wrong ?
Im running on a Intel i3-6100 CPU, using C with minGW and CLion, im using the -O0 flag.
Im mesuring the performance using the clock() function, both versions are equally fast to about 45 ticks (of over 1000) (SSE:1138 ticks - C:1093ticks).
I thaught that SSE somehow messes up the clock() time mesuring, but even by simply counting seconds ther is no differnce.
the function : (swaping comments..)
void vTrace(struct Ray * ray, float t, struct Vec3f * r){
//__m128 * mr = (__m128 *)r;
//__m128 mt_m = _mm_set1_ps(t);
//*mr = _mm_add_ps(*(__m128*)&ray->o, _mm_mul_ps(*(__m128*)&ray->d, mt_m));
r->x = ray->o.x + ray->d.x*t;
r->y = ray->o.y + ray->d.y*t;
r->z = ray->o.z + ray->d.z*t;
}
the benchmarking code:
float benchmark_t = 1;
struct Ray benchmark_ray;
vInit3f(&benchmark_ray.o, 0.2, 0.23, 1.4);
vInit3f(&benchmark_ray.d, 0.2, 0.23, 1.4);
ticks = clock();
i = 0;
while(i < 1000000000 ){
vTrace(&benchmark_ray, benchmark_t, &benchmark_ray.o);
i ++;
}
printf("TIME : %i ticks\n", (clock()-ticks));
printVec("result", benchmark_ray.o);
the structures :
struct Vec3f{
float x;
float y;
float z;
float w;//just for SSE
};
struct Ray{
struct Vec3f o;
struct Vec3f d;
struct Vec3f inverse_d;
};
Using SSE the performance should be about 4-times that fast why is there no performance gain ?

The code somehow autovectorized, I dont know why but it did.
So there was no great performance difference.
(next time step in the assembly code first)

Related

How do I obtain the theoretical/Linpack FLOPS performance on basic vector/matrix operations?

The question is simple. How do I further optimize my code as the basic matrix operations are critical and common to my calculation. BLAS and LAPACK operations are good in linear algebra but neither of them provides basic element by element addition/multiply operations (Hadamard). Theoretical performance maybe difficult, but Linpack performance or 60~80% Linpack performance should be achievable. (I can only do 12%, if I use multiply-add, then only 25%)
For references
Theoretical performance: 8259u has 4 cores * 3.8GHz * 16 FLOPS = 240 GFlops
Linpack performance: 8259u can run as fast as 140~160 GFlops double precision operations.
Platform: Macbook Pro 2018, Monterey
CPU: i5-8259u, 4c8t
RAM: 8GB
CC: gcc 11.3.0
CFLAGS: -mavx2 -mfma -fopenmp -O3
Here's my attempt
the flops are calculated as follows:
double time = stop - start;
double ops = 1.0 * Nx * Ny * iterNum; //2.0 for complex numbers
double flops = ops / time;
double gFlops = flops / 1E9;
Here's some results when I run my code. real and complex results are almost the same. Only showing the real results (roughly):
//Nx = Ny = 2048, iterNum = 10000
//Typical matrix size and iteration depth for my calculation
threads = 1: 1 GFlops
threads = 2: 2 GFlops
threads = 4: 3 GFlops
threads = 8: 4 GFlops
threads = 16: 9 GFlops
threads = 32: 11 GFlops
threads = 64: 15 GFlops
threads = 128: 18 GFlops
threads = 256: 19 GFlops
threads = 512: 21 GFlops
threads = 1024: 20 GFlops
threads = 2048: 40 GFlops // wrong answer
For the convenience of large matrix on heap and integrating with mathGL, the matrix is flattened as a vector consisting of Nx * Ny elements cascading by rows.
// for real numbers
x = (double *)_mm_malloc(Nx * Ny * sizeof(double), 32);
y = (double *)_mm_malloc(Nx * Ny * sizeof(double), 32);
z = (double *)_mm_malloc(Nx * Ny * sizeof(double), 32);
sum = (double *)_mm_malloc(Nx * Ny * sizeof(double), 32);
// for complex numbers
x = (double *)_mm_malloc(Nx * Ny * sizeof(double complex), 32);
y = (double *)_mm_malloc(Nx * Ny * sizeof(double complex), 32);
z = (double *)_mm_malloc(Nx * Ny * sizeof(double complex), 32);
sum = (double *)_mm_malloc(Nx * Ny * sizeof(double complex), 32);
and the addition was done parallelly using openmp.
double start = omp_get_wtime();
#pragma omp parallel private(shift)
{
for (int tds = omp_get_thread_num(); tds < threads; tds = tds + threads)
{
shift = Nx * Ny / threads * tds;
for (int i = 0; i < iterNum; i++)
{
AddComplex(sum+shift, sum+shift, z+shift, Nx/threads, Ny);
}
}
}
double stop = omp_get_wtime();
I wrote explicit vectorization code using AVX intrinsics "immintrin.h".
//real matrix addition
void AddReal(double *summation, const double *summand, const double *addend, int Nx, int Ny)
{
int nBlock = Nx * Ny / realPackSize;
int nRem = Nx * Ny % realPackSize;
register __m256d packSummand, packAddend, packSum;
const double *px = summand;
const double *py = addend;
double *pSum = summation;
for (int i = 0; i < nBlock; i++)
{
packSummand = _mm256_load_pd(px);
packAddend = _mm256_load_pd(py);
packSum = _mm256_add_pd(packSummand, packAddend);
_mm256_store_pd(pSum, packSum);
px = px + realPackSize;
py = py + realPackSize;
pSum = pSum + realPackSize;
}
for (int i = 0; i < nRem; i++)
{
pSum[i] = px[i] + py[i];
}
px = NULL;
py = NULL;
pSum = NULL;
return;
}
//Complex matrix addition
void AddComplex(double complex *summation, const double complex *summand, const double complex *addend, int Nx, int Ny)
{
int nBlock = Nx * Ny / complexPackSize;
int nRem = Nx * Ny % complexPackSize;
register __m256d packSummand, packAddend, packSum;
const double complex *px = summand;
const double complex *py = addend;
double complex *pSum = summation;
for (int i = 0; i < nBlock; i++)
{
packSummand = _mm256_load_pd(px);
packAddend = _mm256_load_pd(py);
packSum = _mm256_add_pd(packSummand, packAddend);
_mm256_store_pd(pSum, packSum);
px = px + complexPackSize;
py = py + complexPackSize;
pSum = pSum + complexPackSize;
}
for (int i = 0; i < nRem; i++)
{
pSum[i] = px[i] + py[i];
}
px = NULL;
py = NULL;
pSum = NULL;
return;
}

Level 1 (eg. dot product) and level 2 (eg. vector-matrix multiplication) BLAS functions are known not to scale (especially level 1 BLAS functions) as opposed to level 3 (eg. matrix-multiplication). Indeed, they are generally memory-bound: the amount of data read/written is O(n) while the amount of floating-point operation is also O(n). This is not the case for level 3 BLAS which are generally clearly compute-bound.
Theoretical performance maybe difficult, but Linpack performance or 60~80% Linpack performance should be achievable
If the computation is memory bound, then, no, this is not possible. Linpack is generally clearly compute bound on nearly all machine. The think is memory is slow and the speed of the RAM is not increasing as fast as the speed of processors over the last decades. This is known as a memory wall (formulated few decades ago and still true nowadays).
Here's some results when I run my code.
Having a faster computation with from using 1024 threads instead of 512 on a mobile processor with 4 core and 8 thread make me think that there is a huge problem somewhere. The maximum should be reached with 8 threads, or otherwise this means the computation is clearly inefficient. Indeed, running more threads than hardware threads cause the OS scheduler to make expensive context-switch (higher overhead). In the end, your processor never runs more that 8 tasks at a time. There are two possibility:
The timings are not correct (the provided piece of code about that seems fine to me)
The program is bogus
The computation exhibit a super-linear speed up (possibly due to cache)
I wrote explicit vectorization code using AVX intrinsics "immintrin.h".
The hot loop contains 2 loads, 1 store, 1 add and few instructions incrementing integers. Your processor can do 2 loads and 1 store per cycle so the SIMD part can be done in 1 cycle of throughput (though the latency can be much bigger) assuming nBlock is large enough.
Your processor can do 2 add per cycle so half the throughput is lost. However, you cannot write something faster than that if the load/write are mandatory.
If complexPackSize is smaller than a SIMD lane, then I think the processor has to make complex operation due to the overlapp with the past iteration that will certainly make it run the loop much less efficiently (a loop carried dependency will make the loop latency bound which is very inefficient here). If complexPackSize is much larger than a cache line, then prefetching will likely be an issue.
Your processor cannot execute too many instructions at the same time. The increment instruction and the loop check cause 5 instruction to be executed, which consume at least 1 cycle. This reduce the throughput by a factor of 2 again so not more than 25% of the theoretical performance can be reached. This can be improved a bit by unrolling the loop. Unrolling might also improve the execution because the _mm256_add_pd instruction has a pretty high latency. One should keep in mind that SIMD instructions are great for throughput but not for latency. Thus, when the latency is not an issue, SIMD codes should be fast.
Note that the write allocate cache policy cause data to be read when _mm256_store_pd is used increasing the amount of data transferred from the RAM and reducing the observed throughput. _mm256_stream_pd can be used to avoid this effect but it is fast only if data are not read just after or when data do not fit in the cache anyway. It also require data to be aligned. In fact, _mm256_store_pd also requires that and if it is not the case, it certainly cause a silent bug. The same applies for _mm256_load_pd: _mm256_loadu_pd should be used instead for unaligned data. I am not sure data read is always aligned. It should be fine if complexPackSize is a power of two divisible by 32 as well as shift. However, I highly doubt this is the case for shift, especially with a large number of threads. I also find very suspicious to use a constant complexPackSize while the SIMD lanes have a fixed size. Did you checked the results in all cases?

Optimizing n-body simulation

I'm trying to optimize the n-body algorithm, I have seen that the most expensive function is this:
real3 bodyBodyInteraction(real iPosx, real iPosy, real iPosz,
real jPosx, real jPosy, real jPosz, real jMass)
{
real rx, ry, rz;
rx = jPosx - iPosx;
ry = jPosy - iPosy;
rz = jPosz - iPosz;
real distSqr = rx*rx+ry*ry+rz*rz;
distSqr += SOFTENING_SQUARED;
real s = jMass / POW(distSqr,3.0/2.0); //very expensive
real3 f;
f.x = rx * s;
f.y = ry * s;
f.z = rz * s;
return f;
}
Using perf record I can see the division is the most expensive instruction and this one have a O(n^2) complexity, but I don't really know how to optimize it.

Convert
for(int i=0;i<N;i++)
for(int j=0;j<N;j++)
into
for(int i=0; i<N;i++)
for(int j=i+1;j<N;j++)
Restructure to take advantage of SIMD operators, this can quadruple your throughput.
Use OpenMP to parallelize the loops either across your CPU or by offloading to your GPU (OpenMP 4.5+).
Learn about the Barnes-Hut algorithm, which groups particles to achieve O(N log N) complexity (down from your O(N^2)).

This is actually quite a nice one to SIMD. It's worth noting that this:
real s = jMass / POW(distSqr,3.0/2.0);
can be refactored into this if you negate the power: (removes a division)
real s = jMass * POW(distSqr, -3.0/2.0);
Its now worth noting that you can remove the call to pow completely here, since you are dealing with a very simple exponent. so...
real s = jMass * std::sqrt(distSqr) / (distSqr * distSqr);
If you know your laws of powers, you can do an additional refactor step here:
real s = jMass / (std::sqrt(distSqr) * distSqr);
Now with any luck, your compiler should hopefully be performing this transformation for you already (you'll need -O2 and -ffast-math typically). Example:
https://godbolt.org/z/8YqFYA
The reason this is nice, is that now you have removed a cmath call from your code completely. This makes it very easy to drop to something like simd, and extremely easy if you happpen to be using clang or gcc. e.g.
#include <immintrin.h>
typedef __m256 real;
struct real3 { real x, y, z; };
// i had to make up a value
const __m256 SOFTENING_SQUARED = _mm256_set1_ps(1.23f);
real3 bodyBodyInteraction(real iPosx, real iPosy, real iPosz,
real jPosx, real jPosy, real jPosz, real jMass)
{
real rx, ry, rz;
rx = jPosx - iPosx;
ry = jPosy - iPosy;
rz = jPosz - iPosz;
real distSqr = rx*rx+ry*ry+rz*rz;
distSqr += SOFTENING_SQUARED;
real s = jMass / (_mm256_sqrt_ps(distSqr) * distSqr);
real3 f;
f.x = rx * s;
f.y = ry * s;
f.z = rz * s;
return f;
}
And in godbolt:
https://godbolt.org/z/JTCwm-

Why the cos function in math.h faster than x86 fcos instruction

The cos() in math.h run faster than the x86 asm fcos.
The following code is compare between the x86 fcos and the cos() in math.h.
In this code, 1000000 times asm fcos cost 150ms; 1000000 times cos() call cost only 80ms.
How is the fcos implemented in x86?
Why is the fcos much slower than cos()?
My enviroment is intel i7-6820HQ + win10 + visual studio 2017.
#include "string"
#include "iostream"
#include<time.h>
#include "math.h"
int main()
{
int i;
const int i_max = 1000000;
float c = 10000;
float *d = &c;
float start_value = 8.333333f;
float* pstart_value = &start_value;
clock_t a, b;
a = clock();
__asm {
mov edx, pstart_value;
fld [edx];
}
for (i = 0; i < i_max; i++) {
__asm {
fcos;
}
}
b = clock();
printf("asm time = %u", b - a);
a = clock();
double y;
for (i = 0; i < i_max; i++) {
start_value = cos(start_value);
}
b = clock();
printf("math time = %u", b - a);
return 0;
}
According to my personal understanding, a single asm instruction is usually faster than a function call.
Why in this case the fcos so slow?
Update:
I have run the same code on another laptop with i7-6700HQ.
On this laptop the 1000000 times fcos cost only 51ms. Why there is such a big difference between the two cpus.

I bet the answer is easy. You do not use the result of cos and it is optimized out as in this example
https://godbolt.org/z/iw-nft
Change the variables to volatile to force cos call.
https://godbolt.org/z/9_dpMs
Another guess:
Maybe your cos implementation uses lookup tables. Then it will be faster than the hardware implementation.

Measuring processor ticks in C

I wanted to calculate the difference in execution time when executing the same code inside a function. To my surprise, however, sometimes the clock difference is 0 when I use clock()/clock_t for the start and stop timer. Does this mean that clock()/clock_t does not actually return the number of clicks the processor spent on the task?
After a bit of searching, it seemed to me that clock_gettime() would return more fine grained results. And indeed it does, but I instead end up with an abitrary number of nano(?)seconds. It gives a hint of the difference in execution time, but it's hardly accurate as to exactly how many clicks difference it amounts to. What would I have to do to find this out?
#include <math.h>
#include <stdio.h>
#include <time.h>
#define M_PI_DOUBLE (M_PI * 2)
void rotatetest(const float *x, const float *c, float *result) {
float rotationfraction = *x / *c;
*result = M_PI_DOUBLE * rotationfraction;
}
int main() {
int i;
long test_total = 0;
int test_count = 1000000;
struct timespec test_time_begin;
struct timespec test_time_end;
float r = 50.f;
float c = 2 * M_PI * r;
float x = 3.f;
float result_inline = 0.f;
float result_function = 0.f;
for (i = 0; i < test_count; i++) {
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &test_time_begin);
float rotationfraction = x / c;
result_inline = M_PI_DOUBLE * rotationfraction;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &test_time_end);
test_total += test_time_end.tv_nsec - test_time_begin.tv_nsec;
}
printf("Inline clocks %li, avg %f (result is %f)\n", test_total, test_total / (float)test_count,result_inline);
for (i = 0; i < test_count; i++) {
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &test_time_begin);
rotatetest(&x, &c, &result_function);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &test_time_end);
test_total += test_time_end.tv_nsec - test_time_begin.tv_nsec;
}
printf("Function clocks %li, avg %f (result is %f)\n", test_total, test_total / (float)test_count, result_inline);
return 0;
}
I am using gcc version 4.8.4 on Linux 3.13.0-37-generic (Linux Mint 16)

First of all: As already mentioned in the comments, clocking a single run of execution one by the other will probably do you no good. If all goes down the hill, the call for getting the time might actually take longer than the actual execution of the operation.
Please clock multiple runs of the operation (including a warm up phase so everything is swapped in) and calculate the average running times.
clock() isn't guaranteed to be monotonic. It also isn't the number of processor clicks (whatever you define this to be) the program has run. The best way to describe the result from clock() is probably "a best effort estimation of the time any one of the CPUs has spent on calculation for the current process". For benchmarking purposes clock() is thus mostly useless.
As per specification:
The clock() function returns the implementation's best approximation to the processor time used by the process since the beginning of an implementation-dependent time related only to the process invocation.
And additionally
To determine the time in seconds, the value returned by clock() should be divided by the value of the macro CLOCKS_PER_SEC.
So, if you call clock() more often than the resolution, you are out of luck.
For profiling/benchmarking, you should --if possible-- use one of the performance clocks that are available on modern hardware. The prime candidates are probably
The HPET
The TSC
Edit: The question now references CLOCK_PROCESS_CPUTIME_ID, which is Linux' way of exposing the TSC.
If any (or both) are available depends on the hardware in is also operating system specific.

After googling a little bit I can see that clock() function can be used as a standard mechanism to find the tome taken for execution , but be aware that the time will be varying at different time depending upon the load of your processor,
You can just use the below code for calculation
clock_t begin, end;
double time_spent;
begin = clock();
/* here, do your time-consuming job */
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;

What could be some possible problems with this use of OpenMP?

I was trying to figure out how to parallelize a segment of code in OpenMP, where the inside of the for loop is independent from the rest of it.
Basically the project is dealing with particle systems, but I don't think that should relevant to the parallelization of the code. Is it a caching problem where the for loop divides the threads in a way such that the particles are not cached in each core in an efficient manner?
Edit: As mentioned by an answer below, I'm wondering why I'm not getting speedup.
#pragma omp parallel for
for (unsigned i = 0; i < psize-n_dead; ++i)
{
s->particles[i].pos = s->particles[i].pos + dt * s->particles[i].vel;
s->particles[i].vel = (1 - dt*.1) * s->particles[i].vel + dt*s->force;
// printf("%d", omp_get_thread_num());
}

If you're asking whether it's parallelized correctly, it looks fine. I don't see any data-races or loop-dependencies that could break it.
But I think you're wondering on why you aren't getting any speedup with parallelism.
Since you mentioned that the trip count, psize-n_dead will be on the order of 4000. I'd say that's actually pretty small given the amount of work in the loop.
In other words, you don't have much total work to be worth parallelizing. So threading overhead is probably eating up any speedup that you should be gaining. If possible, you should try parallelizing at a higher level.
EDIT: You updated your comment to include up to 200000.
For larger values, it's likely that you'll be memory bound in some way. Your loop merely iterates through all the data doing very little work. So using more threads probably won't help much (if at all).

There is no correctness issues such as data races in this piece of code.
Assuming that the number of particles to process is big enough to warrant parallelism, I do not see OpenMP related performance issues in this code. By default, OpenMP will split the loop iterations statically in equal portions across all threads, so any cache conflicts may only occur at the boundaries of these portions, i.e. just in a few iterations of the loop.
Unrelated to OpenMP (and so to the parallel speedup problem), possibly performance improvement can be achieved by switching from array-of-structs to struct-of-arrays, as this might help compiler to vectorize the code (i.e. use SIMD instructions of a target processor):
#pragma omp parallel for
for (unsigned i = 0; i < psize-n_dead; ++i)
{
s->particles.pos[i] = s->particles.pos[i] + dt * s->particles.vel[i];
s->particles.vel[i] = (1 - dt*.1) * s->particles.vel[i] + dt*s->force;
}
Such reorganization assumes that most time all particles are processed in a loop like this one. Working with an individual particle requires more cache lines to be loaded, but if you process them all in a loop, the net amount of cache lines loaded is nearly the same.

How sure are you that you're not getting speedup?
Trying it both ways - array of structs and struct of arrays, compiled with gcc -O3 (gcc 4.6), on a dual quad-core nehalem, I get for psize-n_dead = 200000, running 100 iterations for better timer accuracy:
Struct of arrays (reported time are in milliseconds)
$ for t in 1 2 4 8; do export OMP_NUM_THREADS=$t; time ./foo; done
Took time 90.984000
Took time 45.992000
Took time 22.996000
Took time 11.998000
Array of structs:
$ for t in 1 2 4 8; do export OMP_NUM_THREADS=$t; time ./foo; done
Took time 58.989000
Took time 28.995000
Took time 14.997000
Took time 8.999000
However, I because the operation is so short (sub-ms) I didn't see any speedup without doing 100 iterations because of timer accuracy. Also, you'd have to have a machine with good memory bandwidth to to get this sort of behaviour; you're only doing ~3 FMAs and another multiplication for every two pieces of data you read in.
Code for array-of-structs follows.
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
typedef struct particle_struct {
double pos;
double vel;
} particle;
typedef struct simulation_struct {
particle *particles;
double force;
} simulation;
void tick(struct timeval *t) {
gettimeofday(t, NULL);
}
/* returns time in seconds from now to time described by t */
double tock(struct timeval *t) {
struct timeval now;
gettimeofday(&now, NULL);
return (double)(now.tv_sec - t->tv_sec) + ((double)(now.tv_usec - t->tv_usec)/1000000.);
}
void update(simulation *s, unsigned psize, double dt) {
#pragma omp parallel for
for (unsigned i = 0; i < psize; ++i)
{
s->particles[i].pos = s->particles[i].pos+ dt * s->particles[i].vel;
s->particles[i].vel = (1 - dt*.1) * s->particles[i].vel + dt*s->force;
}
}
void init(simulation *s, unsigned np) {
s->force = 1.;
s->particles = malloc(np*sizeof(particle));
for (unsigned i=0; i<np; i++) {
s->particles[i].pos = 1.;
s->particles[i].vel = 1.;
}
int main(void)
{
const unsigned np=200000;
simulation s;
struct timeval clock;
init(&s, np);
tick(&clock);
for (int iter=0;iter< 100; iter++)
update(&s, np, 0.75);
double elapsed=tock(&clock)*1000.;
printf("Took time %lf\n", elapsed);
free(s.particles);
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight