Suppose I have a very simple code like:
double array[SIZE_OF_ARRAY];
double sum = 0.0;
for (int i = 0; i < SIZE_OF_ARRAY; ++i)
{
sum += array[i];
}
I basically want to do the same operations using SSE2. How can I do that?
Here's a very simple SSE3 implementation:
#include <emmintrin.h>
__m128d vsum = _mm_set1_pd(0.0);
for (int i = 0; i < n; i += 2)
{
__m128d v = _mm_load_pd(&a[i]);
vsum = _mm_add_pd(vsum, v);
}
vsum = _mm_hadd_pd(vsum, vsum);
double sum = _mm_cvtsd_f64(vsum0);
You can unroll the loop to get much better performance by using multiple accumulators to hide the latency of FP addition (as suggested by #Mysticial). Unroll 3 or 4 times with multiple "sum" vectors to bottleneck on load and FP-add throughput (one or two per clock cycle) instead of FP-add latency (one per 3 or 4 cycles):
__m128d vsum0 = _mm_setzero_pd();
__m128d vsum1 = _mm_setzero_pd();
for (int i = 0; i < n; i += 4)
{
__m128d v0 = _mm_load_pd(&a[i]);
__m128d v1 = _mm_load_pd(&a[i + 2]);
vsum0 = _mm_add_pd(vsum0, v0);
vsum1 = _mm_add_pd(vsum1, v1);
}
vsum0 = _mm_add_pd(vsum0, vsum1); // vertical ops down to one accumulator
vsum0 = _mm_hadd_pd(vsum0, vsum0); // horizontal add of the single register
double sum = _mm_cvtsd_f64(vsum0);
Note that the array a is assumed to be 16 byte aligned and the number of elements n is assumed to be a multiple of 2 (or 4, in the case of the unrolled loop).
See also Fastest way to do horizontal float vector sum on x86 for alternate ways of doing the horizontal sum outside the loop. SSE3 support is not totally universal (especially AMD CPUs were later to support it than Intel).
Also, _mm_hadd_pd is usually not the fastest way even on CPUs that support it, so an SSE2-only version won't be worse on modern CPUs. It's outside the loop and doesn't make much difference either way, though.
Related
I am working on a spiking neural network project in C where spikes are boolean values. Right now I have built a custom bit matrix type to represent the spike matrixes.
I frequently need the dot product of the bit matrix and a matrix of single precision floats of the same size, so I was wondering how I should speed things up?
I also need to do pointwise multiplication of the float matrix and the bit matrix later.
My plan right now was just to loop through with and if statement and bitshift. I want to speed this up.
float current = 0;
for (int i = 0; i < n_elem; i++, bit_vec >>= 1) {
if (bit_vec & 1)
current += weights[i];
}
I don't necessarily need to use a bit vector, it could be represented in other ways too. I have seen other answers here, but they are hardware specific and I am looking for something that can be portable.
I am not using any BLAS functions either, mostly because I am never operating on two floats. Should I be?
Thanks.
The bit_vec >>= 1 and current += weights[i] instruction cause a loop carried dependency that will certainly prevent the compiler to generate a fast implementation (and also prevent the processor to execute it efficiently).
You can solve this by unrolling the loop. Additionally, most mainstream compilers are not smart enough so to optimize out the condition en use a blend instruction available on most architecture. Conditional branches are slow, especially when they cannot be easily predicted (which is certainly you case). You can use a multiplication so to help the compiler generating better instructions. Here is the result:
const unsigned int blockSize = 4;
float current[blockSize] = {0.f};
int i;
for (i = 0; i < n_elem-blockSize+1; i+=blockSize, bit_vec >>= blockSize)
for(int j = 0 ; j < blockSize ; ++j)
current[j] += weights[i] * (bit_vec >> j);
for (; i < n_elem; ++i, bit_vec >>= 1)
if (bit_vec & 1)
current[0] += weights[i];
float sum = 0.f;
for(int j = 0 ; j < blockSize ; ++j)
sum += current[j];
This code should be faster assuming n_elem is relatively big. It should still be far from being efficient since compilers like GCC and Clang fail to auto-vectorize it. This is sad since it would be several time faster with SIMD instructions (like SSE, AVX, Neon, etc.). That being said, this is exactly why people use non-portable code: to manually use efficient instruction since compiler often fail to do that in non-trivial cases.
I wonder if there is a fast way of multiplying int8 arrays, i.e.
for(i = 0; i < n; ++i)
z[i] = x * y[i];
I see that the Intel intrinsics guide lists several SIMD instructions, such as _mm_mulhi_epi16 and _mm_mullo_epi16 that do something like this for int16. Is there something similar for int8 that I'm missing?
Breaking the input into low & hi, one can
__m128i const kff00ff00 = _mm_set1_epi32(0xff00ff00);
__m128i lo = _mm_mullo_epi16(y, x);
__m128i hi = _mm_mullo_epi16(_mm_and_si128(y, kff00ff00), x);
__m128i z = _mm_blendv_epi8(lo, hi, kff00ff00);
AFAIK, the high bits YY of the YYyy|YYyy|YYyy|YYyy multiplied by 00xx|00xx|00xx|00xx do not interfere with the low 8 bits ??ll, and likewise the product of YY00|YY00 * 00xx|00xx produces the correct 8 bit product at HH00. These two results at the correct alignment need to be blended.
__m128i x = _mm_set1_epi16(scalar_x);, and __m128i y = _mm_loadu_si128(...);
An alternative is to use shufb calculating LutLo[y & 15] + LutHi[y >> 4], where unfortunately the shift must be also emulated by _mm_and_si128(_mm_srli_epi16(y,4),_mm_set1_epi8(15)).
Hi I am trying to improve the performance of this code, suposing that I have a machine capable of handling 4 threads. I first thought about making omp parallel but then I saw that this function was inside a for loop so creating threads so many times was not very efficient. So i would like to know how to implement it with SSE that would be more efficient:
unsigned char cubicInterpolate_paralelo(unsigned char p[4], unsigned char x) {
unsigned char resultado;
unsigned char intermedio;
intermedio = + x*(3.0*(p[1] - p[2]) + p[3] - p[0]);
resultado = p[1] + 0.5 * x *(p[2] - p[0] + x*(2.0*p[0] - 5.0*p[1] + 4.0*p[2] - p[3] + x*(3.0*(p[1] - p[2]) + p[3] - p[0])));
return resultado;
}
unsigned char bicubicInterpolate_paralelo (unsigned char p[4][4], unsigned char x, unsigned char y) {
unsigned char arr[4],valorPixelCanal;
arr[0] = cubicInterpolate_paralelo(p[0], y);
arr[1] = cubicInterpolate_paralelo(p[1], y);
arr[2] = cubicInterpolate_paralelo(p[2], y);
arr[3] = cubicInterpolate_paralelo(p[3], y);
valorPixelCanal = cubicInterpolate_paralelo(arr, x);
return valorPixelCanal;
}
this is used inside some nested for:
for(i=0; i<z_img.width(); i++) {
for(j=0; j<z_img.height(); j++) {
//For R,G,B
for(c=0; c<3; c++) {
for(l=0; l<4; l++){
for(k=0; k<4; k++){
arr[l][k] = img(i/zFactor +l, j/zFactor +k, 0, c);
}
}
color[c] = bicubicInterpolate_paralelo(arr, (unsigned char)(i%zFactor)/zFactor, (unsigned char)(j%zFactor)/zFactor);
}
z_img.draw_point(i,j,color);
}
}
I've taken some liberties with the code, so you may have to change it significantly, but here's an (untested) transliteration to SSE:
__m128i x = _mm_unpacklo_epi8(_mm_loadl_epi64(x_array), _mm_setzero_si128());
__m128i p0 = _mm_unpacklo_epi8(_mm_loadl_epi64(p0_array), _mm_setzero_si128());
__m128i p1 = _mm_unpacklo_epi8(_mm_loadl_epi64(p1_array), _mm_setzero_si128());
__m128i p2 = _mm_unpacklo_epi8(_mm_loadl_epi64(p2_array), _mm_setzero_si128());
__m128i p3 = _mm_unpacklo_epi8(_mm_loadl_epi64(p3_array), _mm_setzero_si128());
__m128i t = _mm_sub_epi16(p1, p2);
t = _mm_add_epi16(_mm_add_epi16(t, t), t); // 3 * (p[1] - p[2])
__m128i intermedio = _mm_mullo_epi16(x, _mm_sub_epi16(_mm_add_epi16(t, p3), p0));
t = _mm_add_epi16(p1, _mm_slli_epi16(p1, 2)); // 5 * p[1]
// t2 = 2 * p[0] + 4 * p[2]
__m128i t2 = _mm_add_epi16(_mm_add_epi16(p0, p0), _mm_slli_epi16(p2, 2));
t = _mm_mullo_epi16(x, _mm_sub_epi16(_mm_add_epi16(t2, intermedio), _mm_add_epi16(t, p3)));
t = _mm_mullo_epi16(x, _mm_add_epi16(_mm_sub_epi16(p2, p0), t));
__m128i resultado = _mm_add_epi16(p1, _mm_srli_epi16(t, 1));
return resultado;
The 16 bit intermediates that I use should be wide enough, the only way for information from the high bits to affect low bits in this code is the right shift by 1 (0.5 * in your code), so really we only need 9 bits, the rest cannot affect the result. Bytes wouldn't be wide enough (unless you have some extra guarantees that I don't know about), but they would be annoying anyway because there is no nice way to multiply them.
I pretended for simplicity that the input takes the form of contiguous arrays of x's, p[0]'s etc, that's not what you need here but I didn't have time to work out all the loading and shuffling.
SSE is quite unrelated to threads. A single thread executes a single instruction at a time; with SSE that single instruction may apply to 4 or 8 sets of arguments at a time. So with multiple threads you can also run multiple SSE instructions to process even more data.
You can use threads with for-loops. Just don't use them inside. Instead, take the for(i=0; i<z_img.width(); i++) { outer loop and split it in 4 bands of width/4. Thread 0 gets 0..width/4, thread 1 gets width/4..width/2 etc.
On an unrelated note your code also suffers from mixing floating-point and integer math. 0.5 * x is not nearly as efficient as x/2.
Using OpenMP, you could try adding the #pragma to the outer-most for loop. This should solve your problem.
Going the SSE route is trickier because of the extra alignment restrictions on data, but the easiest transform would be to extend cubicInterpolate_paralelo to handle multiple calculations at once. With enough luck, telling the compiler to use SSE will do the trick for you, but to make sure, you could use intrinsic functions and types.
I need to find the highest and lowest value in an array of floats. Optionally, I want to be able to skip members of the array and evaluate only every 2nd, 4th, 8th, etc. element:
float maxValue = 0.0;
float minValue = 0.0;
int index = 0;
int stepwith = 8;
for( int i = 0; i < pixelCount; i++ )
{
float value = data[index];
if( value > maxValue )
maxValue = value;
if( value < minValue )
minValue = value;
index += stepwidth;
if( index >= dataLength )
break;
}
That seems to be the fastest way without using other magic.
So I tried other magic, namely the vIsmax() and vIsmin() functions from vecLib (included in OSX' Accelerate framework) which apparently uses processor-level acceleration of vector operations:
int maxIndex = vIsmax( pixelCount, data );
int minIndex = vIsmin( pixelCount, data );
float maxValue = data[maxIndex];
float minValue = data[minIndex];
It is very fast but doesn't allow skipping values (the functions don't offer a 'stride' argument). This makes it actually slower than my first code because I can easily skip every 8th value.
I even found a third way with BLAS which implements a similar function:
cblas_isamax( const int __N, const float *__X, const int __incX )
with __incX = stride to skip values, but it isn't fast at all and only finds absolute maxima which doesn't work for me.
So my question: can anyone think of another way to accelerate this?
Following the suggestion in the comment I looked into Intel intrinsics and came up with this code. Fair warning: this is my very first approach to intrinsics and is highly experimental. It works though:
#include <emmintrin.h>
void vec_minmax( float * inData, int length, float * outMin, float * outMax )
{
// In each iteration of the loop we will gather 8 from 64
// values (only fetching every 8th value).
// Build an index set that points to 8 consecutive floats.
// These indexes will later be spread up by factor 8 so they
// point to every 8th float.
// NOTE: these indexes are bytes, in reverse order.
__m256i vindex = _mm256_set_epi32( 28, 24, 20, 16, 12, 8, 4, 0 );
// Gather the first 8 floats.
__m256 v_min = _mm256_i32gather_ps( inData, vindex, 8 );
__m256 v_max = v_min;
for( int i = 64; i < length; i += 64 )
{
// gather the next set of floats.
__m256 v_cur = _mm256_i32gather_ps(( inData + i ), vindex, 8 );
// Compare every member and store the results in v_min and v_max.
v_min = _mm256_min_ps( v_min, v_cur );
v_max = _mm256_max_ps( v_max, v_cur );
}
// Store the final result in two arrays.
float * max8;
float * min8;
posix_memalign( (void **)&min8, 32, ( 8 * sizeof( float )));
posix_memalign( (void **)&max8, 32, ( 8 * sizeof( float )));
_mm256_store_ps( min8, v_min );
_mm256_store_ps( max8, v_max );
// Find the min/max value in the arrays.
* outMin = min8[0];
* outMax = max8[0];
for( int i = 0; i < 8; i++ )
{
if( min8[i] < * outMin )
* outMin = min8[i];
if( max8[i] > * outMax )
* outMax = max8[i];
}
}
So this function finds the min and max value in a set of floats, examining only every 8th value which is enough precision for my needs.
Unfortunately, it is not significantly faster than the trivial scalar approach with a simple for-loop and two if-statements (like shown above). At least not with a stride of 8.
Here is an implementation for the stride = 8 case, using SSE. I've tested the code but I haven't had time to benchmark it yet. I'm not entirely confident that it will be any faster than a scalar implementation, but it's worth a try...
#include <tmmintrin.h>
#include <float.h>
void vec_minmax_stride_8(const float * inData, int length, float * outMin, float * outMax)
{
__m128i vmax = _mm_set1_ps(-FLT_MAX);
__m128i vmin = _mm_set1_ps(FLT_MAX);
for (int i = 0; i < length; i += 32)
{
__m128i v0 = _mm_loadu_ps(&inData[i]);
__m128i v1 = _mm_loadu_ps(&inData[i + 8]);
__m128i v2 = _mm_loadu_ps(&inData[i + 16]);
__m128i v3 = _mm_loadu_ps(&inData[i + 24]);
v0 = _mm_shuffle_ps(v0, v1, _MM_SHUFFLE(0, 0, 0, 0));
v2 = _mm_shuffle_ps(v2, v3, _MM_SHUFFLE(0, 0, 0, 0));
v0 = _mm_shuffle_ps(v0, v2, _MM_SHUFFLE(2, 0, 2, 0));
vmax = _mm_max_ps(vmax, v0);
vmin = _mm_min_ps(vmin, v0);
}
vmax = _mm_max_ps(vmax, _mm_alignr_epi8(vmax, vmax, 4));
vmin = _mm_min_ps(vmin, _mm_alignr_epi8(vmin, vmin, 4));
vmax = _mm_max_ps(vmax, _mm_alignr_epi8(vmax, vmax, 8));
vmin = _mm_min_ps(vmin, _mm_alignr_epi8(vmin, vmin, 8));
_mm_store_ss(outMax, vmax);
_mm_store_ss(outMin, vmin);
}
Except where the amount of computation is high -- this example is not such a case -- strides are typically death for vector architectures. Vector loads and stores do not work that way, and it is a lot, lot of work to load the lanes individually. You are usually better off with scalar in such cases, though some cleverness might let you beat scalar in some cases by small margins.
The way you go fast here with vector intrinsics is to find the min/max for several positions at once. For example, if we had a RGBA floating point image, then find the min/max for r,g,b,a all at the same time, and return four mins and four maxes at the end. It's not much faster than the code you have, but you get more work out of it -- assuming the work is needed.
Another method is to keep a decimated copy of your data set around and run the filter over the reduced sized variants as needed. This will use more memory, but with factor of two decimation, it is less than twice as much (1/3 for 2D, c.f. mipmaps). Here again, it only useful if you intend to do this a lot.
The difference operator, (similar to the derivative operator), and the sum operator, (similar to the integration operator), can be used to change an algorithm because they are inverses.
Sum of (difference of y) = y
Difference of (sum of y) = y
An example of using them that way in a c program is below.
This c program demonstrates three approaches to making an array of squares.
The first approach is the simple obvious approach, y = x*x .
The second approach uses the equation (difference in y) = (x0 + x1)*(difference in x) .
The third approach is the reverse and uses the equation (sum of y) = x(x+1)(2x+1)/6 .
The second approach is consistently slightly faster then the first one, even though I haven't bothered optimizing it. I imagine that if I tried harder I could make it even better.
The third approach is consistently twice as slow, but this doesn't mean the basic idea is dumb. I could imagine that for some function other than y = x*x this approach might be faster. Also there is an integer overflow issue.
Trying out all these transformations was very interesting, so now I want to know what are some other pairs of mathematical operators I could use to transform the algorithm?
Here is the code:
#include <stdio.h>
#include <time.h>
#define tries 201
#define loops 100000
void printAllIn(unsigned int array[tries]){
unsigned int index;
for (index = 0; index < tries; ++index)
printf("%u\n", array[index]);
}
int main (int argc, const char * argv[]) {
/*
Goal, Calculate an array of squares from 0 20 as fast as possible
*/
long unsigned int obvious[tries];
long unsigned int sum_of_differences[tries];
long unsigned int difference_of_sums[tries];
clock_t time_of_obvious1;
clock_t time_of_obvious0;
clock_t time_of_sum_of_differences1;
clock_t time_of_sum_of_differences0;
clock_t time_of_difference_of_sums1;
clock_t time_of_difference_of_sums0;
long unsigned int j;
long unsigned int index;
long unsigned int sum1;
long unsigned int sum0;
long signed int signed_index;
time_of_obvious0 = clock();
for (j = 0; j < loops; ++j)
for (index = 0; index < tries; ++index)
obvious[index] = index*index;
time_of_obvious1 = clock();
time_of_sum_of_differences0 = clock();
for (j = 0; j < loops; ++j)
for (index = 1, sum_of_differences[0] = 0; index < tries; ++index)
sum_of_differences[index] = sum_of_differences[index-1] + 2 * index - 1;
time_of_sum_of_differences1 = clock();
time_of_difference_of_sums0 = clock();
for (j = 0; j < loops; ++j)
for (signed_index = 0, sum0 = 0; signed_index < tries; ++signed_index) {
sum1 = signed_index*(signed_index+1)*(2*signed_index+1);
difference_of_sums[signed_index] = (sum1 - sum0)/6;
sum0 = sum1;
}
time_of_difference_of_sums1 = clock();
// printAllIn(obvious);
printf(
"The obvious approach y = x*x took, %f seconds\n",
((double)(time_of_obvious1 - time_of_obvious0))/CLOCKS_PER_SEC
);
// printAllIn(sum_of_differences);
printf(
"The sum of differences approach y1 = y0 + 2x - 1 took, %f seconds\n",
((double)(time_of_sum_of_differences1 - time_of_sum_of_differences0))/CLOCKS_PER_SEC
);
// printAllIn(difference_of_sums);
printf(
"The difference of sums approach y = sum1 - sum0, sum = (x - 1)x(2(x - 1) + 1)/6 took, %f seconds\n",
(double)(time_of_difference_of_sums1 - time_of_difference_of_sums0)/CLOCKS_PER_SEC
);
return 0;
}
There are two classes of optimizations here: strength reduction and peephole optimizations.
Strength reduction is the usual term for replacing "expensive" mathematical functions with cheaper functions -- say, replacing a multiplication with two logarithm table lookups, an addition, and then an inverse logarithm lookup to find the final result.
Peephole optimizations is the usual term for replacing something like multiplication by a power of two with left shifts. Some CPUs have simple instructions for these operations that run faster than generic integer multiplication for the specific case of multiplying by powers of two.
You can also perform optimizations of individual algorithms. You might write a * b, but there are many different ways to perform multiplication, and different algorithms perform better or worse under different conditions. Many of these decisions are made by the chip designers, but arbitrary-precision integer libraries make their own choices based on the merits of the primitives available to them.
When I tried to compile your code on Ubuntu 10.04, I got a segmentation fault right when main() started because you are declaring many megabytes worth of variables on the stack. I was able to compile it after I moved most of your variables outside of main to make them be global variables.
Then I got these results:
The obvious approach y = x*x took, 0.000000 seconds
The sum of differences approach y1 = y0 + 2x - 1 took, 0.020000 seconds
The difference of sums approach y = sum1 - sum0, sum = (x - 1)x(2(x - 1) + 1)/6 took, 0.000000 seconds
The program runs so fast it's hard to believe it really did anything. I put the "-O0" option in to disable optimizations but it's possible GCC still might have optimized out all of the computations. So I tried adding the "volatile" qualifier to your arrays but still got similar results.
That's where I stopped working on it. In conclusion, I don't really know what's going on with your code but it's quite possible that something is wrong.