I wonder if there is a fast way of multiplying int8 arrays, i.e.
for(i = 0; i < n; ++i)
z[i] = x * y[i];
I see that the Intel intrinsics guide lists several SIMD instructions, such as _mm_mulhi_epi16 and _mm_mullo_epi16 that do something like this for int16. Is there something similar for int8 that I'm missing?
Breaking the input into low & hi, one can
__m128i const kff00ff00 = _mm_set1_epi32(0xff00ff00);
__m128i lo = _mm_mullo_epi16(y, x);
__m128i hi = _mm_mullo_epi16(_mm_and_si128(y, kff00ff00), x);
__m128i z = _mm_blendv_epi8(lo, hi, kff00ff00);
AFAIK, the high bits YY of the YYyy|YYyy|YYyy|YYyy multiplied by 00xx|00xx|00xx|00xx do not interfere with the low 8 bits ??ll, and likewise the product of YY00|YY00 * 00xx|00xx produces the correct 8 bit product at HH00. These two results at the correct alignment need to be blended.
__m128i x = _mm_set1_epi16(scalar_x);, and __m128i y = _mm_loadu_si128(...);
An alternative is to use shufb calculating LutLo[y & 15] + LutHi[y >> 4], where unfortunately the shift must be also emulated by _mm_and_si128(_mm_srli_epi16(y,4),_mm_set1_epi8(15)).
Related
I'm totally new to SSE programming, but have an Intel Core i7 processor.
Basically, I want to take 4 32-bit unsigned integers and cube them all (raise to the power of 3) at once. It is my understanding that the SIMD functionality of SSE and its successors make this possible, but how in the world do I go about doing it? Preferably in C but I could manage assembly if necessary.
Edit to make clear my final goal:
Then, I want to add all the cubes together to come up with a single number.
Background: I'm just trying to use SSE to optimize figuring out if a number is an Armstrong number (a three-digit number whose sum of each digit cubed is the same as the number itself). An example is 153. There seems to be no way to do this other than brute force. These are a subset of Narcissistic numbers whose sum of all digits to the power of the length of the decimal number are equal to number itself. Hopefully, I'd like to eventually expand it to be more flexible, to start I'm just doing the Armstrong numbers. As you might imagine, this came up on another site and a few of us are trying to optimize the hell out of it. By taking your ideas and my own research, I came up with this code:
#include <stdio.h>
#include <smmintrin.h> // SSE 4.1
__m128i vcube(const __m128i v)
{
return _mm_mullo_epi32(v, _mm_mullo_epi32(v, v));
}
int main(int argc, const char * argv[]) {
for (unsigned int i = 1; i <= 500; i++) {
unsigned int firstDigit = i / 100;
unsigned int secondDigit = (i - firstDigit * 100) / 10;
unsigned int thirdDigit = (i - firstDigit * 100 - secondDigit * 10);
__m128i v = _mm_setr_epi32(0, firstDigit, secondDigit, thirdDigit);
__m128 v3 = (__m128) vcube(v);
v3 = _mm_hadd_ps(v3, v3);
v3 = _mm_hadd_ps(v3, v3);
if (_mm_extract_epi32((__m128i) v3, 0) == i)
printf ("%03d is an Armstrong number\n", i);
}
return 0;
}
Note: I had to do some type coercions to get it to compile in some systems (Solaris, at least some Linux).
So this works, but maybe it could be streamlined. Sorry I didn't post the whole task, but I was trying to break it down into steps and I wanted to make sure each digit was correctly cubed.
(END EDIT)
Thank you!
Edit: I guess I should add I'm running Mac OS X Sierra.
EDIT AGAIN:
So, let's say I make these all these unsigned shorts instead of unsigned ints and add more digits, how do I add them together when a short may not be able to hold the sum of all the digits? Is there a way to add them and store in a vector of larger variables if you know what I mean, or a plain larger number such as a UInt64?
Sorry for all the questions, but like I said I'm totally new at vector processing even though I had access to it since my first Mac G4.
If your input values are in the range 0..1625 (so that the result fits in 32 bits) then you can use _mm_mullo_epi32:
__m128i vcube(const __m128i v)
{
return _mm_mullo_epi32(v, _mm_mullo_epi32(v, v));
}
Demo:
#include <stdio.h>
#include <smmintrin.h> // SSE 4.1
__m128i vcube(const __m128i v)
{
return _mm_mullo_epi32(v, _mm_mullo_epi32(v, v));
}
int main()
{
__m128i v = _mm_setr_epi32(0, 1, 1000, 1625);
__m128i v3 = vcube(v);
printf("%vlu => %vlu\n", v, v3);
return 0;
}
Compile and test:
$ gcc -Wall -Wno-format-invalid-specifier -Wno-format-extra-args -msse4 vcube.c && ./a.out
0 1 1000 1625 => 0 1 1000000000 4291015625
For x<=2642245 you can do x*x*x using the foo_SSE function below using SSE4.1. This takes two 32-bit unsigned intergs as input packed into the upper and lower 64-bits of a SSE register and outputs two 64-bit integers.
#include <stdio.h>
#include <x86intrin.h>
#include <inttypes.h>
__m128i foo_SSE(__m128i x) {
__m128i mask = _mm_set_epi32(-1, 0, -1, 0);
__m128i x2 =_mm_shuffle_epi32(x, 0x80);
__m128i t0 = _mm_mul_epu32(x,x);
__m128i t1 = _mm_mul_epu32(t0,x);
__m128i t2 = _mm_mullo_epi32(t0,x2);
__m128i t3 = _mm_and_si128(t2, mask);
__m128i t4 = _mm_add_epi32(t3, t1);
return t4;
}
int main(void) {
uint64_t k1 = 100000;
uint64_t k2 = 2642245;
__m128i x = _mm_setr_epi32(k1, 0, k2, 0);
uint64_t t[2];
_mm_store_si128((__m128i*)t, foo_SSE(x));
printf("%20" PRIu64 " ", t[0]);
printf("%20" PRIu64 "\n", t[1]);
printf("%20" PRIu64 " ", k1*k1*k1);
printf("%20" PRIu64 "\n", k2*k2*k2);
}
This can probably be improved a bit. I'm a little out of practice.
To get a quick overview about the 3 main stages (loading, operating, storing) see the following snippet. For integers e0 and e1:
#include "emmintrin.h"
__m128i result __attribute__((aligned(16)));
__m128i x = _mm_setr_epi32(0, e1, 0, e0);
__m128i cube = _mm_mul_epu32(x, _mm_mul_epu32(x, x));
_mm_store_si128(&result, cube);
The _mm_mul_epu32 takes the even multiples of 32bits of two _m128i registers, multiplies them and puts the result as 2-tuple of 64bits into the result register.
To get them out of there access either access them through a cast or use your compiler's convenience definition of __m128i, e.g. for icc:
printf("%llu %llu\n", result.m128i_i64[0], result.m128i_i64[1]); /* msc style */
Note: I'm using the Intel Intrinsics guide for SSE primitives.
Edited for clarity about what the code actually does.
Hi I am trying to improve the performance of this code, suposing that I have a machine capable of handling 4 threads. I first thought about making omp parallel but then I saw that this function was inside a for loop so creating threads so many times was not very efficient. So i would like to know how to implement it with SSE that would be more efficient:
unsigned char cubicInterpolate_paralelo(unsigned char p[4], unsigned char x) {
unsigned char resultado;
unsigned char intermedio;
intermedio = + x*(3.0*(p[1] - p[2]) + p[3] - p[0]);
resultado = p[1] + 0.5 * x *(p[2] - p[0] + x*(2.0*p[0] - 5.0*p[1] + 4.0*p[2] - p[3] + x*(3.0*(p[1] - p[2]) + p[3] - p[0])));
return resultado;
}
unsigned char bicubicInterpolate_paralelo (unsigned char p[4][4], unsigned char x, unsigned char y) {
unsigned char arr[4],valorPixelCanal;
arr[0] = cubicInterpolate_paralelo(p[0], y);
arr[1] = cubicInterpolate_paralelo(p[1], y);
arr[2] = cubicInterpolate_paralelo(p[2], y);
arr[3] = cubicInterpolate_paralelo(p[3], y);
valorPixelCanal = cubicInterpolate_paralelo(arr, x);
return valorPixelCanal;
}
this is used inside some nested for:
for(i=0; i<z_img.width(); i++) {
for(j=0; j<z_img.height(); j++) {
//For R,G,B
for(c=0; c<3; c++) {
for(l=0; l<4; l++){
for(k=0; k<4; k++){
arr[l][k] = img(i/zFactor +l, j/zFactor +k, 0, c);
}
}
color[c] = bicubicInterpolate_paralelo(arr, (unsigned char)(i%zFactor)/zFactor, (unsigned char)(j%zFactor)/zFactor);
}
z_img.draw_point(i,j,color);
}
}
I've taken some liberties with the code, so you may have to change it significantly, but here's an (untested) transliteration to SSE:
__m128i x = _mm_unpacklo_epi8(_mm_loadl_epi64(x_array), _mm_setzero_si128());
__m128i p0 = _mm_unpacklo_epi8(_mm_loadl_epi64(p0_array), _mm_setzero_si128());
__m128i p1 = _mm_unpacklo_epi8(_mm_loadl_epi64(p1_array), _mm_setzero_si128());
__m128i p2 = _mm_unpacklo_epi8(_mm_loadl_epi64(p2_array), _mm_setzero_si128());
__m128i p3 = _mm_unpacklo_epi8(_mm_loadl_epi64(p3_array), _mm_setzero_si128());
__m128i t = _mm_sub_epi16(p1, p2);
t = _mm_add_epi16(_mm_add_epi16(t, t), t); // 3 * (p[1] - p[2])
__m128i intermedio = _mm_mullo_epi16(x, _mm_sub_epi16(_mm_add_epi16(t, p3), p0));
t = _mm_add_epi16(p1, _mm_slli_epi16(p1, 2)); // 5 * p[1]
// t2 = 2 * p[0] + 4 * p[2]
__m128i t2 = _mm_add_epi16(_mm_add_epi16(p0, p0), _mm_slli_epi16(p2, 2));
t = _mm_mullo_epi16(x, _mm_sub_epi16(_mm_add_epi16(t2, intermedio), _mm_add_epi16(t, p3)));
t = _mm_mullo_epi16(x, _mm_add_epi16(_mm_sub_epi16(p2, p0), t));
__m128i resultado = _mm_add_epi16(p1, _mm_srli_epi16(t, 1));
return resultado;
The 16 bit intermediates that I use should be wide enough, the only way for information from the high bits to affect low bits in this code is the right shift by 1 (0.5 * in your code), so really we only need 9 bits, the rest cannot affect the result. Bytes wouldn't be wide enough (unless you have some extra guarantees that I don't know about), but they would be annoying anyway because there is no nice way to multiply them.
I pretended for simplicity that the input takes the form of contiguous arrays of x's, p[0]'s etc, that's not what you need here but I didn't have time to work out all the loading and shuffling.
SSE is quite unrelated to threads. A single thread executes a single instruction at a time; with SSE that single instruction may apply to 4 or 8 sets of arguments at a time. So with multiple threads you can also run multiple SSE instructions to process even more data.
You can use threads with for-loops. Just don't use them inside. Instead, take the for(i=0; i<z_img.width(); i++) { outer loop and split it in 4 bands of width/4. Thread 0 gets 0..width/4, thread 1 gets width/4..width/2 etc.
On an unrelated note your code also suffers from mixing floating-point and integer math. 0.5 * x is not nearly as efficient as x/2.
Using OpenMP, you could try adding the #pragma to the outer-most for loop. This should solve your problem.
Going the SSE route is trickier because of the extra alignment restrictions on data, but the easiest transform would be to extend cubicInterpolate_paralelo to handle multiple calculations at once. With enough luck, telling the compiler to use SSE will do the trick for you, but to make sure, you could use intrinsic functions and types.
I need to find the highest and lowest value in an array of floats. Optionally, I want to be able to skip members of the array and evaluate only every 2nd, 4th, 8th, etc. element:
float maxValue = 0.0;
float minValue = 0.0;
int index = 0;
int stepwith = 8;
for( int i = 0; i < pixelCount; i++ )
{
float value = data[index];
if( value > maxValue )
maxValue = value;
if( value < minValue )
minValue = value;
index += stepwidth;
if( index >= dataLength )
break;
}
That seems to be the fastest way without using other magic.
So I tried other magic, namely the vIsmax() and vIsmin() functions from vecLib (included in OSX' Accelerate framework) which apparently uses processor-level acceleration of vector operations:
int maxIndex = vIsmax( pixelCount, data );
int minIndex = vIsmin( pixelCount, data );
float maxValue = data[maxIndex];
float minValue = data[minIndex];
It is very fast but doesn't allow skipping values (the functions don't offer a 'stride' argument). This makes it actually slower than my first code because I can easily skip every 8th value.
I even found a third way with BLAS which implements a similar function:
cblas_isamax( const int __N, const float *__X, const int __incX )
with __incX = stride to skip values, but it isn't fast at all and only finds absolute maxima which doesn't work for me.
So my question: can anyone think of another way to accelerate this?
Following the suggestion in the comment I looked into Intel intrinsics and came up with this code. Fair warning: this is my very first approach to intrinsics and is highly experimental. It works though:
#include <emmintrin.h>
void vec_minmax( float * inData, int length, float * outMin, float * outMax )
{
// In each iteration of the loop we will gather 8 from 64
// values (only fetching every 8th value).
// Build an index set that points to 8 consecutive floats.
// These indexes will later be spread up by factor 8 so they
// point to every 8th float.
// NOTE: these indexes are bytes, in reverse order.
__m256i vindex = _mm256_set_epi32( 28, 24, 20, 16, 12, 8, 4, 0 );
// Gather the first 8 floats.
__m256 v_min = _mm256_i32gather_ps( inData, vindex, 8 );
__m256 v_max = v_min;
for( int i = 64; i < length; i += 64 )
{
// gather the next set of floats.
__m256 v_cur = _mm256_i32gather_ps(( inData + i ), vindex, 8 );
// Compare every member and store the results in v_min and v_max.
v_min = _mm256_min_ps( v_min, v_cur );
v_max = _mm256_max_ps( v_max, v_cur );
}
// Store the final result in two arrays.
float * max8;
float * min8;
posix_memalign( (void **)&min8, 32, ( 8 * sizeof( float )));
posix_memalign( (void **)&max8, 32, ( 8 * sizeof( float )));
_mm256_store_ps( min8, v_min );
_mm256_store_ps( max8, v_max );
// Find the min/max value in the arrays.
* outMin = min8[0];
* outMax = max8[0];
for( int i = 0; i < 8; i++ )
{
if( min8[i] < * outMin )
* outMin = min8[i];
if( max8[i] > * outMax )
* outMax = max8[i];
}
}
So this function finds the min and max value in a set of floats, examining only every 8th value which is enough precision for my needs.
Unfortunately, it is not significantly faster than the trivial scalar approach with a simple for-loop and two if-statements (like shown above). At least not with a stride of 8.
Here is an implementation for the stride = 8 case, using SSE. I've tested the code but I haven't had time to benchmark it yet. I'm not entirely confident that it will be any faster than a scalar implementation, but it's worth a try...
#include <tmmintrin.h>
#include <float.h>
void vec_minmax_stride_8(const float * inData, int length, float * outMin, float * outMax)
{
__m128i vmax = _mm_set1_ps(-FLT_MAX);
__m128i vmin = _mm_set1_ps(FLT_MAX);
for (int i = 0; i < length; i += 32)
{
__m128i v0 = _mm_loadu_ps(&inData[i]);
__m128i v1 = _mm_loadu_ps(&inData[i + 8]);
__m128i v2 = _mm_loadu_ps(&inData[i + 16]);
__m128i v3 = _mm_loadu_ps(&inData[i + 24]);
v0 = _mm_shuffle_ps(v0, v1, _MM_SHUFFLE(0, 0, 0, 0));
v2 = _mm_shuffle_ps(v2, v3, _MM_SHUFFLE(0, 0, 0, 0));
v0 = _mm_shuffle_ps(v0, v2, _MM_SHUFFLE(2, 0, 2, 0));
vmax = _mm_max_ps(vmax, v0);
vmin = _mm_min_ps(vmin, v0);
}
vmax = _mm_max_ps(vmax, _mm_alignr_epi8(vmax, vmax, 4));
vmin = _mm_min_ps(vmin, _mm_alignr_epi8(vmin, vmin, 4));
vmax = _mm_max_ps(vmax, _mm_alignr_epi8(vmax, vmax, 8));
vmin = _mm_min_ps(vmin, _mm_alignr_epi8(vmin, vmin, 8));
_mm_store_ss(outMax, vmax);
_mm_store_ss(outMin, vmin);
}
Except where the amount of computation is high -- this example is not such a case -- strides are typically death for vector architectures. Vector loads and stores do not work that way, and it is a lot, lot of work to load the lanes individually. You are usually better off with scalar in such cases, though some cleverness might let you beat scalar in some cases by small margins.
The way you go fast here with vector intrinsics is to find the min/max for several positions at once. For example, if we had a RGBA floating point image, then find the min/max for r,g,b,a all at the same time, and return four mins and four maxes at the end. It's not much faster than the code you have, but you get more work out of it -- assuming the work is needed.
Another method is to keep a decimated copy of your data set around and run the filter over the reduced sized variants as needed. This will use more memory, but with factor of two decimation, it is less than twice as much (1/3 for 2D, c.f. mipmaps). Here again, it only useful if you intend to do this a lot.
In SSE there is a function _mm_cvtepi32_ps(__m128i input) which takes input vector of 32 bits wide signed integers (int32_t) and converts them into floats.
Now, I want to interpret input integers as not signed. But there is no function _mm_cvtepu32_ps and I could not find an implementation of one. Do you know where I can find such a function or at least give a hint on the implementation?
To illustrate the the difference in results:
unsigned int a = 2480160505; // 10010011 11010100 00111110 11111001
float a1 = a; // 01001111 00010011 11010100 00111111;
float a2 = (signed int)a; // 11001110 11011000 01010111 10000010
With Paul R's solution and with my previous solution
the difference between the rounded floating point and the original integer is less than or equal to
0.75 ULP (Unit in the Last Place). In these methods
at two places rounding may occur: in _mm_cvtepi32_ps and
in _mm_add_ps. This leads to results that are not as accurate as possible for some inputs.
For example, with Paul R's method 0x2000003=33554435 is converted to 33554432.0, but 33554436.0
also exists as a float, which would have been better here.
My previous solution suffers from similar inaccuracies.
Such inaccurate results may also occur with compiler generated code, see here.
Following the approach of gcc (see Peter Cordes' answer to that other SO question), an accurate conversion within 0.5 ULP is obtained:
inline __m128 _mm_cvtepu32_ps(const __m128i v)
{
__m128i msk_lo = _mm_set1_epi32(0xFFFF);
__m128 cnst65536f= _mm_set1_ps(65536.0f);
__m128i v_lo = _mm_and_si128(v,msk_lo); /* extract the 16 lowest significant bits of v */
__m128i v_hi = _mm_srli_epi32(v,16); /* 16 most significant bits of v */
__m128 v_lo_flt = _mm_cvtepi32_ps(v_lo); /* No rounding */
__m128 v_hi_flt = _mm_cvtepi32_ps(v_hi); /* No rounding */
v_hi_flt = _mm_mul_ps(cnst65536f,v_hi_flt); /* No rounding */
return _mm_add_ps(v_hi_flt,v_lo_flt); /* Rounding may occur here, mul and add may fuse to fma for haswell and newer */
} /* _mm_add_ps is guaranteed to give results with an error of at most 0.5 ULP */
Note that other high bits/low bits partitions are possible as long as _mm_cvt_ps can convert
both pieces to floats without rounding.
For example, a partition with 20 high bits and 12 low bits will work equally well.
This functionality exists in AVX-512, but if you can't wait until then the only thing I can suggest is to convert the unsigned int input values into pairs of smaller values, convert these, and then add them together again, e.g.
inline __m128 _mm_cvtepu32_ps(const __m128i v)
{
__m128i v2 = _mm_srli_epi32(v, 1); // v2 = v / 2
__m128i v1 = _mm_sub_epi32(v, v2); // v1 = v - (v / 2)
__m128 v2f = _mm_cvtepi32_ps(v2);
__m128 v1f = _mm_cvtepi32_ps(v1);
return _mm_add_ps(v2f, v1f);
}
UPDATE
As noted by #wim in his answer, the above solution fails for an input value of UINT_MAX. Here is a more robust, but slightly less efficient solution, which should work for the full uint32_t input range:
inline __m128 _mm_cvtepu32_ps(const __m128i v)
{
__m128i v2 = _mm_srli_epi32(v, 1); // v2 = v / 2
__m128i v1 = _mm_and_si128(v, _mm_set1_epi32(1)); // v1 = v & 1
__m128 v2f = _mm_cvtepi32_ps(v2);
__m128 v1f = _mm_cvtepi32_ps(v1);
return _mm_add_ps(_mm_add_ps(v2f, v2f), v1f); // return 2 * v2 + v1
}
I think Paul's answer is nice, but it fails for v=4294967295U (=2^32-1). In that case v2=2^31-1 and v1=2^31. Intrinsic _mm_cvtepi32_ps converts 2^31 to -2.14748365E9 . v2=2^31-1 is converted to 2.14748365E9 and consequently _mm_add_ps returns 0 (due to rounding v1f and v2f are the exact opposite of each other).
The idea of the solution below is to copy the most significant bit of v to v_high. The other bits of v are copied to v_low. v_high is converted to 0 or 2.14748365E9 .
inline __m128 _mm_cvtepu32_v3_ps(const __m128i v)
{
__m128i msk0=_mm_set1_epi32(0x7FFFFFFF);
__m128i zero=_mm_xor_si128(msk0,msk0);
__m128i cnst2_31=_mm_set1_epi32(0x4F000000); /* IEEE representation of float 2^31 */
__m128i v_high=_mm_andnot_si128(msk0,v);
__m128i v_low=_mm_and_si128(msk0,v);
__m128 v_lowf=_mm_cvtepi32_ps(v_low);
__m128i msk1=_mm_cmpeq_epi32(v_high,zero);
__m128 v_highf=_mm_castsi128_ps(_mm_andnot_si128(msk1,cnst2_31));
__m128 v_sum=_mm_add_ps(v_lowf,v_highf);
return v_sum;
}
Update
It was possible to reduce the number of instructions:
inline __m128 _mm_cvtepu32_v4_ps(const __m128i v)
{
__m128i msk0=_mm_set1_epi32(0x7FFFFFFF);
__m128i cnst2_31=_mm_set1_epi32(0x4F000000);
__m128i msk1=_mm_srai_epi32(v,31);
__m128i v_low=_mm_and_si128(msk0,v);
__m128 v_lowf=_mm_cvtepi32_ps(v_low);
__m128 v_highf=_mm_castsi128_ps(_mm_and_si128(msk1,cnst2_31));
__m128 v_sum=_mm_add_ps(v_lowf,v_highf);
return v_sum;
}
Intrinsic _mm_srai_epi32 shifts the most significant bit of v to the right, while shifting in sign bits, which turns out to be quite useful here.
Suppose I have a very simple code like:
double array[SIZE_OF_ARRAY];
double sum = 0.0;
for (int i = 0; i < SIZE_OF_ARRAY; ++i)
{
sum += array[i];
}
I basically want to do the same operations using SSE2. How can I do that?
Here's a very simple SSE3 implementation:
#include <emmintrin.h>
__m128d vsum = _mm_set1_pd(0.0);
for (int i = 0; i < n; i += 2)
{
__m128d v = _mm_load_pd(&a[i]);
vsum = _mm_add_pd(vsum, v);
}
vsum = _mm_hadd_pd(vsum, vsum);
double sum = _mm_cvtsd_f64(vsum0);
You can unroll the loop to get much better performance by using multiple accumulators to hide the latency of FP addition (as suggested by #Mysticial). Unroll 3 or 4 times with multiple "sum" vectors to bottleneck on load and FP-add throughput (one or two per clock cycle) instead of FP-add latency (one per 3 or 4 cycles):
__m128d vsum0 = _mm_setzero_pd();
__m128d vsum1 = _mm_setzero_pd();
for (int i = 0; i < n; i += 4)
{
__m128d v0 = _mm_load_pd(&a[i]);
__m128d v1 = _mm_load_pd(&a[i + 2]);
vsum0 = _mm_add_pd(vsum0, v0);
vsum1 = _mm_add_pd(vsum1, v1);
}
vsum0 = _mm_add_pd(vsum0, vsum1); // vertical ops down to one accumulator
vsum0 = _mm_hadd_pd(vsum0, vsum0); // horizontal add of the single register
double sum = _mm_cvtsd_f64(vsum0);
Note that the array a is assumed to be 16 byte aligned and the number of elements n is assumed to be a multiple of 2 (or 4, in the case of the unrolled loop).
See also Fastest way to do horizontal float vector sum on x86 for alternate ways of doing the horizontal sum outside the loop. SSE3 support is not totally universal (especially AMD CPUs were later to support it than Intel).
Also, _mm_hadd_pd is usually not the fastest way even on CPUs that support it, so an SSE2-only version won't be worse on modern CPUs. It's outside the loop and doesn't make much difference either way, though.