SSE etc. vector programming (SIMD) - c

I'm totally new to SSE programming, but have an Intel Core i7 processor.
Basically, I want to take 4 32-bit unsigned integers and cube them all (raise to the power of 3) at once. It is my understanding that the SIMD functionality of SSE and its successors make this possible, but how in the world do I go about doing it? Preferably in C but I could manage assembly if necessary.
Edit to make clear my final goal:
Then, I want to add all the cubes together to come up with a single number.
Background: I'm just trying to use SSE to optimize figuring out if a number is an Armstrong number (a three-digit number whose sum of each digit cubed is the same as the number itself). An example is 153. There seems to be no way to do this other than brute force. These are a subset of Narcissistic numbers whose sum of all digits to the power of the length of the decimal number are equal to number itself. Hopefully, I'd like to eventually expand it to be more flexible, to start I'm just doing the Armstrong numbers. As you might imagine, this came up on another site and a few of us are trying to optimize the hell out of it. By taking your ideas and my own research, I came up with this code:
#include <stdio.h>
#include <smmintrin.h> // SSE 4.1
__m128i vcube(const __m128i v)
{
return _mm_mullo_epi32(v, _mm_mullo_epi32(v, v));
}
int main(int argc, const char * argv[]) {
for (unsigned int i = 1; i <= 500; i++) {
unsigned int firstDigit = i / 100;
unsigned int secondDigit = (i - firstDigit * 100) / 10;
unsigned int thirdDigit = (i - firstDigit * 100 - secondDigit * 10);
__m128i v = _mm_setr_epi32(0, firstDigit, secondDigit, thirdDigit);
__m128 v3 = (__m128) vcube(v);
v3 = _mm_hadd_ps(v3, v3);
v3 = _mm_hadd_ps(v3, v3);
if (_mm_extract_epi32((__m128i) v3, 0) == i)
printf ("%03d is an Armstrong number\n", i);
}
return 0;
}
Note: I had to do some type coercions to get it to compile in some systems (Solaris, at least some Linux).
So this works, but maybe it could be streamlined. Sorry I didn't post the whole task, but I was trying to break it down into steps and I wanted to make sure each digit was correctly cubed.
(END EDIT)
Thank you!
Edit: I guess I should add I'm running Mac OS X Sierra.
EDIT AGAIN:
So, let's say I make these all these unsigned shorts instead of unsigned ints and add more digits, how do I add them together when a short may not be able to hold the sum of all the digits? Is there a way to add them and store in a vector of larger variables if you know what I mean, or a plain larger number such as a UInt64?
Sorry for all the questions, but like I said I'm totally new at vector processing even though I had access to it since my first Mac G4.

If your input values are in the range 0..1625 (so that the result fits in 32 bits) then you can use _mm_mullo_epi32:
__m128i vcube(const __m128i v)
{
return _mm_mullo_epi32(v, _mm_mullo_epi32(v, v));
}
Demo:
#include <stdio.h>
#include <smmintrin.h> // SSE 4.1
__m128i vcube(const __m128i v)
{
return _mm_mullo_epi32(v, _mm_mullo_epi32(v, v));
}
int main()
{
__m128i v = _mm_setr_epi32(0, 1, 1000, 1625);
__m128i v3 = vcube(v);
printf("%vlu => %vlu\n", v, v3);
return 0;
}
Compile and test:
$ gcc -Wall -Wno-format-invalid-specifier -Wno-format-extra-args -msse4 vcube.c && ./a.out
0 1 1000 1625 => 0 1 1000000000 4291015625

For x<=2642245 you can do x*x*x using the foo_SSE function below using SSE4.1. This takes two 32-bit unsigned intergs as input packed into the upper and lower 64-bits of a SSE register and outputs two 64-bit integers.
#include <stdio.h>
#include <x86intrin.h>
#include <inttypes.h>
__m128i foo_SSE(__m128i x) {
__m128i mask = _mm_set_epi32(-1, 0, -1, 0);
__m128i x2 =_mm_shuffle_epi32(x, 0x80);
__m128i t0 = _mm_mul_epu32(x,x);
__m128i t1 = _mm_mul_epu32(t0,x);
__m128i t2 = _mm_mullo_epi32(t0,x2);
__m128i t3 = _mm_and_si128(t2, mask);
__m128i t4 = _mm_add_epi32(t3, t1);
return t4;
}
int main(void) {
uint64_t k1 = 100000;
uint64_t k2 = 2642245;
__m128i x = _mm_setr_epi32(k1, 0, k2, 0);
uint64_t t[2];
_mm_store_si128((__m128i*)t, foo_SSE(x));
printf("%20" PRIu64 " ", t[0]);
printf("%20" PRIu64 "\n", t[1]);
printf("%20" PRIu64 " ", k1*k1*k1);
printf("%20" PRIu64 "\n", k2*k2*k2);
}
This can probably be improved a bit. I'm a little out of practice.

To get a quick overview about the 3 main stages (loading, operating, storing) see the following snippet. For integers e0 and e1:
#include "emmintrin.h"
__m128i result __attribute__((aligned(16)));
__m128i x = _mm_setr_epi32(0, e1, 0, e0);
__m128i cube = _mm_mul_epu32(x, _mm_mul_epu32(x, x));
_mm_store_si128(&result, cube);
The _mm_mul_epu32 takes the even multiples of 32bits of two _m128i registers, multiplies them and puts the result as 2-tuple of 64bits into the result register.
To get them out of there access either access them through a cast or use your compiler's convenience definition of __m128i, e.g. for icc:
printf("%llu %llu\n", result.m128i_i64[0], result.m128i_i64[1]); /* msc style */
Note: I'm using the Intel Intrinsics guide for SSE primitives.
Edited for clarity about what the code actually does.

Related

fast multiplication of int8 arrays by scalars

I wonder if there is a fast way of multiplying int8 arrays, i.e.
for(i = 0; i < n; ++i)
z[i] = x * y[i];
I see that the Intel intrinsics guide lists several SIMD instructions, such as _mm_mulhi_epi16 and _mm_mullo_epi16 that do something like this for int16. Is there something similar for int8 that I'm missing?
Breaking the input into low & hi, one can
__m128i const kff00ff00 = _mm_set1_epi32(0xff00ff00);
__m128i lo = _mm_mullo_epi16(y, x);
__m128i hi = _mm_mullo_epi16(_mm_and_si128(y, kff00ff00), x);
__m128i z = _mm_blendv_epi8(lo, hi, kff00ff00);
AFAIK, the high bits YY of the YYyy|YYyy|YYyy|YYyy multiplied by 00xx|00xx|00xx|00xx do not interfere with the low 8 bits ??ll, and likewise the product of YY00|YY00 * 00xx|00xx produces the correct 8 bit product at HH00. These two results at the correct alignment need to be blended.
__m128i x = _mm_set1_epi16(scalar_x);, and __m128i y = _mm_loadu_si128(...);
An alternative is to use shufb calculating LutLo[y & 15] + LutHi[y >> 4], where unfortunately the shift must be also emulated by _mm_and_si128(_mm_srli_epi16(y,4),_mm_set1_epi8(15)).

Better way to store or extract scalar int result using SSE2 intrinsic

I'm wondering how load and store efficiently vars when working with SSE2.
In this example, I want to bench the pclmulqdq instruction (carry less multiplication, useful for polynomial arithmetic) vs plain C function, so I need the same "calling convention" that a standard function.
a and b are 16 significant bits, result will have 32 significant bits
#include <wmmintrin.h>
int GFpoly_mul_i(int a, int b) {
__m128i xa = _mm_loadu_si128( (__m128i*) a);
__m128i xb = _mm_loadu_si128((__m128i*) b);
__m128i r = _mm_clmulepi64_si128(xa, xb, 0);
_MM_ALIGN16 int result[4];
__m128i* ptr_result = (__m128i*)result;
_mm_store_si128(ptr_result, r);
return result[0];
}
Extracting the 32bit integer from the lowest part of a vector can be done easily with _mm_cvtsi128_si32:
return _mm_cvtsi128_si32(r);
Loading a 32bit integer into the lowest part of a vector can be done with the "opposite" operation, _mm_cvtsi32_si128:
__m128i xa = _mm_cvtsi32_si128(a);
Loading the integer a into a vector cannot be done with _mm_loadu_si128( (__m128i*) a), this would cast a to a pointer and dereference it (reading a 128bit vector), but a is just an integer value and doesn't point anywhere useful, except perhaps by accident.

How to generate a uniformly distributed random double in C using libsodium in range [-a,a]?

The libsodium library has a function
uint32_t randombytes_uniform(const uint32_t upper_bound);
but obviously this returns an unsigned integer. Can I somehow use this to generate a uniformly distributed random double in range [-a,a] where a is also a double given by the user ? I am especially focused on the result being uniformly distributed/unbiased, so that is why I would like to use the libsodium library.
const uint32_t mybound = 1000000000; // Example
const uint32_t x = randombytes_uniform(mybound);
const double a = 3.5; // Example
const double variate = a * ( (2.0 * x / mybound) - 1);
Let me try to to do it step-by-step.
First, you obviously need to combine two calls to get up to 64bit of randomness for one double value output.
Second, you convert it to [0...1] interval. There are several ways to do it, all of the are good in some sense or another, I prefer uniform random dyadic rationals in the form n*2-53, see here for details. You could try other methods listed above as well. NB: methods in the link produce results in [0...1) range, I've tried to do acceptance/rejection to get closed [0...1] range.
Last, I scale result into desired range.
Sorry, C++ only but it is trivial to convert to C
#include <stdint.h>
#include <math.h>
#include <iostream>
#include <random>
// emulate libsodium RNG, valid for full 32bits result only!
static uint32_t randombytes_uniform(const uint32_t upper_bound) {
static std::mt19937 mt{9876713};
return mt();
}
// get 64bits from two 32bit numbers
static inline uint64_t rng() {
return (uint64_t)randombytes_uniform(UINT32_MAX) << 32 | randombytes_uniform(UINT32_MAX);
}
const int32_t bits_in_mantissa = 53;
const uint64_t max = (1ULL << bits_in_mantissa);
const uint64_t mask = (1ULL << (bits_in_mantissa+1)) - 1;
static double rnd(double a, double b) {
uint64_t r;
do {
r = rng() & mask; // get 54 random bits, need 53 or max
} while (r > max);
double v = ldexp( (double)r, -bits_in_mantissa ); // http://xoshiro.di.unimi.it/random_real.c
return a + (b-a)*v;
}
int main() {
double a = -3.5;
double b = 3.5;
for(int k = 0; k != 100; ++k)
std::cout << rnd(a, b) << '\n';
return 0;
}
First recognizing that finding a random number [0...a] is a sufficient step, followed by a coin flip for +/-.
Step 2. Find the expo such that a < 2**expo or ceil(log2(a)).
int sign;
do {
int exp;
frexp(a, &exp);
Step 3. Form an integral 63-bit random number [0...0x7FFF_FFFF_FFFF_FFFF] and random sign. The 63 should be at least as wide as the precision of a double - which is often 53 bits. At this point r is certainly uniform.
unit64_t r = randombytes_uniform(0xFFFFFFFF);
r <<= 32;
r |= randombytes_uniform(0xFFFFFFFF);
// peel off one bit for sign
sign = r & 1;
r >>= 1;
Step 4. Scale and test if in range. Repeat as needed.
double candidate = ldexp(r/pow(2 63), expo);
} while (candidate > a);
Step 5. Apply the sign.
if (sign) {
candidate = -candidate;
}
return candidate;
Avoid (2.0 * x / a) - 1 as the calculation is not symmetric about 0.0.
Code would benefit with improvements to deal with a near DBL_MAX.
Some rounding issues apply that this answer glosses over, yet the distribution remains uniform - except potentially at the edges.

Computing determinant and using division and multiplication of uint64_t C type

This question has been edited to clarify it.
I have the following matrix, defined in page 1 of [reteam.org/papers/e59.pdf], written in R notation:
m1 = matrix(c(207560540,956631177,1,956631177,2037688522,1,2037688522,1509348670,1),ncol=3,byrow=T)
The determinant of m1 should be a integer multiple of 2^31 -1.
As indicated in accepted answer, det(m1) should be -1564448852668574749.
However, in R, I got
> det(m1)
[1] -1564448852668573184
and, using a simple equation by hand:
> m1[1,1]*(m1[2,2]-m1[3,2]) - m1[2,1]*(m1[1,2] - m1[3,2]) + m1[3,1]*(m1[1,2]- m1[2,2])
[1] -1564448852668574720
As indicated in accepted answer, the correct determinant is obtained
and checked by:
#include <inttypes.h>
#include <stdio.h>
int main() {
int64_t m1[3][3] = {{INT64_C(207560540) , INT64_C(956631177) , INT64_C(1)},{ INT64_C(956631177), INT64_C(2037688522), INT64_C(1)},{INT64_C(2037688522), INT64_C(1509348670) , INT64_C(1)}};
int64_t dm1 = m1[0][0]*(m1[1][1]-m1[2][1]) - m1[1][0]*(m1[0][1] - m1[2][1]) + m1[2][0]*(m1[0][1]- m1[1][1]);
int64_t divisor = (INT64_C(1)<<31) -1;
int64_t tmp = dm1/divisor;
int64_t check = tmp * divisor;
printf("dm1 == %" PRIu64"\n",dm1);
printf("(dm1/(2^31 -1))* %" PRIu64 " == %" PRIu64 "\n", divisor, check);
}
The following text is the old question. The main error was using a unsigned type.
My old minimum non-working code example was:
#include <inttypes.h>
#include <stdio.h>
int main() {
uint64_t dm1 = 1564448852668573184;
uint64_t divisor = (UINT64_C(1)<<31) -1; //powl(2,31)-1;
uint64_t tmp = dm1/divisor;
uint64_t check = tmp*divisor;
printf("dm1 == %" PRIu64"\n",dm1);
printf("(dm1/(2^31 -1))* %" PRIu64 " == %" PRIu64 "\n", divisor, check);
}
Its output is
dm1 == 1564448852668573184
(dm1/(2^31 -1))* 2147483647 == 1564448850521091102
The problem is that the value of the second line should be equal to the one in the first line.
What is my mistake? How can I make this work?
The numerator is not an exact multiple of the divisor, so there is a remainder, and the quotient is truncated.
1564448852668573184 / 2147483647 = 728503266 remainder 2147482082
Multiplying back,
2147483647 * 728503266 + 2147482082 = 1564448852668573184
EDIT:
The determinant of the 3x3 matrix shown in your linked reference is -1564448852668574749. This is exactly divisible by 2147483647 to give -728503267.
So you have an arithmetic overflow somewhere.
ANSWER:
The value of the matrix determinant in your linked example is negative. Please use int64_t instead of uint64_t.
For the same reason that (using integers) you get 10 / 3 * 3 = 9. There is a remainder of the division. 10 / 3 = 3, with remainder 1. When you multiply by 3, the remainder is lost, so you get 9 instead of 10.
In this case, your remainder is 2147482082, which, when added to 1564448850521091102, gives 1564448852668573184. Try this:
uint64_t dm1 = 1564448852668573184;
uint64_t divisor = (UINT64_C(1)<<31) -1; //powl(2,31)-1;
uint64_t tmp = dm1/divisor;
uint64_t remainder = dm1%divisor;
uint64_t check = tmp*divisor+remainder;
And you should get the correct result.

How to perform uint32/float conversion with SSE?

In SSE there is a function _mm_cvtepi32_ps(__m128i input) which takes input vector of 32 bits wide signed integers (int32_t) and converts them into floats.
Now, I want to interpret input integers as not signed. But there is no function _mm_cvtepu32_ps and I could not find an implementation of one. Do you know where I can find such a function or at least give a hint on the implementation?
To illustrate the the difference in results:
unsigned int a = 2480160505; // 10010011 11010100 00111110 11111001
float a1 = a; // 01001111 00010011 11010100 00111111;
float a2 = (signed int)a; // 11001110 11011000 01010111 10000010
With Paul R's solution and with my previous solution
the difference between the rounded floating point and the original integer is less than or equal to
0.75 ULP (Unit in the Last Place). In these methods
at two places rounding may occur: in _mm_cvtepi32_ps and
in _mm_add_ps. This leads to results that are not as accurate as possible for some inputs.
For example, with Paul R's method 0x2000003=33554435 is converted to 33554432.0, but 33554436.0
also exists as a float, which would have been better here.
My previous solution suffers from similar inaccuracies.
Such inaccurate results may also occur with compiler generated code, see here.
Following the approach of gcc (see Peter Cordes' answer to that other SO question), an accurate conversion within 0.5 ULP is obtained:
inline __m128 _mm_cvtepu32_ps(const __m128i v)
{
__m128i msk_lo = _mm_set1_epi32(0xFFFF);
__m128 cnst65536f= _mm_set1_ps(65536.0f);
__m128i v_lo = _mm_and_si128(v,msk_lo); /* extract the 16 lowest significant bits of v */
__m128i v_hi = _mm_srli_epi32(v,16); /* 16 most significant bits of v */
__m128 v_lo_flt = _mm_cvtepi32_ps(v_lo); /* No rounding */
__m128 v_hi_flt = _mm_cvtepi32_ps(v_hi); /* No rounding */
v_hi_flt = _mm_mul_ps(cnst65536f,v_hi_flt); /* No rounding */
return _mm_add_ps(v_hi_flt,v_lo_flt); /* Rounding may occur here, mul and add may fuse to fma for haswell and newer */
} /* _mm_add_ps is guaranteed to give results with an error of at most 0.5 ULP */
Note that other high bits/low bits partitions are possible as long as _mm_cvt_ps can convert
both pieces to floats without rounding.
For example, a partition with 20 high bits and 12 low bits will work equally well.
This functionality exists in AVX-512, but if you can't wait until then the only thing I can suggest is to convert the unsigned int input values into pairs of smaller values, convert these, and then add them together again, e.g.
inline __m128 _mm_cvtepu32_ps(const __m128i v)
{
__m128i v2 = _mm_srli_epi32(v, 1); // v2 = v / 2
__m128i v1 = _mm_sub_epi32(v, v2); // v1 = v - (v / 2)
__m128 v2f = _mm_cvtepi32_ps(v2);
__m128 v1f = _mm_cvtepi32_ps(v1);
return _mm_add_ps(v2f, v1f);
}
UPDATE
As noted by #wim in his answer, the above solution fails for an input value of UINT_MAX. Here is a more robust, but slightly less efficient solution, which should work for the full uint32_t input range:
inline __m128 _mm_cvtepu32_ps(const __m128i v)
{
__m128i v2 = _mm_srli_epi32(v, 1); // v2 = v / 2
__m128i v1 = _mm_and_si128(v, _mm_set1_epi32(1)); // v1 = v & 1
__m128 v2f = _mm_cvtepi32_ps(v2);
__m128 v1f = _mm_cvtepi32_ps(v1);
return _mm_add_ps(_mm_add_ps(v2f, v2f), v1f); // return 2 * v2 + v1
}
I think Paul's answer is nice, but it fails for v=4294967295U (=2^32-1). In that case v2=2^31-1 and v1=2^31. Intrinsic _mm_cvtepi32_ps converts 2^31 to -2.14748365E9 . v2=2^31-1 is converted to 2.14748365E9 and consequently _mm_add_ps returns 0 (due to rounding v1f and v2f are the exact opposite of each other).
The idea of the solution below is to copy the most significant bit of v to v_high. The other bits of v are copied to v_low. v_high is converted to 0 or 2.14748365E9 .
inline __m128 _mm_cvtepu32_v3_ps(const __m128i v)
{
__m128i msk0=_mm_set1_epi32(0x7FFFFFFF);
__m128i zero=_mm_xor_si128(msk0,msk0);
__m128i cnst2_31=_mm_set1_epi32(0x4F000000); /* IEEE representation of float 2^31 */
__m128i v_high=_mm_andnot_si128(msk0,v);
__m128i v_low=_mm_and_si128(msk0,v);
__m128 v_lowf=_mm_cvtepi32_ps(v_low);
__m128i msk1=_mm_cmpeq_epi32(v_high,zero);
__m128 v_highf=_mm_castsi128_ps(_mm_andnot_si128(msk1,cnst2_31));
__m128 v_sum=_mm_add_ps(v_lowf,v_highf);
return v_sum;
}
Update
It was possible to reduce the number of instructions:
inline __m128 _mm_cvtepu32_v4_ps(const __m128i v)
{
__m128i msk0=_mm_set1_epi32(0x7FFFFFFF);
__m128i cnst2_31=_mm_set1_epi32(0x4F000000);
__m128i msk1=_mm_srai_epi32(v,31);
__m128i v_low=_mm_and_si128(msk0,v);
__m128 v_lowf=_mm_cvtepi32_ps(v_low);
__m128 v_highf=_mm_castsi128_ps(_mm_and_si128(msk1,cnst2_31));
__m128 v_sum=_mm_add_ps(v_lowf,v_highf);
return v_sum;
}
Intrinsic _mm_srai_epi32 shifts the most significant bit of v to the right, while shifting in sign bits, which turns out to be quite useful here.

Resources