SSE parallelization - c

Hi I am trying to improve the performance of this code, suposing that I have a machine capable of handling 4 threads. I first thought about making omp parallel but then I saw that this function was inside a for loop so creating threads so many times was not very efficient. So i would like to know how to implement it with SSE that would be more efficient:
unsigned char cubicInterpolate_paralelo(unsigned char p[4], unsigned char x) {
unsigned char resultado;
unsigned char intermedio;
intermedio = + x*(3.0*(p[1] - p[2]) + p[3] - p[0]);
resultado = p[1] + 0.5 * x *(p[2] - p[0] + x*(2.0*p[0] - 5.0*p[1] + 4.0*p[2] - p[3] + x*(3.0*(p[1] - p[2]) + p[3] - p[0])));
return resultado;
}
unsigned char bicubicInterpolate_paralelo (unsigned char p[4][4], unsigned char x, unsigned char y) {
unsigned char arr[4],valorPixelCanal;
arr[0] = cubicInterpolate_paralelo(p[0], y);
arr[1] = cubicInterpolate_paralelo(p[1], y);
arr[2] = cubicInterpolate_paralelo(p[2], y);
arr[3] = cubicInterpolate_paralelo(p[3], y);
valorPixelCanal = cubicInterpolate_paralelo(arr, x);
return valorPixelCanal;
}
this is used inside some nested for:
for(i=0; i<z_img.width(); i++) {
for(j=0; j<z_img.height(); j++) {
//For R,G,B
for(c=0; c<3; c++) {
for(l=0; l<4; l++){
for(k=0; k<4; k++){
arr[l][k] = img(i/zFactor +l, j/zFactor +k, 0, c);
}
}
color[c] = bicubicInterpolate_paralelo(arr, (unsigned char)(i%zFactor)/zFactor, (unsigned char)(j%zFactor)/zFactor);
}
z_img.draw_point(i,j,color);
}
}

I've taken some liberties with the code, so you may have to change it significantly, but here's an (untested) transliteration to SSE:
__m128i x = _mm_unpacklo_epi8(_mm_loadl_epi64(x_array), _mm_setzero_si128());
__m128i p0 = _mm_unpacklo_epi8(_mm_loadl_epi64(p0_array), _mm_setzero_si128());
__m128i p1 = _mm_unpacklo_epi8(_mm_loadl_epi64(p1_array), _mm_setzero_si128());
__m128i p2 = _mm_unpacklo_epi8(_mm_loadl_epi64(p2_array), _mm_setzero_si128());
__m128i p3 = _mm_unpacklo_epi8(_mm_loadl_epi64(p3_array), _mm_setzero_si128());
__m128i t = _mm_sub_epi16(p1, p2);
t = _mm_add_epi16(_mm_add_epi16(t, t), t); // 3 * (p[1] - p[2])
__m128i intermedio = _mm_mullo_epi16(x, _mm_sub_epi16(_mm_add_epi16(t, p3), p0));
t = _mm_add_epi16(p1, _mm_slli_epi16(p1, 2)); // 5 * p[1]
// t2 = 2 * p[0] + 4 * p[2]
__m128i t2 = _mm_add_epi16(_mm_add_epi16(p0, p0), _mm_slli_epi16(p2, 2));
t = _mm_mullo_epi16(x, _mm_sub_epi16(_mm_add_epi16(t2, intermedio), _mm_add_epi16(t, p3)));
t = _mm_mullo_epi16(x, _mm_add_epi16(_mm_sub_epi16(p2, p0), t));
__m128i resultado = _mm_add_epi16(p1, _mm_srli_epi16(t, 1));
return resultado;
The 16 bit intermediates that I use should be wide enough, the only way for information from the high bits to affect low bits in this code is the right shift by 1 (0.5 * in your code), so really we only need 9 bits, the rest cannot affect the result. Bytes wouldn't be wide enough (unless you have some extra guarantees that I don't know about), but they would be annoying anyway because there is no nice way to multiply them.
I pretended for simplicity that the input takes the form of contiguous arrays of x's, p[0]'s etc, that's not what you need here but I didn't have time to work out all the loading and shuffling.

SSE is quite unrelated to threads. A single thread executes a single instruction at a time; with SSE that single instruction may apply to 4 or 8 sets of arguments at a time. So with multiple threads you can also run multiple SSE instructions to process even more data.
You can use threads with for-loops. Just don't use them inside. Instead, take the for(i=0; i<z_img.width(); i++) { outer loop and split it in 4 bands of width/4. Thread 0 gets 0..width/4, thread 1 gets width/4..width/2 etc.
On an unrelated note your code also suffers from mixing floating-point and integer math. 0.5 * x is not nearly as efficient as x/2.

Using OpenMP, you could try adding the #pragma to the outer-most for loop. This should solve your problem.
Going the SSE route is trickier because of the extra alignment restrictions on data, but the easiest transform would be to extend cubicInterpolate_paralelo to handle multiple calculations at once. With enough luck, telling the compiler to use SSE will do the trick for you, but to make sure, you could use intrinsic functions and types.

Related

fast multiplication of int8 arrays by scalars

I wonder if there is a fast way of multiplying int8 arrays, i.e.
for(i = 0; i < n; ++i)
z[i] = x * y[i];
I see that the Intel intrinsics guide lists several SIMD instructions, such as _mm_mulhi_epi16 and _mm_mullo_epi16 that do something like this for int16. Is there something similar for int8 that I'm missing?
Breaking the input into low & hi, one can
__m128i const kff00ff00 = _mm_set1_epi32(0xff00ff00);
__m128i lo = _mm_mullo_epi16(y, x);
__m128i hi = _mm_mullo_epi16(_mm_and_si128(y, kff00ff00), x);
__m128i z = _mm_blendv_epi8(lo, hi, kff00ff00);
AFAIK, the high bits YY of the YYyy|YYyy|YYyy|YYyy multiplied by 00xx|00xx|00xx|00xx do not interfere with the low 8 bits ??ll, and likewise the product of YY00|YY00 * 00xx|00xx produces the correct 8 bit product at HH00. These two results at the correct alignment need to be blended.
__m128i x = _mm_set1_epi16(scalar_x);, and __m128i y = _mm_loadu_si128(...);
An alternative is to use shufb calculating LutLo[y & 15] + LutHi[y >> 4], where unfortunately the shift must be also emulated by _mm_and_si128(_mm_srli_epi16(y,4),_mm_set1_epi8(15)).

SSE etc. vector programming (SIMD)

I'm totally new to SSE programming, but have an Intel Core i7 processor.
Basically, I want to take 4 32-bit unsigned integers and cube them all (raise to the power of 3) at once. It is my understanding that the SIMD functionality of SSE and its successors make this possible, but how in the world do I go about doing it? Preferably in C but I could manage assembly if necessary.
Edit to make clear my final goal:
Then, I want to add all the cubes together to come up with a single number.
Background: I'm just trying to use SSE to optimize figuring out if a number is an Armstrong number (a three-digit number whose sum of each digit cubed is the same as the number itself). An example is 153. There seems to be no way to do this other than brute force. These are a subset of Narcissistic numbers whose sum of all digits to the power of the length of the decimal number are equal to number itself. Hopefully, I'd like to eventually expand it to be more flexible, to start I'm just doing the Armstrong numbers. As you might imagine, this came up on another site and a few of us are trying to optimize the hell out of it. By taking your ideas and my own research, I came up with this code:
#include <stdio.h>
#include <smmintrin.h> // SSE 4.1
__m128i vcube(const __m128i v)
{
return _mm_mullo_epi32(v, _mm_mullo_epi32(v, v));
}
int main(int argc, const char * argv[]) {
for (unsigned int i = 1; i <= 500; i++) {
unsigned int firstDigit = i / 100;
unsigned int secondDigit = (i - firstDigit * 100) / 10;
unsigned int thirdDigit = (i - firstDigit * 100 - secondDigit * 10);
__m128i v = _mm_setr_epi32(0, firstDigit, secondDigit, thirdDigit);
__m128 v3 = (__m128) vcube(v);
v3 = _mm_hadd_ps(v3, v3);
v3 = _mm_hadd_ps(v3, v3);
if (_mm_extract_epi32((__m128i) v3, 0) == i)
printf ("%03d is an Armstrong number\n", i);
}
return 0;
}
Note: I had to do some type coercions to get it to compile in some systems (Solaris, at least some Linux).
So this works, but maybe it could be streamlined. Sorry I didn't post the whole task, but I was trying to break it down into steps and I wanted to make sure each digit was correctly cubed.
(END EDIT)
Thank you!
Edit: I guess I should add I'm running Mac OS X Sierra.
EDIT AGAIN:
So, let's say I make these all these unsigned shorts instead of unsigned ints and add more digits, how do I add them together when a short may not be able to hold the sum of all the digits? Is there a way to add them and store in a vector of larger variables if you know what I mean, or a plain larger number such as a UInt64?
Sorry for all the questions, but like I said I'm totally new at vector processing even though I had access to it since my first Mac G4.
If your input values are in the range 0..1625 (so that the result fits in 32 bits) then you can use _mm_mullo_epi32:
__m128i vcube(const __m128i v)
{
return _mm_mullo_epi32(v, _mm_mullo_epi32(v, v));
}
Demo:
#include <stdio.h>
#include <smmintrin.h> // SSE 4.1
__m128i vcube(const __m128i v)
{
return _mm_mullo_epi32(v, _mm_mullo_epi32(v, v));
}
int main()
{
__m128i v = _mm_setr_epi32(0, 1, 1000, 1625);
__m128i v3 = vcube(v);
printf("%vlu => %vlu\n", v, v3);
return 0;
}
Compile and test:
$ gcc -Wall -Wno-format-invalid-specifier -Wno-format-extra-args -msse4 vcube.c && ./a.out
0 1 1000 1625 => 0 1 1000000000 4291015625
For x<=2642245 you can do x*x*x using the foo_SSE function below using SSE4.1. This takes two 32-bit unsigned intergs as input packed into the upper and lower 64-bits of a SSE register and outputs two 64-bit integers.
#include <stdio.h>
#include <x86intrin.h>
#include <inttypes.h>
__m128i foo_SSE(__m128i x) {
__m128i mask = _mm_set_epi32(-1, 0, -1, 0);
__m128i x2 =_mm_shuffle_epi32(x, 0x80);
__m128i t0 = _mm_mul_epu32(x,x);
__m128i t1 = _mm_mul_epu32(t0,x);
__m128i t2 = _mm_mullo_epi32(t0,x2);
__m128i t3 = _mm_and_si128(t2, mask);
__m128i t4 = _mm_add_epi32(t3, t1);
return t4;
}
int main(void) {
uint64_t k1 = 100000;
uint64_t k2 = 2642245;
__m128i x = _mm_setr_epi32(k1, 0, k2, 0);
uint64_t t[2];
_mm_store_si128((__m128i*)t, foo_SSE(x));
printf("%20" PRIu64 " ", t[0]);
printf("%20" PRIu64 "\n", t[1]);
printf("%20" PRIu64 " ", k1*k1*k1);
printf("%20" PRIu64 "\n", k2*k2*k2);
}
This can probably be improved a bit. I'm a little out of practice.
To get a quick overview about the 3 main stages (loading, operating, storing) see the following snippet. For integers e0 and e1:
#include "emmintrin.h"
__m128i result __attribute__((aligned(16)));
__m128i x = _mm_setr_epi32(0, e1, 0, e0);
__m128i cube = _mm_mul_epu32(x, _mm_mul_epu32(x, x));
_mm_store_si128(&result, cube);
The _mm_mul_epu32 takes the even multiples of 32bits of two _m128i registers, multiplies them and puts the result as 2-tuple of 64bits into the result register.
To get them out of there access either access them through a cast or use your compiler's convenience definition of __m128i, e.g. for icc:
printf("%llu %llu\n", result.m128i_i64[0], result.m128i_i64[1]); /* msc style */
Note: I'm using the Intel Intrinsics guide for SSE primitives.
Edited for clarity about what the code actually does.

SSE SIMD Segmentation Fault when using resulting float

I'm trying to use Intel Intrinsics to perform an operation quickly on a float array. The operations themselves seem to work fine; however, when I try to get the result of the operation into a standard C variable I get a SEGFAULT. If I comment the indicated line below out, the program runs. If I save the result of the indicated line, but do not manipulate it in any way, the program runs fine. It is only when I try to (in any way) interact with the result of _mm_cvtss_f32(C) that my program crashes. Any ideas?
float proc(float *a, float *b, int n, int c, int width) {
// Operation: SUM: (A - B) ^ 2
__m128 A, B, C;
float total = 0;
for (int d = 0, k = 0; k < c; d += width, k++) {
for (int i = 0; i < n / 4 * 4; i += 4) {
A = _mm_load_ps(&a[i + d]);
B = _mm_load_ps(&b[i + d]);
C = _mm_sub_ps(A, B);
C = _mm_mul_ps(C, C);
C = _mm_hadd_ps(C, C);
C = _mm_hadd_ps(C, C);
total += _mm_cvtss_f32(C); // SEGFAULT HERE
}
for (int i = n / 4 * 4; i < n; i++) {
int diff = a[i + d] - b[i + d];
total += diff * diff;
}
}
return total;
}
Are you sure your program actually crashes at the instruction you cited, or is the compiler just optimizing the rest of the loop away if you remove the _mm_cvtss_f32() line (it doesn't have any other visible side effects)? Potential failure causes would be improper alignment of the a and b arrays since you are using aligned load instructions. Are you sure they are 16-byte aligned? On contemporary Intel hardware, there is very little performance difference between 16-byte aligned and unaligned loads (see the comments on the question above for a discussion of the issue).
I mentioned in my original comment that movaps has a shorter encoding than movups. This is not correct. I was thinking instead of movaps versus movapd, which do the same memory transfer, only they're labeled as being for single-precision and double-precision data, respectively. In practice, they do the same thing, but movaps has a shorter encoding.

How do I add all elements in an array using SSE2?

Suppose I have a very simple code like:
double array[SIZE_OF_ARRAY];
double sum = 0.0;
for (int i = 0; i < SIZE_OF_ARRAY; ++i)
{
sum += array[i];
}
I basically want to do the same operations using SSE2. How can I do that?
Here's a very simple SSE3 implementation:
#include <emmintrin.h>
__m128d vsum = _mm_set1_pd(0.0);
for (int i = 0; i < n; i += 2)
{
__m128d v = _mm_load_pd(&a[i]);
vsum = _mm_add_pd(vsum, v);
}
vsum = _mm_hadd_pd(vsum, vsum);
double sum = _mm_cvtsd_f64(vsum0);
You can unroll the loop to get much better performance by using multiple accumulators to hide the latency of FP addition (as suggested by #Mysticial). Unroll 3 or 4 times with multiple "sum" vectors to bottleneck on load and FP-add throughput (one or two per clock cycle) instead of FP-add latency (one per 3 or 4 cycles):
__m128d vsum0 = _mm_setzero_pd();
__m128d vsum1 = _mm_setzero_pd();
for (int i = 0; i < n; i += 4)
{
__m128d v0 = _mm_load_pd(&a[i]);
__m128d v1 = _mm_load_pd(&a[i + 2]);
vsum0 = _mm_add_pd(vsum0, v0);
vsum1 = _mm_add_pd(vsum1, v1);
}
vsum0 = _mm_add_pd(vsum0, vsum1); // vertical ops down to one accumulator
vsum0 = _mm_hadd_pd(vsum0, vsum0); // horizontal add of the single register
double sum = _mm_cvtsd_f64(vsum0);
Note that the array a is assumed to be 16 byte aligned and the number of elements n is assumed to be a multiple of 2 (or 4, in the case of the unrolled loop).
See also Fastest way to do horizontal float vector sum on x86 for alternate ways of doing the horizontal sum outside the loop. SSE3 support is not totally universal (especially AMD CPUs were later to support it than Intel).
Also, _mm_hadd_pd is usually not the fastest way even on CPUs that support it, so an SSE2-only version won't be worse on modern CPUs. It's outside the loop and doesn't make much difference either way, though.

Optimizing for speed - 4 dimensional array lookup in C

I have a fitness function that is scoring the values on an int array based on data that lies on a 4D array. The profiler says this function is using 80% of CPU time (it needs to be called several million times). I can't seem to optimize it further (if it's even possible). Here is the function:
unsigned int lookup_array[26][26][26][26]; /* lookup_array is a global variable */
unsigned int get_i_score(unsigned int *input) {
register unsigned int i, score = 0;
for(i = len - 3; i--; )
score += lookup_array[input[i]][input[i + 1]][input[i + 2]][input[i + 3]];
return(score)
}
I've tried to flatten the array to a single dimension but there was no improvement in performance. This is running on an IA32 CPU. Any CPU specific optimizations are also helpful.
Thanks
What is the range of the array items? If you can change the array base type to unsigned short or unsigned char, you might get fewer cache misses because a larger portion of the array fits into the cache.
Most of your time probably goes into cache misses. If you can optimize those away, you can get a big performance boost.
Remember that C/C++ arrays are stored in row-major order. Remember to store your data so that addresses referenced closely in time reside closely in memory. For example, it may make sense to store sub-results in a temporary array. Then you could process exactly one row of elements located sequentially. That way the processor cache will always contain the row during iterations and less memory operations will be required. However, you might need to modularize your lookup_array function. Maybe even split it into four (by the number of dimensions in your array).
The problem is definitely related to the size of the matrix. You cannot optimize it by declaring as a single array just because it's what the compiler does automatically.
Everything depends on which order do you use for accessing the data, namely on the content of the input array.
The only think you can do is work on locality: read this one, it should give you some inspiration.
By the way, I suggest you to replace the input array with four parameters: it will be more intuitive and it will be less error prone.
Good luck
A few suggestions to improve performance:
Parallelise. This is a very easy reduction to be programmed in OpenMP or MPI.
Reorder data to improve locality. Try sorting input first, for example.
Use streaming processing instructions if the compiler is not already doing it.
About reordering, it would be possible if you flatten the array and use linear coordinates instead.
Another point, compare the theoretical peak performance of your processor (integer operations) with the performance you're getting (do a quick count of the assembly generated instructions, multiply by the length of the input, etc.) and see if there's room for a significant improvement there.
I have a couple of suggestions:
unsigned int lookup_array[26][26][26][26]; /* lookup_array is a global variable */
unsigned int get_i_score(unsigned int *input, len) {
register unsigned int i, score = 0;
unsigned int *a=input;
unsigned int *b=input+1;
unsigned int *c=input+2;
unsigned int *d=input+3;
for(i = 0; i < (len - 3); i++, a++, b++, c++, d++)
score += lookup_array[*a][*b][*c][*d];
return(score)
}
Or try
for(i = 0; i < (len - 3); i++, a=b, b=c, c=d, d++)
score += lookup_array[*a][*b][*c][*d];
Also, given that there are only 26 values, why are you putting the input array in terms of unsigned ints? If it were char *input, you'd be using 1/4 as much memory and therefore using 1/4 of the memory bandwidth. Obviously the types of a through d have to match. Similarly, if the score values don't need to be unsigned ints, make the array smaller by using chars or uint16_t.
You might be able to squeeze a bit out, by unrolling the loop in some variation of Duffs device.
Multidimesional arrays often constrain the compiler to one or more multiply operations. It may be slow on some CPUs. A common workaround is to transform the N-dimensional array into an array of pointers to elements of (N-1) dimensions. With a 4-dim. array is quite annoying (26 pointers to 26*26 pointers to 26*26*26 rows...) I suggest to try it and compare the result. It is not guaranteed that it's faster: compilers are quite smart in optimizing array accesses, while a chain of indirect accesses has higher probability to invalidate the cache.
Bye
if lookup_array is mostly zeroes, could def be replaced with a hash table lookup on a smaller array. The inline lookup function could calculate the offset of the 4-dimensions ([5,6,7,8] = (4*26*26*26)+(5*26*26)+(6*26)+7 = 73847). the hash key could just be the lower few bits of the offset (depending on how sparse the array is expected to be). if the offset exists in the hash table, use the value, if it doesn't exist it's 0...
the loop could also be unrolled using something like this if the input has arbitrary length. there are only len accesses of input needed (instead of around len * 4 in the original loop).
register int j, x1, x2, x3, x4;
register unsigned int *p;
p = input;
x1 = *p++;
x2 = *p++;
x3 = *p++;
for (j = (len - 3) / 20; j--; ) {
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = *p++;
score += lookup_array[x2][x3][x4][x1];
x2 = *p++;
score += lookup_array[x3][x4][x1][x2];
x3 = *p++;
score += lookup_array[x4][x1][x2][x3];
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = *p++;
score += lookup_array[x2][x3][x4][x1];
x2 = *p++;
score += lookup_array[x3][x4][x1][x2];
x3 = *p++;
score += lookup_array[x4][x1][x2][x3];
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = *p++;
score += lookup_array[x2][x3][x4][x1];
x2 = *p++;
score += lookup_array[x3][x4][x1][x2];
x3 = *p++;
score += lookup_array[x4][x1][x2][x3];
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = *p++;
score += lookup_array[x2][x3][x4][x1];
x2 = *p++;
score += lookup_array[x3][x4][x1][x2];
x3 = *p++;
score += lookup_array[x4][x1][x2][x3];
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = *p++;
score += lookup_array[x2][x3][x4][x1];
x2 = *p++;
score += lookup_array[x3][x4][x1][x2];
x3 = *p++;
score += lookup_array[x4][x1][x2][x3];
/* that's 20 iterations, add more if you like */
}
for (j = (len - 3) % 20; j--; ) {
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = x2;
x2 = x3;
x3 = x4;
}
If you convert it to a flat array of size 26*26*26*26, you only need to lookup the input array once per loop:
unsigned int get_i_score(unsigned int *input)
{
unsigned int i = len - 3, score = 0, index;
index = input[i] * 26 * 26 +
input[i + 1] * 26 +
input[i + 2];
while (--i)
{
index += input[i] * 26 * 26 * 26;
score += lookup_array[index];
index /= 26 ;
}
return score;
}
The additional cost is a multiplication and a division. Whether it ends up being faster in practice - you'll have to test.
(By the way, the register keyword is often ignored by modern compilers - it's usually better to leave register allocation up to the optimiser).
Does the content of the array change much? Perhaps it would be faster to pre-calculate the score, and then modify that pre-calculated score everytime the array changes? Similar to how you can materialize a view in SQL using triggers.
Maybe you can eliminate some accesses to the input array by using local variables.
unsigned int lookup_array[26][26][26][26]; /* lookup_array is a global variable */
unsigned int get_i_score(unsigned int *input, unsigned int len) {
unsigned int i, score, a, b, c, d;
score = 0;
a = input[i + 0];
b = input[i + 1];
c = input[i + 2];
d = input[i + 3];
for (i = len - 3; i-- > 0; ) {
d = c, c = b, b = a, a = input[i];
score += lookup_array[a][b][c][d];
}
return score;
}
Moving around registers may be faster than accessing memory, although this kind of memory should remain in the innermost cache anyway.

Resources