Qualcomm Hexagon: Optimize MAC-loop - c

I would like to DSP-optimize a simple multiply-accumulate for-loop for the QC Hexagon. From the manual, it's not perfectly clear to me how to do that, both for the vector version and the non-vector version.
Assume my loop has a length which is a multiple of 4 (e.g., 64), i.e., I want to unroll the loop with a factor of 4. How would I do that? I can use either C-intrinsics or asm-code, but I don't understand how to do the 4x-memory load in first place.
Here is how my loop could look like in C:
Word32 sum = 0;
Word16 *pointer1; Word16 *pointer2;
for (i=0; i<64; i++)
{
sum += pointer1[I]*pointer2[i];
}
Any suggestions?

Here is a FIR filter implementation that demonstrates how to use Q6_P_vrmpyhacc_PP, the multiply halfword/accumulate. This instruction is described as 'big mac' in the PRM 😉
This instruction is in the scalar core so it does not require the HVX vector coprocessor.
void FIR08(short_8B_align Input[],
short_8B_align Coeff[],
short_8B_align Output[],
int unused, int ntaps,
int nsamples)
{
Word64 * vInput = (Word64*)Input;
Word64 * vCoeff = (Word64*)Coeff;
Word64 *__restrict vOutput = (Word64*)Output;
int i, j;
Word64 sum0, sum1, sum2, sum3;
for (i = 0; i < nsamples/4; i++)
{
sum0 = sum1 = sum2 = sum3 = 0;
for (j = 0; j < ntaps/4; j++)
{
Word64 vIn1 = vInput[i+j];
Word64 vIn2 = vInput[i+j+1];
Word64 curCoeff = vCoeff[j];
Word64 curIn;
curIn = vIn1;
sum0 = Q6_P_vrmpyhacc_PP(sum0, curIn, curCoeff);
curIn = Q6_P_valignb_PPI(vIn2, vIn1, 2);
sum1 = Q6_P_vrmpyhacc_PP(sum1, curIn, curCoeff);
curIn = Q6_P_valignb_PPI(vIn2, vIn1, 4);
sum2 = Q6_P_vrmpyhacc_PP(sum2, curIn, curCoeff);
curIn = Q6_P_valignb_PPI(vIn2, vIn1, 6);
sum3 = Q6_P_vrmpyhacc_PP(sum3, curIn, curCoeff);
}
Word64 curOut = Q6_P_combine_RR(Q6_R_combine_RhRh(sum3, sum2), Q6_R_combine_RhRh(sum1, sum0));
vOutput[i + 1] = Q6_P_vasrh_PI(curOut, 2);
}
}

Related

Open MP parallel for loop in C running in one thread

I have this part of a code which is reading a 2D array of structs, doing some math on them that putting the results into a second 2D array:
#pragma omp parallel for private (n, i, j) schedule(dynamic)
for(n = 0; n < frames_read; n++){
for (i = 0; i < atoms_total; i++)
{
for(j = 0; j < atoms_total; j++)
{
if (timestep_array[i][n].atom_id == timestep_array[j][m].atom_id)
{
// calculates the vector magnitude and stores it in the created array MSD
double temp1_x = timestep_array[i][n].normalized_x_position + timestep_array[i][n].x_box;
double temp2_x = timestep_array[j][n+1].normalized_x_position + timestep_array[j][n+1].x_box;
double temp3_x = temp2_x - temp1_x;
double temp4_x = temp3_x * box_bound_x;
double temp5_x = pow(temp4_x, 2);
double temp1_y = timestep_array[i][n].normalized_y_position + timestep_array[i][n].y_box;
double temp2_y = timestep_array[j][n+1].normalized_y_position + timestep_array[j][n+1].y_box;
double temp3_y = temp2_y - temp1_y;
double temp4_y = temp3_y * box_bound_y;
double temp5_y = pow(temp4_y, 2);
double temp1_z = timestep_array[i][n].normalized_z_position + timestep_array[i][n].z_box;
double temp2_z = timestep_array[j][n+1].normalized_z_position + timestep_array[j][n+1].z_box;
double temp3_z = temp2_z - temp1_z;
double temp4_z = temp3_z * box_bound_z;
double temp5_z = pow(temp4_z, 2);
double temp = temp5_x + temp5_y + temp5_z;
double temp2 = sqrt(temp);
int atom_number = timestep_array[i][n].atom_id;
MSD[atom_number][n].msd = sqrt(temp2);
MSD[atom_number][n].atom_type = timestep_array[i][n].atom_type;
MSD[atom_number][n].time_in_picoseconds = timestep_array[i][n].timestep / picoseconds;
}
}
}
}
I have tried so many combinations of the #pragma statement (including making many more of the variables private.) Nothing has resulted in the a.out file running more than one thread. What am I doing wrong?

How does this function change the parameter passed to it?

I seem to be lost with this Fourier Transform function. There's a sample program that I have but don't understand. The ggFFTworksp contains the data and fftFrameSize is simply framesize of the data. I don't understand how the function is supposed to put the FFT version of the data into the fftBuffer if there is no part in the code where fftBuffer is actually edited or manipulated. Thank you in advance!
The function call is this:
static float gFFTworksp[2*MAX_FRAME_LENGTH];
long fftFrameSize;
smbFft(gFFTworksp, fftFrameSize, -1);
The function in question is this:
void smbFft(float *fftBuffer, long fftFrameSize, long sign)
/*
FFT routine, (C)1996 S.M.Bernsee. Sign = -1 is FFT, 1 is iFFT (inverse)
Fills fftBuffer[0...2*fftFrameSize-1] with the Fourier transform of the
time domain data in fftBuffer[0...2*fftFrameSize-1]. The FFT array takes
and returns the cosine and sine parts in an interleaved manner, ie.
fftBuffer[0] = cosPart[0], fftBuffer[1] = sinPart[0], asf. fftFrameSize
must be a power of 2. It expects a complex input signal (see footnote 2),
ie. when working with 'common' audio signals our input signal has to be
passed as {in[0],0.,in[1],0.,in[2],0.,...} asf. In that case, the transform
of the frequencies of interest is in fftBuffer[0...fftFrameSize].
*/
{
float wr, wi, arg, *p1, *p2, temp;
float tr, ti, ur, ui, *p1r, *p1i, *p2r, *p2i;
long i, bitm, j, le, le2, k;
for (i = 2; i < 2*fftFrameSize-2; i += 2) {
for (bitm = 2, j = 0; bitm < 2*fftFrameSize; bitm <<= 1) {
if (i & bitm) j++;
j <<= 1;
}
if (i < j) {
p1 = fftBuffer+i; p2 = fftBuffer+j;
temp = *p1; *(p1++) = *p2;
*(p2++) = temp; temp = *p1;
*p1 = *p2; *p2 = temp;
}
}
for (k = 0, le = 2; k < (long)(log(fftFrameSize)/log(2.)+.5); k++) {
le <<= 1;
le2 = le>>1;
ur = 1.0;
ui = 0.0;
arg = M_PI / (le2>>1);
wr = cos(arg);
wi = sign*sin(arg);
for (j = 0; j < le2; j += 2) {
p1r = fftBuffer+j; p1i = p1r+1;
p2r = p1r+le2; p2i = p2r+1;
for (i = j; i < 2*fftFrameSize; i += le) {
tr = *p2r * ur - *p2i * ui;
ti = *p2r * ui + *p2i * ur;
*p2r = *p1r - tr; *p2i = *p1i - ti;
*p1r += tr; *p1i += ti;
p1r += le; p1i += le;
p2r += le; p2i += le;
}
tr = ur*wr - ui*wi;
ui = ur*wi + ui*wr;
ur = tr;
}
}
}
In the following line:
p1 = fftBuffer+i; p2 = fftBuffer+j;
p1 and p2 become pointers that point to the memory location of the fftBuffer array. And in these lines:
*(p2++) = temp; temp = *p1;
*p1 = *p2; *p2 = temp;
the values in these memory locations are being changed.

Computing Hamming distances to several strings with SSE

I have n (8 bit) character strings all of them of the same length (say m), and another string s of the same length. I need to compute Hamming distances from s to each of the others strings. In plain C, something like:
unsigned char strings[n][m];
unsigned char s[m];
int distances[n];
for(i=0; i<n; i++) {
int distances[i] = 0;
for(j=0; j<m; j++) {
if(strings[i][j] != s[j])
distances[i]++;
}
}
I would like to use SIMD instructions with gcc to perform such computations more efficiently. I have read that PcmpIstrI in SSE 4.2 can be useful and my target computer supports that instruction set, so I would prefer a solution using SSE 4.2.
EDIT:
I wrote following function to compute Hamming distance between two strings:
static inline int popcnt128(__m128i n) {
const __m128i n_hi = _mm_unpackhi_epi64(n, n);
return _mm_popcnt_u64(_mm_cvtsi128_si64(n)) + _mm_popcnt_u64(_mm_cvtsi128_si64(n_hi));
}
int HammingDist(const unsigned char *p1, unsigned const char *p2, const int len) {
#define MODE (_SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_EACH | _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY)
__m128i smm1 = _mm_loadu_si128 ((__m128i*) p1);
__m128i smm2 = _mm_loadu_si128 ((__m128i*) p2);
__m128i ResultMask;
int iters = len / 16;
int diffs = 0;
int i;
for(i=0; i<iters; i++) {
ResultMask = _mm_cmpestrm (smm1,16,smm2,16,MODE);
diffs += popcnt128(ResultMask);
p1 = p1+16;
p2 = p2+16;
smm1 = _mm_loadu_si128 ((__m128i*)p1);
smm2 =_mm_loadu_si128 ((__m128i*)p2);
}
int mod = len % 16;
if(mod>0) {
ResultMask = _mm_cmpestrm (smm1,mod,smm2,mod,MODE);
diffs += popcnt128(ResultMask);
}
return diffs;
}
So I can solve my problem by means of:
for(i=0; i<n; i++) {
int distances[i] = HammingDist(s, strings[i], m);
}
Is this the best I can do or can I use the fact that one of the strings compared is always the same? In addition, should I do some alignment on my arrays to improve performance?
ANOTHER ATTEMPT
Following Harold's recomendation, I have written following code:
void _SSE_hammingDistances(const ByteP str, const ByteP strings, int *ds, const int n, const int m) {
int iters = m / 16;
__m128i *smm1, *smm2, diffs;
for(int j=0; j<n; j++) {
smm1 = (__m128i*) str;
smm2 = (__m128i*) &strings[j*(m+1)]; // m+1, as strings are '\0' terminated
diffs = _mm_setzero_si128();
for (int i = 0; i < iters; i++) {
diffs = _mm_add_epi8(diffs, _mm_cmpeq_epi8(*smm1, *smm2));
smm1 += 1;
smm2 += 1;
}
int s = m;
signed char *ptr = (signed char *) &diffs;
for(int p=0; p<16; p++) {
s += *ptr;
ptr++;
}
*ds = s;
ds++;
}
}
but I am not able to do the final addition of bytes in __m128i by using psadbw. Can anyone please help me with that?
Here's an improved version of your latest routine, which uses PSADBW (_mm_sad_epu8) to eliminate the scalar code:
void hammingDistances_SSE(const uint8_t * str, const uint8_t * strings, int * const ds, const int n, const int m)
{
const int iters = m / 16;
const __m128i smm1 = _mm_loadu_si128((__m128i*)str);
assert((m & 15) == 0); // m must be a multiple of 16
for (int j = 0; j < n; j++)
{
__m128i smm2 = _mm_loadu_si128((__m128i*)&strings[j*(m+1)]); // m+1, as strings are '\0' terminated
__m128i diffs = _mm_setzero_si128();
for (int i = 0; i < iters; i++)
{
diffs = _mm_sub_epi8(diffs, _mm_cmpeq_epi8(smm1, smm2));
}
diffs = _mm_sad_epu8(diffs, _mm_setzero_si128());
ds[j] = m - (_mm_extract_epi16(diffs, 0) + _mm_extract_epi16(diffs, 4));
}
}

compare buffers as fast as possible

I need to compare two buffers chunk-wise for equality. I don't need information about the relation of the two buffers, just if each two chunks are equal or not. My intel machine supports up to SSE4.2
The naive approach is:
const size_t CHUNK_SIZE = 16; //128bit for SSE2 integer registers
const int ARRAY_SIZE = 200000000;
char* array_1 = (char*)_aligned_malloc(ARRAY_SIZE, 16);
char* array_2 = (char*)_aligned_malloc(ARRAY_SIZE, 16);
for (size_t i = 0; i < ARRAY_SIZE; )
{
volatile bool result = memcmp(array_1+i, array_2+i, CHUNK_SIZE);
i += CHUNK_SIZE;
}
Compared to my first try using SSE ever:
union U
{
__m128i m;
volatile int i[4];
} res;
for (size_t i = 0; i < ARRAY_SIZE; )
{
__m128i* pa1 = (__m128i*)(array_1+i);
__m128i* pa2 = (__m128i*)(array_2+i);
res.m = _mm_cmpeq_epi32(*pa1, *pa2);
volatile bool result = ( (res.i[0]==0) || (res.i[1]==0) || (res.i[2]==0) || (res.i[3]==0) );
i += CHUNK_SIZE;
}
The gain in speed is about 33%. Could I do any better?
You really shouldn't be using scalar code and unions to test all the individual vector elements - do something like this instead:
for (size_t i = 0; i < ARRAY_SIZE; i += CHUNK_SIZE)
{
const __m128i a1 = _mm_load_si128(array_1 + i);
const __m128i a2 = _mm_load_si128(array_2 + i);
const __m128i vcmp = _mm_cmpeq_epi32(a1, a2);
const int vmask = _mm_movemask_epi8(vcmp);
const bool result = (vmask == 0xffff);
// you probably want to break here if you get a mismatch ???
}
Since you can use SSE 4.1, there is another alternative that might be faster:
for (size_t i = 0; i < ARRAY_SIZE; i += CHUNK_SIZE;)
{
__m128i* pa1 = (__m128i*)(array_1+i);
__m128i* pa2 = (__m128i*)(array_2+i);
__m128i temp = _mm_xor_si128(*pa1, *pa2);
bool result = (bool)_mm_testz_si128(temp, temp);
}
_mm_testz_si128(a, b) returns 0 if a & b != 0 and it returns 1 if a & b == 0. The advantage is that you can use this version with the new AVX instructions as well, where the chunk size is 32 bytes.

How Can I Improve/SpeedUp This FrequentFunction in C?

How can I improve / speed up this frequent function?
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define M 10 // This is fixed
#define N 8 // This is NOT fixed
// Assumptions: 1. x, a, b and c are all arrays of 10 (M).
// 2. y and z are all matrices of 8 x 10 (N x M).
// Requirement: 1. return the value of ret;
// 2. get all elements of array c
float fnFrequentFunction(const float* x, const float* const* y, const float* const* z,
const float* a, const float* b, float *c, int n)
{
register float tmp;
register float sum;
register float ret = 0;
register const float* yy;
register const float* zz;
int i;
for (i = 0; i < n; i++) // M == 1, 2, 4, or 8
{
sum = 0;
yy = y[i];
zz = z[i];
tmp = x[0] - yy[0]; sum += tmp * tmp * zz[0];
tmp = x[1] - yy[1]; sum += tmp * tmp * zz[1];
tmp = x[2] - yy[2]; sum += tmp * tmp * zz[2];
tmp = x[3] - yy[3]; sum += tmp * tmp * zz[3];
tmp = x[4] - yy[4]; sum += tmp * tmp * zz[4];
tmp = x[5] - yy[5]; sum += tmp * tmp * zz[5];
tmp = x[6] - yy[6]; sum += tmp * tmp * zz[6];
tmp = x[7] - yy[7]; sum += tmp * tmp * zz[7];
tmp = x[8] - yy[8]; sum += tmp * tmp * zz[8];
tmp = x[9] - yy[9]; sum += tmp * tmp * zz[9];
ret += (c[i] = log(a[i] * b[i]) + sum);
}
return ret;
}
// In the main function, all values are just example data.
int main()
{
float x[M] = {0.001251f, 0.563585f, 0.193304f, 0.808741f, 0.585009f, 0.479873f, 0.350291f, 0.895962f, 0.622840f, 0.746605f};
float* y[N];
float* z[N];
float a[M] = {0.870205f, 0.733879f, 0.711386f, 0.588244f, 0.484176f, 0.852962f, 0.168126f, 0.684286f, 0.072573f, 0.632160f};
float b[M] = {0.871487f, 0.998108f, 0.798608f, 0.134831f, 0.576281f, 0.410779f, 0.402936f, 0.522935f, 0.623218f, 0.193030f};
float c[N];
float t1[M] = {0.864406f, 0.709006f, 0.091433f, 0.995727f, 0.227180f, 0.902585f, 0.659047f, 0.865627f, 0.846767f, 0.514359f};
float t2[M] = {0.866817f, 0.581347f, 0.175542f, 0.620197f, 0.781823f, 0.778588f, 0.938688f, 0.721610f, 0.940214f, 0.811353f};
int i, j;
int n = 10000000;
long start;
// Initialize y, z for test example:
for(i = 0; i < N; ++i)
{
y[i] = (float*)malloc(sizeof(float) * M);
z[i] = (float*)malloc(sizeof(float) * M);
for(j = 0; j < M; ++j)
{
y[i][j] = t1[j] * j;
z[i][j] = t2[j] * j;
}
}
// Speed test here:
start = clock();
while(--n)
fnFrequentFunction(x, y, z, a, b, c, 8);
printf("Time used: %ld\n", clock() - start);
// Output the result here:
printf("fnFrequentFunction == %f\n", fnFrequentFunction(x, y, z, a, b, c, 8));
for(j = 0; j < N; ++j)
printf(" c[%d] == %f\n", j, c[j]);
printf("\n");
// Free memory
for(j = 0; j < N; ++j)
{
free(y[j]);
free(z[j]);
}
return 0;
}
Any suggestions are welcome :-)
I feel terrible that I made a big mistake in my function. The above code is the new one. I'm rechecking it now to make sure that is what I need.
put this outside the loop
sum = 0;
tmp = x[0] - y[0]; sum += tmp * tmp * z[0];
tmp = x[1] - y[1]; sum += tmp * tmp * z[1];
tmp = x[2] - y[2]; sum += tmp * tmp * z[2];
tmp = x[3] - y[3]; sum += tmp * tmp * z[3];
tmp = x[4] - y[4]; sum += tmp * tmp * z[4];
tmp = x[5] - y[5]; sum += tmp * tmp * z[5];
tmp = x[6] - y[6]; sum += tmp * tmp * z[6];
tmp = x[7] - y[7]; sum += tmp * tmp * z[7];
tmp = x[8] - y[8]; sum += tmp * tmp * z[8];
tmp = x[9] - y[9]; sum += tmp * tmp * z[9];
This function is perfectly amenable to SIMD processing. Look into your compiler documentation for the intrinsic functions that correspond to the SSE instructions.
You could break up the dependence chain on the sum variable. Instead of a single sum accumulator, use two accumulators sum1 and sum2 alternately - one for even, one for odd indices. Add them up afterwards.
The single biggest performance bottleneck here is the log() function. Check if an approximation would be sufficient. The calculation of this could also be vectorized - I believe Intel published a high-performance math library - including vectorized versions of functions like log(). You may like to use this.
You are operating on floats here, and log() uses double precision. Use logf() instead. It may (or may not) be faster. It will certainly be no slower.
If your compiler understands C99, place a restrict qualifier on the pointers which are function arguments. This tells the compiler that those arrays do not overlap, and may help it generate more efficient code.
Change the way matrices are kept in memory. Instead of an array of pointers pointing to disjoint memory blocks, use a single array M*N elements in size.
So, to put it together, this is how the function should look like. This is portable C99. Using the compiler-specific SIMD intrinsics, this could be made WAAAAY faster.
UPDATE: Note that I changed the way input matrices are defined. A matrix is a single, large array.
float fnFrequentFunction(const float *restrict x, const float *restrict y,
const float *restrict z, const float *restrict a,
const float *restrict b, float *restrict c, int n)
{
float ret = 0;
const float *restrict yy = y; //for readability
const float *restrict zz = z; // -||-
for (int i = 0; i < n; i++, yy += M, zz += M) // n == 1, 2, 4, or 8
{
float sum = 0;
float sum2 = 0;
for(int j = 0; j < 10; j += 2)
{
float tmp = x[j] - yy[j]; sum += tmp * tmp * zz[j];
float tmp2 = x[j+1] - yy[j+1]; sum2 += tmp2 * tmp2 * zz[j+1];
}
sum += sum2;
ret += (c[i] = logf(a[i] * b[i]) + sum);
}
return ret;
}
Use memoization to cache the results. This is a time/space trade-off optimization.
It's really easy to do this in Perl with the memoize package, and probably in many other dynamic languages. In C, you'd need to roll your own.
Use a wrapper function to make a hash of the arguments and use it to check if the value has already been calculated. If it has, return it. If not, pass through to the original function and cache the returned result.
Alternatively, you could pre-calculate your lookup table at program startup, or even calculate it once and then persist it, depending on your needs.
The above suggestions of strength reducing the tmp values out of the loop are correct. I might even consider dropping those 10 lines into a for loop of their own as this may improve code cache efficiency.
Beyond this, you start getting to the point where you want to know what type of processor you are targetting. If it has native SIMD support, an FPU, what kind of cache it uses, etc. Also depending on how many arguments get passed via registers, reducing the parameters by combining into a single struct and pass by reference might get you a small boost. Declaring vars as register may or may not help. Again profiling and examining the assembler output will answer that.
As sum is known at before the loop, you could get away with adding M * its value in after the loop for a boost. That just leaves the 2 log muls on the inside.
If M is always 8 or has some other known pattern, you could do some minor loop unrolling, but the gains there are almost nil against the log calls.
The only other major thing to look at is log(). How is this implemented? Can you perhaps roll your own faster version through table lookups if your input range is known. Better yet, table the log products if there's enough RAM available.
Just a few thoughts.
Do you use compiler optimization?
Register before variables is antiqued for modern compilers. You can even harm the compiler performance if you use them with compiler optimization. For example gcc simple compilation provides:
Time used: 8720000
and without register floats:
Time used: 8710000
I know this is not much.
I assume you made all those sums to avoid a for loop because you think that is much slower. It is not. A modern compiler will do that optimization for you too.
One big optimization I think is to use a table for log, if you don't mind the memory, that will be faster, use log only when you are out of range.
I wonder if doing it as scaled ints rather than floats might speed it up. I dont know the data ranges so I dont know if this is even possible
In addition to Andrey's answer, you can add some prefetching to the loop:
float fnFrequentFunction(const float* x, const float* y, const float* z,
const float *a, const float *b, float *c, int M)
{
register float tmp;
register float sum;
register float ret = 0;
int i;
sum = 0;
tmp = x[0] - y[0]; sum += tmp * tmp * z[0];
tmp = x[1] - y[1]; sum += tmp * tmp * z[1];
tmp = x[2] - y[2]; sum += tmp * tmp * z[2];
tmp = x[3] - y[3]; sum += tmp * tmp * z[3];
tmp = x[4] - y[4]; sum += tmp * tmp * z[4];
tmp = x[5] - y[5]; sum += tmp * tmp * z[5];
tmp = x[6] - y[6]; sum += tmp * tmp * z[6];
tmp = x[7] - y[7]; sum += tmp * tmp * z[7];
tmp = x[8] - y[8]; sum += tmp * tmp * z[8];
tmp = x[9] - y[9]; sum += tmp * tmp * z[9];
for (i = 0; i < M; i++) // M == 1, 2, 4, or 8
{
//----------------------------------------
// Prefetch data into the processor's cache
//----------------------------------------
float a_value = a[i];
float b_value = b[i];
float c_value = 0.0;
//----------------------------------------
// Calculate using prefetched data.
//----------------------------------------
c_value = log(a_value * b_value) + sum;
c[i] = c_value;
ret += c_value;
}
return ret;
}
You could also try unrolling the loop:
float a_value = 0.0;
float b_value = 0.0;
float c_value = 0.0;
--M;
switch (M)
{
case 7:
a_value = a[M];
b_value = b[M];
c_value = log(a_value * b_value) + sum;
c[M] = c_value;
ret += c_value;
--M;
case 6:
a_value = a[M];
b_value = b[M];
c_value = log(a_value * b_value) + sum;
c[M] = c_value;
ret += c_value;
--M;
case 5:
a_value = a[M];
b_value = b[M];
c_value = log(a_value * b_value) + sum;
c[M] = c_value;
ret += c_value;
--M;
case 4:
a_value = a[M];
b_value = b[M];
c_value = log(a_value * b_value) + sum;
c[M] = c_value;
ret += c_value;
--M;
case 3:
a_value = a[M];
b_value = b[M];
c_value = log(a_value * b_value) + sum;
c[M] = c_value;
ret += c_value;
--M;
case 2:
a_value = a[M];
b_value = b[M];
c_value = log(a_value * b_value) + sum;
c[M] = c_value;
ret += c_value;
--M;
case 1:
a_value = a[M];
b_value = b[M];
c_value = log(a_value * b_value) + sum;
c[M] = c_value;
ret += c_value;
--M;
case 0:
a_value = a[M];
b_value = b[M];
c_value = log(a_value * b_value) + sum;
c[M] = c_value;
ret += c_value;
break;
}
Looking at the unrolled version, you could take the " + sum" out of the "loop" and add it in at the end as:
ret += (M + 1) * sum;
since sum doesn't change.
Finally, another alternative is to perform all multiplications at once, followed by all log calculations, then sum up everything:
float product[8];
for (i = 0; i < M; ++i)
{
product[i] = a[i] * b[i];
}
for (i = 0; i < M; ++i)
{
c[i] = log(product);
ret += c[i];
}
ret += M * sum;
If you are calling this multiple times when a and b have not changed, then combine a and b into logab where logab[i] = log(a[i] * b[i]) since a and b are not used anywhere else.
This appears to be a gaussian mixture model computation. Several years ago, I worked on an effort to optimize this same algorithm which was being used as part of a speech processing program. I investigated a number of optimizations like you're attempting to do but never found anything using straight C to gain more than just a few percent. My biggest gain came from recoding the basic GMM kernel using SIMD instructions. Since that still wasn't providing the performance I was looking for, the next (and final) step was to use an Nvidia GPU. This sort of worked but programming that thing was a headache in itself.
Sorry I can't be more helpful but I don't think you are going to pick up more than just a nominal amount of speed if you're sticking to a regular CPU.

Resources