SSE SIMD Segmentation Fault when using resulting float - c

I'm trying to use Intel Intrinsics to perform an operation quickly on a float array. The operations themselves seem to work fine; however, when I try to get the result of the operation into a standard C variable I get a SEGFAULT. If I comment the indicated line below out, the program runs. If I save the result of the indicated line, but do not manipulate it in any way, the program runs fine. It is only when I try to (in any way) interact with the result of _mm_cvtss_f32(C) that my program crashes. Any ideas?
float proc(float *a, float *b, int n, int c, int width) {
// Operation: SUM: (A - B) ^ 2
__m128 A, B, C;
float total = 0;
for (int d = 0, k = 0; k < c; d += width, k++) {
for (int i = 0; i < n / 4 * 4; i += 4) {
A = _mm_load_ps(&a[i + d]);
B = _mm_load_ps(&b[i + d]);
C = _mm_sub_ps(A, B);
C = _mm_mul_ps(C, C);
C = _mm_hadd_ps(C, C);
C = _mm_hadd_ps(C, C);
total += _mm_cvtss_f32(C); // SEGFAULT HERE
}
for (int i = n / 4 * 4; i < n; i++) {
int diff = a[i + d] - b[i + d];
total += diff * diff;
}
}
return total;
}

Are you sure your program actually crashes at the instruction you cited, or is the compiler just optimizing the rest of the loop away if you remove the _mm_cvtss_f32() line (it doesn't have any other visible side effects)? Potential failure causes would be improper alignment of the a and b arrays since you are using aligned load instructions. Are you sure they are 16-byte aligned? On contemporary Intel hardware, there is very little performance difference between 16-byte aligned and unaligned loads (see the comments on the question above for a discussion of the issue).
I mentioned in my original comment that movaps has a shorter encoding than movups. This is not correct. I was thinking instead of movaps versus movapd, which do the same memory transfer, only they're labeled as being for single-precision and double-precision data, respectively. In practice, they do the same thing, but movaps has a shorter encoding.

Related

Segment fault using Intel SIMD, even the space is very large and is of multiple of 32 bytes

I keep getting segment fault when using SIMD instructions to optimize matrix multiplication.
Here is the core computing part.
The matrices are stored like this:
a large vector<double> buf of size (3 * 1025 * 1025) is allocated. matrix A starts from buf[0], matrix B starts from buf[1025] and C starts from buf[1025*2]. I performed various matrix multiplication with size of 4 to 1024. So they could all fit in this vector.
#include <immintrin.h>
#define BLOCK_SIZE 4
/*
* performs 4 * 4 matrix multiplication C=A*B
* C is 4-by-4, A is 4-by-4, and B is 4-by-4, column major matrices
* lda is the size of the large matrix.
*/
static void do_block(int lda4, double* A, double* B, double* C) {
int n=4;
for(int i=0; i<n; i++){ // process i th column
for(int j=0; j<n; j++){
__m256d c = _mm256_load_pd(C+j*lda);
c = _mm256_fmadd_pd(_mm256_load_pd(A+i*lda), _mm256_broadcast_sd(B+i+j*lda), c);
_mm256_store_pd(C+j*lda, c);
}
}
}
/* This routine performs a dgemm operation
* C := C + A * B
* where A, B, and C are lda-by-lda matrices stored in column-major format.
* On exit, A and B maintain their input values. */
void square_dgemm(int lda, double* A, double* B, double* C) {
for (int j = 0; j < lda; j += BLOCK_SIZE) {
// Accumulate block dgemms into block of C
for (int k = 0; k < lda; k += BLOCK_SIZE) {
// For each block-row of A
for (int i = 0; i < lda; i += BLOCK_SIZE) {
do_block(lda, A + i + k * lda, B + k + j * lda, C + i + j * lda);
}
}
}
}
The weird thing is:
When I change the vector from size of (3 * 1025 * 1025) to (3 * 1024 * 1024), it gives me segment fault.
My questions are:
I have learnt that these instructions require aligned data. Indeed replacing by unaligned versions like _mm256_loadu_pd eliminates this error. However, since size of (3 * 1024 * 1024 * sizeof(double)) % 32 bytes == 0, isn't it 32 bytes-aligned or I misunderstood the concept?
I have allocated very large contiguous space, and why it crashes from the beginning, when performing small mat mul (4*4)? I thought that as long as I calling _mm256_load_pd(addr) with at least 32 bytes allocated starting from addr, it won't crash, am I wrong?
Why it doesn't crash on buf of (3 * 1025 * 1025), but crashes on (3 * 1024 * 1024) ? It seems doesn't crashes when the size is an odd number, like 1025, 1027, 1029 and always crash when the number is even, like 1024, 1026.
The code was compiled using GCC, with -march=native and -O3. The CPU supports FMA, AVX and AVX2. The machine is Google Cloud VM, the CPU is Intel Xeon which I cannot get the exact model. Thanks for your advice!
Thanks for the comments and I think I have figured out what was wrong.
Answer to Q1: Yes, I misunderstood the concept of 'Alignment'. I have to ensure that the adress is the multiple of 32. So after manually (not an elegant way by doing pointer math) making 3 matrices starting from 36 bytes aligned space, it works.
Answer to Q2: Since the adress passed to _mm256_load_pd(addr) is not 32 bytes aligned, even if it's very large, an exception might be raised, according to Intel documentation
Answer to Q3: This is quite tricky: Not both of A and C are 32 bytes aligned. However, changing the total size will cause C to be 32 bytes aligned (But A still not aligned), and it doesn't crash. This works just by accident and it's not safe, as A is not aligned.
So vectorization of matrix multiplication is trciky, I need to
Ensure matrix A and C to be 32 bytes aligned, maybe use something like aligned_alloc mentioned in the comments.
pad the matrices into multiple of 4, to ensure in every step calling _mm256_load_pd(addr) the adress is properly aligned.

SSE parallelization

Hi I am trying to improve the performance of this code, suposing that I have a machine capable of handling 4 threads. I first thought about making omp parallel but then I saw that this function was inside a for loop so creating threads so many times was not very efficient. So i would like to know how to implement it with SSE that would be more efficient:
unsigned char cubicInterpolate_paralelo(unsigned char p[4], unsigned char x) {
unsigned char resultado;
unsigned char intermedio;
intermedio = + x*(3.0*(p[1] - p[2]) + p[3] - p[0]);
resultado = p[1] + 0.5 * x *(p[2] - p[0] + x*(2.0*p[0] - 5.0*p[1] + 4.0*p[2] - p[3] + x*(3.0*(p[1] - p[2]) + p[3] - p[0])));
return resultado;
}
unsigned char bicubicInterpolate_paralelo (unsigned char p[4][4], unsigned char x, unsigned char y) {
unsigned char arr[4],valorPixelCanal;
arr[0] = cubicInterpolate_paralelo(p[0], y);
arr[1] = cubicInterpolate_paralelo(p[1], y);
arr[2] = cubicInterpolate_paralelo(p[2], y);
arr[3] = cubicInterpolate_paralelo(p[3], y);
valorPixelCanal = cubicInterpolate_paralelo(arr, x);
return valorPixelCanal;
}
this is used inside some nested for:
for(i=0; i<z_img.width(); i++) {
for(j=0; j<z_img.height(); j++) {
//For R,G,B
for(c=0; c<3; c++) {
for(l=0; l<4; l++){
for(k=0; k<4; k++){
arr[l][k] = img(i/zFactor +l, j/zFactor +k, 0, c);
}
}
color[c] = bicubicInterpolate_paralelo(arr, (unsigned char)(i%zFactor)/zFactor, (unsigned char)(j%zFactor)/zFactor);
}
z_img.draw_point(i,j,color);
}
}
I've taken some liberties with the code, so you may have to change it significantly, but here's an (untested) transliteration to SSE:
__m128i x = _mm_unpacklo_epi8(_mm_loadl_epi64(x_array), _mm_setzero_si128());
__m128i p0 = _mm_unpacklo_epi8(_mm_loadl_epi64(p0_array), _mm_setzero_si128());
__m128i p1 = _mm_unpacklo_epi8(_mm_loadl_epi64(p1_array), _mm_setzero_si128());
__m128i p2 = _mm_unpacklo_epi8(_mm_loadl_epi64(p2_array), _mm_setzero_si128());
__m128i p3 = _mm_unpacklo_epi8(_mm_loadl_epi64(p3_array), _mm_setzero_si128());
__m128i t = _mm_sub_epi16(p1, p2);
t = _mm_add_epi16(_mm_add_epi16(t, t), t); // 3 * (p[1] - p[2])
__m128i intermedio = _mm_mullo_epi16(x, _mm_sub_epi16(_mm_add_epi16(t, p3), p0));
t = _mm_add_epi16(p1, _mm_slli_epi16(p1, 2)); // 5 * p[1]
// t2 = 2 * p[0] + 4 * p[2]
__m128i t2 = _mm_add_epi16(_mm_add_epi16(p0, p0), _mm_slli_epi16(p2, 2));
t = _mm_mullo_epi16(x, _mm_sub_epi16(_mm_add_epi16(t2, intermedio), _mm_add_epi16(t, p3)));
t = _mm_mullo_epi16(x, _mm_add_epi16(_mm_sub_epi16(p2, p0), t));
__m128i resultado = _mm_add_epi16(p1, _mm_srli_epi16(t, 1));
return resultado;
The 16 bit intermediates that I use should be wide enough, the only way for information from the high bits to affect low bits in this code is the right shift by 1 (0.5 * in your code), so really we only need 9 bits, the rest cannot affect the result. Bytes wouldn't be wide enough (unless you have some extra guarantees that I don't know about), but they would be annoying anyway because there is no nice way to multiply them.
I pretended for simplicity that the input takes the form of contiguous arrays of x's, p[0]'s etc, that's not what you need here but I didn't have time to work out all the loading and shuffling.
SSE is quite unrelated to threads. A single thread executes a single instruction at a time; with SSE that single instruction may apply to 4 or 8 sets of arguments at a time. So with multiple threads you can also run multiple SSE instructions to process even more data.
You can use threads with for-loops. Just don't use them inside. Instead, take the for(i=0; i<z_img.width(); i++) { outer loop and split it in 4 bands of width/4. Thread 0 gets 0..width/4, thread 1 gets width/4..width/2 etc.
On an unrelated note your code also suffers from mixing floating-point and integer math. 0.5 * x is not nearly as efficient as x/2.
Using OpenMP, you could try adding the #pragma to the outer-most for loop. This should solve your problem.
Going the SSE route is trickier because of the extra alignment restrictions on data, but the easiest transform would be to extend cubicInterpolate_paralelo to handle multiple calculations at once. With enough luck, telling the compiler to use SSE will do the trick for you, but to make sure, you could use intrinsic functions and types.

sse precision error with Matrix multiplication

My program does NxN matrices multiplication where elements of both the matrices are initialized to values (0, 1, 2, ... N) using a for loop. Both the matrix elements are of type float. There is no memory allocation problem. Matrix sizes are input as a multiple of 4 eg: 4x4 or 8x8 etc. The answers are verified with a sequential calculation. Everything works fine upto matrix size of 64x64. A difference between the sequential version and SSE version is observed only when the matrix size exceeds 64 (eg: 68 x 68).
SSE snippet is as shown (size = 68):
void matrix_mult_sse(int size, float *mat1_in, float *mat2_in, float *ans_out) {
__m128 a_line, b_line, r_line;
int i, j, k;
for (k = 0; k < size * size; k += size) {
for (i = 0; i < size; i += 4) {
j = 0;
b_line = _mm_load_ps(&mat2_in[i]);
a_line = _mm_set1_ps(mat1_in[j + k]);
r_line = _mm_mul_ps(a_line, b_line);
for (j = 1; j < size; j++) {
b_line = _mm_load_ps(&mat2_in[j * size + i]);
a_line = _mm_set1_ps(mat1_in[j + k]);
r_line = _mm_add_ps(_mm_mul_ps(a_line, b_line), r_line);
}
_mm_store_ps(&ans_out[i + k], r_line);
}
}
}
With this, the answer differs at element 3673 where I get the answers of multiplication as follows
scalar: 576030144.000000 & SSE: 576030208.000000
I also wrote a similar program in Java with the same initialization and setup and N = 68 and for element 3673, I got the answer as 576030210.000000
Now there are three different answers and I'm not sure how to proceed. Why does this difference occur and how do we eliminate this?
I am summarizing the discussion in order to close this question as answered.
So according to the article (What Every Computer Scientist Should Know About Floating-Point Arithmetic) in link, floating point always results in a rounding error which is a direct consequence of the approximate representation nature of the floating point number.
Arithmetic operations such as addition, subtraction etc results in a precision error. Hence, the 6 most significant digits of the floating point answer (irrespective of where the decimal point is situated) can be considered to be accurate while the other digits may be erroneous (prone to precision error).

What are other mathematical operators one can use to transform an algorithm

The difference operator, (similar to the derivative operator), and the sum operator, (similar to the integration operator), can be used to change an algorithm because they are inverses.
Sum of (difference of y) = y
Difference of (sum of y) = y
An example of using them that way in a c program is below.
This c program demonstrates three approaches to making an array of squares.
The first approach is the simple obvious approach, y = x*x .
The second approach uses the equation (difference in y) = (x0 + x1)*(difference in x) .
The third approach is the reverse and uses the equation (sum of y) = x(x+1)(2x+1)/6 .
The second approach is consistently slightly faster then the first one, even though I haven't bothered optimizing it. I imagine that if I tried harder I could make it even better.
The third approach is consistently twice as slow, but this doesn't mean the basic idea is dumb. I could imagine that for some function other than y = x*x this approach might be faster. Also there is an integer overflow issue.
Trying out all these transformations was very interesting, so now I want to know what are some other pairs of mathematical operators I could use to transform the algorithm?
Here is the code:
#include <stdio.h>
#include <time.h>
#define tries 201
#define loops 100000
void printAllIn(unsigned int array[tries]){
unsigned int index;
for (index = 0; index < tries; ++index)
printf("%u\n", array[index]);
}
int main (int argc, const char * argv[]) {
/*
Goal, Calculate an array of squares from 0 20 as fast as possible
*/
long unsigned int obvious[tries];
long unsigned int sum_of_differences[tries];
long unsigned int difference_of_sums[tries];
clock_t time_of_obvious1;
clock_t time_of_obvious0;
clock_t time_of_sum_of_differences1;
clock_t time_of_sum_of_differences0;
clock_t time_of_difference_of_sums1;
clock_t time_of_difference_of_sums0;
long unsigned int j;
long unsigned int index;
long unsigned int sum1;
long unsigned int sum0;
long signed int signed_index;
time_of_obvious0 = clock();
for (j = 0; j < loops; ++j)
for (index = 0; index < tries; ++index)
obvious[index] = index*index;
time_of_obvious1 = clock();
time_of_sum_of_differences0 = clock();
for (j = 0; j < loops; ++j)
for (index = 1, sum_of_differences[0] = 0; index < tries; ++index)
sum_of_differences[index] = sum_of_differences[index-1] + 2 * index - 1;
time_of_sum_of_differences1 = clock();
time_of_difference_of_sums0 = clock();
for (j = 0; j < loops; ++j)
for (signed_index = 0, sum0 = 0; signed_index < tries; ++signed_index) {
sum1 = signed_index*(signed_index+1)*(2*signed_index+1);
difference_of_sums[signed_index] = (sum1 - sum0)/6;
sum0 = sum1;
}
time_of_difference_of_sums1 = clock();
// printAllIn(obvious);
printf(
"The obvious approach y = x*x took, %f seconds\n",
((double)(time_of_obvious1 - time_of_obvious0))/CLOCKS_PER_SEC
);
// printAllIn(sum_of_differences);
printf(
"The sum of differences approach y1 = y0 + 2x - 1 took, %f seconds\n",
((double)(time_of_sum_of_differences1 - time_of_sum_of_differences0))/CLOCKS_PER_SEC
);
// printAllIn(difference_of_sums);
printf(
"The difference of sums approach y = sum1 - sum0, sum = (x - 1)x(2(x - 1) + 1)/6 took, %f seconds\n",
(double)(time_of_difference_of_sums1 - time_of_difference_of_sums0)/CLOCKS_PER_SEC
);
return 0;
}
There are two classes of optimizations here: strength reduction and peephole optimizations.
Strength reduction is the usual term for replacing "expensive" mathematical functions with cheaper functions -- say, replacing a multiplication with two logarithm table lookups, an addition, and then an inverse logarithm lookup to find the final result.
Peephole optimizations is the usual term for replacing something like multiplication by a power of two with left shifts. Some CPUs have simple instructions for these operations that run faster than generic integer multiplication for the specific case of multiplying by powers of two.
You can also perform optimizations of individual algorithms. You might write a * b, but there are many different ways to perform multiplication, and different algorithms perform better or worse under different conditions. Many of these decisions are made by the chip designers, but arbitrary-precision integer libraries make their own choices based on the merits of the primitives available to them.
When I tried to compile your code on Ubuntu 10.04, I got a segmentation fault right when main() started because you are declaring many megabytes worth of variables on the stack. I was able to compile it after I moved most of your variables outside of main to make them be global variables.
Then I got these results:
The obvious approach y = x*x took, 0.000000 seconds
The sum of differences approach y1 = y0 + 2x - 1 took, 0.020000 seconds
The difference of sums approach y = sum1 - sum0, sum = (x - 1)x(2(x - 1) + 1)/6 took, 0.000000 seconds
The program runs so fast it's hard to believe it really did anything. I put the "-O0" option in to disable optimizations but it's possible GCC still might have optimized out all of the computations. So I tried adding the "volatile" qualifier to your arrays but still got similar results.
That's where I stopped working on it. In conclusion, I don't really know what's going on with your code but it's quite possible that something is wrong.

Optimizing for speed - 4 dimensional array lookup in C

I have a fitness function that is scoring the values on an int array based on data that lies on a 4D array. The profiler says this function is using 80% of CPU time (it needs to be called several million times). I can't seem to optimize it further (if it's even possible). Here is the function:
unsigned int lookup_array[26][26][26][26]; /* lookup_array is a global variable */
unsigned int get_i_score(unsigned int *input) {
register unsigned int i, score = 0;
for(i = len - 3; i--; )
score += lookup_array[input[i]][input[i + 1]][input[i + 2]][input[i + 3]];
return(score)
}
I've tried to flatten the array to a single dimension but there was no improvement in performance. This is running on an IA32 CPU. Any CPU specific optimizations are also helpful.
Thanks
What is the range of the array items? If you can change the array base type to unsigned short or unsigned char, you might get fewer cache misses because a larger portion of the array fits into the cache.
Most of your time probably goes into cache misses. If you can optimize those away, you can get a big performance boost.
Remember that C/C++ arrays are stored in row-major order. Remember to store your data so that addresses referenced closely in time reside closely in memory. For example, it may make sense to store sub-results in a temporary array. Then you could process exactly one row of elements located sequentially. That way the processor cache will always contain the row during iterations and less memory operations will be required. However, you might need to modularize your lookup_array function. Maybe even split it into four (by the number of dimensions in your array).
The problem is definitely related to the size of the matrix. You cannot optimize it by declaring as a single array just because it's what the compiler does automatically.
Everything depends on which order do you use for accessing the data, namely on the content of the input array.
The only think you can do is work on locality: read this one, it should give you some inspiration.
By the way, I suggest you to replace the input array with four parameters: it will be more intuitive and it will be less error prone.
Good luck
A few suggestions to improve performance:
Parallelise. This is a very easy reduction to be programmed in OpenMP or MPI.
Reorder data to improve locality. Try sorting input first, for example.
Use streaming processing instructions if the compiler is not already doing it.
About reordering, it would be possible if you flatten the array and use linear coordinates instead.
Another point, compare the theoretical peak performance of your processor (integer operations) with the performance you're getting (do a quick count of the assembly generated instructions, multiply by the length of the input, etc.) and see if there's room for a significant improvement there.
I have a couple of suggestions:
unsigned int lookup_array[26][26][26][26]; /* lookup_array is a global variable */
unsigned int get_i_score(unsigned int *input, len) {
register unsigned int i, score = 0;
unsigned int *a=input;
unsigned int *b=input+1;
unsigned int *c=input+2;
unsigned int *d=input+3;
for(i = 0; i < (len - 3); i++, a++, b++, c++, d++)
score += lookup_array[*a][*b][*c][*d];
return(score)
}
Or try
for(i = 0; i < (len - 3); i++, a=b, b=c, c=d, d++)
score += lookup_array[*a][*b][*c][*d];
Also, given that there are only 26 values, why are you putting the input array in terms of unsigned ints? If it were char *input, you'd be using 1/4 as much memory and therefore using 1/4 of the memory bandwidth. Obviously the types of a through d have to match. Similarly, if the score values don't need to be unsigned ints, make the array smaller by using chars or uint16_t.
You might be able to squeeze a bit out, by unrolling the loop in some variation of Duffs device.
Multidimesional arrays often constrain the compiler to one or more multiply operations. It may be slow on some CPUs. A common workaround is to transform the N-dimensional array into an array of pointers to elements of (N-1) dimensions. With a 4-dim. array is quite annoying (26 pointers to 26*26 pointers to 26*26*26 rows...) I suggest to try it and compare the result. It is not guaranteed that it's faster: compilers are quite smart in optimizing array accesses, while a chain of indirect accesses has higher probability to invalidate the cache.
Bye
if lookup_array is mostly zeroes, could def be replaced with a hash table lookup on a smaller array. The inline lookup function could calculate the offset of the 4-dimensions ([5,6,7,8] = (4*26*26*26)+(5*26*26)+(6*26)+7 = 73847). the hash key could just be the lower few bits of the offset (depending on how sparse the array is expected to be). if the offset exists in the hash table, use the value, if it doesn't exist it's 0...
the loop could also be unrolled using something like this if the input has arbitrary length. there are only len accesses of input needed (instead of around len * 4 in the original loop).
register int j, x1, x2, x3, x4;
register unsigned int *p;
p = input;
x1 = *p++;
x2 = *p++;
x3 = *p++;
for (j = (len - 3) / 20; j--; ) {
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = *p++;
score += lookup_array[x2][x3][x4][x1];
x2 = *p++;
score += lookup_array[x3][x4][x1][x2];
x3 = *p++;
score += lookup_array[x4][x1][x2][x3];
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = *p++;
score += lookup_array[x2][x3][x4][x1];
x2 = *p++;
score += lookup_array[x3][x4][x1][x2];
x3 = *p++;
score += lookup_array[x4][x1][x2][x3];
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = *p++;
score += lookup_array[x2][x3][x4][x1];
x2 = *p++;
score += lookup_array[x3][x4][x1][x2];
x3 = *p++;
score += lookup_array[x4][x1][x2][x3];
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = *p++;
score += lookup_array[x2][x3][x4][x1];
x2 = *p++;
score += lookup_array[x3][x4][x1][x2];
x3 = *p++;
score += lookup_array[x4][x1][x2][x3];
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = *p++;
score += lookup_array[x2][x3][x4][x1];
x2 = *p++;
score += lookup_array[x3][x4][x1][x2];
x3 = *p++;
score += lookup_array[x4][x1][x2][x3];
/* that's 20 iterations, add more if you like */
}
for (j = (len - 3) % 20; j--; ) {
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = x2;
x2 = x3;
x3 = x4;
}
If you convert it to a flat array of size 26*26*26*26, you only need to lookup the input array once per loop:
unsigned int get_i_score(unsigned int *input)
{
unsigned int i = len - 3, score = 0, index;
index = input[i] * 26 * 26 +
input[i + 1] * 26 +
input[i + 2];
while (--i)
{
index += input[i] * 26 * 26 * 26;
score += lookup_array[index];
index /= 26 ;
}
return score;
}
The additional cost is a multiplication and a division. Whether it ends up being faster in practice - you'll have to test.
(By the way, the register keyword is often ignored by modern compilers - it's usually better to leave register allocation up to the optimiser).
Does the content of the array change much? Perhaps it would be faster to pre-calculate the score, and then modify that pre-calculated score everytime the array changes? Similar to how you can materialize a view in SQL using triggers.
Maybe you can eliminate some accesses to the input array by using local variables.
unsigned int lookup_array[26][26][26][26]; /* lookup_array is a global variable */
unsigned int get_i_score(unsigned int *input, unsigned int len) {
unsigned int i, score, a, b, c, d;
score = 0;
a = input[i + 0];
b = input[i + 1];
c = input[i + 2];
d = input[i + 3];
for (i = len - 3; i-- > 0; ) {
d = c, c = b, b = a, a = input[i];
score += lookup_array[a][b][c][d];
}
return score;
}
Moving around registers may be faster than accessing memory, although this kind of memory should remain in the innermost cache anyway.

Resources