Optimizing the loop in a block by block transfer - c

In the below code, is there a way to avoid the if statement?
s = 13; /*Total size*/
b = 5; /*Block size*/
x = 0;
b1 = b;
while(x < s)
{
if(x + b > s)
b1 = s-x;
SendData(x, b1); /*SendData(offset,length);*/
x += b1;
}
Thanks much!

I don't know maybe you'll think:
s = 13; /*Total size*/
b = 5; /*Block size*/
x = 0;
while(x + b < s)
{
SendData(x, b); /*SendData(offset,length);*/
x += b;
}
SendData(x, s%b);
is better?

Don't waste your time on pointless micro-optimizations your compiler probably does for you anyway.
Program for the programmer; not the computer. Compilers get better and better, but programmers don't.
If it makes your program more readable (#PaulPRO's answer), then do it. Otherwise, don't.

You can use a conditional move or branchless integer select to assign b1 without an if-statement:
// if a >= 0, return x, else y
// assumes 32-bit processors
inline int isel( int a, int x, int y ) // inlining is important here
{
int mask = a >> 31; // arithmetic shift right, splat out the sign bit
// mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise.
return x + ((y - x) & mask);
};
// ...
while(x < s)
{
b1 = isel( x + b - s, s-x, b1 );
SendData(x, b1); /*SendData(offset,length);*/
x += b1;
}
This is only a useful optimization on in-order processors, though. It won't make any difference on a modern PC's x86, which has a fast branch and a reorder unit. It might be useful on some embedded systems (like a Playstation), where pipeline latency matters more for performance than instruction count. I've used it to shave a few microseconds in tight loops.
In theory a compiler "should" be able to turn a ternary expression (b = (a > 0 ? x : y)) into a conditional move, but I've never met one that did.
Of course, in a larger sense everyone who says that this is a pointless optimization compared to the cost of SendData() is correct. The difference between a cmov and a branch is about 4 nanoseconds, which is negligible compared to the cost of a network call. Spending your time fixing this branch which happens once per network call is like driving across town to save 1¢ on gasoline.

If you try to remove if(), it might change your logic and you have to spend lot of time for testing. I see only one potential change:
s = 13;
b = 5;
x = 0;
b1 = b;
while(x < s)
{
const unsigned int total = x + b; // <--- introduce 'total'
if(total > s)
b1 = s-x;
SendData(x, b1);
x = total; // <--- reusing it
}

Related

SSE SIMD Segmentation Fault when using resulting float

I'm trying to use Intel Intrinsics to perform an operation quickly on a float array. The operations themselves seem to work fine; however, when I try to get the result of the operation into a standard C variable I get a SEGFAULT. If I comment the indicated line below out, the program runs. If I save the result of the indicated line, but do not manipulate it in any way, the program runs fine. It is only when I try to (in any way) interact with the result of _mm_cvtss_f32(C) that my program crashes. Any ideas?
float proc(float *a, float *b, int n, int c, int width) {
// Operation: SUM: (A - B) ^ 2
__m128 A, B, C;
float total = 0;
for (int d = 0, k = 0; k < c; d += width, k++) {
for (int i = 0; i < n / 4 * 4; i += 4) {
A = _mm_load_ps(&a[i + d]);
B = _mm_load_ps(&b[i + d]);
C = _mm_sub_ps(A, B);
C = _mm_mul_ps(C, C);
C = _mm_hadd_ps(C, C);
C = _mm_hadd_ps(C, C);
total += _mm_cvtss_f32(C); // SEGFAULT HERE
}
for (int i = n / 4 * 4; i < n; i++) {
int diff = a[i + d] - b[i + d];
total += diff * diff;
}
}
return total;
}
Are you sure your program actually crashes at the instruction you cited, or is the compiler just optimizing the rest of the loop away if you remove the _mm_cvtss_f32() line (it doesn't have any other visible side effects)? Potential failure causes would be improper alignment of the a and b arrays since you are using aligned load instructions. Are you sure they are 16-byte aligned? On contemporary Intel hardware, there is very little performance difference between 16-byte aligned and unaligned loads (see the comments on the question above for a discussion of the issue).
I mentioned in my original comment that movaps has a shorter encoding than movups. This is not correct. I was thinking instead of movaps versus movapd, which do the same memory transfer, only they're labeled as being for single-precision and double-precision data, respectively. In practice, they do the same thing, but movaps has a shorter encoding.

Conversion to branchless of consecutive if statements

I'm stuck there trying to figure out how to convert the last two "if" statements of the following code to a branchless state.
int u, x, y;
x = rand() % 100 - 50;
y = rand() % 100 - 50;
u = rand() % 4;
if ( y > x) u = 5;
if (-y > x) u = 4;
Or, in case the above turns out to be too difficult, you can consider them as:
if (x > 0) u = 5;
if (y > 0) u = 4;
I think that what gets me is the fact that those don't have an else catcher. If it was the case I could have probably adapted a variation of a branchless abs (or max/min) function.
The rand() functions you see aren't part of the real code. I added them like this just to hint at the expected ranges that the variables x, y and u can possibly have at the time the two branches happen.
Assembly machine code is allowed for the purpose.
EDIT:
After a bit of braingrinding I managed to put together a working branchless version:
int u, x, y;
x = rand() % 100 - 50;
y = rand() % 100 - 50;
u = rand() % 4;
u += (4-u)*((unsigned int)(x+y) >> 31);
u += (5-u)*((unsigned int)(x-y) >> 31);
Unfortunately, due to the integer arithmetic involved, the original version with if statements turns out to be faster by a 30% range.
Compiler knows where the party is at.
[All: this answer was written with the assumption that the calls on rand() were part of the problem. I offer improvement below under that assumption.
OP belatedly clarifies he only used rand to tell us ranges (and presumably distribution) of the values of x and y. Unclear if he meant for the value for u, too. Anyway, enjoy my improved answer to the problem he didn't really pose].
I think you'd be better off recoding this as:
int u, x, y;
x = rand() % 100 - 50;
y = rand() % 100 - 50;
if ( y > x) u = 5;
else if (-y > x) u = 4;
else u = rand() % 4;
This calls the last rand only 1/4 as often as OP's original code.
Since I assume rand (and the divides) are much more expensive
than compare-and-branch, this would be a significant savings.
If your rand generator produces a lot of truly random bits (e.g. 16) on each call as it should, you can call it just once (I've assumed rand is more expensive than divide, YMMV):
int u, x, y, t;
t = rand() ;
u = t % 4;
t = t >> 2;
x = t % 100 - 50;
y = ( t / 100 ) %100 - 50;
if ( y > x) u = 5;
else if (-y > x) u = 4;
I think that the rand function in the MS C library is not good enough for this if you want really random values. I had to code my own; turned out faster anyway.
You might also get rid of the divide, by using multiplication by a reciprocal (untested):
int u, x, y;
unsigned int t;
unsigned long t2;
t = rand() ;
u = t % 4;
{ // Compute value of x * 2^32 in a long by multiplying.
// The (unsigned int) term below should be folded into a single constant at compile time.
// The remaining multiply can be done by one machine instruction
// (typically 32bits * 32bits --> 64bits) widely found in processors.
// The "4" has the same effect as the t = t >> 2 in the previous version
t2 = ( t * ((unsigned int)1./(4.*100.)*(1<<32));
}
x = (t2>>32)-50; // take the upper word (if compiler won't, do this in assembler)
{ // compute y from the fractional remainder of the above multiply,
// which is sitting in the lower 32 bits of the t2 product
y = ( t2 mod (1<<32) ) * (unsigned int)(100.*(1<<32));
}
if ( y > x) u = 5;
else if (-y > x) u = 4;
If your compiler won't produce the "right" instructions, it should be straightforward to write assembly code to do this.
Some tricks using arrays indices, they may be quite fast if the compiler/CPU has one-step instructions to convert comparison results to 0-1 values (e.g. x86's "sete" and similar).
int ycpx[3];
/* ... */
ycpx[0] = 4;
ycpx[1] = u;
ycpx[2] = 5;
u = ycpx[1 - (-y <= x) + (y > x)];
Alternate form
int v1[2];
int v2[2];
/* ... */
v1[0] = u;
v1[1] = 5;
v2[1] = 4;
v2[0] = v1[y > x];
u = v2[-y > x];
Almost unreadable...
NOTE: In both cases the initialization of array elements containing 4 and 5 may be included in declaration and arrays may be made static if reentrancy is not a problem for you.

Find the most significant bit or log base 2 (floor) of a positive integer via straight line bit manipulation in C

This is what I need to do:
int lg(int v)
{
int r = 0;
while (v >>= 1) // unroll for more speed...
{
r++;
}
}
I found the above solution at: http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog
This works, butI need to do it without loops, control structures, or constants bigger than 0xFF (255), which has proven to be very hard for me to find. I've been trying to figure something out using conditionals in the form
( x ? y : z ) = (((~(!!x) + 1)) & y) | ((~(~(!!x) + 1)) & z)
but I can't get it to work. Thanks for your time.
Without any control structure, not even the ?: operator, you can simulate your own algo
int r = 0;
x >>= 1;
r += (x != 0);
x >>= 1;
r += (x != 0);
...
provided that, in C,
x is assumed to be positive (otherwise having int x=-1; for instance x >>= 1 n times is always != 0
a condition like x != 0 returns 0 (false) or 1 (*true)
That sounds like a homework. Well, if you can't use control structures, a good alternative is to precalculate what you can: divide and conquer. Solve for a smaller part (one byte, one nibble, your choice), and apply to the parts of your integer.

Find Pythagorean triplet for which a + b + c = 1000

A Pythagorean triplet is a set of three natural numbers, a < b < c, for which,
a2 + b2 = c2
For example, 32 + 42 = 9 + 16 = 25 = 52.
There exists exactly one Pythagorean triplet for which a + b + c = 1000.
Find the product abc.
Source: http://projecteuler.net/index.php?section=problems&id=9
I tried but didn't know where my code went wrong. Here's my code in C:
#include <math.h>
#include <stdio.h>
#include <conio.h>
void main()
{
int a=0, b=0, c=0;
int i;
for (a = 0; a<=1000; a++)
{
for (b = 0; b<=1000; b++)
{
for (c = 0; c<=1000; c++)
{
if ((a^(2) + b^(2) == c^(2)) && ((a+b+c) ==1000)))
printf("a=%d, b=%d, c=%d",a,b,c);
}
}
}
getch();
}
#include <math.h>
#include <stdio.h>
int main()
{
const int sum = 1000;
int a;
for (a = 1; a <= sum/3; a++)
{
int b;
for (b = a + 1; b <= sum/2; b++)
{
int c = sum - a - b;
if ( a*a + b*b == c*c )
printf("a=%d, b=%d, c=%d\n",a,b,c);
}
}
return 0;
}
explanation:
b = a;
if a, b (a <= b) and c are the Pythagorean triplet,
then b, a (b >= a) and c - also the solution, so we can search only one case
c = 1000 - a - b;
It's one of the conditions of the problem (we don't need to scan all possible 'c': just calculate it)
I'm afraid ^ doesn't do what you think it does in C. Your best bet is to use a*a for integer squares.
Here's a solution using Euclid's formula (link).
Let's do some math:
In general, every solution will have the form
a=k(x²-y²)
b=2kxy
c=k(x²+y²)
where k, x and y are positive integers, y < x and gcd(x,y)=1 (We will ignore this condition, which will lead to additional solutions. Those can be discarded afterwards)
Now, a+b+c= kx²-ky²+2kxy+kx²+ky²=2kx²+2kxy = 2kx(x+y) = 1000
Divide by 2: kx(x+y) = 500
Now we set s=x+y: kxs = 500
Now we are looking for solutions of kxs=500, where k, x and s are integers and x < s < 2x.
Since all of them divide 500, they can only take the values 1, 2, 4, 5, 10, 20, 25, 50, 100, 125, 250, 500. Some pseudocode to do this for arbitrary n (it and can be done by hand easily for n=1000)
If n is odd
return "no solution"
else
L = List of divisors of n/2
for x in L
for s in L
if x< s <2*x and n/2 is divisible by x*s
y=s-x
k=((n/2)/x)/s
add (k*(x*x-y*y),2*k*x*y,k*(x*x+y*y)) to list of solutions
sort the triples in the list of solutions
delete solutions appearing twice
return list of solutions
You can still improve this:
x will never be bigger than the root of n/2
the loop for s can start at x and stop after 2x has been passed (if the list is ordered)
For n = 1000, the program has to check six values for x and depending on the details of implementation up to one value for y. This will terminate before you release the button.
As mentioned above, ^ is bitwise xor, not power.
You can also remove the third loop, and instead use
c = 1000-a-b; and optimize this a little.
Pseudocode
for a in 1..1000
for b in a+1..1000
c=1000-a-b
print a, b, c if a*a+b*b=c*c
There is a quite dirty but quick solution to this problem. Given the two equations
a*a + b*b = c*c
a+b+c = 1000.
You can deduce the following relation
a = (1000*1000-2000*b)/(2000-2b)
or after two simple math transformations, you get:
a = 1000*(500-b) / (1000 - b)
since a must be an natural number. Hence you can:
for b in range(1, 500):
if 1000*(500-b) % (1000-b) == 0:
print b, 1000*(500-b) / (1000-b)
Got result 200 and 375.
Good luck
#include <stdio.h>
int main() // main always returns int!
{
int a, b, c;
for (a = 0; a<=1000; a++)
{
for (b = a + 1; b<=1000; b++) // no point starting from 0, otherwise you'll just try the same solution more than once. The condition says a < b < c.
{
for (c = b + 1; c<=1000; c++) // same, this ensures a < b < c.
{
if (((a*a + b*b == c*c) && ((a+b+c) ==1000))) // ^ is the bitwise xor operator, use multiplication for squaring
printf("a=%d, b=%d, c=%d",a,b,c);
}
}
}
return 0;
}
Haven't tested this, but it should set you on the right track.
From man pow:
POW(3) Linux Programmer's Manual POW(3)
NAME
pow, powf, powl - power functions
SYNOPSIS
#include <math.h>
double pow(double x, double y);
float powf(float x, float y);
long double powl(long double x, long double y);
Link with -lm.
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
powf(), powl(): _BSD_SOURCE || _SVID_SOURCE || _XOPEN_SOURCE >= 600 || _ISOC99_SOURCE; or cc -std=c99
DESCRIPTION
The pow() function returns the value of x raised to the power of y.
RETURN VALUE
On success, these functions return the value of x to the power of y.
If x is a finite value less than 0, and y is a finite non-integer, a domain error occurs, and a NaN is
returned.
If the result overflows, a range error occurs, and the functions return HUGE_VAL, HUGE_VALF, or HUGE_VALL,
as you see, pow is using floating point arithmetic, which is unlikely to give you the exact result (although in this case should be OK, as relatively small integers have an exact representation; but don't rely on that for general cases)... use n*n to square the numbers in integer arithmetic (also, in modern CPU's with powerful floating point units the throughput can be even higher in floating point, but converting from integer to floating point has a very high cost in number of CPU cycles, so if you're dealing with integers, try to stick to integer arithmetic).
some pseudocode to help you optimise a little bit your algorithm:
for a from 1 to 998:
for b from 1 to 999-a:
c = 1000 - a - b
if a*a + b*b == c*c:
print a, b, c
In C the ^ operator computes bitwise xor, not the power. Use x*x instead.
I know this question is quite old, and everyone has been posting solutions with 3 for loops, which is not needed. I got this solved in O(n), by **equating the formulas**; **a+b+c=1000 and a^2 + b^2 = c^2**
So, solving further we get;
a+b = 1000-c
(a+b)^2 = (1000-c)^2
If we solve further we deduce it to;
a=((50000-(1000*b))/(1000-b)).
We loop for "b", and find "a".
Once we have "a" and "b", we get "c".
public long pythagorasTriplet(){
long a = 0, b=0 , c=0;
for(long divisor=1; divisor<1000; divisor++){
if( ((500000-(1000*divisor))%(1000-divisor)) ==0){
a = (500000 - (1000*divisor))/(1000-divisor);
b = divisor;
c = (long)Math.sqrt(a*a + b*b);
System.out.println("a is " + a + " b is: " + b + " c is : " + c);
break;
}
}
return a*b*c;
}
As others have mentioned you need to understand the ^ operator.
Also your algorithm will produce multiple equivalent answers with the parameters a,b and c in different orders.
While as many people have pointed out that your code will work fine once you switch to using pow. If your interested in learning a bit of math theory as it applies to CS, I would recommend trying to implementing a more effient version using "Euclid's formula" for generating Pythagorean triples (link).
Euclid method gives the perimeter to be m(m+n)= p/2 where m> n and the sides are m^2+n^2 is the hypotenuse and the legs are 2mn and m^2-n^2.thus m(m+n)=500 quickly gives m= 20 and n=5. The sides are 200, 375 and 425. Use Euclid to solve all pythorean primitive questions.
As there are two equations (a+b+c = 1000 && aˆ2 + bˆ2 = cˆ2) with three variables, we can solve it in linear time by just looping through all possible values of one variable, and then we can solve the other 2 variables in constant time.
From the first formula, we get b=1000-a-c, and if we replace b in 2nd formula with this, we get c^2 = aˆ2 + (1000-a-c)ˆ2, which simplifies to c=(aˆ2 + 500000 - 1000a)/(1000-a).
Then we loop through all possible values of a, solve c and b with the above formulas, and if the conditions are satisfied we have found our triplet.
int n = 1000;
for (int a = 1; a < n; a++) {
int c = (a*a + 500000 - 1000*a) / (1000 - a);
int b = (1000 - a - c);
if (b > a && c > b && (a * a + b * b) == c * c) {
return a * b * c;
}
}
for a in range(1,334):
for b in range(500, a, -1):
if a + b < 500:
break
c = 1000 - a - b
if a**2 + b**2 == c**2:
print(a,b,c)
Further optimization from Oleg's answer.
One side cannot be greater than the sum of the other two.
So a + b cannot be less than 500.
I think the best approach here is this:
int n = 1000;
unsigned long long b =0;
unsigned long long c =0;
for(int a =1;a<n/3;a++){
b=((a*a)- (a-n)*(a-n)) /(2*(a-n));
c=n-a-b;
if(a*a+b*b==c*c)
cout<<a<<' '<<b<<' '<<c<<endl;
}
explanation:
We shall refer to the N and A constant so we will not have to use two loops.
We can do it because
c=n-a-b and b=(a^2-(a-n)^2)/(2(a-n))
I got these formulas by solving a system of equations:
a+b+c=n,
a^2+b^2=c^2
func maxProd(sum:Int)->Int{
var prod = 0
// var b = 0
var c = 0
let bMin:Int = (sum/4)+1 //b can not be less than sum/4+1 as (a+b) must be greater than c as there will be no triangle if this condition is false and any pythagorus numbers can be represented by a triangle.
for b in bMin..<sum/2 {
for a in ((sum/2) - b + 1)..<sum/3{ //as (a+b)>c for a valid triangle
c = sum - a - b
let csquare = Int(pow(Double(a), 2) + pow(Double(b), 2))
if(c*c == csquare){
let newProd = a*b*c
if(newProd > prod){
prod = newProd
print(a,b,c)
}
}
}
}
//
return prod
}
The answers above are good enough but missing one important piece of information a + b > c. ;)
More details will be provided to those who ask.
with Python
def findPythagorean1000():
for c in range(1001):
for b in range(1,c):
for a in range(1,b):
if (a+b+c==1000):
if(pow(a,2)+pow(b,2)) ==pow(c,2):
print(a,b,c)
print(a*b*c)
return
findPythagorean1000()

Optimizing for speed - 4 dimensional array lookup in C

I have a fitness function that is scoring the values on an int array based on data that lies on a 4D array. The profiler says this function is using 80% of CPU time (it needs to be called several million times). I can't seem to optimize it further (if it's even possible). Here is the function:
unsigned int lookup_array[26][26][26][26]; /* lookup_array is a global variable */
unsigned int get_i_score(unsigned int *input) {
register unsigned int i, score = 0;
for(i = len - 3; i--; )
score += lookup_array[input[i]][input[i + 1]][input[i + 2]][input[i + 3]];
return(score)
}
I've tried to flatten the array to a single dimension but there was no improvement in performance. This is running on an IA32 CPU. Any CPU specific optimizations are also helpful.
Thanks
What is the range of the array items? If you can change the array base type to unsigned short or unsigned char, you might get fewer cache misses because a larger portion of the array fits into the cache.
Most of your time probably goes into cache misses. If you can optimize those away, you can get a big performance boost.
Remember that C/C++ arrays are stored in row-major order. Remember to store your data so that addresses referenced closely in time reside closely in memory. For example, it may make sense to store sub-results in a temporary array. Then you could process exactly one row of elements located sequentially. That way the processor cache will always contain the row during iterations and less memory operations will be required. However, you might need to modularize your lookup_array function. Maybe even split it into four (by the number of dimensions in your array).
The problem is definitely related to the size of the matrix. You cannot optimize it by declaring as a single array just because it's what the compiler does automatically.
Everything depends on which order do you use for accessing the data, namely on the content of the input array.
The only think you can do is work on locality: read this one, it should give you some inspiration.
By the way, I suggest you to replace the input array with four parameters: it will be more intuitive and it will be less error prone.
Good luck
A few suggestions to improve performance:
Parallelise. This is a very easy reduction to be programmed in OpenMP or MPI.
Reorder data to improve locality. Try sorting input first, for example.
Use streaming processing instructions if the compiler is not already doing it.
About reordering, it would be possible if you flatten the array and use linear coordinates instead.
Another point, compare the theoretical peak performance of your processor (integer operations) with the performance you're getting (do a quick count of the assembly generated instructions, multiply by the length of the input, etc.) and see if there's room for a significant improvement there.
I have a couple of suggestions:
unsigned int lookup_array[26][26][26][26]; /* lookup_array is a global variable */
unsigned int get_i_score(unsigned int *input, len) {
register unsigned int i, score = 0;
unsigned int *a=input;
unsigned int *b=input+1;
unsigned int *c=input+2;
unsigned int *d=input+3;
for(i = 0; i < (len - 3); i++, a++, b++, c++, d++)
score += lookup_array[*a][*b][*c][*d];
return(score)
}
Or try
for(i = 0; i < (len - 3); i++, a=b, b=c, c=d, d++)
score += lookup_array[*a][*b][*c][*d];
Also, given that there are only 26 values, why are you putting the input array in terms of unsigned ints? If it were char *input, you'd be using 1/4 as much memory and therefore using 1/4 of the memory bandwidth. Obviously the types of a through d have to match. Similarly, if the score values don't need to be unsigned ints, make the array smaller by using chars or uint16_t.
You might be able to squeeze a bit out, by unrolling the loop in some variation of Duffs device.
Multidimesional arrays often constrain the compiler to one or more multiply operations. It may be slow on some CPUs. A common workaround is to transform the N-dimensional array into an array of pointers to elements of (N-1) dimensions. With a 4-dim. array is quite annoying (26 pointers to 26*26 pointers to 26*26*26 rows...) I suggest to try it and compare the result. It is not guaranteed that it's faster: compilers are quite smart in optimizing array accesses, while a chain of indirect accesses has higher probability to invalidate the cache.
Bye
if lookup_array is mostly zeroes, could def be replaced with a hash table lookup on a smaller array. The inline lookup function could calculate the offset of the 4-dimensions ([5,6,7,8] = (4*26*26*26)+(5*26*26)+(6*26)+7 = 73847). the hash key could just be the lower few bits of the offset (depending on how sparse the array is expected to be). if the offset exists in the hash table, use the value, if it doesn't exist it's 0...
the loop could also be unrolled using something like this if the input has arbitrary length. there are only len accesses of input needed (instead of around len * 4 in the original loop).
register int j, x1, x2, x3, x4;
register unsigned int *p;
p = input;
x1 = *p++;
x2 = *p++;
x3 = *p++;
for (j = (len - 3) / 20; j--; ) {
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = *p++;
score += lookup_array[x2][x3][x4][x1];
x2 = *p++;
score += lookup_array[x3][x4][x1][x2];
x3 = *p++;
score += lookup_array[x4][x1][x2][x3];
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = *p++;
score += lookup_array[x2][x3][x4][x1];
x2 = *p++;
score += lookup_array[x3][x4][x1][x2];
x3 = *p++;
score += lookup_array[x4][x1][x2][x3];
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = *p++;
score += lookup_array[x2][x3][x4][x1];
x2 = *p++;
score += lookup_array[x3][x4][x1][x2];
x3 = *p++;
score += lookup_array[x4][x1][x2][x3];
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = *p++;
score += lookup_array[x2][x3][x4][x1];
x2 = *p++;
score += lookup_array[x3][x4][x1][x2];
x3 = *p++;
score += lookup_array[x4][x1][x2][x3];
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = *p++;
score += lookup_array[x2][x3][x4][x1];
x2 = *p++;
score += lookup_array[x3][x4][x1][x2];
x3 = *p++;
score += lookup_array[x4][x1][x2][x3];
/* that's 20 iterations, add more if you like */
}
for (j = (len - 3) % 20; j--; ) {
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = x2;
x2 = x3;
x3 = x4;
}
If you convert it to a flat array of size 26*26*26*26, you only need to lookup the input array once per loop:
unsigned int get_i_score(unsigned int *input)
{
unsigned int i = len - 3, score = 0, index;
index = input[i] * 26 * 26 +
input[i + 1] * 26 +
input[i + 2];
while (--i)
{
index += input[i] * 26 * 26 * 26;
score += lookup_array[index];
index /= 26 ;
}
return score;
}
The additional cost is a multiplication and a division. Whether it ends up being faster in practice - you'll have to test.
(By the way, the register keyword is often ignored by modern compilers - it's usually better to leave register allocation up to the optimiser).
Does the content of the array change much? Perhaps it would be faster to pre-calculate the score, and then modify that pre-calculated score everytime the array changes? Similar to how you can materialize a view in SQL using triggers.
Maybe you can eliminate some accesses to the input array by using local variables.
unsigned int lookup_array[26][26][26][26]; /* lookup_array is a global variable */
unsigned int get_i_score(unsigned int *input, unsigned int len) {
unsigned int i, score, a, b, c, d;
score = 0;
a = input[i + 0];
b = input[i + 1];
c = input[i + 2];
d = input[i + 3];
for (i = len - 3; i-- > 0; ) {
d = c, c = b, b = a, a = input[i];
score += lookup_array[a][b][c][d];
}
return score;
}
Moving around registers may be faster than accessing memory, although this kind of memory should remain in the innermost cache anyway.

Resources