How to do aligned additions without aligned arrays - c

So i was trying to do an array operation that looked something like
for (int i=0;i++i<32)
{
output[offset+i] += input[i];
}
where output and input are float arrays (which are 16-byte aligned thanks to malloc). However, I can't gurantee that offset%4=0. I was wondering how you could fix these alignment problems.
I though something like
while (offset+c %4 != 0)
{
c++;
output[offset+c] += input[c];
}
followed by an aligned loop - obviously this can't work as we now need an unaligned access to input.
Is there a way to vectorize my original loop?

Moving comments to an answer:
There are SSE instructions for misaligned memory accesses. They are accessible via the following intrinsics:
_mm_loadu_ps() - documentation
_mm_storeu_ps() - documentation
and similarly for all the double and integer types.
So if you can't guarantee alignment, then this is the easy way to go. If possible, the ideal solution is to align your arrays from the start so that you avoid this problem altogether.
There will still be a performance penalty for misaligned accesses, but they're unavoidable unless you resort to extremely messy shift/shuffle hacks (such as _mm_alignr_epi8()).
The code using _mm_loadu_ps and _mm_storeu_ps - this is actually 50% slower than what gcc does by itself
for (int j=0;j<8;j++)
{
float* out = &output[offset+j*4];
__m128 in = ((__m128*)input)[j]; //this is aligned so no need for _mm_loadu_ps
__m128 res = _mm_add_ps(in,_mm_loadu_ps(out)); //add values
_mm_storeu_ps(out,res); //store result
}

Related

Maximizing the performance and efficiency of triangularizing a 24x24 matrix in C and then in MIPS assembly

As of recently an interest within the realm of computer architecture and performance has been sparked in me. With that said, I have been picking up an "easier" assembly language to really try and learn how stuff "works under the hood". Namely MIPS assembly. I feel comfortable enough to try and experiment with some more advanced stuff and as such I have decided to combine programming with my interest in mathematics.
My goal is simple, given a 24x24 (I don't care about any other size) matrix A, I want to write an algorithm that as efficiently as possible finds the upper triangular form of the matrix. With efficiently I mean that I want to eventually end up in a state where I use the processor's that I am using resources the best I can. High cache hit rate, efficient usage of memory (locality of reference principle etc.), performance as in time it takes to run the solution, etc.
Eventually my goal is to transform the C solution to MIPS-assembly and tailor it to fit the memory subsystem of the processor that I will be trying to run my algorithm on. Regarding the processor I will have different options to play around with when it comes to caches, write buffers and memory in the sense that I can play around with different cache sizes, block sizes, associativity levels, memory access times etc. Performance in this case will be measured in the time it takes to triangularize a 24x24 matrix.
To begin, I need to actually write some high level code and actually solve the problem there before diving into MIPS assembly. I have "looked around" and eventually came up with this seemingly standard solution. It isn't necessarily super fast, neither do I think it is optimal for triangularizing 24x24 matrices. Can I do better?
void triangularize(float **A, int N)
{
int i, j, k;
// Loop over the diagonal elements
for (k = 0; k < N; k++)
{
// Loop over all the elements in the pivot row and right of the pivot ELEMENT
for (j = k + 1; j < N; j++)
{
// divide by the pivot element
A[k][j] = A[k][j] / A[k][k];
}
// Set the pivot elements
A[k][k] = 1.0;
// Loop over all elements below the pivot right an right of the pivot COLUMN
for (i = k + 1; i < N; i++)
{
for (j = k + 1; j < N; j++)
{
A[i][j] = A[i][j] - A[i][k] * A[k][j];
}
A[i][k] = 0.0;
}
}
}
Furthermore, what should be my next steps when trying to convert the C code to MIPS assembly with respect to maximizing performance and minimizing cost (cache hit rates, IO costs when dealing with memory etc.) to get a lightning fast and efficient solution?
First of all, encoding a matrix as a jagged array (ie. float**) is generally not efficient as it cause unnecessary expensive indirections and the array may not be contiguous in memory resulting in more cache misses or even cache trashing in pathological cases. It is certainly better to copy the matrix in a contiguous flatten array. Please consider storing your matrices as flatten arrays that are generally more efficient (especially on MIPS). Flatten array can be indexed using something like array[i*24+j] instead of array[i][j].
Moreover, if you do not care about matrices other than 24x24 ones, then you can write a specialized code for 24x24 matrices. This help compilers to generate a more efficient assembly code (typically by unrolling loops and using more efficient instructions like multiplication by a constant).
Additionally, divisions are generally expensive, especially on embedded MIPS processors. Thus, you can replace divisions by multiplications with the inverse. For example:
float inv = 1.0f / A[k][k];
for (j = k + 1; j < N; j++)
A[k][j] *= inv;
Note that the result might be slightly different due to floating-point rounding. You can use the -ffast-math compiler flag so to help it generating such optimisation if you know that special values like NaN or Inf do not appear in the matrix.
Moreover, it may be faster to unroll the loop manually since not all compilers do that (properly). That being said, the benefit of loop unrolling is very dependent of the target processor (unspecified here). Without more information, it is very hard to know if this is useful. For example, some processor can execute multiple floating-point operation per cycles while some other cannot even do that natively (ie. no hardware FP unit): they are somehow emulated with many instruction which is very expensive (compilers like GCC do function calls for basic operations like addition/subtraction on such processors). If there is no hardware FP unit, then it might be faster to use fixed precision.
Finally, some MIPS processors have a 128-bit SIMD unit. Using it should significantly speed up the execution. Compilers should be able to mostly auto-vectorize your code but you need to tell them if your target processor support it (see the -march flag for GCC/Clang). For a fixed-size matrix, manual vectorization often result in a faster execution (than auto-vectorisation) assuming you write an efficient code.

Broadcasting each element of a SIMD register in a loop

I need to fill a SIMD register with one element of another SIMD register. i.e. "broadcast" or "splat" a single element to every position.
My current code for doing it is (it's simplified, my real functions are declared inline):
__m128
f4_broadcast_1(__m128 a, int i) {
return _mm_set1_ps(a[i]);
}
This seem to generate efficient code on clang and gcc, but msvc forbids index accesses. Therefore, I instead write:
__m128
f4_broadcast_2(__m128 a, int i) {
union { __m128 reg; float f[4]; } r = { .reg = a };
return _mm_set1_ps(r.f[i]);
}
It generates the same code on clang and gcc but bad code on msvc. Godbolt link: https://godbolt.org/z/IlOqZl
Is there a better way to do it? I know there are similar questions on SO already, but my use case involves both extracting a float32 from a register and putting it back into another one, which is a slightly different problem. It would be cool if you could do this without having to touch the main memory at all.
Is the index variable or constant? Apparently it matters a lot to SIMD performance whether it is. In my case, the index is a loop variable:
for (int i = 0; i < M; i++) {
... broadcast element i of some reg
}
where M is either 4, 8 or 16. Maybe I should manually unroll the loops to make it a constant? It's a lot of code in the for-loop so the amount of code would grow considerably.
I also wonder how to do the same thing but for the __m256 and __m512 registers found on modern cpu:s.
Some of the shuffles in Get an arbitrary float from a simd register at runtime? can be adapted to broadcast an element instead of just get 1 copy if it to the low element. It discusses tradeoffs of shuffle vs. store/reload strategies in more detail.
x86 doesn't have a 32-bit-element variable-control shuffle until AVX vpermilps and AVX2 lane-crossing vpermps / vpermd. e.g.
// for runtime-variable i. Otherwise use something more efficient.
_mm_permutevar_ps(v, _mm_set1_epi32(i));
Or broadcast the low element with vbroadcastss (the vector-source version requires AVX2)
Broadcast loads are very efficient with AVX1: _mm_broadcast_ss(float*) (or _mm256/512 of the same) or simply 128/256/512 _mm_set1_ps(float) of a float that happened to come from memory, and let your compiler use a broadcast load if compiling with AVX1 enabled.
With a compile-time-constant control, you can broadcast any single element with SSE1
_mm_shuffle_ps(same,same, _MM_SHUFFLE(i,i,i,i));
Or for integer, with SSE2 pshufd: _mm_shuffle_epi32(v, _MM_SHUFFLE(i,i,i,i)).
Depending on your compiler, it may have to be a macro for i to be a compile-time constant with optimization disabled. The shuffle-control constant has to compile into an immediate byte (with 4x 2-bit fields) embedded in the machine code, not loaded as data or from a register.
Iterating over elements in a loop.
I'm using AVX2 in this section; this easily adapts to AVX512. Without AVX2 the store/reload strategy is your only good option for 256-bit vectors, or vpermilps for 128-bit vectors.
Possibly incrementing counters (by 4) for SSSE3 pshufb (with casting between __m128i and __m128) `could be a good idea without AVX where you don't have an efficient broadcast load.
the index is a loop variable
Compilers will often fully unroll loops for you, turning the loop variable into a compile-time constant for each iteration. But only with optimization enabled. In C++ you could maybe use template recursion to iterate with a constexpr.
MSVC doesn't optimize intrinsics, so if you write _mm_permutevar_ps(v, _mm_set1_epi32(i)); you're actually going to get that in each iteration, not 4x vshufps. But gcc and especially clang do optimize shuffles, so they should do well with optimization enabled.
It's a lot of code in the for-loop
If it's going to need a lot of registers / spend a lot of time, a store/reload might be a good choice especially with AVX available for broadcast reloads. Shuffle throughput is more limited (1/clock) than load throughput (2/clock) on current Intel CPUs.
Compiling your code with AVX512 will even allow broadcast memory-source operands, not a separate load instruction, so the compiler can even fold a broadcast-load into a source operand if it's only needed once.
/********* Store/reload strategy ****************/
#include <stdalign.h>
void foo(__m256 v) {
alignas(32) float tmp[8];
_mm256_store_ps(tmp, v);
// with only AVX1, maybe don't peel first iteration, or broadcast manually in 2 steps
__m256 bcast = _mm256_broadcastss_ps(_mm256_castps256_ps128(v)); // AVX2 vbroadcastss ymm, xmm
... do stuff with bcast ...
for (int i=1; i<8 ; i++) {
bcast = _mm256_broadcast_ss(tmp[i]);
... do stuff with bcast ...
}
}
I peeled the first iteration manually to just broadcast the low element with an ALU operation (lower latency) so it can get started right away. Later iterations then reload with a broadcast load.
Another option would be to use a SIMD increment for a vector shuffle-control (aka mask), if you have AVX2.
// Also AVX2
void foo(__m256 v) {
__m256i shufmask = _mm256_setzero_si256();
for (int i=1; i<8 ; i++) {
__m256 bcast = _mm256_permutevar8x32_ps(v, shufmask); // AVX2 vpermps
// prep for next iteration by incrementing the element selectors
shufmask = _mm256_add_epi32(shufmask, _mm256_set1_epi32(1));
... do stuff with bcast ...
}
}
This does one redundant vpaddd on shufmask (in the last iteration), but that's probably fine and better than peeling the first or last iteration. And obviously better than starting with -1 and doing an add before the shuffle in the first iteration.
Lane-crossing shuffles have 3-cycle latency on Intel so putting it right after the shuffle is probably good scheduling unless there's other per-iteration work that doesn't depend on bcast; out-of-order exec makes this a minor issue anyway. In the first iteration, vpermps with a mask that was just xor-zeroed is basically just as good as vbroadcastss on Intel, for out-of-order exec to get started quickly.
But on AMD CPUs (at least before Zen2), lane-crossing vpermps is pretty slow; lane-crossing shuffles with granularity <128-bit are extra expensive because it has to decode into 128-bit uops. So this strategy isn't wonderful on AMD. If store/reload performs equally for your surrounding code on Intel, then it might be a better choice to make your code AMD-friendly as well.
vpermps also has a new intrinsic introduced with AVX512 intrinsics: _mm256_permutexvar_ps(__m256i idx, __m256 a) which has the operands in the order that matches asm. Use whichever one you like, if your compiler supports the new one.
Broadcasting can be achieved by using the AVX2 instruction VBROADCASTSS, but moving the value to the input position (first position) depends on your instruction set:
VBROADCASTSS (128 bit version VEX and legacy)
This instruction broadcasts the source value on position [0] of the source XMM register to all four FLOATS of the destination XMM register. Its intrinsic is __m128 _mm_broadcastss_ps(__m128 a);.
If the position of your value is constant, you can use the instruction PSHUFD to move the value from its current position to the first position. Its intrinsic is __m128i _mm_shuffle_epi32(__m128i a, int n). To move the value that should be broadcasted to the first position of the input XMM vector, use the following values for int n:
1. : 0h
2. : 1h
3. : 2h
4. : 3h
This moves the value from the 0..3 position to the first position.
So use, for example, use the following to move the fourth position of the input vector to the first one:
__m128 newInput = _mm_shuffle_epi32(__m128i input, 3)
Then apply the following intrinsic:
__m128 result = _mm_broadcastss_ps(__m128 newInput);
Now the value from the fourth position of your input XMM vector should be on all positions of your result vector.

Matrix Multiplication of size 100*100 using SSE Intrinsics

int MAX_DIM = 100;
float a[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16)));
float b[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16)));
float d[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16)));
/*
* I fill these arrays with some values
*/
for(int i=0;i<MAX_DIM;i+=1){
for(int j=0;j<MAX_DIM;j+=4){
for(int k=0;k<MAX_DIM;k+=4){
__m128 result = _mm_load_ps(&d[i][j]);
__m128 a_line = _mm_load_ps(&a[i][k]);
__m128 b_line0 = _mm_load_ps(&b[k][j+0]);
__m128 b_line1 = _mm_loadu_ps(&b[k][j+1]);
__m128 b_line2 = _mm_loadu_ps(&b[k][j+2]);
__m128 b_line3 = _mm_loadu_ps(&b[k][j+3]);
result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0x00), b_line0));
result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0x55), b_line1));
result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0xaa), b_line2));
result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0xff), b_line3));
_mm_store_ps(&d[i][j],result);
}
}
}
the above code I made to make matrix multiplication using SSE. the code runs as flows I take 4 elements from row from a multiply it by 4 elements from a column from b and move to the next 4 elements in the row of a and next 4 elements in column b
I get an error Segmentation fault (core dumped) I don't really know why
I use gcc 5.4.0 on ubuntu 16.04.5
Edit :
The segmentation fault was solved by _mm_loadu_ps
Also there is something wrong with logic i will be greatfull if someone helps me to find it
The segmentation fault was solved by _mm_loadu_ps Also there is something wrong with logic...
You're loading 4 overlapping windows on b[k][j+0..7]. (This is why you needed loadu).
Perhaps you meant to load b[k][j+0], +4, +8, +12? If so, you should align b by 64, so all four loads come from the same cache line (for performance). Strided access is not great, but using all 64 bytes of every cache line you touch is a lot better than getting row-major vs. column-major totally wrong in scalar code with no blocking.
I take 4 elements from row from a multiply it by 4 elements from a column from b
I'm not sure your text description describes your code.
Unless you've already transposed b, you can't load multiple values from the same column with a SIMD load, because they aren't contiguous in memory.
C multidimensional arrays are "row major": the last index is the one that varies most quickly when moving to the next higher memory address. Did you think that _mm_loadu_ps(&b[k][j+1]) was going to give you b[k+0..3][j+1]? If so, this is a duplicate of SSE matrix-matrix multiplication (That question is using 32-bit integer, not 32-bit float, but same layout problem. See that for a working loop structure.)
To debug this, put a simple pattern of values into b[]. Like
#include <stdalign.>
alignas(64) float b[MAX_DIM][MAX_DIM] = {
0000, 0001, 0002, 0003, 0004, ...,
0100, 0101, 0102, ...,
0200, 0201, 0202, ...,
};
// i.e. for (...) b[i][j] = 100 * i + j;
Then when you step through your code in the debugger, you can see what values end up in your vectors.
For your a[][] values, maybe use 90000.0 + 100 * i + j so if you're looking at registers (instead of C variables) you can still tell which values are a and which are b.
Related:
Ulrich Drepper's What Every Programmer Should Know About Memory shows an optimized matmul with cache-blocking with SSE instrinsics for double-precision. Should be straightforward to adapt for float.
How does BLAS get such extreme performance? (You might want to just use an optimized matmul library; tuning matmul for optimal cache-blocking is non-trivial but important)
Matrix Multiplication with blocks
Poor maths performance in C vs Python/numpy has some links to other questions
how to optimize matrix multiplication (matmul) code to run fast on a single processor core

Finding the instances of the number in a vector array in KNC (Xeon Phi)

I am trying to exploit the SIMD 512 offered by knc (Xeon Phi) to improve performance of the below C code using intel intrinsics. However, my intrinsic embedded code runs slower than auto-vectorized code
C Code
int64_t match=0;
int *myArray __attribute__((align(64)));
myArray = (int*) malloc (sizeof(int)*SIZE); //SIZE is array size taken from user
radomize(myArray); //to fill some random data
int searchVal=24;
#pragma vector always
for(int i=0;i<SIZE;i++) {
if (myArray[i]==searchVal) match++;
return match;
Intrinsic embedded code:
In the below code I am first loading the array and comparing it with search key. Intrinsics return 16bit mask values that is reduced using _mm512_mask_reduce_add_epi32().
register int64_t match=0;
int *myArray __attribute__((align(64)));
myArray = (int*) malloc (sizeof(int)*SIZE); //SIZE is array size taken from user
const int values[16]=\
{ 1,1,1,1,\
1,1,1,1,\
1,1,1,1,\
1,1,1,1,\
};
__m512i const flag = _mm512_load_epi32((void*) values);
__mmask16 countMask;
__m512i searchVal = _mm512_set1_epi32(16);
__m512i kV = _mm512_setzero_epi32();
for (int i=0;i<SIZE;i+=16)
{
// kV = _mm512_setzero_epi32();
kV = _mm512_loadunpacklo_epi32(kV,(void* )(&myArray[i]));
kV = _mm512_loadunpackhi_epi32(kV,(void* )(&myArray[i + 16]));
countMask = _mm512_cmpeq_epi32_mask(kV, searchVal);
match += _mm512_mask_reduce_add_epi32(countMask,flag);
}
return match;
I believe I have some how introduced extra cycles in this code and hence it is running slowly compared to the auto-vectorized code. Unlike SIMD128 which directly returns the value of the compare in 128bit register, SIMD512 returns the values in mask register which is adding more complexity to my code. Am I missing something here, there must be a way out to directly compare and keep count of successful search rather than using masks such as XOR ops.
Finally, please suggest me the ways to increase the performance of this code using intrinsics. I believe I can juice out more performance using intrinsics. This was at least true for SIMD128 where in using intrinsics allowed me to gain 25% performance.
I suggest the following optimizations:
Use prefetching. Your code performs very little computations, and almost surely bandwidth-bound. Xeon Phi has hardware prefetching only for L2 cache, so for optimal performance you need to insert prefetching instructions manually.
Use aligned read _mm512_load_epi32 as hinted by #PaulR. Use memalign function instead of malloc to guarantee that the array is really aligned on 64 bytes. And in case you will ever need misaligned instructions, use _mm512_undefined_epi32() as the source for the first misaligned load, as it breaks dependency on kV (in your current code) and lets the compiler do additional optimizations.
Unroll the array by 2 or use at least two threads to hide instruction latency.
Avoid using int variable as an index. unsigned int, size_t or ssize_t are better options.

32x32 Multiply and add optimization

I'm working on optimizing an application . I found that i need to optimize an inner loop for improved performance.
rgiFilter is a 16 bit arrary.
for (i = 0; i < iLen; i++) {
iPredErr = (I32)*rgiResidue;
rgiFilter = rgiFilterBuf;
rgiPrevVal = rgiPrevValRdBuf + iRecent;
rgiUpdate = rgiUpdateRdBuf + iRecent;
iPred = iScalingOffset;
for (j = 0; j < iOrder_Div_8; j++) {
iPred += (I32) rgiFilter[0] * rgiPrevVal[0];
rgiFilter[0] += rgiUpdate[0];
iPred += (I32) rgiFilter[1] * rgiPrevVal[1];
rgiFilter[1] += rgiUpdate[1];
iPred += (I32) rgiFilter[2] * rgiPrevVal[2];
rgiFilter[2] += rgiUpdate[2];
iPred += (I32) rgiFilter[3] * rgiPrevVal[3];
rgiFilter[3] += rgiUpdate[3];
iPred += (I32) rgiFilter[4] * rgiPrevVal[4];
rgiFilter[4] += rgiUpdate[4];
iPred += (I32) rgiFilter[5] * rgiPrevVal[5];
rgiFilter[5] += rgiUpdate[5];
iPred += (I32) rgiFilter[6] * rgiPrevVal[6];
rgiFilter[6] += rgiUpdate[6];
iPred += (I32) rgiFilter[7] * rgiPrevVal[7];
rgiFilter[7] += rgiUpdate[7];
rgiFilter += 8;
rgiPrevVal += 8;
rgiUpdate += 8;
}
ode here
Your only bet is to do more than one operation at a time, and that means one of these 3 options:
SSE instructions (SIMD). You process multiple memory locations with a single instructions
Multi-threading (MIMD). This works best if you have more than 1 cpu core. Split your array into multiple, similarly sized strips that are independant of eachother (dependency will increase this option's complexity a lot, to the point of being slower than sequentially calculating everything if you need a lot of locks). Note that the array has to be big enough to offset the extra context switching and synchronization overhead (it's pretty small, but not negligeable). Best for 4 cores or more.
Both at once. If your array is really big, you could gain a lot by combining both.
If rgiFilterBuf, rgiPrevValRdBuf and rgiUpdateRdBuf are function parameters that don't alias, declare them with the restrict qualifier. This will allow the compiler to optimise more aggresively.
As some others have commented, your inner loop looks like it may be a good fit for vector processing instructions (like SSE, if you're on x86). Check your compiler's intrinsics.
I don't think you can do much to optimize it in C. Your compiler might have options to generate SIMD code, but you probably need to just go and write your own SIMD assembly code if performance is critical...
You can replace the inner loop with very few SSE2 intrinsics
see [_mm_madd_epi16][1] to replace the eight
iPred += (I32) rgiFilter[] * rgiPrevVal[];
and [_mm_add_epi16][2] or _[mm_add_epi32][3] to replace the eight
rgiFilter[] += rgiUpdate[];
You should see a nice acceleration with that alone.
These intrinsics are specific to Microsoft and Intel Compilers.
I am sure equivalents exist for GCC I just havent used them.
EDIT: based on the comments below I would change the following...
If you have mixed types the compiler is not always smart enough to figure it out.
I would suggest the following to make it more obvious and give it a better chance
at autovectorizing.
declare rgiFilter[] as I32 bit for
the purposes of this function. You
will pay one copy.
change iPred to iPred[] as I32 also
do the iPred[] summming outside the inner (or even outer) loop
Pack similar instructions in groups of four
iPred[0] += rgiFilter[0] * rgiPrevVal[0];
iPred[1] += rgiFilter[1] * rgiPrevVal[1];
iPred[2] += rgiFilter[2] * rgiPrevVal[2];
iPred[3] += rgiFilter[3] * rgiPrevVal[3];
rgiFilter[0] += rgiUpdate[0];
rgiFilter[1] += rgiUpdate[1];
rgiFilter[2] += rgiUpdate[2];
rgiFilter[3] += rgiUpdate[3];
This should be enough for the Intel compiler to figure it out
Ensure that iPred is hold in a register (not read from memory before and not written back into memory after each += operation).
Optimize the memory layout for 1st level cache. Ensure that the 3 arrays to not fight for same cache entries. This depends on CPU architecture and isn't simple at all.
Loop unrolling and vectorizing should left to the compiler.
See Gcc Auto-vectorization
Start out by making sure that the data is layed out linearly in memory so that you get no cache misses. This doesn't seem to be an issue though.
If you can't SSE the operations (and if the compiler fails with it - look at the assembly), try to separate into several different for-loops that are smaller (one for each 0 .. 8). Compilers tend to be able to do better optimizations on loops that perform less amount of operations (except in cases like this where it might be able to do vectorization/SSE).
16 bit integers are more expensive for 32/64 bit architecture to use (unless they have specific 16-bit registers). Try converting it to 32 bits before doing the loop (most 64-bit architectures have 32bit registers as well afaik).
Pretty good code.
At each step, you're basically doing three things, a multiplication and two additions.
The other suggestions are good. Also, I've sometimes found that I get faster code if I separate those activities into different loops, like
one loop to do the multiplication and save to a temporary array.
one loop to sum that array in iPred.
one loop to add rgiUpdate to rgiFilter.
With the unrolling, your loop overhead is negligible, but if the number of different things done inside each loop is minimized, the compiler can sometimes make better use of its registers.
There's lots of optimizations that you can do that involve introducing target specific code. I'll stick mostly with generic stuff, though.
First, if you are going to loop with index limits then you should usually try to loop downward.
Change:
for (i = 0; i < iLen; i++) {
to
for (i = iLen-1; i <= 0; i--) {
This can take advantage of the fact that many common processors essentially do a comparison with 0 for the results of any math operation, so you don't have to do an explicit comparison.
This only works, though, if going backwards through the loop has the same results and if the index is signed (though you can sneak around that).
Alternately you could try limiting by pointer math. This might eliminate the need for an explicit index (counter) variable, which could speed things up, especially if registers are in short supply.
for (p = rgiFilter; p <= rgiFilter+8; ) {
iPred += (I32) (*p) + *rgiPreval++;
*p++ += *rgiUpdate++;
....
}
This also gets rid of the odd updating at the end of your inner loop. The updating at the end of the loop could confuse the compiler and make it produce worse code. You may also find that the loop unrolling that you did do may produce worse or equally as good results as if you had only two statements in the body of the inner loop. The compiler is likely able to make good decisions about how this loop should be rolled/unrolled. Or you might just want to make sure that the loop is unrolled twice since rgiFilter is an array of 16 bit values and see if the compiler can take advantage of accessing it just twice to accomplish two reads and two writes -- doing one 32 bit load and one 32 bit store.
for (p = rgiFilter; p <= rgiFilter+8; ) {
I16 x = *p;
I16 y = *(p+1); // Hope that the compiler can combine these loads
iPred += (I32) x + *rgiPreval++;
iPred += (I32) y + *rgiPreval++;
*p++ += *rgiUpdate++;
*p++ += *rgiUpdate++; // Hope that the complier can combine these stores
....
}
If your compiler and/or target processor supports it you can also try issuing prefetch instructions. For instance gcc has:
__builtin_prefetch (const void * addr)
__builtin_prefetch (const void * addr, int rw)
__builtin_prefetch (const void * addr, int rw, int locality)
These can be used to tell the compiler that if the target has prefetch instructions it should use them to try to go ahead and get addr into the cache. Optimally these should be issued once per cache line step per array you're working on. The rw argument is to tell the compiler if you want to read or write to address. Locality has to do with if the data needs to stay in cache after you access it. The compiler just tries to do the best it can figure out how to to generate the right instructions for this, but if it can't do what you ask for on a certain target it just does nothing and it doesn't hurt anything.
Also, since the __builtin_ functions are special the normal rules about variable number of arguments don't really apply -- this is a hint to the compiler, not a call to a function.
You should also look into any vector operations your target supports as well as any generic or platform specific functions, builtins, or pragmas that your compiler supports for doing vector operations.

Resources