some mandelbrot drawing routine from c to sse2

some mandelbrot drawing routine from c to sse2 - c

I want to rewrite such simple routine to SSE2 code, (preferably
in nasm) and I am not totally sure how to do it, two things
not clear (how to express calculations (inner loop and those from
outer loop too) and how to call c code function "SetPixelInDibInt(i ,j, palette[n]);"
from under staticaly linked asm code
void DrawMandelbrotD(double ox, double oy, double lx, int N_ITER)
{
double ly = lx * double(CLIENT_Y)/double(CLIENT_X);
double dx = lx / CLIENT_X;
double dy = ly / CLIENT_Y;
double ax = ox - lx * 0.5 + dx * 0.5;
double ay = oy - ly * 0.5 + dy * 0.5;
static double re, im, re_n, im_n, c_re, c_im, rere, imim, int n;
for(int j=0; j<CLIENT_Y; j+=1)
{
for(int i=0; i<CLIENT_X; i+=1)
{
c_re = ax + i * dx;
c_im = ay + j * dy;
re = c_re;
im = c_im;
rere=re*re;
imim=im*im;
n=1;
for(int k=0;k<N_ITER;k++)
{
im = (re+re)*im + c_im;
re = rere - imim + c_re;
rere=re*re;
imim=im*im;
if ( (rere + imim) > 4.0 ) break;
n++;
}
SetPixelInDibInt(i ,j, palette[n]);
}
}
}
could someone help, I would like not to see other code
implementations but just nasm-sse translation of those above
- it would be most helpfull in my case - could someone help with that?

Intel has a complete implementation as an AVX example. See below.
What makes Mandelbrot tricky is that the early-out condition for each point in the set (i.e. pixel) is different. You could keep a pair or quad of pixels iterating until the magnitude of both exceeds 2.0 (or you hit max iterations). To do otherwise would require tracking which pixel's points were in which vector element.
Anyway, a simplistic implementation to operate on a vector of 2 (or 4 with AVX) doubles at a time would have its throughput limited by the latency of the dependency chains. You'd need to do multiple dependency chains in parallel to keep both of Haswell's FMA units fed. So you'd duplicate your variables, and interleave operations for two iterations of the outer loop inside the inner loop.
Keeping track of which pixels are being calculated would be a little tricky. I think it might take less overhead to use one set of registers for one row of pixels, and another set of registers for another row. (So you can always just move 4 pixels to the right, rather than checking whether the other dep chain is already processing that vector.)
I suspect that only checking the loop exit condition every 4 iterations or so might be a win. Getting code to branch based on a packed vector comparison, is slightly more expensive than in the scalar case. The extra FP add required is also expensive. (Haswell can do two FMAs per cycle, (latency = 5). The lone FP add unit is one the same port as one of the FMA units. The two FP mul units are on the same ports that can run FMA.)
The loop condition can be checked with a packed-compare to generate a mask of zeros and ones, and a (V)PTEST of that register with itself to see if it's all zero. (edit: movmskps then test+jcc is fewer uops, but maybe higher latency.) Then obviously je or jne as appropriate, depending on whether you did a FP compare that leaves zeros when you should exit, or zeros when you shouldn't. NAN shouldn't be possible, but there's no reason not to choose your comparison op such that a NAN will result in the exit condition being true.
const __mm256d const_four = _mm256_set1_pd(4.0); // outside the loop
__m256i cmp_result = _mm256_cmp_pd(mag_squared, const_four, _CMP_LE_OQ); // vcmppd. result is non-zero if at least one element < 4.0
if (_mm256_testz_si256(cmp_result, cmp_result))
break;
There MIGHT be some way to use PTEST directly on a packed-double, with some bit-hack AND-mask that will pick out bits that will be set iff the FP value is > 4.0. Like maybe some bits in the exponent? Maybe worth considering. I found a forum post about it, but didn't try it out.
Hmm, oh crap, this doesn't record WHEN the loop condition failed, for each vector element separately, for the purpose of coloring the points outside the Mandelbrot set. Maybe test for any element hitting the condition (instead of all), record the result, and then set that element (and c for that element) to 0.0 so it won't trigger the exit condition again. Or maybe scheduling pixels into vector elements is the way to go after all. This code might do fairly well on a hyperthreaded CPU, since there will be a lot of branch mispredicts with every element separately triggering the early-out condition.
That might waste a lot of your throughput, and given that 4 uops per cycle is doable, but only 2 of them can be FP mul/add/FMA, there's room for a significant amount of integer code to schedule points into vector elements. (On Sandybridge/Ivybrideg, without FMA, FP throughput is lower. But there are only 3 ports that can handle integer ops, and 2 of those are the ports for the FP mul and FP add units.)
Since you don't have to read any source data, there's only 1 memory access stream for each dep chain, and it's a write stream. (And it's low bandwidth, since most points take a lot of iterations before you're ready to write a single pixel value.) So the number of hardware prefetch streams isn't a limiting factor for the number of dep chains to run in parallel. Cache misses latency should be hidden by write buffers.
I can write some code if anyone's still interested in this (just post a comment). I stopped at the high-level design stage since this is an old question, though.
==============
I also found that Intel already used the Mandelbrot set as an example for one of their AVX tutorials. They use the mask-off-vector-elements method for the loop condition. (using the mask generated directly by vcmpps to AND). Their results indicate that AVX (single-precision) gave a 7x speedup over scalar float, so apparently it's not common for neighbouring pixels to hit the early-out condition at very different numbers of iterations. (at least for the zoom / pan they tested with.)
They just let the FP results keep accumulating for elements that fail the early-out condition. They just stop incrementing the counter for that element. Hopefully most systems default to having the control word set to zero out denormals, if denormals still take extra cycles.
Their code is silly in one way, though: They track the iteration count for each vector element with a floating-point vector, and then convert it to int at the end before use. It'd be faster, and not occupy an FP execution unit, to use packed-integers for that. Oh, I know why they do that: AVX (without AVX2) doesn't support 256bit integer vector ops. They could have used packed 16bit int loop counters, but that could overflow. (And they'd have to compress the mask down from 256b to 128b).
They also test for all elements being > 4.0 with movmskps and then test that, instead of using ptest. I guess the test / jcc can macro-fuse, and run on a different execution unit than FP vector ops, so it's maybe not even slower. Oh, and of course AVX (without AVX2) doesn't have 256bit PTEST. Also, PTEST is 2 uops, so actually movmskps + test / jcc is fewer uops than ptest + jcc. (PTEST is 1 fused-domain uop on SnB, but still 2 unfused uops for the execution ports. On IvB/HSW, 2 uops even in the fused domain.) So it looks like movmskps is the optimal way, unless you can take advantage of the bitwise AND that's part of PTEST, or need to test more than just the high bit of each element. If a branch is unpredictable, ptest might be lower latency, and thus be worth it by catching mispredicts a cycle sooner.

Related

Maximizing the performance and efficiency of triangularizing a 24x24 matrix in C and then in MIPS assembly

As of recently an interest within the realm of computer architecture and performance has been sparked in me. With that said, I have been picking up an "easier" assembly language to really try and learn how stuff "works under the hood". Namely MIPS assembly. I feel comfortable enough to try and experiment with some more advanced stuff and as such I have decided to combine programming with my interest in mathematics.
My goal is simple, given a 24x24 (I don't care about any other size) matrix A, I want to write an algorithm that as efficiently as possible finds the upper triangular form of the matrix. With efficiently I mean that I want to eventually end up in a state where I use the processor's that I am using resources the best I can. High cache hit rate, efficient usage of memory (locality of reference principle etc.), performance as in time it takes to run the solution, etc.
Eventually my goal is to transform the C solution to MIPS-assembly and tailor it to fit the memory subsystem of the processor that I will be trying to run my algorithm on. Regarding the processor I will have different options to play around with when it comes to caches, write buffers and memory in the sense that I can play around with different cache sizes, block sizes, associativity levels, memory access times etc. Performance in this case will be measured in the time it takes to triangularize a 24x24 matrix.
To begin, I need to actually write some high level code and actually solve the problem there before diving into MIPS assembly. I have "looked around" and eventually came up with this seemingly standard solution. It isn't necessarily super fast, neither do I think it is optimal for triangularizing 24x24 matrices. Can I do better?
void triangularize(float **A, int N)
{
int i, j, k;
// Loop over the diagonal elements
for (k = 0; k < N; k++)
{
// Loop over all the elements in the pivot row and right of the pivot ELEMENT
for (j = k + 1; j < N; j++)
{
// divide by the pivot element
A[k][j] = A[k][j] / A[k][k];
}
// Set the pivot elements
A[k][k] = 1.0;
// Loop over all elements below the pivot right an right of the pivot COLUMN
for (i = k + 1; i < N; i++)
{
for (j = k + 1; j < N; j++)
{
A[i][j] = A[i][j] - A[i][k] * A[k][j];
}
A[i][k] = 0.0;
}
}
}
Furthermore, what should be my next steps when trying to convert the C code to MIPS assembly with respect to maximizing performance and minimizing cost (cache hit rates, IO costs when dealing with memory etc.) to get a lightning fast and efficient solution?

First of all, encoding a matrix as a jagged array (ie. float**) is generally not efficient as it cause unnecessary expensive indirections and the array may not be contiguous in memory resulting in more cache misses or even cache trashing in pathological cases. It is certainly better to copy the matrix in a contiguous flatten array. Please consider storing your matrices as flatten arrays that are generally more efficient (especially on MIPS). Flatten array can be indexed using something like array[i*24+j] instead of array[i][j].
Moreover, if you do not care about matrices other than 24x24 ones, then you can write a specialized code for 24x24 matrices. This help compilers to generate a more efficient assembly code (typically by unrolling loops and using more efficient instructions like multiplication by a constant).
Additionally, divisions are generally expensive, especially on embedded MIPS processors. Thus, you can replace divisions by multiplications with the inverse. For example:
float inv = 1.0f / A[k][k];
for (j = k + 1; j < N; j++)
A[k][j] *= inv;
Note that the result might be slightly different due to floating-point rounding. You can use the -ffast-math compiler flag so to help it generating such optimisation if you know that special values like NaN or Inf do not appear in the matrix.
Moreover, it may be faster to unroll the loop manually since not all compilers do that (properly). That being said, the benefit of loop unrolling is very dependent of the target processor (unspecified here). Without more information, it is very hard to know if this is useful. For example, some processor can execute multiple floating-point operation per cycles while some other cannot even do that natively (ie. no hardware FP unit): they are somehow emulated with many instruction which is very expensive (compilers like GCC do function calls for basic operations like addition/subtraction on such processors). If there is no hardware FP unit, then it might be faster to use fixed precision.
Finally, some MIPS processors have a 128-bit SIMD unit. Using it should significantly speed up the execution. Compilers should be able to mostly auto-vectorize your code but you need to tell them if your target processor support it (see the -march flag for GCC/Clang). For a fixed-size matrix, manual vectorization often result in a faster execution (than auto-vectorisation) assuming you write an efficient code.

Indexing with modulo has a huge performance hit

I have a simple code that sums elements from an array and returns them:
// Called with jump == 0
int performance(int jump, int *array, int size) {
int currentIndex = 0;
int total = 0;
// For i in 1...500_000_000
for (int i = 0; i < 500000000; i++) {
currentIndex = (currentIndex + jump) % size;
total += array[currentIndex];
}
return total;
}
I noticed a weird behavior: the presence of % size has a very large performance impact (~10x slower) even tho jump is 0 so it is constantly accessing the same array element (0). Just removing % size improves performance a lot.
I would have thought this was just the modulo computation that was making this difference, but now say I replace my sum line with total += array[currentIndex] % size; (thus also computing a modulo) the performance difference is almost unnoticeable.
I am compiling this with -O3 with clang on an arm64 machine.
What could be causing this?

Sounds normal for sdiv+msub latency to be about 10x add latency.
Even if this inlined for a compile-time-constant size that wasn't a power of two, that's still a multiplicative inverse and an msub (multiply-subtract) to get the remainder, so a dep chain of at least two multiplies and a shift.
Maybe an extra few instructions on the critical path for a signed remainder with with a constant size (even if positive) since the array is also signed int. e.g. -4 % 3 has to produce -1 in C.
See
How many CPU cycles are needed for each assembly instruction?
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
say I replace my sum line with total += array[currentIndex] % size; (thus also computing a modulo)
That remainder isn't part of a loop-carried dependency chain. (https://fgiesen.wordpress.com/2018/03/05/a-whirlwind-introduction-to-dataflow-graphs/)
Multiple remainder calculations can be in flight in parallel, since the next array[idx] load address only depends on a += jump add instruction.
If you don't bottleneck on throughput limits, those remainder results could potentially be ready with 1/clock throughput, with OoO exec overlapping dep chains between iterations. The only latency bottlenecks are the loop counter/index and total += ..., both of which are just integer add which has 1 cycle latency.
So really, the bottleneck is likely going to be on throughput (of the whole loop body), not those latency bottlenecks, unless you're testing on an extremely wide CPU that can get a lot done every cycle. (Surprised you don't get more slowdown from introducing the % at all. Unless total is getting optimized away after inlining if you're not using the result.)

Broadcasting each element of a SIMD register in a loop

I need to fill a SIMD register with one element of another SIMD register. i.e. "broadcast" or "splat" a single element to every position.
My current code for doing it is (it's simplified, my real functions are declared inline):
__m128
f4_broadcast_1(__m128 a, int i) {
return _mm_set1_ps(a[i]);
}
This seem to generate efficient code on clang and gcc, but msvc forbids index accesses. Therefore, I instead write:
__m128
f4_broadcast_2(__m128 a, int i) {
union { __m128 reg; float f[4]; } r = { .reg = a };
return _mm_set1_ps(r.f[i]);
}
It generates the same code on clang and gcc but bad code on msvc. Godbolt link: https://godbolt.org/z/IlOqZl
Is there a better way to do it? I know there are similar questions on SO already, but my use case involves both extracting a float32 from a register and putting it back into another one, which is a slightly different problem. It would be cool if you could do this without having to touch the main memory at all.
Is the index variable or constant? Apparently it matters a lot to SIMD performance whether it is. In my case, the index is a loop variable:
for (int i = 0; i < M; i++) {
... broadcast element i of some reg
}
where M is either 4, 8 or 16. Maybe I should manually unroll the loops to make it a constant? It's a lot of code in the for-loop so the amount of code would grow considerably.
I also wonder how to do the same thing but for the __m256 and __m512 registers found on modern cpu:s.

Some of the shuffles in Get an arbitrary float from a simd register at runtime? can be adapted to broadcast an element instead of just get 1 copy if it to the low element. It discusses tradeoffs of shuffle vs. store/reload strategies in more detail.
x86 doesn't have a 32-bit-element variable-control shuffle until AVX vpermilps and AVX2 lane-crossing vpermps / vpermd. e.g.
// for runtime-variable i. Otherwise use something more efficient.
_mm_permutevar_ps(v, _mm_set1_epi32(i));
Or broadcast the low element with vbroadcastss (the vector-source version requires AVX2)
Broadcast loads are very efficient with AVX1: _mm_broadcast_ss(float*) (or _mm256/512 of the same) or simply 128/256/512 _mm_set1_ps(float) of a float that happened to come from memory, and let your compiler use a broadcast load if compiling with AVX1 enabled.
With a compile-time-constant control, you can broadcast any single element with SSE1
_mm_shuffle_ps(same,same, _MM_SHUFFLE(i,i,i,i));
Or for integer, with SSE2 pshufd: _mm_shuffle_epi32(v, _MM_SHUFFLE(i,i,i,i)).
Depending on your compiler, it may have to be a macro for i to be a compile-time constant with optimization disabled. The shuffle-control constant has to compile into an immediate byte (with 4x 2-bit fields) embedded in the machine code, not loaded as data or from a register.
Iterating over elements in a loop.
I'm using AVX2 in this section; this easily adapts to AVX512. Without AVX2 the store/reload strategy is your only good option for 256-bit vectors, or vpermilps for 128-bit vectors.
Possibly incrementing counters (by 4) for SSSE3 pshufb (with casting between __m128i and __m128) `could be a good idea without AVX where you don't have an efficient broadcast load.
the index is a loop variable
Compilers will often fully unroll loops for you, turning the loop variable into a compile-time constant for each iteration. But only with optimization enabled. In C++ you could maybe use template recursion to iterate with a constexpr.
MSVC doesn't optimize intrinsics, so if you write _mm_permutevar_ps(v, _mm_set1_epi32(i)); you're actually going to get that in each iteration, not 4x vshufps. But gcc and especially clang do optimize shuffles, so they should do well with optimization enabled.
It's a lot of code in the for-loop
If it's going to need a lot of registers / spend a lot of time, a store/reload might be a good choice especially with AVX available for broadcast reloads. Shuffle throughput is more limited (1/clock) than load throughput (2/clock) on current Intel CPUs.
Compiling your code with AVX512 will even allow broadcast memory-source operands, not a separate load instruction, so the compiler can even fold a broadcast-load into a source operand if it's only needed once.
/********* Store/reload strategy ****************/
#include <stdalign.h>
void foo(__m256 v) {
alignas(32) float tmp[8];
_mm256_store_ps(tmp, v);
// with only AVX1, maybe don't peel first iteration, or broadcast manually in 2 steps
__m256 bcast = _mm256_broadcastss_ps(_mm256_castps256_ps128(v)); // AVX2 vbroadcastss ymm, xmm
... do stuff with bcast ...
for (int i=1; i<8 ; i++) {
bcast = _mm256_broadcast_ss(tmp[i]);
... do stuff with bcast ...
}
}
I peeled the first iteration manually to just broadcast the low element with an ALU operation (lower latency) so it can get started right away. Later iterations then reload with a broadcast load.
Another option would be to use a SIMD increment for a vector shuffle-control (aka mask), if you have AVX2.
// Also AVX2
void foo(__m256 v) {
__m256i shufmask = _mm256_setzero_si256();
for (int i=1; i<8 ; i++) {
__m256 bcast = _mm256_permutevar8x32_ps(v, shufmask); // AVX2 vpermps
// prep for next iteration by incrementing the element selectors
shufmask = _mm256_add_epi32(shufmask, _mm256_set1_epi32(1));
... do stuff with bcast ...
}
}
This does one redundant vpaddd on shufmask (in the last iteration), but that's probably fine and better than peeling the first or last iteration. And obviously better than starting with -1 and doing an add before the shuffle in the first iteration.
Lane-crossing shuffles have 3-cycle latency on Intel so putting it right after the shuffle is probably good scheduling unless there's other per-iteration work that doesn't depend on bcast; out-of-order exec makes this a minor issue anyway. In the first iteration, vpermps with a mask that was just xor-zeroed is basically just as good as vbroadcastss on Intel, for out-of-order exec to get started quickly.
But on AMD CPUs (at least before Zen2), lane-crossing vpermps is pretty slow; lane-crossing shuffles with granularity <128-bit are extra expensive because it has to decode into 128-bit uops. So this strategy isn't wonderful on AMD. If store/reload performs equally for your surrounding code on Intel, then it might be a better choice to make your code AMD-friendly as well.
vpermps also has a new intrinsic introduced with AVX512 intrinsics: _mm256_permutexvar_ps(__m256i idx, __m256 a) which has the operands in the order that matches asm. Use whichever one you like, if your compiler supports the new one.

Broadcasting can be achieved by using the AVX2 instruction VBROADCASTSS, but moving the value to the input position (first position) depends on your instruction set:
VBROADCASTSS (128 bit version VEX and legacy)
This instruction broadcasts the source value on position [0] of the source XMM register to all four FLOATS of the destination XMM register. Its intrinsic is __m128 _mm_broadcastss_ps(__m128 a);.
If the position of your value is constant, you can use the instruction PSHUFD to move the value from its current position to the first position. Its intrinsic is __m128i _mm_shuffle_epi32(__m128i a, int n). To move the value that should be broadcasted to the first position of the input XMM vector, use the following values for int n:
1. : 0h
2. : 1h
3. : 2h
4. : 3h
This moves the value from the 0..3 position to the first position.
So use, for example, use the following to move the fourth position of the input vector to the first one:
__m128 newInput = _mm_shuffle_epi32(__m128i input, 3)
Then apply the following intrinsic:
__m128 result = _mm_broadcastss_ps(__m128 newInput);
Now the value from the fourth position of your input XMM vector should be on all positions of your result vector.

optimization of a code in C

I am trying to optimize a code in C, specificly a critical loop which takes almost 99.99% of total execution time. Here is that loop:
#pragma omp parallel shared(NTOT,i) num_threads(4)
{
# pragma omp for private(dx,dy,d,j,V,E,F,G) reduction(+:dU) nowait
for(j = 1; j <= NTOT; j++){
if(j == i) continue;
dx = (X[j][0]-X[i][0])*a;
dy = (X[j][1]-X[i][1])*a;
d = sqrt(dx*dx+dy*dy);
V = (D/(d*d*d))*(dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]);
E = dS[0]*dx+dS[1]*dy;
F = spin[2*j-2]*dx+spin[2*j-1]*dy;
G = -3*(D/(d*d*d*d*d))*E*F;
dU += (V+G);
}
}
All variables are local. The loop takes 0.7 second for NTOT=3600 which is a large amount of time, especially when I have to do this 500,000 times in the whole program, resulting in 97 hours spent in this loop. My question is if there are other things to be optimized in this loop?
My computer's processor is an Intel core i5 with 4 CPU(4X1600Mhz) and 3072K L3 cache.

Optimize for hardware or software?
Soft:
Getting rid of time consuming exceptions such as divide by zeros:
d = sqrt(dx*dx+dy*dy + 0.001f );
V = (D/(d*d*d))*(dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]);
You could also try John Carmack , Terje Mathisen and Gary Tarolli 's "Fast inverse square root" for the
D/(d*d*d)
part. You get rid of division too.
float qrsqrt=q_rsqrt(dx*dx+dy*dy + easing);
qrsqrt=qrsqrt*qrsqrt*qrsqrt * D;
with sacrificing some precision.
There is another division also to be gotten rid of:
(D/(d*d*d*d*d))
such as
qrsqrt_to_the_power2 * qrsqrt_to_the_power3 * D
Here is the fast inverse sqrt:
float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y; // evil floating point bit level hacking
i = 0x5f3759df - ( i >> 1 ); // what ?
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
return y;
}
To overcome big arrays' non-caching behaviour, you can do the computation in smaller patches/groups especially when is is many to many O(N*N) algorithm. Such as:
get 256 particles.
compute 256 x 256 relations.
save 256 results on variables.
select another 256 particles as target(saving the first 256 group in place)
do same calculations but this time 1st group vs 2nd group.
save first 256 results again.
move to 3rd group
repeat.
do same until all particles are versused against first 256 particles.
Now get second group of 256.
iterate until all 256's are complete.
Your CPU has big cache so you can try 32k particles versus 32k particles directly. But L1 may not be big so I would stick with 512 vs 512(or 500 vs 500 to avoid cache line ---> this is going to be dependent on architecture) if I were you.
Hard:
SSE, AVX, GPGPU, FPGA .....
As #harold commented, SSE should be start point to compare and you should vectorize or at least parallelize through 4-packed vector instructions which have advantage of optimum memory fetching ability and pipelining. When you need 3x-10x more performance(on top of SSE version using all cores), you will need an opencl/cuda compliant gpu(equally priced as i5) and opencl(or cuda) api or you can learn opengl too but it seems harder(maybe directx easier).
Trying SSE is easiest, should give 3x faster than the fast inverse I mentionad above. An equally priced gpu should give another 3x of SSE at least for thousands of particles. Going or over 100k particles, whole gpu can achieve 80x performance of a single core of cpu for this type of algorithm when you optimize it enough(making it less dependent to main memory). Opencl gives ability to address cache to save your arrays. So you can use terabytes/s of bandwidth in it.

I would always do random pausing
to pin down exactly which lines were most costly.
Then, after fixing something I would do it again, to find another fix, and so on.
That said, some things look suspicious.
People will say the compiler's optimizer should fix these, but I never rely on that if I can help it.
X[i], X[j], spin[2*j-1(and 2)] look like candidates for pointers. There is no need to do this index calculation and then hope the optimizer can remove it.
You could define a variable d2 = dx*dx+dy*dy and then say d = sqrt(d2). Then wherever you have d*d you can instead write d2.
I suspect a lot of samples will land in the sqrt function, so I would try to figure a way around using that.
I do wonder if some of these quantities like (dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]) could be calculated in a separate unrolled loop outside this loop. In some cases two loops can be faster than one if the compiler can save some registers.

I cannot believe that 3600 iterations of an O(1) loop can take 0.7 seconds. Perhaps you meant the double loop with 3600 * 3600 iterations? Otherwise I can suggest checking if optimization is enabled, and how long threads spawning takes.
General
Your inner loop is very simple and it contains only a few operations. Note that divisions and square roots are roughly 15-30 times slower than additions, subtractions and multiplications. You are doing three of them, so most of the time is eaten by them.
First of all, you can compute reciprocal square root in one operation instead of computing square root, then getting reciprocal of it. Second, you should save the result and reuse it when necessary (right now you divide by d twice). This would result in one problematic operation per iteration instead of three.
invD = rsqrt(dx*dx+dy*dy);
V = (D * (invD*invD*invD))*(...);
...
G = -3*(D * (invD*invD*invD*invD*invD))*E*F;
dU += (V+G);
In order to further reduce time taken by rsqrt, I advise vectorizing it. I mean: compute rsqrt for two or four input values at once with SSE. Depending on size of your arguments and desired precision of result, you can take one of the routines from this question. Note that it contains a link to a small GitHub project with all the implementations.
Indeed you can go further and vectorize the whole loop with SSE (or even AVX), that is not hard.
OpenCL
If you are ready to use some big framework, then I suggest using OpenCL. Your loop is very simple, so you won't have any problems porting it to OpenCL (except for some initial adaptation to OpenCL).
Then you can use CPU implementations of OpenCL, e.g. from Intel or AMD. Both of them would automatically use multithreading. Also, they are likely to automatically vectorize your loop (e.g. see this article). Finally, there is a chance that they would find a good implementation of rsqrt automatically, if you use native_rsqrt function or something like that.
Also, you would be able to run your code on GPU. If you use single precision, it may result in significant speedup. If you use double precision, then it is not so clear: modern consumer GPUs are often slow with double precision, because they lack the necessary hardware.

Minor optimisations:
(d * d * d) is calculated twice. Store d*d and use it for d^3 and d^5
Modify 2 * x by x<<1;

Quickly find whether a value is present in a C array?

I have an embedded application with a time-critical ISR that needs to iterate through an array of size 256 (preferably 1024, but 256 is the minimum) and check if a value matches the arrays contents. A bool will be set to true is this is the case.
The microcontroller is an NXP LPC4357, ARM Cortex M4 core, and the compiler is GCC. I already have combined optimisation level 2 (3 is slower) and placing the function in RAM instead of flash. I also use pointer arithmetic and a for loop, which does down-counting instead of up (checking if i!=0 is faster than checking if i<256). All in all, I end up with a duration of 12.5 µs which has to be reduced drastically to be feasible. This is the (pseudo) code I use now:
uint32_t i;
uint32_t *array_ptr = &theArray[0];
uint32_t compareVal = 0x1234ABCD;
bool validFlag = false;
for (i=256; i!=0; i--)
{
if (compareVal == *array_ptr++)
{
validFlag = true;
break;
}
}
What would be the absolute fastest way to do this? Using inline assembly is allowed. Other 'less elegant' tricks are also allowed.

In situations where performance is of utmost importance, the C compiler will most likely not produce the fastest code compared to what you can do with hand tuned assembly language. I tend to take the path of least resistance - for small routines like this, I just write asm code and have a good idea how many cycles it will take to execute. You may be able to fiddle with the C code and get the compiler to generate good output, but you may end up wasting lots of time tuning the output that way. Compilers (especially from Microsoft) have come a long way in the last few years, but they are still not as smart as the compiler between your ears because you're working on your specific situation and not just a general case. The compiler may not make use of certain instructions (e.g. LDM) that can speed this up, and it's unlikely to be smart enough to unroll the loop. Here's a way to do it which incorporates the 3 ideas I mentioned in my comment: Loop unrolling, cache prefetch and making use of the multiple load (ldm) instruction. The instruction cycle count comes out to about 3 clocks per array element, but this doesn't take into account memory delays.
Theory of operation: ARM's CPU design executes most instructions in one clock cycle, but the instructions are executed in a pipeline. C compilers will try to eliminate the pipeline delays by interleaving other instructions in between. When presented with a tight loop like the original C code, the compiler will have a hard time hiding the delays because the value read from memory must be immediately compared. My code below alternates between 2 sets of 4 registers to significantly reduce the delays of the memory itself and the pipeline fetching the data. In general, when working with large data sets and your code doesn't make use of most or all of the available registers, then you're not getting maximum performance.
; r0 = count, r1 = source ptr, r2 = comparison value
stmfd sp!,{r4-r11} ; save non-volatile registers
mov r3,r0,LSR #3 ; loop count = total count / 8
pld [r1,#128]
ldmia r1!,{r4-r7} ; pre load first set
loop_top:
pld [r1,#128]
ldmia r1!,{r8-r11} ; pre load second set
cmp r4,r2 ; search for match
cmpne r5,r2 ; use conditional execution to avoid extra branch instructions
cmpne r6,r2
cmpne r7,r2
beq found_it
ldmia r1!,{r4-r7} ; use 2 sets of registers to hide load delays
cmp r8,r2
cmpne r9,r2
cmpne r10,r2
cmpne r11,r2
beq found_it
subs r3,r3,#1 ; decrement loop count
bne loop_top
mov r0,#0 ; return value = false (not found)
ldmia sp!,{r4-r11} ; restore non-volatile registers
bx lr ; return
found_it:
mov r0,#1 ; return true
ldmia sp!,{r4-r11}
bx lr
Update:
There are a lot of skeptics in the comments who think that my experience is anecdotal/worthless and require proof. I used GCC 4.8 (from the Android NDK 9C) to generate the following output with optimization -O2 (all optimizations turned on including loop unrolling). I compiled the original C code presented in the question above. Here's what GCC produced:
.L9: cmp r3, r0
beq .L8
.L3: ldr r2, [r3, #4]!
cmp r2, r1
bne .L9
mov r0, #1
.L2: add sp, sp, #1024
bx lr
.L8: mov r0, #0
b .L2
GCC's output not only doesn't unroll the loop, but also wastes a clock on a stall after the LDR. It requires at least 8 clocks per array element. It does a good job of using the address to know when to exit the loop, but all of the magical things compilers are capable of doing are nowhere to be found in this code. I haven't run the code on the target platform (I don't own one), but anyone experienced in ARM code performance can see that my code is faster.
Update 2:
I gave Microsoft's Visual Studio 2013 SP2 a chance to do better with the code. It was able to use NEON instructions to vectorize my array initialization, but the linear value search as written by the OP came out similar to what GCC generated (I renamed the labels to make it more readable):
loop_top:
ldr r3,[r1],#4
cmp r3,r2
beq true_exit
subs r0,r0,#1
bne loop_top
false_exit: xxx
bx lr
true_exit: xxx
bx lr
As I said, I don't own the OP's exact hardware, but I will be testing the performance on an nVidia Tegra 3 and Tegra 4 of the 3 different versions and post the results here soon.
Update 3:
I ran my code and Microsoft's compiled ARM code on a Tegra 3 and Tegra 4 (Surface RT, Surface RT 2). I ran 1000000 iterations of a loop which fails to find a match so that everything is in cache and it's easy to measure.
My Code MS Code
Surface RT 297ns 562ns
Surface RT 2 172ns 296ns
In both cases my code runs almost twice as fast. Most modern ARM CPUs will probably give similar results.

There's a trick for optimizing it (I was asked this on a job-interview once):
If the last entry in the array holds the value that you're looking for, then return true
Write the value that you're looking for into the last entry in the array
Iterate the array until you encounter the value that you're looking for
If you've encountered it before the last entry in the array, then return true
Return false
bool check(uint32_t theArray[], uint32_t compareVal)
{
uint32_t i;
uint32_t x = theArray[SIZE-1];
if (x == compareVal)
return true;
theArray[SIZE-1] = compareVal;
for (i = 0; theArray[i] != compareVal; i++);
theArray[SIZE-1] = x;
return i != SIZE-1;
}
This yields one branch per iteration instead of two branches per iteration.
UPDATE:
If you're allowed to allocate the array to SIZE+1, then you can get rid of the "last entry swapping" part:
bool check(uint32_t theArray[], uint32_t compareVal)
{
uint32_t i;
theArray[SIZE] = compareVal;
for (i = 0; theArray[i] != compareVal; i++);
return i != SIZE;
}
You can also get rid of the additional arithmetic embedded in theArray[i], using the following instead:
bool check(uint32_t theArray[], uint32_t compareVal)
{
uint32_t *arrayPtr;
theArray[SIZE] = compareVal;
for (arrayPtr = theArray; *arrayPtr != compareVal; arrayPtr++);
return arrayPtr != theArray+SIZE;
}
If the compiler doesn't already apply it, then this function will do so for sure. On the other hand, it might make it harder on the optimizer to unroll the loop, so you will have to verify that in the generated assembly code...

Keep the table in sorted order, and use Bentley's unrolled binary search:
i = 0;
if (key >= a[i+512]) i += 512;
if (key >= a[i+256]) i += 256;
if (key >= a[i+128]) i += 128;
if (key >= a[i+ 64]) i += 64;
if (key >= a[i+ 32]) i += 32;
if (key >= a[i+ 16]) i += 16;
if (key >= a[i+ 8]) i += 8;
if (key >= a[i+ 4]) i += 4;
if (key >= a[i+ 2]) i += 2;
if (key >= a[i+ 1]) i += 1;
return (key == a[i]);
The point is,
if you know how big the table is, then you know how many iterations there will be, so you can fully unroll it.
Then, there's no point testing for the == case on each iteration because, except on the last iteration, the probability of that case is too low to justify spending time testing for it.**
Finally, by expanding the table to a power of 2, you add at most one comparison, and at most a factor of two storage.
** If you're not used to thinking in terms of probabilities, every decision point has an entropy, which is the average information you learn by executing it.
For the >= tests, the probability of each branch is about 0.5, and -log2(0.5) is 1, so that means if you take one branch you learn 1 bit, and if you take the other branch you learn one bit, and the average is just the sum of what you learn on each branch times the probability of that branch.
So 1*0.5 + 1*0.5 = 1, so the entropy of the >= test is 1. Since you have 10 bits to learn, it takes 10 branches.
That's why it's fast!
On the other hand, what if your first test is if (key == a[i+512)? The probability of being true is 1/1024, while the probability of false is 1023/1024. So if it's true you learn all 10 bits!
But if it's false you learn -log2(1023/1024) = .00141 bits, practically nothing!
So the average amount you learn from that test is 10/1024 + .00141*1023/1024 = .0098 + .00141 = .0112 bits. About one hundredth of a bit.
That test is not carrying its weight!

You're asking for help with optimising your algorithm, which may push you to assembler. But your algorithm (a linear search) is not so clever, so you should consider changing your algorithm. E.g.:
perfect hash function
binary search
Perfect hash function
If your 256 "valid" values are static and known at compile time, then you can use a perfect hash function. You need to find a hash function that maps your input value to a value in the range 0..n, where there are no collisions for all the valid values you care about. That is, no two "valid" values hash to the same output value. When searching for a good hash function, you aim to:
Keep the hash function reasonably fast.
Minimise n. The smallest you can get is 256 (minimal perfect hash function), but that's probably hard to achieve, depending on the data.
Note for efficient hash functions, n is often a power of 2, which is equivalent to a bitwise mask of low bits (AND operation). Example hash functions:
CRC of input bytes, modulo n.
((x << i) ^ (x >> j) ^ (x << k) ^ ...) % n (picking as many i, j, k, ... as needed, with left or right shifts)
Then you make a fixed table of n entries, where the hash maps the input values to an index i into the table. For valid values, table entry i contains the valid value. For all other table entries, ensure that each entry of index i contains some other invalid value which doesn't hash to i.
Then in your interrupt routine, with input x:
Hash x to index i (which is in the range 0..n)
Look up entry i in the table and see if it contains the value x.
This will be much faster than a linear search of 256 or 1024 values.
I've written some Python code to find reasonable hash functions.
Binary search
If you sort your array of 256 "valid" values, then you can do a binary search, rather than a linear search. That means you should be able to search 256-entry table in only 8 steps (log2(256)), or a 1024-entry table in 10 steps. Again, this will be much faster than a linear search of 256 or 1024 values.

If the set of constants in your table is known in advance, you can use perfect hashing to ensure that only one access is made to the table. Perfect hashing determines a hash function
that maps every interesting key to a unique slot (that table isn't always dense, but you can decide how un-dense a table you can afford, with less dense tables typically leading to simpler hashing functions).
Usually, the perfect hash function for the specific set of keys is relatively easy to compute; you don't want that to be long and complicated because that competes for time perhaps better spent doing multiple probes.
Perfect hashing is a "1-probe max" scheme. One can generalize the idea, with the thought that one should trade simplicity of computing the hash code with the time it takes to make k probes. After all, the goal is "least total time to look up", not fewest probes or simplest hash function. However, I've never seen anybody build a k-probes-max hashing algorithm. I suspect one can do it, but that's likely research.
One other thought: if your processor is extremely fast, the one probe to memory from a perfect hash probably dominates the execution time. If the processor is not very fast, than k>1 probes might be practical.

Use a hash set. It will give O(1) lookup time.
The following code assumes that you can reserve value 0 as an 'empty' value, i.e. not occurring in actual data.
The solution can be expanded for a situation where this is not the case.
#define HASH(x) (((x >> 16) ^ x) & 1023)
#define HASH_LEN 1024
uint32_t my_hash[HASH_LEN];
int lookup(uint32_t value)
{
int i = HASH(value);
while (my_hash[i] != 0 && my_hash[i] != value) i = (i + 1) % HASH_LEN;
return i;
}
void store(uint32_t value)
{
int i = lookup(value);
if (my_hash[i] == 0)
my_hash[i] = value;
}
bool contains(uint32_t value)
{
return (my_hash[lookup(value)] == value);
}
In this example implementation, the lookup time will typically be very low, but at the worst case can be up to the number of entries stored. For a realtime application, you can consider also an implementation using binary trees, which will have a more predictable lookup time.

In this case, it might be worthwhile investigating Bloom filters. They're capable of quickly establishing that a value is not present, which is a good thing since most of the 2^32 possible values are not in that 1024 element array. However, there are some false positives that will need an extra check.
Since your table is apparently static, you can determine which false positives exist for your Bloom filter and put those in a perfect hash.

Assuming your processor runs at 204 MHz which seems to be the maximum for the LPC4357, and also assuming your timing result reflects the average case (half of the array traversed), we get:
CPU frequency: 204 MHz
Cycle period: 4.9 ns
Duration in cycles: 12.5 µs / 4.9 ns = 2551 cycles
Cycles per iteration: 2551 / 128 = 19.9
So, your search loop spends around 20 cycles per iteration. That doesn't sound awful, but I guess that in order to make it faster you need to look at the assembly.
I would recommend dropping the index and using a pointer comparison instead, and making all the pointers const.
bool arrayContains(const uint32_t *array, size_t length)
{
const uint32_t * const end = array + length;
while(array != end)
{
if(*array++ == 0x1234ABCD)
return true;
}
return false;
}
That's at least worth testing.

Other people have suggested reorganizing your table, adding a sentinel value at the end, or sorting it in order to provide a binary search.
You state "I also use pointer arithmetic and a for loop, which does down-counting instead of up (checking if i != 0 is faster than checking if i < 256)."
My first advice is: get rid of the pointer arithmetic and the downcounting. Stuff like
for (i=0; i<256; i++)
{
if (compareVal == the_array[i])
{
[...]
}
}
tends to be idiomatic to the compiler. The loop is idiomatic, and the indexing of an array over a loop variable is idiomatic. Juggling with pointer arithmetic and pointers will tend to obfuscate the idioms to the compiler and make it generate code related to what you wrote rather than what the compiler writer decided to be the best course for the general task.
For example, the above code might be compiled into a loop running from -256 or -255 to zero, indexing off &the_array[256]. Possibly stuff that is not even expressible in valid C but matches the architecture of the machine you are generating for.
So don't microoptimize. You are just throwing spanners into the works of your optimizer. If you want to be clever, work on the data structures and algorithms but don't microoptimize their expression. It will just come back to bite you, if not on the current compiler/architecture, then on the next.
In particular using pointer arithmetic instead of arrays and indexes is poison for the compiler being fully aware of alignments, storage locations, aliasing considerations and other stuff, and for doing optimizations like strength reduction in the way best suited to the machine architecture.

Vectorization can be used here, as it is often is in implementations of memchr. You use the following algorithm:
Create a mask of your query repeating, equal in length to your OS'es bit count (64-bit, 32-bit, etc.). On a 64-bit system you would repeat the 32-bit query twice.
Process the list as a list of multiple pieces of data at once, simply by casting the list to a list of a larger data type and pulling values out. For each chunk, XOR it with the mask, then XOR with 0b0111...1, then add 1, then & with a mask of 0b1000...0 repeating. If the result is 0, there is definitely not a match. Otherwise, there may (usually with very high probability) be a match, so search the chunk normally.
Example implementation: https://sourceware.org/cgi-bin/cvsweb.cgi/src/newlib/libc/string/memchr.c?rev=1.3&content-type=text/x-cvsweb-markup&cvsroot=src

If you can accommodate the domain of your values with the amount of memory that's available to your application, then, the fastest solution would be to represent your array as an array of bits:
bool theArray[MAX_VALUE]; // of which 1024 values are true, the rest false
uint32_t compareVal = 0x1234ABCD;
bool validFlag = theArray[compareVal];
EDIT
I'm astounded by the number of critics. The title of this thread is "How do I quickly find whether a value is present in a C array?" for which I will stand by my answer because it answers precisely that. I could argue that this has the most speed efficient hash function (since address === value). I've read the comments and I'm aware of the obvious caveats. Undoubtedly those caveats limit the range of problems this can be used to solve, but, for those problems that it does solve, it solves very efficiently.
Rather than reject this answer outright, consider it as the optimal starting point for which you can evolve by using hash functions to achieve a better balance between speed and performance.

I'm sorry if my answer was already answered - just I'm a lazy reader. Feel you free to downvote then ))
1) you could remove counter 'i' at all - just compare pointers, ie
for (ptr = &the_array[0]; ptr < the_array+1024; ptr++)
{
if (compareVal == *ptr)
{
break;
}
}
... compare ptr and the_array+1024 here - you do not need validFlag at all.
all that won't give any significant improvement though, such optimization probably could be achieved by the compiler itself.
2) As it was already mentioned by other answers, almost all modern CPU are RISC-based, for example ARM. Even modern Intel X86 CPUs use RISC cores inside, as far as I know (compiling from X86 on fly). Major optimization for RISC is pipeline optimization (and for Intel and other CPU as well), minimizing code jumps. One type of such optimization (probably a major one), is "cycle rollback" one. It's incredibly stupid, and efficient, even Intel compiler can do that AFAIK. It looks like:
if (compareVal == the_array[0]) { validFlag = true; goto end_of_compare; }
if (compareVal == the_array[1]) { validFlag = true; goto end_of_compare; }
...and so on...
end_of_compare:
This way the optimization is that the pipeline is not broken for the worst case (if compareVal is absent in the array), so it is as fast as possible (of course not counting algorithm optimizations such as hash tables, sorted arrays and so on, mentioned in other answers, which may give better results depending on array size. Cycles Rollback approach can be applied there as well by the way. I'm writing here about that I think I didn't see in others)
The second part of this optimization is that that array item is taken by direct address (calculated at compiling stage, make sure you use a static array), and do not need additional ADD op to calculate pointer from array's base address. This optimization may not have significant effect, since AFAIK ARM architecture has special features to speed up arrays addressing. But anyway it's always better to know that you did all the best just in C code directly, right?
Cycle Rollback may look awkward due to waste of ROM (yep, you did right placing it to fast part of RAM, if your board supports this feature), but actually it's a fair pay for speed, being based on RISC concept. This is just a general point of calculation optimization - you sacrifice space for sake of speed, and vice versa, depending on your requirements.
If you think that rollback for array of 1024 elements is too large sacrifice for your case, you can consider 'partial rollback', for example dividing the array into 2 parts of 512 items each, or 4x256, and so on.
3) modern CPU often support SIMD ops, for example ARM NEON instruction set - it allows to execute the same ops in parallel. Frankly speaking I do not remember if it is suitable for comparison ops, but I feel it may be, you should check that. Googling shows that there may be some tricks as well, to get max speed, see https://stackoverflow.com/a/5734019/1028256
I hope it can give you some new ideas.

This is more like an addendum than an answer.
I've had a similar case in the past, but my array was constant over a considerable number of searches.
In half of them, the searched value was NOT present in array. Then I realized I could apply a "filter" before doing any search.
This "filter" is just a simple integer number, calculated ONCE and used in each search.
It's in Java, but it's pretty simple:
binaryfilter = 0;
for (int i = 0; i < array.length; i++)
{
// just apply "Binary OR Operator" over values.
binaryfilter = binaryfilter | array[i];
}
So, before do a binary search, I check binaryfilter:
// Check binaryfilter vs value with a "Binary AND Operator"
if ((binaryfilter & valuetosearch) != valuetosearch)
{
// valuetosearch is not in the array!
return false;
}
else
{
// valuetosearch MAYBE in the array, so let's check it out
// ... do binary search stuff ...
}
You can use a 'better' hash algorithm, but this can be very fast, specially for large numbers.
May be this could save you even more cycles.

Make sure the instructions ("the pseudo code") and the data ("theArray") are in separate (RAM) memories so CM4 Harvard architecture is utilized to its full potential. From the user manual:
To optimize the CPU performance, the ARM Cortex-M4 has three buses for Instruction (code) (I) access, Data (D) access, and System (S) access. When instructions and data are kept in separate memories, then code and data accesses can be done in parallel in one cycle. When code and data are kept in the same memory, then instructions that load or store data may take two cycles.
Following this guideline I observed ~30% speed increase (FFT calculation in my case).

I'm a great fan of hashing. The problem of course is to find an efficient algorithm that is both fast and uses a minimum amount of memory (especially on an embedded processor).
If you know beforehand the values that may occur you can create a program that runs through a multitude of algorithms to find the best one - or, rather, the best parameters for your data.
I created such a program that you can read about in this post and achieved some very fast results. 16000 entries translates roughly to 2^14 or an average of 14 comparisons to find the value using a binary search. I explicitly aimed for very fast lookups - on average finding the value in <=1.5 lookups - which resulted in greater RAM requirements. I believe that with a more conservative average value (say <=3) a lot of memory could be saved. By comparison the average case for a binary search on your 256 or 1024 entries would result in an average number of comparisons of 8 and 10, respectively.
My average lookup required around 60 cycles (on a laptop with an intel i5) with a generic algorithm (utilizing one division by a variable) and 40-45 cycles with a specialized (probably utilizing a multiplication). This should translate into sub-microsecond lookup times on your MCU, depending of course on the clock frequency it executes at.
It can be real-life-tweaked further if the entry array keeps track of how many times an entry was accessed. If the entry array is sorted from most to least accessed before the indeces are computed then it'll find the most commonly occuring values with a single comparison.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight