find nan in array of doubles using simd

find nan in array of doubles using simd - c

This question is very similar to:
SIMD instructions for floating point equality comparison (with NaN == NaN)
Although that question focused on 128 bit vectors and had requirements about identifying +0 and -0.
I had a feeling I might be able to get this one myself but the intel intrinsics guide page seems to be down :/
My goal is to take an array of doubles and to return whether a NaN is present in the array. I am expecting that the majority of the time that there won't be one, and would like that route to have the best performance.
Initially I was going to do a comparison of 4 doubles to themselves, mirroring the non-SIMD approach for NaN detection (i.e. NaN only value where a != a is true). Something like:
data *double = ...
__m256d a, b;
int temp = 0;
//This bit would be in a loop over the array
//I'd probably put a sentinel in and loop over while !temp
a = _mm256_loadu_pd(data);
b = _mm256_cmp_pd(a, a, _CMP_NEQ_UQ);
temp = temp | _mm256_movemask_pd(b);
However, in some of the examples of comparison it looks like there is some sort of NaN detection already going on in addition to the comparison itself. I briefly thought, well if something like _CMP_EQ_UQ will detect NaNs, I can just use that and then I can compare 4 doubles to 4 doubles and magically look at 8 doubles at once at the same time.
__m256d a, b, c;
a = _mm256_loadu_pd(data);
b = _mm256_loadu_pd(data+4);
c = _mm256_cmp_pd(a, b, _CMP_EQ_UQ);
At this point I realized I wasn't quite thinking straight because I might happen to compare a number to itself that is not a NaN (i.e. 3 == 3) and get a hit that way.
So my question is, is comparing 4 doubles to themselves (as done above) the best I can do or is there some other better approach to finding out whether my array has a NaN?

You might be able to avoid this entirely by checking fenv status, or if not then cache block it and/or fold it into another pass over the same data, because it's very low computational intensity (work per byte loaded/stored), so it easily bottlenecks on memory bandwidth. See below.
The comparison predicate you're looking for is _CMP_UNORD_Q or _CMP_ORD_Q to tell you that the comparison is unordered or ordered, i.e. that at least one of the operands is a NaN, or that both operands are non-NaN, respectively. What does ordered / unordered comparison mean?
The asm docs for cmppd list the predicates and have equal or better details than the intrinsics guide.
So yes, if you expect NaN to be rare and want to quickly scan through lots of non-NaN values, you can vcmppd two different vectors against each other. If you cared about where the NaN was, you could do extra work to sort that out once you know that there is at least one in either of two input vectors. (Like _mm256_cmp_pd(a,a, _CMP_UNORD_Q) to feed movemask + bitscan for lowest set bit.)
OR or AND multiple compares per movemask
Like with other SSE/AVX search loops, you can also amortize the movemask cost by combining a few compare results with _mm256_or_pd (find any unordered) or _mm256_and_pd (check for all ordered). E.g. check a couple cache lines (4x _mm256d with 2x _mm256_cmp_pd) per movemask / test/branch. (glibc's asm memchr and strlen use this trick.) Again, this optimizes for your common case where you expect no early-outs and have to scan the whole array.
Also remember that it's totally fine to check the same element twice, so your cleanup can be simple: a vector that loads up to the end of the array, potentially overlapping with elements you already checked.
// checks 4 vectors = 16 doubles
// non-zero means there was a NaN somewhere in p[0..15]
static inline
int any_nan_block(double *p) {
__m256d a = _mm256_loadu_pd(p+0);
__m256d abnan = _mm256_cmp_pd(a, _mm256_loadu_pd(p+ 4), _CMP_UNORD_Q);
__m256d c = _mm256_loadu_pd(p+8);
__m256d cdnan = _mm256_cmp_pd(c, _mm256_loadu_pd(p+12), _CMP_UNORD_Q);
__m256d abcdnan = _mm256_or_pd(abnan, cdnan);
return _mm256_movemask_pd(abcdnan);
}
// more aggressive ORing is possible but probably not needed
// especially if you expect any memory bottlenecks.
I wrote the C as if it were assembly, one instruction per source line. (load / memory-source cmppd). These 6 instructions are all single-uop in the fused-domain on modern CPUs, if using non-indexed addressing modes on Intel. test/jnz as a break condition would bring it up to 7 uops.
In a loop, an add reg, 16*8 pointer increment is another 1 uop, and cmp / jne as a loop condition is one more, bringing it up to 9 uops. So unfortunately on Skylake this bottlenecks on the front-end at 4 uops / clock, taking at least 9/4 cycles to issue 1 iteration, not quite saturating the load ports. Zen 2 or Ice Lake could sustain 2 loads per clock without any more unrolling or another level of vorpd combining.
Another trick that might be possible is to use vptest or vtestpd on two vectors to check that they're both non-zero. But I'm not sure it's possible to correctly check that every element of both vectors is non-zero. Can PTEST be used to test if two registers are both zero or some other condition? shows that the other way (that _CMP_UNORD_Q inputs are both all-zero) is not possible.
But this wouldn't really help: vtestpd / jcc is 3 uops total, vs. vorpd / vmovmskpd / test+jcc also being 3 fused-domain uops on existing Intel/AMD CPUs with AVX, so it's not even a win for throughput when you're branching on the result. So even if it's possible, it's probably break even, although it might save a bit of code size. And wouldn't be worth considering if it takes more than one branch to sort out the all-zeros or mix_zeros_and_ones cases from the all-ones case.
Avoiding work: check fenv flags instead
If your array was the result of computation in this thread, just check the FP exception sticky flags (in MXCSR manually, or via fenv.h fegetexcept) to see if an FP "invalid" exception has happened since you last cleared FP exceptions. If not, I think that means the FPU hasn't produced any NaN outputs and thus there are none in arrays written since then by this thread.
If it is set, you'll have to check; the invalid exception might have been raised for a temporary result that didn't propagate into this array.
Cache blocking:
If/when fenv flags don't let you avoid the work entirely, or aren't a good strategy for your program, try to fold this check into whatever produced the array, or into the next pass that reads it. So you're reusing data while it's already loaded into vector registers, increasing computational intensity. (ALU work per load/store.)
Even if data is already hot in L1d, it will still bottleneck on load port bandwidth: 2 loads per cmppd still bottlenecks on 2/clock load port bandwidth, on CPUs with 2/clock vcmppd ymm (Skylake but not Haswell).
Also worthwhile to align your pointers to make sure you're getting full load throughput from L1d cache, especially if data is sometimes already hot in L1d.
Or at least cache-block it so you check a 128kiB block before running another loop on that same block while it's hot in cache. That's half the size of 256k L2 so your data should still be hot from the previous pass, and/or hot for the next pass.
Definitely avoid running this over a whole multi-megabyte array and paying the cost of getting it into the CPU core from DRAM or L3 cache, then evicting again before another loop reads it. That's worst case computational intensity, paying the cost of getting it into a CPU core's private cache more than once.

Related

How can I test if CRC32C is a "good" random generator?

I recently discovered that the _mm_crc32_* intel intrinsic instruction could be used to generate (pseudo-)random 32 bit numbers.
#include <nmmintrin.h> /* needs CRC32C instruction from SSE4.2 instruction set extension */
uint32_t rnd = 1; /* initialize with seed != 0 */
/* period length is 4,294,967,295 = 2^32-1 */
while (1) {
#if 0 // this was faster but worse than xorshift32 (fails more tests)
// rnd = _mm_crc32_u8(rnd, rnd >> 3);
#else // this is faster and better than xorshift32 (fails fewer tests)
rnd = _mm_crc32_u32(rnd, rnd << 18);
#endif
printf("%08X\n", rnd);
}
This method is as fast as a LCG, and faster than xorshift32. Wikipedia says that because xorshift generators "fail a few statistical tests, they have been accused of being unreliable".
Now I am wondering if the CRC32C method passes the various tests that are done on random number generators. I only verified that each bit, even the LSB, is "random" by trying to compress using PAQ8 compressors (which failed). Can someone help me do better tests?
EDIT: Using tests from the suggested TestU01 suite, the method that I previously used turned out to be worse than xorshift32. I have updated the source code above in case anyone is interested in using the better version.

This is an interesting question. Ultimately the only test that counts is "does this rng yield correct results for the problem I'm working on". What are you hoping to do with the rng?
In order to avoid answering that question for every different problem, various tests have been devised. See for example the "Diehard" tests devised by George Marsaglia. A web search for "marsaglia random number generator tests" turns up several interesting links.
I think Marsaglia's work is a couple of decades old at this point. I don't know if there has been more work on this topic since then. My guess is that for non-cryptographic purposes, an rng which passes the Diehard tests is probably sufficient.

There's a big difference in the requirements for a PRNG for a video game (especially single-player) vs. a monte-carlo simulation. Small biases can be a problem for scientific numerical computing, but generally not for a game, especially if numbers from the same PRNG are used in different ways.
There's a reason that different PRNGs with different speed / quality tradeoffs exist.
This one is very fast, especially if the seed / state stays in a register, taking only 2 or 3 uops on a modern Intel CPU. So it's fantastic if it can inline into a loop. Compared to anything else of the same speed, it's probably better quality. But compared to something only a little bit slower with larger state, it's probably pathetic if you care about statistical quality.
On x86 with BMI2, each RNG step should only require rorx edx, eax, 3 / crc32 eax, dl. On Haswell/Skylake, that's 2 uops with total latency = 1 + 3 cycles for the loop-carried dependency. (http://agner.org/optimize/). Or 3 uops without BMI2, for mov edx, eax / shr edx,3 / crc32 eax, dl, but still only 4 cycles of latency on CPUs with zero-latency mov for GP registers: Ivybridge+ and Ryzen.
2 uops is negligible impact on surrounding code in the normal case where you do enough work with each PRNG result that the 4-cycle dependency chain isn't a bottleneck. (Or ~9 cycle if your compiler stores/reloads the PRNG state inside the loop instead of keeping it live in a register and sinking the store to the global to after a loop, costing you 2 extra 1-uop instructions).
On Ryzen, crc32 is 3 uops with 3c total latency, so more impact on surrounding code but the same one per 4 clock bottleneck if you're doing so little with the PRNG results that you bottleneck on that.
I suspect you may have been benchmarking the loop-carried dependency chain bottleneck, not the impact on real surrounding code that does enough work to hide that latency. (Almost all relevant x86 CPUs are out-of-order execution.) Making an RNG even cheaper than xorshift128+, or even xorshift128, is likely to be of negligible benefit for most use-cases. xorshift128+ or xorshift128* is fast, and quite good quality for the speed.
If you want lots of PRNG results very quickly, consider using a SIMD xorshift128+ to run two or four generators in parallel (in different elements of XMM or YMM vectors). Especially if you can usefully use a __m256i vector of PRNG results. See AVX/SSE version of xorshift128+, and also this answer where I used it.
Returning the entire state as the RNG result is usually a bad thing, because it means that one value tells you exactly what the next one will be. i.e. 3 is always followed by 1897987234 (fake numbers), never 3 followed by something else. Most statistical quality tests should pick this up, but this might or might not be a problem for any given use-case.
Note that https://en.wikipedia.org/wiki/Xorshift is saying that even xorshift128 fails a few statistical tests. I assume xorshift32 is significantly worse. CRC32c is also based on XOR and shift (but also with bit-reflect and modulo in Galois Field(2)), so it's reasonable to think it might be similar or better in quality.
You say your choice of crc32(rnd, rnd>>3) gives a period of 2^32, and that's the best you can do with a state that small. (Of course rnd++ achieves the same period, so it's not the only measure of quality.) It's likely at least as good as an LCG, but those are not considered high quality, especially if the modulus is 2^32 (so you get it for free from fixed-width integer math).
I tested with a bitmap of previously seen values (NASM source), incrementing a counter until we reach a bitmap entry we've already seen. Indeed, all 2^32 values are seen before coming back to the original value. Since the period is exactly 0x100000000, that rules out any special values that are part of a shorter cycle.
Without memory access to the bitmap, Skylake does indeed run the loop at 4 cycles per iteration, as expected from the latency bottleneck. And rorx/crc32 is just 2 total uops.

One measure of the goodness of a PRNG is the length of the cycle. If that matters for your application, then a CRC-32 as you are using it would not be a good choice, since the cycle is only 232. One consequence is that if you are using more samples than that, which doesn't take very long, your results will repeat. Another is that there is a correlation between successive CRC-32 values, where there is only one possible value that will follow the current value.
Better PRNGs have exponentially longer cycles, and the values returned are smaller than the bits in the state, so that successive values do not have that correlation.
You don't need to use a CRC-32C instruction to be fast. Also you don't need to design your own PRNG, which is fraught with hidden perils. Best to leave that to the professionals. See this work for high-quality, small, and fast random number generators.

What is 'differential timing' technique for doing benchmarks?

At around 39 minute of "Writing Fast Code I" by Andrei Alexandrescu (link here to youtube)
there is a slide of how to use differential timing... can someone show me some basic code with this approach? It was only mentioned for a second, but I think that's an interesting idea.
Run baseline 2n times (t2a)
vs. baseline n times + contender n times (ta+b).
Relative improvement = "t2a / (2ta+b - t2a)"
some overhead noises canceled

Alexsandrescu's slide is rather trivial to pour into code:
auto start = clock::now();
for( int i = 0; i < 2*n; i++ )
baseline();
auto t2a = clock::now() - start;
start = clock::now();
for( int i = 0; i < n; i++ )
baseline();
// *
for( int i = 0; i < n; i++ )
contender();
auto taplusb = clock::now() - start;
double r = t2a / (2 * taplusb - t2a) // relative speedup
* Synchronization point which prevents optimization across the last two loops.
I would be more interested in the mathematical reasoning behind measuring the relative speed up this way as opposed to simply tBaseline / tContender as I've been doing for ever. He only vaguely hints at '...overhead noise (being) cancelled (out)', but doesn't explain it in detail.

If you keep watching until 41:40 or so, he mentions it again when warning about the pitfall of first run vs. subsequent (allocators warmed up, etc.)
The best solution for that is doing warm-up runs before the first timed region.
I think he's picturing that 2n baseline vs. n baseline + n contender in separate invocations of the benchmark program.
So instead of doing some warmup runs before the timed region, he's using the baseline as a controlled warmup inside the timed region. This might make it possible to just time the whole program, e.g. perf stat, instead of calling a time function inside the program. Depending on how much process startup overhead your OS has vs. how long you make your repeat loop.
Microbenchmarking is hard and there are many pitfalls. Notably benchmarking optimized code while still making sure there isn't optimization between iterations of your repeat loop. (Often it's useful to use inline asm "escape" macros to force the compiler to materialize a value in an integer register, and/or to forget about the value of a variable to defeat CSE. Sometimes it's sufficient to just add the final result of each iteration to a sum that you print at the end.)
This is the first I've heard of this differential idea. It doesn't sound more useful than normal warm-ups.
If anything it will make the contender look slightly worse than using the function under test for some warm-up runs before the timed region. Using the same function as the timed region will warm up branch-prediction for it. Or not because after inlining the warm-up vs. main versions will be at different addresses. The same pattern at different addresses may possibly still help a modern TAGE predictor but IDK.
Or if contender has any lookup tables, those will become hot in cache from the warmup.
In any case, warmups are essential, unless you make the repeat count long enough to dwarf the time it takes for the CPU to switch to max turbo and so on. And to page-fault in all the memory you touch.
If your calculated time/iteration doesn't stay constant with your repeat count, your microbenchmark is broken.
Take the rest of his advice with a grain of salt, too. Most of it use useful (e.g. prefer 32-bit integers even for local temporaries, not just for arrays for cache-footprint reasons), but the reasoning is wrong for some of them.
His explanation that an ALU can do 2x 32-bit adds or 1x 64-bit add only applies to SIMD: 4x 32-bit int in a vector for paddd or 2x 64-bit int in a vector for paddq. But x86 scalar add r32, r32 has the same throughput as add r64,r64. I don't think it was true even on Pentium 4 (Nocona) despite P4 having funky double-pumped ALUs with 0.5 cycle latency for add. At least before Prescott/Nocona which introduced 64-bit support.
Using 32-bit unsigned integers on x86-64 can stop the compiler from optimizing to pointer increments if it wants to. It has to maintain correctness in case of 32-bit wraparound of a variable before array indexing.
Using 16-bit or 8-bit locals to match the data in an array can sometimes help auto-vectorization, IIRC. Gcc/clang sometimes make really braindead code that unpacks to 32-bit and then re-packs down to 8-bit elements, when processing an array of int8_t or uint8_t. I forget if I've every actually worked around that by using narrow locals, though. C default integer promotions bring most expressions back up to 32-bit.
Also, at https://youtu.be/vrfYLlR8X8k?t=3498, he claims that FP->int is expensive. That's never been true on x86-64: FP math uses SSE/SSE2 which has an instruction that does truncating conversion. FP->int used to be slow in the bad old days of x87 math, where you had to change the FP rounding mode, fistp, then change it back, to get C truncation semantics. But SSE includes cvttsd2si exactly for that common case.
He also says float is no faster than double. That's true for scalar (other than div/sqrt), but if your code can auto-vectorize then you get twice as much work done per instruction and the instructions have the same throughput. (Twice as many elements fit in a SIMD vector.)
How the math works:
It just cancels out the n * baseline time from both parts, effectively doing (2 * baseline) / (2*contender) = baseline/contender.
It assumes that the times add normally (not overlapping computation). t_2a = 2 * baseline, and 2 * t_ab = 2 * baseline + 2 * contender. Subtracting cancels the 2*baseline parts, leaving you with 2*contender.
The trick isn't in the math, if anything this is more mathematically dangerous because subtracting two larger numbers accumulates error. i.e. if the n*baseline actually takes different amounts of time in the two runs (because you didn't control that perfectly), then it doesn't cancel and contributes error to your estimate.

Using SSE on floating point pixels with only 3 color components

I am creating a struct to store a single RGB pixel in an image.
struct Pixel
{
// color values range from 0.0 to 1.0
float r, g, b;
}__attribute__((aligned(16));
I want to use 128 bit SSE instructions to do things like adding, multiplying, etc. This way I can perform operations on all 3 color channels at once. So the first packed float in my SSE register would be red, then green, then blue, but I am not so sure what would go into my fourth register. I really don't care what bits are in the extra 32 bits of padding. When I load a pixel into the SSE register I would imagine it contains either zeros or junk values. Is this problematic? Should I add a fourth alpha channel even though I don't really need one? The only way I see this being an issue is if I were dividing by a pixel and there was a zero value in the fourth spot, or of I was taking a root of a negative, etc.

Integer ops will have no problem at all with uninitialized values, since the latency is never data-dependent. Floating point is different. Some FPUs slow down on denormals, NaNs, and infinities (in any one of the vector elements).
Intel Nehalem and earlier slow down a lot when doing math ops with denormal inputs/outputs, and on FP underflow/overflow. Sandybridge has a nice FPU with fast add/sub for any inputs (according to Agner Fog's instruction tables), but multiply can still slow down.
Add/sub/multiply are fine with zeros, but potentially a problem with uninitialized junk that might represent NaN or something.
Be careful with division that you aren't dividing by zero. That could even raise an FPU exception, depending on HW settings.
So yes, keeping the unused element zeroed is probably a good idea. Depending how you generate things in the first place, this may be pretty cheap to accomplish. (e.g. movd/pinsrd/pinsrd (or insertps) to put three 32bit elements into a vector, with the initial movd zeroing the high 96b.)
One workaround could be to store a 2nd copy of the blue channel in the 4th element. (or whatever is most convenient to shuffle there.) You could load vectors with movsldup(SSE3) / movlps. After movsldup, your register would hold { b b r r }. movlps would re-load the lower 64bits, so you'd have { b b g r }. (This is equivalent to movsd, BTW.) Or if the shuffle port is less busy than the load ports, do one 16B load and then shufps. (movsldup on Intel CPUs is a single uop that runs on a load port, even though it has the duplication built in.)
Another option would be to pack your pixels into 12 bytes, so a 16B load would get one component of the next pixel. Depending on what you're doing, overlapping stores that clobber one element of the next pixel might or might not be ok. Loading the next pixel before storing the current could work around that for some ops. It's quite easy to be cache or bandwidth-limited, so saving 1/4 space at the small cost of the occasional cache-line split load/store could be worth it.

Performance Optimization for Matrix Rotation

I'm now trapped by a performance optimization lab in the book "Computer System from a Programmer's Perspective" described as following:
In a N*N matrix M, where N is multiple of 32, the rotate operation can be represented as:
Transpose: interchange elements M(i,j) and M(j,i)
Exchange rows: Row i is exchanged with row N-1-i
A example for matrix rotation(N is 3 instead of 32 for simplicity):
------- -------
|1|2|3| |3|6|9|
------- -------
|4|5|6| after rotate is |2|5|8|
------- -------
|7|8|9| |1|4|7|
------- -------
A naive implementation is:
#define RIDX(i,j,n) ((i)*(n)+(j))
void naive_rotate(int dim, pixel *src, pixel *dst)
{
int i, j;
for (i = 0; i < dim; i++)
for (j = 0; j < dim; j++)
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
I come up with an idea by inner-loop-unroll. The result is:
Code Version Speed Up
original 1x
unrolled by 2 1.33x
unrolled by 4 1.33x
unrolled by 8 1.55x
unrolled by 16 1.67x
unrolled by 32 1.61x
I also get a code snippet from pastebin.com that seems can solve this problem:
void rotate(int dim, pixel *src, pixel *dst)
{
int stride = 32;
int count = dim >> 5;
src += dim - 1;
int a1 = count;
do {
int a2 = dim;
do {
int a3 = stride;
do {
*dst++ = *src;
src += dim;
} while(--a3);
src -= dim * stride + 1;
dst += dim - stride;
} while(--a2);
src += dim * (stride + 1);
dst -= dim * dim - stride;
} while(--a1);
}
After carefully read the code, I think main idea of this solution is treat 32 rows as a data zone, and perform the rotating operation respectively. Speed up of this version is 1.85x, overwhelming all the loop-unroll version.
Here are the questions:
In the inner-loop-unroll version, why does increment slow down if the unrolling factor increase, especially change the unrolling factor from 8 to 16, which does not effect the same when switch from 4 to 8? Does the result have some relationship with depth of the CPU pipeline? If the answer is yes, could the degrade of increment reflect pipeline length?
What is the probable reason for the optimization of data-zone version? It seems that there is no too much essential difference from the original naive version.
EDIT:
My test environment is Intel Centrino Duo architecture and the verion of gcc is 4.4
Any advice will be highly appreciated!
Kind regards!

What kind of processor are you testing this on? I dimly remember that unrolling loops helps when the processor can handle multiple operations at once, but only up to the maximum number of parallel executions. So if your processor can only handle 8 simultaneous instructions, then unrolling to 16 won't help. But someone with knowledge of more recent processor design will have to pipe up/correct me.
EDIT: According to this PDF, the centrino core2 duo has two processors, each of which is capable of 4 simultaneous instructions. It's generally not so simple, though. Unless your compiler is optimizing across both cores (ie, when you run the task manager (if you're on windows, top if you're on linux), you'll see that CPU usage is maxed out), your process will be running on one core at a time. The processor also features 14 stages of execution, so if you can keep the pipeline full, you'll get a faster execution.
Continuing along the theoretical route, then, you get a speed improvement of 33% with a single unroll because you're starting to take advantage of simultaneous instruction execution. Going to 4 unrolls doesn't really help, because you're now still within that 4-simultaneous-instruction limit. Going to 8 unrolls helps because the processor can now fill the pipeline more completely, so more instructions will get executed per clock cycle.
For this last, think about how a McDonald's drive through works (I think that that's relatively widespread?). A car enters the drivethrough, orders at one window, pays at a second window, and receives food at a third window. If a second drive enters when the first is still ordering, then by the time both finish (assuming each operation in the drive through takes one 'cycle' or time unit), then 2 full operations will be done by the time 4 cycles have elapsed. If each car did all of their operations at one window, then the first car would take 3 cycles for ordering, paying, and getting food, and then the second car would also take 3 cycles for ordering, paying and getting food, for a total of 6 cycles. So, operation time due to pipelining decreases.
Of course, you have to keep the pipeline full to get the largest speed improvement. 14 stages is a lot of stages, so going to 16 unrolls will give you some improvement still because more operations can be in the pipeline.
Going to 32 causing a decrease in performance may have to do with bandwidth to the processor from the cache (again a guess, can't know for sure without seeing your code exactly, as well as the machine code). If all the instructions can't fit into cache or into the registers, then there is some time necessary to prepare them all to run (ie, people have to get into their cars and get to the drive through in the first place). There will be some reduction in speed if they all get there all at once, and some shuffling of the line has to be done to make the operation proceed.
Note that each movement from src to dst is not free or a single operation. You have the lookups into the arrays, and that costs time.
As for why the second version works so quickly, I'm going to hazard a guess that it has to do with the [] operator. Every time that gets called, you're doing some lookups into both the src and dst arrays, resolving pointers to locations, and then retrieving the memory. The other code is going straight to the pointers of the arrays and accessing them directly; basically, for each of the movements from src to dst, there are less operations involved in the move, because the lookups have been handled explicitly through pointer placement. If you use [], these steps are followed:
do any math inside the []
take a pointer to that location (startOfArray + [] in memory)
return the result of that location in memory
If you walk along with a pointer, you just do the math to do the walk (typically just an addition, no multiplication) and then return the result, because you've already done the second step.
If I'm right, then you might get better results with the second code by unrolling its inner loop as well, so that multiple operations can be pipelined simultaneously.

The first part of the question I'm not sure about. My initial thought was some sort of cache problem, but you're only accessing each item once.
The other code could be faster for a coupe reasons.
1) The loops count down instead of up. Comparing a loop counter to zero costs nothing on most architectures (a flag is set by the decrement automatically) you have to explicitly compare to a max value with each iteration.
2) There is no math in the inner loop. You are doing a bunch of math in your inner loop. I see 2 subtractions in the main code and a multiply in the macro (which is used twice). There is also the implicit addition of the resulting indexes to the base address of the array which is avoided by the use of pointers (good addressing modes on x86 should eliminate this penalty too).
When writing optimized code, you always construct it bottom up from the inside. This means taking the inner-most loop and reducing its content to nearly zero. In this case, moving data is unavoidable. Incrementing a pointer is the bare minimum to get to the next item, the other pointer needs to add an offset to get to its next item. So at a minimum we have 4 operations: load, store, increment, add. If an architecture supported "move with post-increment" this would be 2 instructions total. On Intel I suspect it's 3 or 4 instructions. Anything more than this like subtractions and multiplication is going to add significant code.
Looking at the assembly code of each version should offer much insight.
If you run this repeatedly on a small matrix (32x32) that fits completely in cache you should should see even more dramatic differences in implementations. Running on a 1024x1024 matrix will be much slower than doing 1024 rotations of a single 32x32 even though the number of data copies is the same.

The main purpose of loop unrolling is to reduce the time spent on the loop control (test for completion, incrementing counters, etc...). This is a case of diminishing returns though, since as the loop is unrolled more and more, the time spent on loop control becomes less and less significant. Like mmr said, loop unrolling may also help the compiler to execute things in parallel, but only up to a point.
The "data-zone" algorithm appears to be a version of a cache efficient matrix transpose algorithm. The problem with computing a transpose the naive way is that it results in a lot of cache misses. For the source array, you are accessing the memory along each row, so it is accessed in a linear manner, element-by-element. However, this requires that you access the destination array along the columns, meaning you are jumping dim elements each time you access an element. Basically, for each row of the input, you are traversing the memory of the entire destination matrix. Since the whole matrix probably won't fit in the cache, memory has to be loaded and unloaded from the cache very often.
The "data-zone" algorithm takes the matrix that you are accessing by column and only performs the transpose for 32 rows at a time, so the amount of memory you are traversing is 32xstride, which should hopefully fit completely into the cache. Basically the aim is to work on sub-sections that fit in the cache and reduce the amount of jumping around in memory.

What is the maximum theoretical speed-up due to SSE for a simple binary subtraction?

In trying to figure out whether or not my code's inner loop is hitting a hardware design barrier or a lack of understanding on my part barrier. There's a bit more to it, but the simplest question I can come up with to answer is as follows:
If I have the following code:
float px[32768],py[32768],pz[32768];
float xref, yref, zref, deltax, deltay, deltaz;
initialize_with_random(px);
initialize_with_random(py);
initialize_with_random(pz);
for(i=0;i<32768-1;i++) {
xref=px[i];
yref=py[i];
zref=pz[i];
for(j=0;j<32768-1;j++ {
deltx=xref-px[j];
delty=yref-py[j];
deltz=zref-pz[j];
} }
What type of maximum theoretical speed up would I be able to see by going to SSE instructions in a situation where I have complete control over code (assembly, intrinsics, whatever) but no control over runtime environment other than architecture (i.e. it's a multi-user environment so I can't do anything about how the OS kernel assigns time to my particular process).
Right now I'm seeing a speed up of 3x with my code, when I would have thought using SSE would give me much more vector depth than the 3x speed up is indicating (presumably the 3x speed up tells me I have a 4x maximum theoretical throughput). (I've tried things such as letting deltx/delty/deltz be arrays in case the compiler wasn't smart enough to auto-promote them, but I still see only 3x speed up.) I'm using the intel C compiler with the appropriate compiler flags for vectorization, but no intrinsics obviously.

It depends on the CPU. But the theoretical max won't get above 4x. I don't know of a CPU which can execute more than one SSE instruction per clock cycle, which means that it can at most compute 4 values per cycle.
Most CPU's can do at least one floating point scalar instruction per cycle, so in this case you'd see a theoretical max of a 4x speedup.
But you'll have to look up the specific instruction throughput for the CPU you're running on.
A practical speedup of 3x is pretty good though.

I think you'd probably have to interleave the inner loop somehow. The 3-component vector is getting done at once, but that's only 3 operations at once. To get to 4, you'd do 3 components from the first vector, and 1 from the next, then 2 and 2, and so on. If you established some kind of queue that loads and processes the data 4 components at a time, then separate it after, that might work.
Edit: You could unroll the inner loop to do 4 vectors per iteration (assuming the array size is always a multiple of 4). That would accomplish what I said above.

Consider: How wide is a float? How wide is the SSEx instruction? The ratio should should give you some kind of reasonable upper bound.
It's also worth noting that out-of-order pipes play havok with getting good estimates of speedup.

You should consider loop tiling - the way you are accessing values in the inner loop is probably causing a lot of thrashing in the L1 data cache. It's not too bad, because everything probably still fits in the L2 at 384 KB, but there is easily an order of magnitude difference between an L1 cache hit and an L2 cache hit, so this could make a big difference for you.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight