Compute efficiently Max and Min on data stream

Compute efficiently Max and Min on data stream - c

I'm working in C with a stream of data. Basically I receive a column array of 6 elements every n milliseconds. I would like to compute the max value for each row of data.
To make this clear this is how my data looks like (this is a toy example, actually I'll have thousand of columns acquired):
[6] [-10] [5]
[1] [5] [3]
[5] [30] [10]
[2] [-10] [0]
[-2][5] [10]
[-5][0] [1]
So basically (as I said) I receive a column of data every n milliseconds, and I want to compute the max and min value row-wise. So in my previous example my result would be:
max_values=[6,5,30,2,10,1]
min_values=[-10,1,5,-10,-2,-5]
I want to point out that I have no access to the full matrix, I can only work over single columns of 6 elements that I receive every n milliseconds.
This is my simple code algorithm so far (I'm omitting the whole code since it's part of a bigger project):
for(int i=0;i<6;i++){
if(input[i]>temp_max[i]){
temp_max[i]=input[i];
}
if(input[i]<temp_min[i]){
temp_min[i]=input[i];
}
}
Where input, temp_max and temp_min are all float arrays of dimension 6.
Basically my code executes this piece of code everytime a new input array is available and updates the maximum and minimum accordingly.
Since I'm interested in performance (this is going to run on an embedded system), is there any way to improve this part of the code? Calling a comparison for each single element of the 2 arrays doesn't seem the most smart idea.

With random input data (i.e. unordered data), it'll be pretty hard (aka impossible) to find min/max without a comparison per element.
You may get some minor improvement from something like:
temp_max[0]=input[0];
temp_min[0]=input[0];
for(int i=1;i<6;i++){ // Only 1..6
if(input[i]>temp_max[i]){
temp_max[i]=input[i];
}
else // If current element was larger than max, you don't need to check min
{
if(input[i]<temp_min[i]){
temp_min[i]=input[i];
}
}
}
but I doubt this will be a significant improvement.

Branching is slow, especially on embedded systems. Scalar computation too.
Hopefully, your targeted processor seems to be an ARM-based processor supporting the NEON SIMD instruction set (apparently one based on a 64-bits ARM-V8 A53 architecture). NEON can compute 4 32-bits floating-point operations in a row. This should be much faster than the current code (which compilers apparently fail to vectorize).
Here is an example code (untested):
void minmax_optim(float temp_min[6], float temp_max[6], float input[6]) {
/* Compute the first 4 floats */
float32x4_t vInput = vld1q_f32(input);
float32x4_t vMin = vld1q_f32(temp_min);
float32x4_t vMax = vld1q_f32(temp_max);
vMin = vminq_f32(vInput, vMin);
vMax = vmaxq_f32(vInput, vMax);
vst1q_f32(temp_min, vMin);
vst1q_f32(temp_max, vMax);
/* Remainder 2 floats */
float32x2_t vLastInput = vld1_f32(input+4);
float32x2_t vLastMin = vld1_f32(temp_min+4);
float32x2_t vLastMax = vld1_f32(temp_max+4);
vLastMin = vmin_f32(vLastInput, vLastMin);
vLastMax = vmax_f32(vLastInput, vLastMax);
vst1_f32(temp_min+4, vLastMin);
vst1_f32(temp_max+4, vLastMax);
}
The resulting code should be much faster. One can see on goldbolt that the number of instructions of this vectorized implementation is drastically smaller than the reference implementation without any conditional jump instructions.

You nailed it -- you have to keep a temporary max and min arrays. Unfortunately, if we're talking strictly C, it seems to be the single possible and thus most performant algorithm possible.
Since you've mentioned it's going to run on embedded system (but omitted which), please make sure you have hardware floating point support. If there isn't, that's going to be high performance penality. If you have high-end hardware, you can look for availability of vector instructions, but then that's platform-specific, possibly by use of assembly.

To my impression the approach as such cannot be substantially improved, as the input is not available as a whole. That being said, the inner comparisons can be compacted. The assignemnts
if(input[i]>temp_max[i]){
temp_max[i]=input[i];
}
if(input[i]<temp_min[i]){
temp_min[i]=input[i];
}
can be improved to
if(input[i]>temp_max[i]){
temp_max[i]=input[i];
}
else if(input[i]<temp_min[i]){
temp_min[i]=input[i];
}
because if the current value replaces the temporary maximum, it cannot also replace the temporary minimum (assuming some sensible initialization).

Only for max but it is easy to expand
#define MAX(a,b,c) (a) > (b) ? ((b) > (c) ? (b) : (a) > (c) ? (a) : (c) ) : (b) > (c) ? (b) : (c)
void rowmax(int *a, int *b, int *c, int *result, size_t size)
{
for(size_t index = 0; index < size; index++)
{
result[index] = MAX(a[index], b[index], c[index]);
}
}

Related

Cache-friendly copying of an array with readjustment by known index, gather, scatter

Suppose we have an array of data and another array with indexes.
data = [1, 2, 3, 4, 5, 7]
index = [5, 1, 4, 0, 2, 3]
We want to create a new array from elements of data at position from index. Result should be
[4, 2, 5, 7, 3, 1]
Naive algorithm works for O(N) but it performs random memory access.
Can you suggest CPU cache friendly algorithm with the same complexity.
PS
In my certain case all elements in data array are integers.
PPS
Arrays might contain millions of elements.
PPPS I'm ok with SSE/AVX or any other x64 specific optimizations

Combine index and data into a single array. Then use some cache-friendly sorting algorithm to sort these pairs (by index). Then get rid of indexes. (You could combine merging/removing indexes with the first/last pass of the sorting algorithm to optimize this a little bit).
For cache-friendly O(N) sorting use radix sort with small enough radix (at most half number of cache lines in CPU cache).
Here is C implementation of radix-sort-like algorithm:
void reorder2(const unsigned size)
{
const unsigned min_bucket = size / kRadix;
const unsigned large_buckets = size % kRadix;
g_counters[0] = 0;
for (unsigned i = 1; i <= large_buckets; ++i)
g_counters[i] = g_counters[i - 1] + min_bucket + 1;
for (unsigned i = large_buckets + 1; i < kRadix; ++i)
g_counters[i] = g_counters[i - 1] + min_bucket;
for (unsigned i = 0; i < size; ++i)
{
const unsigned dst = g_counters[g_index[i] % kRadix]++;
g_sort[dst].index = g_index[i] / kRadix;
g_sort[dst].value = g_input[i];
__builtin_prefetch(&g_sort[dst + 1].value, 1);
}
g_counters[0] = 0;
for (unsigned i = 1; i < (size + kRadix - 1) / kRadix; ++i)
g_counters[i] = g_counters[i - 1] + kRadix;
for (unsigned i = 0; i < size; ++i)
{
const unsigned dst = g_counters[g_sort[i].index]++;
g_output[dst] = g_sort[i].value;
__builtin_prefetch(&g_output[dst + 1], 1);
}
}
It differs from radix sort in two aspects: (1) it does not do counting passes because all counters are known in advance; (2) it avoids using power-of-2 values for radix.
This C++ code was used for benchmarking (if you want to run it on 32-bit system, slightly decrease kMaxSize constant).
Here are benchmark results (on Haswell CPU with 6Mb cache):
It is easy to see that small arrays (below ~2 000 000 elements) are cache-friendly even for naive algorithm. Also you may notice that sorting approach starts to be cache-unfriendly at the last point on diagram (with size/radix near 0.75 cache lines in L3 cache). Between these limits sorting approach is more efficient than naive algorithm.
In theory (if we compare only memory bandwidth needed for these algorithms with 64-byte cache lines and 4-byte values) sorting algorithm should be 3 times faster. In practice we have much smaller difference, about 20%. This could be improved if we use smaller 16-bit values for data array (in this case sorting algorithm is about 1.5 times faster).
One more problem with sorting approach is its worst-case behavior when size/radix is close to some power-of-2. This may be either ignored (because there are not so many "bad" sizes) or fixed by making this algorithm slightly more complicated.
If we increase number of passes to 3, all 3 passes use mostly L1 cache, but memory bandwidth is increased by 60%. I used this code to get experimental results: TL; DR. After determining (experimentally) the best radix value, I got somewhat better results for sizes greater than 4 000 000 (where 2-pass algorithm uses L3 cache for one pass) but somewhat worse results for smaller arrays (where 2-pass algorithm uses L2 cache for both passes). As it may be expected, performance is better for 16-bit data.
Conclusion: performance difference is much smaller than difference in complexity of algorithms, so naive approach is almost always better; if performance is very important and only 2 or 4 byte values are used, sorting approach is preferable.

data = [1, 2, 3, 4, 5, 7]
index = [5, 1, 4, 0, 2, 3]
We want to create a new array from elements of data at position from
index. Result should be
result -> [4, 2, 5, 7, 3, 1]
Single thread, one pass
I think, for a few million elements and on a single thread, the naive approach might be the best here.
Both data and index are accessed (read) sequentially, which is already optimal for the CPU cache. That leaves the random writing, but writing to memory isn't as cache friendly as reading from it anyway.
This would only need one sequential pass through data and index. And chances are some (sometimes many) of the writes will already be cache-friendly too.
Using multiple blocks for result - multiple threads
We could allocate or use cache-friendly sized blocks for the result (blocks being regions in the result array), and loop through index and data multiple times (while they stay in the cache).
In each loop we then only write elements to result that fit in the current result-block. This would be 'cache friendly' for the writes too, but needs multiple loops (the number of loops could even get rather high - i.e. size of data / size of result-block).
The above might be an option when using multiple threads: data and index, being read-only, would be shared by all cores at some level in the cache (depending on the cache architecture). The result blocks in each thread would be totally independent (one core never has to wait for the result of another core, or a write in the same region). For example: 10 million elements - each thread could be working on an independent result block of say 500.000 elements (number should be a power of 2).
Combining the values as a pair and sorting them first: this would already take much more time than the naive option (and wouldn't be that cache friendly either).
Also, if there are only a few million of elements (integers), it won't make much of a difference. If we would be talking about billions, or data that doesn't fit in memory, other strategies might be preferable (like for example memory mapping the result set if it doesn't fit in memory).

If your problem deals with a lot more data than you show here the fastest way - and probably the most cache friendly - would be to do a large and wide merge sort operation.
So you would divide the input data into reasonable chunks, and have a seperate thread operate on each chunk. The result of this operation would be two arrays much like the input (one data and one destination indexes), however the indexes would be sorted. Then you would have a final thread do a merge operation on the data into the final output array.
As long as the segments are chosen well this should be quite a cache friendly algorithm. By wisely I mean so that the data used by different threads maps onto different cache lines (of your chosen processor) so as to avoid cache thrashing.

If you have a lot of data and that is indeed the bottle neck you will need to use a block based algorithm where you read and write from the same blocks as much as possible. It will take up to 2 passes over the data to ensure the new array is entirely populated and the block size will need to be set appropriately. The pseudocode is below.
def populate(index,data,newArray,cache)
blockSize = 1000
for i = 0; i < size(index); i++
//We cached this value earlier
if i in cache
newArray[i] = cache[i]
remove(cache,i)
else
newIndex = index[i]
newValue = data[i]
//Check if this index is in our block
if i%blockSize != newIndex%blockSize
//This index is not in our current block, cache it
cache[newIndex] = newValue
else
//This value is in our current block
newArray[newIndex] = newValue
cache = {}
newArray = []
populate(index,data,newArray,cache)
populate(index,data,newArray,cache)
Analysis
The naive solution accesses the index and data array in order but the new array is accessed in random order. Since the new array is randomly accessed you essentially end up with O(N^2) where N is the number of blocks in the array.
The block based solution does not jump from block to block. It reads the index, data, and new array all in sequence to read and write to the same blocks. If an index will be in another block, it is cached and either retrieved when the block it belongs in comes up or if the block is already passed, it will be retrieved in the second pass. A second pass will not hurt at all. This is O(N).
The only caveat is in dealing with the cache. There are a lot of opportunities to get creative here but in general if a lot of the reads and writes end up being on different blocks, the cache will grow and this is not optimal. It depends on the makeup of your data, how often this occurs and your cache implementation.
Lets imagine that all of the information inside of the cache exists on one block and it fits in memory. And lets say the cache has y elements. The naive approach would have randomly accessed at least y times. The block based approach will get those in the second pass.

I notice your index completely covers the domain but is in random order.
If you were to sort the index but also apply the same operations to the index array to the data array, the data array would become the result you are after.
There are plenty of sort algoritms to select from, all would satisfy your cache friendly criteria. But their complexity varies. I'd consider either quicksort or mergesort.
If you're interested in this answer I can elaborate with pseudo code.

I am concerned this may not be a winning pattern.
We had a piece of code which performed well, and we optimized it by removing a copy.
The result was that it performed poorly (due to cache issues). I can't see how you can produce a single pass algorithm which solves the issue. Using OpenMP, may allow the stalls this will cause to be shared amongst multiple threads.

I assume that the reordering happens only once in the same way. If it happens multiple times, then creating some better strategy beforehand (by and appropriate sorting algorithm) will improve performance
I wrote the following program to actually test if a simple split of the target in N blocks helps, and my finding were:
a) even for the worst cases it was not possible to the single thread performance (using segmented writes) does not exceed the naive strategy, and is usually worse by at least a factor of 2
b) However, the performance approaches unity for some subdivisions (probably depends on the processor) and array sizes, thus indicating that it actually would improve the multi-core performance
The consequence of this is: Yes, it's more "cache-friendly" than not subdividing, but for a single thread (and only one reordering) this wont help you a bit.
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
void main(char **ARGS,int ARGC) {
int N=1<<26;
double* source = malloc(N*sizeof(double));
double* target = malloc(N*sizeof(double));
int* idx = malloc(N*sizeof(double));
int i;
for(i=0;i<N;i++) {
source[i]=i;
target[i]=0;
idx[i] = rand() % N ;
};
struct timeval now,then;
gettimeofday(&now,NULL);
for(i=0;i<N;i++) {
target[idx[i]]=source[i];
};
gettimeofday(&then,NULL);
printf("%f\n",(0.0+then.tv_sec*1e6+then.tv_usec-now.tv_sec*1e6-now.tv_usec)/N);
gettimeofday(&now,NULL);
int j;
int targetblocks;
int M = 24;
int targetblocksize = 1<<M;
targetblocks = (N/targetblocksize);
for(i=0;i<N;i++) {
for(j=0;j<targetblocks;j++) {
int k = idx[i];
if ((k>>M) == j) {
target[k]=source[i];
};
};
};
gettimeofday(&then,NULL);
printf("%d,%f\n",targetblocks,(0.0+then.tv_sec*1e6+then.tv_usec-now.tv_sec*1e6-now.tv_usec)/N);
};

Run-time efficient transposition of a rectangular matrix of arbitrary size

I am pressed for time to optimize a large piece of C code for speed and I am looking for an algorithm---at the best a C "snippet"---that transposes a rectangular source matrix u[r][c] of arbitrary size (r number of rows, c number of columns) into a target matrix v[s][d] (s = c number of rows, d = r number of columns) in a "cache-friendly" i. e. data-locality respecting way. The typical size of u is around 5000 ... 15000 rows by 50 to 500 columns, and it is clear that a row-wise access of elements is very cache-inefficient.
There are many discussions on this topic in the web (nearby this thread), but as far as I see all of them discuss the spacial cases like square matrices, u[r][r], or the definition an on-dimensional array, e. g. u[r * c], not the above mentioned "array of arrays" (of equal length) used in my context of Numerical Recipes (background see here).
I would by very thankful for any hint that helps to spare me the "reinvention of the wheel".
Martin

I do not think that array of arrays is much harder to transpose than linear array in general. But if you are going to have 50 columns in each array, that sounds bad: it may be not enough to hide the overhead of pointer dereferencing.
I think that the overall strategy of cache-friendly implementation is the same: process your matrix in tiles, choose size of tiles which performs best according to experiments.
template<int BLOCK>
void TransposeBlocked(Matrix &dst, const Matrix &src) {
int r = dst.r, c = dst.c;
assert(r == src.c && c == src.r);
for (int i = 0; i < r; i += BLOCK)
for (int j = 0; j < c; j += BLOCK) {
if (i + BLOCK <= r && j + BLOCK <= c)
ProcessFullBlock<BLOCK>(dst.data, src.data, i, j);
else
ProcessPartialBlock(dst.data, src.data, r, c, i, j, BLOCK);
}
}
I have tried to optimize the best case when r = 10000, c = 500 (with float type). On my local machine 128 x 128 tiles give speedup in 2.5 times. Also, I have tried to use SSE to accelerate transposition, but it does not change timings significantly. I think that's because the problem is memory bound.
Here are full timings (for 100 launches each) of various implementations on Core2 E4700 2.6GHz:
Trivial: 6.111 sec
Blocked(4): 8.370 sec
Blocked(16): 3.934 sec
Blocked(64): 2.604 sec
Blocked(128): 2.441 sec
Blocked(256): 2.266 sec
BlockedSSE(16): 4.158 sec
BlockedSSE(64): 2.604 sec
BlockedSSE(128): 2.245 sec
BlockedSSE(256): 2.036 sec
Here is the full code used.

So, I'm guessing you have an array of array of floats/doubles. This setup is already very bad for cache performance. The reason is that with a 1-dimensional array the compiler can output code that results in a prefetch operation and ( in the case of a very new compiler) produce SIMD/vectorized code. With an array of pointers there's a deference operation on each step making a prefetch more difficult. Not to mention there aren't any guarantees on memory alignment.
If this is for an assignment and you have no choice but to write the code from scratch, I'd recommend looking at how CBLAS does it (note that you'll still need your array to be "flattened"). Otherwise, you're much better off using a highly optimized BLAS implementation like
OpenBLAS. It's been optimized for nearly a decade and will produce the fastest code for your target processor (tuning for things like cache sizes and vector instruction set).
The tl;dr is that using an array of arrays will result in terrible performance no matter what. Flatten your arrays and make your code nice to read by using a #define to access elements of the array.

Optimising C for performance vs memory optimisation using multidimensional arrays

I am struggling to decide between two optimisations for building a numerical solver for the poisson equation.
Essentially, I have a two dimensional array, of which I require n doubles in the first row, n/2 in the second n/4 in the third and so on...
Now my difficulty is deciding whether or not to use a contiguous 2d array grid[m][n], which for a large n would have many unused zeroes but would probably reduce the chance of a cache miss. The other, and more memory efficient method, would be to dynamically allocate an array of pointers to arrays of decreasing size. This is considerably more efficient in terms of memory storage but would it potentially hinder performance?
I don't think I clearly understand the trade-offs in this situation. Could anybody help?
For reference, I made a nice plot of the memory requirements in each case:

There is no hard and fast answer to this one. If your algorithm needs more memory than you expect to be given then you need to find one which is possibly slower but fits within your constraints.
Beyond that, the only option is to implement both and then compare their performance. If saving memory results in a 10% slowdown is that acceptable for your use? If the version using more memory is 50% faster but only runs on the biggest computers will it be used? These are the questions that we have to grapple with in Computer Science. But you can only look at them once you have numbers. Otherwise you are just guessing and a fair amount of the time our intuition when it comes to optimizations are not correct.

Build a custom array that will follow the rules you have set.
The implementation will use a simple 1d contiguous array. You will need a function that will return the start of array given the row. Something like this:
int* Get( int* array , int n , int row ) //might contain logical errors
{
int pos = 0 ;
while( row-- )
{
pos += n ;
n /= 2 ;
}
return array + pos ;
}
Where n is the same n you described and is rounded down on every iteration.
You will have to call this function only once per entire row.
This function will never take more that O(log n) time, but if you want you can replace it with a single expression: http://en.wikipedia.org/wiki/Geometric_series#Formula

You could use a single array and just calculate your offset yourself
size_t get_offset(int n, int row, int column) {
size_t offset = column;
while (row--) {
offset += n;
n << 1;
}
return offset;
}
double * array = calloc(sizeof(double), get_offset(n, 64, 0));
access via
array[get_offset(column, row)]

Runtime of Initializing an Array Zero-Filled

If I were to define the following array using the zero-fill initialization syntax on the stack:
int arr[ 10 ] = { 0 };
... is the run time constant or linear?
My assumption is that it's a linear run time -- my assumption is only targeting the fact that calloc must go over every byte to zero-fill it.
If you could also provide a why and not just it's order xxx that would be tremendous!

The runtime is linear in the array size.
To see why, here's a sample implementation of memset, which initializes an array to an arbitrary value. At the assembly-language level, this is no different than what goes on in your code.
void *memset(void *dst, int val, size_t count) {
unsigned char *start = dst;
for (size_t i = 0; i < count; i++)
*start++ = value;
return dst;
}
Of course, compilers will often use intrinsics to set multiple array elements at a time. Depending on the size of the array and things like alignment and padding, this might make the runtime over array length more like a staircase, with the step size based on the vector length. Over small differences in array size, this would effectively make the runtime constant, but the general pattern is still linear.

This is actually a tip of the ice berg question. What you are really asking is what is the order (Big Oh) of initializing an array. Essentially, the code is looping thru each element of the array and setting them to zero. You could write a for loop to do the same thing.
The Order of magnitude of that loop is O(n), that is, the time spent in the loop increases in proportion to the number of elements being initialized.
If the hardware supported an instruction that says to set all bytes from location X to Y to zero and that instruction worked in M instruction cycles and M never changed regardless of the number of bytes being set to zero, then that would be of order k, or O(k).
In general, O(k) is probably referred to as constant time and O(n) as linear.

Optimizing C loops

I'm new to C from many years of Matlab for numerical programming. I've developed a program to solve a large system of differential equations, but I'm pretty sure I've done something stupid as, after profiling the code, I was surprised to see three loops that were taking ~90% of the computation time, despite the fact they are performing the most trivial steps of the program.
My question is in three parts based on these expensive loops:
Initialization of an array to zero. When J is declared to be a double array are the values of the array initialized to zero? If not, is there a fast way to set all the elements to zero?
void spam(){
double J[151][151];
/* Other relevant variables declared */
calcJac(data,J,y);
/* Use J */
}
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
/* The first expensive loop */
int iter, jter;
for (iter=0; iter<151; iter++) {
for (jter = 0; jter<151; jter++) {
J[iter][jter] = 0;
}
}
/* More code to populate J from data and y that runs very quickly */
}
During the course of solving I need to solve matrix equations defined by P = I - gamma*J. The construction of P is taking longer than solving the system of equations it defines, so something I'm doing is likely in error. In the relatively slow loop below, is accessing a matrix that is contained in a structure 'data' the the slow component or is it something else about the loop?
for (iter = 1; iter<151; iter++) {
for(jter = 1; jter<151; jter++){
P[iter-1][jter-1] = - gamma*(data->J[iter][jter]);
}
}
Is there a best practice for matrix multiplication? In the loop below, Ith(v,iter) is a macro for getting the iter-th component of a vector held in the N_Vector structure 'v' (a data type used by the Sundials solvers). Particularly, is there a best way to get the dot product between v and the rows of J?
Jv_scratch = 0;
int iter, jter;
for (iter=1; iter<151; iter++) {
for (jter=1; jter<151; jter++) {
Jv_scratch += J[iter][jter]*Ith(v,jter);
}
Ith(Jv,iter) = Jv_scratch;
Jv_scratch = 0;
}

1) No they're not you can memset the array as follows:
memset( J, 0, sizeof( double ) * 151 * 151 );
or you can use an array initialiser:
double J[151][151] = { 0.0 };
2) Well you are using a fairly complex calculation to calculate the position of P and the position of J.
You may well get better performance. by stepping through as pointers:
for (iter = 1; iter<151; iter++)
{
double* pP = (P - 1) + (151 * iter);
double* pJ = data->J + (151 * iter);
for(jter = 1; jter<151; jter++, pP++, pJ++ )
{
*pP = - gamma * *pJ;
}
}
This way you move various of the array index calculation outside of the loop.
3) The best practice is to try and move as many calculations out of the loop as possible. Much like I did on the loop above.

First, I'd advise you to split up your question into three separate questions. It's hard to answer all three; I, for example, have not worked much with numerical analysis, so I'll only answer the first one.
First, variables on the stack are not initialized for you. But there are faster ways to initialize them. In your case I'd advise using memset:
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
memset((void*)J, 0, sizeof(double) * 151 * 151);
/* More code to populate J from data and y that runs very quickly */
}
memset is a fast library routine to fill a region of memory with a specific pattern of bytes. It just so happens that setting all bytes of a double to zero sets the double to zero, so take advantage of your library's fast routines (which will likely be written in assembler to take advantage of things like SSE).

Others have already answered some of your questions. On the subject of matrix multiplication; it is difficult to write a fast algorithm for this, unless you know a lot about cache architecture and so on (the slowness will be caused by the order that you access array elements causes thousands of cache misses).
You can try Googling for terms like "matrix-multiplication", "cache", "blocking" if you want to learn about the techniques used in fast libraries. But my advice is to just use a pre-existing maths library if performance is key.

Initialization of an array to zero.
When J is declared to be a double
array are the values of the array
initialized to zero? If not, is there
a fast way to set all the elements to
zero?
It depends on where the array is allocated. If it is declared at file scope, or as static, then the C standard guarantees that all elements are set to zero. The same is guaranteed if you set the first element to a value upon initialization, ie:
double J[151][151] = {0}; /* set first element to zero */
By setting the first element to something, the C standard guarantees that all other elements in the array are set to zero, as if the array were statically allocated.
Practically for this specific case, I very much doubt it will be wise to allocate 151*151*sizeof(double) bytes on the stack no matter which system you are using. You will likely have to allocate it dynamically, and then none of the above matters. You must then use memset() to set all bytes to zero.
In the
relatively slow loop below, is
accessing a matrix that is contained
in a structure 'data' the the slow
component or is it something else
about the loop?
You should ensure that the function called from it is inlined. Otherwise there isn't much else you can do to optimize the loop: what is optimal is highly system-dependent (ie how the physical cache memories are built). It is best to leave such optimization to the compiler.
You could of course obfuscate the code with manual optimization things such as counting down towards zero rather than up, or to use ++i rather than i++ etc etc. But the compiler really should be able to handle such things for you.
As for matrix addition, I don't know of the mathematically most efficient way, but I suspect it is of minor relevance to the efficiency of the code. The big time thief here is the double type. Unless you really have need for high accuracy, I'd consider using float or int to speed up the algorithm.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight