Pointer Math with Complex Array

Pointer Math with Complex Array - c

I have this snippet of code with some pointer math that I'm having trouble understanding:
#include <stdlib.h>
#include <complex.h>
#include <fftw3.h>
int main(void)
{
int i, j, k;
int N, N2;
fftwf_complex *box;
fftwf_plan plan;
float *smoothed_box;
// Allocate memory for arrays (Ns are set elsewhere and properly,
// I've just left it out for clarity)
box = (fftwf_complex *)fftwf_malloc(N * sizeof(fftwf_complex));
smoothed_box = (float *)malloc(N2 * sizeof(float));
// Create complex data and fill box with it. Do FFT. Box has the
// Hermitian symmetry that complex data has when doing FFTs with
// real data
plan = fftwf_plan_dft_c2r_3d(N,N,N,box,(float *)box,
FFTW_ESTIMATE);
...
// end fft
// Now do the loop I don't understand
for(i = 0; i < N2; i++)
{
for(j = 0; j < N2; j++)
{
for(k = 0; k < N2; k++)
{
smoothed_box[R_INDEX(i,j,k)] = *((float *)box +
R_FFT_INDEX(i*f + 0.5, j*f + 0.5, k*f +0.5))/V;
}
}
}
// Do other stuff
...
return 0;
}
Where f and V are just some numbers that are set elsewhere in the code and don't matter for this particular question. Additionally, the functions R_FFT_INDEX and R_INDEX don't really matter, either. What's important is that, for the first loop iteration ,when i=j=k=0, R_INDEX = 0 and R_FFT_INDEX=45. smoothed_box has 8 elements and box has 320.
So, in gdb, when I print smoothed_box[0] after the loop, I get smoothed_box[0] = some number. Now, I understand that, for an array of normal types, say floats, array + integer will give array[integer], assuming that integer is within the bounds of the array.
However, fftwf_complex is defined as typedef float fftw_complex[2], as you need to hold both the real and imaginary parts of the complex number. It's also being casted to a float * from a fftwf_complex *, and I'm unsure what this does, given the typedef.
All I know is that when I print box[45] in gdb, I get box[45] = some complex number that is not smoothed_box[0] * V. Even when I print *((float *)box + 45)/V, I get a different number than smoothed_box[0].
So, I was just wondering if anyone could explain to me the pointer math that is being done in the above loop? Thank you, and I appreciate your time!

box is allocated as an array of N fftwf_complex. Then a backward 3D c2r fftw transform using N,N,N is performed on box, requiring N*N*(N/2+1) fftwf_complex. See http://www.fftw.org/fftw3_doc/Real_002ddata-DFT-Array-Format.html#Real_002ddata-DFT-Array-Format Therefore, this code might trigger undefined behavior, such as segmentation fault, before reaching the pointer arithmetics...
It is practical to cast back box to an array of float because the DFT is performed in place. Indeed, box is used twice as the fftwf_plan is created. box is both the input array of complex and the output array of real:
plan = fftwf_plan_dft_c2r_3d(N,N,N,box,(float *)box,
FFTW_ESTIMATE);
Once fftwf_execute(plan); is called, box is better seen as an array of real. Nevertheless, this array is of size N*N*2*(N/2+1), where the items located at positions i,j,k where k>N-1 are meaningless. See FFTW's Real-data DFT Array Format:
For an in-place transform, some complications arise since the complex data is slightly larger than the real data. In this case, the final dimension of the real data must be padded with extra values to accommodate the size of the complex data—two extra if the last dimension is even and one if it is odd. That is, the last dimension of the real data must physically contain 2 * (nd-1/2+1) double values (exactly enough to hold the complex data). This physical array size does not, however, change the logical array size—only nd-1 values are actually stored in the last dimension, and nd-1 is the last dimension passed to the planner.
This is the reason why the real array smoothed_box is introduced, though an N*N*N array would be expected. If smoothed_box were an array of size N*N*N, then the following conversion could have been performed:
for(i=0;i<N;i++){
for(j=0;j<N;j++){
for(k=0;k<N;k++){
smoothed_box[(i*N+j)*N+k]=((float *)box)[(i*N+j)*(2*(N/2+1))+k]
}
}
}

Related

Best approach to FIFO implementation in a kernel OpenCL

Goal: Implement the diagram shown below in OpenCL. The main thing needed from the OpenCl kernel is to multiply the coefficient array and temp array and then accumilate all those values into one at the end. (That is probably the most time intensive operation, parallelism would be really helpful here).
I am using a helper function for the kernel that does the multiplication and addition (I am hoping this function will be parallel as well).
Description of the picture:
One at a time, the values are passed into the array (temp array) which is the same size as the coefficient array. Now every time a single value is passed into this array, the temp array is multiplied with the coefficient array in parallel and the values of each index are then concatenated into one single element. This will continue until the input array reaches it's final element.
What happens with my code?
For 60 elements from the input, it takes over 8000 ms!! and I have a total of 1.2 million inputs that still have to be passed in. I know for a fact that there is a way better solution to do what I am attempting. Here is my code below.
Here are some things that I know are wrong with he code for sure. When I try to multiply the coefficient values with the temp array, it crashes. This is because of the global_id. All I want this line to do is simply multiply the two arrays in parallel.
I tried to figure out why it was taking so long to do the FIFO function, so I started commenting lines out. I first started by commenting everything except the first for loop of the FIFO function. As a result this took 50 ms. Then when I uncommented the next loop, it jumped to 8000ms. So the delay would have to do with the transfer of data.
Is there a register shift that I could use in OpenCl? Perhaps use some logical shifting method for integer arrays? (I know there is a '>>' operator).
float constant temp[58];
float constant tempArrayForShift[58];
float constant multipliedResult[58];
float fifo(float inputValue, float *coefficients, int sizeOfCoeff) {
//take array of 58 elements (or same size as number of coefficients)
//shift all elements to the right one
//bring next element into index 0 from input
//multiply the coefficient array with the array thats the same size of coefficients and accumilate
//store into one output value of the output array
//repeat till input array has reached the end
int globalId = get_global_id(0);
float output = 0.0f;
//Shift everything down from 1 to 57
//takes about 50ms here
for(int i=1; i<58; i++){
tempArrayForShift[i] = temp[i];
}
//Input the new value passed from main kernel. Rest of values were shifted over so element is written at index 0.
tempArrayForShift[0] = inputValue;
//Takes about 8000ms with this loop included
//Write values back into temp array
for(int i=0; i<58; i++){
temp[i] = tempArrayForShift[i];
}
//all 58 elements of the coefficient array and temp array are multiplied at the same time and stored in a new array
//I am 100% sure this line is crashing the program.
//multipliedResult[globalId] = coefficients[globalId] * temp[globalId];
//Sum the temp array with each other. Temp array consists of coefficients*fifo buffer
for (int i = 0; i < 58; i ++) {
// output = multipliedResult[i] + output;
}
//Returned summed value of temp array
return output;
}
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) {
//Initialize the temporary array values to 0
for (int i = 0; i < 58; i ++) {
temp[i] = 0;
tempArrayForShift[i] = 0;
multipliedResult[i] = 0;
}
//fifo adds one element in and calls the fifo function. ALL I NEED TO DO IS SEND ONE VALUE AT A TIME HERE.
for (int i = 0; i < 60; i ++) {
Output[i] = fifo(Array[i], coefficients, 58);
}
}
I have had this problem with OpenCl for a long time. I am not sure how to implement parallel and sequential instructions together.
Another alternative I was thinking about
In the main cpp file, I was thinking of implementing the fifo buffer there and having the kernel do the multiplication and addition. But this would mean I would have to call the kernel 1000+ times in a loop. Would this be the better solution? Or would it just be completely inefficient.

To get good performance out of GPU, you need to parallelize your work to many threads. In your code you are just using a single thread and a GPU is very slow per thread but can be very fast, if many threads are running at the same time. In this case you can use a single thread for each output value. You do not actually need to shift values through a array: For every output value a window of 58 values is considered, you can just grab these values from memory, multiply them with the coefficients and write back the result.
A simple implementation would be (launch with as many threads as output values):
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output)
{
int globalId = get_global_id(0);
float sum=0.0f;
for (int i=0; i< 58; i++)
{
float tmp=0;
if (globalId+i > 56)
{
tmp=Array[i+globalId-57]*coefficient[57-i];
}
sum += tmp;
}
output[globalId]=sum;
}
This is not perfect, as the memory access patterns it generates are not optimal for GPUs. The Cache will likely help a bit, but there is clearly a lot of room for optimization, as the values are reused several times. The operation you are trying to perform is called convolution (1D). NVidia has an 2D example called oclConvolutionSeparable in their GPU Computing SDK, that shows an optimized version. You adapt use their convolutionRows kernel for a 1D convolution.

Here's another kernel you can try out. There are a lot of synchronization points (barriers), but this should perform fairly well. The 65-item work group is not very optimal.
the steps:
init local values to 0
copy coefficients to local variable
looping over the output elements to compute:
shift existing elements (work items > 0 only)
copy new element (work item 0 only)
compute dot product
5a. multiplication - one per work item
5b. reduction loop to compute sum
copy dot product to output (WI 0 only)
final barrier
the code:
__kernel void lowpass(__global float *Array, __constant float *coefficients, __global float *Output, __local float *localArray, __local float *localSums){
int globalId = get_global_id(0);
int localId = get_local_id(0);
int localSize = get_local_size(0);
//1 init local values to 0
localArray[localId] = 0.0f
//2 copy coefficients to local
//don't bother with this id __constant is working for you
//requires another local to be passed in: localCoeff
//localCoeff[localId] = coefficients[localId];
//barrier for both steps 1 and 2
barrier(CLK_LOCAL_MEM_FENCE);
float tmp;
for(int i = 0; i< outputSize; i++)
{
//3 shift elements (+barrier)
if(localId > 0){
tmp = localArray[localId -1]
}
barrier(CLK_LOCAL_MEM_FENCE);
localArray[localId] = tmp
//4 copy new element (work item 0 only, + barrier)
if(localId == 0){
localArray[0] = Array[i];
}
barrier(CLK_LOCAL_MEM_FENCE);
//5 compute dot product
//5a multiply + barrier
localSums[localId] = localArray[localId] * coefficients[localId];
barrier(CLK_LOCAL_MEM_FENCE);
//5b reduction loop + barrier
for(int j = 1; j < localSize; j <<= 1) {
int mask = (j << 1) - 1;
if ((localId & mask) == 0) {
localSums[local_index] += localSums[localId +j]
}
barrier(CLK_LOCAL_MEM_FENCE);
}
//6 copy dot product (WI 0 only)
if(localId == 0){
Output[i] = localSums[0];
}
//7 barrier
//only needed if there is more code after the loop.
//the barrier in #3 covers this in the case where the loop continues
//barrier(CLK_LOCAL_MEM_FENCE);
}
}
What about more work groups?
This is slightly simplified to allow a single 1x65 work group computer the entire 1.2M Output. To allow multiple work groups, you could use / get_num_groups(0) to calculate the amount of work each group should do (workAmount), and adjust the i for-loop:
for (i = workAmount * get_group_id(0); i< (workAmount * (get_group_id(0)+1) -1); i++)
Step #1 must be changed as well to initialize to the correct starting state for localArray, rather than all 0s.
//1 init local values
if(groupId == 0){
localArray[localId] = 0.0f
}else{
localArray[localSize - localId] = Array[workAmount - localId];
}
These two changes should allow you to use a more optimal number of work groups; I suggest some multiple of the number of compute units on the device. Try to keep the amount of work for each group in the thousands though. Play around with this, sometimes what seems optimal on a high-level will be detrimental to the kernel when it's running.
Advantages
At almost every point in this kernel, the work items have something to do. The only time fewer than 100% of the items are working is during the reduction loop in step 5b. Read more here about why that is a good thing.
Disadvantages
The barriers will slow down the kernel just by the nature of what barriers do: the pause a work item until the others reach that point. Maybe there is a way you could implement this with fewer barriers, but I still feel this is optimal because of the problem you are trying to solve.
There isn't room for more work items per group, and 65 is not a very optimal size. Ideally, you should try to use a power of 2, or a multiple of 64. This won't be a huge issue though, because there are a lot of barriers in the kernel which makes them all wait fairly regularly.

Best solution to represent Data[i,j] in c?

There is a pseudocode that I want to implement in C. But I am in doubt on how to implement a part of it. The psuedocode is:
for every pair of states qi, and qj, i<j, do
D[i,j] := 0
S[i,j] := notzero
end for
i and j, in qi and qj are subscripts.
how do I represent D[i,J] or S[i,j]. which data structure to use so that its simple and fast.

You can use something like
int length= 10;
int i =0, j= 0;
int res1[10][10] = {0, }; //index is based on "length" value
int res2[10][10] = {0, }; //index is based on "length" value
and then
for (i =0; i < length; i++)
{
for (j =0; j < length; j++)
{
res1[i][j] = 0;
res2[i][j] = 1;//notzero
}
}
Here D[i,j] and S[i,j] are represented by res1[10][10] and res2[10][10], respectively. These are called two-dimentional array.

I guess struct will be your friend here depending on what you actually want to work with.
Struct would be fine if, say, pair of states creates some kind of entity.
Otherwise You could use two-dimensional array.

After accept answer.
Depending on coding goals and platform, to get "simple and fast" using a pointer to pointer to a number may be faster then a 2-D array in C.
// 2-D array
double x[MAX_ROW][MAX_COL];
// Code computes the address in `x`, often involving a i*MAX_COL, if not in a loop.
// Slower when multiplication is expensive and random array access occurs.
x[i][j] = f();
// pointer to pointer of double
double **y = calloc(MAX_ROW, sizeof *y);
for (i=0; i<MAX_ROW; i++) y[i] = calloc(MAX_COL, sizeof *(y[i]));
// Code computes the address in `y` by a lookup of y[i]
y[i][j] = f();
Flexibility
The first data type is easy print(x), when the array size is fixed, but becomes challenging otherwise.
The 2nd data type is easy print(y, rows, columns), when the array size is variable and of course works well with fixed.
The 2nd data type also row swapping simply by swapping pointers.
So if code is using a fixed array size, use double x[MAX_ROW][MAX_COL], otherwise recommend double **y. YMMV

C extract an array from a matrix using pointers

I wrote a code and I have some data stored in a 2d matrix:
double y[LENGTH][2];
I have a function that take as input a 1D array:
double function(double* data)
I am interested in passing the data stored in the first column of this matrix to this function. How can I do that using pointers?
My function is something like (where the array data is an array of double containing LENGTH elements:
double data[LENGTH];
):
double function(double* data){
double result=0;
for(int i=0; i<LENGTH; i++){
result+=data[i];
}
return result;
}
And I want to pass to this function a row of a matrix as data input.
Thanks to everyone in advance!

If you pass a pointer to the first element of your 2D matrix, you can access it as a 1 D matrix since the elements are stored contiguously:
double y[LENGTH][2];
x = function(y[0]);
...
double function(double* p) {
int ii;
double sum=0;
for(ii=0; ii<2*LENGTH; ii++) sum += p[ii];
return sum;
}
Note that in this case the order of accessing the elements is
y[0][0]
y[0][1]
y[1][0]
y[1][1]
y[2][0]
... etc
update - you just clarified your question a little bit. If you want to access just one column of data, you need to skip through the array. This means you need to know the size of the second dimension. I would recommend something like this:
double function(double* p, int D2) {
int ii;
double sum=0;
for(ii=0; ii<D2*LENGTH; ii+=D2) sum += p[ii];
return sum;
}
And you would call it with
x = function(y[colNum], numCols);
Now we start at a certain location, then, skip forward D2 elements to access the next element in the column.
I have to say that this is rather ugly - this is not really how C is intended to be used. I would recommend wrapping things into a class that handles these things for you cleanly - in other words, switch to C++ (although it's possible to write pure C functions that "hide" some of this complexity). You could of course copy the data to another memory block to make it contiguous, but that's usually considered a last recourse.
Be careful that you don't end up with code that is unreadable / unmaintainable...
further update
Per your comment, the above is still not what you wanted. Then I recommend the following:
double *colPointer(double *p, int rowCount, int colCount) {
double *cp;
int ii;
cp = malloc(rowCount * sizeof *cp);
for(ii=0; ii<rowCount; ii++) cp[ii] = *(p + ii * colCount);
return cp;
}
This will return a pointer to a newly created copy of the column. You call it with
double *cc;
cc = colPointer(y[colNum], LENGTH, 2);
answer = function(cc);
And now you can use cc in the way you wanted. If you have to do this many times you might be better off transposing the entire array just once - that way you can pass a pointer to a row of the transpose and achieve your result. You can adapt the code above to generate such a transpose.
Note that there is a risk of memory leaks if you don't clean up after yourself with this method.

the question is that do you consider to be the row-dimension.
usually the first one is rows and the second one cols.
that means that your double y[LENGTH][2]; is a matrix with LENGTH rows ans 2 cols.
if that is also your interpretation then the answer to your question is "you can't" since the memory is layed out like this:
r0c0 r0c1 r1c0 r1c1 r2c0 r2c1 ...
you can retrieve pointer to a row but not to a column.
matrix classes are usually designed in a way, that row and column step length is stored so that by carefully setting them you can build sub matrices on a big data chunk.
you may look for opencv matrix implementation if you plan to perform complexer tasks.
if you can change the implementation of the function you want to call. you can change it to accept the row step (number of your columns), so that it does not joust increment the pointer by one to reach the next element but to increment the pointer by row step.
as an alternative there is the obvious way to copy the required column to a new array.
edit:
fixed stupid error on memory layout diagram

Remove 1000Hz tone from FFT array in C

I have an array of doubles which is the result of the FFT applied on an array, that contains the audio data of a Wav audio file in which i have added a 1000Hz tone.
I obtained this array thought the DREALFT defined in "Numerical Recipes".(I must use it).
(The original array has a length that is power of two.)
Mine array has this structure:
array[0] = first real valued component of the complex transform
array[1] = last real valued component of the complex transform
array[2] = real part of the second element
array[3] = imaginary part of the second element
etc......
Now, i know that this array represent the frequency domain.
I want to determine and kill the 1000Hz frequency.
I have tried this formula for finding the index of the array which should contain the 1000Hz frequency:
index = 1000. * NElements /44100;
Also, since I assume that this index refers to an array with real values only, i have determined the correct(?) position in my array, that contains imaginary values too:
int correctIndex=2;
for(k=0;k<index;k++){
correctIndex+=2;
}
(I know that surely there is a way easier but it is the first that came to mind)
Then, i find this value: 16275892957.123705, which i suppose to be the real part of the 1000Hz frequency.(Sorry if this is an imprecise affermation but at the moment I do not care to know more about it)
So i have tried to suppress it:
array[index]=-copy[index]*0.1f;
I don't know exactly why i used this formula but is the only one that gives some results, in fact the 1000hz tone appears to decrease slightly.
This is the part of the code in question:
double *copy = malloc( nCampioni * sizeof(double));
int nSamples;
/*...Fill copy with audio data...*/
/*...Apply ZERO PADDING and reach the length of 8388608 samples,
or rather 8388608 double values...*/
/*Apply the FFT (Sure this works)*/
drealft(copy - 1, nSamples, 1);
/*I determine the REAL(?) array index*/
i= 1000. * nSamples /44100;
/*I determine MINE(?) array index*/
int j=2;
for(k=0;k<i;k++){
j+=2;
}
/*I reduce the array value, AND some other values aroud it as an attempt*/
for(i=-12;i<12;i+=2){
copy[j-i]=-copy[i-j]*0.1f;
printf("%d\n",j-i);
}
/*Apply the inverse FFT*/
drealft(copy - 1, nSamples, -1);
/*...Write the audio data on the file...*/
NOTE: for simplicity I omitted the part where I get an array of double from an array of int16_t
How can i determine and totally kill the 1000Hz frequency?
Thank you!

As Oli Charlesworth writes, because your target frequency is not exactly one of the FFT bins (your index, TargetFrequency * NumberOfElements / SamplingRate, is not exactly an integer), the energy of the target frequency will be spread across all bins. For a start, you can eliminate some of the frequency by zeroing the bin closest to the target frequency. This will of course affect other frequencies somewhat too, since it is slightly off target. To better suppress the target frequency, you will need to consider a more sophisticated filter.
However, for educational purposes: To suppress the frequency corresponding to a bin, simply set that bin to zero. You must set both the real and the imaginary components of the bin to zero, which you can do with:
copy[index*2 + 0] = 0;
copy[index*2 + 1] = 1;
Some notes about this:
You had this code to calculate the position in the array:
int correctIndex = 2;
for (k = 0; k < index; k++) {
correctIndex += 2;
}
That is equivalent to:
correctIndex = 2*(index+1);
I believe you want 2*index, not 2*(index+1). So you were likely reducing the wrong bin.
At one point in your question, you wrote array[index] = -copy[index]*0.1f;. I do not know what array is. You appeared to be working in place in copy. I also do not know why you multiplied by 1/10. If you want to eliminate a frequency, just set it to zero. Multiplying it by 1/10 only reduces it to 10% of its original magnitude.
I understand that you must pass copy-1 to drealft because the Numerical Recipes code uses one-based indexing. However, the C standard does not support the way you are doing it. The behavior of the expression copy-1 is not defined by the standard. It will work in most C implementations. However, to write supported portable code, you should do this instead:
// Allocate one extra element.
double *memory = malloc((nCampioni+1) * sizeof *memory);
// Make a pointer that is convenient for your work.
double *copy = memory+1;
…
// Pass the necessary base address to drealft.
drealft(memory, nSamples, 1);
// Suppress a frequency.
copy[index*2 + 0] = 0;
copy[index*2 + 1] = 0;
…
// Free the memory.
free(memory);
One experiment I suggest you consider is to initialize an array with just a sine wave at the desired frequency:
for (i = 0; i < nSamples; ++i)
copy[i] = sin(TwoPi * Frequency / SampleRate * i);
(TwoPi is of course 2*3.1415926535897932384626433.) Then apply drealft and look at the results. You will see that much of the energy is at a peak in the closest bin to the target frequency, but much of it has also spread to other bins. Clearly, zeroing a single bin and performing the inverse FFT cannot eliminate all of the frequency. Also, you should see that the peak is in the same bin you calculated for index. If it is not, something is wrong.

Optimizing C loops

I'm new to C from many years of Matlab for numerical programming. I've developed a program to solve a large system of differential equations, but I'm pretty sure I've done something stupid as, after profiling the code, I was surprised to see three loops that were taking ~90% of the computation time, despite the fact they are performing the most trivial steps of the program.
My question is in three parts based on these expensive loops:
Initialization of an array to zero. When J is declared to be a double array are the values of the array initialized to zero? If not, is there a fast way to set all the elements to zero?
void spam(){
double J[151][151];
/* Other relevant variables declared */
calcJac(data,J,y);
/* Use J */
}
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
/* The first expensive loop */
int iter, jter;
for (iter=0; iter<151; iter++) {
for (jter = 0; jter<151; jter++) {
J[iter][jter] = 0;
}
}
/* More code to populate J from data and y that runs very quickly */
}
During the course of solving I need to solve matrix equations defined by P = I - gamma*J. The construction of P is taking longer than solving the system of equations it defines, so something I'm doing is likely in error. In the relatively slow loop below, is accessing a matrix that is contained in a structure 'data' the the slow component or is it something else about the loop?
for (iter = 1; iter<151; iter++) {
for(jter = 1; jter<151; jter++){
P[iter-1][jter-1] = - gamma*(data->J[iter][jter]);
}
}
Is there a best practice for matrix multiplication? In the loop below, Ith(v,iter) is a macro for getting the iter-th component of a vector held in the N_Vector structure 'v' (a data type used by the Sundials solvers). Particularly, is there a best way to get the dot product between v and the rows of J?
Jv_scratch = 0;
int iter, jter;
for (iter=1; iter<151; iter++) {
for (jter=1; jter<151; jter++) {
Jv_scratch += J[iter][jter]*Ith(v,jter);
}
Ith(Jv,iter) = Jv_scratch;
Jv_scratch = 0;
}

1) No they're not you can memset the array as follows:
memset( J, 0, sizeof( double ) * 151 * 151 );
or you can use an array initialiser:
double J[151][151] = { 0.0 };
2) Well you are using a fairly complex calculation to calculate the position of P and the position of J.
You may well get better performance. by stepping through as pointers:
for (iter = 1; iter<151; iter++)
{
double* pP = (P - 1) + (151 * iter);
double* pJ = data->J + (151 * iter);
for(jter = 1; jter<151; jter++, pP++, pJ++ )
{
*pP = - gamma * *pJ;
}
}
This way you move various of the array index calculation outside of the loop.
3) The best practice is to try and move as many calculations out of the loop as possible. Much like I did on the loop above.

First, I'd advise you to split up your question into three separate questions. It's hard to answer all three; I, for example, have not worked much with numerical analysis, so I'll only answer the first one.
First, variables on the stack are not initialized for you. But there are faster ways to initialize them. In your case I'd advise using memset:
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
memset((void*)J, 0, sizeof(double) * 151 * 151);
/* More code to populate J from data and y that runs very quickly */
}
memset is a fast library routine to fill a region of memory with a specific pattern of bytes. It just so happens that setting all bytes of a double to zero sets the double to zero, so take advantage of your library's fast routines (which will likely be written in assembler to take advantage of things like SSE).

Others have already answered some of your questions. On the subject of matrix multiplication; it is difficult to write a fast algorithm for this, unless you know a lot about cache architecture and so on (the slowness will be caused by the order that you access array elements causes thousands of cache misses).
You can try Googling for terms like "matrix-multiplication", "cache", "blocking" if you want to learn about the techniques used in fast libraries. But my advice is to just use a pre-existing maths library if performance is key.

Initialization of an array to zero.
When J is declared to be a double
array are the values of the array
initialized to zero? If not, is there
a fast way to set all the elements to
zero?
It depends on where the array is allocated. If it is declared at file scope, or as static, then the C standard guarantees that all elements are set to zero. The same is guaranteed if you set the first element to a value upon initialization, ie:
double J[151][151] = {0}; /* set first element to zero */
By setting the first element to something, the C standard guarantees that all other elements in the array are set to zero, as if the array were statically allocated.
Practically for this specific case, I very much doubt it will be wise to allocate 151*151*sizeof(double) bytes on the stack no matter which system you are using. You will likely have to allocate it dynamically, and then none of the above matters. You must then use memset() to set all bytes to zero.
In the
relatively slow loop below, is
accessing a matrix that is contained
in a structure 'data' the the slow
component or is it something else
about the loop?
You should ensure that the function called from it is inlined. Otherwise there isn't much else you can do to optimize the loop: what is optimal is highly system-dependent (ie how the physical cache memories are built). It is best to leave such optimization to the compiler.
You could of course obfuscate the code with manual optimization things such as counting down towards zero rather than up, or to use ++i rather than i++ etc etc. But the compiler really should be able to handle such things for you.
As for matrix addition, I don't know of the mathematically most efficient way, but I suspect it is of minor relevance to the efficiency of the code. The big time thief here is the double type. Unless you really have need for high accuracy, I'd consider using float or int to speed up the algorithm.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight