I have a fortran90 code to optimize.
Now I'd like to access memory location of a structure in an external loop, and then access the deepest structure in a nested loop.
Something like this:
sample fortran loop - legacy version
do i = 1, N
ii = some integer
jj = some other integer
do j = 1, M
c = a(ii, jj)%b(i)
enddo
enddo
has to become:
second fortran loop - what I would like to write
do i = 1, N
ii = some integer
jj = some other integer
pointertoa = &a(ii, jj) !I know it's not correct in fortran, that is the question!
do j = 1, M
c = pointertoa%b(i)
enddo
enddo
I have this (sample) C code working as I expect:
Working memory addressing in C
#include <stdio.h>
struct mem{
int a;
struct mm{
int b;
float v;
} mmm;
};
void main(){
struct mem *m, dum;
dum.a = 12;
dum.mmm.b = 5;
dum.mmm.v = 3.2;
m = &dum; //m is given dum memory address
printf("dum.a = %d\n", dum.a);
printf("dum.mmm.b = %d\n", dum.mmm.b);
printf("dum.mmm.v = %f\n", dum.mmm.v);
printf("m.a = %d\n", m->a);
printf("m.mmm.b = %d\n", m->mmm.b);
printf("m.mmm.v = %f\n", m->mmm.v);
}
A couple of question:
How would you do the same I did in C in fortran90?
Do you think the second fortran loop would speed up the code?
Fortran will make it very very difficult for you to get the memory address of a variable or anything else for that matter. The tricks and techniques you may have learned in C, messing around with pointers and memory addresses, just aren't supported in Fortran. Nor, generally, are they needed within Fortran's core application domains. Your question rather suggests you are trying to write C in Fortran. Don't.
Now I've got that off my chest, you may be able to achieve what you want using the recently-introduced associate construct. Something like
associate(pointertoa => a(ii, jj))
do j = 1, M
c = pointertoa%b(i)
enddo
end associate
Whether this achieves your efficiency goals I haven't a scooby. But I'll be surprised if it does. Optimising access to array elements is something that Fortran compilers have been working on for 50+ years and they're really quite good at it.
EDIT, in response to OP's first comment ...
If your compiler supports associate you can certainly use it. But if someone is going to look over your shoulder and hit you painfully on the head if you use any feature introduced to Fortran after publication of the 90 standard then it's up to you whether or not you take the hit. The compiler ain't going to care, nor is the compiled code. associate is part of the standard and Fortran has a very good record of maintaining backwards compatibility so the likelihood that a future compiler will get upset is very small.
In writing your C function don't forget to take advantage of loop unrolling, memory prefetching, multiple instruction pipelines, vector operations, common subexpression elimination, all that sort of shit. If you manage to write a C function that outperforms the product of a Fortran compiler with optimisation turned up to 11 come back with the data to prove it and I'll eat my hat.
And, while I'm writing again, I note that the loop
do j = 1, M
c = pointertoa%b(i)
enddo
is almost entirely redundant and a good optimising compiler would just create code to execute c = pointertoa%b(i) once.
Related
I'm working on code to do matrix operations for a satellite my school is making. Would it be faster and less resource intensive to use a for loop or to just write out the operations? All matrices are of a known size
for (i = 0; i < 3; i++)//Row
{
for (j = 0; j < 3; j++)//Column
{
result[i][j] = a[i][j] * b;
}
}
or
result[1][1] = a[1][1] * b;
result[1][2] = a[1][2] * b;
etc...
You are talking about loop unrolling. You're right, it is a common technique to decrease computing time of a program. However, as it has been said in the comments, it is not absolutely sure you will save time, because it depends on many factors (compiler, compiler optimisation level, etc...). It is also possible that the compiler does unroll loops itself if you choose a high optimisation level.
Don't forget it takes more code size, which is also a valuable resource.
Keep in mind there are other ways to optimize code. For example here, you multiply all elements in an array by the same variable. Perhaps you can do this multiplication later in the code when you access again the resultarray ? It will save travelling an array with all the memory accesses it implies.
I am pressed for time to optimize a large piece of C code for speed and I am looking for an algorithm---at the best a C "snippet"---that transposes a rectangular source matrix u[r][c] of arbitrary size (r number of rows, c number of columns) into a target matrix v[s][d] (s = c number of rows, d = r number of columns) in a "cache-friendly" i. e. data-locality respecting way. The typical size of u is around 5000 ... 15000 rows by 50 to 500 columns, and it is clear that a row-wise access of elements is very cache-inefficient.
There are many discussions on this topic in the web (nearby this thread), but as far as I see all of them discuss the spacial cases like square matrices, u[r][r], or the definition an on-dimensional array, e. g. u[r * c], not the above mentioned "array of arrays" (of equal length) used in my context of Numerical Recipes (background see here).
I would by very thankful for any hint that helps to spare me the "reinvention of the wheel".
Martin
I do not think that array of arrays is much harder to transpose than linear array in general. But if you are going to have 50 columns in each array, that sounds bad: it may be not enough to hide the overhead of pointer dereferencing.
I think that the overall strategy of cache-friendly implementation is the same: process your matrix in tiles, choose size of tiles which performs best according to experiments.
template<int BLOCK>
void TransposeBlocked(Matrix &dst, const Matrix &src) {
int r = dst.r, c = dst.c;
assert(r == src.c && c == src.r);
for (int i = 0; i < r; i += BLOCK)
for (int j = 0; j < c; j += BLOCK) {
if (i + BLOCK <= r && j + BLOCK <= c)
ProcessFullBlock<BLOCK>(dst.data, src.data, i, j);
else
ProcessPartialBlock(dst.data, src.data, r, c, i, j, BLOCK);
}
}
I have tried to optimize the best case when r = 10000, c = 500 (with float type). On my local machine 128 x 128 tiles give speedup in 2.5 times. Also, I have tried to use SSE to accelerate transposition, but it does not change timings significantly. I think that's because the problem is memory bound.
Here are full timings (for 100 launches each) of various implementations on Core2 E4700 2.6GHz:
Trivial: 6.111 sec
Blocked(4): 8.370 sec
Blocked(16): 3.934 sec
Blocked(64): 2.604 sec
Blocked(128): 2.441 sec
Blocked(256): 2.266 sec
BlockedSSE(16): 4.158 sec
BlockedSSE(64): 2.604 sec
BlockedSSE(128): 2.245 sec
BlockedSSE(256): 2.036 sec
Here is the full code used.
So, I'm guessing you have an array of array of floats/doubles. This setup is already very bad for cache performance. The reason is that with a 1-dimensional array the compiler can output code that results in a prefetch operation and ( in the case of a very new compiler) produce SIMD/vectorized code. With an array of pointers there's a deference operation on each step making a prefetch more difficult. Not to mention there aren't any guarantees on memory alignment.
If this is for an assignment and you have no choice but to write the code from scratch, I'd recommend looking at how CBLAS does it (note that you'll still need your array to be "flattened"). Otherwise, you're much better off using a highly optimized BLAS implementation like
OpenBLAS. It's been optimized for nearly a decade and will produce the fastest code for your target processor (tuning for things like cache sizes and vector instruction set).
The tl;dr is that using an array of arrays will result in terrible performance no matter what. Flatten your arrays and make your code nice to read by using a #define to access elements of the array.
I'm fresh to C - used to scripting languages like, PHP, JS, Ruby etc. Got a query in regard to performance. I know one should not micro optimize too early - however, I'm writing a Ruby C Extension for Google SketchUp where I'm doing lots of 3D calculations so performance is a concern. (And this question is also for learning how C works.)
Often many iterations is done to process all the 3D data so I'm trying to work out what might be faster.
I'm wondering if accessing an array entry many times is faster if I make a pointer reference to that array entry? What would common practice be?
struct FooBar arr[10];
int i;
for ( i = 0; i < 10; i++ ) {
arr[i].foo = 10;
arr[i].bar = 20;
arr[i].biz = 30;
arr[i].baz = 40;
}
Would this be faster or slower? Why?
struct FooBar arr[10], *item;
int i;
for ( i = 0; i < 10; i++ ) {
item = &arr[i];
item->foo = 10;
item->bar = 20;
item->biz = 30;
item->baz = 40;
}
I looked around and found discussions about variables vs pointers - where it was generally said that pointers required extra steps since it had to look up the address, then the value - but in general there wasn't a bit hit.
But what I was wondering was if accessing an array entry in C has much of a performance hit? In Ruby it is faster to make a reference to the entry if you need to access it many time - but that's Ruby...
There's unlikely to be a significant difference. Possibly the emitted code will be identical. This is assuming a vaguely competent compiler, with optimization enabled. You might like to look at the disassembled code, just to get a feel for some of the things a C optimizer gets up to. You may well conclude, "my code is mangled beyond all recognition, there's no point worrying about this kind of thing at this stage", which is a good instinct.
Conceivably the first code could even be faster, if introducing the item pointer were to somehow interfere with any loop unrolling or other optimization that your compiler performs on the first. Or it could be that the optimizer can figure out that arr[i].foo is equal to stack_pointer + sizeof(FooBar) * i, but fail to figure that out once you use the pointer, and end up using an extra register, spilling something else, with performance implications. But I'm speculating wildly on that point: there is usually little to no difference between accessing an array by pointer or by index, my point is just that any difference there is can come for surprising reasons.
If were worried, and felt like micro-optimizing it (or just were in a pointer-oriented mood), I'd skip the integer index and just use pointers all over:
struct FooBar arr[10], *item, *end = arr + sizeof arr / sizeof *arr;
for (item = arr; item < end; item++)
item->foo = 10;
item->bar = 20;
item->biz = 30;
item->baz = 40;
}
But please note: I haven't compiled this (or your code) and counted the instructions, which is what you'd need to do. As well as running it and measuring of course, since some combinations of multiple instructions might be faster than shorter sequences of other instructions, and so on.
I'm new to C from many years of Matlab for numerical programming. I've developed a program to solve a large system of differential equations, but I'm pretty sure I've done something stupid as, after profiling the code, I was surprised to see three loops that were taking ~90% of the computation time, despite the fact they are performing the most trivial steps of the program.
My question is in three parts based on these expensive loops:
Initialization of an array to zero. When J is declared to be a double array are the values of the array initialized to zero? If not, is there a fast way to set all the elements to zero?
void spam(){
double J[151][151];
/* Other relevant variables declared */
calcJac(data,J,y);
/* Use J */
}
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
/* The first expensive loop */
int iter, jter;
for (iter=0; iter<151; iter++) {
for (jter = 0; jter<151; jter++) {
J[iter][jter] = 0;
}
}
/* More code to populate J from data and y that runs very quickly */
}
During the course of solving I need to solve matrix equations defined by P = I - gamma*J. The construction of P is taking longer than solving the system of equations it defines, so something I'm doing is likely in error. In the relatively slow loop below, is accessing a matrix that is contained in a structure 'data' the the slow component or is it something else about the loop?
for (iter = 1; iter<151; iter++) {
for(jter = 1; jter<151; jter++){
P[iter-1][jter-1] = - gamma*(data->J[iter][jter]);
}
}
Is there a best practice for matrix multiplication? In the loop below, Ith(v,iter) is a macro for getting the iter-th component of a vector held in the N_Vector structure 'v' (a data type used by the Sundials solvers). Particularly, is there a best way to get the dot product between v and the rows of J?
Jv_scratch = 0;
int iter, jter;
for (iter=1; iter<151; iter++) {
for (jter=1; jter<151; jter++) {
Jv_scratch += J[iter][jter]*Ith(v,jter);
}
Ith(Jv,iter) = Jv_scratch;
Jv_scratch = 0;
}
1) No they're not you can memset the array as follows:
memset( J, 0, sizeof( double ) * 151 * 151 );
or you can use an array initialiser:
double J[151][151] = { 0.0 };
2) Well you are using a fairly complex calculation to calculate the position of P and the position of J.
You may well get better performance. by stepping through as pointers:
for (iter = 1; iter<151; iter++)
{
double* pP = (P - 1) + (151 * iter);
double* pJ = data->J + (151 * iter);
for(jter = 1; jter<151; jter++, pP++, pJ++ )
{
*pP = - gamma * *pJ;
}
}
This way you move various of the array index calculation outside of the loop.
3) The best practice is to try and move as many calculations out of the loop as possible. Much like I did on the loop above.
First, I'd advise you to split up your question into three separate questions. It's hard to answer all three; I, for example, have not worked much with numerical analysis, so I'll only answer the first one.
First, variables on the stack are not initialized for you. But there are faster ways to initialize them. In your case I'd advise using memset:
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
memset((void*)J, 0, sizeof(double) * 151 * 151);
/* More code to populate J from data and y that runs very quickly */
}
memset is a fast library routine to fill a region of memory with a specific pattern of bytes. It just so happens that setting all bytes of a double to zero sets the double to zero, so take advantage of your library's fast routines (which will likely be written in assembler to take advantage of things like SSE).
Others have already answered some of your questions. On the subject of matrix multiplication; it is difficult to write a fast algorithm for this, unless you know a lot about cache architecture and so on (the slowness will be caused by the order that you access array elements causes thousands of cache misses).
You can try Googling for terms like "matrix-multiplication", "cache", "blocking" if you want to learn about the techniques used in fast libraries. But my advice is to just use a pre-existing maths library if performance is key.
Initialization of an array to zero.
When J is declared to be a double
array are the values of the array
initialized to zero? If not, is there
a fast way to set all the elements to
zero?
It depends on where the array is allocated. If it is declared at file scope, or as static, then the C standard guarantees that all elements are set to zero. The same is guaranteed if you set the first element to a value upon initialization, ie:
double J[151][151] = {0}; /* set first element to zero */
By setting the first element to something, the C standard guarantees that all other elements in the array are set to zero, as if the array were statically allocated.
Practically for this specific case, I very much doubt it will be wise to allocate 151*151*sizeof(double) bytes on the stack no matter which system you are using. You will likely have to allocate it dynamically, and then none of the above matters. You must then use memset() to set all bytes to zero.
In the
relatively slow loop below, is
accessing a matrix that is contained
in a structure 'data' the the slow
component or is it something else
about the loop?
You should ensure that the function called from it is inlined. Otherwise there isn't much else you can do to optimize the loop: what is optimal is highly system-dependent (ie how the physical cache memories are built). It is best to leave such optimization to the compiler.
You could of course obfuscate the code with manual optimization things such as counting down towards zero rather than up, or to use ++i rather than i++ etc etc. But the compiler really should be able to handle such things for you.
As for matrix addition, I don't know of the mathematically most efficient way, but I suspect it is of minor relevance to the efficiency of the code. The big time thief here is the double type. Unless you really have need for high accuracy, I'd consider using float or int to speed up the algorithm.
I have this 'simplified' fortran code
real B(100, 200)
real A(100,200)
... initialize B array code.
do I = 1, 100
do J = 1, 200
A(J,I) = B(J,I)
end do
end do
One of the programming gurus warned me, that fortran accesses data efficiently in column order, while c accesses data efficiently in row order. He suggested that I take a good hard look at the code, and be prepared to switch loops around to maintain the speed of the old program.
Being the lazy programmer that I am, and recognizing the days of effort involved, and the mistakes I am likely to make, I started wondering if there might a #define technique that would let me convert this code safely, and easily.
Do you have any suggestions?
In C, multi-dimensional arrays work like this:
#define array_length(a) (sizeof(a)/sizeof((a)[0]))
float a[100][200];
a[x][y] == ((float *)a)[array_length(a[0])*x + y];
In other words, they're really flat arrays and [][] is just syntactic sugar.
Suppose you do this:
#define at(a, i, j) ((typeof(**(a)) *)a)[(i) + array_length((a)[0])*(j)]
float a[100][200];
float b[100][200];
for (i = 0; i < 100; i++)
for (j = 0; j < 200; j++)
at(a, j, i) = at(b, j, i);
You're walking sequentially through memory, and pretending that a and b are actually laid out in column-major order. It's kind of horrible in that a[x][y] != at(a, x, y) != a[y][x], but as long as you remember that it's tricked out like this, you'll be fine.
Edit
Man, I feel dumb. The intention of this definition is to make at(a, x, y) == at[y][x], and it does. So the much simpler and easier to understand
#define at(a, i, j) (a)[j][i]
would be better that what I suggested above.
Are you sure your FORTRAN guys did things right?
The code snippet you originally posted is already accessing the arrays in row-major order (which is 'inefficient' for FORTRAN, 'efficient' for C).
As illustrated by the snippet of code and as mentioned in your question, getting this 'correct' can be error prone. Worry about getting the FORTRAN code ported to C first without worrying about details like this. When the port is working - then you can worry about changing column-order accesses to row-order accesses (if it even really matters after the port is working).
One of my first programming jobs out of college was to fix a long-running C app that had been ported from FORTRAN. The arrays were much larger than yours and it was taking something around 27 hours per run. After fixing it, they ran in about 2.5 hours... pretty sweet!
(OK, it really wasn't assigned, but I was curious and found a big problem with their code. Some of the old timers didn't like me much despite this fix.)
It would seem that the same issue is found here.
real B(100, 200)
real A(100,200)
... initialize B array code.
do I = 1, 100
do J = 1, 200
A(I,J) = B(I,J)
end do
end do
Your looping (to be good FORTRAN) would be:
real B(100, 200)
real A(100,200)
... initialize B array code.
do J = 1, 200
do I = 1, 100
A(I,J) = B(I,J)
end do
end do
Otherwise you are marching through the arrays in row-major, which could be highly inefficient.
At least I believe that's how it would be in FORTRAN - it's been a long time.
Saw you updated the code...
Now, you'd want to swap the loop control variables so that you iterate on the rows and then inside that iterate on the columns if you are converting to C.