Best way to optimize performance of program with arrays that regularly need to be reset to zero in C? - arrays

I have searched to see if this question has been asked before, but I could not find anything - but when I was searching I found some many interesting points about optimization in this answer and the other answers to a question about optimization.
My question is to ask which way is the most efficient/ fastest to set the elements of a large array to zero using C.
The program will track a large number of particles, >>>1000. Each particle is described by several variables some of which will need to be reset to zero every time around a loop, which will be executed >>>1000 times. The exact number of particles that can be handled will depend on the efficiency of the code.
The choices seem to be the following and I have ordered them as I guess from least efficient to most efficient. (I try to describe them with indicative code fragments - no way, of course, is this code that can run, but just something to indicate the strategy - and I realise that loop unrolling might be a good idea, but for simplicity it is not included below)
N particles are represented by an array of a structure that contains all the information about each particle so
/*structure definition*/
struct particle {
double a;
double b;
....
};
/*memory allocation*/
struct particle * part;
part = (struct particle *)calloc(N,sizeof(particle));
/*routine to set some particle variables to zero*/
for (i=0;i<N;i++)
{
part[i].a=0;
part[i].b=0;
.... etc....
}
N particles are represented by several arrays in a structure that contains all the information about the ensemble of particles so
/*structure definition*/
struct ensemble {
double * a;
double * b;
....
};
/*memory allocation*/
struct ensemble group;
group.a = (double *)calloc(N,sizeof(double));
group.b = (double *)calloc(N,sizeof(double));
/*routine to set some particle variables to zero*/
for (i=0;i<N;i++)
{
group.a[i]=0;
group.b[i]=0;
.... etc....
}
exactly the same as 2) above but the variables are reset to zero with
/*routine to set some particle variables to zero*/
free(group.a); group.a = (double *)calloc(N,sizeof(double));
free(group.b); group.b = (double *)calloc(N,sizeof(double));
instinctively I think there must be an easier way to than 3) to write 0 to memory, which does not require freeing and then reallocating large amounts of memory every time around the loop. -- The answers to this question mention memset, which i am guessing would work, provided that setting everything to zero bytewise will give doubles with values of 0.0000000e00.
same as 2), 3), 4) above, but instead of using any data structure just grab memory for separate arrays.
/*memory allocation*/
double * a, * b, ... ;
a = (double *)calloc(N,sizeof(double));
b = (double *)calloc(N,sizeof(double));
/*routine to set some particle variables to zero*/
for (i=0;i<N;i++)
{
a[i]=0;
b[i]=0;
.... etc....
}
Finally, I saw something that *(a+i)=0 would be quicker than a[i]=0, but for readability the code above has a[i] array indexing.
I also guess that it may be the compiler with optimization flags turned on will do some of these things.
I would be really interested to hear what would be expected to be fastest and how much improvement might be obtained in each refinement..

Related

Pointer Math with Complex Array

I have this snippet of code with some pointer math that I'm having trouble understanding:
#include <stdlib.h>
#include <complex.h>
#include <fftw3.h>
int main(void)
{
int i, j, k;
int N, N2;
fftwf_complex *box;
fftwf_plan plan;
float *smoothed_box;
// Allocate memory for arrays (Ns are set elsewhere and properly,
// I've just left it out for clarity)
box = (fftwf_complex *)fftwf_malloc(N * sizeof(fftwf_complex));
smoothed_box = (float *)malloc(N2 * sizeof(float));
// Create complex data and fill box with it. Do FFT. Box has the
// Hermitian symmetry that complex data has when doing FFTs with
// real data
plan = fftwf_plan_dft_c2r_3d(N,N,N,box,(float *)box,
FFTW_ESTIMATE);
...
// end fft
// Now do the loop I don't understand
for(i = 0; i < N2; i++)
{
for(j = 0; j < N2; j++)
{
for(k = 0; k < N2; k++)
{
smoothed_box[R_INDEX(i,j,k)] = *((float *)box +
R_FFT_INDEX(i*f + 0.5, j*f + 0.5, k*f +0.5))/V;
}
}
}
// Do other stuff
...
return 0;
}
Where f and V are just some numbers that are set elsewhere in the code and don't matter for this particular question. Additionally, the functions R_FFT_INDEX and R_INDEX don't really matter, either. What's important is that, for the first loop iteration ,when i=j=k=0, R_INDEX = 0 and R_FFT_INDEX=45. smoothed_box has 8 elements and box has 320.
So, in gdb, when I print smoothed_box[0] after the loop, I get smoothed_box[0] = some number. Now, I understand that, for an array of normal types, say floats, array + integer will give array[integer], assuming that integer is within the bounds of the array.
However, fftwf_complex is defined as typedef float fftw_complex[2], as you need to hold both the real and imaginary parts of the complex number. It's also being casted to a float * from a fftwf_complex *, and I'm unsure what this does, given the typedef.
All I know is that when I print box[45] in gdb, I get box[45] = some complex number that is not smoothed_box[0] * V. Even when I print *((float *)box + 45)/V, I get a different number than smoothed_box[0].
So, I was just wondering if anyone could explain to me the pointer math that is being done in the above loop? Thank you, and I appreciate your time!
box is allocated as an array of N fftwf_complex. Then a backward 3D c2r fftw transform using N,N,N is performed on box, requiring N*N*(N/2+1) fftwf_complex. See http://www.fftw.org/fftw3_doc/Real_002ddata-DFT-Array-Format.html#Real_002ddata-DFT-Array-Format Therefore, this code might trigger undefined behavior, such as segmentation fault, before reaching the pointer arithmetics...
It is practical to cast back box to an array of float because the DFT is performed in place. Indeed, box is used twice as the fftwf_plan is created. box is both the input array of complex and the output array of real:
plan = fftwf_plan_dft_c2r_3d(N,N,N,box,(float *)box,
FFTW_ESTIMATE);
Once fftwf_execute(plan); is called, box is better seen as an array of real. Nevertheless, this array is of size N*N*2*(N/2+1), where the items located at positions i,j,k where k>N-1 are meaningless. See FFTW's Real-data DFT Array Format:
For an in-place transform, some complications arise since the complex data is slightly larger than the real data. In this case, the final dimension of the real data must be padded with extra values to accommodate the size of the complex data—two extra if the last dimension is even and one if it is odd. That is, the last dimension of the real data must physically contain 2 * (nd-1/2+1) double values (exactly enough to hold the complex data). This physical array size does not, however, change the logical array size—only nd-1 values are actually stored in the last dimension, and nd-1 is the last dimension passed to the planner.
This is the reason why the real array smoothed_box is introduced, though an N*N*N array would be expected. If smoothed_box were an array of size N*N*N, then the following conversion could have been performed:
for(i=0;i<N;i++){
for(j=0;j<N;j++){
for(k=0;k<N;k++){
smoothed_box[(i*N+j)*N+k]=((float *)box)[(i*N+j)*(2*(N/2+1))+k]
}
}
}

Segmentation fault when trying to use intrinsics specifically _mm256_storeu_pd()

Seemed to have fixed it myself by type casting the cij2 pointer inside the mm256 call
so _mm256_storeu_pd((double *)cij2,vecC);
I have no idea why this changed anything...
I'm writing some code and trying to take advantage of the Intel manual vectorization. But whenever I run the code I get a segmentation fault on trying to use my double *cij2.
if( q == 0)
{
__m256d vecA;
__m256d vecB;
__m256d vecC;
for (int i = 0; i < M; ++i)
for (int j = 0; j < N; ++j)
{
double cij = C[i+j*lda];
double *cij2 = (double *)malloc(4*sizeof(double));
for (int k = 0; k < K; k+=4)
{
vecA = _mm256_load_pd(&A[i+k*lda]);
vecB = _mm256_load_pd(&B[k+j*lda]);
vecC = _mm256_mul_pd(vecA,vecB);
_mm256_storeu_pd(cij2, vecC);
for (int x = 0; x < 4; x++)
{
cij += cij2[x];
}
}
C[i+j*lda] = cij;
}
I've pinpointed the problem to the cij2 pointer. If i comment out the 2 lines that include that pointer the code runs fine, it doesn't work like it should but it'll actually run.
My question is why would i get a segmentation fault here? I know I've allocated the memory correctly and that the memory is a 256 vector of double's with size 64 bits.
After reading the comments I've come to add some clarification.
First thing I did was change the _mm_malloc to just a normal allocation using malloc. Shouldn't affect either way but will give me some more breathing room theoretically.
Second the problem isn't coming from a null return on the allocation, I added a couple loops in to increment through the array and make sure I could modify the memory without it crashing so I'm relatively sure that isn't the problem. The problem seems to stem from the loading of the data from vecC to the array.
Lastly I can not use BLAS calls. This is for a parallelisms class. I know it would be much simpler to call on something way smarter than I but unfortunately I'll get a 0 if I try that.
You dynamically allocate double *cij2 = (double *)malloc(4*sizeof(double)); but you never free it. This is just silly. Use double cij2[4], especially if you're not going to bother to align it. You never need more than one scratch buffer at once, and it's a small fixed size, so just use automatic storage.
In C++11, you'd use alignas(32) double cij2[4] so you could use _mm256_store_pd instead of storeu. (Or just to make sure storeu isn't slowed down by an unaligned address).
If you actually want to debug your original, use a debugger to catch it when it segfaults, and look at the pointer value. Make sure it's something sensible.
Your methods for testing that the memory was valid (like looping over it, or commenting stuff out) sound like they could lead to a lot of your loop being optimized away, so the problem wouldn't happen.
When your program crashes, you can also look at the asm instructions. Vector intrinsics map fairly directly to x86 asm (except when the compiler sees a more efficient way).
Your implementation would suck a lot less if you pulled the horizontal sum out of the loop over k. Instead of storing each multiply result and horizontally adding it, use a vector add into a vector accumulator. hsum it outside the loop over k.
__m256d cij_vec = _mm256_setzero_pd();
for (int k = 0; k < K; k+=4) {
vecA = _mm256_load_pd(&A[i+k*lda]);
vecB = _mm256_load_pd(&B[k+j*lda]);
vecC = _mm256_mul_pd(vecA,vecB);
cij_vec = _mm256_add_pd(cij_vec, vecC); // TODO: use multiple accumulators to keep multiple VADDPD or VFMAPD instructions in flight.
}
C[i+j*lda] = hsum256_pd(cij_vec); // put the horizontal sum in an inline function
For good hsum256_pd implementations (other than storing to memory and using a scalar loop), see Fastest way to do horizontal float vector sum on x86 (I included an AVX version there. It should be easy to adapt the pattern of shuffling to 256b double-precision.) This will help your code a lot, since you still have O(N^2) horizontal sums (but not O(N^3) with this change).
Ideally you could accumulate results for 4 i values in parallel, and not need horizontal sums.
VADDPD has a latency of 3 to 4 clocks, and a throughput of one per 1 to 0.5 clocks, so you need from 3 to 8 vector accumulators to saturate the execution units. Or with FMA, up to 10 vector accumulators (e.g. on Haswell where FMA...PD has 5c latency and one per 0.5c throughput). See Agner Fog's instruction tables and optimization guides to learn more about that. Also the x86 tag wiki.
Also, ideally nest your loops in a way that gave you contiguous access to two of your three arrays, since cache access patterns are critical for matmul (lots of data reuse). Even if you don't get fancy and transpose small blocks at a time that fit in cache. Even transposing one of your input matrices can be a win, since that costs O(N^2) and speeds up the O(N^3) process. I see your inner loop currently has a stride of lda while accessing A[].

optimized copying from an openCV mat to a 2D float array

I wanted to copy an opencv Mat variable in a 2D float array.I used the code below to reach this purpose.but because I develope a code that speed is a very important metric this method of copy is not enough optimize.Is there another more optimized way to use?
float *ImgSrc_f;
ImgSrc_f = (float *)malloc(512 * 512 * sizeof(float));
for(int i=0;i<512;i++)
for(int j=0;j<512;j++)
{
ImgSrc_f[i * 512 + j]=ImgSrc.at<float>(i,j);
}
Really,
//first method
Mat img_cropped = ImgSrc(512, 512).clone();
float *ImgSrc_f = img_cropped.data;
should be no more than a couple of % less efficient than the best method. I suggest this one as long as it doesn't lose more than 1% to the second method.
Also try this, which is very close to the absolute best method unless you can use some kind of expanded cpu instruction set (and if such ones are available). You'll probably see minimal difference between method 2 and 1.
//method 2
//preallocated memory
//largest possible with minimum number of direct copy calls
//caching pointer arithmetic can speed up things extremely minimally
//removing the Mat header with a malloc can speed up things minimally
>>startx, >>starty, >>ROI;
Mat img_roi = ImgSrc(ROI);
Mat img_copied(ROI.height, ROI.width, CV_32FC1);
for (int y = starty; y < starty+ROI.height; ++y)
{
unsigned char* rowptr = img_roi.data + y*img_roi.step1() + startx * sizeof(float);
unsigned char* rowptr2 = img_coped.data + y*img_copied.step1();
memcpy(rowptr2, rowptr, ROI.width * sizeof(float));
}
Basically, if you really, really care about these details of performance, you should stay away from overloaded operators in general. The more levels of abstraction in code, the higher the penalty cost. Of course that makes your code more dangerous, harder to read, and bug-prone.
Can you use a std::vector structure, too? If try
std::vector<float> container;
container.assign((float*)matrix.startpos, (float*)matrix.endpos);

Optimising C for performance vs memory optimisation using multidimensional arrays

I am struggling to decide between two optimisations for building a numerical solver for the poisson equation.
Essentially, I have a two dimensional array, of which I require n doubles in the first row, n/2 in the second n/4 in the third and so on...
Now my difficulty is deciding whether or not to use a contiguous 2d array grid[m][n], which for a large n would have many unused zeroes but would probably reduce the chance of a cache miss. The other, and more memory efficient method, would be to dynamically allocate an array of pointers to arrays of decreasing size. This is considerably more efficient in terms of memory storage but would it potentially hinder performance?
I don't think I clearly understand the trade-offs in this situation. Could anybody help?
For reference, I made a nice plot of the memory requirements in each case:
There is no hard and fast answer to this one. If your algorithm needs more memory than you expect to be given then you need to find one which is possibly slower but fits within your constraints.
Beyond that, the only option is to implement both and then compare their performance. If saving memory results in a 10% slowdown is that acceptable for your use? If the version using more memory is 50% faster but only runs on the biggest computers will it be used? These are the questions that we have to grapple with in Computer Science. But you can only look at them once you have numbers. Otherwise you are just guessing and a fair amount of the time our intuition when it comes to optimizations are not correct.
Build a custom array that will follow the rules you have set.
The implementation will use a simple 1d contiguous array. You will need a function that will return the start of array given the row. Something like this:
int* Get( int* array , int n , int row ) //might contain logical errors
{
int pos = 0 ;
while( row-- )
{
pos += n ;
n /= 2 ;
}
return array + pos ;
}
Where n is the same n you described and is rounded down on every iteration.
You will have to call this function only once per entire row.
This function will never take more that O(log n) time, but if you want you can replace it with a single expression: http://en.wikipedia.org/wiki/Geometric_series#Formula
You could use a single array and just calculate your offset yourself
size_t get_offset(int n, int row, int column) {
size_t offset = column;
while (row--) {
offset += n;
n << 1;
}
return offset;
}
double * array = calloc(sizeof(double), get_offset(n, 64, 0));
access via
array[get_offset(column, row)]

Optimizing C loops

I'm new to C from many years of Matlab for numerical programming. I've developed a program to solve a large system of differential equations, but I'm pretty sure I've done something stupid as, after profiling the code, I was surprised to see three loops that were taking ~90% of the computation time, despite the fact they are performing the most trivial steps of the program.
My question is in three parts based on these expensive loops:
Initialization of an array to zero. When J is declared to be a double array are the values of the array initialized to zero? If not, is there a fast way to set all the elements to zero?
void spam(){
double J[151][151];
/* Other relevant variables declared */
calcJac(data,J,y);
/* Use J */
}
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
/* The first expensive loop */
int iter, jter;
for (iter=0; iter<151; iter++) {
for (jter = 0; jter<151; jter++) {
J[iter][jter] = 0;
}
}
/* More code to populate J from data and y that runs very quickly */
}
During the course of solving I need to solve matrix equations defined by P = I - gamma*J. The construction of P is taking longer than solving the system of equations it defines, so something I'm doing is likely in error. In the relatively slow loop below, is accessing a matrix that is contained in a structure 'data' the the slow component or is it something else about the loop?
for (iter = 1; iter<151; iter++) {
for(jter = 1; jter<151; jter++){
P[iter-1][jter-1] = - gamma*(data->J[iter][jter]);
}
}
Is there a best practice for matrix multiplication? In the loop below, Ith(v,iter) is a macro for getting the iter-th component of a vector held in the N_Vector structure 'v' (a data type used by the Sundials solvers). Particularly, is there a best way to get the dot product between v and the rows of J?
Jv_scratch = 0;
int iter, jter;
for (iter=1; iter<151; iter++) {
for (jter=1; jter<151; jter++) {
Jv_scratch += J[iter][jter]*Ith(v,jter);
}
Ith(Jv,iter) = Jv_scratch;
Jv_scratch = 0;
}
1) No they're not you can memset the array as follows:
memset( J, 0, sizeof( double ) * 151 * 151 );
or you can use an array initialiser:
double J[151][151] = { 0.0 };
2) Well you are using a fairly complex calculation to calculate the position of P and the position of J.
You may well get better performance. by stepping through as pointers:
for (iter = 1; iter<151; iter++)
{
double* pP = (P - 1) + (151 * iter);
double* pJ = data->J + (151 * iter);
for(jter = 1; jter<151; jter++, pP++, pJ++ )
{
*pP = - gamma * *pJ;
}
}
This way you move various of the array index calculation outside of the loop.
3) The best practice is to try and move as many calculations out of the loop as possible. Much like I did on the loop above.
First, I'd advise you to split up your question into three separate questions. It's hard to answer all three; I, for example, have not worked much with numerical analysis, so I'll only answer the first one.
First, variables on the stack are not initialized for you. But there are faster ways to initialize them. In your case I'd advise using memset:
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
memset((void*)J, 0, sizeof(double) * 151 * 151);
/* More code to populate J from data and y that runs very quickly */
}
memset is a fast library routine to fill a region of memory with a specific pattern of bytes. It just so happens that setting all bytes of a double to zero sets the double to zero, so take advantage of your library's fast routines (which will likely be written in assembler to take advantage of things like SSE).
Others have already answered some of your questions. On the subject of matrix multiplication; it is difficult to write a fast algorithm for this, unless you know a lot about cache architecture and so on (the slowness will be caused by the order that you access array elements causes thousands of cache misses).
You can try Googling for terms like "matrix-multiplication", "cache", "blocking" if you want to learn about the techniques used in fast libraries. But my advice is to just use a pre-existing maths library if performance is key.
Initialization of an array to zero.
When J is declared to be a double
array are the values of the array
initialized to zero? If not, is there
a fast way to set all the elements to
zero?
It depends on where the array is allocated. If it is declared at file scope, or as static, then the C standard guarantees that all elements are set to zero. The same is guaranteed if you set the first element to a value upon initialization, ie:
double J[151][151] = {0}; /* set first element to zero */
By setting the first element to something, the C standard guarantees that all other elements in the array are set to zero, as if the array were statically allocated.
Practically for this specific case, I very much doubt it will be wise to allocate 151*151*sizeof(double) bytes on the stack no matter which system you are using. You will likely have to allocate it dynamically, and then none of the above matters. You must then use memset() to set all bytes to zero.
In the
relatively slow loop below, is
accessing a matrix that is contained
in a structure 'data' the the slow
component or is it something else
about the loop?
You should ensure that the function called from it is inlined. Otherwise there isn't much else you can do to optimize the loop: what is optimal is highly system-dependent (ie how the physical cache memories are built). It is best to leave such optimization to the compiler.
You could of course obfuscate the code with manual optimization things such as counting down towards zero rather than up, or to use ++i rather than i++ etc etc. But the compiler really should be able to handle such things for you.
As for matrix addition, I don't know of the mathematically most efficient way, but I suspect it is of minor relevance to the efficiency of the code. The big time thief here is the double type. Unless you really have need for high accuracy, I'd consider using float or int to speed up the algorithm.

Resources