Optimize lower triangle matrix transpose - c

I have to optimize the following function so it runs faster: Note(this is a lower triangle transpose)
void trans(int ** source, int** destination)
{
for (int i = 0 ; i < sizee ; i ++)
{
for (int j = i +1 ; j < sizee ; j ++)
{
destination[i][j]= source[j][i];
}
}
}
I understand that the accesses to source don't have spatial locality because it is being accessed by columns, but I don't understand how I would implement this. Any help is appreciated. Thanks.
EDIT: I tried tiling, although the runtime improved, the optimized transpose is producing the wrong result:
#define b 2
for (int ii = 0 ; ii < sizee ; ii += b) {
for (int jj = ii +1 ; jj < sizee ; jj +=b) {
for(int i = ii; i < std::min(ii+b-1, sizee); i++)
{
for(int j = jj; j < std::min(jj+b-1, sizee); j++)
{
destination[i][j]= source[j][i];
}
}
}
}

One way of doing a cache-friendly transpose algorithm is to tile the data:
- for each square tile
- load a square tile from source into a temporary buffer
- transpose tile in-place
- write out transpose tile to its correct location in dest
Choose the tile size so that it fits comfortably within cache.
For further optimisation you can work on the in-place tile transpose routine - there are plenty of micro-optimisations you can do on e.g. an 8x8 or 16x16 in-place transpose.
Note: this answer was provided for the original version of the question when it was not apparent that the requirement was for a partial transpose. I'm leaving the answer here though as it has some useful comments below.

You can start by inverting your loop. Put j on the outside and i on the inside. Here's why: the following locations are all right next to each other in memory:
source[j][0];
source[j][1];
source[j][2];
source[j][3];
But these locations are not:
source[0][i];
source[1][i];
source[2][i];
source[3][i];
The moment the CPU finishes reading source[j][0] into a register, you have an entire cache line of data in your L1 cache. Take advantage of that by having your reads progress linearly over the address space instead of being scattered.
You can also unroll your loops. The CPU likes it when you can execute lots of instructions with no branching.
for (int j = i +1 ; j < sizee ; j += 8)
{
destination[i][j]= source[j][i];
destination[i][j+1]= source[j+1][i];
destination[i][j+2]= source[j+2][i];
destination[i][j+3]= source[j+3][i];
destination[i][j+4]= source[j+4][i];
destination[i][j+5]= source[j+5][i];
destination[i][j+6]= source[j+6][i];
destination[i][j+7]= source[j+7][i];
}
If your CPU has prefetching instructions then you can ask it to start loading the next row of data before you have finished with the current block of memory.

Related

OpenMP: Parallelize for loops inside a while loop

I am implementing an algorithm to compute graph layout using force-directed. I would like to add OpenMP directives to accelerate some loops. After reading some courses, I added some OpenMP directives according to my understanding. The code is compiled, but don’t return the same result as the sequential version.
I wonder if you would be kind enough to look at my code and help me to figure out what is going wrong with my OpenMP version.
Please download the archive here:
http://www.mediafire.com/download/3m42wdiq3v77xbh/drawgraph.zip
As you see, the portion of code which I want to parallelize is:
unsigned long graphLayout(Graph * graph, double * coords, unsigned long maxiter)
Particularly, these two loops which consumes alot of computational resources:
/* compute repulsive forces (electrical: f=-C.K^2/|xi-xj|.Uij) */
for(int j = 0 ; j < graph->nvtxs ; j++) {
if(i == j) continue;
double * _xj = _position+j*DIM;
double dist = DISTANCE(_xi,_xj);
// power used for repulsive force model (standard is 1/r, 1/r^2 works well)
// double coef = 0.0; -C*K*K/dist; // power 1/r
double coef = -C*K*K*K/(dist*dist); // power 1/r^2
for(int d = 0 ; d < DIM ; d++) force[d] += coef*(_xj[d]-_xi[d])/dist;
}
/* compute attractive forces (spring: f=|xi-xj|^2/K.Uij) */
for(int k = graph->xadj[i] ; k < graph->xadj[i+1] ; k++) {
int j = graph->adjncy[k]; /* edge (i,j) */
double * _xj = _position+j*DIM;
double dist = DISTANCE(_xi,_xj);
double coef = dist*dist/K;
for(int d = 0 ; d < DIM ; d++) force[d] += coef*(_xj[d]-_xi[d])/dist;
}
Thank you in advance for any help you can provide!
You have data races in your code, e.g., when doing maxmove = nmove; or energy += nforce2;. In any multi-threaded code, you cannot write into a variable shared by threads until you use an atomic access (#pragma omp atomic read/write/update) or until you synchronize an access to such a variable explicitly (critical sections, locks). Read some tutorial about OpenMP first, there are more problems with your code, e.g.
if(nmove > maxmove) maxmove = nmove;
this line will generally not work even with atomics (you would have to use so-called compare-and-exchange atomic operation to solve this). Much better solution here is to let each thread to calculate its local maximum and then reduce it into a global maximum.

Paralellized execution in nested for using Cilk

I'm trying to implement a 2D-stencil algorithm that manipulates a matrix. For each field in the matrix, the fields above, below, left and right of it are to be added and divided by 4 in order to calculate the new value. This process may be iterated multiple times for a given matrix.
The program is written in C and compiles with the cilkplus gcc binary.
**Edit: I figured you might interested in the compiler flags:
~/cilkplus/bin/gcc -fcilkplus -lcilkrts -pedantic-errors -g -Wall -std=gnu11 -O3 `pkg-config --cflags glib-2.0 gsl` -c -o sal_cilk_tst.o sal_cilk_tst.c
Please note that the real code involves some pointer arithmetic to keep everything consistent. The sequential implementation works. I'm omitting these steps here to enhance understandability.
A pseudocode would look something like this (No edge case handling):
for(int i = 0; i < iterations; i++){
for(int j = 0; j < matrix.width; j++){
for(int k = 0; k < matrix.height; k++){
result_ matrix[j][k] = (matrix[j-1][k] +
matrix[j+1][k] +
matrix[j] [k+1] +
matrix[j] [k-1]) / 4;
}
}
matrix = result_matrix;
}
The stencil calculation itself is then moved to the function apply_stencil(...)
for(int i = 0; i < iterations; i++){
for(int j = 0; j < matrix.width; j++){
for(int k = 0; k < matrix.height; k++){
apply_stencil(matrix, result_matrix, j, k);
}
}
matrix = result_matrix;
}
and parallelization is attempted:
for(int i = 0; i < iterations; i++){
for(int j = 0; j < matrix.width; j++){
cilk_for(int k = 0; k < matrix.height; k++){ /* <--- */
apply_stencil(matrix, result_matrix, j, k);
}
}
matrix = result_matrix;
}
This version compiles without errors/warning, but just straight out produces a Floating point exception when executed. In case you are wondering: It does not matter which of the for loops are made into cilk_for loops. All configurations (except no cilk_for) produce the same error.
the possible other method:
for(int i = 0; i < iterations; i++){
for(int j = 0; j < matrix.width; j++){
for(int k = 0; k < matrix.height; k++){
cilk_spawn apply_stencil(matrix, result_matrix, j, k); /* <--- */
}
}
cilk_sync; /* <--- */
matrix = result_matrix;
}
This produces 3 warnings when compiled: i, j and k appear to be uninitialized.
When trying to execute, the function which executes the matrix = result_matrix; step appears to be undefined.
Now for the actual question: Why and how does Cilk break my sequential code; or rather how can I prevent it from doing so?
The actual code is of course available too, should you be interested. However, this project is for an university class and therefore subject to plagiarism from other students who find this thread which is why I would prefer not to share it publicly.
**UPDATE:
As suggested I attempted to run the algorithm with only 1 worker thread, effectively making the cilk implementation sequential. This did, surprisingly enough, work out fine. However as soon as I change the number of workers to two, the familiar errors return.
I don't think this behavior is caused by race-conditions though. Since the working matrix is changed after each iteration and cilk_sync is called, there is effectively no critical section. All threads do not depend on data written by others in the same iteration.
The next steps I will attempt is to try out other versions of the cilkplus compiler, to see if its maybe an error on their side.
With regards to the floating point exception in a cilk_for, there are some issues that have been fixed in some versions of the Cilk Plus runtime. Is it possible that you are using an outdated version?
https://software.intel.com/en-us/forums/intel-cilk-plus/topic/558825
Also, what were the specific warning messages that are produced? There are some "uninitialized variable" warnings that occur with older versions of Cilk Plus GCC, which I thought were spurious warnings.
The Cilk runtime uses a recursive divide and conquer algorithm to parallelize your loop. Essentially, it breaks the range in half, and recursively calls itself twice, spawning half and calling half.
As part of the initialization, it calculates a "grain size" which is the size of the minimum size it will break your range into. By default, that's loopRange/8P, where P is the number of cores.
One interesting experiment would be to set the number of Cilk workers to 1. When you do this, all of the cilk_for mechanism is excersized, but because there's only 1 worker, nothing gets stolen.
Another possibility is to try running your code under Cilkscreen - the Cilk race detector. Unfortunately only the cilkplus branch of GCC generates the annotations that Cilkscreen needs. Your choices are to use the Intel commpiler, or try using the cilkplus branch of GCC 4.9. Directions on how to pull down the code and build it are at the cilkplus.org website.

OpenMP gives (core dumped)

I have loop which I want to parallelize with OpenMP. when I compile with gcc -o prog prog.c -lm -fopenmp I get no errors. But when I execute it, I get segmentation fault(core dumped). The problem surely comes from the OpenMP commands because the program works when I delete the #pragma...
Here is the parallel loop:
ix = (i-1)%ILIGNE+1;
iy = (i-1)/ILIGNE+1;
k = 1;
# pragma omp parallel for private(j,jx,jy,r,R,voisin) shared(NTOT,k,i,ix,iy) num_threads(2) schedule(auto)
for(j = 1;j <= NTOT;j++){
if(j != i){
jx = (j-1)%ILIGNE+1;
jy = (j-1)/ICOLONE+1;
r[k][0] = (jx-ix)*a;
r[k][1] = (jy-iy)*a;
R[k] = sqrt(pow(r[k][0],2.0)+pow(r[k][1],2.0));
voisin[k] = j;
k++;
}
}
I tried to change the stack size to unlimited but it doesn't fix the problem. Please tell me if it is about a memory leak or a race condition or something else? and thank you for your help
As a side note, be careful when you make an array private.
If you allocated it as a static array
e.g.
int R[5] or something similar then that's fine, each thread gets its own personal copy :).
If you malloc these however
e.g.:
int R = malloc(5*sizeof(int));
then it will act as a shared array regardless of whether you define it as private (which could potentially lead to undefined behaviour, segfaults, jibberish in the array etc).
I'm not sure what your code does, but I'm pretty sure the OpenMP version is wrong. Indeed, you parallelised over the j loop, but the heart of your algorithm revolves around the k, which is loosely derived from j and i (which is not presented here BTW).
So when you distribute your j indexes across your OpenMP threads, they all start from a different value of j, but all from the same value of k which is shared. From that, k is incremented quite randomly and the accesses to the various arrays using k are very very likely to generate segmentation faults.
Moreover, arrays r, R and voisin shouldn't be declared private if one want the parallelisation to have any effect.
Finally, C loops like this for(j = 1;j <= NTOT;j++) look utterly suspicious to me for off-by-one accesses... Shouldn't that be rather for(j = 0;j < NTOT;j++)? (just mentioning this since the initial value of k is 1 as well...)
Bottom line is that you'd probably better define k from j's value with k = j<i ? j : j-1 instead of trying to increment it inside the code.
Assuming all the rest is correct, this might be a valid version:
ix = (i-1)%ILIGNE+1;
iy = (i-1)/ILIGNE+1;
# pragma omp parallel for private(j,jx,jy,k) num_threads(2) schedule(auto)
for(j = 1;j <= NTOT;j++){
if(j != i){
k = j<i ? j : j-1;
jx = (j-1)%ILIGNE+1;
jy = (j-1)/ICOLONE+1;
r[k][0] = (jx-ix)*a;
r[k][1] = (jy-iy)*a;
R[k] = sqrt(pow(r[k][0],2.0)+pow(r[k][1],2.0));
voisin[k] = j;
}
}
Still, be careful with the C indexing from 0 to size-1, not from 1 to size...

C- Peak detection via quadratic fit

I have an application where I need to find the position of peaks in a given set of data. The resolution must be much higher than the spacing between the datapoints (i.e. it is not sufficient to find the highest datapoint, instead a "virtual" peak position has to be estimated given the shape of the peak). A peak is made of about 4 or 5 datapoints. A dataset is acquired every few ms and the peak detection has to be performed in real time.
I compared several methods in LabVIEW and I found the best result (in terms of resolution and speed) is given by the LabVIEW PeakDetector.vi, which scans the dataset with a moving window (>= 3 points width) and for each position performs a quadratic fit. The resulting quadratic function (a parabola) has a local maximum, which is in turn compared to nearby points.
Now I want to implement the same method in C. The polynomial fit is implemented as follows (using Gaussian matrix):
// Fits *y from x_start to (x_start + window) with a parabola and returns x_max and y_max
int polymax(uint16_t * y_data, int x_start, int window, double *x_max, double *y_max)
{
float sum[10],mat[3][4],temp=0,temp1=0,a1,a2,a3;
int i,j;
float x[window];
for(i = 0; i < window; i++)
x[i] = (float)i;
float y[window];
for(i = 0; i < window; i++)
y[i] = (float)(y_data[x_start + i] - y_data[x_start]);
for(i = 0; i < window; i++)
{
temp=temp+x[i];
temp1=temp1+y[i];
}
sum[0]=temp;
sum[1]=temp1;
sum[2]=sum[3]=sum[4]=sum[5]=sum[6]=0;
for(i = 0;i < window;i++)
{
sum[2]=sum[2]+(x[i]*x[i]);
sum[3]=sum[3]+(x[i]*x[i]*x[i]);
sum[4]=sum[4]+(x[i]*x[i]*x[i]*x[i]);
sum[5]=sum[5]+(x[i]*y[i]);
sum[6]=sum[6]+(x[i]*x[i]*y[i]);
}
mat[0][0]=window;
mat[0][1]=mat[1][0]=sum[0];
mat[0][2]=mat[1][2]=mat[2][0]=sum[2];
mat[1][2]=mat[2][3]=sum[3];
mat[2][2]=sum[4];
mat[0][3]=sum[1];
mat[1][3]=sum[5];
mat[2][3]=sum[6];
temp=mat[1][0]/mat[0][0];
temp1=mat[2][0]/mat[0][0];
for(i = 0, j = 0; j < 3 + 1; j++)
{
mat[i+1][j]=mat[i+1][j]-(mat[i][j]*temp);
mat[i+2][j]=mat[i+2][j]-(mat[i][j]*temp1);
}
temp=mat[2][4]/mat[1][5];
temp1=mat[0][6]/mat[1][7];
for(i = 1,j = 0; j < 3 + 1; j++)
{
mat[i+1][j]=mat[i+1][j]-(mat[i][j]*temp);
mat[i-1][j]=mat[i-1][j]-(mat[i][j]*temp1);
}
temp=mat[0][2]/mat[2][2];
temp1=mat[1][2]/mat[2][2];
for(i = 0, j = 0; j < 3 + 1; j++)
{
mat[i][j]=mat[i][j]-(mat[i+2][j]*temp);
mat[i+1][j]=mat[i+1][j]-(mat[i+2][j]*temp1);
}
a3 = mat[2][3]/mat[2][2];
a2 = mat[1][3]/mat[1][8];
a1 = mat[0][3]/mat[0][0];
// zX^2 + yX + x
if (a3 < 0)
{
temp = - a2 / (2*a3);
*x_max = temp + x_start;
*y_max = (a3*temp*temp + a2*temp + a1) + y_data[x_start];
return 0;
}
else
return -1;
}
The scan is performed in an outer function, which calls the above function repeatedly and chooses then the highest local y_max.
The above works and peaks are found. Only the noise is much worse than the LabVIEW counterpart (i.e. I get a very oscillating peak position, given the same input dataset and the same parameters). As the algorithm works the above code should be conceptually correct, so I think it might be a numerical problem as I simply use "floats" without further effort to improve numerical accuracy. Is this a possible answer? Does anyone have a tip, where I should be looking to?
Thanks.
PS: I have done my search and found this very good overview and also this question, similar to mine (unfortunately with not many answers). I will study these further.
EDIT: I have found my problems being elsewhere. Improving the algorithm by removing certain output values (a sort of post-validation in which a result is only accepted if the result is within the moving window) brought the solution to the issue. Now I am satisfied with the results, i.e. they are comparable to those from LabVIEW. Nevertheless, thanks a lot for your comments.
Sorry to be late to the part, but if you have C/C++ it is really easy to port it to C# code using VS2013 Express (free version) and just port that into Labview using the .NET toolset.

OpenMP - make array counter with thread different start number

I am strugling with quite tricky (I guess) problem with parallel file reading.
Right now I have maped file using mmap and I want it to read values and put into three arrays. Well, maybe explanations isn't so clear, so this is my current code:
int counter = header;
#pragma omp parallel for shared (red,green,blue,map) private (i,j) firstprivate(counter)
for(i = 0; i < x; i++)
{
for(j = 0; j < y; j++)
{
red[i][j] = map[counter];
green[i][j] = map[counter+1];
blue[i][j] = map[counter+2];
counter+=3;
}
}
the header is beginning of a file - just image related info like size and some comments.
The problem here is, that counter has to be private. What I am finding difficult is to come up with dividing the counter across threads with different start numbers.
Can someone give me a hint how it can be achieved?
You could, if I've understood your code correctly, leave counter as a shared variable and enclose the update to it in a single section, something like
blue[i][j] = map[counter+2];
#pragma omp single
{
counter+=3;
}
}
I write something like because I haven't carefully checked the syntax nor considered the usefulness of some of the optional clauses for the single directive. I suspect that this might prove a real drag on performance.
An alternative would be to radically re-order your loop nest, perhaps (and again this is not carefully checked):
#pragma omp parallel shared (red,green,blue,map) private (i,j)
#pragma for collapse(2) schedule(hmmm, this needs some thought)
for(counter = 0; counter < counter_max; counter += 3)
{
for(i = 0; i < x; i++)
{
for(j = 0; j < y; j++)
{
red[i][j] = map[counter];
}
}
}
... loop for blue now ...
... loop for green now ...
Note that in this version OpenMP will collapse the first two loops into one iteration space, I expect this will provide better performance than uncollapsed looping but it's worth experimenting to figure that out. It's probably also worth experimenting with the schedule clause.
If I understood well your problem is that of initializing counter to different values for different threads.
If this is the case, then it can be solved like the linearization of matrix indexes:
size_t getLineStart(size_t ii, size_t start, size_t length) {
return start + ii*length*3;
}
int counter;
#pragma omp parallel for shared (red,green,blue,map) private (i,j) private(counter)
for(size_t ii = 0; ii < x; ii++) {
////////
// Get the starting point for the current iteration
counter = getLineStart(ii,header,y);
////////
for(size_t jj = 0; jj < y; jj++) {
red [i][j] = map[counter ];
green[i][j] = map[counter+1];
blue [i][j] = map[counter+2];
counter+=3;
}
}
These kind of counters are useful when it's hard to know how many values will be counted but then they don't work easily with OpenMP. But in your case it's trivial to figure out what the index is: 3*(y*i+j). I suggest fusing the loop and doing something like this:
#pragma omp parallel for
for(n=0; n<x*y; n++) {
(&red [0][0])[n] = map[3*n+header+0];
(&green[0][0])[n] = map[3*n+header+1];
(&blue [0][0])[n] = map[3*n+header+2];
}

Resources