why is this matrix multiplication algorithm faster than the other? - c

int mmult_omp(double *c,
double *a, int aRows, int aCols,
double *b, int bRows, int bCols, int numThreads)
{
for (i = 0; i < aRows; i++) {
for (j = 0; j < bCols; j++) {
c[i*bCols + j] = 0;
}
for (k = 0; k < aCols; k++) {
for (j = 0; j < bCols; j++) {
c[i*bCols + j] += a[i*aCols + k] * b[k*bCols + j];
}
}
}
for (i = 0; i < aRows; i++) {
for (j = 0; j < bCols; j++) {
c[i*bCols + j] = 0;
for (k = 0; k < aCols; k++) {
c[i*bCols + j] += a[i*aCols + k] * b[k*bCols + j];
}
}
}
Why is the first algorithm faster than the second?
I’ve used C’s time library and the first algorithm is objectively faster than the second. Why is that?

This code is very hard to understand. I had to copy it and reformat it to see what loops were what. I'm not really sure why one is faster but here's a great resource to see why.
Here are links to inspect the assembly output:
link for #1
link for #2

Related

How to convert this C code into Assembly using intrinsics?

this is my first time using intrisics and I have to convert the C code below into C code that uses intrisics for assembly. I don't know where to start.
void slow_routine(float alpha, float beta){
unsigned int i,j;
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
A[i][j] = A[i][j] + u1[i] * v1[j] + u2[i] * v2[j];
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
x[i] = x[i] + beta * A[j][i] * y[j];
for (i = 0; i < N; i++)
x[i] = x[i] + z[i];
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
w[i] = w[i] + alpha * A[i][j] * x[j];
}
Any of the following can be used: x86-64 SSE/SSE2/SSE4/AVX/AVX2
Please help I have been trying for hours and I'm very stuck.

Wrong result with OpenMP to parallelize GEMM

I know OpenMP shares all variables declared in an outer scope between all workers. And that my be the answer of my question. But I really confused why function omp3 delivers right result while function omp2 delivers a wrong result.
void omp2(double *A, double *B, double *C, int m, int k, int n) {
for (int i = 0; i < m; ++i) {
#pragma omp parallel for
for (int ki = 0; ki < k; ++ki) {
for (int j = 0; j < n; ++j) {
C[i * n + j] += A[i * k + ki] * B[ki * n + j];
}
}
}
}
void omp3(double *A, double *B, double *C, int m, int k, int n) {
for (int i = 0; i < m; ++i) {
for (int ki = 0; ki < k; ++ki) {
#pragma omp parallel for
for (int j = 0; j < n; ++j) {
C[i * n + j] += A[i * k + ki] * B[ki * n + j];
}
}
}
}
The problem is that there is a race condition in this line:
C[i * n + j] += ...
Different threads can read and write the same memory location (C[i * n + j]) simultaneously, which causes data race. In omp2 this data race can occur, but not in omp3.
The solution is (as suggested by #Victor Eijkhout) is to reorder your loops, use a local variable to calculate the sum of the innermost loop. In this case C[i * n + j] is updated only once, so you got rid of data race and the outermost loop can be parallelized (which gives the best performance):
#pragma omp parallel for
for (int i = 0; i < m; ++i) {
for (int j = 0; j < n; ++j) {
double sum=0;
for (int ki = 0; ki < k; ++ki) {
sum += A[i * k + ki] * B[ki * n + j];
}
C[i * n + j] +=sum;
}
}
Note that you can use collapse(2) clause, which may increase the performance.

Find inverse of matrix using b^nth power

I've searched for hours and spent many more trying to figure how to fix this problem. I need to find the inverse of a predefined matrix using
A^-1 = I + (B + B^2 + ... + B^20) where B = I-A.
void invA(double a[][3], double id[][3], double z[][3])
{
int i, j, n, k;
double pb[3][3] = {1.,0.,0.,0.,1.,0.,0.,0.,1.};
double temp[3][3] = {1.,0.,0.,0.,1.,0.,0.,0.,1.};
double b[3][3];
temp[i][j] = 0;
b[i][j] = 0;
for(i = 0; i < 3; i++)
for (j = 0; j < 3; j++)
b[i][j] = id[i][j] - a[i][j];
for (n = 0; n < 20; n++) //run loop n times
{
for (i = 0; i < 3; i++) //find b to the power 20
for (j = 0; j < 3; j++)
for (k = 0; k < 3; k++)
temp[i][j] += pb[i][k] * b[k][j];
for (i = 0; i < 3; i++) //allocate pb from temp
for (j = 0; j < 3; j++)
pb[i][j] = temp[i][j];
for (i = 0; i < 3; i++) //summing b n time
for (j = 0; j < 3; j++) //to find inverse
z[i][j] = z[i][j] + pb[i][j];
}
}
Matrix a is the defined matrix, id is the identity and z is the inverse (result). I can't seem to figure out where I've gone wrong.
You have few problems.
First, temp[i][j] = 0; and b[i][j] = 0; at the beginning of the function use uninitialized variables i and j. The behaviour is undefined, and who knows how temp is actually initialized.
Then, temp must be reinitialized to a zero matrix at each iteration. I don't know what exactly does your code compute, but it is not a power for sure.
Finally, (unless z is initialized to I), you are missing the initial term.
All that said, I highly recommend to factor out most of the loops into functions: matAdd() and matMult(). Once they are unit tested, the rest is much simpler.

SSE memory access

I need to perform Gaussian Elimination using SSE and I am not sure how to access each element(32 bits) from the 128 bit registers(each storing 4 elements). This is the original code(without using SSE):
unsigned int i, j, k;
for (i = 0; i < num_elements; i ++) /* Copy the contents of the A matrix into the U matrix. */
for(j = 0; j < num_elements; j++)
U[num_elements * i + j] = A[num_elements*i + j];
for (k = 0; k < num_elements; k++){ /* Perform Gaussian elimination in place on the U matrix. */
for (j = (k + 1); j < num_elements; j++){ /* Reduce the current row. */
if (U[num_elements*k + k] == 0){
printf("Numerical instability detected. The principal diagonal element is zero. \n");
return 0;
}
/* Division step. */
U[num_elements * k + j] = (float)(U[num_elements * k + j] / U[num_elements * k + k]);
}
U[num_elements * k + k] = 1; /* Set the principal diagonal entry in U to be 1. */
for (i = (k+1); i < num_elements; i++){
for (j = (k+1); j < num_elements; j++)
/* Elimnation step. */
U[num_elements * i + j] = U[num_elements * i + j] -\
(U[num_elements * i + k] * U[num_elements * k + j]);
U[num_elements * i + k] = 0;
}
}
Okay I'm getting segmentation fault[core dumped] with this code. I'm new to SSE. Can someone help? Thanks.
int i,j,k;
__m128 a_i,b_i,c_i,d_i;
for (i = 0; i < num_rows; i++)
{
for (j = 0; j < num_rows; j += 4)
{
int index = num_rows * i + j;
__m128 v = _mm_loadu_ps(&A[index]); // load 4 x floats
_mm_storeu_ps(&U[index], v); // store 4 x floats
}
}
for (k = 0; k < num_rows; k++){
a_i= _mm_load_ss(&U[num_rows*k+k]);
for (j = (4*k + 1); j < num_rows; j+=4){
b_i= _mm_loadu_ps(&U[num_rows*k+j]);// Reduce the currentrow.
if (U[num_rows*k+k] == 0){
printf("Numerical instability detected.);
}
/* Division step. */
b_i = _mm_div_ps(b_i, a_i);
}
a_i = _mm_set_ss(1);
for (i = (k+1); i < num_rows; i++){
d_i= _mm_load_ss(&U[num_rows*i+k]);
for (j = (4*k+1); j < num_rows; j+=4){
c_i= _mm_loadu_ps(&U[num_rows*i+j]); /* Elimnation step. */
b_i= _mm_loadu_ps(&U[num_rows*k+j]);
c_i = _mm_sub_ps(c_i, _mm_mul_ss(b_i,d_i));
}
d_i= _mm_set_ss(0);
}
}
In order to get you started, your first loop should be more like this:
for (i = 0; i < num_elements; i++)
{
for (j = 0; j < num_elements; j += 4)
{
int index = num_elements * i + j;
__m128i v = _mm_loadu_ps((__m128i *)&A[index]); // load 4 x floats
_mm_storeu_ps((__m128i *)&U[index], v); // store 4 x floats
}
}
This assumes that num_elements is a multiple of 4, and that neither A nor U is correctly aligned.

Remove conditions in a C program for speedup

I have a number crunching C program which involves a main loop with two conditionals:
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
for (k = 0; k < N; k++) {
if (k == i || k == j) continue;
...(calculate a, b, c, d (depending on k)
if (a*a + b*b + c*c < d*d) {break;}
} //k
} //j
} //i
The hardware here is the SPE of the Cell processor, where there is a big penalty when using branching. So in order to optimize my program for speedup I need to remove these 2 conditionals, do you know about good strategies for this?
For the first one, you could break it into multiple loops, eg change:
for(int i = 0; i < 1000; i++)
for(int j = 0; j < 1000; j++) {
for(int k = 0; k < 1000; k++) {
if(k==i || k == j) continue;
// other code
}
}
to:
for(int i = 0; i < 1000; i++)
for(int j = 0; j < 1000; j++) {
for(int k = 0; k < min(i, j); k++) {
// other code
}
for(int k = min(i, j) + 1; k < max(i, j); k++) {
// other code
}
for(int k = max(i, j) + 1; k < 1000; k++) {
// other code
}
}
To remove the second, you could store the previous total and use it in the for loop conditions, i.e.:
int left_side = 1, right_side = 0;
for(int i = 0; i < N; i++)
for(int j = 0; j < N; j++) {
for(int k = 0; k < min(i, j) && left_side >= right_side; k++) {
// other code (calculate a, b, c, d)
left_side = a * a + b * b + c * c;
right_side = d * d;
}
for(int k = min(i, j) + 1; k < max(i, j) && left_side >= right_side; k++) {
// same as in previous loop
}
for(int k = max(i, j) + 1; k < N && left_side >= right_side; k++) {
// same as in previous loop
}
}
Implementing min and max without branching could also be tricky. Maybe this version is better:
int i, j, k,
left_side = 1, right_side = 0;
for(i = 0; i < N; i++) {
// this loop covers the case where j < i
for(j = 0; j < i; j++) {
k = 0;
for(; k < j && left_side >= right_side; k++) {
// other code (calculate a, b, c, d)
left_side = a * a + b * b + c * c;
right_side = d * d;
}
k++; // skip k == j
for(; k < i && left_side >= right_side; k++) {
// same as in previous loop
}
k++; // skip k == i
for(; k < N && left_side >= right_side; k++) {
// same as in previous loop
}
}
j++; // skip j == i
// and now, j > i
for(; j < N; j++) {
k = 0;
for(; k < i && left_side >= right_side; k++) {
// other code (calculate a, b, c, d)
left_side = a * a + b * b + c * c;
right_side = d * d;
}
k++; // skip k == i
for(; k < j && left_side >= right_side; k++) {
// same as in previous loop
}
k++; // skip k == j
for(; k < N && left_side >= right_side; k++) {
// same as in previous loop
}
}
}
I agree with 'sje397'.
Besides this, you provide too little information about your problem. You say branching is pricey. But how often does it actually happen? Maybe your problem is that compiler-generated code does branching in the common scenario?
Perhaps you could re-arrange your if-s. The implementation of the if is actually compiler-dependent, bust many compilers treat it in a straight-forward way. That is: if - common - else - rare (jump).
Then try the following:
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
for (k = 0; k < N; k++) {
if (k != i && k != j)
{
...(calculate a, b, c, d)
if (a*a + b*b + c*c >= d*d)
{
...
} else
break;
}
} //k
} //j
} //i
EDIT:
Of course you may go into assembler level to ensure correct code generated.
I would look first at your calculate code, because that could swamp all these branching issues. Some sampling would find out for sure.
However, it looks like you're doing, for each i,j, a linear search for the first point inside a sphere. Could you have 3 arrays, one for each of the X, Y, and Z axes, and in each array store indexes of all the original points in ascending order by that axis? That could facilitate a nearest-neighbor search. Also, you might be able to use an in-cube test, rather than an in-sphere test, since you're not hunting for the closest point, but only a nearby point.
Are you sure you actually need the first if-statement? Even if it jumps one calculation when k equals i or j, the penalty for checking it every iteration is very costly. Also, keep in mind that if N is not a constant, the compiler probably wont be able to unroll the for loops.
Although, if it's a cell processor, the compiler might even try to vectorize the loops.
If the for loops compiles to normal iterative loops it could be an idea to make them compare with zero instead, as the decrement operation will often do the comparison for you when it hits zero.
for (i = 0; i < N; i++) {
...can become...
for (i = N; i != 0; i--) {
Although, if "i" is used as an index or a variable in a calculation, you might get performance degradation as you will get cache misses.

Resources