How to convert this C code into Assembly using intrinsics? - c

this is my first time using intrisics and I have to convert the C code below into C code that uses intrisics for assembly. I don't know where to start.
void slow_routine(float alpha, float beta){
unsigned int i,j;
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
A[i][j] = A[i][j] + u1[i] * v1[j] + u2[i] * v2[j];
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
x[i] = x[i] + beta * A[j][i] * y[j];
for (i = 0; i < N; i++)
x[i] = x[i] + z[i];
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
w[i] = w[i] + alpha * A[i][j] * x[j];
}
Any of the following can be used: x86-64 SSE/SSE2/SSE4/AVX/AVX2
Please help I have been trying for hours and I'm very stuck.

Related

Matrix Multiplication function won't compile inside the main

This is a simple matrix multiplication, code won't compile. I also want to take the function outside. I know I have to have global variables and function declaration, but the code won't even compile inside the main.
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#define N 10
float *A[N], *B[N], *C[N];
int main(){
int count = 0, i,j, k;
for (i = 0; i < N; i++)
A[i] = (float*)malloc(N * sizeof(float));
for (i = 0; i < N; i++)
B[i] = (float*)malloc(N * sizeof(float));
for (i = 0; i < N; i++)
C[i] = (float*)malloc(N * sizeof(float));
for (i = 0; i < N; i++)
for (j = 0; j < N; j++){
A[i][j] = ++count;
B[i][j] = count;
}
void multiply(float* A, float* B, int n) {
for(int i=0; i<n; i++)
for(int j=0; j<n; j++)
for(int k=0; k<n; k++)
C[i][j] += A[i][k] * B[k][j];
}
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
printf("%f\t", C[i][j] );
}
Your multiply function definition inside main makes it a nested function which is not allowed in C. You can call as many functions inside a function but cannot define a new function inside an already existing one.
Plus you have also not called the multiply function.
A , B and C are defined at the top as global as an array of pointers to float
inside your multiply function you have locally redefined them as a pointer to a float..
this is why the multiplication fails inside multiply.
you are pretending that A and B are arrays of pointers to float again.. and they have been locally scoped as pointers to float.
Does this do what you were expecting ?
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#define N 10
float *A[N], *B[N], *C[N];
void multiply( float *a[], float *b[], float *c[], int n )
{
for(int i=0; i<n; i++)
for(int j=0; j<n; j++)
for(int k=0; k<n; k++)
c[i][j] += (a[i][k]) * b[k][j];
}
int main(){
int count = 0, i,j, k;
for (i = 0; i < N; i++)
A[i] = (float*)malloc(N * sizeof(float));
for (i = 0; i < N; i++)
B[i] = (float*)malloc(N * sizeof(float));
for (i = 0; i < N; i++)
C[i] = (float*)malloc(N * sizeof(float));
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
{
A[i][j] = ++count;
B[i][j] = count;
}
multiply(A,B, C, N);
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
printf("%f\t", C[i][j] );
}
if so please explain it back to me how this works

why is this matrix multiplication algorithm faster than the other?

int mmult_omp(double *c,
double *a, int aRows, int aCols,
double *b, int bRows, int bCols, int numThreads)
{
for (i = 0; i < aRows; i++) {
for (j = 0; j < bCols; j++) {
c[i*bCols + j] = 0;
}
for (k = 0; k < aCols; k++) {
for (j = 0; j < bCols; j++) {
c[i*bCols + j] += a[i*aCols + k] * b[k*bCols + j];
}
}
}
for (i = 0; i < aRows; i++) {
for (j = 0; j < bCols; j++) {
c[i*bCols + j] = 0;
for (k = 0; k < aCols; k++) {
c[i*bCols + j] += a[i*aCols + k] * b[k*bCols + j];
}
}
}
Why is the first algorithm faster than the second?
I’ve used C’s time library and the first algorithm is objectively faster than the second. Why is that?
This code is very hard to understand. I had to copy it and reformat it to see what loops were what. I'm not really sure why one is faster but here's a great resource to see why.
Here are links to inspect the assembly output:
link for #1
link for #2

Find inverse of matrix using b^nth power

I've searched for hours and spent many more trying to figure how to fix this problem. I need to find the inverse of a predefined matrix using
A^-1 = I + (B + B^2 + ... + B^20) where B = I-A.
void invA(double a[][3], double id[][3], double z[][3])
{
int i, j, n, k;
double pb[3][3] = {1.,0.,0.,0.,1.,0.,0.,0.,1.};
double temp[3][3] = {1.,0.,0.,0.,1.,0.,0.,0.,1.};
double b[3][3];
temp[i][j] = 0;
b[i][j] = 0;
for(i = 0; i < 3; i++)
for (j = 0; j < 3; j++)
b[i][j] = id[i][j] - a[i][j];
for (n = 0; n < 20; n++) //run loop n times
{
for (i = 0; i < 3; i++) //find b to the power 20
for (j = 0; j < 3; j++)
for (k = 0; k < 3; k++)
temp[i][j] += pb[i][k] * b[k][j];
for (i = 0; i < 3; i++) //allocate pb from temp
for (j = 0; j < 3; j++)
pb[i][j] = temp[i][j];
for (i = 0; i < 3; i++) //summing b n time
for (j = 0; j < 3; j++) //to find inverse
z[i][j] = z[i][j] + pb[i][j];
}
}
Matrix a is the defined matrix, id is the identity and z is the inverse (result). I can't seem to figure out where I've gone wrong.
You have few problems.
First, temp[i][j] = 0; and b[i][j] = 0; at the beginning of the function use uninitialized variables i and j. The behaviour is undefined, and who knows how temp is actually initialized.
Then, temp must be reinitialized to a zero matrix at each iteration. I don't know what exactly does your code compute, but it is not a power for sure.
Finally, (unless z is initialized to I), you are missing the initial term.
All that said, I highly recommend to factor out most of the loops into functions: matAdd() and matMult(). Once they are unit tested, the rest is much simpler.

SSE memory access

I need to perform Gaussian Elimination using SSE and I am not sure how to access each element(32 bits) from the 128 bit registers(each storing 4 elements). This is the original code(without using SSE):
unsigned int i, j, k;
for (i = 0; i < num_elements; i ++) /* Copy the contents of the A matrix into the U matrix. */
for(j = 0; j < num_elements; j++)
U[num_elements * i + j] = A[num_elements*i + j];
for (k = 0; k < num_elements; k++){ /* Perform Gaussian elimination in place on the U matrix. */
for (j = (k + 1); j < num_elements; j++){ /* Reduce the current row. */
if (U[num_elements*k + k] == 0){
printf("Numerical instability detected. The principal diagonal element is zero. \n");
return 0;
}
/* Division step. */
U[num_elements * k + j] = (float)(U[num_elements * k + j] / U[num_elements * k + k]);
}
U[num_elements * k + k] = 1; /* Set the principal diagonal entry in U to be 1. */
for (i = (k+1); i < num_elements; i++){
for (j = (k+1); j < num_elements; j++)
/* Elimnation step. */
U[num_elements * i + j] = U[num_elements * i + j] -\
(U[num_elements * i + k] * U[num_elements * k + j]);
U[num_elements * i + k] = 0;
}
}
Okay I'm getting segmentation fault[core dumped] with this code. I'm new to SSE. Can someone help? Thanks.
int i,j,k;
__m128 a_i,b_i,c_i,d_i;
for (i = 0; i < num_rows; i++)
{
for (j = 0; j < num_rows; j += 4)
{
int index = num_rows * i + j;
__m128 v = _mm_loadu_ps(&A[index]); // load 4 x floats
_mm_storeu_ps(&U[index], v); // store 4 x floats
}
}
for (k = 0; k < num_rows; k++){
a_i= _mm_load_ss(&U[num_rows*k+k]);
for (j = (4*k + 1); j < num_rows; j+=4){
b_i= _mm_loadu_ps(&U[num_rows*k+j]);// Reduce the currentrow.
if (U[num_rows*k+k] == 0){
printf("Numerical instability detected.);
}
/* Division step. */
b_i = _mm_div_ps(b_i, a_i);
}
a_i = _mm_set_ss(1);
for (i = (k+1); i < num_rows; i++){
d_i= _mm_load_ss(&U[num_rows*i+k]);
for (j = (4*k+1); j < num_rows; j+=4){
c_i= _mm_loadu_ps(&U[num_rows*i+j]); /* Elimnation step. */
b_i= _mm_loadu_ps(&U[num_rows*k+j]);
c_i = _mm_sub_ps(c_i, _mm_mul_ss(b_i,d_i));
}
d_i= _mm_set_ss(0);
}
}
In order to get you started, your first loop should be more like this:
for (i = 0; i < num_elements; i++)
{
for (j = 0; j < num_elements; j += 4)
{
int index = num_elements * i + j;
__m128i v = _mm_loadu_ps((__m128i *)&A[index]); // load 4 x floats
_mm_storeu_ps((__m128i *)&U[index], v); // store 4 x floats
}
}
This assumes that num_elements is a multiple of 4, and that neither A nor U is correctly aligned.

I'm not seeing performance boost while using optimised memory bandwidth method

I was presented example of a loop which should be slower than the one after this:
for (i = 0; i < 1000; i++)
column_sum[i] = 0.0;
for (j = 0; j < 1000; j++)
column_sum[i] += b[j][i];
Comparing to this one:
for (i = 0; i < 1000; i++)
column_sum[i] = 0.0;
for (j = 0; j < 1000; j++)
for (i = 0; i < 1000; i++)
column_sum[i] += b[j][i];
Now, I coded a tool to test number of different index numbers, but I am not seeing much of performance advantage there after I tried this concept, and I'm afraid that my code has something to do with it...
Should be slower loop that works within my code:
for (i = 0; i < val; i++){
column_sum[i] = 0.0;
for (j = 0; j < val; j++){
int index = i * (int)val + j;
column_sum[i] += p[index];
}
}
Should be "significantly" faster code:
for (i = 0; i < val; i++) {
column_sum[i] = 0.0;
}
for (j = 0; j < val; j++) {
for (i = 0; i < val; i++) {
int index = j * (int)val + i;
column_sum[i] += p[index];
}
}
Data comparison:
I had confused the Index values in the loops: int index = j * (int)val + i;
Slower loop:
for (i = 0; i < val; i++) {
column_sum[i] = 0.0;
for (j = 0; j < val; j++){
int index = j * (int)val + i;
column_sum[i] += p[index];
}
}
Faster loop:
for (i = 0; i < val; i++) {
column_sum[i] = 0.0;
}
for (j = 0; j < val; j++) {
for (i = 0; i < val; i++) {
int index = j * (int)val + i;
column_sum[i] += p[index];
}
}

Resources