SSE memory access - c

I need to perform Gaussian Elimination using SSE and I am not sure how to access each element(32 bits) from the 128 bit registers(each storing 4 elements). This is the original code(without using SSE):
unsigned int i, j, k;
for (i = 0; i < num_elements; i ++) /* Copy the contents of the A matrix into the U matrix. */
for(j = 0; j < num_elements; j++)
U[num_elements * i + j] = A[num_elements*i + j];
for (k = 0; k < num_elements; k++){ /* Perform Gaussian elimination in place on the U matrix. */
for (j = (k + 1); j < num_elements; j++){ /* Reduce the current row. */
if (U[num_elements*k + k] == 0){
printf("Numerical instability detected. The principal diagonal element is zero. \n");
return 0;
}
/* Division step. */
U[num_elements * k + j] = (float)(U[num_elements * k + j] / U[num_elements * k + k]);
}
U[num_elements * k + k] = 1; /* Set the principal diagonal entry in U to be 1. */
for (i = (k+1); i < num_elements; i++){
for (j = (k+1); j < num_elements; j++)
/* Elimnation step. */
U[num_elements * i + j] = U[num_elements * i + j] -\
(U[num_elements * i + k] * U[num_elements * k + j]);
U[num_elements * i + k] = 0;
}
}
Okay I'm getting segmentation fault[core dumped] with this code. I'm new to SSE. Can someone help? Thanks.
int i,j,k;
__m128 a_i,b_i,c_i,d_i;
for (i = 0; i < num_rows; i++)
{
for (j = 0; j < num_rows; j += 4)
{
int index = num_rows * i + j;
__m128 v = _mm_loadu_ps(&A[index]); // load 4 x floats
_mm_storeu_ps(&U[index], v); // store 4 x floats
}
}
for (k = 0; k < num_rows; k++){
a_i= _mm_load_ss(&U[num_rows*k+k]);
for (j = (4*k + 1); j < num_rows; j+=4){
b_i= _mm_loadu_ps(&U[num_rows*k+j]);// Reduce the currentrow.
if (U[num_rows*k+k] == 0){
printf("Numerical instability detected.);
}
/* Division step. */
b_i = _mm_div_ps(b_i, a_i);
}
a_i = _mm_set_ss(1);
for (i = (k+1); i < num_rows; i++){
d_i= _mm_load_ss(&U[num_rows*i+k]);
for (j = (4*k+1); j < num_rows; j+=4){
c_i= _mm_loadu_ps(&U[num_rows*i+j]); /* Elimnation step. */
b_i= _mm_loadu_ps(&U[num_rows*k+j]);
c_i = _mm_sub_ps(c_i, _mm_mul_ss(b_i,d_i));
}
d_i= _mm_set_ss(0);
}
}

In order to get you started, your first loop should be more like this:
for (i = 0; i < num_elements; i++)
{
for (j = 0; j < num_elements; j += 4)
{
int index = num_elements * i + j;
__m128i v = _mm_loadu_ps((__m128i *)&A[index]); // load 4 x floats
_mm_storeu_ps((__m128i *)&U[index], v); // store 4 x floats
}
}
This assumes that num_elements is a multiple of 4, and that neither A nor U is correctly aligned.

Related

How to convert this C code into Assembly using intrinsics?

this is my first time using intrisics and I have to convert the C code below into C code that uses intrisics for assembly. I don't know where to start.
void slow_routine(float alpha, float beta){
unsigned int i,j;
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
A[i][j] = A[i][j] + u1[i] * v1[j] + u2[i] * v2[j];
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
x[i] = x[i] + beta * A[j][i] * y[j];
for (i = 0; i < N; i++)
x[i] = x[i] + z[i];
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
w[i] = w[i] + alpha * A[i][j] * x[j];
}
Any of the following can be used: x86-64 SSE/SSE2/SSE4/AVX/AVX2
Please help I have been trying for hours and I'm very stuck.

why is this matrix multiplication algorithm faster than the other?

int mmult_omp(double *c,
double *a, int aRows, int aCols,
double *b, int bRows, int bCols, int numThreads)
{
for (i = 0; i < aRows; i++) {
for (j = 0; j < bCols; j++) {
c[i*bCols + j] = 0;
}
for (k = 0; k < aCols; k++) {
for (j = 0; j < bCols; j++) {
c[i*bCols + j] += a[i*aCols + k] * b[k*bCols + j];
}
}
}
for (i = 0; i < aRows; i++) {
for (j = 0; j < bCols; j++) {
c[i*bCols + j] = 0;
for (k = 0; k < aCols; k++) {
c[i*bCols + j] += a[i*aCols + k] * b[k*bCols + j];
}
}
}
Why is the first algorithm faster than the second?
I’ve used C’s time library and the first algorithm is objectively faster than the second. Why is that?
This code is very hard to understand. I had to copy it and reformat it to see what loops were what. I'm not really sure why one is faster but here's a great resource to see why.
Here are links to inspect the assembly output:
link for #1
link for #2

C, matrix transposed multiplication using dynamic memory allocation

I'm trying to transpose and multiply some matrices, basically
I get 2 matrices, matrixA and matrixB the action to be performed is trace(transpose(matrixA)*matrixB).
I managed to get this working for nxn matrices but I can't get it to work with mxn where (n>m or m>n).
I've looked online for solutions but I can't implement theirs solution into mine.
I removed almost all the code to simplify reading, if you prefer the entire code I linked it here.
If you do want to run the entire code, to recreate the problem use the following commands:
zeroes matrixA 2 3
zeroes matrixB 2 3
set matrixA
1 2 3 4 5 6
set matrixB
6 5 4 3 2 1
frob matrixA matrixB
The above commands should return Sum 56 but instead I get Sum 18
int* matrixATransposed = (int*) malloc(matrixARowLenght * matrixAColLenght * sizeof(int));
for (int i = 0; i < matrixARowLenght; i++)
{
for (int j = 0; j < matrixAColLenght; j++)
{
*(matrixATransposed + i * matrixAColLenght + j) = *(matrixA + j * matrixAColLenght + i);
}
}
// Multiply
int* mulRes = (int*)malloc(matrixARowLenght * matrixAColLenght * sizeof(int));
for (int i = 0; i < matrixAColLenght; i++) {
for (int j = 0; j < matrixBColLenght; j++) {
*(mulRes + i * matrixARowLenght + j) = 0;
for (int k = 0; k < matrixARowLenght; k++)
*(mulRes + i * matrixAColLenght + j) += *(matrixATransposed + i * matrixAColLenght + k) * *(matrixB + k * matrixBColLenght + j);
}
}
// Sum the trace
int trace = 0;
for (int i = 0; i < matrixARowLenght; i++) {
for (int j = 0; j < matrixAColLenght; j++) {
if (i == j) {
trace += *(mulRes + i * matrixAColLenght + j);
}
}
}
printf_s("Sum: %d\n", trace);
Your array indices for calculating the transpose, multiplication, and the trace seem to be incorrect. I've corrected them in the following program:
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char **argv) {
int matrixARowLenght = 2;
int matrixAColLenght = 3;
int matrixA[] = {1,2,3,4,5,6};
int matrixBRowLenght = 2;
int matrixBColLenght = 3;
int matrixB[] = {6,5,4,3,2,1};
// Transpose
int* matrixATransposed = (int *) malloc(matrixARowLenght * matrixAColLenght * sizeof(int));
for (int i = 0; i < matrixAColLenght; i++) {
for (int j = 0; j < matrixARowLenght; j++) {
*(matrixATransposed + i * matrixARowLenght + j) = *(matrixA + j * matrixAColLenght + i);
}
}
// Multiply
int *mulRes = (int *) malloc(matrixARowLenght * matrixAColLenght * sizeof(int));
for (int i = 0; i < matrixAColLenght; ++i) {
for (int j = 0; j < matrixAColLenght; ++j) {
*(mulRes + (i * matrixAColLenght) + j) = 0;
for (int k = 0; k < matrixARowLenght; ++k) {
*(mulRes + (i * matrixAColLenght) + j) += *(matrixATransposed + (i * matrixARowLenght) + k) * *(matrixB + (k * matrixAColLenght) + j);
}
}
}
free(matrixATransposed);
// Sum the trace
int trace = 0;
for (int i = 0; i < matrixAColLenght; i++) {
for (int j = 0; j < matrixAColLenght; j++) {
if (i == j) {
trace += *(mulRes + i * matrixAColLenght + j);
}
}
}
printf("Sum: %d\n", trace);
free(mulRes);
return 0;
}
The above program will output your expected value:
Sum: 56
** UPDATE **
As pointed by MFisherKDX, the above code will not work if the result matrix is not a square matrix. The following code fixes this issue:
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char **argv) {
int matrixARowLenght = 2;
int matrixAColLenght = 3;
int matrixA[] = {1,2,3,4,5,6};
int matrixBRowLenght = 2;
int matrixBColLenght = 4;
int matrixB[] = {8,7,6,5,4,3,2,1};
// Transpose
int* matrixATransposed = (int *) malloc(matrixARowLenght * matrixAColLenght * sizeof(int));
for (int i = 0; i < matrixAColLenght; i++) {
for (int j = 0; j < matrixARowLenght; j++) {
*(matrixATransposed + i * matrixARowLenght + j) = *(matrixA + j * matrixAColLenght + i);
}
}
// Multiply
int *mulRes = (int *) malloc(matrixAColLenght * matrixBColLenght * sizeof(int));
for (int i = 0; i < matrixAColLenght; ++i) {
for (int j = 0; j < matrixBColLenght; ++j) {
*(mulRes + (i * matrixBColLenght) + j) = 0;
for (int k = 0; k < matrixARowLenght; ++k) {
*(mulRes + (i * matrixBColLenght) + j) += *(matrixATransposed + (i * matrixARowLenght) + k) * *(matrixB + (k * matrixBColLenght) + j);
}
}
}
free(matrixATransposed);
// Sum the trace
int trace = 0;
for (int i = 0; i < matrixAColLenght; i++) {
for (int j = 0; j < matrixBColLenght; j++) {
if (i == j) {
trace += *(mulRes + i * matrixBColLenght + j);
}
}
}
printf("Sum: %d\n", trace);
free(mulRes);
return 0;
}
This code will output the following as expected:
Sum: 83

Why does my code return -nan in visual studio, but not in Linux?

My Gauss Elimination code's results are -nan in visual studio, but not in Linux.
And the Linux results are awful because at func Gauss_Eli how many I increase the variable k at for blocks the func is working... doesn't occur segment error.
What is wrong with my code?
float ** Gauss_Eli(float ** matrix, int n) {
// -----------------------------------------------------
// | |
// | Eliminate elements except (i, i) element |
// | |
// -----------------------------------------------------
// Eliminate elements at lower triangle part
for (int i = 0; i < n; i++) {
for (int j = i + 1; j < n; j++) {
for (int k = 0; k < n + 1; k++) {
float e;
e = matrix[i][k] * (matrix[j][i] / matrix[i][i]);
matrix[j][k] -= e;
}
}
}
// Eliminate elements at upper triangle part
for (int i = n - 1; i >= 0; i--) {
for (int j = i - 1; j >= 0; j--) {
for (int k = 0; k < n + 1; k++) {
float e;
e = matrix[i][k] * (matrix[j][i] / matrix[i][i]);
matrix[j][k] -= e;
}
}
}
// Make 1 elements i, i
for (int i = 0; i < n; i++)
for (int j = 0; j < n + 1; j++) matrix[i][j] /= matrix[i][i];
return matrix;
}
int main() {
float ** matrix;
int n;
printf("Matrix Size : ");
scanf("%d", &n);
// Malloc variable matrix for Matrix
matrix = (float**)malloc(sizeof(float) * n);
for (int i = 0; i < n; i++) matrix[i] = (float*)malloc(sizeof(float) * (n + 1));
printf("Input elements : \n");
for (int i = 0; i < n; i++)
for (int j = 0; j < n + 1; j++) scanf("%f", &matrix[i][j]);
matrix = Gauss_Eli(matrix, n);
printf("Output result : \n");
//Print matrix after elimination
for (int i = 0; i < n; i++) {
for (int j = 0; j < n + 1; j++) printf("%.6f ", matrix[i][j]);
printf("\n");
}
return 0;
}
1.) OP allocates memory using the wrong type. This may lead to issues of insufficient memory and all sorts of UB and explain the difference between systems as they could have differing pointer and float sizes.
float ** matrix;
// v--- wrong type
// matrix = (float**)malloc(sizeof(float) * n);
Instead allocate to the size of the referenced variable. Easier to code (and get right), review and maintain.
matrix = malloc(sizeof *matrix * n);
if (matrix == NULL) Handle_Error();
2.) Code should look for division by 0.0
//for (int k = 0; k < n + 1; k++) {
// float e;
// e = matrix[i][k] * (matrix[j][i] / matrix[i][i]);
// matrix[j][k] -= e;
//}
if (matrix[i][i] == 0.0) Handle_Error();
float m = matrix[j][i] / matrix[i][i];
for (int k = 0; k < n + 1; k++) {
matrix[j][k] -= matrix[i][k]*m;
}
3.) General problem solving tips:
Check return values of scanf("%f", &matrix[i][j]);. It is 1?
Enable all warnings.
Especially for debug, print FP using "%e" rather than "%f".
4.) Numerical analysis tip: Insure exact subtraction when i==j
if (i == j) {
for (int k = 0; k < n + 1; k++) {
matrix[j][k] = 0.0;
}
else {
if (matrix[i][i] == 0.0) Handle_Divide_by_0();
float m = matrix[j][i] / matrix[i][i];
for (int k = 0; k < n + 1; k++) {
matrix[j][k] -= matrix[i][k]*m;
}
}

Find inverse of matrix using b^nth power

I've searched for hours and spent many more trying to figure how to fix this problem. I need to find the inverse of a predefined matrix using
A^-1 = I + (B + B^2 + ... + B^20) where B = I-A.
void invA(double a[][3], double id[][3], double z[][3])
{
int i, j, n, k;
double pb[3][3] = {1.,0.,0.,0.,1.,0.,0.,0.,1.};
double temp[3][3] = {1.,0.,0.,0.,1.,0.,0.,0.,1.};
double b[3][3];
temp[i][j] = 0;
b[i][j] = 0;
for(i = 0; i < 3; i++)
for (j = 0; j < 3; j++)
b[i][j] = id[i][j] - a[i][j];
for (n = 0; n < 20; n++) //run loop n times
{
for (i = 0; i < 3; i++) //find b to the power 20
for (j = 0; j < 3; j++)
for (k = 0; k < 3; k++)
temp[i][j] += pb[i][k] * b[k][j];
for (i = 0; i < 3; i++) //allocate pb from temp
for (j = 0; j < 3; j++)
pb[i][j] = temp[i][j];
for (i = 0; i < 3; i++) //summing b n time
for (j = 0; j < 3; j++) //to find inverse
z[i][j] = z[i][j] + pb[i][j];
}
}
Matrix a is the defined matrix, id is the identity and z is the inverse (result). I can't seem to figure out where I've gone wrong.
You have few problems.
First, temp[i][j] = 0; and b[i][j] = 0; at the beginning of the function use uninitialized variables i and j. The behaviour is undefined, and who knows how temp is actually initialized.
Then, temp must be reinitialized to a zero matrix at each iteration. I don't know what exactly does your code compute, but it is not a power for sure.
Finally, (unless z is initialized to I), you are missing the initial term.
All that said, I highly recommend to factor out most of the loops into functions: matAdd() and matMult(). Once they are unit tested, the rest is much simpler.

Resources