Vectorising a nested loop with AVX2

Vectorising a nested loop with AVX2 - c

I am trying to vectorise the inner loop the following nested loop. Firstly, is this good practice, or should one avoid attempting to vectorise nested loops?
The following works, it has already some basic loop unrolling.
int sparsemv(struct mesh *A, const double * const x, double * const y) {
const int nrow = (const int) A->local_nrow;
int j = 0;
double sum = 0.0;
#pragma omp parallel for private(j, sum)
for (int i=0; i< nrow; i++) {
sum = 0.0;
const double * const cur_vals = (const double * const) A->ptr_to_vals_in_row[i];
const int * const cur_inds = (const int * const) A->ptr_to_inds_in_row[i];
const int cur_nnz = (const int) A->nnz_in_row[i];
int unroll = (cur_nnz/4)*4;
for (j=0; j< unroll; j+=4) {
sum += cur_vals[j] * x[cur_inds[j]];
sum += cur_vals[j+1] * x[cur_inds[j+1]];
sum += cur_vals[j+2] * x[cur_inds[j+2]];
sum += cur_vals[j+3] * x[cur_inds[j+3]];
}
for (; j < cur_nnz; j++) {
sum += cur_vals[j] * x[cur_inds[j]];
}
y[i] = sum;
}
return 0;
}
However, when I try to vectorise using 256-bit Vector registers in AVX2, I get either the incorrect answers or seg faults. x and y are aligned but A is not, but for the moment, all loading and storing is done using unaligned operations since that is the only time I don't get seg faults:
int sparsemv(struct mesh *A, const double * const x, double * const y) {
const int nrow = (const int) A->local_nrow;
int j = 0;
double sum = 0.0;
#pragma omp parallel for private(j, sum)
for (int i=0; i< nrow; i++) {
sum = 0.0;
const double * const cur_vals = (const double * const) A->ptr_to_vals_in_row[i];
const int * const cur_inds = (const int * const) A->ptr_to_inds_in_row[i];
const int cur_nnz = (const int) A->nnz_in_row[i];
int unroll = (cur_nnz/4)*4;
__m256d sumVec = _mm256_set1_pd(sum);
for (j=0; j< unroll; j+=4) {
__m256d cur_valsVec = _mm256_loadu_pd(cur_vals + j);
__m256d xVec = _mm256_loadu_pd(x + cur_inds[j]);
sumVec = _mm256_add_pd(sumVec, _mm256_mul_pd(cur_valsVec, xVec));
}
_mm256_storeu_pd(y + i, sumVec); // Is this storing in y + i + 1, 2 and 3 aswell?
for (; j < cur_nnz; j++) {
sum += cur_vals[j] * x[cur_inds[j]];
}
y[i] += sum;
}
return 0;
}

Related

Parallelise 2 for loops with OpenMP

So I have this function that I have to parallelize with OpenMP static scheduling for n threads
void computeAccelerations(){
int i,j;
for(i=0;i<bodies;i++){
accelerations[i].x = 0; accelerations[i].y = 0; accelerations[i].z = 0;
for(j=0;j<bodies;j++){
if(i!=j){
//accelerations[i] = addVectors(accelerations[i],scaleVector(GravConstant*masses[j]/pow(mod(subtractVectors(positions[i],positions[j])),3),subtractVectors(positions[j],positions[i])));
vector sij = {positions[i].x-positions[j].x,positions[i].y-positions[j].y,positions[i].z-positions[j].z};
vector sji = {positions[j].x-positions[i].x,positions[j].y-positions[i].y,positions[j].z-positions[i].z};
double mod = sqrt(sij.x*sij.x + sij.y*sij.y + sij.z*sij.z);
double mod3 = mod * mod * mod;
double s = GravConstant*masses[j]/mod3;
vector S = {s*sji.x,s*sji.y,s*sji.z};
accelerations[i].x+=S.x;accelerations[i].y+=S.y;accelerations[i].z+=S.z;
}
}
}
}
I tried to do something like:
void computeAccelerations_static(int num_of_threads){
int i,j;
#pragma omp parallel for num_threads(num_of_threads) schedule(static)
for(i=0;i<bodies;i++){
accelerations[i].x = 0; accelerations[i].y = 0; accelerations[i].z = 0;
for(j=0;j<bodies;j++){
if(i!=j){
//accelerations[i] = addVectors(accelerations[i],scaleVector(GravConstant*masses[j]/pow(mod(subtractVectors(positions[i],positions[j])),3),subtractVectors(positions[j],positions[i])));
vector sij = {positions[i].x-positions[j].x,positions[i].y-positions[j].y,positions[i].z-positions[j].z};
vector sji = {positions[j].x-positions[i].x,positions[j].y-positions[i].y,positions[j].z-positions[i].z};
double mod = sqrt(sij.x*sij.x + sij.y*sij.y + sij.z*sij.z);
double mod3 = mod * mod * mod;
double s = GravConstant*masses[j]/mod3;
vector S = {s*sji.x,s*sji.y,s*sji.z};
accelerations[i].x+=S.x;accelerations[i].y+=S.y;accelerations[i].z+=S.z;
}
}
}
It comes naturally to just add the #pragma omp parallel for num_threads(num_of_threads) schedule(static) but it isn't correct.
I think there is some kind of false sharing with the accelerations[i] but I don't know how to approach it. I appreciate any kind of help. Thank you.

In your loop nest, only the iterations of the outer loop are parallelized. Because i is the loop-control variable, each thread gets its own, private copy, but as a matter of style, it would be better to declare i in the loop control block.
j is another matter. It is declared outside the parallel region and it is not the control variable of a parallelized loop. As a result, it is shared among the threads. Because each of the threads executing i-loop iterations manipulates shared variable j, you have a huge problem with data races. This would be resolved (among other alternatives) by moving the declaration of j into the parallel region, preferrably into the control block of its associated loop.
Overall, then:
// int i, j;
#pragma omp parallel for num_threads(num_of_threads) schedule(static)
for (int i = 0; i < bodies; i++) {
accelerations[i].x = 0;
accelerations[i].y = 0;
accelerations[i].z = 0;
for (int j = 0; j < bodies; j++) {
if (i != j) {
//accelerations[i] = addVectors(accelerations[i],scaleVector(GravConstant*masses[j]/pow(mod(subtractVectors(positions[i],positions[j])),3),subtractVectors(positions[j],positions[i])));
vector sij = { positions[i].x - positions[j].x,
positions[i].y - positions[j].y,
positions[i].z - positions[j].z };
vector sji = { positions[j].x - positions[i].x,
positions[j].y - positions[i].y,
positions[j].z - positions[i].z };
double mod = sqrt(sij.x * sij.x + sij.y * sij.y + sij.z * sij.z);
double mod3 = mod * mod * mod;
double s = GravConstant * masses[j] / mod3;
vector S = { s * sji.x, s * sji.y, s * sji.z };
accelerations[i].x += S.x;
accelerations[i].y += S.y;
accelerations[i].z += S.z;
}
}
}
Note also that computing sji appears to be wasteful, as in mathematical terms it is just -sij, and neither sji nor sij is modified. I would probably reduce the above to something more like this:
#pragma omp parallel for num_threads(num_of_threads) schedule(static)
for (int i = 0; i < bodies; i++) {
accelerations[i].x = 0;
accelerations[i].y = 0;
accelerations[i].z = 0;
for (int j = 0; j < bodies; j++) {
if (i != j) {
vector sij = { positions[i].x - positions[j].x,
positions[i].y - positions[j].y,
positions[i].z - positions[j].z };
double mod = sqrt(sij.x * sij.x + sij.y * sij.y + sij.z * sij.z);
double mod3 = mod * mod * mod;
double s = GravConstant * masses[j] / mod3;
accelerations[i].x -= s * sij.x;
accelerations[i].y -= s * sij.y;
accelerations[i].z -= s * sij.z;
}
}
}

Implementation of OpenMP

void collision_f()
{
int x;
long double feq[Q], feq_R[Q], feq_B[Q], feq_force[Q];
long double mR[Q], mB[Q], meq_R[Q], meq_B[Q];
long double col_R[Q], col_B[Q];
forces_f();
#pragma omp parallel
{
#pragma omp for
for (x =0; x <NX; x++)
{
for (int y =0; y <NY; y++)
{
if ( (bnode[x][y] ==0) || (bnode[x][y] ==1) || (bnode[x][y] ==2) )
{
long double uxeq = ux_f[x][y] + Force_x_f[x][y]/rho_f[x][y];
long double uyeq = uy_f[x][y] + Force_y_f[x][y]/rho_f[x][y];
for (int i =0; i <Q; i++)
{
feq[i] = feq_R[i] = feq_B[i] = feq_force[i] = 0.0;
//equilibrium distribution
long double udotc_f = ux_f[x][y]*cx[i] + uy_f[x][y]*cy[i];
long double u2_f = pow(ux_f[x][y], 2) + pow(uy_f[x][y], 2);
feq[i] = wt[i]*rho_f[x][y]*(1.0 + 3.0*udotc_f + 4.5*pow(udotc_f, 2) - 1.5*u2_f);
feq_R[i] = wt[i]*rho_R_f[x][y]*(1.0 + 3.0*udotc_f + 4.5*pow(udotc_f, 2) - 1.5*u2_f);
feq_B[i] = wt[i]*rho_B_f[x][y]*(1.0 + 3.0*udotc_f + 4.5*pow(udotc_f, 2) - 1.5*u2_f);
long double udotc_force = uxeq*cx[i] + uyeq*cy[i];
long double u2_force = pow(uxeq, 2) + pow(uyeq, 2);
feq_force[i] = wt[i]*rho_f[x][y]*(1.0 + 3.0*udotc_force + 4.5*pow(udotc_force, 2) - 1.5*u2_force);
//printf("%d\t%d\t%d\t%Lf\t%Lf\t%Lf\t%Lf\n", x, y, i, feq[i], feq_R[i], feq_B[i], feq_force[i]);
}
//Calculating moments and meq
for (int i =0; i <Q; i++)
{
meq_R[i] = meq_B[i] = mR[i] = mB[i] = 0.0;
for (int j =0; j <Q; j++)
{
mR[i] += M[i][j]*r1[x][y][j];
meq_R[i] += M[i][j]*feq_R[j];
mB[i] += M[i][j]*b1[x][y][j];
meq_B[i] += M[i][j]*feq_B[j];
//printf("%d,%d\t%d,%d\t%Lf\t%Lf\t%Lf\t%Lf\n", x, y, i, j, mR[i], meq_R[i], mB[i], meq_B[i]);
}
//printf("%d\t%d\t%d\t%Lf\t%Lf\t%Lf\t%Lf\n", x, y, i, mR[i], meq_R[i], mB[i], meq_B[i]);
}
//Collision equation
for (int i =0; i <Q; i++)
{
col_R[i] = col_B[i] = 0.0;
for (int j =0; j <Q; j++)
{
col_R[i] += stmiv_f[x][y][i][j]*(mR[j] - meq_R[j]);
col_B[i] += stmiv_f[x][y][i][j]*(mB[j] - meq_B[j]);
}
long double force = feq_force[i] - feq[i];
r2[x][y][i] = r1[x][y][i] - col_R[i];
b2[x][y][i] = b1[x][y][i] - col_B[i];
f1[x][y][i] = r2[x][y][i] + b2[x][y][i] + force;
//Recoloring using d'Ortona's segregation method
r2[x][y][i] = (rho_R_f[x][y]/rho_f[x][y]) * (f1[x][y][i] + BETA_LKR*wt[i]*(rho_f[x][y] - rho_R_f[x][y])*(n_x[x][y]*cx[i] + n_y[x][y]*cy[i]));
b2[x][y][i] = f1[x][y][i] - r2[x][y][i];
}
if (rho_R_f[x][y] <=EVAP_LIM*rho_r_f)
{
for (int i =0; i <Q; i++)
{
b2[x][y][i] = f1[x][y][i];
r2[x][y][i] = 0.0;
}
}
if (rho_B_f[x][y] <=EVAP_LIM*rho_r_f)
{
for (int i =0; i <Q; i++)
{
r2[x][y][i] = f1[x][y][i];
b2[x][y][i] = 0.0;
}
}
}
}
}
}
return;
}
I am trying to implement OpenMP in this function. But I am getting nan for feq_R[Q] and -nan for meq_R[Q], meq_B[Q] only for a few combination of x, y. Also for each run, the values of x, y are different where I get -nan kind of solution. I have also checked that rho_f[x][y] is not zero for those x, y. I have also tried omp for reduction for meq_R[Q], meq_B[Q] only to be unsuccessful. FYI, the serial code runs without any problem. Any help is greatly appreciated.

Speed up matrix-matrix multiplication using SSE vector instructions

I have some trouble in vectorize some C code using SSE vector instructions. The code which I have to victorize is
#define N 1000
void matrix_mul(int mat1[N][N], int mat2[N][N], int result[N][N])
{
int i, j, k;
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; ++j)
{
for (k = 0; k < N; ++k)
{
result[i][k] += mat1[i][j] * mat2[j][k];
}
}
}
}
Here is what I got so far:
void matrix_mul_sse(int mat1[N][N], int mat2[N][N], int result[N][N])
{
int i, j, k; int* l;
__m128i v1, v2, v3;
v3 = _mm_setzero_si128();
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; j += 4)
{
for (k = 0; k < N; k += 4)
{
v1 = _mm_set1_epi32(mat1[i][j]);
v2 = _mm_loadu_si128((__m128i*)&mat2[j][k]);
v3 = _mm_add_epi32(v3, _mm_mul_epi32(v1, v2));
_mm_storeu_si128((__m128i*)&result[i][k], v3);
v3 = _mm_setzero_si128();
}
}
}
}
After execution I got wrong result. I know that the reason is the loading from memory to v2. I loop through mat1 in row major order so I need to load mat2[0][0], mat2[1][0], mat2[2][0], mat2[3][0].... but what actually loaded is mat2[0][0], mat2[0][1], mat2[0][2], mat2[0][3]... because mat2 has stored in the memory in row major order. I tried to fix this problem but without any improvement.
Can anyone help me please.

Below fixed your implementation:
void matrix_mul_sse(int mat1[N][N], int mat2[N][N], int result[N][N])
{
int i, j, k;
__m128i v1, v2, v3, v4;
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; ++j) // 'j' must be incremented by 1
{
// read mat1 here because it does not use 'k' index
v1 = _mm_set1_epi32(mat1[i][j]);
for (k = 0; k < N; k += 4)
{
v2 = _mm_loadu_si128((const __m128i*)&mat2[j][k]);
// read what's in the result array first as we will need to add it later to our calculations
v3 = _mm_loadu_si128((const __m128i*)&result[i][k]);
// use _mm_mullo_epi32 here instead _mm_mul_epi32 and add it to the previous result
v4 = _mm_add_epi32(v3, _mm_mullo_epi32(v1, v2));
// store the result
_mm_storeu_si128((__m128i*)&result[i][k], v4);
}
}
}
}
In short _mm_mullo_epi32 (requires SSE4.1) produces 4 x int32 results as opposed to _mm_mul_epi32 which does 2 x int64 results. If you cannot use SSE4.1 then have a look at the answer here for an alternative SSE2 solution.
Full description by Intel Intrinsic Guide:
_mm_mullo_epi32: Multiply the packed 32-bit integers in a and b, producing intermediate 64-bit integers, and store
the low 32 bits of the intermediate integers in dst.
_mm_mul_epi32: Multiply the low 32-bit integers from each packed 64-bit element in a and b, and store the
signed 64-bit results in dst.

I kinda changed around your code to make the addressing explicit [ it helps in this case ].
#define N 100
This is a stub for the vector unit multiple & accumulate operation; you should be able to replace NV with whatever throw your vector unit has, and put the relevant opcodes in here.
#define NV 8
int Vmacc(int *A, int *B) {
int i = 0;
int x = 0;
for (i = 0; i < NV; i++) {
x += *A++ * *B++;
}
return x;
}
This multiply has two notable variations from the norm:
1. It caches the columnar vector into a contiguous one.
2. It attempts to push slices of the multiply accumulate into a vector-like func.
Even without using the vector unit, this takes half the time of naive version just because of better cache/prefetch utilization.
void mm2(int *A, int *B, int n, int *C) {
int c, r;
int stride = 0;
int cache[N];
for (c = 0; c < n; c++) {
/* cache cumn i: */
for (r = 0; r < n; r++) {
cache[r] = B[c + r*n];
}
for (r = 0; r < n; r++) {
int k = 0;
int x = 0;
int *Av = A + r*n;
for (k = 0; k+NV-1 < n; k += NV) {
x += Vmacc(Av+k, cache+k);
}
while (k < n) {
x += Av[k] * cache[k];
k++;
}
C[r*n + c] = x;
}
}
}

parallelizing matrix multiplication through threading and SIMD

I am trying to speed up matrix multiplication on multicore architecture. For this end, I try to use threads and SIMD at the same time. But my results are not good. I test speed up over sequential matrix multiplication:
void sequentialMatMul(void* params)
{
cout << "SequentialMatMul started.";
int i, j, k;
for (i = 0; i < N; i++)
{
for (k = 0; k < N; k++)
{
for (j = 0; j < N; j++)
{
X[i][j] += A[i][k] * B[k][j];
}
}
}
cout << "\nSequentialMatMul finished.";
}
I tried to add threading and SIMD to matrix multiplication as follows:
void threadedSIMDMatMul(void* params)
{
bounds *args = (bounds*)params;
int lowerBound = args->lowerBound;
int upperBound = args->upperBound;
int idx = args->idx;
int i, j, k;
for (i = lowerBound; i <upperBound; i++)
{
for (k = 0; k < N; k++)
{
for (j = 0; j < N; j+=4)
{
mmx1 = _mm_loadu_ps(&X[i][j]);
mmx2 = _mm_load_ps1(&A[i][k]);
mmx3 = _mm_loadu_ps(&B[k][j]);
mmx4 = _mm_mul_ps(mmx2, mmx3);
mmx0 = _mm_add_ps(mmx1, mmx4);
_mm_storeu_ps(&X[i][j], mmx0);
}
}
}
_endthread();
}
And the following section is used for calculating lowerbound and upperbound of each thread:
bounds arg[CORES];
for (int part = 0; part < CORES; part++)
{
arg[part].idx = part;
arg[part].lowerBound = (N / CORES)*part;
arg[part].upperBound = (N / CORES)*(part + 1);
}
And finally threaded SIMD version is called like this:
HANDLE handle[CORES];
for (int part = 0; part < CORES; part++)
{
handle[part] = (HANDLE)_beginthread(threadedSIMDMatMul, 0, (void*)&arg[part]);
}
for (int part = 0; part < CORES; part++)
{
WaitForSingleObject(handle[part], INFINITE);
}
The result is as follows:
Test 1:
// arrays are defined as follow
float A[N][N];
float B[N][N];
float X[N][N];
N=2048
Core=1//just one thread
Sequential time: 11129ms
Threaded SIMD matmul time: 14650ms
Speed up=0.75x
Test 2:
//defined arrays as follow
float **A = (float**)_aligned_malloc(N* sizeof(float), 16);
float **B = (float**)_aligned_malloc(N* sizeof(float), 16);
float **X = (float**)_aligned_malloc(N* sizeof(float), 16);
for (int k = 0; k < N; k++)
{
A[k] = (float*)malloc(cols * sizeof(float));
B[k] = (float*)malloc(cols * sizeof(float));
X[k] = (float*)malloc(cols * sizeof(float));
}
N=2048
Core=1//just one thread
Sequential time: 15907ms
Threaded SIMD matmul time: 18578ms
Speed up=0.85x
Test 3:
//defined arrays as follow
float A[N][N];
float B[N][N];
float X[N][N];
N=2048
Core=2
Sequential time: 10855ms
Threaded SIMD matmul time: 27967ms
Speed up=0.38x
Test 4:
//defined arrays as follow
float **A = (float**)_aligned_malloc(N* sizeof(float), 16);
float **B = (float**)_aligned_malloc(N* sizeof(float), 16);
float **X = (float**)_aligned_malloc(N* sizeof(float), 16);
for (int k = 0; k < N; k++)
{
A[k] = (float*)malloc(cols * sizeof(float));
B[k] = (float*)malloc(cols * sizeof(float));
X[k] = (float*)malloc(cols * sizeof(float));
}
N=2048
Core=2
Sequential time: 16579ms
Threaded SIMD matmul time: 30160ms
Speed up=0.51x
My question: why I don’t get speed up?

Here are the times I get building on your algorithm on my four core i7 IVB processor.
sequential: 3.42 s
4 threads: 0.97 s
4 threads + SSE: 0.86 s
Here are the times on a 2 core P9600 #2.53 GHz which is similar to the OP's E2200 #2.2 GHz
sequential: time 6.52 s
2 threads: time 3.66 s
2 threads + SSE: 3.75 s
I used OpenMP because it makes this easy. Each thread in OpenMP runs over effectively
lowerBound = N*part/CORES;
upperBound = N*(part + 1)/CORES;
(note that that is slightly different than your definition. Your definition can give the wrong result due to rounding for some values of N since you divide by CORES first.)
As to the SIMD version. It's not much faster probably due it being memory bandwidth bound . It's probably not really faster because GCC already vectroizes the loop.
The most optimal solution is much more complicated. You need to use loop tiling and reorder the elements within tiles to get the optimal performance. I don't have time to do that today.
Here is the code I used:
//c99 -O3 -fopenmp -Wall foo.c
#include <stdio.h>
#include <string.h>
#include <x86intrin.h>
#include <omp.h>
void gemm(float * restrict a, float * restrict b, float * restrict c, int n) {
for(int i=0; i<n; i++) {
for(int k=0; k<n; k++) {
for(int j=0; j<n; j++) {
c[i*n+j] += a[i*n+k]*b[k*n+j];
}
}
}
}
void gemm_tlp(float * restrict a, float * restrict b, float * restrict c, int n) {
#pragma omp parallel for
for(int i=0; i<n; i++) {
for(int k=0; k<n; k++) {
for(int j=0; j<n; j++) {
c[i*n+j] += a[i*n+k]*b[k*n+j];
}
}
}
}
void gemm_tlp_simd(float * restrict a, float * restrict b, float * restrict c, int n) {
#pragma omp parallel for
for(int i=0; i<n; i++) {
for(int k=0; k<n; k++) {
__m128 a4 = _mm_set1_ps(a[i*n+k]);
for(int j=0; j<n; j+=4) {
__m128 c4 = _mm_load_ps(&c[i*n+j]);
__m128 b4 = _mm_load_ps(&b[k*n+j]);
c4 = _mm_add_ps(_mm_mul_ps(a4,b4),c4);
_mm_store_ps(&c[i*n+j], c4);
}
}
}
}
int main(void) {
int n = 2048;
float *a = _mm_malloc(n*n * sizeof *a, 64);
float *b = _mm_malloc(n*n * sizeof *b, 64);
float *c1 = _mm_malloc(n*n * sizeof *c1, 64);
float *c2 = _mm_malloc(n*n * sizeof *c2, 64);
float *c3 = _mm_malloc(n*n * sizeof *c2, 64);
for(int i=0; i<n*n; i++) a[i] = 1.0*i;
for(int i=0; i<n*n; i++) b[i] = 1.0*i;
memset(c1, 0, n*n * sizeof *c1);
memset(c2, 0, n*n * sizeof *c2);
memset(c3, 0, n*n * sizeof *c3);
double dtime;
dtime = -omp_get_wtime();
gemm(a,b,c1,n);
dtime += omp_get_wtime();
printf("time %f\n", dtime);
dtime = -omp_get_wtime();
gemm_tlp(a,b,c2,n);
dtime += omp_get_wtime();
printf("time %f\n", dtime);
dtime = -omp_get_wtime();
gemm_tlp_simd(a,b,c3,n);
dtime += omp_get_wtime();
printf("time %f\n", dtime);
printf("error %d\n", memcmp(c1,c2, n*n*sizeof *c1));
printf("error %d\n", memcmp(c1,c3, n*n*sizeof *c1));
}

It looks to me that the threads are sharing __m128 mmx* variables, you probably defined them global/static. You must be getting wrong results in your X array too. Define __m128 mmx* variables inside threadedSIMDMatMul function scope and it will run much faster.
void threadedSIMDMatMul(void* params)
{
__m128 mmx0, mmx1, mmx2, mmx3, mmx4;
// rest of the code here
}

Vector equations in C

Im trying to code this function which computes the projection formula
P(x) = x + (c- a dot x)a * (1/|a|^2). Note that x and a are vectors and c is a scalar. Also note that a dot x is the product dot/inner product of a and x. Here is what I have,
double dotProduct(double *q, double *b, int length) {
double runningSum = 0;
for (int index = 0; index < length; index++)
runningSum += q[index] * b[index];
return runningSum;
}
void project(double *x ,int n, double *a, double c)
{ double m_sum = 0.0;
for (int i = 0; i < n; i++) {
m_sum += a[i]*a[i];
}
for ( int j = 0; j < n; j++) {
x[j] = x[j] + (1/ m_sum)* (c - dotProduct(a, x, n))*a[j];
}
}
So my question is how do I create a test file to check the consistency of the project function since it doesn’t return anything. Is my code even right? does not return anything.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Vectorising a nested loop with AVX2 - c

Related

Parallelise 2 for loops with OpenMP

Implementation of OpenMP

Speed up matrix-matrix multiplication using SSE vector instructions

parallelizing matrix multiplication through threading and SIMD

Vector equations in C

Categories

Resources