I have the code of matrices multiplication with using openmp:
#include <stdio.h>
#include <omp.h>
#include <math.h>
#define N 1000
int main()
{
long int i, j, k;
//long int N = atoi(argv[1]);
double t1, t2;
double a[N][N],b[N][N],c[N][N];
for (i=0; i<N; i++)
for (j=0; j<N; j++)
a[i][j]=b[i][j]=log(i*j/(i*j+1.)+1) +exp(-(i+j)*(i+j+1.));
t1=omp_get_wtime();
#pragma omp parallel for shared(a, b, c) private(i, j, k)
for(i=0; i<N; i++){
for(j=0; j<N; j++){
c[i][j] = 0.0;
for(k=0; k<N; k++) c[i][j]+=a[i][k]*b[k][j];
}
}
t2=omp_get_wtime();
printf("Time=%lf\n", t2-t1);
}
Now I want to set the number of threads which I want through command line. I do that by using
atoi(argv[])
Namely
#include <stdio.h>
#include <omp.h>
#include <math.h>
#define N 1000
int main(int argc, char** argv[])
{
long int i, j, k;
//long int N = atoi(argv[1]);
double t1, t2;
double a[N][N],b[N][N],c[N][N];
for (i=0; i<N; i++)
for (j=0; j<N; j++)
a[i][j]=b[i][j]=log(i*j/(i*j+1.)+1) +exp(-(i+j)*(i+j+1.));
int t = atoi(argv[1]);
t1=omp_get_wtime();
#pragma omp parallel for shared(a, b, c) private(i, j, k) num_threads(t)
for(i=0; i<N; i++){
for(j=0; j<N; j++){
c[i][j] = 0.0;
for(k=0; k<N; k++) c[i][j]+=a[i][k]*b[k][j];
}
}
t2=omp_get_wtime();
printf("Time=%lf\n", t2-t1);
}
Everything is fine, except one crucial thing: when I try to compute the product of matrices with dimension more than (more or less) 500, I get the mistake: "segmentation fault". Could someone clarify the reason for this mistake?
I don't know anything about openmp, but you are most assuredly blowing up your stack. Default stack space will vary from system to system, but with N == 1000, you are trying to put three 2D arrays totaling 3 million doubles on the stack. Assuming a double is 8 bytes, that's 24 million bytes, or just shy of 22.9MB. There can't be many systems allowing that kind of stack space. Instead, I'd recommend trying to grab that amount of memory from the heap. Something like this:
//double a[N][N],b[N][N],c[N][N];
double **a, **b, **c;
a = malloc(sizeof(double*) * N);
b = malloc(sizeof(double*) * N);
c = malloc(sizeof(double*) * N);
for (i=0; i<N; i++)
{
a[i] = malloc(sizeof(double) * N);
b[i] = malloc(sizeof(double) * N);
c[i] = malloc(sizeof(double) * N);
}
// do your calculations
for (i=0; i<N; i++)
{
free(a[i]);
free(b[i]);
free(c[i]);
}
free(a);
free(b);
free(c);
I've verified on my machine at least, that with N == 1000 I crash right out of the gate with EXC_BAD_ACCESS when trying to place those arrays on the stack. When I dynamically allocate the memory instead as shown above, I get no seg faults.
Related
I've been working with a heat transfer code. This code, basically, stablishes the initial conditions for a cube and all of its faces. The six faces start at different temperatures, and then the code will be calculating how the temperature changes in all of the faces due to the heat transfer between them. Now, I've been trying offloading to an NVIDIA GPU using OpenMP directives. This code initializes the faces conditions using a triple pointer, which is sort of an array of arrays. Reading a little bit about this matter, I've come to know that 3D architectures are not easily offloaded to the GPUs. So my question is if it is possible to offload this triple pointer arrays to the GPU or if I have to use a more flat array form.
Here I leave the code, which is still working on CPU. Parallel version of the code.
#include <omp.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define N 25 //This defines the number of points per dimension (Cube = N*N*N)
#define NUM_STEPS 6000 //This is the number of simulations time steps
/*writeFile: this function writes simulation results into a file.
* A file is created for each iteration that's passed to the function
* as a parameter. It also takes the triple pointer to the simulation
* data*/
void writeFile(int iteration, double*** data){
char filename[50];
char itr[12];
sprintf(itr, "%d", iteration);
strcpy(filename, "heat_");
strcat(filename, itr);
strcat(filename, ".txt");
//printf("Filename is %s\n", filename);
FILE *fp;
fp = fopen(filename, "w");
fprintf(fp, "x,y,z,T\n");
for(int i=0; i<N; i++){
for(int j=0;j<N; j++){
for(int k=0; k<N; k++){
fprintf(fp,"%d,%d,%d,%f\n", i,j,k,data[i][j][k]);
}
}
}
fclose(fp);
}
void compute_heat_transfer(double ***arrayOld, double ***arrayNew){
int i,j,k;
/*Compute steady-state solution*/
for(int nsteps=0; nsteps < NUM_STEPS; nsteps++){
/*if(nsteps % 100 == 0){
writeFile(nsteps, arrayOld);
}*/
#pragma omp parallel shared(arrayNew, arrayOld) private(i,j,k)
{
#pragma omp for
for(i=1; i<N-1; i++){
for(j=1; j<N-1; j++){
for(k=1;k<N-1;k++){
//This is the 6-neighbor stencil computation
arrayNew[i][j][k] = (arrayOld[i-1][j][k] + arrayOld[i+1][j][k] + arrayOld[i][j-1][k] + arrayOld[i][j+1][k] +
arrayOld[i][j][k-1] + arrayOld[i][j][k+1])/6.0;
}
}
}
#pragma omp for
for(i=1; i<N-1; i++){
for(j=1; j<N-1; j++){
for(k=1; k<N-1; k++){
arrayOld[i][j][k] = arrayNew[i][j][k];
}
}
}
}
}
}
int main (int argc, char *argv[]) {
int i,j,k,nsteps;
double mean;
double ***arrayOld; //Variable that will hold the data of the past iteration
double ***arrayNew; //Variable where newly computed data will be stored
arrayOld = (double***)malloc(N*sizeof(double**));
arrayNew = (double***)malloc(N*sizeof(double**));
if(arrayOld== NULL){
fprintf(stderr, "Out of memory");
exit(0);
}
for(i=0; i<N;i++){
arrayOld[i] = (double**)malloc(N*sizeof(double*));
arrayNew[i] = (double**)malloc(N*sizeof(double*));
if(arrayOld[i]==NULL){
fprintf(stderr, "Out of memory");
exit(0);
}
for(int j=0;j<N;j++){
arrayOld[i][j] = (double*)malloc(N*sizeof(double));
arrayNew[i][j] = (double*)malloc(N*sizeof(double));
if(arrayOld[i][j]==NULL){
fprintf(stderr,"Out of memory");
exit(0);
}
}
}
/*Set boundary values and compute mean boundary values*/
mean = 0.0;
for(i=0; i<N; i++){
for(j=0;j<N;j++){
arrayOld[i][j][0] = 100.0;
mean += arrayOld[i][j][0];
}
}
for(i=0; i<N; i++){
for(j=0;j<N;j++){
arrayOld[i][j][N-1] = 100.0;
mean += arrayOld[i][j][N-1];
}
}
for(j=0; j<N; j++){
for(k=0;k<N;k++){
arrayOld[0][j][k] = 100.0;
mean += arrayOld[0][j][k];
}
}
for(j=0; j<N; j++){
for(k=0;k<N;k++){
arrayOld[N-1][j][k] = 100.0;
mean += arrayOld[N-1][j][k];
}
}
for(i=0; i<N; i++){
for(k=0;k<N;k++){
arrayOld[i][0][k] = 100.0;
mean += arrayOld[i][0][k];
}
}
for(i=0; i<N; i++){
for(k=0;k<N;k++){
arrayOld[i][N-1][k] = 0.0;
mean += arrayOld[i][N-1][k];
}
}
mean /= (6.0 * (N*N));
/*Initialize interior values*/
for(i=1; i<N-1; i++){
for(j=1; j<N-1; j++){
for(k=1; k<N-1;k++){
arrayOld[i][j][k] = mean;
}
}
}
double tdata = omp_get_wtime();
compute_heat_transfer(arrayOld, arrayNew);
tdata = omp_get_wtime()-tdata;
printf("Execution time was %f secs\n", tdata);
for(i=0; i<N;i++){
for(int j=0;j<N;j++){
free(arrayOld[i][j]);
free(arrayNew[i][j]);
}
free(arrayOld[i]);
free(arrayNew[i]);
}
free(arrayOld);
free(arrayNew);
return 0;
}
Use variable length arrays with dynamic storage:
Allocation:
double (*arr)[N][N] = calloc(N, sizeof *arr);
Indexing.
Use good old arr[i][j][k] syntax
Deallocation.
free(arr)
Flattening.
double *flat = (double*)arr;
Note that this conversion is not guaranteed by the C standard to work.
Though it will very likely work on all platforms capable of using GPUs.
Passing to functions.
VLAs can be parameters of the functions.
void fun(int n, double arr[n][n][n]) {
...
}
Exemplary usage would be:
foo(N, arr);
EDIT
VLA friendly variant of compute_heat_transfer():
void compute_heat_transfer(int n, double arrayOld[restrict n][n][n], double arrayNew[restrict n][n][n]) {
int i,j,k;
/*Compute steady-state solution*/
for(int nsteps=0; nsteps < NUM_STEPS; nsteps++){
/*if(nsteps % 100 == 0){
writeFile(nsteps, arrayOld);
}*/
#pragma omp parallel for collapse(3)
for(i=1; i<n-1; i++){
for(j=1; j<n-1; j++){
for(k=1; k<n-1; k++){
//This is the 6-neighbor stencil computation
arrayNew[i][j][k] = (arrayOld[i-1][j][k] + arrayOld[i+1][j][k] + arrayOld[i][j-1][k] + arrayOld[i][j+1][k] +
arrayOld[i][j][k-1] + arrayOld[i][j][k+1])/6.0;
}}}
#pragma omp parallel for collapse(3)
for(i=1; i<n-1; i++){
for(j=1; j<n-1; j++){
for(k=1; k<n-1; k++){
arrayOld[i][j][k] = arrayNew[i][j][k];
}}}
}
}
Keyword restrict in arrNew[restrict n][n][n] is used to let the compiler assume that arrNew and arrOld do not alias. It should let the compiler use more aggressive optimizations.
Note that arrNew and arrOld are pointer to arrays. So rather than copy arrNew to arrOld you could simply swap those pointers forming a kind of simple double buffering. It should make the code even faster.
Given a n x n matrix of ints, I have an algorithm that at each step of a for loop of range n traverses and modifies the matrix. Here is the code:
typedef int **Matrix;
void floyd_slow(Matrix dist, int n)
{
int d;
for (int k=0; k<n; k++)
{
for (int i=0; i<n; i++)
{
for (int j=0; j<n; j++)
if ((d=dist[k][j]+dist[i][k])<dist[i][j])
dist[i][j]=d;
}
}
for (int i=0; i<n; i++)
dist[i][i]=0;
}
The matrix is built as an array of n*n ints and for each index line i, dist[i] is the address of the row of index i [the above code is the standard way to write the Floyd-Warshall algorithm but my question is not about this algorithm by itself].
The following drawing tries to explain how the matrix is processed:
At each step of the loop of index k, the underlying matrix is traversed line by line.
Now, consider the following transformation of the previous code:
void relax(Matrix dist, int n, int* rowk, int* colk)
{
int d;
for (int i=0; i<n; i++)
for (int j=0; j<n; j++)
if ((d=rowk[j]+colk[i])<dist[i][j])
dist[i][j]=d;
}
void floyd_fast(Matrix dist, int n)
{
int i, k;
int* colk=malloc(n*sizeof(int));
if (!colk)
exit(EXIT_FAILURE);
for (k=0; k<n; k++)
{
int* rowk =dist[k];
for (i=0; i<n; i++)
colk[i]=dist[i][k];
relax(dist, n, rowk, colk);
}
free(colk);
for (i=0; i<n; i++)
dist[i][i]=0;
}
At every step, the elements of the matrix are accessed in the same order as in the previous algorithm.
The only difference is that at each step k in the exterior loop, the column of index k is copied into a temporary array, cf. the colk malloc above. It results that the element at position (i, k) is read from this array instead of being accessed directly from the matrix.
This innocuous change leads in fact to a significant speedup: you gain a factor 4 if n=1000.
I know that in C, it's faster to traverse an array in row major order but this is always the case here. So i was wondering why there is a speedup so important. Is it related to cache optimisation?
Complete code
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
typedef int **Matrix;
void floyd_slow(Matrix dist, int n)
{
int d;
for (int k=0; k<n; k++)
{
for (int i=0; i<n; i++)
{
for (int j=0; j<n; j++)
if ((d=dist[k][j]+dist[i][k])<dist[i][j])
dist[i][j]=d;
}
}
for (int i=0; i<n; i++)
dist[i][i]=0;
}
void relax(Matrix dist, int n, int* rowk, int* colk)
{
int d;
for (int i=0; i<n; i++)
for (int j=0; j<n; j++)
if ((d=rowk[j]+colk[i])<dist[i][j])
dist[i][j]=d;
}
void floyd_fast(Matrix dist, int n)
{
int i, k;
int* colk=malloc(n*sizeof(int));
if (!colk)
exit(EXIT_FAILURE);
for (k=0; k<n; k++)
{
int* rowk =dist[k];
for (i=0; i<n; i++)
colk[i]=dist[i][k];
relax(dist, n, rowk, colk);
}
free(colk);
for (i=0; i<n; i++)
dist[i][i]=0;
}
void print(Matrix dist, int n)
{
int i, j;
for (i=0; i<n; i++)
{
for (j=0; j<n; j++)
printf("%d ", dist[i][j]);
printf("\n");
}
}
void test_slow(Matrix dist, int n)
{
clock_t now=clock();
floyd_slow(dist, n);
// print(dist, n);
int *p=dist[0];
free(dist);
free(p);
fprintf(stderr, "Elapsed slow: %.2f s\n",
(double) (clock() - now) / CLOCKS_PER_SEC);
}
void test_fast(Matrix dist, int n)
{
clock_t now=clock();
floyd_fast(dist, n);
// print(dist, n);
int *p=dist[0];
free(dist);
free(p);
fprintf(stderr, "Elapsed fast: %.2f s\n",
(double) (clock() - now) / CLOCKS_PER_SEC);
}
int * data(int n)
{
int N=n*n;
int *t=malloc(N*sizeof(int));
if (!t)
exit(EXIT_FAILURE);
srand(time(NULL));
for (int i=0; i<N;i++)
t[i]=(1+rand())%10;
return t;
}
Matrix getMatrix(int *t, int n)
{
int N=n*n;
int *tt=malloc(N*sizeof(int));
Matrix mat=malloc(n*sizeof(int*));
if (!tt || !mat)
exit(EXIT_FAILURE);
memcpy(tt, t, N*sizeof(int));
for (int i=0; i<n;i++)
mat[i]=&tt[i*n];
return mat;
}
int main(void)
{
int n=1000;
int *t=data(n);
Matrix mat_slow=getMatrix(data(n), n);
Matrix mat_fast=getMatrix(data(n), n);
test_slow(mat_slow, n);
test_fast(mat_fast, n);
return 0;
}
Output:
Elapsed slow: 0.58 s
Elapsed fast: 0.14 s
Compilation options:
rm floyd
gcc -Wall -O3 -march=native -ffast-math -Wno-unused-result -Wno-unused-variable -Wno-unused-but-set-variable -Wno-unused-parameter floyd.c -o floyd -lm
Here is the section of my code where I am running into a race condition. I'm just trying to copy the values of the matrix "matxOriginal" into the matrix "cluster0", but, when run on multiple threads using openMP, the sample values printed for "cluster0" is different from "matxOriginal".
Both matrix has been dynamically allocated and are both 698 x 9 matrix.
Also, I would like to keep the use and purpose of the "cluster0IndexCounter" variable for a different use outside of what I'm posting. So if you can, please let me know how to make this work.
double **matxGen(int row, int col)
{
int i=0, j=0;
double **m;
m=(double **) malloc(row*col*sizeof(double *));
for (i; i<row; i++)
{
m[i]=(double *) malloc(col*sizeof(double ));
for (j=0; j<col; j++)
{
m[i][j]=j+i;
}
}
return m;
}
double **emptyMatxGen(int row, int col)
{
int i=0, j=0;
double **m;
m=(double **) malloc(row*col*sizeof(double *));
for (i; i<row; i++)
{
m[i]=(double *) malloc(col*sizeof(double ));
for (j=0; j<col; j++)
{
m[i][j]=0.0;
}
}
return m;
}
int main()
{
int x, i, j, tid, row=699, col=9,
cluster0IndexCounter=0;
double **matxOriginal, **matx, **cluster0;
matxOriginal=matxGen(row, col);
matx=matxGen(row, col);
double *centerPoint0=matx[99];
cluster0=emptyMatxGen(row, col);
#pragma omp parallel for private(x, j, tid) schedule(static) reduction(+:cluster0IndexCounter)
for (x=0; x<=698; x++)
{
for (j=0; j<9; j++)
{
cluster0[cluster0IndexCounter][j]=matxOriginal[x][j];
}
cluster0IndexCounter=cluster0IndexCounter+1;
}
printf("cluster0: %f, %f, %f, %f, %f\n", cluster0[9][0], cluster0[9][1], cluster0[9][2], cluster0[9][3], cluster0[9][4]);
free(cluster0);
free(matxOriginal);
free(matx);
return 0;
}
So, I have to use this function from GSL.
This one:
gsl_matrix_view_array (double * base, size_t n1, size_t n2)
The first argument (double * base) is the matrix I need to pass to it, which is read as input from the user.
I'm dynamically allocating it this way:
double **base;
base = malloc(size*sizeof(double*));
for(i=0;i<size;i++)
base[i] = malloc(size*sizeof(double));
Where size is given by the user.
But then, when the code runs, it warns this :
"passing arg 1 of gsl_matrix_view_array from incompatible pointer type".
What is happening?
The function expects a flat array, e.g., double arr[size*size];.
Here's an example from the documentation that I have slightly modified to use a matrix view:
#include <stdio.h>
#include <stdlib.h>
#include <gsl/gsl_matrix.h>
int main(void) {
int i, j;
double *arr = malloc(10 * 3 * sizeof*arr);
gsl_matrix_view mv = gsl_matrix_view_array(arr, 10, 3);
gsl_matrix * m = &(mv.matrix);
for (i=0; i<10; i++)
for (j=0; j<3; j++)
gsl_matrix_set(m, i, j, 0.23 + 100*i + j);
for (i=0; i<10; i++)
for (j=0; j<3; j++)
printf("m(%d,%d) = %g\n", i, j, gsl_matrix_get(m, i, j));
free(arr);
return 0;
}
Note that you can also directly allocate memory for the matrix using the provided API.
Here's the original example:
#include <stdio.h>
#include <gsl/gsl_matrix.h>
int main(void)
{
int i, j;
gsl_matrix * m = gsl_matrix_alloc(10, 3);
for (i=0; i<10; i++)
for (j=0; j<3; j++)
gsl_matrix_set(m, i, j, 0.23 + 100*i + j);
for (i=0; i<10; i++)
for (j=0; j<3; j++)
printf("m(%d,%d) = %g\n", i, j, gsl_matrix_get(m, i, j));
gsl_matrix_free(m);
return 0;
}
For reference:
http://www.gnu.org/software/gsl/manual/html_node/Matrix-views.html
http://www.gnu.org/software/gsl/manual/html_node/Matrix-allocation.html#Matrix-allocation
http://www.gnu.org/software/gsl/manual/html_node/Accessing-matrix-elements.html#Accessing-matrix-elements
gsl_matrix_view_array expects your matrix as a contiguous single allocation in row-major order. You should be allocating your array like this:
double (*ar)[size] = malloc(sizeof(ar[size]) * size);
Then (after populating it)
gsl_matrix_view_array(ar[0], size, size);
Finally, free your allocation when done with a single call:
free(ar);
Note: Don't try this with C++, as VLA's aren't standard-supported for that language.
I'm having an issue that I cannot seem to fix with my memory allocations.
I create 3 dynamically allocated arrays (ipiv,k,b) using malloc, but when I try and free them, I get a seg fault. If I don't free them, the code works fine (but if I run too many iterations, I run out of memory).
Here is the code... I've taken out all of the parts that do not use the 3 arrays, since the code is pretty long.
#include<stdio.h>
#include <string.h>
#include<stdlib.h>
#include<math.h>
#include <mpi.h>
#include "mkl.h"
#define K(i,j) k[(i)+(j)*(n)]
void dgesv_( const MKL_INT* n, const MKL_INT* nrhs, double* a,
const MKL_INT* lda, MKL_INT* ipiv, double* b,
const MKL_INT* ldb, MKL_INT* info );
int main()
{
int *ipiv=malloc(n*sizeof(int));
for (i=0; i<n; i++) {
ipiv[i]=0;
}
for (globloop=0; globloop<=lasti; globloop++) {
double a[ndofs];
double rhs[ndofs];
double F[ndofs];
double *k=malloc(n*n*sizeof(double));
//var for stiffness matrix (this is the one acutally fed to dgesv)
//see define at top
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
K(i,j)=0.0;
}
}
//bunch of stuff modified, a,rhs,and F filled... ect
while (sos>=ep && nonlinloop<=maxit) {
double KFull[ndofs][ndofs];
for (i=0; i<ndofs; i++) {
for (j=0; j<ndofs; j++) {
KFull[i][j]=0.0;
}
}
//KFull filled with values..
//trim the arrays to account for bcs
double *b=malloc(n*sizeof(double));
for (i=0; i<n; i++) {
b[i]=rhs[i+2];
}
//k array filled
//see define above
for (i=0; i<n; i++) {
for (j=0; j<ndofs-2; j++) {
K(i,j)=KFull[i+2][j+2];
}
}
//SOLVER
dgesv_(&n,&one,k,&n,ipiv,b,&n,&info);
//now we must take our solution in b, and place back into rhs
for (i=0; i<n; i++) {
rhs[i+2]=b[i];
}
nonlinloop++;
free(b);
}
free(k);
}
free(ipiv);
return 0;
}
Freeing any one of these 3 variables gives me a segmentation fault. I am super-confused about this.
If n=ndofs-4 (as mentioned in the OP's comment) then ndofs-2 is greater then n. And then the code will be corrupting the memory at
K(i,j)=KFull[i+2][j+2];
because j runs up to ndofs-2-1 and K is (only) defined to be K[0..n-1][0..n-1].