Matrix and vector multiplication optimization algorithm - c

Assume that the dimensions are very large (up to 1 billion elements in a matrix). How would I implement a cache oblivious algorithm for matrix-vector product? Based on wikipedia I will need to recursively divide and conquer however I feel like there would be a lot of overhead.. Would it be efficient to do so?
Follow up question and answer: OpenMP with matrices and vectors

So the answer to the question, "how do I make this basic linear algebra operation fast", is always and everywhere to find and link to a tuned BLAS library for your platform. Eg, GotoBLAS (whose work is being continued in OpenBLAS), or the slower autotuned ATLAS, or commercial packages like Intel's MKL. Linear algebra is so fundamental to so many other operations that enormous amounts of effort goes into optimizing these packages for various platforms, and there's just no chance you're going to come up with something in a few afternoon's work that will compete. The particular subroutine calls you're looking for for general dense matrix-vector multiplicaiton is SGEMV/DGEMV/CGEMV/ZGEMV.
Cache-oblivious algorithms, or autotuning, are for when you can't be bothered tuning for the specific cache architecture of your system - which might be fine, normally, but since people are willing to do that for BLAS routines, and then make the tuned results available, means that you're best off just using those routines.
The memory access pattern for GEMV is straightforward enough that you don't really need divide and conquer (same for the standard case of matrix transpose) - you just find the cache blocking size and use it. In GEMV (y = Ax), you still have to scan through the entire matrix once, so there's nothing to be done for reuse (and thus effective cache use) there, but you can try reuse x as much as possible so you load it once instead of (number of rows) times - and you still want access to A to be cache friendly. So the obvious cache blocking thing to do is to break along blocks:
A x -> [ A11 | A12 ] | x1 | = | A11 x1 + A12 x2 |
[ A21 | A22 ] | x2 | | A21 x1 + A22 x2 |
And you can certainly do that recursively. But doing a naive implementation, it's slower than the simple double-loop, and way slower than a proper SGEMV library call:
$ ./gemv
Testing for N=4096
Double Loop: time = 0.024995, error = 0.000000
Divide and conquer: time = 0.299945, error = 0.000000
SGEMV: time = 0.013998, error = 0.000000
The code follows:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include "mkl.h"
float **alloc2d(int n, int m) {
float *data = malloc(n*m*sizeof(float));
float **array = malloc(n*sizeof(float *));
for (int i=0; i<n; i++)
array[i] = &(data[i*m]);
return array;
}
void tick(struct timeval *t) {
gettimeofday(t, NULL);
}
/* returns time in seconds from now to time described by t */
double tock(struct timeval *t) {
struct timeval now;
gettimeofday(&now, NULL);
return (double)(now.tv_sec - t->tv_sec) + ((double)(now.tv_usec - t->tv_usec)/1000000.);
}
float checkans(float *y, int n) {
float err = 0.;
for (int i=0; i<n; i++)
err += (y[i] - 1.*i)*(y[i] - 1.*i);
return err;
}
/* assume square matrix */
void divConquerGEMV(float **a, float *x, float *y, int n,
int startr, int endr, int startc, int endc) {
int nr = endr - startr + 1;
int nc = endc - startc + 1;
if (nr == 1 && nc == 1) {
y[startc] += a[startr][startc] * x[startr];
} else {
int midr = (endr + startr+1)/2;
int midc = (endc + startc+1)/2;
divConquerGEMV(a, x, y, n, startr, midr-1, startc, midc-1);
divConquerGEMV(a, x, y, n, midr, endr, startc, midc-1);
divConquerGEMV(a, x, y, n, startr, midr-1, midc, endc);
divConquerGEMV(a, x, y, n, midr, endr, midc, endc);
}
}
int main(int argc, char **argv) {
const int n=4096;
float **a = alloc2d(n,n);
float *x = malloc(n*sizeof(float));
float *y = malloc(n*sizeof(float));
struct timeval clock;
double eltime;
printf("Testing for N=%d\n", n);
for (int i=0; i<n; i++) {
x[i] = 1.*i;
for (int j=0; j<n; j++)
a[i][j] = 0.;
a[i][i] = 1.;
}
/* naive double loop */
tick(&clock);
for (int i=0; i<n; i++) {
y[i] = 0.;
for (int j=0; j<n; j++) {
y[i] += a[i][j]*x[j];
}
}
eltime = tock(&clock);
printf("Double Loop: time = %lf, error = %f\n", eltime, checkans(y,n));
for (int i=0; i<n; i++) y[i] = 0.;
/* naive divide and conquer */
tick(&clock);
divConquerGEMV(a, x, y, n, 0, n-1, 0, n-1);
eltime = tock(&clock);
printf("Divide and conquer: time = %lf, error = %f\n", eltime, checkans(y,n));
/* decent GEMV implementation */
tick(&clock);
float alpha = 1.;
float beta = 0.;
int incrx=1;
int incry=1;
char trans='N';
sgemv(&trans,&n,&n,&alpha,&(a[0][0]),&n,x,&incrx,&beta,y,&incry);
eltime = tock(&clock);
printf("SGEMV: time = %lf, error = %f\n", eltime, checkans(y,n));
return 0;
}

Related

Segfault with large int - not enough memory?

I am fairly new to C and how arrays and memory allocation works. I'm solving a very simple function right now, vector_average(), which computes the mean value between two successive array entries, i.e., the average between (i) and (i + 1). This average function is the following:
void
vector_average(double *cc, double *nc, int n)
{
//#pragma omp parallel for
double tbeg ;
double tend ;
tbeg = Wtime() ;
for (int i = 0; i < n; i++) {
cc[i] = .5 * (nc[i] + nc[i+1]);
}
tend = Wtime() ;
printf("vector_average() took %g seconds\n", tend - tbeg);
}
My goal is to set int n extremely high, to the point where it actually takes some time to complete this loop (hence, why I am tracking wall time in this code). I'm passing this function a random test function of x, f(x) = sin(x) + 1/3 * sin(3 x), denoted in this code as x_nc, in main() in the following form:
int
main(int argc, char **argv)
{
int N = 1.E6;
double x_nc[N+1];
double dx = 2. * M_PI / N;
for (int i = 0; i <= N; i++) {
double x = i * dx;
x_nc[i] = sin(x) + 1./3. * sin(3.*x);
}
double x_cc[N];
vector_average(x_cc, x_nc, N);
}
But my problem here is that if I set int N any higher than 1.E5, it segfaults. Please provide any suggestions for how I might set N much higher. Perhaps I have to do something with malloc, but, again, I am new to all of this stuff and I'm not quite sure how I would implement this.
-CJW
A function only has 1M stack memory on Windows or other system. Obviously, the size of temporary variable 'x_nc' is bigger than 1M. So, you should use heap to save data of x_nc:
int
main(int argc, char **argv)
{
int N = 1.E6;
double* x_nc = (double*)malloc(sizeof(dounble)*(N+1));
double dx = 2. * M_PI / N;
for (int i = 0; i <= N; i++) {
double x = i * dx;
x_nc[i] = sin(x) + 1./3. * sin(3.*x);
}
double* x_cc = (double*)malloc(sizeof(double)*N);
vector_average(x_cc, x_nc, N);
free(x_nc);
free(x_cc);
return 0;
}

Why is my Julia implementation for computing Euclidean distances in 3D faster than my C implementation

I am comparing the time it takes Julia to compute the Euclidean distances between two sets of points in 3D space against an equivalent implementation in C. I was very surprised to observe that (for this particular case and my particular implementations) Julia is 22% faster than C. When I also included #fastmath in the Julia version, it would be even 83% faster than C.
This leads to my question: why? Either Julia is more amazing than I originally thought or I am doing something very inefficient in C. I am betting my money on the latter.
Some particulars about the implementation:
In Julia I use 2D arrays of Float64.
In C I use dynamically allocated 1D arrays of double.
In C I use the sqrt function from math.h.
The computations are very fast, therefore I compute them a 1000 times to avoid comparing on the micro/millisecond level.
Some particulars about the compilation:
Compiler: gcc 5.4.0
Optimisation flags: -O3 -ffast-math
Timings:
Julia (without #fastmath): 90 s
Julia (with #fastmath): 20 s
C: 116 s
I use the bash command time for the timings
$ time ./particleDistance.jl (with shebang in file)
$ time ./particleDistance
particleDistance.jl
#!/usr/local/bin/julia
function distance!(x::Array{Float64, 2}, y::Array{Float64, 2}, r::Array{Float64, 2})
nx = size(x, 1)
ny = size(y, 1)
for k = 1:1000
for j = 1:ny
#fastmath for i = 1:nx
#inbounds dx = y[j, 1] - x[i, 1]
#inbounds dy = y[j, 2] - x[i, 2]
#inbounds dz = y[j, 3] - x[i, 3]
rSq = dx*dx + dy*dy + dz*dz
#inbounds r[i, j] = sqrt(rSq)
end
end
end
end
function main()
n = 4096
m = 4096
x = rand(n, 3)
y = rand(m, 3)
r = zeros(n, m)
distance!(x, y, r)
println("r[n, m] = $(r[n, m])")
end
main()
particleDistance.c
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
void distance(int n, int m, double* x, double* y, double* r)
{
int i, j, I, J;
double dx, dy, dz, rSq;
for (int k = 0; k < 1000; k++)
{
for (j = 0; j < m; j++)
{
J = 3*j;
for (i = 0; i < n; i++)
{
I = 3*i;
dx = y[J] - x[I];
dy = y[J+1] - x[I+1];
dz = y[J+2] - x[I+2];
rSq = dx*dx + dy*dy + dz*dz;
r[j*n+i] = sqrt(rSq);
}
}
}
}
int main()
{
int i;
int n = 4096;
int m = 4096;
double *x, *y, *r;
size_t xbytes = 3*n*sizeof(double);
size_t ybytes = 3*m*sizeof(double);
x = (double*) malloc(xbytes);
y = (double*) malloc(ybytes);
r = (double*) malloc(xbytes*ybytes/9);
for (i = 0; i < 3*n; i++)
{
x[i] = (double) rand()/RAND_MAX*2.0-1.0;
}
for (i = 0; i < 3*m; i++)
{
y[i] = (double) rand()/RAND_MAX*2.0-1.0;
}
distance(n, m, x, y, r);
printf("r[n*m-1] = %f\n", r[n*m-1]);
free(x);
free(y);
free(r);
return 0;
}
Makefile
all: particleDistance.c
gcc -o particleDistance particleDistance.c -O3 -ffast-math -lm
Maybe it should be a comment, but the point is that Julia is indeed pretty optimized. In the Julia web page you can see that it can beat C in some cases (mandel).
I see that you are using -ffast-math in your compilation. But, maybe you could do some optimizations in your code (although nowadays compilers are pretty smart and this might not solve the issue).
Instead of using int for your indexes, try to use unsigned int, this allows you to maybe try the following thing;
Instead of multiply by 3, if you use an unsigned you can do a shift and add. This can save some computation time;
In accessing the elements like x[J], maybe try using pointers directly and access the elements in a sequential manner like x+=3 (?);
Instead of int n and int m, try to set them as macros. If they are known in advance, you can take advantage of that.
Does the malloc make difference in this case? If n and m are known, fixed size arrays would reduce the time spent for the OS allocate memory.
There might be a few other things, but Julia is pretty optimized with real time compilation, so everything that is constant and is known in advance is used in favor of it. I have tried Julia with no regrets.
Your index calculation in C is rather slow
Try something like the following (I did not compiled it, it may have still errors, just too visualize the idea):
void distance(int n, int m, double* x, double* y, double* r)
{
int i, j;
double dx, dy, dz, rSq;
double* X, *Y, *R;
for (int k = 0; k < 1000; k++)
{
R = r;
Y = y;
for (j = 0; j < m; j++)
{
X = x;
for (i = 0; i < n; i++)
{
dx = Y[0] - *X++;
dy = Y[1] - *X++;
dz = Y[2] - *X++;
rSq = dx*dx + dy*dy + dz*dz;
*R++ = sqrt(rSq);
}
Y += 3;
}
}
}
Alternatively you could try, it might be a little bit faster (one increment instead of 3)
dx = Y[0] - X[0];
dy = Y[1] - X[1];
dz = Y[2] - X[2];
X+=3;
Y[x] is the same as *(Y+x).
Good luck

OpenMP parallelization (Block Matrix Mult)

I'm attempting to implement block matrix multiplication and making it more parallelized.
This is my code :
int i,j,jj,k,kk;
float sum;
int en = 4 * (2048/4);
#pragma omp parallel for collapse(2)
for(i=0;i<2048;i++) {
for(j=0;j<2048;j++) {
C[i][j]=0;
}
}
for (kk=0;kk<en;kk+=4) {
for(jj=0;jj<en;jj+=4) {
for(i=0;i<2048;i++) {
for(j=jj;j<jj+4;j++) {
sum = C[i][j];
for(k=kk;k<kk+4;k++) {
sum+=A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
}
}
I've been playing around with OpenMP but still have had no luck in figuring what the best way to have this done in the least amount of time.
Getting good performance from matrix multiplication is a big job. Since "The best code is the code I don't have to write", a much better use of your time would be to understand how to use a BLAS library.
If you are using X86 processors, the Intel Math Kernel Library (MKL) is available free, and includes optimized, parallelized, matrix multiplication operations.
https://software.intel.com/en-us/articles/free-mkl
(FWIW, I work for Intel, but not on MKL :-))
I recently started looking into dense matrix multiplication (GEMM)again. It turns out the Clang compiler is really good at optimization GEMM without needing any intrinsics (GCC still needs intrinsics). The following code gets 60% of the peak FLOPS of my four core/eight hardware thread Skylake system. It uses block matrix multiplication.
Hyper-threading gives worse performance so you make sure you only use threads equal to the number of cores and bind threads to prevent thread migration.
export OMP_PROC_BIND=true
export OMP_NUM_THREADS=4
Then compile like this
clang -Ofast -march=native -fopenmp -Wall gemm_so.c
The code
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <omp.h>
#include <x86intrin.h>
#define SM 80
typedef __attribute((aligned(64))) float * restrict fast_float;
static void reorder2(fast_float a, fast_float b, int n) {
for(int i=0; i<SM; i++) memcpy(&b[i*SM], &a[i*n], sizeof(float)*SM);
}
static void kernel(fast_float a, fast_float b, fast_float c, int n) {
for(int i=0; i<SM; i++) {
for(int k=0; k<SM; k++) {
for(int j=0; j<SM; j++) {
c[i*n + j] += a[i*n + k]*b[k*SM + j];
}
}
}
}
void gemm(fast_float a, fast_float b, fast_float c, int n) {
int bk = n/SM;
#pragma omp parallel
{
float *b2 = _mm_malloc(sizeof(float)*SM*SM, 64);
#pragma omp for collapse(3)
for(int i=0; i<bk; i++) {
for(int j=0; j<bk; j++) {
for(int k=0; k<bk; k++) {
reorder2(&b[SM*(k*n + j)], b2, n);
kernel(&a[SM*(i*n+k)], b2, &c[SM*(i*n+j)], n);
}
}
}
_mm_free(b2);
}
}
static int doublecmp(const void *x, const void *y) { return *(double*)x < *(double*)y ? -1 : *(double*)x > *(double*)y; }
double median(double *x, int n) {
qsort(x, n, sizeof(double), doublecmp);
return 0.5f*(x[n/2] + x[(n-1)/2]);
}
int main(void) {
int cores = 4;
double frequency = 3.1; // i7-6700HQ turbo 4 cores
double peak = 32*cores*frequency;
int n = SM*10*2;
int mem = sizeof(float) * n * n;
float *a = _mm_malloc(mem, 64);
float *b = _mm_malloc(mem, 64);
float *c = _mm_malloc(mem, 64);
memset(a, 1, mem), memset(b, 1, mem);
printf("%dx%d matrix\n", n, n);
printf("memory of matrices: %.2f MB\n", 3.0*mem*1E-6);
printf("peak SP GFLOPS %.2f\n", peak);
puts("");
while(1) {
int r = 10;
double times[r];
for(int j=0; j<r; j++) {
times[j] = -omp_get_wtime();
gemm(a, b, c, n);
times[j] += omp_get_wtime();
}
double flop = 2.0*1E-9*n*n*n; //GFLOP
double time_mid = median(times, r);
double flops_low = flop/times[r-1], flops_mid = flop/time_mid, flops_high = flop/times[0];
printf("%.2f %.2f %.2f %.2f\n", 100*flops_low/peak, 100*flops_mid/peak, 100*flops_high/peak, flops_high);
}
}
This does GEMM 10 times per iteration of an infinite loop and prints the low, median, and high ratio of FLOPS to peak_FLOPS and finally the median FLOPS.
You will need to adjust the following lines
int cores = 4;
double frequency = 3.1; // i7-6700HQ turbo 4 cores
double peak = 32*cores*frequency;
to the number of physical cores, frequency for all cores (with turbo if enabled), and the number of floating pointer operations per core which is 16 for Core2-Ivy Bridge, 32 for Haswell-Kaby Lake, and 64 for the Xeon Phi Knights Landing.
This code may be less efficient with NUMA systems. It does not do nearly as well with Knight Landing (I just started looking into this).

C code for particle interaction in 2d

Imagine that you have a particle in each coordinate of a 2D Cartesian plane. Each particle emits a substance that difuses in all directions, with a decay over distance based on a Bessel function, and the other particles each absorb this substance. Thus all particles at same distance from a given particle have the same influence on that particle. Something such as
I'm calculating such an interaction using this code:
EDIT:31/03: Complete code for both.
#include <stdio.h> // para as rotinas de entrada e saĆ­da
#include <stdlib.h> //
#include <stdarg.h> // para importar os elementos da linha de comando
#include <math.h>
#include <string.h>
#include <ctype.h>
#include <malloc.h>
#include <time.h>
#include"ran1.c"
#include"bessel.c"
#define tmax 90000
#define N 50
#define beta 0.001
#define sigma 0.001
#define pi acos(-1.0)
#define trans 50000
#define epsilon 0.1
void condicoes_iniciais(double **xold,double **yold,double **a)
{
int l,j;
long idum=-120534;
for(l=0;l<= N; l++)
{
for(j=0;j<= N; j++)
{
a[l][j]=5.0;
}
}
for(l=0;l<= N; l++)
{
for(j=0;j<= N; j++)
{
while(a[l][j]>4.4)
a[l][j]=4.1+ran1(& idum);
}
}
for(l=0;l<= N; l++)
{
for(j=0;j<= N; j++)
{
xold[l][j]=0.1*ran1(& idum);
}
}
for(l=0;l<= N; l++)
{
for(j=0;j<= N; j++)
{
yold[l][j]=0.1*ran1(& idum);
}
}
}
void Matriz_Bessel(double **Bess,double gama)
{
int x,y;
double r;
for(x=0;x<=N;x++)
{
for(y=0;y<=N;y++)
{
if(y!=0 || x!=0)
{
r = gama*sqrt(x*x +y*y);
Bess[x][y] = bessk0(r);
}
}
}
}
void acoplamento(int x, int y,double **xold, double *Acopl,double **Bess)
{
int j, i, h, k,xdist, ydist;
int Nmeio = N/2;
double Xf;
Xf = 0;
for(i=0;i<=N;i++)
{
for(j=0;j<=N;j++)
{
h = x+i;
k = y+j;
ydist = j;
xdist = i;
if(i>Nmeio)
{
h = x +i;
xdist = (N+1) -h +x;
}
if(h>N)
{
h=h-(N+1);
xdist = x-h;
if(xdist >Nmeio){xdist = i;
}
}
if(j>Nmeio)
{
k = y +j;
ydist = (N+1) -k +y;
}
if(k>N)
{
k=k-(N+1);
ydist = y-k;
if(ydist >Nmeio){ydist = j;
}
}
if(ydist!=0 || xdist!=0)
{
Xf = Xf +Bess[xdist][ydist]*xold[h][k];
}
}
}
*Acopl = Xf;
}
void constante(double *c, double gama, double **Bess){
double soma;
int x, y;
soma = 0;
for(x=0;x<=(N/2);x++)
{
for(y=0;y<=(N/2);y++)
{
if(y!=0 || x!=0)
{
soma = soma +Bess[x][y];
}
}
}
*c = (1/(4*soma));
}
int main(int argc, char* argv[])
{
double **xold, **xnew, **yold, **ynew, **a;
double gama, C;
int x,y;
int t,i;
double Mn, acopl;
char arqnome[100];
FILE *fout;
double **Bess;
Bess= (double**)malloc(sizeof(double*)*(N+3));
for(i=0; i<(N+3); i++){Bess[i] = (double*)malloc(sizeof(double)* (N+3));}
xold= (double**)malloc(sizeof(double*)*(N+3));
for(i=0; i<(N+3); i++){xold[i] = (double*)malloc(sizeof(double)* (N+3));}
yold= (double**)malloc(sizeof(double*)*(N+3));
for(i=0; i<(N+3); i++){yold[i] = (double*)malloc(sizeof(double)*(N+3));}
xnew= (double**)malloc(sizeof(double*)*(N+3));
for(i=0; i<(N+3); i++){xnew[i] = (double*)malloc(sizeof(double)*(N+3));}
ynew= (double**)malloc(sizeof(double*)*(N+3));
for(i=0; i<(N+3); i++){ynew[i] = (double*)malloc(sizeof(double)*(N+3));}
a= (double**)malloc(sizeof(double*)*(N+3));
for(i=0; i<(N+3); i++){a[i] = (double*)malloc(sizeof(double)*(N+3));}
srand (time(NULL));
gama = 0.005;
sprintf(arqnome,"serie_%.3f_%.3f.dat",gama,epsilon);
fout = fopen(arqnome,"w");
Matriz_Bessel(Bess,gama);
condicoes_iniciais(xold,yold,a);
a[0][0] = 4.1;
a[N/2][N/2] = 4.3;
constante(&C, gama,Bess);
for(t=0;t<=tmax;t++)
{
Mn = 0;
for(x=0;x<=N;x++)
{
for(y=0;y<=N;y++)
{
acoplamento(x,y,xold,&acopl,Bess);
xnew[x][y] = (a[x][y]/(1+xold[x][y]*xold[x][y])) +yold[x][y] + epsilon*C*acopl;
ynew[x][y] = yold[x][y] - sigma*xold[x][y] - beta;
Mn = Mn + xnew[x][y];
xold[x][y] = xnew[x][y];
yold[x][y] = ynew[x][y];
}
}
if(t>trans){fprintf(fout,"%d %f %f %f %f %f\n",(t-trans),xold[0][0],yold[0][0],xold[N/2][N/2],yold[N/2][N/2],Mn/((N+1)*(N+1)));}
}
return 0;
}
Bess[N][N] is the Bessel function for each radius, with is calculated using numerical recipes. This program take around 1 hour to finish.
With the sugestion of Francis i have
#include <fftw3.h>
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <string.h>
#include"bessel.c"
#include"ran1.c"
#define tmax 90000
#define beta 0.001
#define N 50
#define sigma 0.001
#define pi acos(-1.0)
#define trans 50000
#define epsilon 0.1
void condicoes_iniciais(double *xold,double *yold,double *a)
{
int l;
long idum=-120534;
for(l=0;l<= N*N; l++){
a[l]=5.0;}
for(l=0;l<= N*N; l++){
while(a[l]>4.4)
a[l]=4.1+ran1(& idum);}
for(l=0;l<=N* N; l++){
xold[l]=0.1*ran1(& idum);
yold[l]=0.1*ran1(& idum);}
a[0]=4.1;
a[N]=4.4;
}
void Matriz_Bessel(double *bessel,double gama)
{
int x,y,i,j;
double dist;
for(x=0,i=-N/2;x<N;x++,i++)
{
for(y=0,j=-N/2;y<N;y++,j++)
{
double dist=sqrt(i*i+j*j);
if(dist>0){
bessel[x*N+y]=bessk0(gama*dist);
}
else{
bessel[x*N+y]=1;
}
}
}
}
void constante(double *c, double *bessel)
{
int x;
int y;
double soma = 0;
for(x=0;x<N;x++){
for(y=0;y<N;y++){
soma = soma + bessel[x*N+y];
}}
*c =(1/(4*soma));
}
int main(int argc, char* argv[]){
double *xnew=fftw_malloc(sizeof(double)*N*N);
double *acopl=fftw_malloc(sizeof(double)*N*N);
double *xold=malloc(sizeof(double)*N*N);
double *yold = malloc(sizeof(double)*N*N);
double *a = malloc(sizeof(double)*N*N);
fftw_complex *xfourier;
xfourier = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*(N/2+1)*N);
fftw_complex *aux;
aux= (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*(N/2+1)*N);
double *bessel= fftw_malloc(sizeof(double)*N*N);
fftw_complex *besself;
besself=fftw_malloc(sizeof(fftw_complex)*(N/2+1)*N);
double scale=1.0/(N*N);
int t,i;
double gama,Mn,C;
gama = 0.005;
char arqnome[1000];
FILE *fout;
sprintf(arqnome,"opt2_tamanho_plato_%.3f_%d.dat",gama,N);
fout = fopen(arqnome,"w");
//initial
printf("initial\n");
condicoes_iniciais(xold,yold,a);
//xold[(N/2)*N+N/2]=1;
// fftw_plan
printf("fftw_plan\n");
fftw_plan plan;
plan=fftw_plan_dft_r2c_2d(N, N, xnew, xfourier, FFTW_MEASURE | FFTW_PRESERVE_INPUT);
fftw_plan planb;
planb=fftw_plan_dft_r2c_2d(N, N,(double*) bessel, besself, FFTW_MEASURE);
fftw_plan plani;
plani=fftw_plan_dft_c2r_2d(N, N, aux, acopl, FFTW_MEASURE);
Matriz_Bessel(bessel,gama);
constante(&C, bessel);
fftw_execute(planb);
//time loop
printf("time loop\n");
for(t=0;t<=tmax;t++){
//convolution= products in fourier space
fftw_execute(plan);
for(i=0;i<N*(N/2+1);i++){
aux[i][0]=(xfourier[i][0]*besself[i][0]-xfourier[i][2]*besself[i][3]);
aux[i][4]=(xfourier[i][0]*besself[i][5]+xfourier[i][6]*besself[i][0]);
}
fftw_execute(plani);//xnew is updated
Mn = 0;
for(i=0;i<N*N;i++){
xnew[i]=(a[i]/(1+xold[i]*xold[i])) +yold[i] + epsilon*C* (acopl[i]/(double)(N*N));
yold[i] = yold[i] - sigma*xold[i] - beta;
Mn = Mn +xnew[i];
}
memcpy(xold,xnew,N*N*sizeof(double));
if(t>trans){fprintf(fout,"%d %f %f %f %f %f\n",(t-trans),xold[0],yold[0],xold[N],yold[N],Mn/((N+1)*(N+1)));}
}
printf("destroy\n");
fftw_destroy_plan(plan);
fftw_destroy_plan(plani);
fftw_destroy_plan(planb);
printf("free\n");
fftw_free(bessel);
fftw_free(xnew);
fftw_free(xold);
fftw_free(yold);
fftw_free(besself);
fftw_free(xfourier);
return 0;
}
With take around 1min to finish, but i got this results
The scale factor on fftw3 code have to be that value. I dont know how make it work.
The operation you are describing is called a convolution. Let f(x,y) be your periodic sources and B(x,y) the Bessel function. You are trying to compute :
Discretized on a grid of size N+1, it writes :
Since this sum is performed at all points, the complexity is very high : O(N^4). It means that the number of operations to performed is of the magnitude of N*N*N*N. How to reduce this complexity ?
If B(x,y) gets rapidly small as the distance increases, long-range interactions may be neglected and the window of the convolution may be reduced. It will affect the precision of the output and it may not be useful for your problem. Let N_W<<N be the size of this window. The sum now writes :
And the number of operations to be performed is about N*N*N_W*N_W<<N^4.
Yet, from a practical point of view, the kernel has to be very small to make the method described above very interesting. Since the Bessel functions decrease slowly (from Abramowitz and Stegun: Handbook of Mathematical Functions, p364) (approx 1/sqrt(x)), the previous method is unlikely to be successful.
According to the convolution theorem, the Discrete Fourier Transform may be applied to convolve periodic signals ! A convolution in distance space resumes to products of corresponding wavelength in the Fourier space.
The algorithm is the following :
1 Compute the DFT of f, named hatf
2 Compute the DFT of B, named hatB
3 For all frequencies p,q, perform the product :
hatf*(p,q)=hatf(p,q)*hatB(p,q)
4 Inverse the DFT to get f*
The method described above is really efficient since its complexity is the one of 2D DFT, that is N*N*log(N). Moreover, dedicated libraries such as FFTW makes it easy to implement. Take a look at fftw_plan fftw_plan_dft_r2c_2d and be careful about the data layout.
EDIT : I still think there is a may to make it work... Here is a starting code, compile it by gcc main.c -o main -lfftw3 -lm
#include <fftw3.h>
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <string.h>
int save_image(int N,double* data,int nb){
char filename[1000];
sprintf(filename,"xxx%d.vtk",nb);
FILE * pFile;
pFile = fopen (filename,"w");
if (pFile!=NULL)
{
fputs ("# vtk DataFile Version 2.0\n",pFile);
fputs ("Volume example\n",pFile);
fputs ("ASCII\n",pFile);
fputs ("DATASET STRUCTURED_POINTS\n",pFile);
fprintf(pFile,"DIMENSIONS %d %d 1\n",N,N);
fputs ("ASPECT_RATIO 1 1 1\n",pFile);
fputs ("ORIGIN 0 0 0\n",pFile);
fprintf(pFile,"POINT_DATA %d\n",N*N);
fputs ("SCALARS volume_scalars float 1\n",pFile);
fputs ("LOOKUP_TABLE default\n",pFile);
int i;
for(i=0;i<N*N;i++){
fprintf(pFile,"%f ",data[i]);
}
fclose (pFile);
}
return 0;
}
int main(int argc, char* argv[]){
int N=64;
double *xnew=fftw_malloc(sizeof(double)*N*N);
double *xold=fftw_malloc(sizeof(double)*N*N);
double *yold=fftw_malloc(sizeof(double)*N*N);
fftw_complex *xfourier=fftw_malloc(sizeof(fftw_complex)*(N/2+1)*N);
double *bessel=fftw_malloc(sizeof(double)*N*N);
fftw_complex *besself=fftw_malloc(sizeof(fftw_complex)*(N/2+1)*N);
//initial
printf("initial\n");
memset(xold,0,sizeof(double)*N*N);
memset(yold,0,sizeof(double)*N*N);
xold[(N/2)*N+N/2]=1;
// fftw_plan
printf("fftw_plan\n");
fftw_plan plan;
plan=fftw_plan_dft_r2c_2d(N, N, xold, xfourier, FFTW_ESTIMATE | FFTW_PRESERVE_INPUT);
fftw_plan planb;
planb=fftw_plan_dft_r2c_2d(N, N,(double*) bessel, besself, FFTW_ESTIMATE);
fftw_plan plani;
plani=fftw_plan_dft_c2r_2d(N, N, xfourier, xnew, FFTW_ESTIMATE);
//bessel function
//crude approximate of bessel...
printf("bessel function\n");
double dx=1.0/(double)N;
double dy=1.0/(double)N;
int x,y;int i,j;
for(x=0,i=-N/2;x<N;x++,i++){
for(y=0,j=-N/2;y<N;y++,j++){
double dist=sqrt(dx*dx*(i*i+j*j));
double range=0.01;
dist=dist/range;
if(dist>0){
bessel[x*N+y]=sqrt(2./(M_PI*dist))*cos(dist-M_PI/4.0);
}else{
bessel[x*N+y]=1;
}
}
}
fftw_execute(planb);
fftw_destroy_plan(planb);
fftw_free(bessel);
//time loop
printf("time loop\n");
int t,tmax=100;
for(t=0;t<=tmax;t++){
save_image(N,xold,t);
printf("t=%d\n",t);
//convolution= products in fourier space
fftw_execute(plan);
double scale=1.0/((double)N*N);
//scale*=scale; //may be needed to correct scaling
for(i=0;i<N*(N/2+1);i++){
xfourier[i][0]=(xfourier[i][0]*besself[i][0]-xfourier[i][1]*besself[i][1])*scale;
xfourier[i][1]=(xfourier[i][0]*besself[i][1]+xfourier[i][1]*besself[i][0])*scale;
}
fftw_execute(plani);//xnew is updated
double C=1;double epsilon=1; double a=1; double beta=1;double sigma=1;
for(i=0;i<N*N;i++){
xnew[i]=(a/(1+xold[i]*xold[i])) +yold[i] + epsilon*C*xnew[i];
yold[i] = yold[i] - sigma*xold[i] - beta;
}
memcpy(xold,xnew,N*N*sizeof(double));
}
printf("destroy\n");
fftw_destroy_plan(plan);
fftw_destroy_plan(plani);
// fftw_destroy_plan(planb);
printf("free\n");
fftw_free(xnew);
fftw_free(xold);
fftw_free(yold);
fftw_free(besself);
fftw_free(xfourier);
return 0;
}
It produces some vtk images of xold which may be opened by the paraview software. It is likely that saving the images slow down the computations...
My coefficients are wrong, so the output is wrong...
EDIT : Here is piece of code based on yours, to be compiled by gcc main.c -o main -lfftw3 -lm. I found bessk0.c and bessi0.c.
The code writes :
#include <fftw3.h>
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <string.h>
#include"bessi0.c"
#include"bessk0.c"
//#include"bessel.c"
//#include"ran1.c"
#define tmax 90000
#define beta 0.001
#define N 50
#define sigma 0.001
#define pi acos(-1.0)
#define trans 50000
#define epsilon 0.1
double ran1(long* idum){
return ((double)rand())/((double)RAND_MAX);
}
void condicoes_iniciais(double *xold,double *yold,double *a)
{
int l;
long idum=-120534;
for(l=0;l<= N*N; l++){
a[l]=5.0;}
for(l=0;l<= N*N; l++){
while(a[l]>4.4)
a[l]=4.1+ran1(& idum);}
for(l=0;l<=N* N; l++){
xold[l]=0.1*ran1(& idum);
yold[l]=0.1*ran1(& idum);
//printf("%g %g %g\n",xold[l],yold[l],a[l]);
}
a[0]=4.1;
a[N]=4.4;
}
void Matriz_Bessel(double *bessel,double gama)
{
int x,y,i,j;
double dist;
for(x=0,i=-N/2;x<N;x++,i++)
{
for(y=0,j=-N/2;y<N;y++,j++)
{
double dist=sqrt(i*i+j*j);
if(dist>0){
bessel[x*N+y]=bessk0(gama*dist);
//printf("%g %g\n",dist,bessel[x*N+y]);
}
else{
bessel[x*N+y]=1;
}
}
}
}
void constante(double *c, double *bessel)
{
int x;
int y;
double soma = 0;
for(x=0;x<N;x++){
for(y=0;y<N;y++){
soma = soma + bessel[x*N+y];
}}
// *c =(1.0/(4.0*soma));
*c =(1.0/(soma));
}
int main(int argc, char* argv[]){
//srand (time(NULL));
srand (0);
double *xnew=fftw_malloc(sizeof(double)*N*N);
double *acopl=fftw_malloc(sizeof(double)*N*N);
double *xold=malloc(sizeof(double)*N*N);
double *yold = malloc(sizeof(double)*N*N);
double *a = malloc(sizeof(double)*N*N);
fftw_complex *xfourier;
xfourier = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*(N/2+1)*N);
fftw_complex *aux;
aux= (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*(N/2+1)*N);
double *bessel= fftw_malloc(sizeof(double)*N*N);
fftw_complex *besself;
besself=fftw_malloc(sizeof(fftw_complex)*(N/2+1)*N);
double scale=1.0/((double)N*N);
int t,i;
double gama,Mn,C;
gama = 0.005;
char arqnome[1000];
FILE *fout;
sprintf(arqnome,"opt2_tamanho_plato_%.3f_%d.dat",gama,N);
fout = fopen(arqnome,"w");
//initial
printf("initial\n");
condicoes_iniciais(xold,yold,a);
//xold[(N/2)*N+N/2]=1;
// fftw_plan
printf("fftw_plan\n");
fftw_plan plan;
plan=fftw_plan_dft_r2c_2d(N, N, xnew, xfourier, FFTW_MEASURE | FFTW_PRESERVE_INPUT);
fftw_plan planb;
planb=fftw_plan_dft_r2c_2d(N, N, bessel, besself, FFTW_MEASURE);
fftw_plan plani;
plani=fftw_plan_dft_c2r_2d(N, N, aux, acopl, FFTW_MEASURE);
Matriz_Bessel(bessel,gama);
constante(&C, bessel);
fftw_execute(planb);
//time loop
printf("time loop\n");
for(t=0;t<=tmax;t++){
//convolution= products in fourier space
fftw_execute(plan);
for(i=0;i<N*(N/2+1);i++){
aux[i][0]=(xfourier[i][0]*besself[i][0]-xfourier[i][1]*besself[i][1]);
aux[i][1]=(xfourier[i][0]*besself[i][1]+xfourier[i][1]*besself[i][0]);
}
fftw_execute(plani);//xnew is updated
Mn = 0;
for(i=0;i<N*N;i++){
xnew[i]=(a[i]/(1+xold[i]*xold[i])) +yold[i] + epsilon*C* (acopl[i]/(double)(N*N));
yold[i] = yold[i] - sigma*xold[i] - beta;
Mn = Mn +xnew[i];
}
memcpy(xold,xnew,N*N*sizeof(double));
if(t>trans){fprintf(fout,"%d %f %f %f %f %f\n",(t-trans),xold[0],yold[0],xold[N],yold[N],Mn/((N+1)*(N+1)));}
}
printf("destroy\n");
fftw_destroy_plan(plan);
fftw_destroy_plan(plani);
fftw_destroy_plan(planb);
printf("free\n");
fftw_free(bessel);
fftw_free(xnew);
fftw_free(xold);
fftw_free(yold);
fftw_free(besself);
fftw_free(xfourier);
fftw_free(aux);
fftw_free(acopl);
return 0;
}
The result is the following :
The lines :
aux[i][0]=(xfourier[i][0]*besself[i][0]-xfourier[i][1]*besself[i][1]);
aux[i][1]=(xfourier[i][0]*besself[i][1]+xfourier[i][1]*besself[i][0]);
Correspond to product of complex numbers. aux[i] is a complex number, aux[i][0] is its real part and aux[i][1] its imaginary part. Hence aux[i][4] does not correspond to something meaningful. These complex numbers correspond to magnitudes of frequencies in the Fourier space.
I also modified the constant : *c =(1.0/(soma));
Do not forget to add srand(0) if you wish to compare outputs and build the initial state in the same way.
You could perhaps use the symmetry of the grid to reduce the number of computations needed. Especially so if you are modelling an infinite periodic system, as apparent wrap-around logic makes me think you may be doing.
Consider:
the same influence is exerted on a particle at the coordinates [35][35] by particles at [35 - x][35], [35 + x][35], [35][35 - x], and [35][35 + x], for any x; also,
another influence is exerted equally by the particles at [35 - x][35 - x], [35 + x][35 - x], [35 - x][35 + x], and [35 + x][35 + x], for any; and
yet another influence is exerted equally by the particles at [35 + x][35 + y], [35 + x][35 - y], [35 - x][35 + y], [35 - x][35 - y], [35 + y][35 + x], [35 + y][35 - x], [35 - y][35 + x], and [35 - y][35 - x], for any x != y.
You should be able to speed your computation by a little less than a factor of 8 by using those equivalences.
If indeed you are simulating an infinite periodic system, however, then I observe that your approach incorporates a bias: by computing the influences from a square grid, you are including the influence of some of the particles at distances between N and sqrt(2) * N from the target, but not of others. You should compute on a (virtual) disc, instead, to avoid such bias.
Furthermore, the appearance of input parameters x and y leads me to suppose that you are performing that computation once for each grid position. If, again, you are modelling an infinite, periodic grid with an emitter at each grid point, and in which each point's influence depends only on distance, then every point will experience the same influence. You could cut your runtime several thousand-fold, and reduce the asymptotic complexity of your algorithm if you can make use of that.

SSE Intrinsics arithmetic error

I've been experimenting with SSE intrinsics and I seem to have run into a weird bug that I can't figure out. I am computing the inner product of two float arrays, 4 elements at a time.
For testing I've set each element of both arrays to 1, so the product should be == size.
It runs correctly, but whenever I run the code with size > ~68000000 the code using the sse intrinsics starts computing the wrong inner product. It seems to get stuck at a certain sum and never exceeds this number. Here is an example run:
joe:~$./test_sse 70000000
sequential inner product: 70000000.000000
sse inner product: 67108864.000000
sequential time: 0.417932
sse time: 0.274255
Compilation:
gcc -fopenmp test_sse.c -o test_sse -std=c99
This error seems to be consistent amongst the handful of computers I've tested it on. Here is the code, perhaps someone might be able to help me figure out what is going on:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <omp.h>
#include <math.h>
#include <assert.h>
#include <xmmintrin.h>
double inner_product_sequential(float * a, float * b, unsigned int size) {
double sum = 0;
for(unsigned int i = 0; i < size; i++) {
sum += a[i] * b[i];
}
return sum;
}
double inner_product_sse(float * a, float * b, unsigned int size) {
assert(size % 4 == 0);
__m128 X, Y, Z;
Z = _mm_set1_ps(0.0f);
float arr[4] __attribute__((aligned(sizeof(float) * 4)));
for(unsigned int i = 0; i < size; i += 4) {
X = _mm_load_ps(a+i);
Y = _mm_load_ps(b+i);
X = _mm_mul_ps(X, Y);
Z = _mm_add_ps(X, Z);
}
_mm_store_ps(arr, Z);
return arr[0] + arr[1] + arr[2] + arr[3];
}
int main(int argc, char ** argv) {
if(argc < 2) {
fprintf(stderr, "usage: ./test_sse <size>\n");
exit(EXIT_FAILURE);
}
unsigned int size = atoi(argv[1]);
srand(time(0));
float *a = (float *) _mm_malloc(size * sizeof(float), sizeof(float) * 4);
float *b = (float *) _mm_malloc(size * sizeof(float), sizeof(float) * 4);
for(int i = 0; i < size; i++) {
a[i] = b[i] = 1;
}
double start, time_seq, time_sse;
start = omp_get_wtime();
double inner_seq = inner_product_sequential(a, b, size);
time_seq = omp_get_wtime() - start;
start = omp_get_wtime();
double inner_sse = inner_product_sse(a, b, size);
time_sse = omp_get_wtime() - start;
printf("sequential inner product: %f\n", inner_seq);
printf("sse inner product: %f\n", inner_sse);
printf("sequential time: %f\n", time_seq);
printf("sse time: %f\n", time_sse);
_mm_free(a);
_mm_free(b);
}
You are running into the precision limit of single precision floating point numbers. The number 16777216 (2^24), which is the value of each component of the vector Z when reaching the "limit" inner product, is represented in 32-bit floating point as hexadecimal 0x4b800000 or binary 0 10010111 00000000000000000000000, i.e. the 23-bit mantissa is all zeros (implicit leading 1 bit), and the 8-bit exponent part is 151 representing the exponent 151 - 127 = 24. If you add a 1 to that value this would require to increase the exponent but then the added one cannot be represented in the mantissa any longer, so in single precision floating point arithmetic 2^24 + 1 = 2^24.
You do not see that in your sequential function because there you are using a 64-bit double precision value to store the result, and as we are working on a x86 platform, internally most probably an 80-bit excess precision register is used.
You can force to use single precision throughout in your sequential code by rewriting it as
float sum;
float inner_product_sequential(float * a, float * b, unsigned int size) {
sum = 0;
for(unsigned int i = 0; i < size; i++) {
sum += a[i] * b[i];
}
return sum;
}
and you will see 16777216.000000 as maximum computed value.

Resources