Background
Imagine that we have N particles inside a box of length L, which interact with each other (through a Lennard Jones potential).
I want to compute the total potential energy of the system. I implemented the function POT which calculates all the contributions from all the particles and gives the correct results (this is tested and can be assumed true).
I also wrote a function POT_ONE which only calculates the potential energy of one particle with respect to all the others. This means that if I want to calculate the total potential energy I will have to call this function N times (making sure that the particle does not interact with itself) and then divide by 2 since I double count the interactions.
Goal
My goal is to make the second function yield the same results as the first one.
Problem
There is something really strange going on: If I put 4 particles, the two functions give the same results. If I put a fifth one then there is deviation. Then for 6,7,8 particles,again, it gives correct results and then for N=9 I am getting a different result. In the case N=1000 the result that I am getting from POT_ONE is somemthing like 113383820348202024.
My results for N=5 are:
-0.003911 with POT and
12.864234 with POT_ONE
In case someone tries to run the code and wants to check the N=4 case, he/she should change the number of particles (np) which is defined as global variable and then comment the line pos[12]=1;pos[13]=1;pos[14]=1;.
Code
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
/*GENERAL PARAMETERS*/
int dim=3; //number of dimensions
int np=5; //number of particles
double L=36.413; //box length (A)
double invL=1/36.413; //inverse of box length
/*ARGON CHARACTERISTICS*/
double sig=3.4; // Angstroms (A)
double e=0.001; // eV
double distSQ(double array[]){
/*calculates the squared distance given the array x=[dx,dy,dz]*/
int i;
double r2=0;
for(i=0;i<dim;i++) r2+=array[i]*array[i];
return r2;
}//distSQ
void MIC(double dr[],double L, int dim){
/* MINIMUM IMAGE CONVENTION: dr[] is the array dr = [dx,dy,dz] describing relative
positions of two particles, L is the box length, dim the number
of dimensions */
int i;
for(i=0;i<dim;i++) dr[i]-=round(dr[i]*invL)*L;
}//MIC
void POT(double x1[],double* potential){
/*given the positions of each particle in the form x=[x0,y0,z0;x1,y1,z1;...;xn-1,yn-1,zn-1],
the number of dimensions dim and particles np, it calculates the potential energy of the configuration*/
//variables for potential calculation
int i,j,k;
double *x2;
double r2inv; // 1/r^2
double foo,bar;
double dr[dim];
*potential=0; // set potential energy to zero
//main part of POT
for(i=0;i<np-1;i++){
x2=x1+dim;
for(j=i+1;j<np;j++){
//calculate relative distances between particles i & j
//apply periodic BCs and then calculate squared distance
//and the potential energy between them.
for(k=0;k<dim;k++) dr[k] = x2[k]-x1[k];
MIC(dr,L,dim); //periodic boundary conditions
r2inv=1/distSQ(dr);
//calculate potential energy
foo = sig*sig*r2inv;
bar = foo*foo*foo;
*potential+=bar*(bar-1);
}//for j
x1+=dim;
}//for i
*potential*=4*e; //scale and give energy units
}//POT
void POT_ONE(int particle,double pos[],double* potential){
*potential=0;
int i,k;
double dr[dim];
double r2inv,foo,bar;
double par_pos[dim];
int index=particle*dim;
par_pos[0]=pos[index];
par_pos[1]=pos[index+1];
par_pos[2]=pos[index+2];
for(i=0;i<np;i++){
if(i!=particle){
for(k=0;k<dim;k++) dr[k]=pos[k]-par_pos[k];
MIC(dr,L,dim);
r2inv=1/distSQ(dr);
foo=sig*sig*r2inv;
bar=foo*foo*foo;
*potential+=bar*(bar-1);
}
pos+=dim;
}
*potential*=4*e; //scale and give energy units
}//POT_ONE
int main(){
int D=np*dim;
double* pos=malloc(D*sizeof(double));
double potential=0; //calculated with POT
double U=0; ////calculated with POT_ONE
double tempU=0;
pos[0]=0;pos[1]=0;pos[2]=0;
pos[3]=4;pos[4]=0;pos[5]=0;
pos[6]=0;pos[7]=4;pos[8]=0;
pos[9]=0;pos[10]=0;pos[11]=4;
pos[12]=1;pos[13]=1;pos[14]=1;
POT(pos,&potential);
printf("POT: %f\n", potential);
int i,j;
for(i=0;i<np;i++){
POT_ONE(i,pos,&tempU);
U+=tempU;
}
U=U/2;
printf("POT_ONE: %f\n\n", U);
return 0;
}
Your error is in POT, where you forgot to update x2 at the end of the inner loop.
for (i = 0; i < np - 1; i++) {
double *x2 = x1 + dim;
for (j = i + 1; j < np; j++) {
// ... calculate stuff ..
x2 += dim;
}
x1 += dim;
}
An easier and arguably more readable variant is to forgo pointer arithmetic altogether and use boring old indices:
for (k = 0; k < dim; k++) {
dr[k] = x[j * dim + k] - x[i * dim + k];
}
Further observations:
Please make your variables local to the scope where they are used. A large list of uninitialized variables at the top of the function makes it very hard to track variables, even in a short function like yours.
Please consider returning single values from functions instead of passing in pointers. In my opinion, that makes functions like the square of the distance more readable.
The structure of your code is hard to see, because everything is run togeher very tightly, even the comments.
Related
I made matrix-vector multiplication program using AVX2, FMA in C. I compiled using GCC ver7 with -mfma, -mavx.
However, I got the error "incorrect checksum for freed object - object was probably modified after being freed."
I think the error would generate if the matrix dimension isn't multiples of 4.
I know AVX2 use ymm register that can use 4 double precision floating point number. Therefore, I can use AVX2 without error in case the matrix is multiples of 4.
But, here is my question.
How can I use AVX2 efficiently if the matrix isn't multiples of 4 ???
Here is my code.
#include "stdio.h"
#include "math.h"
#include "stdlib.h"
#include "time.h"
#include "x86intrin.h"
void mv(double *a,double *b,double *c, int m, int n, int l)
{
__m256d va,vb,vc;
int k;
int i;
for (k = 0; k < l; k++) {
vb = _mm256_broadcast_sd(&b[k]);
for (i = 0; i < m; i+=4) {
va = _mm256_loadu_pd(&a[m*k+i]);
vc = _mm256_loadu_pd(&c[i]);
vc = _mm256_fmadd_pd(vc, va, vb);
_mm256_storeu_pd( &c[i], vc );
}
}
}
int main(int argc, char* argv[]) {
// set variables
int m;
double* a;
double* b;
double* c;
int i;
int temp=0;
struct timespec startTime, endTime;
m=9;
// main program
// set vector or matrix
a=(double *)malloc(sizeof(double) * m*m);
b=(double *)malloc(sizeof(double) * m*1);
c=(double *)malloc(sizeof(double) * m*1);
for (i=0;i<m;i++) {
a[i]=1;
b[i]=1;
c[i]=0.0;
}
for (i=m;i<m*m;i++) {
a[i]=1;
}
// check start time
clock_gettime(CLOCK_REALTIME, &startTime);
mv(a, b, c, m, 1, m);
// check end time
clock_gettime(CLOCK_REALTIME, &endTime);
free(a);
free(b);
free(c);
return 0;
}
You load and store vectors of 4 double, but your loop condition only checks that the first vector element is in-bounds, so you can write outside objects by up to 3x8 = 24 bytes when m is not a multiple of 4.
You need something like i < (m-3) in main loop, and a cleanup strategy for handling the last partial vector of data. Vectorizing with SIMD is very much like unrolling: you have to check that it's ok to do multiple future elements in the loop condition.
A scalar cleanup loop works well, but we can do better. For example, do as many 128-bit vectors as possible after the last full 256-bit vector (i.e. up to 1), before going scalar.
In many cases (e.g. write-only destination) an unaligned vector load that ends at the end of your arrays is very good (when m>=4). It can overlap with your main loop if m%4 != 0, but that's fine because your output array doesn't overlap your inputs, so redoing an element as part of a single cleanup is cheaper than branching to avoid it.
But that doesn't work here, because your logic is c[i+0..3] += ..., so redoing an element would make it wrong.
// cleanup using a 128-bit FMA, then scalar if there's an odd element.
// untested
void mv(double *a,double *b,double *c, int m, int n, int l)
{
/* the loop below should actually work for m=1..3, but a separate strategy might be good.
if (m < 4) {
// maybe check m >= 2 and use __m128 vectors?
// or vectorize differently?
}
*/
for (int k = 0; k < l; k++) {
__m256 vb = _mm256_broadcast_sd(&b[k]);
int i;
for (i = 0; i < (m-3); i+=4) {
__m256d va = _mm256_loadu_pd(&a[m*k+i]);
__m256d vc = _mm256_loadu_pd(&c[i]);
vc = _mm256_fmadd_pd(vc, va, vb);
_mm256_storeu_pd( &c[i], vc );
}
if (i<(m-1)) {
__m128d lasta = _mm_loadu_pd(&a[m*k+i]);
__m128d lastc = _mm_loadu_pd(&c[i]);
lastc = _mm_fmadd_pd(lastc, va, _mm256_castpd256_pd128(vb));
_mm_storeu_pd( &c[i], lastc );
// i+=2; // last element only checks m odd/even, doesn't use i
}
// if (i<m)
if (m&1) {
// odd number of elements, do the last non-vector one
c[m-1] += a[m*k + m-1] * _mm256_cvtsd_f64(vb);
}
}
}
I haven't looked at exactly how gcc/clang -O3 compile that. Sometimes compilers try to get too smart with cleanup code (e.g. trying to auto-vectorize scalar cleanup loops).
Other strategies could include doing the last up-to-4 elements with an AVX masked store: you need the same mask for the end of every matrix row, so generating it once and then using it at the end of every row could be good. See Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all. (To simplify branching, you'd set it up so your main loop only goes to i < (m-4), then you always run the cleanup. In the m%4 == 0 case, the mask is all-ones so you do the final full vector.) If you can't safely read past the end of the matrix, you probably need a masked load as well as masked store.
You could also look at aligning your rows for efficiency, or a row stride that's separate from the logical length of rows. (i.e. pad rows out to 32-byte boundaries). Leaving padding at the end of rows simplifies the cleanup, because you can always do whole vectors that write padding.
Special case m==2: instead of broadcasting one element from b[], you'd like to broadcast 2 elements into two 128-bit lanes of a __m256d, so one 256-bit FMA could do 2 rows at once.
Check the image
This is my 1st post so have that in mind while reading my question.
I have an exam of a colloquium but my code does not provide me the correct result.
So if anyone could help me that would be great. :)
These are the informations that are provided in the exam:
A function y=f(x)=ax^2+bx+c
We have to find the surface that is below the chart but keep in mind that dx(Delta X)=B-A and the height goes like this: A,A+dx,A+2dx, .... , B-dx.
As dx value gets lower the surface will be more accurate.
You have to write the program so that the surface with precision 0.001
This is my code so could someone who is good in C check it please.
Thank you.
#include <stdio.h>
#include <math.h>
int main()
{
int a,b,c;
double A,B,dx,p,D,q,x,y,nv,nv1,nv2,sv;
do{
printf("Insert a & b: "),scanf("%lf %lf",&A,&B);
} while(A<1 || B<1);
nv=dx=B-A;
do{
printf("enter odds: "),scanf("%d %d %d",&a,&b,&c);
p=(-b)/2;
D=sqrt(pow(b,2)-4*a*c);
q= -D/4*a;
} while( a<0 || p<0 || q<0);
do{
sv=nv;
dx/=2;
nv=0;
for(x=A;x<p;x+=dx)
for(dx=B-A;dx<q;dx/=2)
nv1+=x*dx;
for(y=p;y<=B;y+=dx)
for(dx=q;dx<B;dx/=2)
nv2+=y*dx;
nv=nv1+nv2;
}while(fabs(nv-sv)>0.001);
printf("The surface is %lf",nv);
return 0;
}
You want to find the approximation of a definite integral of a quadratic function. There are several issues with your code:
What is the restriction of A ≥ 1 and B ≥ 1 for? A parabola is defined over the whole abscissa. If anything, you should enforce that the input is numeric and that two values were given.
You don't need to find the vertex of the parabola. Your task is to create small rectangles based on the left x value of each interval as the image shows. Therefore, you don't need p and q. And you shouldn't enforce that the vertex is in the first quadrant on the input without indication.
Why are the coefficients of the parabola integers? Make them doubles to be consistent.
Because you don't need to know the vertex, you don't need to split your loop in two. In your code, you don't even check that p is between A and B, which is a requirement of cour code.
What is the inner loop for? You are supposed to just calculate the area of the current rectangle here. What's worse: you re-use the variable dx as iteration variable, which means you lose it as an indicator of how large your current interval is.
The repeated incrementing of dx may lead to an accumulated floating-point error when the number of intervals is large. A common technique to avoid this is to use an integer variable for loop control and the determine the actual floating-point variable by multiplication.
The absolute value as a convergence criterion may lead to problems with small and big numbers. The iteration ends too early for small values and it may never reach the criterion for big numbers, where a difference of 0.001 cannot be resolved.
Here's a version of your code that puts all that into practice:
#include <stdio.h>
#include <math.h>
int main()
{
double a, b, c;
double A, B;
printf("Lower and upper limit A, B: ");
scanf("%lf %lf", &A, &B);
printf("enter coefficients a, b, c: ");
scanf("%lf %lf %lf", &a, &b, &c);
double nv = 0;
double sv;
int n = 1;
do {
int i;
double dx;
sv = nv;
n *= 2;
dx = (B - A) / n;
nv = 0;
for (i = 0; i < n; i++) {
double x = A + i * (B - A) / n;
double y = a*x*x + b*x + c;
nv += dx * y;
}
} while(fabs(nv - sv) > 0.0005 * fabs(nv + sv));
printf("Surface: %lf\n", nv);
return 0;
}
The code is well-behaved for empty intervals (where A = B) or reversed intervals (where A > B). The inpt is still quick and dirty. It should really heck that the entered values are valid numbers. There's no need to restrict the input arbitrarily, though.
This is for all you C experts out there..
The first function takes a two-dimensional matrix src[dim][dim] representing pixels of an image, and rotates it 90 degrees into a destination matrix dst[dim][dim]. The second function takes the same src[dim][dim] and smoothens the image by replacing every pixel value with the average of all the pixels around it (in a maximum of 3 × 3 window centered at that pixel).
I need to optimize the program in account for time and cycles, how else would I be able to optimize the following?:
void rotate(int dim, pixel *src, pixel *dst,)
{
int i, j, nj;
nj = 0;
/* below are the main computations for the implementation of rotate. */
for (j = 0; j < dim; j++) {
nj = dim-1-j; /* Code Motion moved operation outside inner for loop */
for (i = 0; i < dim; i++) {
dst[RIDX(nj, i, dim)] = src[RIDX(i, j, dim)];
}
}
}
/* A struct used to compute averaged pixel value */
typedef struct {
int red;
int green;
int blue;
int num;
} pixel_sum;
/* Compute min and max of two integers, respectively */
static int minimum(int a, int b)
{ return (a < b ? a : b); }
static int maximum(int a, int b)
{ return (a > b ? a : b); }
/*
* initialize_pixel_sum - Initializes all fields of sum to 0
*/
static void initialize_pixel_sum(pixel_sum *sum)
{
sum->red = sum->green = sum->blue = 0;
sum->num = 0;
return;
}
/*
* accumulate_sum - Accumulates field values of p in corresponding
* fields of sum
*/
static void accumulate_sum(pixel_sum *sum, pixel p)
{
sum->red += (int) p.red;
sum->green += (int) p.green;
sum->blue += (int) p.blue;
sum->num++;
return;
}
/*
* assign_sum_to_pixel - Computes averaged pixel value in current_pixel
*/
static void assign_sum_to_pixel(pixel *current_pixel, pixel_sum sum)
{
current_pixel->red = (unsigned short) (sum.red/sum.num);
current_pixel->green = (unsigned short) (sum.green/sum.num);
current_pixel->blue = (unsigned short) (sum.blue/sum.num);
return;
}
/*
* avg - Returns averaged pixel value at (i,j)
*/
static pixel avg(int dim, int i, int j, pixel *src)
{
int ii, jj;
pixel_sum sum;
pixel current_pixel;
initialize_pixel_sum(&sum);
for(ii = maximum(i-1, 0); ii <= minimum(i+1, dim-1); ii++)
for(jj = maximum(j-1, 0); jj <= minimum(j+1, dim-1); jj++)
accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);
assign_sum_to_pixel(¤t_pixel, sum);
return current_pixel;
}
void smooth(int dim, pixel *src, pixel *dst)
{
int i, j;
/* below are the main computations for the implementation of the smooth function. */
for (j = 0; j < dim; j++)
for (i = 0; i < dim; i++)
dst[RIDX(i, j, dim)] = avg(dim, i, j, src);
}
I moved dim-1-j outside of the inner for loop of rotate which reduces time and cycles used in the program, but is there anything else that can be used for either main function?
Thanks!
There are several oprimizations you can do; some a compiler might do for you but best to write it out yourself. For example: moving constant expressions out of the loop (you did that once; there are more places you can do that - don't forget that the condition is checked every iteration too, so optimize the loop condition in this manner too) and, as Chris pointed out, use pointers that you increment instead of full array indexing. I also see some function calls that can be rewritten in-line.
I also want to point to an article on stackoverflow about matrix multiplication and optimizing that to use the processor cache. In essence it first rearranges the arrrays into memory bocks that fit the cache, then performs the operation on those blocks, then moves to the next block, and so on. You may be able to re-use the ideas for your rotation.
See Optimizing assembly generated by Microsoft Visual Studio Compiler
For the rotation, you get a better utilization of the cache by decomposing in smaller image tiles.
For the smoothing,
1) expand the whole operation inside the main double loop, do not use these intermediate micro-functions;
2) completely unroll the accumulation and averaging (it's only a sum of 9 terms), hard coding the indexes;
3) process in different loops along the edges (where not all 9 pixels are available) and in the middle. The middle deserves maximum optimization (especially (2));
4) try and avoid the divisions by 9 (you can think of replacing the division by a table lookup).
Top speed will be obtained by handcrafting vectorized optimization (SSE/AVX), but this requires some deal of experience. Multicore parallelization is also an option.
To give you an idea, it is possible to apply a 3x3 average on a 1 MB grayscale image in less than 0.5 ms (monocore, Core i7#3.4 GHz). We can extrapolate to 2 ms or so for a 1 Mpixel RGB image.
Since you can't provide a running program these are just ideas of things that could help:
Assuming values in the range [0,256) then use uint8_t as your rgbn values. This takes up 1/4 of the memory of the int version but will likely require more cycles; I can't know if this would be faster or not without more knowledge. The idea is that since you use 1/4 of the memory you are more likely to keep more values in L1-L3 cache.
Since your neighbors are the same whether you are rotated or not, calculate the average before rotating. I suspect this would help out with caching but again can't be sure; it depends on some code I can't see.
Parallelize the outer loop. Since you have easy grid dimensions and the inputs and outputs don't have read/write conflicts this is a trivial thing to do. This will certainly take more cycles but will possibly be faster.
Hard-code your edges; you are currently doing maximum and minimum operations on every call to average, but for the inner points it is unneeded. Calculate the edges and the inner points separately.
I'm trying to optimize some of my code in C, which is a lot bigger than the snippet below. Coming from Python, I wonder whether you can simply multiply an entire array by a number like I do below.
Evidently, it does not work the way I do it below. Is there any other way that achieves the same thing, or do I have to step through the entire array as in the for loop?
void main()
{
int i;
float data[] = {1.,2.,3.,4.,5.};
//this fails
data *= 5.0;
//this works
for(i = 0; i < 5; i++) data[i] *= 5.0;
}
There is no short-cut you have to step through each element of the array.
Note however that in your example, you may achieve a speedup by using int rather than float for both your data and multiplier.
If you want to, you can do what you want through BLAS, Basic Linear Algebra Subprograms, which is optimised. This is not in the C standard, it is a package which you have to install yourself.
Sample code to achieve what you want:
#include <stdio.h>
#include <stdlib.h>
#include <cblas.h>
int main () {
int limit =10;
float *a = calloc( limit, sizeof(float));
for ( int i = 0; i < limit ; i++){
a[i] = i;
}
cblas_sscal( limit , 0.5f, a, 1);
for ( int i = 0; i < limit ; i++){
printf("%3f, " , a[i]);
}
printf("\n");
}
The names of the functions is not obvious, but reading the guidelines you might start to guess what BLAS functions does. sscal() can be split into s for single precision and scal for scale, which means that this function works on floats. The same function for double precision is called dscal().
If you need to scale a vector with a constant and adding it to another, BLAS got a function for that too:
saxpy()
s a x p y
float a*x + y
y[i] += a*x
As you might guess there is a daxpy() too which works on doubles.
I'm afraid that, in C, you will have to use for(i = 0; i < 5; i++) data[i] *= 5.0;.
Python allows for so many more "shortcuts"; however, in C, you have to access each element and then manipulate those values.
Using the for-loop would be the shortest way to accomplish what you're trying to do to the array.
EDIT: If you have a large amount of data, there are more efficient (in terms of running time) ways to multiply 5 to each value. Check out loop tiling, for example.
data *= 5.0;
Here data is address of array which is constant.
if you want to multiply the first value in that array then use * operator as below.
*data *= 5.0;
I have a problem with a series of functions. I have an array of 'return values' (i compute them through matrices) from a single function sys which depends on a integer variable, lets say, j, and I want to return them according to this j , i mean, if i want the equation number j, for example, i just write sys(j)
For this, i used a for loop but i don't know if it's well defined, because when i run my code, i don't get the right values.
Is there a better way to have an array of functions and call them in a easy way? That would make easier to work with a function in a Runge Kutta method to solve a diff equation.
I let this part of the code here: (c is just the j integer i used to explain before)
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int N=3;
double s=10.;
//float r=28.;
double b=8.0/3.0;
/ * Define functions * /
double sys(int c,double r,double y[])
{
int l,m,n,p=0;
double tmp;
double t[3][3]={0};
double j[3][3]={{-s,s,0},{r-y[2],-1,-y[0]},{y[1],y[0],-b}}; //Jacobiano
double id[3][3] = { {y[3],y[6],y[9]} , {y[4],y[7],y[10]} , {y[5],y[8],y[11]} };
double flat[N*(N+1)];
// Multiplication of matrices J * Y
for(l=0;l<N;l++)
{
for(m=0;m<N;m++)
{
for(n=0;n<N;n++)
{
t[l][m] += j[l][n] * id[n][m];
}
}
}
// Transpose the matrix (J * Y) -> () t
for(l=0;l<N;l++)
{
for(m=l+1;m<N;m++)
{
tmp = t[l][m];
t[l][m] = t[m][l];
t[m][l] = tmp;
}
}
// We flatten the array to be left in one array
for(l=0;l<N;l++)
{
for(m=0;m<N;m++)
{
flat[p+N] = t[l][m];
}
}
flat[0] = s*(y[1]-y[0]);
flat[1] = y[0]*(r-y[2])-y[1];
flat[2] = y[0]*y[1]-b*y[2];
for(l=0;l<(N*(N+1));l++)
{
if(c==l)
{
return flat[c];
}
}
}
EDIT ----------------------------------------------------------------
Ok, this is the part of the code where i use the function
int main(){
output = fopen("lyapcoef.dat","w");
int j,k;
int N2 = N*N;
int NN = N*(N+1);
double r;
double rmax = 29;
double t = 0;
double dt = 0.05;
double tf = 50;
double z[NN]; // Temporary matrix for RK4
double k1[N2],k2[N2],k3[N2],k4[N2];
double y[NN]; // Matrix for all variables
/* Initial conditions */
double u[N];
double phi[N][N];
double phiu[N];
double norm;
double lyap;
//Here we integrate the system using Runge-Kutta of fourth order
for(r=28;r<rmax;r++){
y[0]=19;
y[1]=20;
y[2]=50;
for(j=N;j<NN;j++) y[j]=0;
for(j=N;j<NN;j=j+3) y[j]=1; // Identity matrix for y from 3 to 11
while(t<tf){
/* RK4 step 1 */
for(j=0;j<NN;j++){
k1[j] = sys(j,r,y)*dt;
z[j] = y[j] + k1[j]*0.5;
}
/* RK4 step 2 */
for(j=0;j<NN;j++){
k2[j] = sys(j,r,z)*dt;
z[j] = y[j] + k2[j]*0.5;
}
/* RK4 step 3 */
for(j=0;j<NN;j++){
k3[j] = sys(j,r,z)*dt;
z[j] = y[j] + k3[j];
}
/* RK4 step 4 */
for(j=0;j<NN;j++){
k4[j] = sys(j,r,z)*dt;
}
/* Updating y matrix with new values */
for(j=0;j<NN;j++){
y[j] += (k1[j]/6.0 + k2[j]/3.0 + k3[j]/3.0 + k4[j]/6.0);
}
printf("%lf %lf %lf \n",y[0],y[1],y[2]);
t += dt;
}
Since you're actually computing all these values at the same time, what you really want is for the function to return them all together. The easiest way to do this is to pass in a pointer to an array, into which the function will write the values. Or perhaps two arrays; it looks to me as if the output of your function is (conceptually) a 3x3 matrix together with a length-3 vector.
So the declaration of sys would look something like this:
void sys(double v[3], double JYt[3][3], double r, const double y[12]);
where v would end up containing the first three elements of your flat and JYt would contain the rest. (More informative names are probably possible.)
Incidentally, the for loop at the end of your code is exactly equivalent to just saying return flat[c]; except that if c happens not to be >=0 and <N*(N+1) then control will just fall off the end of your function, which in practice means that it will return some random number that almost certainly isn't what you want.
Your function sys() does an O(N3) calculation to multiply two matrices, then does a couple of O(N2) operations, and finally selects a single number to return. Then it is called the next time and goes through most of the same processing. It feels a tad wasteful unless (even if?) the matrices are really small.
The final loop in the function is a little odd, too:
for(l=0;l<(N*(N+1));l++)
{
if(c==l)
{
return flat[c];
}
}
Isn't that more simply written as:
return flat[c];
Or, perhaps:
if (c < N * (N+1))
return flat[c];
else
...do something on disastrous error other than fall off the end of the
...function without returning a value as the code currently does...
I don't see where you are selecting an algorithm by the value of j. If that's what you're trying to describe, in C you can have an array of pointers to functions; you could use a numerical index to choose a function from the array, but you can also pass a pointer-to-a-function to another function that will call it.
That said: Judging from your code, you should keep it simple. If you want to use a number to control which code gets executed, just use an if or switch statement.
switch (c) {
case 0:
/* Algorithm 0 */
break;
case 1:
/* Algorithm 1 */
etc.