I am trying to optimise the divide operation from the Jacobi relaxation formula.
Also doing profiling using perf.
Here is my code
for (int l = 0; l < iter; l++) {
for (i = 1; i < height; i++) {
for (j = 1; j < width; j++) {
for (k = 1; k < length; k++) {
float val = 0.0f;
// Do the Jacobi additions here
// From profiling, fastest is to fetch k+/-1,j,i
// Slowest is to fetch k,j,i+/-1
// Scale with dimensions of the array
val -= dim * array[k][j][i];
// Want to optimise this
val /= 6.0; // profiling shows this as the slowest op
// Some code here to put the result into the output array
}
}
}
}
The size of the 3D array can be from 100x100x100 up to 1000x1000x1000.
I've tried to multiply it to 1.0f/6.0f but this does not seem to make a difference. The array is a 3D array of floats.
Related
My code as follows, and in the main function, I recall Mat_product function about 223440 times, use 179ns, 23% in the whole runtime.
struct Matrix_SE3 {
float R[3][3];
double P[3]; //here i need use double type.
};
struct Matrix_SE3 Mat_product(struct Matrix_SE3 A, struct Matrix_SE3 B) {
struct Matrix_SE3 result = { { { 0, 0, 0 }, { 0, 0, 0 }, { 0, 0, 0 } }, { 0,
0, 0 } };
for (int i = 0; i < 3; i++) {
result.P[i] += A.P[i];
for (int j = 0; j < 3; j++) {
result.P[i] += A.R[i][j] * B.P[j];
for (int k = 0; k < 3; k++)
result.R[i][j] += A.R[i][k] * B.R[k][j];
}
}
return result;
}
where $R$ is the rotation matrix, and $P$ represent the position, the function is calculated at two special euclidean group $SE(3)$ matrix multiplication and return $SE(3)$ matrix.
Maybe this is a duplicate of Optimized matrix multiplication in C, the difference is my code use struct to describe matrix, does it affect the efficiency of calculation?
Not sure what are the P and R par in your code, but you should never use the ijk ordering for matrix multiplication.
Because of the row-major ordering, when accessing B.R[k][j] in your inner loop, many accesses will lead to a cache miss, reducing performances significantly, even with your small matrices.
The proper way to perform matrix multiplication is to iterate in the ikj order.
for (int i = 0; i < 3; i++) {
double r;
result.P[i] += A.P[i];
for (int k = 0; k < 3; k++) {
r=A.R[i][k];
for (int j = 0; j < 3; j++) {
result.P[i] += A.R[i][j] * B.P[j];
result.R[i][j] += r * B.R[k][j];
}
}
}
All accesses will properly be performed in row major order order and will benefit from the cache behavior.
And do not forget to use -O3 optimization. Most compilers will use sse/avx instructions to optimize the code.
I'm generally an R user but I am trying to use to C for some lower level cumulative sums and multiplications.
I am trying to generate a cumulative sum of eta and storing the result in tmp0. However, when I output tmp0 it either gives me Inf, NaN, or some arbitrarily large number. I double checked the same cumulative sum in R and it works fine; I am not sure why C is not handling it. Below is the code that I am using:
int i,j;
const int p = ncov, n = nin;
double accNum0[n]; //accumulate first part of likelihood sum eta_i
double accNum1[n]; //accumulate the backwards numerator
double accNum2[n]; //acumulate the forward numerator (weighted)
double tmp0 = 0;
double eta[n]; //calculate linear predictor in this step (X %*% beta)
for(i = 0; i < n; i++) {
for (j = 0; j < p; j++)
eta[i] += b[j] * x[n * j + i];
}
for (i = 0; i < n; ++i) {
tmp0 += eta[i];
}
return (tmp0);
Again, I am fairly new to C so I may be making some rookie mistakes and would greatly appreciate any (and all) suggestions!
There might be errors with how you are initializing b or x. However, one definite error is that eta is being used uninitialized. This means eta[i] may begin with some arbitrary value instead of 0 as you are likely expecting.
Add an initialization before accumulating into it.
for(i = 0; i < n; i++) {
eta[i] = 0;
for (j = 0; j < p; j++)
eta[i] += b[j] * x[n * j + i];
}
I a have a 2D double complex array of size 1001(rows)*144(columns) in C. I want to apply FFT on each of the rows and finally want the output in 4096*144 format. Here N point = 4096. Finally compare the result with matlab ouput.
I am using renowned FFTW C library. I have read the tutorials but could not understand how to use properly. Which routine I should use, Like 1D routine or 2D routine and then how?
#Update
double complex reshaped_calib[1001][144]={{1.0 + 0.1 * I}};
double complex** input; //[4096][144];
// have to take dynamic array as I was getting segmentation fault here
input = malloc(4096 * sizeof(double complex*));
for (i = 0; i < 4096; i++) {
input[i] = malloc(144* sizeof(double complex));
}
// input is array I am sending to fftw to apply fft
for (i= 0; i< 1001; i++)
{
for (j= 0; j< 144; j++)
{
input[i][j]=reshaped_calib[i][j];
}
}
// Pad the extra rows
for (i= 1001; i< 4096; i++)
{
for (j= 0; j< 144; j++)
{
input[i][j] = 0.0;
}
}
int N=144, howmany=4096;
fftw_complex* data = (fftw_complex*) fftw_malloc(N*howmany*sizeof(fftw_complex));
i=0,j=0;
int dataCount=0;
for(i=0;i<4096;i++)
{
for(j=0;j<144;j++)
{
data[dataCount++]=CMPLX(creal(input[i][j]),cimag(input[i][j]));
}
}
int istride=1, idist=N;// as in C data as row major
// ... if data is column-major, set istride=howmany, idist=1
// if data is row-major, set istride=1, idist=N
fftw_plan p = fftw_plan_many_dft(1,&N,howmany,data,NULL,howmany,1,data,NULL,howmany,1,FFTW_FORWARD,FFTW_MEASURE);
fftw_execute(p);
Your attempt at padding the array with
int pad = 4096;
memset(reshaped_calib, 0.0, pad * sizeof(double complex)); //zero padding
essentially overwrites the first 4096 values of the array reshaped_calib. For proper padding, you would instead need to extend the size of the 2D array to your required size of 4096 x 144, and set to zero only the entries outside the input's 1001 x 144 range.
Since you are only extending the number of rows, the following could be used for the padding:
double complex input[1001][144]={{1.0 + 0.1 * I}};
// Allocate storage for the larger 4096x144 2D size, and copy the input.
double complex reshaped_calib[4096][144];
for (int row = 0; row < 1001; row++)
{
for (int col = 0; col < 144; col++)
{
reshaped_calib[row][col] = input[row][col];
}
}
// Pad the extra rows
for (int row = 1001; row < 4996; row++)
{
for (int col = 0; col < 144; col++)
{
reshaped_calib[row][col] = 0.0;
}
}
That said, if you want 1D FFT of each of the rows separately, you should instead use fftw_plan_dft_1d and call fftw_execute multiple times, or otherwise use fftw_plan_many_dft. This is described in more details in this answer by #Dylan.
I am in the middle of porting prototype matlab code into C, and have hit a hitch on inverting a nxn matrix of complex numbers.
I am using GSL to perform LU decompositon and inversion with the following code:
size_t matSize = 4;
gsl_permutation *perm = gsl_permutation_calloc(matSize);
gsl_matrix_complex *mat = gsl_matrix_complex_calloc(matSize, matSize);
gsl_matrix_complex *inv = gsl_matrix_complex_calloc(matSize, matSize);
gsl_complex temp;
for(int i = 0; i < matSize; i++) {
for(int j = 0;j < matSize; j++) {
GSL_SET_COMPLEX(&temp,i+1,j+1);
gsl_matrix_complex_set(mat,i,j,temp);
}
}
int s;
gsl_linalg_complex_LU_decomp(mat, perm, &s);
gsl_linalg_complex_LU_invert(mat, perm, inv);
for(int i = 0; i < matSize; i++) {
for(int j = 0; j < matSize; j++) {
printf("%f + %fi \n",GSL_REAL(gsl_matrix_complex_get(inv, i, j)),GSL_IMAG(gsl_matrix_complex_get(inv, i, j)));
}
}
However comparing the results I get to matlab, they are not similar at all. I know matlab does not always use LU for inversion as it has an algorithm decision tree depending on matrix properties, but I would expect comparable results.
Any help welcomed. As well as recommendations for other libraries / techniques of inversion (not that GSL is wrong, without a doubt I am!) the matrix I have is type fftw_complex and I would rather avoid converting too/from gsl_complex_matrix.
If I have a 2D array, it is trivial to loop through the entire array, a row or a column by using for loops. However, occasionally, I need to traverse an arbitrary 2D sub-array.
A great example would be sudoku in which I might store an entire grid in a 2D array but then need to analyse each individual block of 9 squares. Currently, I would do something like the following:
for(i = 0; i < 9; i += 3) {
for(j = 0; j < 9; j += 3) {
for(k = 0; k < 3; k++) {
for(m = 0; m < 3; m++) {
block[m][k] == grid[j + m][i + k];
}
}
//At this point in each iteration of i/j we will have a 2D array in block
//which we can then iterate over using more for loops.
}
}
Is there a better way to iterate over arbitrary sub-arrays especially when they occur in a regular pattern such as above?
The performance on this loop structure will be horrendous. Consider the inner most loop:
for(m = 0; m < 3; m++) {
block[m][k] == grid[j + m][i + k];
}
C is "row-major" ordered, which means that accessing block will cause a cache miss on each iteration! That's because the memory is not accessed contiguously.
There's a similar issue for grid. Your nested loop order is to fix i before varying j, yet you are accessing grid on j as the row. This again is not contiguous and will cache miss on every iteration.
So a rule of thumb for when dealing with nested loops and multidimensional arrays is to place the loop indices and array indices in the same order. For your code, that's
for(j = 0; j < 9; j += 3) {
for(m = 0; m < 3; m++) {
for(i = 0; i < 9; i += 3) {
for(k = 0; k < 3; k++) {
block[m][k] == grid[j + m][i + k];
}
}
// make sure you access everything so that order doesn't change
// your program's semantics
}
}
Well in the case of sudoku couldn't you just store 9 3x3 arrays. Then you don't need to bother with sub arrays... If you start moving to much larger grids than sudoku you would improve cache performance this way as well.
Ignoring that, your code above works fine.
Imagine you have a 2D array a[n][m]. In order to loop a subarray q x r whose upper right corner is at position x,y use:
for(int i = x; i < n && i < x + q; ++i)
for(int j = y; j < m && j < y + r; ++j)
{
///
}
For your sudoku example, you could do this
for(int i = 0; i<3; ++i)
for(int j = 0; j < 3; ++j)
for(int locali = 0; locali < 3; ++locali)
for(int localj = 0; localkj <3; ++localj)
//the locali,localj element of the bigger i,j 3X3 square is
a[3*i + locali][3*j+localj]