Fast computation of cumulative sums over four-dimensional arrays in R

Fast computation of cumulative sums over four-dimensional arrays in R - arrays

I'm relatively new to R programming, and this website has been very helpful for me so far, but I was unable to find a question that already covered what I want to know. So I decided to post a question myself.
My problem is the following: I want to find efficient ways to compute cumulative sums over four-dimensional arrays, i.e. I have data in a four-dimensional array x and want to write a function that computes an array x_sum such that
x_sum[i,j,k,l] = sum_{ind1 <= i, ind2 <= j, ind3 <= k, ind4 <=l} x[ind1, ind2, ind3, ind4].
I want to use this function billions of times, which makes it very important that it be as efficient as possible. Although I have come up with several ways to calculate the sums (see below), I suspect more experienced R programmers might be able to find a more efficient solution. So if anyone can suggest a better way of doing this, I would be very grateful.
Here's what I've tried so far:
I have found three different implementations (each of which brought a gain in speed) that work (see code below):
One in R using the cumsum() function (cumsum_4R) and two implementations where the „heavy lifting“ is done in C (using the .C() interface).
The first implementation in C is merely a naive attempt to write the sums using nested for-loops and pointer arithmetic (cumsumC_4_old).
In the second C-implementation (cumsumC_4) I tried to adapt my code using the ideas in the following article
As you can see in the source code below, the adaptation is relatively lopsided: For some dimensions, I was able to replace all the nested for-loops but not for others. Do you have ideas on how to do that?
Using microbenchmark on the three implementations, I get the following result for arrays of size 40x40x40x40:
Unit: milliseconds
expr min lq mean median uq
cumsum_4R(x) 976.13258 1029.33371 1064.35100 1051.37782 1074.23234
cumsumC_4_old(x) 174.72868 177.95875 192.75392 184.11121 203.18141
cumsumC_4(x) 56.87169 57.73512 67.34714 63.20269 68.80326
max neval
1381.5832 50
283.2384 50
105.7099 50
Additional information:
1) Since this made it easier to install any needed packages, I ran the benchmarks on my personal computer under Windows, but I plan on running the finished simulations on a computer from my university, which runs under Linux.
EDIT: 2) The four-dimensional data x[i,j,k,l] is actually the result of the product of two applications of the outer function: First, the outer product of a matrix with itself (i.e. outer(mat,mat)) and then taking the pairwise minima of another matrix (i.e. outer(mat2, mat2, pmin)). Then the data is the product
x = outer(mat, mat) * outer(mat2, mat2, pmin),
i.e.
x[i,j,k,l] = mat[i,j] * mat[k,l] * min(mat2[i,j], mat2[k,l])
The four-dimensional array has the corresponding symmetries.
3)The reason I need these cumulative sums in the first place is that I want to run simulations of a test for which I need partial sums over „rectangles“ of indices: I want to iterate over all sums of the form
sum_{k1<=i1 <= m1,k2<=i2 <= m2, k1 <= i3 <= m1, k2 <= i4 <=m2} x[i1, i2, i3, i4],
where 1<=k1<=m1<=n, 1<=k2<=m2<=n. In order to avoid calculating the sum of the same variables over and over again, I first calculate all the cumulative sums and then calculate the sums over rectangles as functions of the cumulative sums. Do you know of a more efficient way to do this?
EDIT to 3): In order to include all potentially important aspects: I also want to calculate sums of the form
sum_{k1<=i1 <= m1,k2<=i2 <= m2, 1 <= i3 <= n, 1 <= i4 <=n} x[i1, i2, i3, i4].
(Since I can trivially obtain them using the cumulative sums, I had not included this specification before).
Here is the C code I use (which I save as „cumsumC.c“):
#include<R.h>
#include<math.h>
#include <stdio.h>
int min(int a, int b){
if(a <= b) return a;
else return b;
}
void cumsumC_4_old(double* x, int* nv){
int n = *nv;
int n2 = n*n;
int n3 = n*n*n;
//Dim 1
for(int i=0; i<n; i++){
for(int j=0; j<n; j++){
for(int k=0; k<n; k++){
for(int l=1; l<n; l++){
x[i+j*n+k*n2+l*n3] += x[i + j*n +k*n2+(l-1)*n3];
}
}
}
}
//Dim 2
for(int i=0; i<n; i++){
for(int j=0; j<n; j++){
for(int k=1; k<n; k++){
for(int l=0; l<n; l++){
x[i+j*n+k*n2+l*n3] += x[i + j*n +(k-1)*n2+l*n3];
}
}
}
}
//Dim 3
for(int i=0; i<n; i++){
for(int j=1; j<n; j++){
for(int k=0; k<n; k++){
for(int l=0; l<n; l++){
x[i+j*n+k*n2+l*n3] += x[i + (j-1)*n +k*n2+l*n3];
}
}
}
}
//Dim 4
for(int i=1; i<n; i++){
for(int j=0; j<n; j++){
for(int k=0; k<n; k++){
for(int l=0; l<n; l++){
x[i+j*n+k*n2+l*n3] += x[i-1 + j*n +k*n2+l*n3];
}
}
}
}
}
void cumsumC_4(double* x, int* nv){
int n = *nv;
int n2 = n*n;
int n3 = n*n*n;
long ind1, ind2;
long index, indexges = n +(n-1)*n+(n-1)*n2+(n-1)*n3, indexend;
//Dim 1
index = n3;
while(index != indexges){
x[index] += x[index-n3];
index++;
}
//Dim 2
long teilind = n+(n-1)*n;
for(int k=1; k<n; k++){
ind1 = k*n2;
ind2 = ind1 - n2;
for(int l=0; l<n; l++){
index = l*n3;
indexend = teilind+index;
while(index != indexend){
x[index+ind1] += x[index+ind2];
index++;
}
}
}
//Dim 3
ind1 = n;
while(ind1 < n+(n-1)*n){
index = 0;
indexend = indexges - ind1;
ind2 = ind1-n;
while(index < indexend){
x[ind1+index] += x[ind2+index];
index += n2;
}
ind1++;
}
//Dim 4
index = 0;
int i;
long minind;
while(index < indexges){
i = 1;
minind = min(indexges, index+n);
while(index+i < minind){
x[index+i] += x[index+i-1];
i++;
}
index+=n;
}
}
Here is the R function „cumsum_4R“ and code used to call and compare the cumulative sum functions in R (under Windows; for Linux, the commands dyn.load/dyn.unload need to be adjusted; ideally, I want to use the functions on 50^4 size arrays, but since the call to microbenchmark would then take a while, I have chosen n=40 here):
library("microbenchmark")
# dyn.load("cumsumC.so")
dyn.load("cumsumC.dll")
cumsum_4R <- function(x){
return(aperm(apply(apply(aperm(apply(apply(x, 2:4,function(a) cumsum(as.numeric(a))), c(1,3,4) , function(a) cumsum(as.numeric(a))), c(2,1,3,4)), c(1,2,4), function(a) cumsum(as.numeric(a))), 1:3, function(a) cumsum(as.numeric(a))), c(3,4,2,1)))
}
cumsumC_4_old <- function(x){
n <- dim(x)[1]
arr <- array(.C("cumsumC_4_old", res=as.double(x), as.integer(n))$res, dim=c(n,n,n,n))
return(arr)
}
cumsumC_4 <- function(x){
n <- dim(x)[1]
arr <- array(.C("cumsumC_4", res=as.double(x), as.integer(n))$res, dim=c(n,n,n,n))
return(arr)
}
set.seed(1234)
n <- 40
x <- array(rnorm(n^4),dim=c(n,n,n,n))
r <- 6 #parameter for rounding results for comparison
res1 <- cumsum_4R(x)
res2 <- cumsumC_4_old(x)
res3 <- cumsumC_4(x)
print(c("Identical R and C1:", identical(round(res1,r),round(res2,r))))
print(c("Identical R and C2:",identical(round(res1,r),round(res3,r))))
times <- microbenchmark(cumsum_4R(x), cumsumC_4_old(x),cumsumC_4(x),times=50)
print(times)
dyn.unload("cumsumC.dll")
# dyn.unload("cumsumC.so")
Thank you for your help!

You can indeed use points 2 and 3 in your original question to solve the problem more efficiently. Actually, this makes the problem separable. By separable I mean that the limits of the 4 sums in Equation 3 do not depend on the variables you sum over. This and the fact that x is an outer product of 2 matrices enables you to separate the 4-fold sum in Eq. 3 into an outer product of two 2-fold sums. Even better: the 2 matrices used to define x are the same (denoted by mat by you) - so the two 2-fold sums give the same matrix, which has to be calculated only once.
Here the code:
set.seed(1234)
n=40
mat=array(rnorm(n^2),dim=c(n,n))
x=outer(mat,mat)
cumsum_sep=function(x) {
#calculate matrix corresponding to 2-fold sums
#actually it's just one matrix because x is an outer product of mat with itself
tmp=t(apply(apply(x,2,cumsum),1,cumsum))
#outer product of two-fold sums
outer(tmp,tmp)
}
y1=cumsum_4R(x)
#note that cumsum_sep operates on the original matrix mat!
y2=cumsum_sep(mat)
Check whether results are the same
all.equal(y1,y2)
[1] TRUE
This gives the benchmark results
microbenchmark(cumsum_4R(x),cumsum_sep(mat),times=10)
Unit: milliseconds
expr min lq mean median uq max neval cld
cumsum_4R(xx) 2084.454155 2135.852305 2226.59692 2251.95928 2270.15198 2402.2724 10 b
cumsum_sep(x) 6.844939 7.145546 32.75852 14.45762 34.94397 120.0846 10 a
Quite a difference! :)

Related

More efficient way of iterating over every small square in big square array

I'm in my first few months of learning to code in C through a high school program. Someone recently mentioned to me that there's often a way to make code more efficient and I think I have a problem that could be made more efficient. I'm not sure how but I have a hunch that it could be made faster.
We're given a 2D square array of integers with row and col size n. We have subsquares within the 2D square array with row and col size s. We can always assume that s will evenly divide n. I've written the following code to iterate over each subsquare
Currently my code looks something like this:
int **grid;
int s, i, j, k, l;
// reading in inputs, other processing
for (i = 0; i < n; i += s) {
for (j = 0; j < n; j += s) {
for (k = 0; k < s; k++) {
for (l = 0; l < s; l++) {
printf("%d \n", grid[i + k][j + l]);
}
}
printf("next subsquare: \n");
}
}
As you can see, I've got 4 nested for loops and I feel like it's a bit messy to have it in this format. Is there a better way to do this? Later on I might be summing each subsquare or performing some other operation with each subsquare.

Parallelization of a prefix sum (Openmp)

I have two vectors, a[n] and b[n], where n is a large number.
a[0] = b[0];
for (i = 1; i < size; i++) {
a[i] = a[i-1] + b[i];
}
With this code we try to achieve that a[i] contains the sum of all the numbers in b[] until b[i]. I need to parallelise this loop using openmp.
The main problem is that a[i] depends of a[i-1], so the only direct way that comes to my mind would be waiting for each a[i-1] number to be ready, which takes a lot of time and makes no sense. Is there any approach in openmp for solving this problem?

You're Carl Friedrich Gauss in the 18 century and your grade school teacher has decided to punish the class with a homework problem that requires a lot or mundane repeated arithmetic. In the previous week your teacher told you to add up the first 100 counting numbers and because you're clever you came up with a quick solution. Your teacher did not like this so he came up with a new problem which he thinks can't be optimized. In your own notation you rewrite the problem like this
a[0] = b[0];
for (int i = 1; i < size; i++) a[i] = a[i-1] + b[i];
then you realize that
a0 = b[0]
a1 = (b[0]) + b[1];
a2 = ((b[0]) + b[1]) + b[2]
a_n = b[0] + b[1] + b[2] + ... b[n]
using your notation again you rewrite the problem as
int sum = 0;
for (int i = 0; i < size; i++) sum += b[i], a[i] = sum;
How to optimize this? First you define
int sum(n0, n) {
int sum = 0;
for (int i = n0; i < n; i++) sum += b[i], a[i] = sum;
return sum;
}
Then you realize that
a_n+1 = sum(0, n) + sum(n, n+1)
a_n+2 = sum(0, n) + sum(n, n+2)
a_n+m = sum(0, n) + sum(n, n+m)
a_n+m+k = sum(0, n) + sum(n, n+m) + sum(n+m, n+m+k)
So now you know what to do. Find t classmates. Have each one work on a subset of the numbers. To keep it simple you choose size is 100 and four classmates t0, t1, t2, t3 then each one does
t0 t1 t2 t3
s0 = sum(0,25) s1 = sum(25,50) s2 = sum(50,75) s3 = sum(75,100)
at the same time. Then define
fix(int n0, int n, int offset) {
for(int i=n0; i<n; i++) a[i] += offset
}
and then each classmates goes back over their subset at the same time again like this
t0 t1 t2 t3
fix(0, 25, 0) fix(25, 50, s0) fix(50, 75, s0+s1) fix(75, 100, s0+s1+s2)
You realize that with t classmate taking about the same K seconds to do arithmetic that you can finish the job in 2*K*size/t seconds whereas one person would take K*size seconds. It's clear you're going to need at least two classmates just to break even. So with four classmates they should finish in half the time as one classmate.
Now your write up your algorithm in your own notation
int *suma; // array of partial results from each classmate
#pragma omp parallel
{
int ithread = omp_get_thread_num(); //label of classmate
int nthreads = omp_get_num_threads(); //number of classmates
#pragma omp single
suma = malloc(sizeof *suma * (nthreads+1)), suma[0] = 0;
//now have each classmate calculate their partial result s = sum(n0, n)
int s = 0;
#pragma omp for schedule(static) nowait
for (int i=0; i<size; i++) s += b[i], a[i] = sum;
suma[ithread+1] = s;
//now wait for each classmate to finish
#pragma omp barrier
// now each classmate sums each of the previous classmates results
int offset = 0;
for(int i=0; i<(ithread+1); i++) offset += suma[i];
//now each classmates corrects their result
#pragma omp for schedule(static)
for (int i=0; i<size; i++) a[i] += offset;
}
free(suma)
You realize that you could optimize the part where each classmate has to add the result of the previous classmate but since size >> t you don't think it's worth the effort.
Your solution is not nearly as fast as your solution adding the counting numbers but nevertheless your teacher is not happy that several of his students finished much earlier than the other students. So now he decides that one student has to read the b array slowly to the class and when you report the result a it has to be done slowly as well. You call this being read/write bandwidth bound. This severely limits the effectiveness of your algorithm. What are you going to do now?
The only thing you can think of is to get multiple classmates to read and record different subsets of the numbers to the class at the same time.

cyclic permutation in O(1) space and O(n) time

I saw an interview question which asked to
Interchange arr[i] and i for i=[0,n-1]
EXAMPLE :
input : 1 2 4 5 3 0
answer :5 0 1 4 2 3
explaination : a[1]=2 in input , so a[2]=1 in answer so on
I attempted this but not getting correct answer.
what i am able to do is : for a pair of numbers p and q , a[p]=q and a[q]=p .
any thoughts how to improve it are welcome.
FOR(j,0,n-1)
{
i=j;
do{
temp=a[i];
next=a[temp];
a[temp]=i;
i=next;
}while(i>j);
}
print_array(a,i,n);
It would be easier for me to to understand your answer if it contains a pseudocode with some explaination.
EDIT : I came to knpw it is cyclic permutation so changed the question title.

Below is what I came up with (Java code).
For each value x in a, it sets a[x] to x, and sets x to the overridden value (to be used for a[a[x]]), and repeats until it gets back to the original x.
I use negative values as a flag to indicate that the value's already been processed.
Running time:
Since it only processes each value once, the running time is O(n).
Code:
int[] a = {1,2,4,5,3,0};
for (int i = 0; i < a.length; i++)
{
if (a[i] < 0)
continue;
int j = a[i];
int last = i;
do
{
int temp = a[j];
a[j] = -last-1;
last = j;
j = temp;
}
while (i != j);
a[j] = -last-1;
}
for (int i = 0; i < a.length; i++)
a[i] = -a[i]-1;
System.out.println(Arrays.toString(a));

Here's my suggestion, O(n) time, O(1) space:
void OrderArray(int[] A)
{
int X = A.Max() + 1;
for (int i = 0; i < A.Length; i++)
A[i] *= X;
for (int i = 0; i < A.Length; i++)
A[A[i] / X] += i;
for (int i = 0; i < A.Length; i++)
A[i] = A[i] % X;
}
A short explanation:
We use X as a basic unit for values in the original array (we multiply each value in the original array by X, which is larger than any number in A- basically the length of A + 1). so at any point we can retrieve the number that was in a certain cell of the original array by array by doing A[i] / X, as long as we didn't add more than X to that cell.
This lets us have two layers of values, where A[i] % X represents the value of the cell after the ordering. these two layers don't intersect through the process.
When we finished, we clean A from the original values multiplied by X by performing A[i] = A[i] % X.
Hopes that's clean enough.

Perhaps it is possible by using the images of the input permutation as indices:
void inverse( unsigned int* input, unsigned int* output, unsigned int n )
{
for ( unsigned int i = 0; i < n; i++ )
output[ input[ i ] ] = i;
}

LU Decomposition from Numerical Recipes not working; what am I doing wrong?

I've literally copied and pasted from the supplied source code for Numerical Recipes for C for in-place LU Matrix Decomposition, problem is its not working.
I'm sure I'm doing something stupid but would appreciate anyone being able to point me in the right direction on this; I've been working on its all day and can't see what I'm doing wrong.
POST-ANSWER UPDATE: The project is finished and working. Thanks to everyone for their guidance.
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#define MAT1 3
#define TINY 1e-20
int h_NR_LU_decomp(float *a, int *indx){
//Taken from Numerical Recipies for C
int i,imax,j,k;
float big,dum,sum,temp;
int n=MAT1;
float vv[MAT1];
int d=1.0;
//Loop over rows to get implicit scaling info
for (i=0;i<n;i++) {
big=0.0;
for (j=0;j<n;j++)
if ((temp=fabs(a[i*MAT1+j])) > big)
big=temp;
if (big == 0.0) return -1; //Singular Matrix
vv[i]=1.0/big;
}
//Outer kij loop
for (j=0;j<n;j++) {
for (i=0;i<j;i++) {
sum=a[i*MAT1+j];
for (k=0;k<i;k++)
sum -= a[i*MAT1+k]*a[k*MAT1+j];
a[i*MAT1+j]=sum;
}
big=0.0;
//search for largest pivot
for (i=j;i<n;i++) {
sum=a[i*MAT1+j];
for (k=0;k<j;k++) sum -= a[i*MAT1+k]*a[k*MAT1+j];
a[i*MAT1+j]=sum;
if ((dum=vv[i]*fabs(sum)) >= big) {
big=dum;
imax=i;
}
}
//Do we need to swap any rows?
if (j != imax) {
for (k=0;k<n;k++) {
dum=a[imax*MAT1+k];
a[imax*MAT1+k]=a[j*MAT1+k];
a[j*MAT1+k]=dum;
}
d = -d;
vv[imax]=vv[j];
}
indx[j]=imax;
if (a[j*MAT1+j] == 0.0) a[j*MAT1+j]=TINY;
for (k=j+1;k<n;k++) {
dum=1.0/(a[j*MAT1+j]);
for (i=j+1;i<n;i++) a[i*MAT1+j] *= dum;
}
}
return 0;
}
void main(){
//3x3 Matrix
float exampleA[]={1,3,-2,3,5,6,2,4,3};
//pivot array (not used currently)
int* h_pivot = (int *)malloc(sizeof(int)*MAT1);
int retval = h_NR_LU_decomp(&exampleA[0],h_pivot);
for (unsigned int i=0; i<3; i++){
printf("\n%d:",h_pivot[i]);
for (unsigned int j=0;j<3; j++){
printf("%.1lf,",exampleA[i*3+j]);
}
}
}
WolframAlpha says the answer should be
1,3,-2
2,-2,7
3,2,-2
I'm getting:
2,4,3
0.2,2,-2.8
0.8,1,6.5
And so far I have found at least 3 different versions of the 'same' algorithm, so I'm completely confused.
PS yes I know there are at least a dozen different libraries to do this, but I'm more interested in understanding what I'm doing wrong than the right answer.
PPS since in LU Decomposition the lower resultant matrix is unity, and using Crouts algorithm as (i think) implemented, array index access is still safe, both L and U can be superimposed on each other in-place; hence the single resultant matrix for this.

I think there's something inherently wrong with your indices. They sometimes have unusual start and end values, and the outer loop over j instead of i makes me suspicious.
Before you ask anyone to examine your code, here are a few suggestions:
double-check your indices
get rid of those obfuscation attempts using sum
use a macro a(i,j) instead of a[i*MAT1+j]
write sub-functions instead of comments
remove unnecessary parts, isolating the erroneous code
Here's a version that follows these suggestions:
#define MAT1 3
#define a(i,j) a[(i)*MAT1+(j)]
int h_NR_LU_decomp(float *a, int *indx)
{
int i, j, k;
int n = MAT1;
for (i = 0; i < n; i++) {
// compute R
for (j = i; j < n; j++)
for (k = 0; k < i-2; k++)
a(i,j) -= a(i,k) * a(k,j);
// compute L
for (j = i+1; j < n; j++)
for (k = 0; k < i-2; k++)
a(j,i) -= a(j,k) * a(k,i);
}
return 0;
}
Its main advantages are:
it's readable
it works
It lacks pivoting, though. Add sub-functions as needed.
My advice: don't copy someone else's code without understanding it.
Most programmers are bad programmers.

For the love of all that is holy, don't use Numerical Recipies code for anything except as a toy implementation for teaching purposes of the algorithms described in the text -- and, really, the text isn't that great. And, as you're learning, neither is the code.
Certainly don't put any Numerical Recipies routine in your own code -- the license is insanely restrictive, particularly given the code quality. You won't be able to distribute your own code if you have NR stuff in there.
See if your system already has a LAPACK library installed. It's the standard interface to linear algebra routines in computational science and engineering, and while it's not perfect, you'll be able to find lapack libraries for any machine you ever move your code to, and you can just compile, link, and run. If it's not already installed on your system, your package manager (rpm, apt-get, fink, port, whatever) probably knows about lapack and can install it for you. If not, as long as you have a Fortran compiler on your system, you can download and compile it from here, and the standard C bindings can be found just below on the same page.
The reason it's so handy to have a standard API to linear algebra routines is that they are so common, but their performance is so system-dependant. So for instance, Goto BLAS
is an insanely fast implementation for x86 systems of the low-level operations which are needed for linear algebra; once you have LAPACK working, you can install that library to make everything as fast as possible.
Once you have any sort of LAPACK installed, the routine for doing an LU factorization of a general matrix is SGETRF for floats, or DGETRF for doubles. There are other, faster routines if you know something about the structure of the matrix - that it's symmetric positive definite, say (SBPTRF), or that it's tridiagonal (STDTRF). It's a big library, but once you learn your way around it you'll have a very powerful piece of gear in your numerical toolbox.

The thing that looks most suspicious to me is the part marked "search for largest pivot". This does not only search but it also changes the matrix A. I find it hard to believe that is correct.
The different version of the LU algorithm differ in pivoting, so make sure you understand that. You cannot compare the results of different algorithms. A better check is to see whether L times U equals your original matrix, or a permutation thereof if your algorithm does pivoting. That being said, your result is wrong because the determinant is wrong (pivoting does not change the determinant, except for the sign).
Apart from that #Philip has good advice. If you want to understand the code, start by understanding LU decomposition without pivoting.

To badly paraphrase Albert Einstein:
... a man with a watch always knows the
exact time, but a man with two is
never sure ....
Your code is definitely not producing the correct result, but even if it were, the result with pivoting will not directly correspond to the result without pivoting. In the context of a pivoting solution, what Alpha has really given you is probably the equivalent of this:
1 0 0 1 0 0 1 3 -2
P= 0 1 0 L= 2 1 0 U = 0 -2 7
0 0 1 3 2 1 0 0 -2
which will then satisfy the condition A = P.L.U (where . denotes the matrix product). If I compute the (notionally) same decomposition operation another way (using the LAPACK routine dgetrf via numpy in this case):
In [27]: A
Out[27]:
array([[ 1, 3, -2],
[ 3, 5, 6],
[ 2, 4, 3]])
In [28]: import scipy.linalg as la
In [29]: LU,ipivot = la.lu_factor(A)
In [30]: print LU
[[ 3. 5. 6. ]
[ 0.33333333 1.33333333 -4. ]
[ 0.66666667 0.5 1. ]]
In [31]: print ipivot
[1 1 2]
After a little bit of black magic with ipivot we get
0 1 0 1 0 0 3 5 6
P = 0 0 1 L = 0.33333 1 0 U = 0 1.3333 -4
1 0 0 0.66667 0.5 1 0 0 1
which also satisfies A = P.L.U . Both of these factorizations are correct, but they are different and they won't correspond to a correctly functioning version of the NR code.
So before you can go deciding whether you have the "right" answer, you really should spend a bit of time understanding the actual algorithm that the code you copied implements.

This thread has been viewed 6k times in the past 10 years. I had used NR Fortran and C for many years, and do not share the low opinions expressed here.
I explored the issue you encountered, and I believe the problem in your code is here:
for (k=j+1;k<n;k++) {
dum=1.0/(a[j*MAT1+j]);
for (i=j+1;i<n;i++) a[i*MAT1+j] *= dum;
}
while in the original if (j != n-1) { ... } is used. I think the two are not equivalent.
NR's lubksb() does have a small issue in the way they set up finding the first non-zero element, but this can be skipped at very low cost, even for a large matrix. With that, both ludcmp() and lubksb(), entered as published, work just fine, and as far as I can tell perform well.
Here's a complete test code, mostly preserving the notation of NR, wth minor simplifications (tested under Ubuntu Linux/gcc):
/* A sample program to demonstrate matrix inversion using the
* Crout's algorithm from Teukolsky and Press (Numerical Recipes):
* LU decomposition + back-substitution, with partial pivoting
* 2022.06 edward.sternin at brocku.ca
*/
#define N 7
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define a(i,j) a[(i)*n+(j)]
/* implied 1D layout is a(0,0), a(0,1), ... a(0,n-1), a(1,0), a(1,1), ... */
void matrixPrint (double *M, int nrow, int ncol) {
int i,j;
for (i=0;i<nrow;i++) {
for (j=0;j<ncol;j++) { fprintf(stderr," %+.3f\t",M[i*ncol+j]); }
fprintf(stderr,"\n");
}
}
void die(char msg[]) {
fprintf(stderr,"ERROR in %s, aborting\n",msg);
exit(1);
}
void ludcmp(double *a, int n, int *indx) {
int i, imax, j, k;
double big, dum, sum, temp;
double *vv;
/* i=row index, i=0..(n-1); j=col index, j=0..(n-1) */
vv=(double *)malloc((size_t)(n * sizeof(double)));
if (!vv) die("ludcmp: allocation failure");
for (i = 0; i < n; i++) { /* loop over rows */
big = 0.0;
for (j = 0; j < n; j++) {
if ((temp=fabs(a(i,j))) > big) big=temp;
}
if (big == 0.0) die("ludcmp: a singular matrix provided");
vv[i] = 1.0 / big; /* vv stores the scaling factor for each row */
}
for (j = 0; j < n; j++) { /* Crout's method: loop over columns */
for (i = 0; i < j; i++) { /* except for i=j */
sum = a(i,j);
for (k = 0; k < i; k++) { sum -= a(i,k) * a(k,j); }
a(i,j) = sum; /* Eq. 2.3.12, in situ */
}
big = 0.0; /* searching for the largest pivot element */
for (i = j; i < n; i++) {
sum = a(i,j);
for (k = 0; k < j; k++) { sum -= a(i,k) * a(k,j); }
a(i,j) = sum;
if ((dum = vv[i] * fabs(sum)) >= big) {
big = dum;
imax = i;
}
}
if (j != imax) { /* if needed, interchange rows */
for (k = 0; k < n; k++){
dum = a(imax,k);
a(imax,k) = a(j,k);
a(j,k) = dum;
}
vv[imax] = vv[j]; /* keep the scale factor with the new row location */
}
indx[j] = imax;
if (j != n-1) { /* divide by the pivot element */
dum = 1.0 / a(j,j);
for (i = j + 1; i < n; i++) a(i,j) *= dum;
}
}
free(vv);
}
void lubksb(double *a, int n, int *indx, double *b) {
int i, ip, j;
double sum;
for (i = 0; i < n; i++) {
/* Forward substitution, Eq.2.3.6, unscrambling permutations from indx[] */
ip = indx[i];
sum = b[ip];
b[ip] = b[i];
for (j = 0; j < i; j++) sum -= a(i,j) * b[j];
b[i] = sum;
}
for (i = n-1; i >= 0; i--) { /* backsubstitution, Eq. 2.3.7 */
sum = b[i];
for (j = i + 1; j < n; j++) sum -= a(i,j) * b[j];
b[i] = sum / a(i,i);
}
}
int main() {
double *a,*y,*col,*aa,*res,sum;
int i,j,k,*indx;
a=(double *)malloc((size_t)(N*N * sizeof(double)));
y=(double *)malloc((size_t)(N*N * sizeof(double)));
col=(double *)malloc((size_t)(N * sizeof(double)));
indx=(int *)malloc((size_t)(N * sizeof(int)));
aa=(double *)malloc((size_t)(N*N * sizeof(double)));
res=(double *)malloc((size_t)(N*N * sizeof(double)));
if (!a || !y || !col || !indx || !aa || !res) die("main: memory allocation failure");
srand48((long int) N);
for (i=0;i<N;i++) {
for (j=0;j<N;j++) { aa[i*N+j] = a[i*N+j] = drand48(); }
}
fprintf(stderr,"\nRandomly generated matrix A = \n");
matrixPrint(a,N,N);
ludcmp(a,N,indx);
for(j=0;j<N;j++) {
for(i=0;i<N;i++) { col[i]=0.0; }
col[j]=1.0;
lubksb(a,N,indx,col);
for(i=0;i<N;i++) { y[i*N+j]=col[i]; }
}
fprintf(stderr,"\nResult of LU/BackSub is inv(A) :\n");
matrixPrint(y,N,N);
for (i=0; i<N; i++) {
for (j=0;j<N;j++) {
sum = 0;
for (k=0; k<N; k++) { sum += y[i*N+k] * aa[k*N+j]; }
res[i*N+j] = sum;
}
}
fprintf(stderr,"\nResult of inv(A).A = (should be 1):\n");
matrixPrint(res,N,N);
return(0);
}

how to calculate total no of iteration of innermost loop of nested for loop? is there any formula?

for example
int count=0
for(int i=0;i<12;i++)
for(int j=i+1;j<10;j++)
for(int k=j+1;k<8;k++)
count++;
System.out.println("count = "+count);
or
for(int i=0;i<I;i++)
for(int j=i+1;j<J;j++)
for(int k=j+1;k<K;k++)
:
:
:
for(int z=y+1;z,<Z;z,++,)
count++;
what is value of count after all iteration? Is there any formula to calculate it?

It's a math problem of summation
Basically, one can prove that:
for (i=a; i<b; i++)
count+=1
is equivalent to
count+=b-a
Similarly,
for (i=a; i<b; i++)
count+=i
is equivalent to
count+= 0.5 * (b*(b+1) - a*(a+1))
You can get similar formulas using for instance wolframalpha (Wolfram's Mathematica)
This system will do the symbolic calculation for you, so for instance,
for(int i=0;i<A;i++)
for(int j=i+1;j<B;j++)
for(int k=j+1;k<C;k++)
count++
is a Mathematica query:
http://www.wolframalpha.com/input/?i=Sum[Sum[Sum[1,{k,j%2B1,C-1}],{j,i%2B1,B-1}],{i,0,A-1}]

Not a full answer but when i, j and k are all the same (say they're all n) the formula is C(n, nb_for_loops), which may already interest you :)
final int n = 50;
int count = 0;
for (int i = 0; i < n; i++) {
for (int j = i + 1; j < n; j++) {
for (int k = j + 1; k < n; k++) {
for (int l = k+1; l < n; l++) {
count++;
}
}
}
}
System.out.println( count );
Will give 230300 which is C(50,4).
You can compute this easily using the binomail coefficient:
http://en.wikipedia.org/wiki/Binomial_coefficient
One formula to compute this is: n! / (k! * (n-k)!)
For example if you want to know how many different sets of 5 cards can be taken out of a 52 cards deck, you can either use 5 nested loops or use the formula above, they'll both give: 2 598 960

That's roughly the volume of an hyperpyramid http://www.physicsinsights.org/pyramids-1.html => 1/d * (n ^d) (with d dimension)
The formula works for real number so you have to adapt it for integer
(for the case d=2 (the hyperpyramid is a triangle then) , 1/2*(n*n) becomes the well know formula n(n+1)/2 (or n(n-1)/2) depending if you include the diagonal or not). I let you do the math
I think the fact your not using n all time but I,J,K is not a problem as you can rewrite each loop as 2 loop stopping in the middle so they all stop as the same number
the formula might becomes 1/d*((n/2)^d)*2 (I'm not sure, but something similar should be ok)
That's not really the answer to your question but I hope that will help to find a real one.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight