DTRMM & DTRSM hangs on certain matrix sizes

DTRMM & DTRSM hangs on certain matrix sizes - c

I'm testing performance of ?GEMM, ?TRMM, ?TRSM using MKL's automatic offload on the new Intel Xeon Phi coprocessors and am having some issues with DTRMM and DTRSM. I have code to test the performance for matrix size in steps of 1024 up to 10240 and performance seems to drop off significantly somewhere after N=M=K=8192. When I try testing exactly where by using step sizes of 2, my script was hanging. I then checked 512 step sizes, which work fine, 256 work as well, but anything under 256 just stalls. I cannot find any known issues in regards to this problem. All single precision versions work, as well as single and double precision on ?GEMM. Here is my code:
#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>
#include <stdint.h>
#include <time.h>
#include "mkl.h"
#define DBG 0
int main(int argc, char **argv)
{
char transa = 'N', side = 'L', uplo = 'L', diag = 'U';
MKL_INT N, NP; // N = M, N, K, lda, ldb, ldc
double alpha = 1.0; // Scaling factors
double *A, *B; // Matrices
int matrix_bytes; // Matrix size in bytes
int matrix_elements; // Matrix size in elements
int i, j; // Counters
int msec;
clock_t start, diff;
N = atoi(argv[1]);
start = clock();
matrix_elements = N * N;
matrix_bytes = sizeof(double) * matrix_elements;
// Allocate the matrices
A = malloc(matrix_bytes);
if (A == NULL)
{
printf("Could not allocate matrix A\n");
return -1;
}
B = malloc(matrix_bytes);
if (B == NULL)
{
printf("Could not allocate matrix B\n");
return -1;
}
for (i = 0; i < matrix_elements; i++)
{
A[i] = 0.0;
B[i] = 0.0;
}
// Initialize the matrices
for (i = 0; i < N; i++)
for (j = 0; j <= i; j++)
{
A[i+N*j] = 1.0;
B[i+N*j] = 2.0;
}
// DTRMM call
dtrmm(&side, &uplo, &transa, &diag, &N, &N, &alpha, A, &N, B, &N);
diff = clock() - start;
msec = diff * 1000 / CLOCKS_PER_SEC;
printf("%f\n", (float)msec * 10e-4);
if (DBG == 1)
{
printf("\nMatrix dimension is set to %d \n\n", (int)N);
// Display the result
printf("\nResulting matrix B:\n");
if (N > 10)
{
printf("NOTE: B is too large, print only upper-left 10x10 block...\n");
NP = 10;
}
else
NP = N;
printf("\n");
for (i = 0; i < NP; i++)
{
for (j = 0; j < NP; j++)
printf("%7.3f ", B[i + j * N]);
printf("\n");
}
}
// Free the matrix memory
free(A);
free(B);
return 0;
}
Any help or insight would be greatly appreciated.

This phenomenon has been extensively discussed in other questions, and also in Intel's Software Optimization Manual and Agner Fog's notes.
Typically, you are experiencing a perfect storm of evictions in the memory hierarchy, such that suddenly (nearly) every single access misses cache and/or TLB (one can determine exactly which resource is missing by looking at the specific data access pattern or by using the PMCs; I can do the calculation later when I'm near a whiteboard, unless mystical gets to you first).
You can also search through some of my or Mystical's answers to find previous answers.

The issue was an older version of Intel's icc compiler (beta 10 update, I believe.. maybe). Gold update works like a charm.

Related

Parallel computing using multiple cores with Open-MP.

I am struggling to figure out how to parallelize this code with OpenMP, any help is appreciated. Below is the base code and a description.
In the simulation of a collection of soft particles (such as proteins in a fluid), there is a repulsive force between a pair of particles when they overlap. The goal of this assignment is to use parallel computing to accelerate the computation of these repulsive forces, using multiple cores with Open-MP.
In the force repulsion function, the particles are assumed to have unit radius. The particles are in a “simulation box” of dimensions L × L × L. The dimension L is chosen such that the volume fraction of particles is φ = 0.3. The simulation box has periodic (wrap-around) boundary conditions, which explains why we need to use the remainder function to compute the distance between two particles. If the particles overlap, i.e., the distance s between two particles is less than 2, then the repulsive force is proportional to k(2−s) where k is a force constant. The force is along the vector joining the two particles. Write a program that tests the correctness of your code. This can be done by computing the correct forces and comparing them to the forces computed by your optimized code. Give evidence in your report that your program works correctly using your test program
How much faster is your accelerated code compared to the provided baseline code? Include timings for different problem sizes. Be sure to include a listing of your code in your report.
Code to parallelize
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <sys/time.h>
double get_walltime() {
struct timeval tp;
gettimeofday(&tp, NULL);
return (double) (tp.tv_sec + tp.tv_usec*1e-6); }
void force_repulsion(int np, const double *pos, double L, double krepulsion, double *forces)
{
int i, j;
double posi [4]; double rvec [4];
double s2, s, f;
// initialize forces to zero
for (i=0; i<3*np; i++)
forces[i] = 0.;
// loop over all pairs
for (i=0; i<np; i++)
{
posi[0] = pos[3*i ];
posi[1] = pos[3*i+1]; posi[2] = pos[3*i+2];
for (j=i+1; j<np; j++)
{
// compute minimum image difference
rvec[0] = remainder(posi[0] - pos[3*j ], L);
rvec[1] = remainder(posi[1] - pos[3*j+1], L);
rvec[2] = remainder(posi[2] - pos[3*j+2], L);
s2 = rvec [0]* rvec [0] + rvec [1]* rvec [1] + rvec [2]* rvec [2];
if (s2 < 4)
{
s = sqrt(s2);
rvec[0] /= s; rvec[1] /= s;
rvec[2] /= s;
f = krepulsion*(2.-s);
forces[3*i ] += f*rvec[0];
forces[3*i+1] += f*rvec[1];
forces[3*i+2] += f*rvec[2];
forces[3*j ] += -f*rvec[0];
forces[3*j+1] += -f*rvec[1];
forces[3*j+2] += -f*rvec[2]; }
} }
}
int main(int argc, char *argv[]) {
int i;
int np = 100; // default number of particles
double phi = 0.3; // volume fraction
double krepulsion = 125.; // force constant
double *pos; double *forces;
double L, time0 , time1;
if (argc > 1)
np = atoi(argv[1]);
L = pow(4./3.*3.1415926536*np/phi, 1./3.);
// generate random particle positions inside simulation box
forces = (double *) malloc(3*np*sizeof(double));
pos = (double *) malloc(3*np*sizeof(double));
for (i=0; i<3*np; i++)
pos[i] = rand()/(double)RAND_MAX*L;
// measure execution time of this function
time0 = get_walltime ();
force_repulsion(np, pos, L, krepulsion, forces);
time1 = get_walltime ();
printf("number of particles: %d\n", np);
printf("elapsed time: %f\n", time1-time0);
free(forces);
free(pos);
return 0; }

Theoretically, it would be as simple as this:
void force_repulsion(int np, const double *pos, double L, double krepulsion,
double *forces)
{
// initialize forces to zero
#pragma omp parallel for
for (int i = 0; i < 3 * np; i++)
forces[i] = 0.;
// loop over all pairs
#pragma omp parallel for
for (int i = 0; i < np; i++)
{
double posi[4];
double rvec[4];
double s2, s, f;
posi[0] = pos[3 * i];
//...
Compilation:
g++ -fopenmp example.cc -o example
Note that I did not check for correctness. Make sure you won't have global variable inside the parallel for (as I updated your code..)

Weird results: matrix multiplication using pthreads

I made a program that multiplies matrices of the same dimension using (p)threads. The program accepts command line flags -N n -M m where n is the size of the matrix arrays and m is the number of threads (computing threshold). The program compiles and runs but I get strange times for elapsed time, USR time, SYS time, and USR+SYS time. I am testing sizes n = {1000,2000,4000} with each threshold m = {1,2,4}.
I should be seeing the elapsed time reduced and a fairly constant USR+SYS time for each value of n but that is not the case. The output will fluctuate but the problem is that a higher threshold doesn't result in a reduction of elapsed time. Am I implementing threads wrong or is there an issue with my timing?
compile with: -pthread
./* -N n -M m
main
#include<stdio.h>
#include<stdlib.h>
#include<unistd.h>
#include<pthread.h>
#include<sys/time.h>
#include<sys/resource.h>
struct matrix {
double **Matrix_A;
double **Matrix_B;
int begin;
int end;
int n;
};
void *calculon(void *mtrx) {
struct matrix *f_mat = (struct matrix *)mtrx;
// transfer data
int f_begin = f_mat->begin;
int f_end = f_mat->end;
int f_n = f_mat->n;
// definition of temp matrix
double ** Matrix_C;
Matrix_C = (double**)malloc(sizeof(double)*f_n);
int f_pholder;
for(f_pholder=0; f_pholder < f_n; f_pholder++)
Matrix_C[f_pholder] = (double*)malloc(sizeof(double)*f_n);
int x, y, z;
for(x = f_begin; x < f_end; x++)
for(y = 0; y < f_n; y++)
for(z = 0; z < f_n; z++)
Matrix_C[x][y] += f_mat->Matrix_A[x][z]*f_mat->Matrix_B[z][y];
for(f_pholder = 0; f_pholder < f_n; f_pholder++)
free(Matrix_C[f_pholder]);
free(Matrix_C);
}
int main(int argc, char **argv) {
char *p;
int c, i, j, x, y, n, m, pholder, n_m, make_thread;
int m_begin = 0;
int m_end = 0;
while((c=getopt(argc, argv, "NM")) != -1) {
switch(c) {
case 'N':
n = strtol(argv[optind], &p, 10);
break;
case 'M':
m = strtol(argv[optind], &p, 10);
break;
default:
printf("\n**WARNING**\nUsage: -N n -M m");
break;
}
}
if(m > n)
printf("\n**WARNING**\nUsage: -N n -M m\n=> m > n");
else if(n%m != 0)
printf("\n**WARNING**\nUsage: -N n -M m\n=> n % m = 0");
else {
n_m = n/m;
// initialize input matrices
double ** thread_matrixA;
double ** thread_matrixB;
// allocate rows onto heap
thread_matrixA=(double**)malloc(sizeof(double)*n);
thread_matrixB=(double**)malloc(sizeof(double)*n);
// allocate columns onto heap
for(pholder = 0; pholder < n; pholder++) {
thread_matrixA[pholder]=(double*)malloc(sizeof(double)*n);
thread_matrixB[pholder]=(double*)malloc(sizeof(double)*n);
}
// populate input matrices with random numbers
for(i = 0; i < n; i++)
for(j = 0; j < n; j++)
thread_matrixA[i][j] = (double)rand()/RAND_MAX+1;
for(x = 0; x < n; x++)
for(y = 0; y < n; y++)
thread_matrixB[x][y] = (double)rand()/RAND_MAX+1;
printf("\n*** Matrix will be of size %d x %d *** \n", n, n);
printf("*** Creating matrix with %d thread(s) ***\n", m);
struct rusage r_usage;
struct timeval usage;
struct timeval time1, time2;
struct timeval cpu_time1, cpu_time2;
struct timeval sys_time1, sys_time2;
struct matrix mat;
pthread_t thread_lord[m];
// begin timing
getrusage(RUSAGE_SELF, &r_usage);
cpu_time1 = r_usage.ru_utime;
sys_time1 = r_usage.ru_stime;
gettimeofday(&time1, NULL);
for(make_thread = 0; make_thread < m; make_thread++) {
m_begin += n_m;
// assign values to struct
mat.Matrix_A = thread_matrixA;
mat.Matrix_B = thread_matrixB;
mat.n = n;
mat.begin = m_begin;
mat.end = m_end;
// create threads
pthread_create(&thread_lord[make_thread], NULL, calculon, (void *)&mat);
m_begin = (m_end + 1);
}
// wait for thread to finish before joining
for(i = 0; i < m; i++)
pthread_join(thread_lord[i], NULL);
// end timing
getrusage(RUSAGE_SELF, &r_usage);
cpu_time2 = r_usage.ru_utime;
sys_time2 = r_usage.ru_stime;
gettimeofday(&time2, NULL);
printf("\nUser time: %f seconds\n", ((cpu_time2.tv_sec * 1000000 + cpu_time2.tv_usec) - (cpu_time1.tv_sec * 1000000 + cpu_time1.tv_usec))/1e6);
printf("System time: %f seconds\n", ((sys_time2.tv_sec * 1000000 + sys_time2.tv_usec) - (sys_time1.tv_sec * 1000000 + sys_time1.tv_usec))/1e6);
printf("Wallclock time: %f seconds\n\n", ((time2.tv_sec * 1000000 + time2.tv_usec) - (time1.tv_sec * 1000000 + time1.tv_usec))/1e6);
// deallocate matrices
for(pholder = 0; pholder < n; pholder++) {
free(thread_matrixA[pholder]);
free(thread_matrixB[pholder]);
}
free(thread_matrixA);
free(thread_matrixB);
}
return 0;
}
Timing

My guess is that all those malloc()s you're using in the individual threads take a lot more time than you save by splitting the calculation among the threads. Math is fast; malloc() is slow. (To oversimplify a bit)
Sometimes weird timing behavior with threads happens when you have multiple threads trying to access a shared resource which is protected by some kind of exclusive lock. (Example, from something I did a long time ago) But I don't think that's the case here because, first of all, you don't seem to be using any locks, and second, the timing pattern that results typically has the runtime increasing by a little bit as you increase the number of threads. In this case, your runtime increases as you increase the number of threads, by a lot (specifically: it seems to be related to the thread number), so I suspect per-thread resource usage to be the culprit.
That being said, I'm having a hard time confirming my guess, so I can't be sure about this.

Alpha blending using table lookup is not as fast as expected

I thought memory access would be faster than the multiplication and division (although compiler-optimized) done with alpha blending. But it wasn't as fast as expected.
The 16 megabytes used for the table is not an issue in this case. But it is a problem if table lookup could even be slower than doing all the CPU calculations.
Can anyone explain to me why and what is happening? Will the table lookup beat out with a slower CPU?
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <time.h>
#define COLOR_MAX UCHAR_MAX
typedef unsigned char color;
color (*blending_table)[COLOR_MAX + 1][COLOR_MAX + 1];
static color blend(unsigned int destination, unsigned int source, unsigned int a) {
return (source * a + destination * (COLOR_MAX - a)) / COLOR_MAX;
}
void initialize_blending_table(void) {
int destination, source, a;
blending_table = malloc((COLOR_MAX + 1) * sizeof *blending_table);
for (destination = 0; destination <= COLOR_MAX; ++destination) {
for (source = 0; source <= COLOR_MAX; ++source) {
for (a = 0; a <= COLOR_MAX; ++a) {
blending_table[destination][source][a] = blend(destination, source, a);
}
}
}
}
struct timer {
double start;
double end;
};
void timer_start(struct timer *self) {
self->start = clock();
}
void timer_end(struct timer *self) {
self->end = clock();
}
double timer_measure_in_seconds(struct timer *self) {
return (self->end - self->start) / CLOCKS_PER_SEC;
}
#define n 300
int main(void) {
struct timer timer;
volatile int i, j, k, l, m;
timer_start(&timer);
initialize_blending_table();
timer_end(&timer);
printf("init %f\n", timer_measure_in_seconds(&timer));
timer_start(&timer);
for (i = 0; i <= n; ++i) {
for (j = 0; j <= COLOR_MAX; ++j) {
for (k = 0; k <= COLOR_MAX; ++k) {
for (l = 0; l <= COLOR_MAX; ++l) {
m = blending_table[j][k][l];
}
}
}
}
timer_end(&timer);
printf("table %f\n", timer_measure_in_seconds(&timer));
timer_start(&timer);
for (i = 0; i <= n; ++i) {
for (j = 0; j <= COLOR_MAX; ++j) {
for (k = 0; k <= COLOR_MAX; ++k) {
for (l = 0; l <= COLOR_MAX; ++l) {
m = blend(j, k, l);
}
}
}
}
timer_end(&timer);
printf("function %f\n", timer_measure_in_seconds(&timer));
return EXIT_SUCCESS;
}
result
$ gcc test.c -O3
$ ./a.out
init 0.034328
table 14.176643
function 14.183924

Table lookup is not a panacea. It helps when the table is small enough, but in your case the table is very big. You write
16 megabytes used for the table is not an issue in this case
which I think is very wrong, and is possibly the source of the problem you experience. 16 megabytes is too big for L1 cache, so reading data from random indices in the table will involve the slower caches (L2, L3, etc). The penalty for cache misses is typically large; your blending algorithm must be very complex if you want your LUT solution to be faster.
Read the Wikipedia article for more info.

Your benchmark is hopelessly broken, it makes the LUT look a lot better than it actually is because it reads the table in-order.
If your performance results show that the LUT is worse than direct calculation, then when you start with real-world random access patterns and cache misses, the LUT is going to be much worse.
Focus on improving the computation, and enabling vectorization. It's likely to pay off far better than a table-based approach.
(source * a + destination * (COLOR_MAX - a)) / COLOR_MAX
with rearrangement becomes
(source * a + destination * COLOR_MAX - destination * a) / COLOR_MAX
which simplifies to
destination + (source - destination) * a / COLOR_MAX
which has one multiply and one division by a constant, both of which are very efficient. And it is easily vectorized.
You should also mark your helper function as inline, although a good optimizing compiler is probably inlining it anyway.

What is the convention of indexing 2D array with x/y coordinates in C?

I have been writing small program, in which I had to use coordinates system on board (x/y in 2d array) and was thinking if I should use indexing like array[x][y], which seems more natural to me or array[y][x] which would match better the way array is represented in memory. I believe both of these methods will be working if I am consistent and it's just naming issue, but what about conventions when writing larger programs?

In my field (image manipulation) the [y][x] convention is more usual. Whatever you do, be consistent and document it well.

You should also consider what you are going to do with these arrays, and whether this is time-critical.
As mentioned in the comments: The element a[r][c+1] is right next to a[r][c]. This fact may have a considerable impact on the performance when iterating over larger arrays. A proper traversal order will cause the cache lines to be fully exploited: When one array index is accessed, it is considered as being "likely" that afterwards, the next index will be accessed, and a whole block of memory will be loaded into the cache. If you are afterwards accessing a completely different memory location (namely, one in the next row), then this cache bandwidth is wasted.
If possible, you should try to use a traversal order that fits the actual memory layout.
(Of course, this is much about "conventions" and "habits": When writing an array access like a[row][col], this is usually interpreted as array access a[y][x], due to the convention of the x-axis being horizontal and the y-axis being vertical...)
Here is a small example that demonstrates the potential performance impact of a "wrong" traversal order:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
float computeSumRowMajor(float **array, int rows, int cols)
{
float sum = 0;
for (int r=0; r<rows; r++)
{
for (int c=0; c<cols; c++)
{
sum += array[r][c];
}
}
return sum;
}
float computeSumColMajor(float **array, int rows, int cols)
{
float sum = 0;
for (int c=0; c<cols; c++)
{
for (int r=0; r<rows; r++)
{
sum += array[r][c];
}
}
return sum;
}
int main()
{
int rows = 5000;
int cols = 5000;
float **array = (float**)malloc(rows*sizeof(float*));
for (int r=0; r<rows; r++)
{
array[r] = (float*)malloc(cols*sizeof(float));
for (int c=0; c<cols; c++)
{
array[r][c] = 0.01f;
}
}
clock_t start, end;
start = clock();
float sumRowMajor = 0;
for (int i=0; i<10; i++)
{
sumRowMajor += computeSumRowMajor(array, rows, cols);
}
end = clock();
double timeRowMajor = ((double) (end - start)) / CLOCKS_PER_SEC;
start = clock();
float sumColMajor = 0;
for (int i=0; i<10; i++)
{
sumColMajor += computeSumColMajor(array, rows, cols);
}
end = clock();
double timeColMajor = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("Row major %f, result %f\n", timeRowMajor, sumRowMajor);
printf("Col major %f, result %f\n", timeColMajor, sumColMajor);
return 0;
}
(apologies if I violated some best practices here, I'm usually a Java guy...)
For me, the row-major access is nearly an order of magnitude faster than the column-major access. Of course, the exact numbers will heavily depend on the target system, but the general issue should be the same on all targets.

LU Decomposition from Numerical Recipes not working; what am I doing wrong?

I've literally copied and pasted from the supplied source code for Numerical Recipes for C for in-place LU Matrix Decomposition, problem is its not working.
I'm sure I'm doing something stupid but would appreciate anyone being able to point me in the right direction on this; I've been working on its all day and can't see what I'm doing wrong.
POST-ANSWER UPDATE: The project is finished and working. Thanks to everyone for their guidance.
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#define MAT1 3
#define TINY 1e-20
int h_NR_LU_decomp(float *a, int *indx){
//Taken from Numerical Recipies for C
int i,imax,j,k;
float big,dum,sum,temp;
int n=MAT1;
float vv[MAT1];
int d=1.0;
//Loop over rows to get implicit scaling info
for (i=0;i<n;i++) {
big=0.0;
for (j=0;j<n;j++)
if ((temp=fabs(a[i*MAT1+j])) > big)
big=temp;
if (big == 0.0) return -1; //Singular Matrix
vv[i]=1.0/big;
}
//Outer kij loop
for (j=0;j<n;j++) {
for (i=0;i<j;i++) {
sum=a[i*MAT1+j];
for (k=0;k<i;k++)
sum -= a[i*MAT1+k]*a[k*MAT1+j];
a[i*MAT1+j]=sum;
}
big=0.0;
//search for largest pivot
for (i=j;i<n;i++) {
sum=a[i*MAT1+j];
for (k=0;k<j;k++) sum -= a[i*MAT1+k]*a[k*MAT1+j];
a[i*MAT1+j]=sum;
if ((dum=vv[i]*fabs(sum)) >= big) {
big=dum;
imax=i;
}
}
//Do we need to swap any rows?
if (j != imax) {
for (k=0;k<n;k++) {
dum=a[imax*MAT1+k];
a[imax*MAT1+k]=a[j*MAT1+k];
a[j*MAT1+k]=dum;
}
d = -d;
vv[imax]=vv[j];
}
indx[j]=imax;
if (a[j*MAT1+j] == 0.0) a[j*MAT1+j]=TINY;
for (k=j+1;k<n;k++) {
dum=1.0/(a[j*MAT1+j]);
for (i=j+1;i<n;i++) a[i*MAT1+j] *= dum;
}
}
return 0;
}
void main(){
//3x3 Matrix
float exampleA[]={1,3,-2,3,5,6,2,4,3};
//pivot array (not used currently)
int* h_pivot = (int *)malloc(sizeof(int)*MAT1);
int retval = h_NR_LU_decomp(&exampleA[0],h_pivot);
for (unsigned int i=0; i<3; i++){
printf("\n%d:",h_pivot[i]);
for (unsigned int j=0;j<3; j++){
printf("%.1lf,",exampleA[i*3+j]);
}
}
}
WolframAlpha says the answer should be
1,3,-2
2,-2,7
3,2,-2
I'm getting:
2,4,3
0.2,2,-2.8
0.8,1,6.5
And so far I have found at least 3 different versions of the 'same' algorithm, so I'm completely confused.
PS yes I know there are at least a dozen different libraries to do this, but I'm more interested in understanding what I'm doing wrong than the right answer.
PPS since in LU Decomposition the lower resultant matrix is unity, and using Crouts algorithm as (i think) implemented, array index access is still safe, both L and U can be superimposed on each other in-place; hence the single resultant matrix for this.

I think there's something inherently wrong with your indices. They sometimes have unusual start and end values, and the outer loop over j instead of i makes me suspicious.
Before you ask anyone to examine your code, here are a few suggestions:
double-check your indices
get rid of those obfuscation attempts using sum
use a macro a(i,j) instead of a[i*MAT1+j]
write sub-functions instead of comments
remove unnecessary parts, isolating the erroneous code
Here's a version that follows these suggestions:
#define MAT1 3
#define a(i,j) a[(i)*MAT1+(j)]
int h_NR_LU_decomp(float *a, int *indx)
{
int i, j, k;
int n = MAT1;
for (i = 0; i < n; i++) {
// compute R
for (j = i; j < n; j++)
for (k = 0; k < i-2; k++)
a(i,j) -= a(i,k) * a(k,j);
// compute L
for (j = i+1; j < n; j++)
for (k = 0; k < i-2; k++)
a(j,i) -= a(j,k) * a(k,i);
}
return 0;
}
Its main advantages are:
it's readable
it works
It lacks pivoting, though. Add sub-functions as needed.
My advice: don't copy someone else's code without understanding it.
Most programmers are bad programmers.

For the love of all that is holy, don't use Numerical Recipies code for anything except as a toy implementation for teaching purposes of the algorithms described in the text -- and, really, the text isn't that great. And, as you're learning, neither is the code.
Certainly don't put any Numerical Recipies routine in your own code -- the license is insanely restrictive, particularly given the code quality. You won't be able to distribute your own code if you have NR stuff in there.
See if your system already has a LAPACK library installed. It's the standard interface to linear algebra routines in computational science and engineering, and while it's not perfect, you'll be able to find lapack libraries for any machine you ever move your code to, and you can just compile, link, and run. If it's not already installed on your system, your package manager (rpm, apt-get, fink, port, whatever) probably knows about lapack and can install it for you. If not, as long as you have a Fortran compiler on your system, you can download and compile it from here, and the standard C bindings can be found just below on the same page.
The reason it's so handy to have a standard API to linear algebra routines is that they are so common, but their performance is so system-dependant. So for instance, Goto BLAS
is an insanely fast implementation for x86 systems of the low-level operations which are needed for linear algebra; once you have LAPACK working, you can install that library to make everything as fast as possible.
Once you have any sort of LAPACK installed, the routine for doing an LU factorization of a general matrix is SGETRF for floats, or DGETRF for doubles. There are other, faster routines if you know something about the structure of the matrix - that it's symmetric positive definite, say (SBPTRF), or that it's tridiagonal (STDTRF). It's a big library, but once you learn your way around it you'll have a very powerful piece of gear in your numerical toolbox.

The thing that looks most suspicious to me is the part marked "search for largest pivot". This does not only search but it also changes the matrix A. I find it hard to believe that is correct.
The different version of the LU algorithm differ in pivoting, so make sure you understand that. You cannot compare the results of different algorithms. A better check is to see whether L times U equals your original matrix, or a permutation thereof if your algorithm does pivoting. That being said, your result is wrong because the determinant is wrong (pivoting does not change the determinant, except for the sign).
Apart from that #Philip has good advice. If you want to understand the code, start by understanding LU decomposition without pivoting.

To badly paraphrase Albert Einstein:
... a man with a watch always knows the
exact time, but a man with two is
never sure ....
Your code is definitely not producing the correct result, but even if it were, the result with pivoting will not directly correspond to the result without pivoting. In the context of a pivoting solution, what Alpha has really given you is probably the equivalent of this:
1 0 0 1 0 0 1 3 -2
P= 0 1 0 L= 2 1 0 U = 0 -2 7
0 0 1 3 2 1 0 0 -2
which will then satisfy the condition A = P.L.U (where . denotes the matrix product). If I compute the (notionally) same decomposition operation another way (using the LAPACK routine dgetrf via numpy in this case):
In [27]: A
Out[27]:
array([[ 1, 3, -2],
[ 3, 5, 6],
[ 2, 4, 3]])
In [28]: import scipy.linalg as la
In [29]: LU,ipivot = la.lu_factor(A)
In [30]: print LU
[[ 3. 5. 6. ]
[ 0.33333333 1.33333333 -4. ]
[ 0.66666667 0.5 1. ]]
In [31]: print ipivot
[1 1 2]
After a little bit of black magic with ipivot we get
0 1 0 1 0 0 3 5 6
P = 0 0 1 L = 0.33333 1 0 U = 0 1.3333 -4
1 0 0 0.66667 0.5 1 0 0 1
which also satisfies A = P.L.U . Both of these factorizations are correct, but they are different and they won't correspond to a correctly functioning version of the NR code.
So before you can go deciding whether you have the "right" answer, you really should spend a bit of time understanding the actual algorithm that the code you copied implements.

This thread has been viewed 6k times in the past 10 years. I had used NR Fortran and C for many years, and do not share the low opinions expressed here.
I explored the issue you encountered, and I believe the problem in your code is here:
for (k=j+1;k<n;k++) {
dum=1.0/(a[j*MAT1+j]);
for (i=j+1;i<n;i++) a[i*MAT1+j] *= dum;
}
while in the original if (j != n-1) { ... } is used. I think the two are not equivalent.
NR's lubksb() does have a small issue in the way they set up finding the first non-zero element, but this can be skipped at very low cost, even for a large matrix. With that, both ludcmp() and lubksb(), entered as published, work just fine, and as far as I can tell perform well.
Here's a complete test code, mostly preserving the notation of NR, wth minor simplifications (tested under Ubuntu Linux/gcc):
/* A sample program to demonstrate matrix inversion using the
* Crout's algorithm from Teukolsky and Press (Numerical Recipes):
* LU decomposition + back-substitution, with partial pivoting
* 2022.06 edward.sternin at brocku.ca
*/
#define N 7
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define a(i,j) a[(i)*n+(j)]
/* implied 1D layout is a(0,0), a(0,1), ... a(0,n-1), a(1,0), a(1,1), ... */
void matrixPrint (double *M, int nrow, int ncol) {
int i,j;
for (i=0;i<nrow;i++) {
for (j=0;j<ncol;j++) { fprintf(stderr," %+.3f\t",M[i*ncol+j]); }
fprintf(stderr,"\n");
}
}
void die(char msg[]) {
fprintf(stderr,"ERROR in %s, aborting\n",msg);
exit(1);
}
void ludcmp(double *a, int n, int *indx) {
int i, imax, j, k;
double big, dum, sum, temp;
double *vv;
/* i=row index, i=0..(n-1); j=col index, j=0..(n-1) */
vv=(double *)malloc((size_t)(n * sizeof(double)));
if (!vv) die("ludcmp: allocation failure");
for (i = 0; i < n; i++) { /* loop over rows */
big = 0.0;
for (j = 0; j < n; j++) {
if ((temp=fabs(a(i,j))) > big) big=temp;
}
if (big == 0.0) die("ludcmp: a singular matrix provided");
vv[i] = 1.0 / big; /* vv stores the scaling factor for each row */
}
for (j = 0; j < n; j++) { /* Crout's method: loop over columns */
for (i = 0; i < j; i++) { /* except for i=j */
sum = a(i,j);
for (k = 0; k < i; k++) { sum -= a(i,k) * a(k,j); }
a(i,j) = sum; /* Eq. 2.3.12, in situ */
}
big = 0.0; /* searching for the largest pivot element */
for (i = j; i < n; i++) {
sum = a(i,j);
for (k = 0; k < j; k++) { sum -= a(i,k) * a(k,j); }
a(i,j) = sum;
if ((dum = vv[i] * fabs(sum)) >= big) {
big = dum;
imax = i;
}
}
if (j != imax) { /* if needed, interchange rows */
for (k = 0; k < n; k++){
dum = a(imax,k);
a(imax,k) = a(j,k);
a(j,k) = dum;
}
vv[imax] = vv[j]; /* keep the scale factor with the new row location */
}
indx[j] = imax;
if (j != n-1) { /* divide by the pivot element */
dum = 1.0 / a(j,j);
for (i = j + 1; i < n; i++) a(i,j) *= dum;
}
}
free(vv);
}
void lubksb(double *a, int n, int *indx, double *b) {
int i, ip, j;
double sum;
for (i = 0; i < n; i++) {
/* Forward substitution, Eq.2.3.6, unscrambling permutations from indx[] */
ip = indx[i];
sum = b[ip];
b[ip] = b[i];
for (j = 0; j < i; j++) sum -= a(i,j) * b[j];
b[i] = sum;
}
for (i = n-1; i >= 0; i--) { /* backsubstitution, Eq. 2.3.7 */
sum = b[i];
for (j = i + 1; j < n; j++) sum -= a(i,j) * b[j];
b[i] = sum / a(i,i);
}
}
int main() {
double *a,*y,*col,*aa,*res,sum;
int i,j,k,*indx;
a=(double *)malloc((size_t)(N*N * sizeof(double)));
y=(double *)malloc((size_t)(N*N * sizeof(double)));
col=(double *)malloc((size_t)(N * sizeof(double)));
indx=(int *)malloc((size_t)(N * sizeof(int)));
aa=(double *)malloc((size_t)(N*N * sizeof(double)));
res=(double *)malloc((size_t)(N*N * sizeof(double)));
if (!a || !y || !col || !indx || !aa || !res) die("main: memory allocation failure");
srand48((long int) N);
for (i=0;i<N;i++) {
for (j=0;j<N;j++) { aa[i*N+j] = a[i*N+j] = drand48(); }
}
fprintf(stderr,"\nRandomly generated matrix A = \n");
matrixPrint(a,N,N);
ludcmp(a,N,indx);
for(j=0;j<N;j++) {
for(i=0;i<N;i++) { col[i]=0.0; }
col[j]=1.0;
lubksb(a,N,indx,col);
for(i=0;i<N;i++) { y[i*N+j]=col[i]; }
}
fprintf(stderr,"\nResult of LU/BackSub is inv(A) :\n");
matrixPrint(y,N,N);
for (i=0; i<N; i++) {
for (j=0;j<N;j++) {
sum = 0;
for (k=0; k<N; k++) { sum += y[i*N+k] * aa[k*N+j]; }
res[i*N+j] = sum;
}
}
fprintf(stderr,"\nResult of inv(A).A = (should be 1):\n");
matrixPrint(res,N,N);
return(0);
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight