Multithreading for backpropagation algorithm - c

To speed up some neural network learning, I tried to do some multi-threading, since for a particular layer, the calculations for each neuron are independent from one another.
The original function I used is some basic backpropagation algorithm, for the evaluation of the deltas in the net :
δ = derivative * Σ (weight * previous δ)
void backpropagation (Autoencoder* AE)
{
int i, j, k;
for(i = AE->numLayer-2; i >= 0; i--)
{
for(j = 0; j < AE->layer[i].size; j++)
{
register double sum = 0.0;
for(k = 0; k < AE->layer[i+1].size; k++)
{
sum += AE->layer[i+1].neuron[k].weight[j] * AE->layer[i+1].neuron[k].delta;
}
AE->layer[i].neuron[j].delta = AE->layer[i].neuron[j].derivative * sum;
}
}
}
Autoencoder being the structure containing the neural net. It worked fine enough, if a bit slow, and it seems like a good idea to try this function first.
The modified functions are the following :
void backpropagationmultithread (Autoencoder* AE, unsigned int ncore, pthread_t* pth)
{
int i, j;
unsigned int neuronpercore, extra;
sem_t semaphore;
argThread* args[ncore];
for(i = AE->numLayer-2; i >= 0; i--)
{
neuronpercore = AE->layer[i].size / ncore;
extra = neuronpercore + (AE->layer[i].size % ncore);
sem_init(&semaphore, 0, -ncore);
for(j = 0; j < ncore; j++)
{
args[j] = malloc(sizeof(argThread));
args[j]->layer = i;
args[j]->AE = AE;
args[j]->sem = &semaphore;
args[j]->startat = neuronpercore * j;
args[j]->nneurons = (j!=ncore-1)?neuronpercore:extra;
pthread_create(&pth[j], NULL, backpropagationthread, (void*)args[j]);
}
sem_wait(&semaphore);
for(j = 0; j < ncore; j++)
{
pthread_cancel(pth[j]);
}
}
}
And the function for the new threads :
void* backpropagationthread (void* arg)
{
argThread* args = (argThread*) arg;
unsigned int j,k,layer = args->layer, start = args->startat, end = args->startat + args->nneurons;
Autoencoder* AE = args->AE;
for(j = start; j < end; j++)
{
register double sum = 0.0;
for(k = 0; k < AE->layer[layer+1].size; k++)
{
sum += AE->layer[layer+1].neuron[k].weight[j] * AE->layer[layer+1].neuron[k].delta;
}
AE->layer[layer].neuron[j].delta = AE->layer[layer].neuron[j].derivative * sum;
}
sem_post(args->sem);
free(arg);
return NULL;
}
argThread is just a little structure that contains all the arguments to be passed to the thread, ncore the number of CPU cores. The idea was to split up each layer into a roughly equal number of neurons to be treated individually by each thread (the last one with all the extra neurons if they are not multiples).
The new function does work to some degree, and much faster, but after a certain threshold does not converge anymore, where the old function did, and I cannot find why its behaviour would change. Am I missing some neurons or weights?

I implemented a multi-threaded backpropagation algorithm for Encog some time ago. I wrote this in Java, but when I implemented it in C, I made use of OpenMP, rather than pthreads. You can see my C implementation here.
https://github.com/encog/encog-c
I also wrote an article about my approach for performing backpropagation in multi-threaded.
You can see my article here.
http://www.heatonresearch.com/encog/mprop/compare.html
There are a few other questions about this on Stack Overflow, as well. Most seem to reference my algorithm.
Multithreaded backpropagation
How can I apply multithreading to the backpropagation neural network training?

Related

Trying out with some code to make use of multithreading in C

I have an app that can spawn any given number of threads. So I'd like this code to become multithreaded
void *some_thread_fuction(void *param)
{
struct some_struct *obj=(struct some_struct *)param;
int m=obj->m;
int n=...
double t[m+2][n+2]={0};
for (i=0; i <= m+1; i++) {
for (j=0; j <= n+1; j++) {
t[i][j] = 30.0;
}
}
for (i=1; i <= m; i++) {
t[i][0] = 40.0;
t[i][n+1] = 90.0;
}
for (j=1; j <= n; j++) {
t[0][j] = 30.0;
t[m+1][j] = 50.0;
}
memcpy(global_t,t,...);
}
I am having simple reasoning issue as to why I like to make this a multi-threaded program. but it make sense because if I have 5 threads (assumed I am taking how many number of threads to spawn at program start in program parameter) and n=20 m=20 which is also fed at program start as parameters then I can try working on 0-4 in one thread, 5-8 in second thread and so on until 16-20 in last iteration of first loop(just an example, because m=etc n=etc and number of threads can be anything values fed by user).
But more importantly I am having tough time as to how to even dissect the three for loops to distribute the processing over amount of work to multiple threads to completion of all the loops in this code. This is simple code, so it's just a real world example that I am having difficulty understanding how to do it in code for a threaded program for this scenario.
Move this piece of code into a function:
for (i=0; i <= m+1; i++) {
for (j=0; j <= n+1; j++) {
t[i][j] = 30.0;
}
}
as follows:
void initialize(double t[], int start_i, int end_i, int n) {
for (i=start_i; i <= end_i; i++) {
for (j=0; j <= n+1; j++) {
t[i][j] = 30.0;
}
}
}
You can then split the interval [0 m+1] into as 5 intervals and call the initialize function from each thread for each interval.
That said, there must be more efficient ways to achieve the same thing using some copy instructions.

How to make an IIR filter?

i'm trying to make IIR filter. I made FIR filter, but I feels IIR is more difficult than FIR.
I think IIR is similar with FIR, but it made me feels confused.
I think the filters are like this.
FIR : y(n) = b0(x[n]) + ... +bM-1(x[n-M+1])
IIR : y(n) = {b0(x[n]) + ... +bM-1(x[n-M+1])} - {a1(y[n-1]) + ... +aN(y[n-N}
in this case, how about a0? Is it just 1?
The part of y[n-1]..... is the problem. I confused how to make it.
Here is my code.
for (n = 0; n < length; n++) {
coeffa = coeffs_A;
coeffb = coeffs_B;
inputp = &insamp[filterLength - 1 + n];
acc = 0;
bcc = 0;
for (k = 0; k < filterLength; k++) {
bcc += (*coeffb++) * (*inputp--);
}
for (k = 0; k < filterLength; k++) {
acc += (*coeffa++) * (////////);
}
output[n] = bcc-acc;
}
In this case, filterLength is 7 and n is 80
////// is what i want to know.
Am I think wrong?
Typically IIR filters are implemented using direct form I or direct form II topologies. Each form requires memory states. These memory states keep the histories of the outputs and inputs in them. This makes it a lot simpler to implement the IIR filter. gtkiostream implements a direct form II approach and may be a helpful reference.
In addressing your question, I like the approach of directly estimating your filters from inputs and outputs, using the input and output buffers as memory states. As a coefficients always operate on outputs, use the outputs as the missing variable, something like this :
acc += (*coeffa++) * output[n-k];
I took your programm and rewrote it a bit to make it a MCVE and be easier readable:
#include <stdio.h>
#define flen 3
#define slen 10
int main(void)
{
int n,k;
double a[flen] = {1,-0.5,0}, b[flen]={1,0,0};
double input[slen]={0,0,1}; // 0s for flen-1 to avoid out of bounds
double output[slen]={0};
double acc, bcc;
for (n = flen-1; n < slen; n++) {
acc = 0;
bcc = 0;
for (k = 0; k < flen; k++) {
bcc += b[k] * (input[n-k]);
}
for (k = 0; k < flen; k++) {
acc += a[k] * (output[n-k]);
}
output[n] = bcc-acc;
printf("%f\t%f\t%f\t%f\n",input[n],acc,bcc,output[n]);
}
return 0;
}
I am not entirely sure this is correct because I do not have the means to test with different filter settings against a filter design tool.
The impulse response for the given coefficents seems to be okay. But I haven't done anything in this topic for a few years now so I am not certain. The general idea should be okay I think.

Using Hash Tables in Leu of Multidimensional Array

EDIT: Found a solution! Like the commenters suggested, using memset is an insanely better approach. Replace the entire for loop with
memset(lookup->n, -3, (dimensions*sizeof(signed char)));
where
long int dimensions = box1 * box2 * box3 * box4 * box5 * box6 * box7 * box8 * memvara * memvarb * memvarc * memvard * adirect * tdirect * fs * bs * outputnum;
Intro
Right now, I'm looking at a beast of a for-loop:
for (j = 0;j < box1; j++)
{
for (k = 0; k < box2; k++)
{
for (l = 0; l < box3; l++)
{
for (m = 0; m < box4; m++)
{
for (x = 0;x < box5; x++)
{
for (y = 0; y < box6; y++)
{
for (xa = 0;xa < box7; xa++)
{
for (xb = 0; xb < box8; xb++)
{
for (nb = 0; nb < memvara; nb++)
{
for (na = 0; na < memvarb; na++)
{
for (nx = 0; nx < memvarc; nx++)
{
for (nx1 = 0; nx1 < memvard; nx1++)
{
for (naa = 0; naa < adirect; naa++)
{
for (nbb = 0; nbb < tdirect; nbb++)
{
for (ncc = 0; ncc < fs; ncc++)
{
for (ndd = 0; ndd < bs; ndd++)
{
for (o = 0; o < outputnum; o++)
{
lookup->n[j][k][l][m][x][y][xa][xb][nb][na][nx][nx1][naa][nbb][ncc][ndd][o] = -3; //set to default value
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
The Problem
This loop is called every cycle in the main run to reset values to an initial state. Unfortunately, it is necessary for the structure of the program that this many values are kept in a single data structure.
Here's the kicker: for every 60 seconds of program run time, 57 seconds goes to this function alone.
The Question
My question is this: would hash tables be an appropriate substitute for a linear array? This array has an O(n^17) cardinality, yet hash tables have an ideal of O(1).
If so, what hash library would you recommend? This program is in C and has no native hash support.
If not, what would you recommend instead?
Can you provide some pseudo-code on how you think this should be implemented?
Notes
OpenMP was used in an attempt to parallelize this loop. Numerous implementations only resulted in slightly-to-greatly increased run time.
Memory usage is not particularly an issue -- this program is intended to be ran on an insanely high-spec'd computer.
We are student researchers, thrust into a heretofore unknown world of optimization and parallelization -- please bear with us, and thank you for any help
Hash vs Array
As comments have specified, an array should not be a problem here. Lookup into an array with a known offset is O(1).
The Bottleneck
It seems to me that the bulk of the work here (and the reason it is slow) is the number of pointer de-references in the inner-loop.
To explain in a bit more detail, consider myData[x][y][z] in the following code:
for (int x = 0; x < someVal1; x++) {
for (int y = 0; y < someVal2; y++) {
for (int z = 0; z < someVal3; z++) {
myData[x][y][z] = -3; // x and y only change in outer-loops.
}
}
}
To compute the location for the -3, we do a lookup and add a value - once for myData[x], then again to get to myData[x][y], and once more finally for myData[x][y][z].
Since this lookup is in the inner-most portion of the loop, we have redundant reads. myData[x] and myData[x][y] are being recomputed, even when only z's value is changing. The lookups were performed during a previous iteration, but the results weren't stored.
For your loop, there are many layers of lookups being computed each iteration, even when only the value of o is changing in that inner-loop.
An Improvement for the Bottleneck
To make one lookup, per loop iteration, per loop level, simply store intermediate lookups. Using int* as the indirection (though any type would work here), the sample code above (with myData) would become:
int **a, *b;
for (int x = 0; x < someVal1; x++) {
a = myData[x]; // Store the lookup.
for (int y = 0; y < someVal2; y++) {
b = a[y]; // Indirection based on the stored lookup.
for (int z = 0; z < someVal3; z++) {
b[z] = -3; // This can be extrapolated as needed to deeper levels.
}
}
}
This is just sample code, small adjustments may be necessary to get it to compile (casts and so forth). Note that there is probably no advantage to using this approach with a 3-dimensional array. However, for a 17-dimensional large data set with simple inner-loop operations (such as assignment), this approach should help quite a bit.
Finally, I'm assuming you aren't actually just assigning the value of -3. You can use memset to accomplish that goal much more efficiently.

OpenMP and 17 Nested For-Loops

I have a giant nested for-loop, designed to set a large array to its default value. I'm trying to use OpenMP for the first time to parallelize, and have no idea where to begin. I have been reading tutorials, and am afraid the process will be performed independently on N number of cores, instead of N cores divided the process amongst itself for a common output. The code is in C, compiled in Visual Studio v14. Any help for this newbie is appreciated -- thanks!
(Attached below is the monster nested for-loop...)
for (j = 0;j < box1; j++)
{
for (k = 0; k < box2; k++)
{
for (l = 0; l < box3; l++)
{
for (m = 0; m < box4; m++)
{
for (x = 0;x < box5; x++)
{
for (y = 0; y < box6; y++)
{
for (xa = 0;xa < box7; xa++)
{
for (xb = 0; xb < box8; xb++)
{
for (nb = 0; nb < memvara; nb++)
{
for (na = 0; na < memvarb; na++)
{
for (nx = 0; nx < memvarc; nx++)
{
for (nx1 = 0; nx1 < memvard; nx1++)
{
for (naa = 0; naa < adirect; naa++)
{
for (nbb = 0; nbb < tdirect; nbb++)
{
for (ncc = 0; ncc < fs; ncc++)
{
for (ndd = 0; ndd < bs; ndd++)
{
for (o = 0; o < outputnum; o++)
{
lookup->n[j][k][l][m][x][y][xa][xb][nb][na][nx][nx1][naa][nbb][ncc][ndd][o] = -3; //set to default value
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
If n is actually a multidimensional array, you can do this:
size_t i;
size_t count = sizeof(lookup->n) / sizeof(int);
int *p = (int*)lookup->n;
for( i = 0; i < count; i++ )
{
p[i] = -3;
}
Now, that's much easier to parallelize.
Read more on why this works here (applies to C as well): How do I use arrays in C++?
This is more of an extended comment than an answer.
Find the iteration limit (ie the variable among box1, box2, etc) with the largest value. Revise your loop nest so that the outermost loop runs over that. Simply parallelise the outermost loop. Choosing the largest value means that you'll get, in the limit, an equal number of inner loop iterations to run for each thread.
Collapsing loops, whether you can use OpenMP's collapse clause or have to do it by hand, is only useful when you have reason to believe that parallelising over only the outermost loop will result in significant load imbalance. That seems very unlikely in this case, so distributing the work (approximately) evenly across the available threads at the outermost level would probably provide reasonably good load balancing.
I believe, based on tertiary research, that the solution might be found in adding #pragma omp parallel for collapse(N) directly above the nested loops. However, this seems to only work in OpenMP v3.0, and the whole project is based on Visual Studio (and therefore, OpenMP v2.0) for now...

LU Decomposition from Numerical Recipes not working; what am I doing wrong?

I've literally copied and pasted from the supplied source code for Numerical Recipes for C for in-place LU Matrix Decomposition, problem is its not working.
I'm sure I'm doing something stupid but would appreciate anyone being able to point me in the right direction on this; I've been working on its all day and can't see what I'm doing wrong.
POST-ANSWER UPDATE: The project is finished and working. Thanks to everyone for their guidance.
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#define MAT1 3
#define TINY 1e-20
int h_NR_LU_decomp(float *a, int *indx){
//Taken from Numerical Recipies for C
int i,imax,j,k;
float big,dum,sum,temp;
int n=MAT1;
float vv[MAT1];
int d=1.0;
//Loop over rows to get implicit scaling info
for (i=0;i<n;i++) {
big=0.0;
for (j=0;j<n;j++)
if ((temp=fabs(a[i*MAT1+j])) > big)
big=temp;
if (big == 0.0) return -1; //Singular Matrix
vv[i]=1.0/big;
}
//Outer kij loop
for (j=0;j<n;j++) {
for (i=0;i<j;i++) {
sum=a[i*MAT1+j];
for (k=0;k<i;k++)
sum -= a[i*MAT1+k]*a[k*MAT1+j];
a[i*MAT1+j]=sum;
}
big=0.0;
//search for largest pivot
for (i=j;i<n;i++) {
sum=a[i*MAT1+j];
for (k=0;k<j;k++) sum -= a[i*MAT1+k]*a[k*MAT1+j];
a[i*MAT1+j]=sum;
if ((dum=vv[i]*fabs(sum)) >= big) {
big=dum;
imax=i;
}
}
//Do we need to swap any rows?
if (j != imax) {
for (k=0;k<n;k++) {
dum=a[imax*MAT1+k];
a[imax*MAT1+k]=a[j*MAT1+k];
a[j*MAT1+k]=dum;
}
d = -d;
vv[imax]=vv[j];
}
indx[j]=imax;
if (a[j*MAT1+j] == 0.0) a[j*MAT1+j]=TINY;
for (k=j+1;k<n;k++) {
dum=1.0/(a[j*MAT1+j]);
for (i=j+1;i<n;i++) a[i*MAT1+j] *= dum;
}
}
return 0;
}
void main(){
//3x3 Matrix
float exampleA[]={1,3,-2,3,5,6,2,4,3};
//pivot array (not used currently)
int* h_pivot = (int *)malloc(sizeof(int)*MAT1);
int retval = h_NR_LU_decomp(&exampleA[0],h_pivot);
for (unsigned int i=0; i<3; i++){
printf("\n%d:",h_pivot[i]);
for (unsigned int j=0;j<3; j++){
printf("%.1lf,",exampleA[i*3+j]);
}
}
}
WolframAlpha says the answer should be
1,3,-2
2,-2,7
3,2,-2
I'm getting:
2,4,3
0.2,2,-2.8
0.8,1,6.5
And so far I have found at least 3 different versions of the 'same' algorithm, so I'm completely confused.
PS yes I know there are at least a dozen different libraries to do this, but I'm more interested in understanding what I'm doing wrong than the right answer.
PPS since in LU Decomposition the lower resultant matrix is unity, and using Crouts algorithm as (i think) implemented, array index access is still safe, both L and U can be superimposed on each other in-place; hence the single resultant matrix for this.
I think there's something inherently wrong with your indices. They sometimes have unusual start and end values, and the outer loop over j instead of i makes me suspicious.
Before you ask anyone to examine your code, here are a few suggestions:
double-check your indices
get rid of those obfuscation attempts using sum
use a macro a(i,j) instead of a[i*MAT1+j]
write sub-functions instead of comments
remove unnecessary parts, isolating the erroneous code
Here's a version that follows these suggestions:
#define MAT1 3
#define a(i,j) a[(i)*MAT1+(j)]
int h_NR_LU_decomp(float *a, int *indx)
{
int i, j, k;
int n = MAT1;
for (i = 0; i < n; i++) {
// compute R
for (j = i; j < n; j++)
for (k = 0; k < i-2; k++)
a(i,j) -= a(i,k) * a(k,j);
// compute L
for (j = i+1; j < n; j++)
for (k = 0; k < i-2; k++)
a(j,i) -= a(j,k) * a(k,i);
}
return 0;
}
Its main advantages are:
it's readable
it works
It lacks pivoting, though. Add sub-functions as needed.
My advice: don't copy someone else's code without understanding it.
Most programmers are bad programmers.
For the love of all that is holy, don't use Numerical Recipies code for anything except as a toy implementation for teaching purposes of the algorithms described in the text -- and, really, the text isn't that great. And, as you're learning, neither is the code.
Certainly don't put any Numerical Recipies routine in your own code -- the license is insanely restrictive, particularly given the code quality. You won't be able to distribute your own code if you have NR stuff in there.
See if your system already has a LAPACK library installed. It's the standard interface to linear algebra routines in computational science and engineering, and while it's not perfect, you'll be able to find lapack libraries for any machine you ever move your code to, and you can just compile, link, and run. If it's not already installed on your system, your package manager (rpm, apt-get, fink, port, whatever) probably knows about lapack and can install it for you. If not, as long as you have a Fortran compiler on your system, you can download and compile it from here, and the standard C bindings can be found just below on the same page.
The reason it's so handy to have a standard API to linear algebra routines is that they are so common, but their performance is so system-dependant. So for instance, Goto BLAS
is an insanely fast implementation for x86 systems of the low-level operations which are needed for linear algebra; once you have LAPACK working, you can install that library to make everything as fast as possible.
Once you have any sort of LAPACK installed, the routine for doing an LU factorization of a general matrix is SGETRF for floats, or DGETRF for doubles. There are other, faster routines if you know something about the structure of the matrix - that it's symmetric positive definite, say (SBPTRF), or that it's tridiagonal (STDTRF). It's a big library, but once you learn your way around it you'll have a very powerful piece of gear in your numerical toolbox.
The thing that looks most suspicious to me is the part marked "search for largest pivot". This does not only search but it also changes the matrix A. I find it hard to believe that is correct.
The different version of the LU algorithm differ in pivoting, so make sure you understand that. You cannot compare the results of different algorithms. A better check is to see whether L times U equals your original matrix, or a permutation thereof if your algorithm does pivoting. That being said, your result is wrong because the determinant is wrong (pivoting does not change the determinant, except for the sign).
Apart from that #Philip has good advice. If you want to understand the code, start by understanding LU decomposition without pivoting.
To badly paraphrase Albert Einstein:
... a man with a watch always knows the
exact time, but a man with two is
never sure ....
Your code is definitely not producing the correct result, but even if it were, the result with pivoting will not directly correspond to the result without pivoting. In the context of a pivoting solution, what Alpha has really given you is probably the equivalent of this:
1 0 0 1 0 0 1 3 -2
P= 0 1 0 L= 2 1 0 U = 0 -2 7
0 0 1 3 2 1 0 0 -2
which will then satisfy the condition A = P.L.U (where . denotes the matrix product). If I compute the (notionally) same decomposition operation another way (using the LAPACK routine dgetrf via numpy in this case):
In [27]: A
Out[27]:
array([[ 1, 3, -2],
[ 3, 5, 6],
[ 2, 4, 3]])
In [28]: import scipy.linalg as la
In [29]: LU,ipivot = la.lu_factor(A)
In [30]: print LU
[[ 3. 5. 6. ]
[ 0.33333333 1.33333333 -4. ]
[ 0.66666667 0.5 1. ]]
In [31]: print ipivot
[1 1 2]
After a little bit of black magic with ipivot we get
0 1 0 1 0 0 3 5 6
P = 0 0 1 L = 0.33333 1 0 U = 0 1.3333 -4
1 0 0 0.66667 0.5 1 0 0 1
which also satisfies A = P.L.U . Both of these factorizations are correct, but they are different and they won't correspond to a correctly functioning version of the NR code.
So before you can go deciding whether you have the "right" answer, you really should spend a bit of time understanding the actual algorithm that the code you copied implements.
This thread has been viewed 6k times in the past 10 years. I had used NR Fortran and C for many years, and do not share the low opinions expressed here.
I explored the issue you encountered, and I believe the problem in your code is here:
for (k=j+1;k<n;k++) {
dum=1.0/(a[j*MAT1+j]);
for (i=j+1;i<n;i++) a[i*MAT1+j] *= dum;
}
while in the original if (j != n-1) { ... } is used. I think the two are not equivalent.
NR's lubksb() does have a small issue in the way they set up finding the first non-zero element, but this can be skipped at very low cost, even for a large matrix. With that, both ludcmp() and lubksb(), entered as published, work just fine, and as far as I can tell perform well.
Here's a complete test code, mostly preserving the notation of NR, wth minor simplifications (tested under Ubuntu Linux/gcc):
/* A sample program to demonstrate matrix inversion using the
* Crout's algorithm from Teukolsky and Press (Numerical Recipes):
* LU decomposition + back-substitution, with partial pivoting
* 2022.06 edward.sternin at brocku.ca
*/
#define N 7
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define a(i,j) a[(i)*n+(j)]
/* implied 1D layout is a(0,0), a(0,1), ... a(0,n-1), a(1,0), a(1,1), ... */
void matrixPrint (double *M, int nrow, int ncol) {
int i,j;
for (i=0;i<nrow;i++) {
for (j=0;j<ncol;j++) { fprintf(stderr," %+.3f\t",M[i*ncol+j]); }
fprintf(stderr,"\n");
}
}
void die(char msg[]) {
fprintf(stderr,"ERROR in %s, aborting\n",msg);
exit(1);
}
void ludcmp(double *a, int n, int *indx) {
int i, imax, j, k;
double big, dum, sum, temp;
double *vv;
/* i=row index, i=0..(n-1); j=col index, j=0..(n-1) */
vv=(double *)malloc((size_t)(n * sizeof(double)));
if (!vv) die("ludcmp: allocation failure");
for (i = 0; i < n; i++) { /* loop over rows */
big = 0.0;
for (j = 0; j < n; j++) {
if ((temp=fabs(a(i,j))) > big) big=temp;
}
if (big == 0.0) die("ludcmp: a singular matrix provided");
vv[i] = 1.0 / big; /* vv stores the scaling factor for each row */
}
for (j = 0; j < n; j++) { /* Crout's method: loop over columns */
for (i = 0; i < j; i++) { /* except for i=j */
sum = a(i,j);
for (k = 0; k < i; k++) { sum -= a(i,k) * a(k,j); }
a(i,j) = sum; /* Eq. 2.3.12, in situ */
}
big = 0.0; /* searching for the largest pivot element */
for (i = j; i < n; i++) {
sum = a(i,j);
for (k = 0; k < j; k++) { sum -= a(i,k) * a(k,j); }
a(i,j) = sum;
if ((dum = vv[i] * fabs(sum)) >= big) {
big = dum;
imax = i;
}
}
if (j != imax) { /* if needed, interchange rows */
for (k = 0; k < n; k++){
dum = a(imax,k);
a(imax,k) = a(j,k);
a(j,k) = dum;
}
vv[imax] = vv[j]; /* keep the scale factor with the new row location */
}
indx[j] = imax;
if (j != n-1) { /* divide by the pivot element */
dum = 1.0 / a(j,j);
for (i = j + 1; i < n; i++) a(i,j) *= dum;
}
}
free(vv);
}
void lubksb(double *a, int n, int *indx, double *b) {
int i, ip, j;
double sum;
for (i = 0; i < n; i++) {
/* Forward substitution, Eq.2.3.6, unscrambling permutations from indx[] */
ip = indx[i];
sum = b[ip];
b[ip] = b[i];
for (j = 0; j < i; j++) sum -= a(i,j) * b[j];
b[i] = sum;
}
for (i = n-1; i >= 0; i--) { /* backsubstitution, Eq. 2.3.7 */
sum = b[i];
for (j = i + 1; j < n; j++) sum -= a(i,j) * b[j];
b[i] = sum / a(i,i);
}
}
int main() {
double *a,*y,*col,*aa,*res,sum;
int i,j,k,*indx;
a=(double *)malloc((size_t)(N*N * sizeof(double)));
y=(double *)malloc((size_t)(N*N * sizeof(double)));
col=(double *)malloc((size_t)(N * sizeof(double)));
indx=(int *)malloc((size_t)(N * sizeof(int)));
aa=(double *)malloc((size_t)(N*N * sizeof(double)));
res=(double *)malloc((size_t)(N*N * sizeof(double)));
if (!a || !y || !col || !indx || !aa || !res) die("main: memory allocation failure");
srand48((long int) N);
for (i=0;i<N;i++) {
for (j=0;j<N;j++) { aa[i*N+j] = a[i*N+j] = drand48(); }
}
fprintf(stderr,"\nRandomly generated matrix A = \n");
matrixPrint(a,N,N);
ludcmp(a,N,indx);
for(j=0;j<N;j++) {
for(i=0;i<N;i++) { col[i]=0.0; }
col[j]=1.0;
lubksb(a,N,indx,col);
for(i=0;i<N;i++) { y[i*N+j]=col[i]; }
}
fprintf(stderr,"\nResult of LU/BackSub is inv(A) :\n");
matrixPrint(y,N,N);
for (i=0; i<N; i++) {
for (j=0;j<N;j++) {
sum = 0;
for (k=0; k<N; k++) { sum += y[i*N+k] * aa[k*N+j]; }
res[i*N+j] = sum;
}
}
fprintf(stderr,"\nResult of inv(A).A = (should be 1):\n");
matrixPrint(res,N,N);
return(0);
}

Resources