I'm trying to optimize some of my code in C, which is a lot bigger than the snippet below. Coming from Python, I wonder whether you can simply multiply an entire array by a number like I do below.
Evidently, it does not work the way I do it below. Is there any other way that achieves the same thing, or do I have to step through the entire array as in the for loop?
void main()
{
int i;
float data[] = {1.,2.,3.,4.,5.};
//this fails
data *= 5.0;
//this works
for(i = 0; i < 5; i++) data[i] *= 5.0;
}
There is no short-cut you have to step through each element of the array.
Note however that in your example, you may achieve a speedup by using int rather than float for both your data and multiplier.
If you want to, you can do what you want through BLAS, Basic Linear Algebra Subprograms, which is optimised. This is not in the C standard, it is a package which you have to install yourself.
Sample code to achieve what you want:
#include <stdio.h>
#include <stdlib.h>
#include <cblas.h>
int main () {
int limit =10;
float *a = calloc( limit, sizeof(float));
for ( int i = 0; i < limit ; i++){
a[i] = i;
}
cblas_sscal( limit , 0.5f, a, 1);
for ( int i = 0; i < limit ; i++){
printf("%3f, " , a[i]);
}
printf("\n");
}
The names of the functions is not obvious, but reading the guidelines you might start to guess what BLAS functions does. sscal() can be split into s for single precision and scal for scale, which means that this function works on floats. The same function for double precision is called dscal().
If you need to scale a vector with a constant and adding it to another, BLAS got a function for that too:
saxpy()
s a x p y
float a*x + y
y[i] += a*x
As you might guess there is a daxpy() too which works on doubles.
I'm afraid that, in C, you will have to use for(i = 0; i < 5; i++) data[i] *= 5.0;.
Python allows for so many more "shortcuts"; however, in C, you have to access each element and then manipulate those values.
Using the for-loop would be the shortest way to accomplish what you're trying to do to the array.
EDIT: If you have a large amount of data, there are more efficient (in terms of running time) ways to multiply 5 to each value. Check out loop tiling, for example.
data *= 5.0;
Here data is address of array which is constant.
if you want to multiply the first value in that array then use * operator as below.
*data *= 5.0;
Related
For my project, I've written a naive C implementation of direct 3D convolution with periodic padding on the input. Unfortunately, since I'm new to C, the performance isn't so good... here's the code:
int mod(int a, int b)
{
// calculate mod to get the correct index with periodic padding
int r = a % b;
return r < 0 ? r + b : r;
}
void convolve3D(const double *image, const double *kernel, const int imageDimX, const int imageDimY, const int imageDimZ, const int stencilDimX, const int stencilDimY, const int stencilDimZ, double *result)
{
int imageSize = imageDimX * imageDimY * imageDimZ;
int kernelSize = kernelDimX * kernelDimY * kernelDimZ;
int i, j, k, l, m, n;
int kernelCenterX = (kernelDimX - 1) / 2;
int kernelCenterY = (kernelDimY - 1) / 2;
int kernelCenterZ = (kernelDimZ - 1) / 2;
int xShift,yShift,zShift;
int outIndex, outI, outJ, outK;
int imageIndex = 0, kernelIndex = 0;
// Loop through each voxel
for (k = 0; k < imageDimZ; k++){
for ( j = 0; j < imageDimY; j++) {
for ( i = 0; i < imageDimX; i++) {
stencilIndex = 0;
// for each voxel, loop through each kernel coefficient
for (n = 0; n < kernelDimZ; n++){
for ( m = 0; m < kernelDimY; m++) {
for ( l = 0; l < kernelDimX; l++) {
// find the index of the corresponding voxel in the output image
xShift = l - kernelCenterX;
yShift = m - kernelCenterY;
zShift = n - kernelCenterZ;
outI = mod ((i - xShift), imageDimX);
outJ = mod ((j - yShift), imageDimY);
outK = mod ((k - zShift), imageDimZ);
outIndex = outK * imageDimX * imageDimY + outJ * imageDimX + outI;
// calculate and add
result[outIndex] += stencil[stencilIndex]* image[imageIndex];
stencilIndex++;
}
}
}
imageIndex ++;
}
}
}
}
by convention, all the matrices (image, kernel, result) are stored in column-major fashion, and that's why I loop through them in such way so they are closer in memory (heard this would help).
I know the implementation is very naive, but since it's written in C, I was hoping the performance would be good, but instead it's a little disappointing. I tested it with image of size 100^3 and kernel of size 10^3 (Total ~1GFLOPS if only count the multiplication and addition), and it took ~7s, which I believe is way below the capability of a typical CPU.
If possible, could you guys help me optimize this routine?
I'm open to anything that could help, with just a few things if you could consider:
The problem I'm working with could be big (e.g. image of size 200 by 200 by 200 with kernel of size 50 by 50 by 50 or even larger). I understand that one way of optimizing this is by converting this problem into a matrix multiplication problem and use the blas GEMM routine, but I'm afraid memory could not hold such a big matrix
Due to the nature of the problem, I would prefer direct convolution instead of FFTConvolve, since my model is developed with direct convolution in mind, and my impression of FFT convolve is that it gives slightly different result than direct convolve especially for rapidly changing image, a discrepancy I'm trying to avoid.
That said, I'm in no way an expert in this. so if you have a great implementation based on FFTconvolve and/or my impression on FFT convolve is totally biased, I would really appreciate if you could help me out.
The input images are assumed to be periodic, so periodic padding is necessary
I understand that utilizing blas/SIMD or other lower level ways would definitely help a lot here. but since I'm a newbie here I dont't really know where to start... I would really appreciate if you help pointing me to the right direction if you have experience in these libraries,
Thanks a lot for your help, and please let me know if you need more info about the nature of the problem
As a first step, replace your mod ((i - xShift), imageDimX) with something like this:
inline int clamp( int x, int size )
{
if( x < 0 ) return x + size;
if( x >= size ) return x - size;
return x;
}
These branches are very predictable because they yield same results for very large count of consecutive elements. Integer modulo is relatively slow.
Now, next step (ordered by cost/profit) is going to be parallelizing. If you have any modern C++ compiler, just enable OpenMP somewhere in project settings. After that you need 2 changes.
Decorate your very outer loop with something like this: #pragma omp parallel for schedule(guided)
Move your function-level variables within that loop. This also means you’ll have to compute initial imageIndex from your k, for each iteration.
Next option, rework your code so you only write each output value once. Compute the final value in your innermost 3 loops, reading from random locations from both image and kernel, and only write the result once. When you have that result[outIndex] += in the inner loop, CPU stalls waiting for the data from memory. When you accumulate in a variable that’s a register not memory, there’s no access latency.
SIMD is the most complicated optimization for that. But in short, you’ll need maximum width of the FMA your hardware has (if you have AVX and need double precision, that width is 4), and you’ll also need multiple independent accumulators for your 3 innermost loops, to avoid hitting the latency as opposed to saturating the throughput. Here’s my answer to much easier problem as an example what I mean.
I am solving the system of linear algebraic equations Ax = b by using Jacobian method but by taking manual inputs. I want to analyze the performance of the solver for large system. Is there any method to generate matrix A i.e non singular?
I am attaching my code here.`
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#define TOL = 0.0001
void main()
{
int size,i,j,k = 0;
printf("\n enter the number of equations: ");
scanf("%d",&size);
double reci = 0.0;
double *x = (double *)malloc(size*sizeof(double));
double *x_old = (double *)malloc(size*sizeof(double));
double *b = (double *)malloc(size*sizeof(double));
double *coeffMat = (double *)malloc(size*size*sizeof(double));
printf("\n Enter the coefficient matrix: \n");
for(i = 0; i < size; i++)
{
for(j = 0; j < size; j++)
{
printf(" coeffMat[%d][%d] = ",i,j);
scanf("%lf",&coeffMat[i*size+j]);
printf("\n");
//coeffMat[i*size+j] = 1.0;
}
}
printf("\n Enter the b vector: \n");
for(i = 0; i < size; i++)
{
x[i] = 0.0;
printf(" b[%d] = ",i);
scanf("%lf",&b[i]);
}
double sum = 0.0;
while(k < size)
{
for(i = 0; i < size; i++)
{
x_old[i] = x[i];
}
for(i = 0; i < size; i++)
{
sum = 0.0;
for(j = 0; j < size; j++)
{
if(i != j)
{
sum += (coeffMat[i * size + j] * x_old[j] );
}
}
x[i] = (b[i] -sum) / coeffMat[i * size + i];
}
k = k+1;
}
printf("\n Solution is: ");
for(i = 0; i < size; i++)
{
printf(" x[%d] = %lf \n ",i,x[i]);
}
}
This is all a bit Heath Robinson, but here's what I've used. I have no idea how 'random' such matrices all, in particular I don't know what distribution they follow.
The idea is to generate the SVD of the matrix. (Called A below, and assumed nxn).
Initialise A to all 0s
Then generate n positive numbers, and put them, with random signs, in the diagonal of A. I've found it useful to be able to control the ratio of the largest of these positive numbers to the smallest. This ratio will be the condition number of the matrix.
Then repeat n times: generate a random n vector f , and multiply A on the left by the Householder reflector I - 2*f*f' / (f'*f). Note that this can be done more efficiently than by forming the reflector matrix and doing a normal multiplication; indeed its easy to write a routine that given f and A will update A in place.
Repeat the above but multiplying on the right.
As for generating test data a simple way is to pick an x0 and then generate b = A * x0. Don't expect to get exactly x0 back from your solver; even if it is remarkably well behaved you'll find that the errors get bigger as the condition number gets bigger.
Talonmies' comment mentions http://www.eecs.berkeley.edu/Pubs/TechRpts/1991/CSD-91-658.pdf which is probably the right approach (at least in principle, and in full generality).
However, you are probably not handling "very large" matrixes (e.g. because your program use naive algorithms, and because you don't run it on a large supercomputer with a lot of RAM). So the naive approach of generating a matrix with random coefficients and testing afterwards that it is non-singular is probably enough.
Very large matrixes would have many billions of coefficients, and you need a powerful supercomputer with e.g. terabytes of RAM. You probably don't have that, if you did, your program probably would run too long (you don't have any parallelism), might give very wrong results (read http://floating-point-gui.de/ for more) so you don't care.
A matrix of a million coefficients (e.g. 1024*1024) is considered small by current hardware standards (and is more than enough to test your code on current laptops or desktops, and even to test some parallel implementations), and generating randomly some of them (and computing their determinant to test that they are not singular) is enough, and easily doable. You might even generate them and/or check their regularity with some external tool, e.g. scilab, R, octave, etc. Once your program computed a solution x0, you could use some tool (or write another program) to compute Ax0 - b and check that it is very close to the 0 vector (there are some cases where you would be disappointed or surprised, since round-off errors matter).
You'll need some good enough pseudo random number generator perhaps as simple as drand48(3) which is considered as nearly obsolete (you should find and use something better); you could seed it with some random source (e.g. /dev/urandom on Linux).
BTW, compile your code with all warnings & debug info (e.g. gcc -Wall -Wextra -g). Your #define TOL = 0.0001 is probably wrong (should be #define TOL 0.0001 or const double tol = 0.0001;). Use the debugger (gdb) & valgrind. Add optimizations (-O2 -mcpu=native) when benchmarking. Read the documentation of every used function, notably those from <stdio.h>. Check the result count from scanf... In C99, you should not cast the result of malloc, but you forgot to test against its failure, so code:
double *b = malloc(size*sizeof(double));
if (!b) {perror("malloc b"); exit(EXIT_FAILURE); };
You'll rather end, not start, your printf control strings with \n because stdout is often (not always!) line buffered. See also fflush.
You probably should read also some basic linear algebra textbook...
Notice that actually writing robust and efficient programs to invert matrixes or to solve linear systems is a difficult art (which I don't know at all : it has programming issues, algorithmic issues, and mathematical issues; read some numerical analysis book). You can still get a PhD and spend your whole life working on that. Please understand that you need ten years to learn programming (or many other things).
I don't know C well at all, and I'm trying to edit someone's code, but I'm having issues when trying to convert values from the log to linear domains.
For example, let's say we have an array A that is full of log values equal to -100 dB, i.e.
float A[100];
int i;
for( i=0; i<100; i++ )
A[i] = -100;
What I want to do is find the average of all the values (which clearly is -100), but by taking the average in the linear and not log domain, i.e.
float tmp_avg = 0.0;
float avg;
int count = 0;
for( i=0; i<100; i++ ) {
tmp_avg += pow(10.0, A[i]/10.0);
count++;
}
avg = 10*log10(tmp_avg / count);
However, the result I'm getting is all 0's. Now the code I'm working on is much more complex than this, but I was wondering if there's anything obvious that I'm missing as to why this won't work.
One thought I had is that 10^(-100/10) is a very small value (1e-10), and perhaps too small to be accurately defined as a float. I've tried making it a double instead, but I still get a result of all 0's.
Thanks!
Just figured out what the problem was: I needed to include the math.h library at the top of the program:
#include <math.h>
Without that, I believe that there was no reference for the log10 function, which in turn caused the result to be all 0's. I now include math.h and everything seems to be working fine.
I have a code, that was not made by me.
In this complex code, many rules are being applied to calculate a quantity, d(x). in the code is being used a pointer to calculate it.
I want to calculate an integral over this, like:
W= Int_0 ^L d(x) dx ?
I am doing this:
#define DX 0.003
void WORK(double *d, double *W)
{
double INTE5=0.0;
int N_X_POINTS=333;
double h=((d[N_X_POINTS]-d[0])/N_X_POINTS);
W[0]=W[0]+((h/2)*(d[1]+2.0*d[0]+d[N_X_POINTS-1])); /*BC*/
for (i=1;i<N_X_POINTS-1;i++)
{
W[i]=W[i]+((h/2)*(d[0]+2*d[i]+d[N_X_POINTS]))*DX;
INTE5+=W[i];
}
W[N_X_POINTS-1]=W[N_X_POINTS-1]+((h/2)*(d[0]+2.0*d[N_X_POINTS-1]+d[N_X_POINTS-2])); /*BC*/
}
And I am getting "Segmentation fault". I was wondering to know if, I am doing right in calculate W as a pointer, or should declare it as a simple double? I guess the Segmentation fault is coming for this.
Other point, am I using correctly the trapezoidal rule?
Any help/tip, will very much appreciate.
Luiz
I don't know where that code come from, but it is a lot ugly and has some limits hard-encoded (333 points and increment by 0.003). To use it you need to "sample" properly your function and generate pairs (x, f(x))...
A possible clearer solution to your problem is here.
Let us consider you function and let us suppose it works (I believe it does't, it's a really obscure code...; e.g. when you integrate a function, you expect a number as result; where's this number? Maybe INTE5? It is not given back... and if it is so, why the final update of the W array? It's useless, or maybe we have something meaningful into W?). How does would you use it?
The prototype
void WORK(double *d, double *W);
means the WORK wants two pointers. What these pointers must be depends on the code; a look at it suggests that indeed you need two arrays, with N_X_POINTS elements each. The code reads from and writes into array W, and reads only from d. The N_X_POINTS int is 333, so you need to pass to the function arrays of at least 333 doubles:
double d[333];
double W[333];
Then you have to fill them properly. I thought you need to fill them with (x, f(x)), sampling the function with a proper step. But of course this makes no too much sense. Already said that the code is obscure (now I don't want to try to reverse engineering the intention of the coder...).
Anyway, if you call it with WORK(d, W), you won't get seg fault, since the arrays are big enough. The result will be wrong, but this is harder to track (again, sorry, no "reverse engineering" for it).
Final note (from comments too): if you have double a[N], then a has type double *.
A segmentation fault error often happens in C when you try to access some part of memory that you shouldn't be accessing. I suspect that the expression d[N_X_POINTS] is the culprit (because arrays in C are zero-indexed), but without seeing the definition of d I can't be sure.
Try putting informative printf debugging statements before/after each line of code in your function so you can narrow down the possible sources of the problem.
Here's a simple program that integrates $f(x) = x^2$ over the range [0..10]. It should send you in the right direction.
#include <stdio.h>
#include <stdlib.h>
double int_trapezium(double f[], double dX, int n)
{
int i;
double sum;
sum = (f[0] + f[n-1])/2.0;
for (i = 1; i < n-1; i++)
sum += f[i];
return dX*sum;
}
#define N 1000
int main()
{
int i;
double x;
double from = 0.0;
double to = 10.0;
double dX = (to-from)/(N-1);
double *f = malloc(N*sizeof(*f));
for (i=0; i<N; i++)
{
x = from + i*dX*(to-from);
f[i] = x*x;
}
printf("%f\n", int_trapezium(f, dX, N));
free(f);
return 0;
}
I'm fairly new to C, not having much need to anything faster than python for most of my research. However, it turns out that recent work I've been doing required the computation of fairly large vectors/matrices, and there therefore a C+MPI solution might be in order.
Mathematically speaking, the task is very simple. I have a lot of vectors of dimensionality ~40k and wish to compute the Kronecker Product of selected pairs of these vectors, and then sum these kronecker products.
The question is, how to do this efficiently? Is there anything wrong with the following structure of code, using for loops, or obtain the effect?
The function kron described below passes vectors A and B of lengths vector_size, and computes their kronecker product, which it stores in C, a vector_size*vector_size matrix.
void kron(int *A, int *B, int *C, int vector_size) {
int i,j;
for(i = 0; i < vector_size; i++) {
for (j = 0; j < vector_size; j++) {
C[i*vector_size+j] = A[i] * B[j];
}
}
return;
}
This seems fine to me, and certainly (if I've not made some silly syntax error) produce the right result, but I have a sneaking suspicion that embedded for loops is not optimal. If there's another way I should be going about this, please let me know. Suggestions welcome.
I thank you for you patience and any advice you may have. Once again, I'm very inexperienced with C, but Googling around has brought me little joy for this query.
Since your loop bodies are all completely independent, there is certainly a way to accelerate this. Easiest would be already to take advantage of several cores before thinking of MPI. OpenMP should do quite fine on this.
#pragma omp parallel for
for(int i = 0; i < vector_size; i++) {
for (int j = 0; j < vector_size; j++) {
C[i][j] = A[i] * B[j];
}
}
This is supported by many compilers nowadays.
You could also try to drag some common expressions out of the inner loop but decent compilers e.g gcc, icc or clang should do this quite well all by themselves:
#pragma omp parallel for
for(int i = 0; i < vector_size; ++i) {
int const x = A[i];
int * vec = &C[i][0];
for (int j = 0; j < vector_size; ++j) {
vec[j] = x * B[j];
}
}
BTW, indexing with int is usually not the right thing to do. size_t is the correct typedef for everything that has to do with indexing and sizes of objects.
For double-precision vectors (single-precision and complex are similar), you can use the BLAS routine DGER (rank-one update) or similar to do the products one-at-a-time, since they are all on vectors. How many vectors are you multiplying? Remember that adding a bunch of vector outer products (which you can treat the Kronecker products as) ends up as a matrix-matrix multiplication, which BLAS's DGEMM can handle efficiently. You might need to write your own routines if you truly need integer operations, though.
If your compiler supports C99 (and you never pass the same vector as A and B), consider compiling in a C99-supporting mode and changing your function signature to:
void kron(int * restrict A, int * restrict B, int * restrict C, int vector_size);
The restrict keyword promises the compiler that the arrays pointed to by A, B and C do not alias (overlap). With your code as written, the compiler must re-load A[i] on every execution of the inner loop, because it must be conservative and assume that your stores to C[] can modify values in A[]. Under restrict, the compiler can assume that this will not happen.
Solution found (thanks to #Jeremiah Willcock): GSL's BLAS bindings seem to do the trick beautifully. If we're progressively selecting pairs of vectors A and B and adding them to some 'running total' vector/matrix C, the following modified version of the above kron function
void kronadd(int *A, int *B, int *C, int vector_size, int alpha) {
int i,j;
for(i = 0; i < vector_size; i++) {
for (j = 0; j < vector_size; j++) {
C[i*vector_size+j] = alpha * A[i] * B[j];
}
}
return;
}
precisely corresponds to the BLAS DGER function (accessible as gsl_blas_dger), functionally speaking. The initial kron function is DGER with alpha = 0 and C being an uninitialised (zeroed) matrix/vector of the correct dimensionality.
It turns out, it might well be easier to simply use python bindings for these libraries, in the end. However, I think I've learned a lot while trying to figure this stuff out. There are some more helpful suggestions in the other responses, do check them out if you have the same sort of problem to deal with. Thanks everyone!
This is a common enough problem in numerical computational circles, that really the best thing to do would be to use a well-debugged package like Matlab (or one of its Free Software clones).
You could probably even find a python binding to it, so you can get rid of C.
All of the above is (probably) going to be faster than code written strictly in python. If you need more speed than that, I'd suggest a couple of things:
Look into using Fortran instead of C. Fortran compilers tend to be better at optimizing numerical computations (one exception would be if you are using gcc, since both its C and Fortran compilers use the same backend).
Consider parallelizing your algorithm. There are variants of Fortran I know that have parallel loop statements. I think there are some C addons around that do the same thing. If you are using a PC (and single-precision) you could also consider using your video card's GPU, which is essentially a really cheap array processor.
Another optimisation that would be easy to implement is that if you know that the inner dimension of your arrays will be divisible by n then add n assignment statements to the body of the loop, reducing the number of necessary iterations, with corresponding changes to the loop counting.
This strategy can be generalised by using a switch statement around the outer loop with cases for array sizes divisible by two, three, four and five, or whatever is most common. This can give quite a big performance win and is compatible with suggestions 1 and 3 for further optimisation/parallelisation. A good compiler may even do something like this for you (aka loop unrolling).
Another optimisation would be to make use of pointer arithmetic to avoid the array indexing. Something like this should do the trick:
int i, j;
for(i = 0; i < vector_size; i++) {
int d = *A++;
int *e = B;
for (j = 0; j < vector_size; j++) {
*C++ = *e++ * d;
}
}
This also avoids accessing the value of A[i] multiple times by caching it in a local variable, which might give you a minor speed boost. (Note that this version is not parallelisable since it alters the value of the pointers, but would still work with loop unrolling.)
To solve your problem, I think you should try to use Eigen 3, it's a C++ library which use all matrix functions!
If you have time, go to see its documentation! =)
Good luck !
uint32_t rA = 3;
uint32_t cA = 5;
uint32_t lda = cA;
uint32_t rB = 5;
uint32_t cB = 3;
uint32_t ldb = cB;
uint32_t rC = rA*rB;
uint32_t cC = cA*cB;
uint32_t ldc = cC;
double *A = (double *)malloc(rA*cA*sizeof(double));
double *B = (double *)malloc(rB*cB*sizeof(double));
double *C = (double *)malloc(rC*cC*sizeof(double));
for (uint32_t i=0, allA=rA*cA; i<allA; i++)
A[i]=i;
for (uint32_t i=0, allB=rB*cB; i<allB; i++)
B[i]=i;
for (uint32_t i=0, allC=rC*cC; i<allC; i++)
C[i]=0;
for (uint32_t i=0, allA=rA*cA; i<allA; i++)
{
for (uint32_t j=0, allB=rB*cB; j<allB; j++)
C[((i/lda)*rB+j/ldb)*ldc
+ (i%lda)*cB+j%ldb ]=A[i]*B[j];
}