Related
For my project, I've written a naive C implementation of direct 3D convolution with periodic padding on the input. Unfortunately, since I'm new to C, the performance isn't so good... here's the code:
int mod(int a, int b)
{
// calculate mod to get the correct index with periodic padding
int r = a % b;
return r < 0 ? r + b : r;
}
void convolve3D(const double *image, const double *kernel, const int imageDimX, const int imageDimY, const int imageDimZ, const int stencilDimX, const int stencilDimY, const int stencilDimZ, double *result)
{
int imageSize = imageDimX * imageDimY * imageDimZ;
int kernelSize = kernelDimX * kernelDimY * kernelDimZ;
int i, j, k, l, m, n;
int kernelCenterX = (kernelDimX - 1) / 2;
int kernelCenterY = (kernelDimY - 1) / 2;
int kernelCenterZ = (kernelDimZ - 1) / 2;
int xShift,yShift,zShift;
int outIndex, outI, outJ, outK;
int imageIndex = 0, kernelIndex = 0;
// Loop through each voxel
for (k = 0; k < imageDimZ; k++){
for ( j = 0; j < imageDimY; j++) {
for ( i = 0; i < imageDimX; i++) {
stencilIndex = 0;
// for each voxel, loop through each kernel coefficient
for (n = 0; n < kernelDimZ; n++){
for ( m = 0; m < kernelDimY; m++) {
for ( l = 0; l < kernelDimX; l++) {
// find the index of the corresponding voxel in the output image
xShift = l - kernelCenterX;
yShift = m - kernelCenterY;
zShift = n - kernelCenterZ;
outI = mod ((i - xShift), imageDimX);
outJ = mod ((j - yShift), imageDimY);
outK = mod ((k - zShift), imageDimZ);
outIndex = outK * imageDimX * imageDimY + outJ * imageDimX + outI;
// calculate and add
result[outIndex] += stencil[stencilIndex]* image[imageIndex];
stencilIndex++;
}
}
}
imageIndex ++;
}
}
}
}
by convention, all the matrices (image, kernel, result) are stored in column-major fashion, and that's why I loop through them in such way so they are closer in memory (heard this would help).
I know the implementation is very naive, but since it's written in C, I was hoping the performance would be good, but instead it's a little disappointing. I tested it with image of size 100^3 and kernel of size 10^3 (Total ~1GFLOPS if only count the multiplication and addition), and it took ~7s, which I believe is way below the capability of a typical CPU.
If possible, could you guys help me optimize this routine?
I'm open to anything that could help, with just a few things if you could consider:
The problem I'm working with could be big (e.g. image of size 200 by 200 by 200 with kernel of size 50 by 50 by 50 or even larger). I understand that one way of optimizing this is by converting this problem into a matrix multiplication problem and use the blas GEMM routine, but I'm afraid memory could not hold such a big matrix
Due to the nature of the problem, I would prefer direct convolution instead of FFTConvolve, since my model is developed with direct convolution in mind, and my impression of FFT convolve is that it gives slightly different result than direct convolve especially for rapidly changing image, a discrepancy I'm trying to avoid.
That said, I'm in no way an expert in this. so if you have a great implementation based on FFTconvolve and/or my impression on FFT convolve is totally biased, I would really appreciate if you could help me out.
The input images are assumed to be periodic, so periodic padding is necessary
I understand that utilizing blas/SIMD or other lower level ways would definitely help a lot here. but since I'm a newbie here I dont't really know where to start... I would really appreciate if you help pointing me to the right direction if you have experience in these libraries,
Thanks a lot for your help, and please let me know if you need more info about the nature of the problem
As a first step, replace your mod ((i - xShift), imageDimX) with something like this:
inline int clamp( int x, int size )
{
if( x < 0 ) return x + size;
if( x >= size ) return x - size;
return x;
}
These branches are very predictable because they yield same results for very large count of consecutive elements. Integer modulo is relatively slow.
Now, next step (ordered by cost/profit) is going to be parallelizing. If you have any modern C++ compiler, just enable OpenMP somewhere in project settings. After that you need 2 changes.
Decorate your very outer loop with something like this: #pragma omp parallel for schedule(guided)
Move your function-level variables within that loop. This also means you’ll have to compute initial imageIndex from your k, for each iteration.
Next option, rework your code so you only write each output value once. Compute the final value in your innermost 3 loops, reading from random locations from both image and kernel, and only write the result once. When you have that result[outIndex] += in the inner loop, CPU stalls waiting for the data from memory. When you accumulate in a variable that’s a register not memory, there’s no access latency.
SIMD is the most complicated optimization for that. But in short, you’ll need maximum width of the FMA your hardware has (if you have AVX and need double precision, that width is 4), and you’ll also need multiple independent accumulators for your 3 innermost loops, to avoid hitting the latency as opposed to saturating the throughput. Here’s my answer to much easier problem as an example what I mean.
I'm trying to perform operations such as multiplying 7D arrays that have 32 million elements. I have written a MEX file as I am under the impression that these operations should be quicker in C than in Matlab. However, I'm finding that the MEX file is about twice as slow as performing the operations directly in Matlab (2017b).
An example operation I want to perform is:
T8 = rand(1,1e3,2,2,2,2,2);
wsm = rand(1e3,1e3,2,2);
CM = bsxfun(#times,T8,wsm);
On my machine this takes 0.117065 seconds (I call this, and other similar operations, ~1000 times per run of a model and the model is run thousands of times to optimize the parameters - these operations are making optimization prohibitively slow).
Here is the MEX file I wrote, it uses 7 for loops to access the elements of T8 and wsm by linear indexing (maybe I should be accessing the elements in a more efficient manner or avoiding for loops?):
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
mwSize i, j, k, l, m, n, o, I, J, K, L, M, N, O;
mwSize *dims,*dims1;
double *T8, *wsm, *CM;
T8 = mxGetPr(prhs[0]);
wsm = mxGetPr(prhs[1]);
dims = mxGetDimensions(prhs[0]);
dims1 = mxGetDimensions(prhs[1]);
dims[0] = dims1[0];
I = dims[0];
J = dims[1];
K = dims[2];
L = dims[3];
M = dims[4];
N = dims[5];
O = dims[6];
plhs[0] = mxCreateNumericArray(7,dims,mxDOUBLE_CLASS,mxREAL);
CM = mxGetPr(plhs[0]);
for( o=0; o<O; o++ ) {
for( n=0; n<N; n++ ) {
for( m=0; m<M; m++ ) {
for( l=0; l<L; l++ ) {
for( k=0; k<K; k++ ) {
for( j=0; j<J; j++ ) {
for( i=0; i<I; i++ ) {
*CM++ = T8[j + k*J + +l*J*K + m*L*J*K + n*M*L*J*K + o*N*M*L*J*K] * wsm[i + j*I + k*I*J + l*I*J*K];
}
}
}
}
}
}
}
}
When I call the above MEX file
CM = arrayProduct(T8,wsm);
it takes 0.215211 seconds (nearly twice as long).
My code was very loosely based on the code suggested here (https://uk.mathworks.com/matlabcentral/answers/210352-optimize-speed-up-a-big-and-slow-matrix-operation-with-addition-and-bsxfun).
Any suggestions as to what I can do differently to speed up my code would be greatly appreciated!
It is a big mistake to assume you can beat Matlab at trivial matrix math like this. Matlab is optimized from the beginning to perform matrix math.
There are good reasons to write MEX functions sometimes, including for performance reasons, but that's typically in cases where a pure Matlab solution is not feasible to write in an optimal way (e.g. when you would need to write lots of explicit loops).
Two major reasons why your code might be slower than the optimized matrix math already present in Matlab are:
Matlab might use multiple threads to do calculations in parallel. Your code does not, but a truly optimal solution probably would.
You may have made a mistake in the memory access pattern, leading to inferior cache hit rates.
Another way to look at this is: if Matlab can't be trusted to implement multiplication in an optimal way, would people be using it for serious math on large data sets? There are algorithms Matlab doesn't know, and sometimes those can be sped up using MEX, but multiplication is not one of them.
I'm trying to optimize some of my code in C, which is a lot bigger than the snippet below. Coming from Python, I wonder whether you can simply multiply an entire array by a number like I do below.
Evidently, it does not work the way I do it below. Is there any other way that achieves the same thing, or do I have to step through the entire array as in the for loop?
void main()
{
int i;
float data[] = {1.,2.,3.,4.,5.};
//this fails
data *= 5.0;
//this works
for(i = 0; i < 5; i++) data[i] *= 5.0;
}
There is no short-cut you have to step through each element of the array.
Note however that in your example, you may achieve a speedup by using int rather than float for both your data and multiplier.
If you want to, you can do what you want through BLAS, Basic Linear Algebra Subprograms, which is optimised. This is not in the C standard, it is a package which you have to install yourself.
Sample code to achieve what you want:
#include <stdio.h>
#include <stdlib.h>
#include <cblas.h>
int main () {
int limit =10;
float *a = calloc( limit, sizeof(float));
for ( int i = 0; i < limit ; i++){
a[i] = i;
}
cblas_sscal( limit , 0.5f, a, 1);
for ( int i = 0; i < limit ; i++){
printf("%3f, " , a[i]);
}
printf("\n");
}
The names of the functions is not obvious, but reading the guidelines you might start to guess what BLAS functions does. sscal() can be split into s for single precision and scal for scale, which means that this function works on floats. The same function for double precision is called dscal().
If you need to scale a vector with a constant and adding it to another, BLAS got a function for that too:
saxpy()
s a x p y
float a*x + y
y[i] += a*x
As you might guess there is a daxpy() too which works on doubles.
I'm afraid that, in C, you will have to use for(i = 0; i < 5; i++) data[i] *= 5.0;.
Python allows for so many more "shortcuts"; however, in C, you have to access each element and then manipulate those values.
Using the for-loop would be the shortest way to accomplish what you're trying to do to the array.
EDIT: If you have a large amount of data, there are more efficient (in terms of running time) ways to multiply 5 to each value. Check out loop tiling, for example.
data *= 5.0;
Here data is address of array which is constant.
if you want to multiply the first value in that array then use * operator as below.
*data *= 5.0;
I'm working on a homework assignment, and I've been stuck for hours on my solution. The problem we've been given is to optimize the following code, so that it runs faster, regardless of how messy it becomes. We're supposed to use stuff like exploiting cache blocks and loop unrolling.
Problem:
//transpose a dim x dim matrix into dist by swapping all i,j with j,i
void transpose(int *dst, int *src, int dim) {
int i, j;
for(i = 0; i < dim; i++) {
for(j = 0; j < dim; j++) {
dst[j*dim + i] = src[i*dim + j];
}
}
}
What I have so far:
//attempt 1
void transpose(int *dst, int *src, int dim) {
int i, j, id, jd;
id = 0;
for(i = 0; i < dim; i++, id+=dim) {
jd = 0;
for(j = 0; j < dim; j++, jd+=dim) {
dst[jd + i] = src[id + j];
}
}
}
//attempt 2
void transpose(int *dst, int *src, int dim) {
int i, j, id;
int *pd, *ps;
id = 0;
for(i = 0; i < dim; i++, id+=dim) {
pd = dst + i;
ps = src + id;
for(j = 0; j < dim; j++) {
*pd = *ps++;
pd += dim;
}
}
}
Some ideas, please correct me if I'm wrong:
I have thought about loop unrolling but I dont think that would help, because we don't know if the NxN matrix has prime dimensions or not. If I checked for that, it would include excess calculations which would just slow down the function.
Cache blocks wouldn't be very useful, because no matter what, we will be accessing one array linearly (1,2,3,4) while the other we will be accessing in jumps of N. While we can get the function to abuse the cache and access the src block faster, it will still take a long time to place those into the dst matrix.
I have also tried using pointers instead of array accessors, but I don't think that actually speeds up the program in any way.
Any help would be greatly appreciated.
Thanks
Cache blocking can be useful. For an example, lets say we have a cache line size of 64 bytes (which is what x86 uses these days). So for a large enough matrix such that it's larger than the cache size, then if we transpose a 16x16 block (since sizeof(int) == 4, thus 16 ints fit in a cache line, assuming the matrix is aligned on a cacheline bounday) we need to load 32 (16 from the source matrix, 16 from the destination matrix before we can dirty them) cache lines from memory and store another 16 lines (even though the stores are not sequential). In contrast, without cache blocking transposing the equivalent 16*16 elements requires us to load 16 cache lines from the source matrix, but 16*16=256 cache lines to be loaded and then stored for the destination matrix.
Unrolling is useful for large matrixes.
You'll need some code to deal with excess elements if the matrix size isn't a multiple of the times you unroll. But this will be outside the most critical loop, so for a large matrix it's worth it.
Regarding the direction of accesses - it may be better to read linearly and write in jumps of N, rather than vice versa. This is because read operations block the CPU, while write operations don't (up to a limit).
Other suggestions:
1. Can you use parallelization? OpenMP can help (though if you're expected to deliver single CPU performance, it's no good).
2. Disassemble the function and read it, focusing on the innermost loop. You may find things you wouldn't notice in C code.
3. Using decreasing counters (stopping at 0) might be slightly more efficient that increasing counters.
4. The compiler must assume that src and dst may alias (point to the same or overlapping memory), which limits its optimization options. If you could somehow tell the compiler that they can't overlap, it may be great help. However, I'm not sure how to do that (maybe use the restrict qualifier).
Messyness is not a problem, so: I would add a transposed flag to each matrix. This flag indicates, whether the stored data array of a matrix is to be interpreted in normal or transposed order.
All matrix operations should receive these new flags in addition to each matrix parameter. Inside each operation implement the code for all possible combinations of flags. Perhaps macros can save redundant writing here.
In this new implementation, the matrix transposition just toggles the flag: The space and time needed for the transpose operation is constant.
Just an idea how to implement unrolling:
void transpose(int *dst, int *src, int dim) {
int i, j;
const int dim1 = (dim / 4) * 4;
for(i = 0; i < dim; i++) {
for(j = 0; j < dim1; j+=4) {
dst[j*dim + i] = src[i*dim + j];
dst[(j+1)*dim + i] = src[i*dim + (j+1)];
dst[(j+2)*dim + i] = src[i*dim + (j+2)];
dst[(j+3)*dim + i] = src[i*dim + (j+3)];
}
for( ; j < dim; j++) {
dst[j*dim + i] = src[i*dim + j];
}
__builtin_prefetch (&src[(i+1)*dim], 0, 1);
}
}
Of cource you should remove counting ( like i*dim) from the inner loop, as you already did in your attempts.
Cache prefetch could be used for source matrix.
you probably know this but register int (you tell the compiler that it would be smart to put this in register). And making the int's unsigned, may make things go little bit faster.
I'm fairly new to C, not having much need to anything faster than python for most of my research. However, it turns out that recent work I've been doing required the computation of fairly large vectors/matrices, and there therefore a C+MPI solution might be in order.
Mathematically speaking, the task is very simple. I have a lot of vectors of dimensionality ~40k and wish to compute the Kronecker Product of selected pairs of these vectors, and then sum these kronecker products.
The question is, how to do this efficiently? Is there anything wrong with the following structure of code, using for loops, or obtain the effect?
The function kron described below passes vectors A and B of lengths vector_size, and computes their kronecker product, which it stores in C, a vector_size*vector_size matrix.
void kron(int *A, int *B, int *C, int vector_size) {
int i,j;
for(i = 0; i < vector_size; i++) {
for (j = 0; j < vector_size; j++) {
C[i*vector_size+j] = A[i] * B[j];
}
}
return;
}
This seems fine to me, and certainly (if I've not made some silly syntax error) produce the right result, but I have a sneaking suspicion that embedded for loops is not optimal. If there's another way I should be going about this, please let me know. Suggestions welcome.
I thank you for you patience and any advice you may have. Once again, I'm very inexperienced with C, but Googling around has brought me little joy for this query.
Since your loop bodies are all completely independent, there is certainly a way to accelerate this. Easiest would be already to take advantage of several cores before thinking of MPI. OpenMP should do quite fine on this.
#pragma omp parallel for
for(int i = 0; i < vector_size; i++) {
for (int j = 0; j < vector_size; j++) {
C[i][j] = A[i] * B[j];
}
}
This is supported by many compilers nowadays.
You could also try to drag some common expressions out of the inner loop but decent compilers e.g gcc, icc or clang should do this quite well all by themselves:
#pragma omp parallel for
for(int i = 0; i < vector_size; ++i) {
int const x = A[i];
int * vec = &C[i][0];
for (int j = 0; j < vector_size; ++j) {
vec[j] = x * B[j];
}
}
BTW, indexing with int is usually not the right thing to do. size_t is the correct typedef for everything that has to do with indexing and sizes of objects.
For double-precision vectors (single-precision and complex are similar), you can use the BLAS routine DGER (rank-one update) or similar to do the products one-at-a-time, since they are all on vectors. How many vectors are you multiplying? Remember that adding a bunch of vector outer products (which you can treat the Kronecker products as) ends up as a matrix-matrix multiplication, which BLAS's DGEMM can handle efficiently. You might need to write your own routines if you truly need integer operations, though.
If your compiler supports C99 (and you never pass the same vector as A and B), consider compiling in a C99-supporting mode and changing your function signature to:
void kron(int * restrict A, int * restrict B, int * restrict C, int vector_size);
The restrict keyword promises the compiler that the arrays pointed to by A, B and C do not alias (overlap). With your code as written, the compiler must re-load A[i] on every execution of the inner loop, because it must be conservative and assume that your stores to C[] can modify values in A[]. Under restrict, the compiler can assume that this will not happen.
Solution found (thanks to #Jeremiah Willcock): GSL's BLAS bindings seem to do the trick beautifully. If we're progressively selecting pairs of vectors A and B and adding them to some 'running total' vector/matrix C, the following modified version of the above kron function
void kronadd(int *A, int *B, int *C, int vector_size, int alpha) {
int i,j;
for(i = 0; i < vector_size; i++) {
for (j = 0; j < vector_size; j++) {
C[i*vector_size+j] = alpha * A[i] * B[j];
}
}
return;
}
precisely corresponds to the BLAS DGER function (accessible as gsl_blas_dger), functionally speaking. The initial kron function is DGER with alpha = 0 and C being an uninitialised (zeroed) matrix/vector of the correct dimensionality.
It turns out, it might well be easier to simply use python bindings for these libraries, in the end. However, I think I've learned a lot while trying to figure this stuff out. There are some more helpful suggestions in the other responses, do check them out if you have the same sort of problem to deal with. Thanks everyone!
This is a common enough problem in numerical computational circles, that really the best thing to do would be to use a well-debugged package like Matlab (or one of its Free Software clones).
You could probably even find a python binding to it, so you can get rid of C.
All of the above is (probably) going to be faster than code written strictly in python. If you need more speed than that, I'd suggest a couple of things:
Look into using Fortran instead of C. Fortran compilers tend to be better at optimizing numerical computations (one exception would be if you are using gcc, since both its C and Fortran compilers use the same backend).
Consider parallelizing your algorithm. There are variants of Fortran I know that have parallel loop statements. I think there are some C addons around that do the same thing. If you are using a PC (and single-precision) you could also consider using your video card's GPU, which is essentially a really cheap array processor.
Another optimisation that would be easy to implement is that if you know that the inner dimension of your arrays will be divisible by n then add n assignment statements to the body of the loop, reducing the number of necessary iterations, with corresponding changes to the loop counting.
This strategy can be generalised by using a switch statement around the outer loop with cases for array sizes divisible by two, three, four and five, or whatever is most common. This can give quite a big performance win and is compatible with suggestions 1 and 3 for further optimisation/parallelisation. A good compiler may even do something like this for you (aka loop unrolling).
Another optimisation would be to make use of pointer arithmetic to avoid the array indexing. Something like this should do the trick:
int i, j;
for(i = 0; i < vector_size; i++) {
int d = *A++;
int *e = B;
for (j = 0; j < vector_size; j++) {
*C++ = *e++ * d;
}
}
This also avoids accessing the value of A[i] multiple times by caching it in a local variable, which might give you a minor speed boost. (Note that this version is not parallelisable since it alters the value of the pointers, but would still work with loop unrolling.)
To solve your problem, I think you should try to use Eigen 3, it's a C++ library which use all matrix functions!
If you have time, go to see its documentation! =)
Good luck !
uint32_t rA = 3;
uint32_t cA = 5;
uint32_t lda = cA;
uint32_t rB = 5;
uint32_t cB = 3;
uint32_t ldb = cB;
uint32_t rC = rA*rB;
uint32_t cC = cA*cB;
uint32_t ldc = cC;
double *A = (double *)malloc(rA*cA*sizeof(double));
double *B = (double *)malloc(rB*cB*sizeof(double));
double *C = (double *)malloc(rC*cC*sizeof(double));
for (uint32_t i=0, allA=rA*cA; i<allA; i++)
A[i]=i;
for (uint32_t i=0, allB=rB*cB; i<allB; i++)
B[i]=i;
for (uint32_t i=0, allC=rC*cC; i<allC; i++)
C[i]=0;
for (uint32_t i=0, allA=rA*cA; i<allA; i++)
{
for (uint32_t j=0, allB=rB*cB; j<allB; j++)
C[((i/lda)*rB+j/ldb)*ldc
+ (i%lda)*cB+j%ldb ]=A[i]*B[j];
}