Optimization of 3D Direct Convolution Implementation in C

Optimization of 3D Direct Convolution Implementation in C - c

For my project, I've written a naive C implementation of direct 3D convolution with periodic padding on the input. Unfortunately, since I'm new to C, the performance isn't so good... here's the code:
int mod(int a, int b)
{
// calculate mod to get the correct index with periodic padding
int r = a % b;
return r < 0 ? r + b : r;
}
void convolve3D(const double *image, const double *kernel, const int imageDimX, const int imageDimY, const int imageDimZ, const int stencilDimX, const int stencilDimY, const int stencilDimZ, double *result)
{
int imageSize = imageDimX * imageDimY * imageDimZ;
int kernelSize = kernelDimX * kernelDimY * kernelDimZ;
int i, j, k, l, m, n;
int kernelCenterX = (kernelDimX - 1) / 2;
int kernelCenterY = (kernelDimY - 1) / 2;
int kernelCenterZ = (kernelDimZ - 1) / 2;
int xShift,yShift,zShift;
int outIndex, outI, outJ, outK;
int imageIndex = 0, kernelIndex = 0;
// Loop through each voxel
for (k = 0; k < imageDimZ; k++){
for ( j = 0; j < imageDimY; j++) {
for ( i = 0; i < imageDimX; i++) {
stencilIndex = 0;
// for each voxel, loop through each kernel coefficient
for (n = 0; n < kernelDimZ; n++){
for ( m = 0; m < kernelDimY; m++) {
for ( l = 0; l < kernelDimX; l++) {
// find the index of the corresponding voxel in the output image
xShift = l - kernelCenterX;
yShift = m - kernelCenterY;
zShift = n - kernelCenterZ;
outI = mod ((i - xShift), imageDimX);
outJ = mod ((j - yShift), imageDimY);
outK = mod ((k - zShift), imageDimZ);
outIndex = outK * imageDimX * imageDimY + outJ * imageDimX + outI;
// calculate and add
result[outIndex] += stencil[stencilIndex]* image[imageIndex];
stencilIndex++;
}
}
}
imageIndex ++;
}
}
}
}
by convention, all the matrices (image, kernel, result) are stored in column-major fashion, and that's why I loop through them in such way so they are closer in memory (heard this would help).
I know the implementation is very naive, but since it's written in C, I was hoping the performance would be good, but instead it's a little disappointing. I tested it with image of size 100^3 and kernel of size 10^3 (Total ~1GFLOPS if only count the multiplication and addition), and it took ~7s, which I believe is way below the capability of a typical CPU.
If possible, could you guys help me optimize this routine?
I'm open to anything that could help, with just a few things if you could consider:
The problem I'm working with could be big (e.g. image of size 200 by 200 by 200 with kernel of size 50 by 50 by 50 or even larger). I understand that one way of optimizing this is by converting this problem into a matrix multiplication problem and use the blas GEMM routine, but I'm afraid memory could not hold such a big matrix
Due to the nature of the problem, I would prefer direct convolution instead of FFTConvolve, since my model is developed with direct convolution in mind, and my impression of FFT convolve is that it gives slightly different result than direct convolve especially for rapidly changing image, a discrepancy I'm trying to avoid.
That said, I'm in no way an expert in this. so if you have a great implementation based on FFTconvolve and/or my impression on FFT convolve is totally biased, I would really appreciate if you could help me out.
The input images are assumed to be periodic, so periodic padding is necessary
I understand that utilizing blas/SIMD or other lower level ways would definitely help a lot here. but since I'm a newbie here I dont't really know where to start... I would really appreciate if you help pointing me to the right direction if you have experience in these libraries,
Thanks a lot for your help, and please let me know if you need more info about the nature of the problem

As a first step, replace your mod ((i - xShift), imageDimX) with something like this:
inline int clamp( int x, int size )
{
if( x < 0 ) return x + size;
if( x >= size ) return x - size;
return x;
}
These branches are very predictable because they yield same results for very large count of consecutive elements. Integer modulo is relatively slow.
Now, next step (ordered by cost/profit) is going to be parallelizing. If you have any modern C++ compiler, just enable OpenMP somewhere in project settings. After that you need 2 changes.
Decorate your very outer loop with something like this: #pragma omp parallel for schedule(guided)
Move your function-level variables within that loop. This also means you’ll have to compute initial imageIndex from your k, for each iteration.
Next option, rework your code so you only write each output value once. Compute the final value in your innermost 3 loops, reading from random locations from both image and kernel, and only write the result once. When you have that result[outIndex] += in the inner loop, CPU stalls waiting for the data from memory. When you accumulate in a variable that’s a register not memory, there’s no access latency.
SIMD is the most complicated optimization for that. But in short, you’ll need maximum width of the FMA your hardware has (if you have AVX and need double precision, that width is 4), and you’ll also need multiple independent accumulators for your 3 innermost loops, to avoid hitting the latency as opposed to saturating the throughput. Here’s my answer to much easier problem as an example what I mean.

Related

wavelet transform opencl for loop

I want to code a wavelet transform in an OpenCL 1.0 kernel. I know how to do this in C language but I don't in OpenCL. What i want to know is how to browse the image with for loops. In C language i do :
for ( j = 0; j < n; j++ )
{
for ( i = 0; i < m; i++ )
{
v[i+j*m] = u[i+j*m];
}
}
With m and n the size of the image. In OpenCL i can't do this. I have just the beginning of my kernel :
__kernel void wavelet(__global float* output, __global float* input1,)
{
int WIDTH = 320;
int HEIGHT = 200;
int i;
int j;
int k;
const int column = get_global_id(0);
const int row = get_global_id(1);
}
How am I suppose to code the two for loops in OpenCL ?
Thank you

Each dimension of your kernel "unwraps" one for-loop into a parallel process. You have a 2D kernel, so you should need no loops at all in your kernel. Think of the row and column variables in your kernel as i and j (or j and i, depending on how you have things set up) in your C code.
Its somewhat more difficult when trying to accumulate values between different locations in the image. Each work-item runs in parallel, introducing potential race conditions. You may need one or more for-loops in your kernel to accumulate values sequentially.
In OpenCL 2.2 and greater, variable-duration loops are possible, and their syntax is identical to C. You can extract the image dimensions in your kernel using get_global_size(uint dimindx).
Make sure to call clEnqueueNDRangeKernel with the right number of dimensions. You also need global_size in this call to match your image dimensions. For instance, int global_size[2] = {w,h}. Your local_size can be any value smaller than your global size, but I like to work with int local_size[2]={16,16};. I've found that OpenCL kernels may sometimes fail completely if the local_size to global_size ratio is sub-optimal. For guaranteed results, you can set local_size to {1,1}.

How to generate a very large non singular matrix A in Ax = b?

I am solving the system of linear algebraic equations Ax = b by using Jacobian method but by taking manual inputs. I want to analyze the performance of the solver for large system. Is there any method to generate matrix A i.e non singular?
I am attaching my code here.`
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#define TOL = 0.0001
void main()
{
int size,i,j,k = 0;
printf("\n enter the number of equations: ");
scanf("%d",&size);
double reci = 0.0;
double *x = (double *)malloc(size*sizeof(double));
double *x_old = (double *)malloc(size*sizeof(double));
double *b = (double *)malloc(size*sizeof(double));
double *coeffMat = (double *)malloc(size*size*sizeof(double));
printf("\n Enter the coefficient matrix: \n");
for(i = 0; i < size; i++)
{
for(j = 0; j < size; j++)
{
printf(" coeffMat[%d][%d] = ",i,j);
scanf("%lf",&coeffMat[i*size+j]);
printf("\n");
//coeffMat[i*size+j] = 1.0;
}
}
printf("\n Enter the b vector: \n");
for(i = 0; i < size; i++)
{
x[i] = 0.0;
printf(" b[%d] = ",i);
scanf("%lf",&b[i]);
}
double sum = 0.0;
while(k < size)
{
for(i = 0; i < size; i++)
{
x_old[i] = x[i];
}
for(i = 0; i < size; i++)
{
sum = 0.0;
for(j = 0; j < size; j++)
{
if(i != j)
{
sum += (coeffMat[i * size + j] * x_old[j] );
}
}
x[i] = (b[i] -sum) / coeffMat[i * size + i];
}
k = k+1;
}
printf("\n Solution is: ");
for(i = 0; i < size; i++)
{
printf(" x[%d] = %lf \n ",i,x[i]);
}
}

This is all a bit Heath Robinson, but here's what I've used. I have no idea how 'random' such matrices all, in particular I don't know what distribution they follow.
The idea is to generate the SVD of the matrix. (Called A below, and assumed nxn).
Initialise A to all 0s
Then generate n positive numbers, and put them, with random signs, in the diagonal of A. I've found it useful to be able to control the ratio of the largest of these positive numbers to the smallest. This ratio will be the condition number of the matrix.
Then repeat n times: generate a random n vector f , and multiply A on the left by the Householder reflector I - 2*f*f' / (f'*f). Note that this can be done more efficiently than by forming the reflector matrix and doing a normal multiplication; indeed its easy to write a routine that given f and A will update A in place.
Repeat the above but multiplying on the right.
As for generating test data a simple way is to pick an x0 and then generate b = A * x0. Don't expect to get exactly x0 back from your solver; even if it is remarkably well behaved you'll find that the errors get bigger as the condition number gets bigger.

Talonmies' comment mentions http://www.eecs.berkeley.edu/Pubs/TechRpts/1991/CSD-91-658.pdf which is probably the right approach (at least in principle, and in full generality).
However, you are probably not handling "very large" matrixes (e.g. because your program use naive algorithms, and because you don't run it on a large supercomputer with a lot of RAM). So the naive approach of generating a matrix with random coefficients and testing afterwards that it is non-singular is probably enough.
Very large matrixes would have many billions of coefficients, and you need a powerful supercomputer with e.g. terabytes of RAM. You probably don't have that, if you did, your program probably would run too long (you don't have any parallelism), might give very wrong results (read http://floating-point-gui.de/ for more) so you don't care.
A matrix of a million coefficients (e.g. 1024*1024) is considered small by current hardware standards (and is more than enough to test your code on current laptops or desktops, and even to test some parallel implementations), and generating randomly some of them (and computing their determinant to test that they are not singular) is enough, and easily doable. You might even generate them and/or check their regularity with some external tool, e.g. scilab, R, octave, etc. Once your program computed a solution x0, you could use some tool (or write another program) to compute Ax0 - b and check that it is very close to the 0 vector (there are some cases where you would be disappointed or surprised, since round-off errors matter).
You'll need some good enough pseudo random number generator perhaps as simple as drand48(3) which is considered as nearly obsolete (you should find and use something better); you could seed it with some random source (e.g. /dev/urandom on Linux).
BTW, compile your code with all warnings & debug info (e.g. gcc -Wall -Wextra -g). Your #define TOL = 0.0001 is probably wrong (should be #define TOL 0.0001 or const double tol = 0.0001;). Use the debugger (gdb) & valgrind. Add optimizations (-O2 -mcpu=native) when benchmarking. Read the documentation of every used function, notably those from <stdio.h>. Check the result count from scanf... In C99, you should not cast the result of malloc, but you forgot to test against its failure, so code:
double *b = malloc(size*sizeof(double));
if (!b) {perror("malloc b"); exit(EXIT_FAILURE); };
You'll rather end, not start, your printf control strings with \n because stdout is often (not always!) line buffered. See also fflush.
You probably should read also some basic linear algebra textbook...
Notice that actually writing robust and efficient programs to invert matrixes or to solve linear systems is a difficult art (which I don't know at all : it has programming issues, algorithmic issues, and mathematical issues; read some numerical analysis book). You can still get a PhD and spend your whole life working on that. Please understand that you need ten years to learn programming (or many other things).

C picture rotation optimization

This is for all you C experts out there..
The first function takes a two-dimensional matrix src[dim][dim] representing pixels of an image, and rotates it 90 degrees into a destination matrix dst[dim][dim]. The second function takes the same src[dim][dim] and smoothens the image by replacing every pixel value with the average of all the pixels around it (in a maximum of 3 × 3 window centered at that pixel).
I need to optimize the program in account for time and cycles, how else would I be able to optimize the following?:
void rotate(int dim, pixel *src, pixel *dst,)
{
int i, j, nj;
nj = 0;
/* below are the main computations for the implementation of rotate. */
for (j = 0; j < dim; j++) {
nj = dim-1-j; /* Code Motion moved operation outside inner for loop */
for (i = 0; i < dim; i++) {
dst[RIDX(nj, i, dim)] = src[RIDX(i, j, dim)];
}
}
}
/* A struct used to compute averaged pixel value */
typedef struct {
int red;
int green;
int blue;
int num;
} pixel_sum;
/* Compute min and max of two integers, respectively */
static int minimum(int a, int b)
{ return (a < b ? a : b); }
static int maximum(int a, int b)
{ return (a > b ? a : b); }
/*
* initialize_pixel_sum - Initializes all fields of sum to 0
*/
static void initialize_pixel_sum(pixel_sum *sum)
{
sum->red = sum->green = sum->blue = 0;
sum->num = 0;
return;
}
/*
* accumulate_sum - Accumulates field values of p in corresponding
* fields of sum
*/
static void accumulate_sum(pixel_sum *sum, pixel p)
{
sum->red += (int) p.red;
sum->green += (int) p.green;
sum->blue += (int) p.blue;
sum->num++;
return;
}
/*
* assign_sum_to_pixel - Computes averaged pixel value in current_pixel
*/
static void assign_sum_to_pixel(pixel *current_pixel, pixel_sum sum)
{
current_pixel->red = (unsigned short) (sum.red/sum.num);
current_pixel->green = (unsigned short) (sum.green/sum.num);
current_pixel->blue = (unsigned short) (sum.blue/sum.num);
return;
}
/*
* avg - Returns averaged pixel value at (i,j)
*/
static pixel avg(int dim, int i, int j, pixel *src)
{
int ii, jj;
pixel_sum sum;
pixel current_pixel;
initialize_pixel_sum(&sum);
for(ii = maximum(i-1, 0); ii <= minimum(i+1, dim-1); ii++)
for(jj = maximum(j-1, 0); jj <= minimum(j+1, dim-1); jj++)
accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);
assign_sum_to_pixel(&current_pixel, sum);
return current_pixel;
}
void smooth(int dim, pixel *src, pixel *dst)
{
int i, j;
/* below are the main computations for the implementation of the smooth function. */
for (j = 0; j < dim; j++)
for (i = 0; i < dim; i++)
dst[RIDX(i, j, dim)] = avg(dim, i, j, src);
}
I moved dim-1-j outside of the inner for loop of rotate which reduces time and cycles used in the program, but is there anything else that can be used for either main function?
Thanks!

There are several oprimizations you can do; some a compiler might do for you but best to write it out yourself. For example: moving constant expressions out of the loop (you did that once; there are more places you can do that - don't forget that the condition is checked every iteration too, so optimize the loop condition in this manner too) and, as Chris pointed out, use pointers that you increment instead of full array indexing. I also see some function calls that can be rewritten in-line.
I also want to point to an article on stackoverflow about matrix multiplication and optimizing that to use the processor cache. In essence it first rearranges the arrrays into memory bocks that fit the cache, then performs the operation on those blocks, then moves to the next block, and so on. You may be able to re-use the ideas for your rotation.
See Optimizing assembly generated by Microsoft Visual Studio Compiler

For the rotation, you get a better utilization of the cache by decomposing in smaller image tiles.
For the smoothing,
1) expand the whole operation inside the main double loop, do not use these intermediate micro-functions;
2) completely unroll the accumulation and averaging (it's only a sum of 9 terms), hard coding the indexes;
3) process in different loops along the edges (where not all 9 pixels are available) and in the middle. The middle deserves maximum optimization (especially (2));
4) try and avoid the divisions by 9 (you can think of replacing the division by a table lookup).
Top speed will be obtained by handcrafting vectorized optimization (SSE/AVX), but this requires some deal of experience. Multicore parallelization is also an option.
To give you an idea, it is possible to apply a 3x3 average on a 1 MB grayscale image in less than 0.5 ms (monocore, Core i7#3.4 GHz). We can extrapolate to 2 ms or so for a 1 Mpixel RGB image.

Since you can't provide a running program these are just ideas of things that could help:
Assuming values in the range [0,256) then use uint8_t as your rgbn values. This takes up 1/4 of the memory of the int version but will likely require more cycles; I can't know if this would be faster or not without more knowledge. The idea is that since you use 1/4 of the memory you are more likely to keep more values in L1-L3 cache.
Since your neighbors are the same whether you are rotated or not, calculate the average before rotating. I suspect this would help out with caching but again can't be sure; it depends on some code I can't see.
Parallelize the outer loop. Since you have easy grid dimensions and the inputs and outputs don't have read/write conflicts this is a trivial thing to do. This will certainly take more cycles but will possibly be faster.
Hard-code your edges; you are currently doing maximum and minimum operations on every call to average, but for the inner points it is unneeded. Calculate the edges and the inner points separately.

Modulo 2*Pi using SSE/SSE2

I'm still pretty new to using SSE and am trying to implement a modulo of 2*Pi for double-precision inputs of the order 1e8 (the result of which will be fed into some vectorised trig calculations).
My current attempt at the code is based around the idea that mod(x, 2*Pi) = x - floor(x/(2*Pi))*2*Pi and looks like:
#define _PD_CONST(Name, Val) \
static const double _pd_##Name[2] __attribute__((aligned(16))) = { Val, Val }
_PD_CONST(2Pi, 6.283185307179586); /* = 2*pi */
_PD_CONST(recip_2Pi, 0.159154943091895); /* = 1/(2*pi) */
void vec_mod_2pi(const double * vec, int Size, double * modAns)
{
__m128d sse_a, sse_b, sse_c;
int i;
int k = 0;
double t = 0;
unsigned int initial_mode;
initial_mode = _MM_GET_ROUNDING_MODE();
_MM_SET_ROUNDING_MODE(_MM_ROUND_DOWN);
for (i = 0; i < Size; i += 2)
{
sse_a = _mm_loadu_pd(vec+i);
sse_b = _mm_mul_pd( _mm_cvtepi32_pd( _mm_cvtpd_epi32( _mm_mul_pd(sse_a, *(__m128d*)_pd_recip_2Pi) ) ), *(__m128d*)_pd_2Pi);
sse_c = _mm_sub_pd(sse_a, sse_b);
_mm_storeu_pd(modAns+i,sse_c);
}
k = i-2;
for (i = 0; i < Size%2; i++)
{
t = (double)((int)(vec[k+i] * 0.159154943091895)) * 6.283185307179586;
modAns[k+i] = vec[k+i] - t;
}
_MM_SET_ROUNDING_MODE(initial_mode);
}
Unfortunately, this is currently returning a lot of NaN with a couple of answers of 1.128e119 as well (some what outside the range of 0 -> 2*Pi that I was aiming for!). I suspect that where I'm going wrong is in the double-to-int-to-double conversion that I'm trying to use to do the floor.
Can anyone suggest where I've gone wrong and how to improve it?
P.S. sorry about the format of that code, it's the first time I've posted a question on here and can't seem to get it to give me empty lines within the code block to make it readable.

If you want any kind of accuracy, the simple algorithm is terribly bad. For an accurate range reduction algorithm, see e.g. Ng et al., ARGUMENT REDUCTION FOR HUGE ARGUMENTS: Good to the Last Bit (now available via the Wayback Machine: 2012-12-24)

For large arguments Hayne-Panek algorithm is typically used. However, the Hayne-Panek paper is quite difficult to read, and I suggest to have a look at Chapter 11 in the Handbook of Floating-Point Arithmetic for a more accessible explanation.

Optimizing array transposing function

I'm working on a homework assignment, and I've been stuck for hours on my solution. The problem we've been given is to optimize the following code, so that it runs faster, regardless of how messy it becomes. We're supposed to use stuff like exploiting cache blocks and loop unrolling.
Problem:
//transpose a dim x dim matrix into dist by swapping all i,j with j,i
void transpose(int *dst, int *src, int dim) {
int i, j;
for(i = 0; i < dim; i++) {
for(j = 0; j < dim; j++) {
dst[j*dim + i] = src[i*dim + j];
}
}
}
What I have so far:
//attempt 1
void transpose(int *dst, int *src, int dim) {
int i, j, id, jd;
id = 0;
for(i = 0; i < dim; i++, id+=dim) {
jd = 0;
for(j = 0; j < dim; j++, jd+=dim) {
dst[jd + i] = src[id + j];
}
}
}
//attempt 2
void transpose(int *dst, int *src, int dim) {
int i, j, id;
int *pd, *ps;
id = 0;
for(i = 0; i < dim; i++, id+=dim) {
pd = dst + i;
ps = src + id;
for(j = 0; j < dim; j++) {
*pd = *ps++;
pd += dim;
}
}
}
Some ideas, please correct me if I'm wrong:
I have thought about loop unrolling but I dont think that would help, because we don't know if the NxN matrix has prime dimensions or not. If I checked for that, it would include excess calculations which would just slow down the function.
Cache blocks wouldn't be very useful, because no matter what, we will be accessing one array linearly (1,2,3,4) while the other we will be accessing in jumps of N. While we can get the function to abuse the cache and access the src block faster, it will still take a long time to place those into the dst matrix.
I have also tried using pointers instead of array accessors, but I don't think that actually speeds up the program in any way.
Any help would be greatly appreciated.
Thanks

Cache blocking can be useful. For an example, lets say we have a cache line size of 64 bytes (which is what x86 uses these days). So for a large enough matrix such that it's larger than the cache size, then if we transpose a 16x16 block (since sizeof(int) == 4, thus 16 ints fit in a cache line, assuming the matrix is aligned on a cacheline bounday) we need to load 32 (16 from the source matrix, 16 from the destination matrix before we can dirty them) cache lines from memory and store another 16 lines (even though the stores are not sequential). In contrast, without cache blocking transposing the equivalent 16*16 elements requires us to load 16 cache lines from the source matrix, but 16*16=256 cache lines to be loaded and then stored for the destination matrix.

Unrolling is useful for large matrixes.
You'll need some code to deal with excess elements if the matrix size isn't a multiple of the times you unroll. But this will be outside the most critical loop, so for a large matrix it's worth it.
Regarding the direction of accesses - it may be better to read linearly and write in jumps of N, rather than vice versa. This is because read operations block the CPU, while write operations don't (up to a limit).
Other suggestions:
1. Can you use parallelization? OpenMP can help (though if you're expected to deliver single CPU performance, it's no good).
2. Disassemble the function and read it, focusing on the innermost loop. You may find things you wouldn't notice in C code.
3. Using decreasing counters (stopping at 0) might be slightly more efficient that increasing counters.
4. The compiler must assume that src and dst may alias (point to the same or overlapping memory), which limits its optimization options. If you could somehow tell the compiler that they can't overlap, it may be great help. However, I'm not sure how to do that (maybe use the restrict qualifier).

Messyness is not a problem, so: I would add a transposed flag to each matrix. This flag indicates, whether the stored data array of a matrix is to be interpreted in normal or transposed order.
All matrix operations should receive these new flags in addition to each matrix parameter. Inside each operation implement the code for all possible combinations of flags. Perhaps macros can save redundant writing here.
In this new implementation, the matrix transposition just toggles the flag: The space and time needed for the transpose operation is constant.

Just an idea how to implement unrolling:
void transpose(int *dst, int *src, int dim) {
int i, j;
const int dim1 = (dim / 4) * 4;
for(i = 0; i < dim; i++) {
for(j = 0; j < dim1; j+=4) {
dst[j*dim + i] = src[i*dim + j];
dst[(j+1)*dim + i] = src[i*dim + (j+1)];
dst[(j+2)*dim + i] = src[i*dim + (j+2)];
dst[(j+3)*dim + i] = src[i*dim + (j+3)];
}
for( ; j < dim; j++) {
dst[j*dim + i] = src[i*dim + j];
}
__builtin_prefetch (&src[(i+1)*dim], 0, 1);
}
}
Of cource you should remove counting ( like i*dim) from the inner loop, as you already did in your attempts.
Cache prefetch could be used for source matrix.

you probably know this but register int (you tell the compiler that it would be smart to put this in register). And making the int's unsigned, may make things go little bit faster.