Loop Tiling Optimisations

Loop Tiling Optimisations - c

I've been attempting to optimise one of my loops in my C code in order to make it use the cache more efficiently. I have a few issues. I'm not 100% sure if I'm even writing the code correctly to loop block due to the fact that I am seeing no increase in speed in the run time of my programme. Here is the code:
for(int k = 0; k < N; k+=b){
for (int i = k; i<MIN(N,i+b); ++i) {
a1[i] = 0.0f;
a2[i] = 0.0f;
for (int j = 0; j < N; j++) {
x = x[j] - x[i];
y = y[j] - y[i];
2 = x*x + y*y + eps;
r2inv = 1.0f / sqrt(r2);
r6inv = r2inv * r2inv * r2inv;
s = m[j] * r6inv;
ax[i] += s * x;
ay[i] += s * y;
}
}
}
I also have another issue. How do I go about choosing a correct block size? I understand that you want to load in enough to fill the l1 cache.
Thanks for the help in advance.

What you are doing is rather pointless, because i goes from 0 to N-1 in your code, just in a slightly more complicated way. So you benefit exactly zero from your attempts at tiling.
What is more critical is the array y, so that is what you should be tiling (if N is large, and if the speed isn't limited by the division and square root). For every value i, you make one complete pass through the array y. You can also easily save a few floating point operations for each j, and since r6inv is symmetrical between i and j, only half the values need to be calculated.

Related

(Computational fluid dynamics) Problem with arrays. Why is my C program outputting -1.#IND00

I have a problem with a program i'm writing in C that solves the 1D linear convection equation. Basically i have initialized an two arrays. The first array (u0_array) is an array of ones with the arrays elements set equal to two over the interval or 0.5 < x < 1. The second array (usol_array) serves as a temporary array that the result or solution will be stored in.
The problem i am running into is the nested for loop at the end of my code. This loop applies the update equation required to calculate the next point, to each element in the array. When i run my script and try to print the results the output i get is just -1.IND00 for each iteration of the loop. (I am following Matlab style pseudo code which i also have attached below) I'm very new with C so my inexperience shows. I don't know why this is happening. If anyone could suggest a possible fix to this i would be very grateful. I have attached my code so far below with a few comments so you can follow my thought process. I have also attached the Matlab style pseudo code i'm followed.
# include <math.h>
# include <stdlib.h>
# include <stdio.h>
# include <time.h>
int main ( )
{
//eqyation to be solved 1D linear convection -- du/dt + c(du/dx) = 0
//initial conditions u(x,0) = u0(x)
//after discretisation using forward difference time and backward differemce space
//update equation becomes u[i] = u0[i] - c * dt/dx * (u0[i] - u[i - 1]);
int nx = 41; //num of grid points
int dx = 2 / (nx - 1); //magnitude of the spacing between grid points
int nt = 25;//nt is the number of timesteps
double dt = 0.25; //the amount of time each timestep covers
int c = 1; //assume wavespeed
//set up our initial conditions. The initial velocity u_0
//is 2 across the interval of 0.5 <x < 1 and u_0 = 1 everywhere else.
//we will define an array of ones
double* u0_array = (double*)calloc(nx, sizeof(double));
for (int i = 0; i < nx; i++)
{
u0_array[i] = 1;
}
// set u = 2 between 0.5 and 1 as per initial conditions
//note 0.5/dx = 10, 1/dx+1 = 21
for (int i = 10; i < 21; i++)
{
u0_array[i] = 2;
//printf("%f, ", u0_array[i]);
}
//make a temporary array that allows us to store the solution
double* usol_array = (double*)calloc(nx, sizeof(double));
//apply numerical scheme forward difference in
//time an backward difference in space
for (int i = 0; i < nt; i++)
{
//first loop iterates through each time step
usol_array[i] = u0_array[i];
//printf("%f", usol_array[i]);
//MY CODE WORKS FINE AS I WANT UP TO THIS LOOP
//second array iterates through each grid point for that time step and applies
//the update equation
for (int j = 1; j < nx - 1; j++)
{
u0_array[j] = usol_array[j] - c * dt/dx * (usol_array[j] - usol_array[j - 1]);
printf("%f, ", u0_array[j]);
}
}
return EXIT_SUCCESS;
}
For reference also the pseudo code i am following is attached below
1D linear convection pseudo Code (Matlab Style)

Instead of integer division, use FP math.
This avoid the later division by 0 and -1.#IND00
// int dx = 2 / (nx - 1); quotient is 0.
double dx = 2.0 / (nx - 1);
OP's code does not match comment
// u[i] = u0[i] - c * dt/dx * (u0[i] - u[i - 1]
u0_array[j] = usol_array[j] - c * dt/dx * (usol_array[j] - usol_array[j - 1]);
Easier to see if the redundant _array removed.
// v correct?
// u[i] = u0[i] - c * dt/dx * (u0[i] - u[i - 1]
u0[j] = usol[j] - c * dt/dx * (usol[j] - usol[j - 1]);
// ^^^^ Correct?

What you probably want is in matlab terms
for i = 1:nt
usol = u0
u0(2:nx) = usol(2:nx) - c*dt/dx*(usol(2:nx)-usol(1:nx-1))
end%for
This means that you have 2 inner loops over the space dimension, one for each of the two vector operations. These two separate loops have to be explicit in the C code
//apply numerical scheme forward difference in
//time an backward difference in space
for (int i = 0; i < nt; i++) {
//first loop iterates through each time step
for (int j = 0; j < nx; j++) {
usol_array[j] = u0_array[j];
//printf("%f", usol_array[i]);
}
//second loop iterates through each grid point for that time step
//and applies the update equation
for (int j = 1; j < nx; j++)
{
u0_array[j] = usol_array[j] - c * dt/dx * (usol_array[j] - usol_array[j - 1]);
printf("%f, ", u0_array[j]);
}
}

Slow OMP vs serial

I'm trying to optimize a C subroutine called from R that takes up ~60% of the computation time for a problem I'm trying to solve. This is down from 86% when coded purely in R. The vast majority of the execution time in my C code is taking place in a nested for loop and so this seems an obvious candidate to try and parallelize using OpenMP. I've tried doing so with variable results – at best the elapsed time is fractionally worse than not using OMP, at worst the performance scaled inversely to the number of threads. The code for the fastest version is below:
#include <R.h>
#include <Rmath.h>
#include <omp.h>
void gradNegLogLik_c(double *param, double *delta, double *X, double *M, int *nBeta, int *nEpsilon, int *nObs, double *gradient){
// ========================================================================================
// param: double[nBeta + nEpsilon] values of parameters at which to evaluate gradient
// delta: double[nObs] satellite - buoy differences
// X: double[nObs * (nBeta + nEpsilon)] design matrix for mean components (i.e. beta terms)
// M: double[nObs * (nBeta + nEpsilon)] design matrix for variance components (i.e. epsilon terms)
// nBeta: int number of mean terms
// nEpsilon: int number of variance terms
// nObs: int number of observations
// gradient: double[nBeta + nEpsilon] output array of gradients
// ========================================================================================
// ========================================================================================
// local variables
size_t i, j, ind;
size_t nterms = *nBeta + *nEpsilon;
size_t nbeta = *nBeta;
size_t nepsilon = *nEpsilon;
size_t nobs = *nObs;
// allocate local memory and set to zero
double *sigma2 = calloc( nobs , sizeof(double) );
double *fittedValues = calloc( nobs , sizeof(double) );
double *residuals = calloc( nobs , sizeof(double) );
double *beta = calloc( nbeta , sizeof(double) );
double *epsilon2 = calloc( nepsilon , sizeof(double) );
double *residuals2 = calloc( nobs , sizeof(double) );
double gradBeta, gradEpsilon;
// extract beta and epsilon terms from param
// =========================================
for(i = 0 ; i < nbeta ; i++){
beta[i] = param[ i ];
epsilon2[i] = param[ nbeta + i ];
}
// Initialise gradient to zero for return value
// =========================================
for( i = 0 ; i < nterms ; i++){
gradient[i] = 0;
}
// calculate sigma, fitted values and residuals
// ============================================
for( i = 0 ; i < nbeta ; i++){
for( j = 0 ; j < nobs ; j++){
ind = i * nobs + j;
sigma2[j] += M[ind] * epsilon2[i];
fittedValues[j] += X[ind] * beta[i];
}
}
for( j = 0 ; j < nobs ; j++){
// calculate reciprocal as this is what we actually use and
// we only want to do it once.
sigma2[j] = 1 / sigma2[j];
residuals[j] = delta[j] - fittedValues[j];
residuals2[j] = residuals[j]*residuals[j];
}
// Loop over all observations and calculate value of (negative) derivative
// =======================================================================
#pragma omp parallel for private(i, j, ind, gradBeta, gradEpsilon)\
shared(gradient, nbeta, nobs, X, M, sigma2, fittedValues, delta, residuals2) \
default(none)
for( i = 0 ; i < nbeta ; i++){
gradBeta = 0.0;
gradEpsilon = 0.0;
for(j = 0 ; j < nobs ; j++){
ind = i * nobs + j;
gradBeta -= -1.0*X[ind] * sigma2[j]*(fittedValues[j] - delta[j]);
gradEpsilon -= 0.5*M[ind] * sigma2[j]*(residuals2[j] * sigma2[j] - 1);
}
gradient[i] = gradBeta;
gradient[nbeta + i] = gradEpsilon;
}
// End of function
// free local memory
free(sigma2);
free(fittedValues);
free(residuals);
free(beta);
free(epsilon2);
free(residuals2);
}
nObs is order 10000.
nBeta is in the range 20 – several hundred.
nEpsilon = nBeta and is not currently used.
After searching through this site and an afternoon googling and trying different things I don't seem to be able to make any further improvement. My first thoughts were false sharing – I've tried various things such as unrolling the outer loop to set 8 elements of gradient[] at a time to creating a temporary padded array to store the results in. I've also tried different combinations of shared, private and firstprivate. None of this appears to improve things and my fastest execution time is marginally worse in parallel than in serial. This leads to two questions before I spend any more time on this:
Is my problem (repeating ~9000 of the same set of calculations 20 - 900 times) too small to make it worthwhile using OMP?
Is there something I'm missing or doing wrong?
I suspect it's the latter as I'm relatively inexperienced when using C and OMP. Any help / thoughts would be appreciated.
(For info, I'm running on SLED11 server with 16 cores and 192GB of memory and using GCC 4.7.2 to compile my C code). Other users are using the server but the relative performance of OMP vs serial code seems independent of the other users.
Thanks in advance,
Dave.
EDIT: For info the compile command I've used is
gcc -I/RHOME/R/3.0.1/lib64/R/include -DNDEBUG -I/usr/local/include -fpic \
-std=c99 -Wall -pedantic –O3 -fopenmp -c src/gradNegLogLik_call.c \
-o src/gradNegLogLik_call.o
Most of the flags are set by the R CMD SHLIB command - I've added the -O3 -fopenmp manually.

It may be useful to give some context to my question above before giving my answer to what I've done to speed up my code (although this has been achieved without using OMP).
My original C function was written to calculate the gradient of a log likelihood function to be used with the R optim() command and the L-BFGS-B method. For each call of optim my log likelihood and gradient functions are each called ~100 times as optim finds the best solution. As a result, these two functions take up the bulk of my execution time, as expected and reported by Rprof, and so were the two targets for converting to C to improve the efficiency of my code.
Converting my two functions to C and optimizing that code has resulted in my calls to optim reducing from an average elapsed time of 1.88s per call to 0.25s per call. This has reduced my processing time from ~1 month to a few days. The change that had the biggest impact (beside calling C) was changing the ordering of the nested loops. The original order was chosen due to the way R stores matrices and chosen to avoid having to transpose my matrices for each call of my C functions. Recognizing that the transpose only needs to be done once for each call to optim(), and not each C call as I had originally coded, this is a small overhead to pay compared to the impact / benefit of changing the order in the C functions.
Given this increase in speed it's had to justify spending any more time on this. The final version of my gradient function (as per my original post) is given below.
Note that whilst I've changed from using .C to .Call in R (hence the change to the function arguments etc) this in itself doesn’t account for the speed increase.
#include <R.h>
#include <Rmath.h>
#include <Rinternals.h>
#include <omp.h>
SEXP gradNegLogLik_call(SEXP param ,SEXP delta, SEXP X, SEXP M, SEXP nBeta, SEXP nEpsilon){
// local variables
double *par, *d;
double *sigma2, *fittedValues, *residuals, *grad, *Xuse, *Muse;
double val, sig2, gradBeta, gradEpsilon;
int n, m, ind, nterms, i, j;
SEXP gradient;
// get / associate parameters with local pointer
par = REAL(param);
Xuse = REAL(X);
Muse = REAL(M);
d = REAL(delta);
n = LENGTH(delta);
m = INTEGER(nBeta)[0];
nterms = m + m;
// allocate memory
PROTECT( gradient = allocVector(REALSXP, nterms ));
// set pointer to real portion of gradient
grad = REAL(gradient);
// set all gradient terms to zero
for(i = 0 ; i < nterms ; i++){
grad[i] = 0.0;
}
sigma2 = Calloc(n, double );
fittedValues = Calloc(n, double );
residuals = Calloc(n, double );
// calculate sigma, fitted values and residuals
for(i = 0 ; i < n ; i++){
val = 0.0;
sig2 = 0.0;
for(j = 0 ; j < m ; j++){
ind = i*m + j;
val += Xuse[ind]*par[j];
sig2 += Muse[ind]*par[j+m];
}
// calculate reciprocal of sigma as this is what we actually use
// and we only want to do it once
sigma2[i] = 1.0 / sig2;
fittedValues[i] = val;
residuals[i] = d[i] - val;
}
// now loop over each paramter and calculate derivative
for(i = 0 ; i < n ; i++){
gradBeta = -1.0*sigma2[i]*(fittedValues[i] - d[i]);
gradEpsilon = 0.5*sigma2[i]*(residuals[i]*residuals[i]*sigma2[i] - 1);
for(j = 0 ; j < m ; j++){
ind = i*m + j;
grad[j] -= Xuse[ind]*gradBeta;
grad[j+m] -= Muse[ind]*gradEpsilon;
}
}
UNPROTECT(1);
Free(sigma2);
Free(residuals);
Free(fittedValues);
// return array of gradients
return gradient;
}

Sum 3D matrix cuda

I need to do calculation like: A[x][y] = sum{from z=0 till z=n}{B[x][y][z]+C[x][y][z]}, where matrix A has dimensions [height][width] and matrix B,C has dimensions [height][width][n].
Values are mapped to memory with something like:
index = 0;
for (z = 0; z<n; ++z)
for(y = 0; y<width; ++y)
for(x = 0; x<height; ++x) {
matrix[index] = value;
index++;
}
Q1: is this Cuda kernel ok?
idx = blockIdx.x*blockDim.x + threadIdx.x;
idy = blockIdx.y*blockDim.y + threadIdx.y;
for(z=0; z<n; z++){
A[idx*width+idy] += B[idx*width+idy+z*width*height] + C[idx*width+idy+z*width*height];
}
Q2: Is this faster way to do the calculation?
idx = blockIdx.x*blockDim.x + threadIdx.x;
idy = blockIdx.y*blockDim.y + threadIdx.y;
idz = blockIdx.z*blockDim.z + threadIdx.z;
int stride_x = blockDim.x * gridDim.x;
int stride_y = blockDim.y * gridDim.y;
int stride_z = blockDim.z * gridDim.z;
while ( idx < height && idy < width && idz < n ) {
atomicAdd( &(A[idx*width+idy]), B[idx*width+idy+idz*width*height] + C[idx*width+idy+idz*width*height] );
idx += stride_x;
idy += stride_y;
idz += stride_z;
}

First kernel is ok. But we have not coalesced access to matrix B and C.
As for second kernel function. You have data racing cause not only one thread has an an ability to write in A[idx*width+idy] addres. You need in additional synchronization like AttomicAdd
As for general question:
I think that experiments show that it is better. It's depends on typical matrix sizes that you have. Remember that maximum thread block size on Fermi < 1024 and if matrices have large size you gem many thread blocks. Usually it's slower (to have many thread blocks).

Real simple in ArrayFire:
array A = randu(nx,ny,nz);
array B = sum(A,2); // sum along 3rd dimension
print(B);

Q1: Test it with matrices where you know the answer
Remark: You might have problems when using very large matrices. Use a while loop with appropriate increments. Cuda by Example is as usual the reference book.
An example for implementing a nested loop can be found here: For nested loops with CUDA. There a while loop is implemented.
marina.k is right about the race condition. That would favor approach one, as atomic operations tend to slow down the code.

Determining the complexities given codes

Given a snipplet of code, how will you determine the complexities in general. I find myself getting very confused with Big O questions. For example, a very simple question:
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
System.out.println("*");
}
}
The TA explained this with something like combinations. Like this is n choose 2 = (n(n-1))/2 = n^2 + 0.5, then remove the constant so it becomes n^2. I can put int test values and try but how does this combination thing come in?
What if theres an if statement? How is the complexity determined?
for (int i = 0; i < n; i++) {
if (i % 2 ==0) {
for (int j = i; j < n; j++) { ... }
} else {
for (int j = 0; j < i; j++) { ... }
}
}
Then what about recursion ...
int fib(int a, int b, int n) {
if (n == 3) {
return a + b;
} else {
return fib(b, a+b, n-1);
}
}

In general, there is no way to determine the complexity of a given function
Warning! Wall of text incoming!
1. There are very simple algorithms that no one knows whether they even halt or not.
There is no algorithm that can decide whether a given program halts or not, if given a certain input. Calculating the computational complexity is an even harder problem since not only do we need to prove that the algorithm halts but we need to prove how fast it does so.
//The Collatz conjecture states that the sequence generated by the following
// algorithm always reaches 1, for any initial positive integer. It has been
// an open problem for 70+ years now.
function col(n){
if (n == 1){
return 0;
}else if (n % 2 == 0){ //even
return 1 + col(n/2);
}else{ //odd
return 1 + col(3*n + 1);
}
}
2. Some algorithms have weird and off-beat complexities
A general "complexity determining scheme" would easily get too complicated because of these guys
//The Ackermann function. One of the first examples of a non-primitive-recursive algorithm.
function ack(m, n){
if(m == 0){
return n + 1;
}else if( n == 0 ){
return ack(m-1, 1);
}else{
return ack(m-1, ack(m, n-1));
}
}
function f(n){ return ack(n, n); }
//f(1) = 3
//f(2) = 7
//f(3) = 61
//f(4) takes longer then your wildest dreams to terminate.
3. Some functions are very simple but will confuse lots of kinds of static analysis attempts
//Mc'Carthy's 91 function. Try guessing what it does without
// running it or reading the Wikipedia page ;)
function f91(n){
if(n > 100){
return n - 10;
}else{
return f91(f91(n + 11));
}
}
That said, we still need a way to find the complexity of stuff, right? For loops are a simple and common pattern. Take your initial example:
for(i=0; i<N; i++){
for(j=0; j<i; j++){
print something
}
}
Since each print something is O(1), the time complexity of the algorithm will be determined by how many times we run that line. Well, as your TA mentioned, we do this by looking at the combinations in this case. The inner loop will run (N + (N-1) + ... + 1) times, for a total of (N+1)*N/2.
Since we disregard constants we get O(N2).
Now for the more tricky cases we can get more mathematical. Try to create a function whose value represents how long the algorithm takes to run, given the size N of the input. Often we can construct a recursive version of this function directly from the algorithm itself and so calculating the complexity becomes the problem of putting bounds on that function. We call this function a recurrence
For example:
function fib_like(n){
if(n <= 1){
return 17;
}else{
return 42 + fib_like(n-1) + fib_like(n-2);
}
}
it is easy to see that the running time, in terms of N, will be given by
T(N) = 1 if (N <= 1)
T(N) = T(N-1) + T(N-2) otherwise
Well, T(N) is just the good-old Fibonacci function. We can use induction to put some bounds on that.
For, example, Lets prove, by induction, that T(N) <= 2^n for all N (ie, T(N) is O(2^n))
base case: n = 0 or n = 1
T(0) = 1 <= 1 = 2^0
T(1) = 1 <= 2 = 2^1
inductive case (n > 1):
T(N) = T(n-1) + T(n-2)
aplying the inductive hypothesis in T(n-1) and T(n-2)...
T(N) <= 2^(n-1) + 2^(n-2)
so..
T(N) <= 2^(n-1) + 2^(n-1)
<= 2^n
(we can try doing something similar to prove the lower bound too)
In most cases, having a good guess on the final runtime of the function will allow you to easily solve recurrence problems with an induction proof. Of course, this requires you to be able to guess first - only lots of practice can help you here.
And as f final note, I would like to point out about the Master theorem, the only rule for more difficult recurrence problems I can think of now that is commonly used. Use it when you have to deal with a tricky divide and conquer algorithm.
Also, in your "if case" example, I would solve that by cheating and splitting it into two separate loops that don; t have an if inside.
for (int i = 0; i < n; i++) {
if (i % 2 ==0) {
for (int j = i; j < n; j++) { ... }
} else {
for (int j = 0; j < i; j++) { ... }
}
}
Has the same runtime as
for (int i = 0; i < n; i += 2) {
for (int j = i; j < n; j++) { ... }
}
for (int i = 1; i < n; i+=2) {
for (int j = 0; j < i; j++) { ... }
}
And each of the two parts can be easily seen to be O(N^2) for a total that is also O(N^2).
Note that I used a good trick trick to get rid of the "if" here. There is no general rule for doing so, as shown by the Collatz algorithm example

In general, deciding algorithm complexity is theoretically impossible.
However, one cool and code-centric method for doing it is to actually just think in terms of programs directly. Take your example:
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
System.out.println("*");
}
}
Now we want to analyze its complexity, so let's add a simple counter that counts the number of executions of the inner line:
int counter = 0;
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
System.out.println("*");
counter++;
}
}
Because the System.out.println line doesn't really matter, let's remove it:
int counter = 0;
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
counter++;
}
}
Now that we have only the counter left, we can obviously simplify the inner loop out:
int counter = 0;
for (int i = 0; i < n; i++) {
counter += n;
}
... because we know that the increment is run exactly n times. And now we see that counter is incremented by n exactly n times, so we simplify this to:
int counter = 0;
counter += n * n;
And we emerged with the (correct) O(n2) complexity :) It's there in the code :)
Let's look how this works for a recursive Fibonacci calculator:
int fib(int n) {
if (n < 2) return 1;
return fib(n - 1) + fib(n - 2);
}
Change the routine so that it returns the number of iterations spent inside it instead of the actual Fibonacci numbers:
int fib_count(int n) {
if (n < 2) return 1;
return fib_count(n - 1) + fib_count(n - 2);
}
It's still Fibonacci! :) So we know now that the recursive Fibonacci calculator is of complexity O(F(n)) where F is the Fibonacci number itself.
Ok, let's look at something more interesting, say simple (and inefficient) mergesort:
void mergesort(Array a, int from, int to) {
if (from >= to - 1) return;
int m = (from + to) / 2;
/* Recursively sort halves */
mergesort(a, from, m);
mergesort(m, m, to);
/* Then merge */
Array b = new Array(to - from);
int i = from;
int j = m;
int ptr = 0;
while (i < m || j < to) {
if (i == m || a[j] < a[i]) {
b[ptr] = a[j++];
} else {
b[ptr] = a[i++];
}
ptr++;
}
for (i = from; i < to; i++)
a[i] = b[i - from];
}
Because we are not interested in the actual result but the complexity, we change the routine so that it actually returns the number of units of work carried out:
int mergesort(Array a, int from, int to) {
if (from >= to - 1) return 1;
int m = (from + to) / 2;
/* Recursively sort halves */
int count = 0;
count += mergesort(a, from, m);
count += mergesort(m, m, to);
/* Then merge */
Array b = new Array(to - from);
int i = from;
int j = m;
int ptr = 0;
while (i < m || j < to) {
if (i == m || a[j] < a[i]) {
b[ptr] = a[j++];
} else {
b[ptr] = a[i++];
}
ptr++;
count++;
}
for (i = from; i < to; i++) {
count++;
a[i] = b[i - from];
}
return count;
}
Then we remove those lines that do not actually impact the counts and simplify:
int mergesort(Array a, int from, int to) {
if (from >= to - 1) return 1;
int m = (from + to) / 2;
/* Recursively sort halves */
int count = 0;
count += mergesort(a, from, m);
count += mergesort(m, m, to);
/* Then merge */
count += to - from;
/* Copy the array */
count += to - from;
return count;
}
Still simplifying a bit:
int mergesort(Array a, int from, int to) {
if (from >= to - 1) return 1;
int m = (from + to) / 2;
int count = 0;
count += mergesort(a, from, m);
count += mergesort(m, m, to);
count += (to - from) * 2;
return count;
}
We can now actually dispense with the array:
int mergesort(int from, int to) {
if (from >= to - 1) return 1;
int m = (from + to) / 2;
int count = 0;
count += mergesort(from, m);
count += mergesort(m, to);
count += (to - from) * 2;
return count;
}
We can now see that actually the absolute values of from and to do not matter any more, but only their distance, so we modify this to:
int mergesort(int d) {
if (d <= 1) return 1;
int count = 0;
count += mergesort(d / 2);
count += mergesort(d / 2);
count += d * 2;
return count;
}
And then we get to:
int mergesort(int d) {
if (d <= 1) return 1;
return 2 * mergesort(d / 2) + d * 2;
}
Here obviously d on the first call is the size of the array to be sorted, so you have the recurrence for the complexity M(x) (this is in plain sight on the second line :)
M(x) = 2(M(x/2) + x)
and this you need to solve in order to get to a closed form solution. This you do easiest by guessing the solution M(x) = x log x, and verify for the right side:
2 (x/2 log x/2 + x)
= x log x/2 + 2x
= x (log x - log 2 + 2)
= x (log x - C)
and verify it is asymptotically equivalent to the left side:
x log x - Cx
------------ = 1 - [Cx / (x log x)] = 1 - [C / log x] --> 1 - 0 = 1.
x log x

Even though this is an over generalization, I like to think of Big-O in terms of lists, where the length of the list is N items.
Thus, if you have a for-loop that iterates over everything in the list, it is O(N). In your code, you have one line that (in isolation all by itself) is 0(N).
for (int i = 0; i < n; i++) {
If you have a for loop nested inside another for loop, and you perform an operation on each item in the list that requires you to look at every item in the list, then you are doing an operation N times for each of N items, thus O(N^2). In your example above you do in fact, have another for loop nested inside your for loop. So you can think about it as if each for loop is 0(N), and then because they are nested, multiply them together for a total value of 0(N^2).
Conversely, if you are just doing a quick operation on a single item then that would be O(1). There is no 'list of length n' to go over, just a single one time operation.To put this in context, in your example above, the operation:
if (i % 2 ==0)
is 0(1). What is important isn't the 'if', but the fact that checking to see if a single item is equal to another item is a quick operation on a single item. Like before, the if statement is nested inside your external for loop. However, because it is 0(1), then you are multiplying everything by '1', and so there is no 'noticeable' affect in your final calculation for the run time of the entire function.
For logs, and dealing with more complex situations (like this business of counting up to j or i, and not just n again), I would point you towards a more elegant explanation here.

I like to use two things for Big-O notation: standard Big-O, which is worst case scenario, and average Big-O, which is what normally ends up happening. It also helps me to remember that Big-O notation is trying to approximate run-time as a function of N, the number of inputs.
The TA explained this with something like combinations. Like this is n choose 2 = (n(n-1))/2 = n^2 + 0.5, then remove the constant so it becomes n^2. I can put int test values and try but how does this combination thing come in?
As I said, normal big-O is worst case scenario. You can try to count the number of times that each line gets executed, but it is simpler to just look at the first example and say that there are two loops over the length of n, one embedded in the other, so it is n * n. If they were one after another, it'd be n + n, equaling 2n. Since its an approximation, you just say n or linear.
What if theres an if statement? How is the complexity determined?
This is where for me having average case and best case helps a lot for organizing my thoughts. In worst case, you ignore the if and say n^2. In average case, for your example, you have a loop over n, with another loop over part of n that happens half of the time. This gives you n * n/x/2 (the x is whatever fraction of n gets looped over in your embedded loops. This gives you n^2/(2x), so you'd get n^2 just the same. This is because its an approximation.
I know this isn't a complete answer to your question, but hopefully it sheds some kind of light on approximating complexities in code.
As has been said in the answers above mine, it is clearly not possible to determine this for all snippets of code; I just wanted to add the idea of using average case Big-O to the discussion.

For the first snippet, it's just n^2 because you perform n operations n times. If j was initialized to i, or went up to i, the explanation you posted would be more appropriate but as it stands it is not.
For the second snippet, you can easily see that half of the time the first one will be executed, and the second will be executed the other half of the time. Depending on what's in there (hopefully it's dependent on n), you can rewrite the equation as a recursive one.
The recursive equations (including the third snippet) can be written as such: the third one would appear as
T(n) = T(n-1) + 1
Which we can easily see is O(n).

Big-O is just an approximation, it doesn't say how long an algorithm takes to execute, it just says something about how much longer it takes when the size of its input grows.
So if the input is size N and the algorithm evaluates an expression of constant complexity: O(1) N times, the complexity of the algorithm is linear: O(N). If the expression has linear complexity, the algorithm has quadratic complexity: O(N*N).
Some expressions have exponential complexity: O(N^N) or logarithmic complexity: O(log N). For an algorithm with loops and recursion, multiply the complexities of each level of loop and/or recursion. In terms of complexity, looping and recursion are equivalent. An algorithm that has different complexities at different stages in the algorithm, choose the highest complexity and ignore the rest. And finally, all constant complexities are considered equivalent: O(5) is the same as O(1), O(5*N) is the same as O(N).

Fast 2D convolution for DSP

I want to implement some image-processing algorithms which are intended to run on a beagleboard. These algorithms use convolutions extensively. I'm trying to find a good C implementation for 2D convolution (probably using the Fast Fourier Transform). I also want the algorithm to be able to run on the beagleboard's DSP, because I've heard that the DSP is optimized for these kinds of operations (with its multiply-accumulate instruction).
I have no background in the field so I think it won't be a good idea to implement the convolution myself (I probably won't do it as good as someone who understands all the math behind it). I believe a good C convolution implementation for DSP exists somewhere but I wasn't able find it?
Could someone help?
EDIT: Turns out the kernel is pretty small. Its dimensions are either 2X2 or 3X3. So I guess I'm not looking for an FFT-based implementation. I was searching for convolution on the web to see its definition so I can implement it in a straight forward way (I don't really know what convolution is). All I've found is something with multiplied integrals and I have no idea how to do it with matrices. Could somebody give me a piece of code (or pseudo code) for the 2X2 kernel case?

What are the dimensions of the image and the kernel ? If the kernel is large then you can use FFT-based convolution, otherwise for small kernels just use direct convolution.
The DSP might not be the best way to do this though - just because it has a MAC instruction doesn't mean that it will be more efficient. Does the ARM CPU on the Beagle Board have NEON SIMD ? If so then that might be the way to go (and more fun too).
For a small kernel, you can do direct convolution like this:
// in, out are m x n images (integer data)
// K is the kernel size (KxK) - currently needs to be an odd number, e.g. 3
// coeffs[K][K] is a 2D array of integer coefficients
// scale is a scaling factor to normalise the filter gain
for (i = K / 2; i < m - K / 2; ++i) // iterate through image
{
for (j = K / 2; j < n - K / 2; ++j)
{
int sum = 0; // sum will be the sum of input data * coeff terms
for (ii = - K / 2; ii <= K / 2; ++ii) // iterate over kernel
{
for (jj = - K / 2; jj <= K / 2; ++jj)
{
int data = in[i + ii][j +jj];
int coeff = coeffs[ii + K / 2][jj + K / 2];
sum += data * coeff;
}
}
out[i][j] = sum / scale; // scale sum of convolution products and store in output
}
}
You can modify this to support even values of K - it just takes a little care with the upper/lower limits on the two inner loops.

I know it might be off topic but due to the similarity between C and JavaScript I believe it could still be helpful. PS.: Inspired by #Paul R answer.
Two dimensions 2D convolution algorithm in JavaScript using arrays
function newArray(size){
var result = new Array(size);
for (var i = 0; i < size; i++) {
result[i] = new Array(size);
}
return result;
}
function convolveArrays(filter, image){
var result = newArray(image.length - filter.length + 1);
for (var i = 0; i < image.length; i++) {
var imageRow = image[i];
for (var j = 0; j <= imageRow.length; j++) {
var sum = 0;
for (var w = 0; w < filter.length; w++) {
if(image.length - i < filter.length) break;
var filterRow = filter[w];
for (var z = 0; z < filter.length; z++) {
if(imageRow.length - j < filterRow.length) break;
sum += image[w + i][z + j] * filter[w][z];
}
}
if(i < result.length && j < result.length)
result[i][j] = sum;
}
}
return result;
}
You can check the full blog post at http://ec2-54-232-84-48.sa-east-1.compute.amazonaws.com/two-dimensional-convolution-algorithm-with-arrays-in-javascript/

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight