I'm looking to write a quick benchmark program that can be compiled and run on various machines. Rather than using commercially/open-sourceally available options, I'd rather have my own to play around with threading and algorithm optimization techniques.
I have a couple that I use already, which include recursively calculating the nth number of the Fibonacci sequence, and of seeding/rand()ing a few thousand times.
Are there any other algorithms that are relatively simple, but at the same time computationally-intensive (and possibly math-related)?
(Note that these operations will be implemented in the C language.)
The Ackermann function is usually a fun one, but don't give it very large inputs if you want it to finish in your lifetime.
Fractals
(at various resolutions) Some fractal source in C (without opengl)
I know you said you wanted to make your own, but perhaps you could draw upon existing benchmarks for inspiration. The Computer language benchmark game has run many programming languages through a a set of benchmarks. Perhaps you can get some ideas looking at their benchmarks.
Some quick ideas of the top of my head:
Matrix multiplication: mulitplying 2
large matrices is relatively
computationally intensive, though you
will have to take caching into account
Generating prime numbers
Integer factorization
Numerical methods for solving ODEs -
Runge-kutta for example
Inverting big matrices.
You could calc big primes or factorizing integers.
Take a look at the NAS Parallel Benchmarks. These were originally written by NASA in Fortran for supercomputers using MPI (and are still available that way), but there are also C, Java, and OpenMP implementations available now.
Most of these are very computationally intensive, as they're intended to be representative of numerical algorithms used in scientific computing.
Try to calculate thousands or millions pi digits. There are quite a few formulas for that task.
You have some really nice ones in project euler, those are all math related and can be time consuming as you want using higher values.
Finding prime numbers is considered quite time-consuming.
Checkout the benchmarks from the language shootout: http://shootout.alioth.debian.org/
However: benchmarks are only benchmarks and don't necessarily tell you a lot about the real world and can, on the contrary, be misleading.
If you want to try parallelism, do lots of matrix math. The size of your matrix you can use will be limited by memory, but you can do as many iterations as you want.
This will stress the SIMD instructions that modern CPUs come with.
This does a lot of addition:
int c = 0;
for (int n = 0; n < INT_MAX; n++)
for (int m = 0; m < INT_MAX; m++)
c++;
std::cout << c;
You could try a tsort (Turbo Sort) with a very large input set. I understand this to be a common operation.
Heuristics for NP-Complete problems are a fun way to get some CPU intensive code. You could code a "solution" :) for one of Karps NP-Complete problems.
Related
I'm running the simulator with MATLAB.
However, it takes a few days.
Hence, I decided to change the code into C.
(First, I tried to use c-mex in MATLAB, but I think coding and debugging are very hard. mxType!!!?!?!? Thus, I decided to make C code using the visual studio 2017.)
In my MATLAB code, I used
x = [unifrnd(varmin(1),varmax(1),varnum,1),...
unifrnd(varmin(2),varmax(2),varnum,1),...
unifrnd(varmin(3),varmax(3),varnum,1)];
That is, x is the matrix of size varnum*3, whose 1st column is random numbers uniformly distributed from varmin(1) to varmax(1), 2nd column is random numbers uniformly distributed from varmin(2) to varmax(2), and 3rd column is random numbers uniformly distributed from varmin(3) to varmax(3).
When I make a matrix in C code, I will code like the follows:
srand(time(NULL));
for(j=1; j<3; j++) {
for(i=1; i<varnum; i++) {
x[i][j] = rand() % (varmax[j]-varmin[j]) + varmin[j];
}
}
I am changing my code from MATLAB to C because the running time is so long. However, I think MATLAB handles a matrix at once, but C handles a matrix of size MxN by running M*N iteration.
Nonetheless, is the code below (C) faster than the code above (MATLAB)?
Also, is there any better codes to make a matrix having random numbers?
Both MATLAB and C will run the random generation M*N times and you can't really get around that. In C it might be a bit more explicit, but in the end it's just an issue of typing rather than actual speed.
I would say that MATLAB is extremely fast if you use it the right way. That means leaning a lot on the built-in matrix operations and vectorizing the others as much as possible. The fact that it automatically parallelizes the computation for you is just topping on the cake.
If on the other hand you're doing a lot of manual for loops over large matrices you're likely to have problems. Mind you, newer versions of MATLAB know how to JIT that sort of code as well, but it's just not going to be as fast as a hyper-tuned builtin method.
So I'd avoid going outside MATLAB to C. At most do just some small functions in C where MATLAB really doesn't help you, but leave the rest in MATLAB.
Unless you really know what you're doing, you're unlikely to see a performance increase. In fact, I'll wager that you'll see a decrease in performance. And a guaranteed increase in development cost.
If resources permit I'd investigate getting a bigger machine for the team, or renting some compute resources from AWS/GCP. Though the logistics of getting MATLAB to work there might be prohibitive.
I am computing a statistical model in Matlab which has to run about 200 kalman filter per iteration, and I want to iterate the modelat least 10 000 times which suggest that I should run it at least 2 000 000 times. I am therefore searching for a way to optimize the computationnal speed of Matlab on this part. I have already gone operation per operation to try to optimize the computation in Matlab using all the tricks which could be used but I would like to go further...
I am not familiar with C/C++ but I read that mex-file could be usefull in some cases. Anyone could tell me if it would be Worth going into this direction ???
Thanks...
Writing mex files will definitely speed up the entire process, but you will not be able to use a lot of the built in MATLAB functions. You are limited to what you can do in C++ and C. Of course you can write your own functions, as long as you know how to.
The speed increase mainly comes from the fact the mex files are compiled and not interpreted line by line as are standard MATLAB scripts. Once you compile the mex you can call it the same way you do any other MATLAB function.
For a class I took in college I had to write my own image scaling function, I had initially written it in a standard script and it would take a couple seconds to complete on large images, but when I wrote it in C in a mex it would complete in less than 0.1 seconds.
MEX Files Documentation
If you you are not familiar at all with C/C++ this will be hard going. Hopefully you have some experience with another language besides Matlab? You can try learning/copying from the many included examples, but you'll really need to figure out the basics first.
One thing in particular. If you use mex you'll need some way to obtain decent random numbers for your Kalman filter noise. You might be surprised, but a very large percentage of your calculation time will likely be in generating random numbers for noise (depending on the complexity of your filter it could be > 50%).
Don't use the default random number generators in C/C++.
These are not suitable for scientific computation, especially when generating vast numbers of values as you seem to need. Your first option is to pass in a large array of random numbers generated via randn in Matlab to your mex code. Or look into including the C code Mersenne Twister algorithm itself and find/implement a scheme for generating normal random numbers from the uniform ones (log-polar is simplest, but Ziggurat will be faster). This is not too hard. I've done it myself and the Double precision SIMD-oriented Fast Mersenne Twister (dSFMT) is actually 2+ times faster than Matlab's current implementation for uniform variates.
You could use parfor loops or the parallel computing toolbox in general to speedup your calculations. Did you already checked whether MATLAB is using 100% CPU?
How would I improve the efficiency of the standard matrix addition algorithm?
The matrix is represented by a 2D array and is added sequentially.
I am not going to read all your code. As I can see, this is the addition part
for(i=0;i<r1;i++)
for(j=0;j<c1;j++)
C[i][j]=A[i][j]+B[i][j];
I don't think this can be improved complexity-wise. As for other types of microoptimizations such as doing a ++i instead of i++ or changing the order of the loops etc. - I think you shouldn't care about these until you've run a profiler which shows you that these are your performance bottlenecks. Remember, premature optimization is the root of all evil :)
The naive double for loop is pretty close to optimal for portable code, so long as you get your two for loops in the right order. You need to be accessing the memory sequentially to get best performance.
You could unroll the loops but this won't make very much difference to performance.
If you want best performance then don't write it yourself and instead use a BLAS that has been optimised for your platform.
You can try to use GPU instead of CPU for performing intensive operations. You can use AMP for this.
We are looking for exemplar problems and codes that will run on any or all of shared memory, distributed memory, and GPGPU architectures. The reference platform we are using is LittleFe (littlefe.net), an open-design, low cost educational cluster currently with six dual core CPUs, each with an nVidia chipset.
These problems and solutions will be good for teaching parallelism to any newbie by providing working examples and opportunities to roll up your sleeves and code. Stackoverflow experts have good insight and are likely to have some favorites.
Calculating area under a curve is interesting, simple and easy to understand, but there are bound to be ones that are just as easily expressed and chock full of opportunities to practice and learn.
Hybrid examples using more than one of the memory architectures are most desirable, and reflective of where parallel programming seems to be trending.
On LittleFe we have predominantly been using three applications. The first is an analysis of optimal targets on a dartboard which is highly parallel with little communication overhead. The second is Conway's game of life which is a typical of problems sharing boundary conditions. It has a moderate communication overhead. The third is an n-body model of galaxy formation which requires heavy communication overhead.
The CUDA programming guide(PDF) contains a detailed analysis of the implementation of matrix multiplication on a GPU. That seems to be the staple "hello world" example for learning GPU programing.
Furthermore, the CUDE SDK contains tens of other well explained examples of GPU programming in CUDA and OpenCL. My favorite is the colliding balls example. (a demo with a few thousands of balls colliding in real time)
Two of my favorites are numerical integeration and finding prime numbers. For the first we code the midpoint rectangle rule on the function f(x) = 4.0 / (1.0 + x*x). Integration of the function between 0 and 1 give an approximation of the constant pi, which makes checking the correctness of the answer easy. The parallelism is across the range of the integration (computing the areas of rectangles).
For the second, we input an integer range and then identify and save the prime numbers in that range. We use a brute force division of values by all possible factors; if any divisors are found that are not 1 or the number, then the value is composite. If a prime is found, count it and store in a shared array. The parallelism is dividing up the range since testing for primality of N is independent of testing M. There is some trickiness needed to share the prime store between threads or to gather distributed parital answers.
These are very basic and simple problems to solve, which allows students to focus on the parallel implementation and not so much on the computation involved.
One of the more complex but easy example problems is the BLAS routine sgemm or dgemm (C = alpha * A x B + beta * C) where A, B, C are matrices of valid sizes and alpha and beta are scalars. The types may be single precision floating point (sgemm) or double precision floating point (dgemm).
The implementation of this simple routine on different platforms and architectures teaches some insights about the functionality and working principles. For more details on BLAS and the ?gemm routine have a look to http://www.netlib.org/blas.
You need only to pay attention that for a double precision implementation on the GPU the GPU needs to have double precision capabilities.
As part of a larger problem, I need to solve small linear systems (i.e NxN where N ~10) so using the relevant cuda libraries doesn't make any sense in terms of speed.
Unfortunately something that's also unclear is how to go about solving such systems without pulling in the big guns like GSL, EIGEN etc.
Can anyone point me in the direction of a dense matrix solver (Ax=B) in straight C?
For those interested, the basic structure of the generator for this section of code is:
ndarray=some.generator(N,N)
for v in range N:
B[v]=_F(v)*constant
for x in range N:
A[v,x]=-_F(v)*ndarray[x,v]
Unfortunately I have approximately zero knowledge of higher mathematics, so any advice would be appreciated.
UPDATE: I've been working away at this, and have a nearly-solution that runs but isn't working. Anyone lurking is welcome to check out what I've got so far on pastebin.
I'm using Crout Decomposition with Pivoting which seems to be the most general approach. The idea for this test is that every thread does the same work. Boring I know, but the plan is that the matrixcount variable is increased, actual data is put in, and each thread solves the small matrices individually.
Thanks for everyone who's been checking on this.
POST-ANSWER UPDATE: Finished the matrix solving code for CPU and GPU operation, check out my lazy-writeup here
CUDA won't help here, that's true. Matrices like that are just too small for it.
What you do to solve a system of linear equations is LU decomposition:
http://en.wikipedia.org/wiki/LU_decomposition
http://mathworld.wolfram.com/LUDecomposition.html
Or even better a QR decomposition with Householder reflections like in the Gram-Schmidt process.
http://en.wikipedia.org/wiki/QR_decomposition#Computing_the_QR_decomposition
Solving the linear equation becomes easy afterwards, but I'm afraid there always is some "higher mathematics" (linear algebra) involved. That, and there are many (many!) C libraries out there for solving linear equations. Doesn't seem like "big guns" to me.