Matrix solving with C (within CUDA) - c

As part of a larger problem, I need to solve small linear systems (i.e NxN where N ~10) so using the relevant cuda libraries doesn't make any sense in terms of speed.
Unfortunately something that's also unclear is how to go about solving such systems without pulling in the big guns like GSL, EIGEN etc.
Can anyone point me in the direction of a dense matrix solver (Ax=B) in straight C?
For those interested, the basic structure of the generator for this section of code is:
ndarray=some.generator(N,N)
for v in range N:
B[v]=_F(v)*constant
for x in range N:
A[v,x]=-_F(v)*ndarray[x,v]
Unfortunately I have approximately zero knowledge of higher mathematics, so any advice would be appreciated.
UPDATE: I've been working away at this, and have a nearly-solution that runs but isn't working. Anyone lurking is welcome to check out what I've got so far on pastebin.
I'm using Crout Decomposition with Pivoting which seems to be the most general approach. The idea for this test is that every thread does the same work. Boring I know, but the plan is that the matrixcount variable is increased, actual data is put in, and each thread solves the small matrices individually.
Thanks for everyone who's been checking on this.
POST-ANSWER UPDATE: Finished the matrix solving code for CPU and GPU operation, check out my lazy-writeup here

CUDA won't help here, that's true. Matrices like that are just too small for it.
What you do to solve a system of linear equations is LU decomposition:
http://en.wikipedia.org/wiki/LU_decomposition
http://mathworld.wolfram.com/LUDecomposition.html
Or even better a QR decomposition with Householder reflections like in the Gram-Schmidt process.
http://en.wikipedia.org/wiki/QR_decomposition#Computing_the_QR_decomposition
Solving the linear equation becomes easy afterwards, but I'm afraid there always is some "higher mathematics" (linear algebra) involved. That, and there are many (many!) C libraries out there for solving linear equations. Doesn't seem like "big guns" to me.

Related

How to find a better algorithm to compute eigenvalue and eigenvector of a very large matrix

I have used Jacobi method to find all eigenvalues and eigenvectors in c code. Though the complexity of Jacobi method is O(n^3) but the dimension of my matrix is huge (17814 X 17814). It takes a lot of time. I want to know a better algorithm by which I can solve this problem. If you want I can attach my c code.
The algorithm suggested in the comments is not necessarily the best one.
As you can see here, the Jacobi method can be vastly faster when using special techniques.
On top of that, Jacobi is quite easy to run in parallel, and it's much faster for sparse matrices than for dense matrices so you can take advantage of that as well, depending on your architecture and the type of matrix you have.
I'd say that the best thing is to test a few different methods and see in practice where you can get the best results.
O(n^2.376) is not necessarily better than O(n^3) depending on constants.

Implementation of Image Convolution (Image Processing) in C

I am testing some convolution algorithms i found in some sites but none of them apply the matrix filters as it should.
I am writing a very simple 24 bits bmp library on my own, but now i need a little help with the convolution, i don't need FFT or complex algorithm, running time is not important at this time.
The last code i was testing was this: http://lodev.org/cgtutor/filtering.html But i didn't work fine.
Could some one indicate me a code or algorithm in C?.
Thank you very much.
You can have a look at this algorithm - this is the closest which i can find:
Convolution to blur the image
Know that the basic convolution algorithm is more or less the same, the affect changes only by the kernel values.
There is an open source C# library which provides methods to perform image convolution of simple filters. It would be an easy port to C.
The actual methods to perform convolution can be found here. The BitmapContext class is used to just wrap a pointer to bitmap. I believe in C# this is treated as int*, so this code is operating on 4 bytes at a time.
I created Image Convolution library for simple cases - https://github.com/RoyiAvital/Projects/tree/master/ImageConvolution.
It is pretty fast (OpenMP + SIMD).
Though I'm not an advanced programmer of something, just tried doing it to do first steps in utilizing SIMD.
Still, from what can be seen in VS 2015, the CPU utilization is pretty good.
If you have ideas to make it even faster, I will be happy.
Feel free to use it in any manner you'd like.

How much mxRealloc can affect a C-Mex matlab code?

For these days I was working on C-mex code in order to improve speed in DBSCAN matlab code. In fact, at the moment I finished a DBSCAN on C-mex. But instead, it takes more time (14.64 seconds in matlab, 53.39 seconds in C-Mex) with my test data which is a matrix 3 x 14414. I think this is due to the use of mxRealloc function in several parts of my code. Would be great that someone give me some suggestion with the aim to get better results.
Here is the code DBSCAN1.c:
https://www.dropbox.com/sh/mxn757a2qmniy06/PmromUQCbO
Using mxRealloc in every iteration of a loop is indeed a performance killer. You can use vector or similar class instead. Dynamic allocation is not needed at all in your distance function.
If your goal is not to implement DBSCAN as a mex but to speed it up, I will offer you a different solution.
I don't know which Matlab implementation are you using, but you won't make a trivial n^2 implementation much faster by just rewriting it to C in the same way. Most of the time is spent calculating the nearest neighbors which won't be faster in C than it is in Matlab. DBSCAN can run in nlogn time by using an index structure to get the nearest neighbors.
For my application, I am using this implementation of dbscan, but I have changed the calculation of nearest neighbors to use a KD-tree (available here). The speedup was sufficient for my application and no reimplementation was required. I think this will be faster than any n^2 c implementation no matter how good you write it.

How to use cepstral?

Recently I asked this question: How to get the fundamental frequency from FFT? (you don't actually need to read it)
My doubt right now it: how to use the cepstral algorithm?
I just don't know how to use it because the only language that I know is ActionScript 3, and for this reason I have few references about the native functions found in C, Java and so on, and how I should implement them on AS. Most articles are about these languages =/
(althought, answers in other languages than AS are welcome, just explain how the script works please)
The articles I found about cepstral to find the fundamental frequency of a FFT result told me that I should do this:
signal → FT → abs() → square → log → FT → abs() → square → power cepstrum
mathematically:
|F{log(|F{f(t)}|²)}|²
Important info:
I am developing a GUITAR TUNER in flash
This is the first time I am dealing with advanced sound
I am using an FFT to extract frequency bins from the signal that reaches user's microphone, but I got stuck in getting the fundamental frequency from it
I don't know:
How to apply a square in an ARRAY (I mean, the data that my FFT gives me is an array. Should I multiply it by itself? ActionScript's debug throws errors when I try to fftResults * fftResults)
How to apply the "log". I would not know how to apply it even if I had a single number.
What is the difference between complex cepstral and power cepstral. Also, what of them should I use? I am trying to develop a guitar tuner.
Thanks!
Note that the output of an FFT is an array of complex values, i.e. each bin = re + j*im. I think you can just combine the abs and square operations and calculate re*re + im*im for each bin. This gives you a single positive value for each bin, and obviously you can calculate the log value for each bin quite easily. You then need to do a second FFT on this log squared data and again using the output of this second FFT you will calculate re*re + im*im for each bin. You will then have an array of postive values which will have one or more peaks representing the fundamental frequency or frequencies of your input.
The autocorrelation is the easiest and most logical approach, and the best place to start.
To get this working, start with a simple autocorrelation, and then, if necessary, improve it following the outline provided by YIN. (YIN is based on the autocorrelation with refinements. But whether or not you'll need these refinements depends on details of your situation.) This way also, you can learn as you go rather than trying to understand the whole thing in one shot.
Although FFT approaches can also work, they are a bit more confusing. The issue is that what you are really after is the period, and this isn't well represented by the FFT. The missing fundamental is a good example of this, where if you have 2Hz and 3Hz, the fundamental is 1Hz, but is nowhere in the FFT, while 1Hz is obvious in a time based representation (e.g. the autocorrelation). Add to this that overtones aren't necessarily harmonic, and noise, etc... and all of these issues make it usually best to start with a direct approach to the problem.
There are many ways of finding fundamental frequency (F0).
For languages like Java etc there are many libraries with those type of algorithms already implemented (you can study their sources).
MFCC (based on cepstral) implemented in Comirva (Open source).
Audacity (beta version!) (Open source) presents cepstrum, autocorellation, enhanced autocorellation,
Yin based on autocorrelation (example )
Finding max signal values after FFT
All these algorithms may be be very helpful for you. However easiest way to get F0 (one value in Hz) would be to use Yin.

How does the Levenberg–Marquardt algorithm work in detail but in an understandable way?

Im a programmer that wants to learn how the Levenberg–Marquardt curvefitting algorithm works so that i can implement it myself. Is there a good tutorial anywhere that can explain how it works in detail with the reader beeing a programmer and not a mathemagician.
My goal is to implement this algorithm in opencl so that i can have it run hardware accelerated.
Minimizing a function is like trying to find lowest point on a surface. Think of yourself walking on a hilly surface and that you are trying to get to the lowest point. You would find the direction that goes downhill and walk until it doesn't go downhill anymore. Then you would chose a new direction that goes downhill and walk in that direction until it doesn't go downhill anymore, and so on. Eventually (hopefully) you would reach a point where no direction goes downhill anymore. You would then be at a (local) minimum.
The LM algorithm, and many other minimization algorithms, use this scheme.
Suppose that the function being minimized is F and we are at the point x(n) in our iteration. We wish to find the next iterate x(n+1) such that F(x(n+1)) < F(x(n)), i.e. the function value is smaller. In order to chose x(n+1) we need two things, a direction from x(n) and a step size (how far to go in that direction). The LM algorithm determines these values as follows -
First, compute a linear approximation to F at the point x(n). It is easy to find out the downhill direction of a linear function, so we use the linear approximating function to determine the downhill direction.
Next, we need to know how far we can go in this chosen direction. If our approximating linear function is a good approximation for F for a large area around x(n), then we can take a fairly large step. If it's a good approximation only very close to x(n), then we can take only a very small step.
This is what LM does - calculates a linear approximation to F at x(n), thus giving the downhill direction, then it figures out how big a step to take based on how well the linear function approximates F at x(n). LM figures out how good the approximating function is by basically taking a step in the direction thus determined and comparing how much the linear approximation to F decreased to the how much the the actual function F decreased. If they are close, the approximating function is good and we can take a little larger step. If they are not close then the approximation function is not good and we should back off and take a smaller step.
Try http://en.wikipedia.org/wiki/Levenberg–Marquardt_algorithm
PDF Tutorial from Ananth Ranganathan
JavaNumerics has a pretty readable implementation
The ICS has a C/C++ implementation
The basic ideas of the LM algorithm can be explained in a few pages - but for a production-grade implementation that is fast and robust, many subtle optimizations are necessary. State of the art is still the Minpack implementation by Moré et al., documented in detail by Moré 1978 (http://link.springer.com/content/pdf/10.1007/BFb0067700.pdf) and in the Minpack user guide (http://www.mcs.anl.gov/~more/ANL8074b.pdf). To study the code, my C translation (https://jugit.fz-juelich.de/mlz/lmfit) is probably more accessible than the original Fortran code.
Try Numerical Recipes (Levenberg-Marquardt is in Section 15.5). It's available online, and I find that they explain algorithms in a way that's detailed (they have complete source code, how much more detailed can you get...), yet accessible.
I used these notes from a course at Purdue University to code up a generic Levenberg-Marquardt curve-fitting algorithm in MATLAB that computes numerical derivatives and therefore accepts any function of the form f(x;p) where p is a vector of fitting parameters.

Resources