Performing in-place calculations with Commons Math's RealMatrix? - apache-commons-math

I am using the Apache Commons Math library to perform some calculations and I need to accumulate a number of matrices. The RealMatrix interface seems designed to return results, rather than to store results in-place.
Should I just make due with creating lots of short-lived matrices returned from add() while I accumulate the values? Is there a better alternative?
Were in-place functions intentionally left out for some reason?

Related

Computational Efficiency of Forward Mode Automatic vs Numeric vs Symbolic Differentiation

I am trying to solve a problem of finding the roots of a function using the Newton-Raphson (NR) method in the C language. The functions in which I would like to find the roots are mostly polynomial functions but may also contain trigonometric and logarithmic.
The NR method requires finding the differential of the function. There are 3 ways to implement differentiation:
Symbolic
Numerical
Automatic (with sub types being forward mode and reverse mode. For this particular question, I would like to focus on forward mode)
I have thousands of these functions all requiring finding roots in the quickest time possible.
From the little that I do know, Automatic differentiation is in general quicker than symbolic because it handles the problem of "expression swell" alot more efficiently.
My question therefore is, all other things being equal, which method of differentiation is more computationally efficient: Automatic Differentiation (and more specifically, forward mode) or Numeric differentiation?
If your functions are truly all polynomials, then symbolic derivatives are dirt simple. Letting the coefficients of the polynomial be stored in an array with entries p[k] = a_k, where index k corresponds to the coefficient of x^k, then the derivative is represented by the array with entries dp[k] = (k+1) p[k+1]. For multivariable polynomial, this extends straightforwardly to multidimensional arrays. If your polynomials are not in standard form, e.g. if they include terms like (x-a)^2 or ((x-a)^2-b)^3 or whatever, a little bit of work is needed to transform them into standard form, but this is something you probably should be doing anyways.
If the derivative is not available, you should consider using the secant or regula falsi methods. They have very decent convergence speed (φ-order instead of quadratic). An additional benefit of regula falsi, is that the iterations remains confined to the initial interval, which allows reliable root separation (which Newton does not).
Also note than in the case of numerical evaluation of the derivatives, you will require several computations of the functions, at best two of them. Then the actual convergence speed drops to √2, which is outperformed by the derivative-less methods.
Also note that the symbolic expression of the derivatives is often more costly to evaluate than the functions themselves. So one iteration of Newton costs at least two function evaluations, spoiling the benefit of the convergence rate.

Efficient way to perform tensor products in Fortran

I need to perform some tensor products and contractions on some large arrays in Fortran. Sometimes they are vectors or matrices and sometimes some of the objects involved are 3-arrays or 4-arrays.
Of course, it is very easy to write a subroutine achieving this with some nested loops, and that's just what I've done. But I have to call this subroutine with all its loops a lot of times for very large arrays, and I was just wondering whether there is some optimized function or subroutine implemented in Fortran which I could benefit from.
Last time I looked (about a year ago) I did not find a high performance general purpose tensor product library in Fortran. I think one of the reason for this might be Fortran's cumbersome way of resizing arrays, which is a constant requirement when dealing with tensors.
If you only need multiplication you might be able to get away with using your own code. However if you need high performance, or more general operations, I would highly recommend writing a C interface and using one of the excellent C++ libraries out there, which are probably already optimized for your type of application:
Physics:
http://itensor.org/
Machine learning:
https://github.com/tensorflow/tensorflow
These are only examples. For a more complete listing see:
Tensor multiplication library

Optimizing functions with GAs

Firstly, sorry if this is not the right stack exchange for this question it might fit better on math.
I've been working on a project to maximize a functions output using a GA. However, from the limited calculus I know I thought there were methods to find the maximum of a mathematical function using calculus? I'd assume the reason GAs are sometimes used to maximize functions is because there are functions where the mathematical methods don't work. I wondered what conditions those were? Maybe that it's not continuous or differentiable?
A superficial explanation
For simple™ mathematical functions, the solution would be to use your calculus and find the derivate function f'(x). If it's not mathematically possible to differentiate the error function f(x), you need to break out the other tools from you math-box. If the error function's solution space is convex, you could possibly use a numerical approach to find your optimum such as the gradient descent or the conjugate gradient algorithm.
The Genetic Algorithm (and other search algorithms) comes in handy if the function you are trying to optimize consists of multiple undefined variables. This would make calculating the optimum using calculus very difficult. If you are familiar with Neural Networks: the genetic algorithm has been applied to find optimal weight configurations for neural networks. In these problem instances, there might be thousands of unknown variables (weights).
A mathematical approach would have to search the solution space in some incremental approach, the genetic algorithm is a bit "all over the place"™. By adjusting the mutation frequency, the GA would be able to jump around in the search space.
A (oversimplified) difficult solution space:
Image: Ciumac Sergiu
Well, for starters, you don't always have easily differentiable functions. You may especially have very high dimensional functions, which can be very difficult to differentiate.
Moreover, even if you do have a function you can differentiate, what you find are local optima, not global optima, and you may end up finding a great many of them-- potentially infinite numbers of them-- with no clear way to decide which are better than others.
While you may know enough about some particular function to be able to optimize with calculus, there is no method guaranteed to go you the global optimum for any possible function. Hence we rely on a number of probabilistic techniques and heuristics, of which genetic algorithms are only one.

Any way to vectorize in C

My question may seem primitive or dumb because, I've just switched to C.
I have been working with MATLAB for several years and I've learned that any computation should be vectorized in MATLAB and I should avoid any for loop to get an acceptable performance.
It seems that if I want to add two vectors, or multiply matrices, or do any other matrix computation, I should use a for loop.
It is appreciated if you let me know whether or not there is any way to do the computations in a vectorized sense, e.g. reading all elements of a vector using only one command and adding those elements to another vector using one command.
Thanks
MATLAB suggests you to avoid any for loop because most of the operations available on vectors and matrices are already implements in its API and ready to be used. They are probably optimized and they work directly on underlying data instead that working at MATLAB language level, a sort of opaque implementation I guess.
Even MATLAB uses for loops underneath to implement most of its magic (or delegates them to highly specialized assembly instructions or through CUDA to the GPU).
What you are asking is not directly possible, you will need to use loops to work on vectors and matrices, actually you would search for a library which allows you to do most of the work without directly using a for loop but by using functions already defined that wraps them.
As it was mentioned, it is not possible to hide the for loops. However, I doubt that the code MATLAB produces is in any way faster the the one produced by C. If you compile your C code with the -O3 it will try to use every hardware feature your computer has available, such as SIMD extensions and multiple issue. Moreover, if your code is good and it doesn't cause too many pipeline stalls and you use the cache, it will be really fast.
But i think what you are looking for are some libraries, search google for LAPACK or BLAS, they might be what you are looking for.
In C there is no way to perform operations in a vectorized way. You can use structures and functions to abstract away the details of operations but in the end you will always be using fors to process your data.
As for speed C is a compiled language and you will not get a performance hit from using for loops in C. C has the benefit (compared to MATLAB) that it does not hide anything from you, so you can always see where your time is being used. On the downside you will notice that things that MATLAB makes trivial (svd,cholesky,inv,cond,imread,etc) are challenging in C.

Matrix operations in CUDA

What is the best way to organize matrix operations in CUDA (in terms of performance)?
For example, I want to calculate C * C^(-1) * B^T + C, C and B are matrices.
Should I write separate functions for multiplication, transposition and so on or write one function for the whole expression?
Which way is the fastest?
I'd recommend you to use the CUBLAS library. It's normally much daster and more reliable than everything you could write on your own. In addition it's API is similar to the BLAS library which is the standard library for numerical linear algebra.
I think the answer depends heavily on the size of your matrices.
If you can fit a matrix in shared memory, I would probably use a single block to compute that and have all inside a single kernel (probably bigger, where this computation is only a part of it). Hopefully, if you have more matrices, and you need to compute the above equation several times, you can do it in parallel, utilising all GPU computing power.
However, if your matrices are much bigger, you will want more blocks to compute that (check matrix multiplication example in CUDA manual). You need a guarantee that multiplication is finished by all blocks before you proceed with the next part of your equation, and if so, you will need a kernel call for each of your operations.

Resources