Computational Efficiency of Forward Mode Automatic vs Numeric vs Symbolic Differentiation

Computational Efficiency of Forward Mode Automatic vs Numeric vs Symbolic Differentiation - c

I am trying to solve a problem of finding the roots of a function using the Newton-Raphson (NR) method in the C language. The functions in which I would like to find the roots are mostly polynomial functions but may also contain trigonometric and logarithmic.
The NR method requires finding the differential of the function. There are 3 ways to implement differentiation:
Symbolic
Numerical
Automatic (with sub types being forward mode and reverse mode. For this particular question, I would like to focus on forward mode)
I have thousands of these functions all requiring finding roots in the quickest time possible.
From the little that I do know, Automatic differentiation is in general quicker than symbolic because it handles the problem of "expression swell" alot more efficiently.
My question therefore is, all other things being equal, which method of differentiation is more computationally efficient: Automatic Differentiation (and more specifically, forward mode) or Numeric differentiation?

If your functions are truly all polynomials, then symbolic derivatives are dirt simple. Letting the coefficients of the polynomial be stored in an array with entries p[k] = a_k, where index k corresponds to the coefficient of x^k, then the derivative is represented by the array with entries dp[k] = (k+1) p[k+1]. For multivariable polynomial, this extends straightforwardly to multidimensional arrays. If your polynomials are not in standard form, e.g. if they include terms like (x-a)^2 or ((x-a)^2-b)^3 or whatever, a little bit of work is needed to transform them into standard form, but this is something you probably should be doing anyways.

If the derivative is not available, you should consider using the secant or regula falsi methods. They have very decent convergence speed (φ-order instead of quadratic). An additional benefit of regula falsi, is that the iterations remains confined to the initial interval, which allows reliable root separation (which Newton does not).
Also note than in the case of numerical evaluation of the derivatives, you will require several computations of the functions, at best two of them. Then the actual convergence speed drops to √2, which is outperformed by the derivative-less methods.
Also note that the symbolic expression of the derivatives is often more costly to evaluate than the functions themselves. So one iteration of Newton costs at least two function evaluations, spoiling the benefit of the convergence rate.

Related

How to compare two Math library implementations?

As you know, C standard library defines several standard functions calls that should be implemented by any compliant implementation e.g., Newlib, MUSL, GLIBC ...
If I am targetting Linux for example, I have to choose between glibc and MUSL, and the criteria for me is accuracy of the math library libm. How can I compare two possible implementations of, say sin() or cos() for example?
A naive approach would be to test the output quality of result of both implementations on a set of randomly generated inputs with a reference one (from Matlab for example), but is there any other more reliable/formal/structured/guided way to compare/model the two? I tried to see if there is any research in this direction but I found any, so any pointers are appreciated.

Some thoughts:
You can use the GNU Multiple Precision Arithmetic Library (GnuMP to generate good reference results.
It is possible to test most, if not all of the single-argument single-precision (IEEE-754 binary32) routines exhaustively. (For some of the macOS trigonometric routines, such as sinf, we tested one implementation exhaustively, verifying that it returned faithfully rounded results, meaning the result was the mathematical value [if representable] or one of the two adjacent values [if not]. Then, when changing implementations, we compared one to the other. If a new-implementation result was identical to the old-implementation result, it passed. Otherwise, GnuMP was used to test it. Since new implementations largely coincided with old implementations, this resulted in few invocations of GnuMP, so we were able to exhaustively test a new routine implementation in about three minutes, if I recall correctly.)
It is not feasible to test the multiple-argument or double-precision routines exhaustively.
When comparing implementations, you have to choose a metric, or several metrics. A library that has a good worst-case error is good for proofs; its bound can be asserted to hold true for any argument, and that can be used to derive further bounds in subsequent computations. But a library that has a good average error may tend to produce better results for, say, physics simulations that use large arrays of data. For some applications, only the errors in a “normal” domain may be relevant (angles around −2π to +2π), so errors in reducing large arguments (up to around 10308) may be irrelevant because those arguments are never used.
There are some common points where various routines should be tested. For example, for the trigonometric routines, test at various fractions of π. Aside from being mathematically interesting, these tend to be where implementations switch between approximations, internally. Also test at large numbers that are representable but happen to be very near multiples of simple fractions of π. These are the worst cases for argument reduction and can yield huge relative errors if not done correctly. They require number theory to find. Testing in any sort of scattershot approach, or even orderly approaches that fail to consider this reduction problem, will fail to find these troublesome arguments, so it would be easy to report as accurate a routine that had huge errors.
On the other hand, there are important points to test that cannot be known without internal knowledge of the implementation. For example, when designing a sine routine, I would use the Remez algorithm to find a minimax polynomial, aiming for it to be good from, say, –π/2 to +π/2 (quite large for this sort of thing, but just for example). Then I would look at the arithmetic and rounding errors that could occur during argument reduction. Sometimes they would produce a result a little outside that interval. So I would go back to the minimax polynomial generation and push for a slightly larger interval. And I would also look for improvements in the argument reduction. In the end, I would end up with a reduction guaranteed to produce results within a certain interval and a polynomial known to be good to a certain accuracy within that interval. To test my routine, you then need to know the endpoints of that interval, and you have to be able to find some arguments for which the argument reduction yields points near those endpoints, which means you have to have some idea of how my argument reduction is implemented—how many bits does it use, and such. Like the troublesome arguments mentioned above, these points cannot be found with a scattershot approach. But unlike those above, they cannot be found from pure mathematics; you need information about the implementation. This makes it practically impossible to know you have compared the worst potential arguments for implementations.

Difference between symbolic differentiation and automatic differentiation?

I just cannot seem to understand the difference. For me it looks like both just go through an expression and apply the chain rule.. What am I missing?

There are 3 popular methods to calculate the derivative:
Numerical differentiation
Symbolic differentiation
Automatic differentiation
Numerical differentiation relies on the definition of the derivative: , where you put a very small h and evaluate function in two places. This is the most basic formula and on practice people use other formulas which give smaller estimation error. This way of calculating a derivative is suitable mostly if you do not know your function and can only sample it. Also it requires a lot of computation for a high-dim function.
Symbolic differentiation manipulates mathematical expressions. If you ever used matlab or mathematica, then you saw something like this
Here for every math expression they know the derivative and use various rules (product rule, chain rule) to calculate the resulting derivative. Then they simplify the end expression to obtain the resulting expression.
Automatic differentiation manipulates blocks of computer programs. A differentiator has the rules for taking the derivative of each element of a program (when you define any op in core TF, you need to register a gradient for this op). It also uses chain rule to break complex expressions into simpler ones. Here is a good example how it works in real TF programs with some explanation.
You might think that Automatic differentiation is the same as Symbolic differentiation (in one place they operate on math expression, in another on computer programs). And yes, they are sometimes very similar. But for control flow statements (`if, while, loops) the results can be very different:
symbolic differentiation leads to inefficient code (unless carefully
done) and faces the difficulty of converting a computer program into a
single expression

It is a common claim, that automatic differentiation and symbolic differentiation are different. However, this is not true. Forward mode automatic differentiation and symbolic differentiation are in fact equivalent. Please see this paper.
In short, they both apply the chain rule from the input variables to the output variables of an expression graph. It is often said, that symbolic differentiation operates on mathematical expressions and automatic differentiation on computer programs. In the end, they are actually both represented as expression graphs.
On the other hand, automatic differentiation also provides more modes. For instance, when applying the chain rule from output variables to input variables then this is called reverse mode automatic differentiation.

"For me it looks like both just go through an expression and apply the chain rule. What am I missing?"
What you're missing is that AD works with numerical values, while symbolic differentiation works with symbols which represent those values. Let's look at simple example to flesh this out.
Suppose I want to compute the derivative of the expression y = x^2.
If I were doing symbolic differentiation, I would start with the symbol x, and I would square it to get y = x^2, and then I would use the chain rule to know that the dervivate dy/dx = 2x. Now, if I want the derivative for x=5, I can plug that into my expression, and get the derivative. But since I have the expression for the derivative, I can plug in any value of x and compute the derivative without having to repeat the chain rule computations.
If I were doing automatic differentiation, I would start with the value x = 5, and then compute y = 5^2 = 25, and compute the derivative as dy/dx = 2*5 = 10. I would have computed the value and the derivative. However, I know nothing about the value of the derivative at x=4. I would have to repeat the process with x=4 to get the derivative at x=4.

Optimizing functions with GAs

Firstly, sorry if this is not the right stack exchange for this question it might fit better on math.
I've been working on a project to maximize a functions output using a GA. However, from the limited calculus I know I thought there were methods to find the maximum of a mathematical function using calculus? I'd assume the reason GAs are sometimes used to maximize functions is because there are functions where the mathematical methods don't work. I wondered what conditions those were? Maybe that it's not continuous or differentiable?

A superficial explanation
For simple™ mathematical functions, the solution would be to use your calculus and find the derivate function f'(x). If it's not mathematically possible to differentiate the error function f(x), you need to break out the other tools from you math-box. If the error function's solution space is convex, you could possibly use a numerical approach to find your optimum such as the gradient descent or the conjugate gradient algorithm.
The Genetic Algorithm (and other search algorithms) comes in handy if the function you are trying to optimize consists of multiple undefined variables. This would make calculating the optimum using calculus very difficult. If you are familiar with Neural Networks: the genetic algorithm has been applied to find optimal weight configurations for neural networks. In these problem instances, there might be thousands of unknown variables (weights).
A mathematical approach would have to search the solution space in some incremental approach, the genetic algorithm is a bit "all over the place"™. By adjusting the mutation frequency, the GA would be able to jump around in the search space.
A (oversimplified) difficult solution space:
Image: Ciumac Sergiu

Well, for starters, you don't always have easily differentiable functions. You may especially have very high dimensional functions, which can be very difficult to differentiate.
Moreover, even if you do have a function you can differentiate, what you find are local optima, not global optima, and you may end up finding a great many of them-- potentially infinite numbers of them-- with no clear way to decide which are better than others.
While you may know enough about some particular function to be able to optimize with calculus, there is no method guaranteed to go you the global optimum for any possible function. Hence we rely on a number of probabilistic techniques and heuristics, of which genetic algorithms are only one.

using the modified bessel functions in matlab and gsl

I am trying to make a kaiser window for a audio signal using both Matlab and c.I have been looking at Matlab and gnu scientific library documentation to understand how to use a modified bessel function of first kind and 0th order, but I still have some questions:
It seems that GSL does not accept a 0 order bessel function, I don't understand the documentation on this point.
I don't know if I should use a regular or irregular function. What are their differences? Matlab do not have that.
which is the fastest method to filter the signal: time domain or frequency domain?
how to filter the signal on the frequency domain?

I will only answer to the last three points. (Warning : I am french and my english isn't great...)
1) When you consider the Fourier transform of a signal multiplied by a specific window, in the spectral domain, you convolute the original spectrum of the signal by the spectrum of your window. In an ideal mathematical world, you would love to have a Dirac since it's convolution would only shift the signal. But to get a Dirac in the frequency you would need a periodic signal in the time domain which isn't defined on a compact (i.e. finite like your sound record) support. And this is too bad because there is a theorem (Paley-Wiener's corollary) that states that if your time-domain support is compact your frequency-domain support is not bounded and the decreasing behaviour of the Fourier transform increase with the regularity of the signal (i.e. window in our case). Great then ! All we have to chose is a nice regular (smooth ?) window. Unfortunately, to get a really smooth window, we have to narrow it (wide smooth windows exist but have other drawbacks dues to their derived function...its like too large constants ahead appealing algorithmic complexity) and it's spectrum will be wider (for the same reason invoked in the theorem). But you (and Obama) believe in compromise to face the (Pontryagin) duality, don't you ? The gaussian is a great compromise since its Fourier transform is a gaussian too (sum of random variables ? convolution ? +,x-morphism in the complex plane...every thing is linked but its a too long non-linear story to be told here). Therefore a lot of window tend to look like a gaussian.
Here is a bunch of windows and spectrums stolen to my speech processing teacher :
2) It's a pure mathematical duality, so it depends of what you mean by fiter. Does applying a Sobel filter into the frequency-domain make any sense ? (in fact it may...)
3) Again, it depends of what you mean by filter.

I think I can answer (1) and (2):
(1) You can program the zeroth order Bessel functions yourself (spherical or not) by consulting a handbook on mathematical functions such as Abramowitz and Stegun, Gradshteyn and Ryzhik, or the Digital Library of Mathematical Functions (http://dlmf.nist.gov/).
(2) By regular and irregular, I presume you mean regular or modified Bessel functions. Bessel functions are solutions to the three-dimensional heat equation posed in cylindrical coordinates. Your boundary conditions determine your use of regular or modified Bessel functions. For nice discussions about regular and modified Bessel functions, I suggest reading The Conduction of Heat in Solids by Carslaw and Jaeger and Boundary Value Problems of Heat Conduction by M. Necati Ozisik. You can also try for difficult problems Classical Electrodynamics by John David Jackson.
How are the regular and modified Bessel functions different? The regular J Bessel function is somewhat oscillatory in nature (see a nice old book like Jahnke, Emde, and Losch for hand-drawn Bessel function graphs, a lost art form if you ask me) whereas I and K are single-valued.
I can't really help you much on (3) and (4), as I'm not much an electrical engineer (although I would like to learn more!).

Efficiency of arcsin computation from sine lookup table

I have implemented a lookup table to compute sine/cosine values in my system. I now need inverse trigonometric functions (arcsin/arccos).
My application is running on an embedded device on which I can't add a second lookup table for arcsin as I am limited in program memory. So the solution I had in mind was to browse over the sine lookup table to retrieve the corresponding index.
I am wondering if this solution will be more efficient than using the standard implementation coming from the math standard library.
Has someone already experimented on this?
The current implementation of the LUT is an array of the sine values from 0 to PI/2. The value stored in the table are multiplied by 4096 to stay with integer values with enough precision for my application. The lookup table as a resolution of 1/4096 which give us an array of 6434 values.
Then I have two funcitons sine & cosine that takes an angle in radian multiplied by 4096 as argument. Those functions convert the given angle to the corresponding angle in the first quadrant and read the corresponding value in the table.
My application runs on dsPIC33F at 40 MIPS an I use the C30 compiling suite.

It's pretty hard to say anything with certainty since you have not told us about the hardware, the compiler or your code. However, a priori, I'd expect the standard library from your compiler to be more efficient than your code.

It is perhaps unfortunate that you have to use the C30 compiler which does not support C++, otherwise I'd point you to Optimizing Math-Intensive Applications with Fixed-Point Arithmetic and its associated library.
However the general principles of the CORDIC algorithm apply, and the memory footprint will be far smaller than your current implementation. The article explains the generation of arctan() and the arccos() and arcsin() can be calculated from that as described here.
Of course that suggests also that you will need square-root and division also. These may be expensive though PIC24/dsPIC have hardware integer division. The article on math acceleration deals with square-root also. It is likely that your look-up table approach will be faster for the direct look-up, but perhaps not for the reverse search, but the approaches explained in this article are more general and more precise (the library uses 64bit integers as 36.28 bit fixed point, you might get away with less precision and range in your application), and certainly faster than a standard library implementation using software-floating-point.

You can use a "halfway" approach, combining a coarse-grained lookup table to save memory, and a numeric approximation for the intermediate values (e.g. Maclaurin Series, which will be more accurate than linear interpolation.)
Some examples here.
This question also has some related links.

A binary search of 6434 will take ~12 lookups to find the value, followed by an interpolation if more accuracy is needed. Due to the nature if the sin curve, you will get much more accuracy at one end than the other. If you can spare the memory, making your own inverse table evenly spaced on the inputs is likely a better bet for speed and accuracy.
In terms of comparison to the built-in version, you'll have to test that. When you do, pay attention to how much the size of your image increases. The stdin implementations can be pretty hefty in some systems.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight