I am currently trying to implement a Baum Welch algorithm in C, but I run into the following problem : the gamma function :
gamma(i,t) = alpha(i,t) * beta(i,t) / sum over `i` of(alpha(i,t) * beta(i,t))
Unfortunately, for large enough observation sets, alpha drops rapidly to 0 as t increases, and beta drops rapidly to 0 as t decreases, meaning that, due to rounding down, there is never a spot where both alpha and beta are non-zero, which makes things rather problematic.
Is there a way around this problem or should I just try to increase precision for the values? I fear the problem may just pop up again if I try this approach, as alpha and beta drop of about one order of magnitude per observation.
You should do these computations, and generally all computations for probability models, in log-space:
lg_gamma(i, t) = (lg_alpha(i, t) + lg_beta(i, t)
- logsumexp over i of (lg_alpha(i, t) + lg_beta(i, t)))
where lg_gamma(i, t) represents the logarithm of gamma(i, t), etc., and logsumexp is the function described here. At the end of the computation, you can convert to probabilities using exp, if needed (that's typically only needed for displaying probabilities, but even there logs may be preferable).
The base of the logarithm is not important, as long as you use the same base everywhere. I prefer the natural logarithm, because log saves typing compared to log2 :)
Related
I am using the 'psych' package in R to run EFA. After running the EFA, I use Lavaan to run a CFA using the factor structure from EFA, purely just to humor myself to compare results. After doing so, I am suspicious of results produced by 'psych'.
I am aware that CFA is used to essentially test a hypothesis of how it is believed certain items may factor. And I am aware that the two are not typically ran together. However, it is my understanding that if I am aware of certain factor structure, that results from EFA (RMSEA, TLI, factor loadings, residual variance, etc.) should be approximately similar to results from CFA.
My issue is that sometimes in using 'psych' I can have a 3 factor structure derived from EFA, where all factor loadings appear to be below 1. Then when putting the same factor structure through CFA, I can have a standardized loading exceed 1, and warnings of negative residual variance, thus yielding unreliable estimates.
This is troubling to me, because I feel uneasy reporting results from 'psych'. Especially, because I am dealing with oblique rotation (correlated factors) that can yield factor loadings greater than 1 (say if I use a different number of factors). I have read that factor loadings can exceed 1, if residual variance is non-negative. I am just having a hard time deciphering if I do have negative residual variance, when my EFA is telling me no, but the CFA is telling me yes. Anyone see anything like this previously?
I know that when I would like to check if double == double I should write:
bool AreSame(double a, double b)
{
return fabs(a - b) < EPSILON;
}
But what when I would like to check if a > b or b > a ?
There is no general solution for comparing floating-point numbers that contain errors from previous operations. The code that must be used is application-specific. So, to get a proper answer, you must describe your situation more specifically. For example, if you are sorting numbers in a list or other data structure, you should not use any tolerance for comparison.
Usually, if your program needs to compare two numbers for order but cannot do so because it has only approximations of those numbers, then you should redesign the program rather than try to allow numbers to be ordered incorrectly.
The underlying problem is that performing a correct computation using incorrect data is in general impossible. If you want to compute some function of two exact mathematical values x and y but the only data you have is some incorrectly computed values x and y, it is generally impossible to compute the exactly correct result. For example, suppose you want to know what the sum, x+y, is, but you only know x is 3 and y is 4, but you do not know what the true, exact x and y are. Then you cannot compute x+y.
If you know that x and y are approximately x and y, then you can compute an approximation of x+y by adding x and y. This works when the function being computed has a reasonable derivative: Slightly changing the inputs of a function with a reasonable derivative slightly changes its outputs. This fails when the function you want to compute has a discontinuity or a large derivative. For example, if you want to compute the square root of x (in the real domain) using an approximation x but x might be negative due to previous rounding errors, then computing sqrt(x) may produce an exception. Similarly, comparing for inequality or order is a discontinuous function: A slight change in inputs can change the answer completely.
The common bad advice is to compare with a “tolerance”. This method trades false negatives (incorrect rejections of numbers that would satisfy the comparison if the true mathematical values were compared) for false positives (incorrect acceptance of numbers that would not satisfy the comparison).
Whether or not an application can tolerate false acceptance depends on the application. Therefore, there is no general solution.
The level of tolerance to set, and even the nature by which it is calculated, depend on the data, the errors, and the previous calculations. So, even when it is acceptable to compare with a tolerance, the amount of tolerance to use and how to calculate it depends on the application. There is no general solution.
The analogous comparisons are:
a > b - EPSILON
and
b > a - EPSILON
I am assuming that EPSILON is some small positive number.
I'm playing around with Neural Networks trying to understand the best practices for designing their architecture based on the kind of problem you need to solve.
I generated a very simple data set composed of a single convex region as you can see below:
Everything works fine when I use an architecture with L = 1, or L = 2 hidden layers (plus the output layer), but as soon as I add a third hidden layer (L = 3) my performance drops down to slightly better than chance.
I know that the more complexity you add to a network (number of weights and parameters to learn) the more you tend to go towards over-fitting your data, but I believe this is not the nature of my problem for two reasons:
my performance on the Training set is also around 60% (whereas over-fitting typically means you have a very low training error and high test error),
and I have a very large amount of data examples (don't look at the figure that's only a toy figure I uplaoded).
Can anybody help me understand why adding an extra hidden layer gives
me this drop in performances on such a simple task?
Here is an image of my performance as a function of the number of layers used:
ADDED PART DUE TO COMMENTS:
I am using a sigmoid functions assuming values between 0 and 1, L(s) = 1 / 1 + exp(-s)
I am using early stopping (after 40000 iterations of backprop) as a criteria to stop the learning. I know it is not the best way to stop but I thought that it would ok for such a simple classification task, if you believe this is the main reason I'm not converging I I might implement some better criteria.
At least on the surface of it, this appears to be a case of the so-called "vanishing gradient" problem.
Activation functions
Your neurons activate according to the logistic sigmoid function, f(x) = 1 / (1 + e^-x) :
This activation function is used frequently because it has several nice properties. One of these nice properties is that the derivative of f(x) is expressible computationally using the value of the function itself, as f'(x) = f(x)(1 - f(x)). This function has a nonzero value for x near zero, but quickly goes to zero as |x| gets large :
Gradient descent
In a feedforward neural network with logistic activations, the error is typically propagated backwards through the network using the first derivative as a learning signal. The usual update for a weight in your network is proportional to the error attributable to that weight times the current weight value times the derivative of the logistic function.
delta_w(w) ~= w * f'(err(w)) * err(w)
As the product of three potentially very small values, the first derivative in such networks can become small very rapidly if the weights in the network fall outside the "middle" regime of the logistic function's derivative. In addition, this rapidly vanishing derivative becomes exacerbated by adding more layers, because the error in a layer gets "split up" and partitioned out to each unit in the layer. This, in turn, further reduces the gradient in layers below that.
In networks with more than, say, two hidden layers, this can become a serious problem for training the network, since the first-order gradient information will lead you to believe that the weights cannot usefully change.
However, there are some solutions that can help ! The ones I can think of involve changing your learning method to use something more sophisticated than first-order gradient descent, generally incorporating some second-order derivative information.
Momentum
The simplest solution to approximate using some second-order information is to include a momentum term in your network parameter updates. Instead of updating parameters using :
w_new = w_old - learning_rate * delta_w(w_old)
incorporate a momentum term :
w_dir_new = mu * w_dir_old - learning_rate * delta_w(w_old)
w_new = w_old + w_dir_new
Intuitively, you want to use information from past derivatives to help determine whether you want to follow the new derivative entirely (which you can do by setting mu = 0), or to keep going in the direction you were heading on the previous update, tempered by the new gradient information (by setting mu > 0).
You can actually get even better than this by using "Nesterov's Accelerated Gradient" :
w_dir_new = mu * w_dir_old - learning_rate * delta_w(w_old + mu * w_dir_old)
w_new = w_old + w_dir_new
I think the idea here is that instead of computing the derivative at the "old" parameter value w, compute it at what would be the "new" setting for w if you went ahead and moved there according to a standard momentum term. Read more in a neural-networks context here (PDF).
Hessian-Free
The textbook way to incorporate second-order gradient information into your neural network training algorithm is to use Newton's Method to compute the first and second order derivatives of your objective function with respect to the parameters. However, the second order derivative, called the Hessian matrix, is often extremely large and prohibitively expensive to compute.
Instead of computing the entire Hessian, some clever research in the past few years has indicated a way to compute just the values of the Hessian in a particular search direction. You can then use this process to identify a better parameter update than just the first-order gradient.
You can learn more about this by reading through a research paper (PDF) or looking at a sample implementation.
Others
There are many other optimization methods that could be useful for this task -- conjugate gradient (PDF -- definitely worth a read), Levenberg-Marquardt (PDF), L-BFGS -- but from what I've seen in the research literature, momentum and Hessian-free methods seem to be the most common ones.
Because the number of iterations of training required for convergence increases as you add complexity to a neural network, holding the length of training constant while adding layers to a neural network will certainly result in you eventually observing a drop like this. To figure out whether that is the explanation for this particular observation, try increasing the number of iterations of training that you're using and see if it improves. Using a more intelligent stopping criterion is also a good option, but a simple increase in the cut-off will give you answers faster.
I wrote a code that numerically uses Legendre polynomials up to some high n-th order. For example:
....
case 8
p = (6435*x.^8-12012*x.^6+6930*x.^4-1260*x.^2+35)/128; return
case 9
...
If the vectorx is long this can become slow. I saw that there is a performance difference between say x.^4 and x.*x.*x.*x and thought I could use this to improve my code. I've used timeit and found that for:
x=linspace(0,10,1e6);
f1= #() power(x,4)
f2= #() x.4;
f3= #() x.^2.^2
f4= #() x.*x.*x.*x
f4 is faster by a factor 2 than the rest. However when I go to x.^6 there is very little difference between (x.*x.*x).^2 and x.*x.*x.*x.*x.*x (while all other options are slower).
Is there away to tell what will be the most efficient way to take a power of a vector?
Can you explain why there is such a big difference in performance?
This is not exactly an answer to your question, but it may solve your problem:
x2 = x.*x; % or x.^2 or power(x,2), whichever is most efficient
p = ((((6435*x2-12012)*x2+6930)*x2-1260)*x2+35)/128
This way you do the power just once, and only with exponent 2. This trick can be applied to all Legendre polynomials (in the odd-degree polynomials one x2 is replaced by x).
Here are some thoughts:
power(x,4) and x.^4 are equivalent (just read the doc).
x.*x.*x.*x is probably optimized to something like x.^2.^2
x.^2.^2 is probably evaluated as: Take the square of each element (fast), and take the square of that again (fast again).
x.^4 is probably directly evaluated as: Take the fourth power of each element (slow).
It is not so strange to see that 2 fast operations take less time than 1 slow operation. Just too bad that the optimization is not performed in the power 4 case, but perhaps it won't always work or come at a cost (input checking, memory?).
About the timings: Actually there is much more difference than a factor 2!
As you call them in a function now, the function overhead is added in each case, making the relative differences smaller:
y=x;tic,power(x,4);toc
y=x;tic,x.^4;toc
y=x;tic,x.^2.^2;toc
y=x;tic,x.*x.*x.*x;toc
will give:
Elapsed time is 0.034826 seconds.
Elapsed time is 0.029186 seconds.
Elapsed time is 0.003891 seconds.
Elapsed time is 0.003840 seconds.
So, it is nearly a factor 10 difference. However, note that the time difference in seconds is still minor, so for most practical applications I would just go for the simple syntax.
It seems as though Mathworks has special cased squares in its power function (unfortunately, it's all builtin closed source that we cannot see). In my testing on R2013b, it appears as though .^, power, and realpow use the same algorithm. For squares, I believe they have special-cased it to be x.*x.
1.0x (4.4ms): #()x.^2
1.0x (4.4ms): #()power(x,2)
1.0x (4.5ms): #()x.*x
1.0x (4.5ms): #()realpow(x,2)
6.1x (27.1ms): #()exp(2*log(x))
For cubes, the story is different. They're no longer special-cased. Again, .^, power, and realpow all are similar, but much slower this time:
1.0x (4.5ms): #()x.*x.*x
1.0x (4.6ms): #()x.*x.^2
5.9x (26.9ms): #()exp(3*log(x))
13.8x (62.3ms): #()power(x,3)
14.0x (63.2ms): #()x.^3
14.1x (63.7ms): #()realpow(x,3)
Let's jump up to the 16th power to see how these algorithms scale:
1.0x (8.1ms): #()x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x
2.2x (17.4ms): #()x.^2.^2.^2.^2
3.5x (27.9ms): #()exp(16*log(x))
7.9x (63.8ms): #()power(x,16)
7.9x (63.9ms): #()realpow(x,16)
8.3x (66.9ms): #()x.^16
So: .^, power and realpow all run in a constant time with regards to the exponent, unless it was special cased (-1 also appears to have been special cased). Using the exp(n*log(x)) trick is also constant time with regards to the exponent, and faster. The only result I don't quite understand why the repeated squaring is slower than the multiplication.
As expected, increasing the size of x by a factor of 100 increases the time similarly for all algorithms.
So, moral of the story? When using scalar integer exponents, always do the multiplication yourself. There's a whole lot of smarts in power and friends (exponent can be floating point, vector, etc). The only exceptions are where Mathworks has done the optimization for you. In 2013b, it seems to be x^2 and x^(-1). Hopefully they'll add more as time goes on. But, in general, exponentiation is hard and multiplication is easy. In performance sensitive code, I don't think you can go wrong by always typing x.*x.*x.*x. (Of course, in your case, follow Luis` advice and make use of the intermediate results within each term!)
function powerTest(x)
f{1} = #() x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x;
f{2} = #() x.^2.^2.^2.^2;
f{3} = #() exp(16.*log(x));
f{4} = #() x.^16;
f{5} = #() power(x,16);
f{6} = #() realpow(x,16);
for i = 1:length(f)
t(i) = timeit(f{i});
end
[t,idxs] = sort(t);
fcns = f(idxs);
for i = 1:length(fcns)
fprintf('%.1fx (%.1fms):\t%s\n',t(i)/t(1),t(i)*1e3,func2str(fcns{i}));
end
I have a vector A, represented by an angle and a length. I want to add vector B, updating the original A. B comes from a lookup table, so it can be represented in which ever way makes the computation easier.
Specifically, A is defined thusly:
uint16_t A_angle; // 0-65535 = 0-2π
int16_t A_length;
Approximations are fine. Checking for overflow is not necessary. A fast sin/cos approximation is available.
The fastest way I can think is to have B represented as a component vector, convert A to component, add A and B, convert the result back to angle/length and replace A. (This requires the addition of a fast asin/acos)
I am not especially good at math and wonder if I am missing a more sensible approach?
I am primarily looking for a general approach, but specific answers/comments about useful micro-optimizations in C is also interesting.
If you need to do a lot of additive operations, it would probably be worth considering storing everything in Cartesian coordinates, rather than polar.
Polar is well-suited to rotation operations (and scaling, I guess), but sticking with Cartesian (where a rotation is four multiplies, see below) is probably going to be cheaper than using cos/sin/acos/asin every time you want to do a vector addition. Although, of course, it depends on the distribution of operations in your case.
FYI, a rotation in Cartesian coordinates is as follows (see http://en.wikipedia.org/wiki/Rotation_matrix):
x' = x.cos(a) - y.sin(a)
y' = x.sin(a) + y.cos(a)
If a is known ahead of time, then cos(a) and sin(a) can be precomputed.